support tokenize while keeping common censor chars (#767)
There are some cases where it's useful to tokenize a string while _not_
splitting on some non-letter chars like `#`, `*`, `-`, or `_`.
Unfortunately right now `Tokenize` will split on all of these, making
some matching difficult.
This just adds a second `TokenizeTextSkippingCensorChars` for those
particular use cases. Also adding `TokenizeTextWithRegex`, so that other
cases can be easily covered in the future if they arise.