I know plenty of ways of splitting strings, but what is the best/fastest way of splitting text into its individual words to produce a table of words, and their sequence? Naturally I'd want to take off trailing punctuation but preserve it if it was within the word. The reason I'm interested is that I want to run some timings to check performance. What I'm writing is a system for doing an inversion index for searching and gauging text similarity. The last time I did it, it was a bit slow.
↧