Just a quick note: For most purposes the @etardiff solution (strip away the non-alpha) is efficient and will do the job!
Some odd bits will show up when the name Martínez
tuns into mart and nez in the worldlist, and wouldn’t
turns into wouldn and t (etc.). And we may or may not want alphanumeric strings to show up as whole words on our list — as in 4’33", 1970s, TRS-80 and 50th…
Alas, apostrophes, ending single-quotes, and prime-marks (“dumb” apostrophes) can’t accurately be simply parsed as text vs punctuation; they can appear within a word (when the word is a contraction), and they can also serve as punctuation right up against a word on either end. We can safely say of any apostrophe or prime-mark that it’ll count as part of a word IFF it’s sandwiched by alpha-numeric characters on both sides (letting the word ’twas become twas, small loss!). At any rate, the more precision matters — and the more diversity of real-world input we’re fielding — the more complex our task becomes.
EDIT: One tweak I might experiment with is: not imposing lower-case from the get-go, but finding unique strings, and then going through to drop the capitalized variant of any string that appears in both capitalized and non-capitalized forms… It’s still not perfect, especially for short texts… but you’d be hedging against capitalized versions of many common words (assuming the sample is decently long), and for most names and other proper nouns you’d get only the capitalized appearance… Perhaps AI could give us a compact list of ordinary words that are statistically (disproportionately) most likely to appear only in sentence-initial positions, such as nevertheless, however, firstly… Then again, if your subject-matter tends to have only limited and predictable use of strings that shouldn’t appear in lowercase, it would indeed be wise to start with etardiff’s pattern of imposing lowercase, allowing an ad-hoc list of word-strings whose case needs fixing (McCarthy, IBM, VoIP, I [the pronoun]) before the list is rendered…