Aha, that explains another thread. (Edit: This topic was moved from Searching/indexing web site/page. The current topic is the one described as “another thread”.)
I have no idea if this would be of any help to you, but I just tried to write something that – with many false positives, and probably a few false negatives – identifies proper nouns in a document. I ran Tony McGillick Painter through pdf2go.com’s PDF → Text converter, and then fed the resulting text through this JS function:
const properNames = pdf => [... new Set(
pdf.replace(/\s+/g, ' ').match(/(?=[^])(?:\P{Sentence_Terminal}|\p{Sentence_Terminal}(?!['"`\p{Close_Punctuation}\p{Final_Punctuation}\s]))*(?:\p{Sentence_Terminal}+['"`\p{Close_Punctuation}\p{Final_Punctuation}]*|$)/guy)
.map(s => s.trim())
.filter(Boolean)
.flatMap(s => [
... (s.match(/^(([A-Z]\w+\s)+){2,}/g) || []),
... (s.match(/((:?\s)[A-Z]\w+)+/g) || []),
])
.map(s => s.trim())
)].sort()
to get this result:
[
"At Redleaf Pool", "Aussie Rules", "Botanic Gardens", "But", "Christmas", "Communists",
"Cross", "Domain", "Double Bay", "Edgecliff", "Friday", "HEN", "In", "Italian", "Italians",
"Jewish", "Kings Cross", "Macleay Street", "Manchester", "McGILLICK Painter", "Melbourne",
"Moore Park", "Paris", "Park", "Pool", "Redleaf Pool", "Saturday", "Sunday", "Sydney", "That",
"The Cross", "They", "Tony McGILLICK Painter", "Trumper", "Trumper Park", "Vaucluse"
]
I’m sure we could revise the regexes to reduce the false positives.