Identifying proper nouns

Bob_Jansen · January 26, 2024, 6:32am

Not exactly a TW issue but I need to do this in my TW wiki as I create tags for each proper noun in a tiddler’s text (displayed PDF file in my case).

Anyone got any suggestions for how to identify proper nouns in a chunk of text? I have been wracking my brain for several days now and have a simple situation working but it seems to quickly becomes intractable as complexity increases. I feel that ‘meaning’ of the string is required and obviously my code does not understand the meaning at all.

bobj

TW_Tones · January 26, 2024, 7:00am

I have looked into this kind of word handling, obtaining as list of stop words, then a list of verbs, nouns perhaps “proper nouns” are many of what is left over.

One way to keep things controllable is to make a single tiddler text parser that that analyses each paragraph and all words and proposes what they are only in the current tiddler, in a tiddler footer. You could then click on those you want to define permanently creating a tiddler marked pronoun etc… perhaps even drag and drop words between one of more classifications, but fix its meaning once done with a tiddler with a tag of field indicating. Eg add additional stop words.

One reason such an approach may be much more effective than you expect is the way some words occur very often and most words occur rarely.

Each time you nominate words as belonging to say propernouns they are dealt with should they appear again. You could keep these updated lists in a subject area wiki or share the list with other wikis and accumulate them. You could even make the lists first a plugin, then capture new and edited ones as tiddlers.

Along with the freelinks plugin either for analysis or permanently these words will also be highlighted as having tiddlers, and those tiddlers can use back links to list if we find a solution mentioned recently.

Scott_Sauyet · January 27, 2024, 1:30am

Overall, it sounds like an impossible problem to get perfect. In well-writtern English, though, you can make a partial stab at this by saying that any run of capitalized words not including the first word in a sentence, and not interrupted by any punctuation is likely a proper noun.

But the trouble remains with the first word in a sentence:

Autumn and I made a bet on the winner of the next election.
Autumn is election season in my country.

The first “Autumn” is a proper noun; the second is not. I don’t know of any way, without trying to do a deep semantic analysis of the sentence to solve that. And that sounds challenging, to say the least. Even slightly flowery prose will probably give you false positives:

Autumn and I never really get along; I prefer spring, summer, and, yes, even winter.

I wish I had better news.

Springer · January 27, 2024, 1:49am

So, the solution won’t be complete, but isn’t there regexp for isolating what’s at the start of a line / sentence / quote / parenthetical remark — that is, what comes immediately after white space or quote marks or opening parens, etc.?

Assuming that false negatives are preferable to false positives, it seems not too difficult to grab the name-like strings that don’t seem to be in sentence-initial positions…

The other confounding factor not yet mentioned, though, is title strings.

Titles of books, periodicals, etc., will give you false positives even on the approach I’m suggesting, unless you have some way of using semantic tags to bracket those title strings.

Depending on what kind of text you’re working through, this could be a big factor, or irrelevant…

Alas, more niggly work!

Scott_Sauyet · January 27, 2024, 2:03am

We could certainly write something – regex or not – to capture the sorts of things we’re discussing. The problem is figuring out the rules.

I’m not quite sure which direction you’re using as positive. But I think of titles as proper nouns. However this does make me think of another ambiguity:

The article about this in the latest New Yorker is astounding!
Oh, she's a real New Yorker, through and through.

I think to get at this at a formal level requires a deep semantic understanding, Chomsky be damned.

This is a hard problem.

Springer · January 27, 2024, 2:24am

Titles when bundled as a whole might need to count (and presumably multiple capitalized words in a row are grouped as all one proper-noun-string). But if @Bob_Jansen is looking to index / automate tag-production for each proper noun within a long text, presumably it would not be ideal to have a tag automatically generated for Caged Bird Sings just because the title “I Know Why the Caged Bird Sings” has a lower case word in the middle of the string, resetting any automatic search for capitals that appear “mid-sentence”.

Maybe that’s where @TW_Tones was already going about needing to incorporate a list of stop-words, so that what comes after them is treated as potentially part of a title… But then you’d fail to catch the proper nouns that just happen to come after stop words, as within “being reminded of Autumn and how she had been right about the Haley campaign”…

So, I think this solution would need to count on the source text properly putting titles within semantic markers — quote strings or html tags, etc.

Scott_Sauyet · January 27, 2024, 2:43am

Oh, yes, that’s still more complication.

I think if we had a marker already, the problem would solve itself. We could easily add an editor button that wraps the selected text in a <proper-noun>Autumn</proper-noun> tag, and then extract them later. But using other markers such as quotation marks or <em></em> tags might help us identify candidates, but wouldn’t help us distinguish, say,

The IT department has made a commitment to move off "Ab Initio" and onto "AWS Glue" by the end of 
next year.

from

We've already learned that "Ab Initio" means "from the outset", but how is this applied in 
literature, science, and the law?

I think the OP had it right:

Maybe one of the AI models out there could make a reasonable stab at this. That’s what the recent crop of them are programmed for, but I doubt we’d want to try to run a Large Language Model implementation inside TiddlyWiki.

TW_Tones · January 27, 2024, 6:47am

They smartest thing in a car is still the nut behind the wheel. Similarly with computers even the dumbest user is smarter than the computer by orders of magnitude.

When you eliminate or highlight verbs stop words common nouns then allow the user to nominate prounouns and othere elements with some tools such as proper nouns then there is not that much left.

so do not trust me i was a phonetic speller as a kid.

Bob_Jansen · January 31, 2024, 3:43am

Thank you all for your thoughts and comments. You have reinforced my opinion, it is an almost impossible task except for special simple circumstances. To do properly, it requires meaning/understanding which is way outside of the state of today’s processing.

TW_Tones · January 31, 2024, 5:07am

Funny. I conclude the opposite through semi-automation. I will keep this on mind and return if I address this.

for analysis purposes you could install a large library of words list in a plugin save detected ones and remove the plugins when done.

Scribs · February 5, 2024, 8:11pm

maybe with a robust implementation of llm/ai in tw this could be a possibility in the future:

although on repeat trials it looks like sometimes copilot gets confused as well:

Scott_Sauyet · February 5, 2024, 8:46pm

That scares the heck out of me!

TW_Tones · February 6, 2024, 2:28am

Perhaps “not in”, but “along side TiddlyWiki”, with the tools to interact

It is great when tools are effectively linked, but loosely linked, allowing any component to innovate independently.

Scribs · February 6, 2024, 1:47pm

of course it would be an optional plugin with plenty of customization. but i guess that’s a topic for another thread : )

TW_Tones · February 6, 2024, 9:23pm

Yes, on topic this loosely coupled idea was mentioned by me earlier where “word list plugins” are available to eliminate most words that are not proper nouns, thus allowing us to highlight and select proper nouns.

once done removed the wordlists if desired.

I imagin on a server implementation a word list could be a data tiddler that is also a skinny tiddler, so it is only loaded if it is needed?

Scott_Sauyet · February 14, 2024, 4:26am

Aha, that explains another thread. (Edit: This topic was moved from Searching/indexing web site/page. The current topic is the one described as “another thread”.)

I have no idea if this would be of any help to you, but I just tried to write something that – with many false positives, and probably a few false negatives – identifies proper nouns in a document. I ran Tony McGillick Painter through pdf2go.com’s PDF → Text converter, and then fed the resulting text through this JS function:

const properNames = pdf => [... new Set(
  pdf.replace(/\s+/g, ' ').match(/(?=[^])(?:\P{Sentence_Terminal}|\p{Sentence_Terminal}(?!['"`\p{Close_Punctuation}\p{Final_Punctuation}\s]))*(?:\p{Sentence_Terminal}+['"`\p{Close_Punctuation}\p{Final_Punctuation}]*|$)/guy)
    .map(s => s.trim())
    .filter(Boolean)
    .flatMap(s => [
      ... (s.match(/^(([A-Z]\w+\s)+){2,}/g) || []),
      ... (s.match(/((:?\s)[A-Z]\w+)+/g) || []),
    ])
    .map(s => s.trim())
)].sort()

to get this result:

[
  "At Redleaf Pool", "Aussie Rules", "Botanic Gardens", "But", "Christmas", "Communists", 
  "Cross", "Domain", "Double Bay", "Edgecliff", "Friday", "HEN", "In", "Italian", "Italians",
  "Jewish", "Kings Cross", "Macleay Street", "Manchester", "McGILLICK Painter", "Melbourne", 
  "Moore Park", "Paris", "Park", "Pool", "Redleaf Pool", "Saturday", "Sunday", "Sydney", "That", 
  "The Cross", "They", "Tony McGILLICK Painter", "Trumper", "Trumper Park", "Vaucluse"
]

I’m sure we could revise the regexes to reduce the false positives.

Bob_Jansen · February 14, 2024, 6:15am

Scott

I wrote an app (using Xojo) to do this also and was able to extract all words and most phrases containing capital letters. Makes things easier but when I went to add the tags to the tiddler I found I still needed to see the context of the word or phrase to make sure I got things right. This meant the extraction first saved little to no time/effort.

Might do some more playing though.

Found the TWTones’ suggestion of stop/go words would not be very efficient as I would mis many terms that are also stop terms. Context and hence meaning are crucial

Bobj

Dr. Bob Jansen
122 Cameron St, Rockdale NSW 2216, Australia
Ph: +61 414 297 448
Skype: bobjtls

TW_Tones · February 14, 2024, 7:26am

not sure what this had to do with this topic?

My suggestion was to use stop terms, eliminate common nouns, verbs and other word sets to identify the remainder as a possible way of finding pronouns.
However you identify text which may contain such, you can still use a preview or snippet of the text and make a method to select and save a set of words even create a matching tiddler.

My idea would be you progressively turn things you wish to refer to, into tiddlers and thus accrue a large set of tiddlers and the freelinks plugin can thus highlight them.

Scott_Sauyet · February 14, 2024, 2:09pm

That’s probably my fault. I responded to the wrong topic. When I participated in the other thread about finding proper nouns, I misunderstood something fundamental. I thought the goal was to find an automated, repeatable way to recognize proper nouns in a string of text, presumably a tiddler. This topic (edit: that is, Searching/indexing web site/page) made it clear that this is for a one-off operation, an attempt to simplify some manual conversion effort. So having a number of false positives, if they’re not overwhelming, would probably not be an issue. And that leads to more possibilities.

I will try to move these posts to that topic.

I find this an interesting problem. I don’t know if I’ll find time, but if I do, would a version that gave results like the following be helpful?

{
  /* ... */
  "Botanic Gardens": [
    "make me that survive, the Botanic Gardens, the Domain, Trumper Park, Moore Park: they are the "
  ],
  /* ... */
  "But": [
    "notice it at the time, but in retrospect it seems to have been happy. It is",
    "But it wasn’t really the football, it was the atmosphere",
    "prepare from Friday night onwards, but going to Trumper is easy.",
    /* ... */
  ],
  /* ... */
  "Trumper Park": [
    "Trumper Park has been persistent in my life.",
    "the people who went to Trumper Park on Sunday",
  ]
  /* ... */
}

(or perhaps with the potential proper noun in ALL CAPS?)

The usefulness of this is directly proportional to the usefulness of the underlying PDF → text conversion, which for this document is not great.

But if this might be helpful, I’ll see if I can find the time. As I said, I find it an interesting problem.

Bob_Jansen · February 15, 2024, 4:48am

Scott,

thanks for all your input, ideas and thoughts. It is greatly appreciated.

As I said before, I have a Xojo app (xojo.com is a flexible programming ide that allows for compilation into osx, windows, ios and android, all from the same source code) that does something similar. It already provides the context that you are proposing, so, unless you are very motivated, I would not suggest you begin on this work.

I need to think all this through in more detail, especially as I am thinking TW might not be the most favourable tool to provide the access to the archive’s documents. I can see this as an exellent Filemaker Pro web site but the problem here is the costs of hosting the filemaker solution.

I am waiting for Google to index the site so I can play around with Google site searching but it seems to be taking some time.

bobj