Identifying proper nouns

TW_Tones,

I have to apologise to you for my hasty remarks regarding your stop words suggestion.

After much additional work on my tool, I can now see how stop words will be useful as long as, as you intimated, they are built up over time across documents from the document set. Also, the stop words appear to be only single words so that if ‘And’ is a stop word, it can still be part of a proper noun phrase, such as ‘And Bob’ (the example is nonsensical but you can see my drift).

I think now I also need to save Go words between processing documents, those strings that I consider proper nouns so that in subsequent documents, they do not need to be reprocessed as they have been approved in an earlier document. This requires some more thought on my part though.

bobj

No problem @Bob_Jansen perhaps I think of stop words differently, I mean a list of publicly known stop words such as “the, and, it”.

Warning ! grammar has never being my strong point, but with age I hope comes wisdom :nerd_face:

I like your term “go words” I would think of these as words you have nominated or perhaps even “go phrases”.

  • I am not saying such go words or phrases could not have stop words included, but when you remove common words, verbs, common nouns a smaller list is available that may relate to a proper noun.
    • Lets highlight these “special words” in context, and provide the tool to register them as “go words or phrases” the simplest method is to create a tiddler title.

For example lets say we remove common words and find one in the text “Galactic”, it may be next to a possible verb, “Federation” and when we look at the actual text we see “This was when “The Galactic Federation” was born”. The capitalisation is also a give away, Our attention has being focused here because the word Galactic was identified. We select “The Galactic Federation” and create a tiddler, perhaps indicating it is a proper noun.

  • Any future reference to the The Galactic Federation will be highlighted and links made automatic if you use free links.
  • Most content and text, will have a limited number of proper nouns and if your attention is drawn to them you can systematically build a list of those in the content, create tiddlers for them, and even store more info in them like a glossary.
  • You could even collate all proper nouns used across multiple wikis
  • Perhaps you can only highlight “special words” with mid sentence capitalisation.

Through taking an approach like this there is value in identifying potential proper nouns, then select actual proper nouns, and having the opportunity to provide more information eg “The Galactic Federation” `head quarters on Earth". Notice we just found another? “Earth”.

  • Using some other ideas rather than create a tiddler just linking eg [[Earth]] it becomes a missing tiddler, and something you can list or retrieve.
  • There is other values in seeing “special words” in your text because it helps draw inferences while you are highlighting proper nouns.
  • The process is cumulative and the exceptions decrease rapidly over time.

See a discussion I started some time ago that explored some advanced ideas we could implement in tiddlywiki Wordsmithing stop words, verbs past and present tense - #2 by TW_Tones

This subject has led me back to some prior discussions that is all related in a broader sense including Measure of order in a tiddlywiki

Very thought provoking, @TW_Tones. I like the idea of having the ‘tags’ created as tiddlers and then being automatically identified by TW as links. Must play around with this.

bobj

FYI: I am building some word lists and need some regular expression help How to extract words with regular expressions and using this to analyse the tiddlywiki.com documentation, ie to split the documentation into words.

Just a final message regarding this issue.

I have completed my Xojo app which parses text documents for proper nouns and proper noun phrases. I have tried it on a few of my documents and it seems to work well, not 100% but well enough.

I have also added a facility to highlight words that I need to manually inspect, in my case the term ‘exhibition’ which may form part of an exhibition title but not be recognised as a proper noun phrase, for example, ‘21st exhibition of Paper works’. This also works well.

I do not own a Xojo license for producing a compiled app so I can not share a copy of the app. I use the IDE to run a debug session to use the app from its source code (this is free)

many thanks to all who contributed to the discussion.

bobj

If it is possible to share your wordlists and details we could look at implementing this in tiddlywiki. perhaps pseudo code and steps.

Up to you.

Happy to provide the Xojo project file. All you would need to do is instal the free Xojo IDE and the you could run the project.

bobj

The App is now generating JSON files so that loading the content into tiddlers has now been simplified. So all content preparation is done outside of TW which makes things quicker for me.

Thinking about this, it might be a useful addition to TW, structure the TW tiddlers but use the App to generate the content at the tiddler level and then bulk load into the TW file.

Thanks to all who have contributed to this development

bobj