Best systemic solution for name order variations — a new "compare" type?

TW_Tones · December 16, 2023, 4:13am

I get it.
I think perhaps you misunderstand me.
I am talking about how we achieve it to put to whatever purpose you want. Whether you commit the normalised data to a tiddler or just re-interpret it every time that’s up to you. you still need similar algorithms.
It the input adheres to one or more standards you can tell the difference between you are good to go.

Me too.

Yes, and with both set multiple variables and action set multiple fields you can commit with a click.

Again totally agree. My emphasis is to craft the solution to handle either or both as needed. Normalisation can just be in memory.

Not withstanding the above, if you are committing your normalisation of the input to storage you do it once, and need not do it again. If the data comes from an external sourse, there is a choice to be made that depends on the use case.

Scott_Sauyet · December 17, 2023, 5:42am

I wonder if something like this would help. We could create a function that accepts two names, does a fair bit of normalization of each of them, and then computes a similarity score.

Then on import, we can offer as options for normalization all names in the wiki above a certain threshold, which the user can accept or choose to ignore. I coded a very naïve implementation, which will give us values like this:

"Simone de Beauvoir" is a 100% match with "Beauvoir, Simone de"
"Simone de Beauvoir" is a 100% match with "de Beauvoir, Simone"
"Márquez, Gabriel García" is a 100% match with "Gabriel Garcia Marquez"
"Mills, Charles W." is a 90.25% match with "C Wright Mills"
"Nussbaum, Martha C." is a 90% match with "Nussbaum, Martha"
"Martha Nussbaum" is a 95% match with "M Nussbaum"
"Martha Nussbaum" is a 85.5% match with "M C Nussbaum"
"Martha C. Nussbaum" is a 95% match with "Martha Craven Nussbaum"
"Martha Nussbaum" is a 90% match with "Martha H Nussbaum"
"Friedrich Nietzsche" is a 90% match with "Nietzsche"
"T. S. Eliot" is a 90.25% match with "Eliot, Thomas Stearns"
"Martha C. Nussbaum" is a 50% match with "Martha H Nussbaum"
"Martha C. Nussbaum" is a 47.5% match with "M.H. Nussbaum"
"J.R.R. Tolkein" is a 25% match with "G.R.R. Martin"
"Prince" is a 40.5% match with "Prince Rogers Nelson"
"Beyoncé Giselle Knowles-Carter" is a 36.45% match with "Beyonce"

This is written in JS, and might not be easy to move to wikitext, but as I think it would make the most sense for this to end up as an operator, that doesn’t bother me much.

There are likely to be many adjustments needed to make this hit your 98% target, but I think it could get close, with a threshold value of 50% match.

It is biased toward last names, so “Nietzsche” is closer to “Friedrich Nietzsche” than “Prince” is to “Prince Rogers Nelson” There are certain potential problems. Right now “Van Halen” and “Eddie Van Halen” are only a 45% match, which seems too low. And totally unrelated names like “Scott Sauyet” and “Elise Springer” are a 25% match, which seems too high. But neither seems a show-stopper.

If this seems like a promising approach, I can explain the implementation more completely.

pmario · December 17, 2023, 11:13am

Just some thoughts

I think every “naive” algorithm has to fall short. IMO eg: “M Nussbaum” is not enough information to create a unique reference.

So I was searching for “someone” who should / could know the different variations that make sense and has an API, that we are allowed to use. I did find openlibrary.org and some others

The datasets can be queried with an HTML output or as a JSON data. See the “Martha Nussbaum” link below.

Such a query returns “alternate_names” field, which we could directly use.
It also contains a unique key - eg: OL262048A which can be used to gather more info. Like “remote_ids”, which lead to even more data eg: wikidata, VIAF and ISNI …

According to openlibrary-org “M. Nussbaum” returns 8 data-sets with several different authors.
Whereas Martha Nussbaum returns 1 author with 7/8 alternative names associated to different publications.

So what you describe in the OP. The first “query operation” would be enough to get the alternate_names, which then would be searchable in TW.

If we would also store the unique key as eg: olib_author_key it would be relatively straight forward to gather more info later in the process.

Virtual Tiddler: An Issue or A Feature

opened 04:01PM - 31 Oct 22 UTC

kookma

### Describe the bug Using Tiddlywiki 5.2.3 I can create a virtual tiddler. …* A virtual tiddler is not a physical tiddler * It is not existed in TW file saved on disk * It does not appear in any search result * BUT it can be viewed in Story River. I discovered this feature during development of my new plugin called TiddlyWiki Garden which is a playground to learn Filters and CSS. ### Expected behavior * I like this behavior * I call this a feature * BUT from developer point of view this may be considered an unwanted behavior. ### To Reproduce 1. download [virtual-tiddler.zip](https://github.com/Jermolene/TiddlyWiki5/files/9902724/virtual-tiddler.zip) 2. Unzip it 3. Then drag and drop to https://tiddlywiki.com 4. Open **Test** tiddler 5. Click on Virtual Tiddler link (it is a missing tiddler link) 6. See the ghost tiddler ;-) the **Virtual Tiddler** ### Screenshots _No response_ ### TiddlyWiki Configuration - Windows 10 - Edge 105 - TiddlyWiki 5.2.3 ### Additional context _No response_

@Mohammad do you have a status report? Is this still on the table as a PR? (Answer in a new thread, please, so as not to derail this one. I’ll delete this post when you do.)

EDIT: ~~prior art~~ → shared art

Springer · December 17, 2023, 4:48pm

I think I am the OG on this one.

(At the very least, I’ve been harnessing this idea — “A so-called ‘Missing’ tiddler can serve as very much visible and useful GUI node” — technique for years, though I can’t find a good smoking-gun “OG” post about it.)

CodaCoder · December 17, 2023, 5:04pm

Oh, me too. I’ve been dreaming about “in-memory tiddlers” for five hundred years

Post updated.

Springer · December 17, 2023, 5:43pm

For example, here’s a Dec 2021 post (ported from google groups) in which I’m talking about my practice of sending permalinks to these useful “missing” tiddlers:

in a number of my teaching-related tiddlywiki instances, I’ve zapped the Missing Tiddler hint (as well as “empty filter” message on Shiraz dynamic tables…). I do this because I have set up ViewTemplate nodes for missing tiddlers (<$list filter="[all[current]is[missing]]"> ... </$list> ) , and don’t want visitors to be distracted by the missing tiddler message.

…

TW_Tones · December 18, 2023, 12:56am

if it helps finding a code pattern this is very similar to “stop words”
- Removing the and etc… from text to retain the keywords.
- Except in this case you remove von./jr etc… but detect they belong to the name and add them back, in the appropriate place.

Virtual tiddlers

I had not seen @Mohammad’s virtual tiddler approach, and not yet worked out the full mechanism which I imagine is based on missing tiddlers. What if we could create “virtual tiddlers” using variables, no need for an action widget or trigger?

this could allow us to define macros and tiddlers on the fly without user interaction.
Then have a mechanisium connected to the save or logout triggers that optionally commit them to real tiddlers.
Recent work to intercept the link widget for [[tiddler links]] and CameLcAse for tiddlers that don’t yet exist would help here.

Wow. The mind boggles.

Scott_Sauyet · December 18, 2023, 2:06am

If you get to removing diacritical marks, I think you may hit a wall in wikitext, unless your function handles a whole lot of independent cases. But if that can run in pure JS, then it’s reasonably simple:

const removeDiacriticals = (s) => s.normalize("NFD").replace(/[\u0300-\u036f]/g, "")

This first converts accented letters into two separate characters, a base character and an accent character. Then it removes all accent characters.

G a b r i e l   G a r c í a   M á r q u e z
G a b r i e l   G a r c i ́ a   M a ́ r q u e z
G a b r i e l   G a r c i a   M a r q u e z

We could replace normalize('NFD') with normalize('NFKD') if we also wanted to separate out ligatures, like

"ﬁ" -> "f" + "i"

but I don’t think I’ve ever seen them in names.

There’s more information on Wikipedia.

I’m slowly getting there myself. I keep assuming they’re just like functions in a programming language, and shooting myself in the foot. They’re similar, but by no means the same.

TW_Tones · December 18, 2023, 2:14am

Feel free to ask me, I have already become fairly well versed in them, having put at least 40hours into researching, tweaking and driving them hard. In part because they solve at least 5 problems that always annoyed me and I really love them.

Mohammad · December 22, 2023, 3:22pm

This is very crude idea:

I assume a bibliographic record is stored as a tiddler (TW bibtex plugin uses this approach)
You can use a viewtemplate to show the bibliographic tiddler in your desired form and style
In the view template write wikitext solution to do the job and create the dynamic table or create the drop down when you click author name

What code shall do in:

collect all records for that author (no matter how his/her name is stored)
create the drop down on click or generate a dynamic table
the challenge is to correctly identify the author if name comes in different formats
a procedure can be used to show author in other content as a button/link on click shows info

@TW_Tones pointed to some functions to treat different name formats

Mohammad · December 22, 2023, 3:31pm

Hi @CodaCoder
Please keep this open. I will create a PR to report virtual tiddler as a feature in documentation. Then I will create a new thread and this post can link to that.

Springer · December 22, 2023, 10:32pm

I’m baffled by this question… virtual tiddlers don’t need to be created. If you open a “missing” tiddler in the story river, nothing is “created”; it’s just that a tiddler frame with appropriate view templates (as determined by cascade filters) appears in that spot… Do you mean that virtual tiddlers would appear in a wiki even without the simple action of being “opened”? (I’m not sure why, or how one would decide which of the quite genuinely infinite number of missing-tiddler-title-strings to open without some action triggering it…? Of course, I have used permalinks (and or query-strings) to help visitors open a web-hosted wiki with “missing” tiddlers (as an appropriate customized “landing-page”, but that’s still a kind of action).

Of course, you can always create a tiddler for any “missing” tiddler (by editing it and adding any info). But if doing so would not add info beyond the helpful structural road-signs and transclusions that already can be summarized in a virtual tiddler, then there’s no need to burden your wiki with another line of json. (But again, I wonder if I’m missing something that you have in mind.)

Springer · December 22, 2023, 10:48pm

Yes, this is exactly what I’ve been modeling in my mockup so far… And I do think dynamic tables in a virtual “missing” tiddler are the most powerful tool for this role!

^ THIS is the hard part.

I understand any solution here will always be some kind of approximation, given that bibtex data “from the wild” may come in with various degrees of messy or incomplete data, and we’ll never have a perfect or complete list of all the ways that names can surprise us.

On thing I want to clarify is that my intention is to have the solution find compatible names for each author name as it appears in the tiddler being viewed (in other words, the tiddler from which one accesses the link to the author-overview tiddler).

One consequence of this approach is that it will be asymmetrical (unlike the solution that @Scott_Sauyet was building above in this thread). If you click on Martha Craven Nussbaum, you’ll get a virtual tiddler (or popup, or whatever) that finds exact matches plus all the “weaker” versions of her name, pretty much guaranteed to rule out false positives. But if you click on “M Nussbaum” the list will necessarily be less constrained, so the filter based on that string will easily include other authors, assuming there are bibtex-author fields listing Mary Nussbaum or Martha Helen Nussbaum, etc.).

This kind of solution is an “interim” one. A robust and well-supported biblio database should want eventually to standardize names and/or implement an author ID system. (Also, this system will be calculation-intensive compared to a simpler match, so over time performance will suffer, and benefits dwindle.) The purpose is to make the database useful even before such standardization has been achieved (and when an influx of new “wild” data comes in).

Springer · December 22, 2023, 10:56pm

Scott_Sauyet:

If you get to removing diacritical marks, I think you may hit a wall in wikitext, unless your function handles a whole lot of independent cases. But if that can run in pure JS, then it’s reasonably simple:
const removeDiacriticals = (s) => s.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
This first converts accented letters into two separate characters, a base character and an accent character. Then it removes all accent characters.

Yes, I suspect that either regex or going straight to javascript would make a tremendous difference in the manageability of this task!

TW_Tones · December 22, 2023, 11:10pm

First, although I am commenting on the concept of virtual tiddlers, keep in mind its relevance to the OT.

As soon as you have links to alternate “name order variations” you in effect can list/search them and even find out where they are defined, with backlinks.

I am not talking about saving or creating tiddlers. Just that point at which they come into existence. Such that they are named, observed or useable.

Yes, It is not always about the user and the the user interface, there are useful cases where we want lists, possible titles, compound titles, and more to exist, some of which we shove into temp or state tiddlers that could also exist in virtual tiddler titles.

Sure that title may have to exist in some text some where [[my virtual tiddler]]

At a minimum you could prefix such titles, but if your code creates them it can also take them away. A tile search may be sufficient eg “surname”:

Perhaps an extension of Exploring default tiddler links hackability in V5.3.0 to add features to missing tiddler titles, our virtual tiddlers. Especially with reference back to where they come into existence, or subsequently linked to.