Best systemic solution for name order variations — a new "compare" type?

  • I get it.
  • I think perhaps you misunderstand me.
  • I am talking about how we achieve it to put to whatever purpose you want. Whether you commit the normalised data to a tiddler or just re-interpret it every time that’s up to you. you still need similar algorithms.
  • It the input adheres to one or more standards you can tell the difference between you are good to go.

Me too.

Yes, and with both set multiple variables and action set multiple fields you can commit with a click.

Again totally agree. My emphasis is to craft the solution to handle either or both as needed. Normalisation can just be in memory.

Not withstanding the above, if you are committing your normalisation of the input to storage you do it once, and need not do it again. If the data comes from an external sourse, there is a choice to be made that depends on the use case.

I wonder if something like this would help. We could create a function that accepts two names, does a fair bit of normalization of each of them, and then computes a similarity score.

Then on import, we can offer as options for normalization all names in the wiki above a certain threshold, which the user can accept or choose to ignore. I coded a very naïve implementation, which will give us values like this:

"Simone de Beauvoir" is a 100% match with "Beauvoir, Simone de"
"Simone de Beauvoir" is a 100% match with "de Beauvoir, Simone"
"Márquez, Gabriel García" is a 100% match with "Gabriel Garcia Marquez"
"Mills, Charles W." is a 90.25% match with "C Wright Mills"
"Nussbaum, Martha C." is a 90% match with "Nussbaum, Martha"
"Martha Nussbaum" is a 95% match with "M Nussbaum"
"Martha Nussbaum" is a 85.5% match with "M C Nussbaum"
"Martha C. Nussbaum" is a 95% match with "Martha Craven Nussbaum"
"Martha Nussbaum" is a 90% match with "Martha H Nussbaum"
"Friedrich Nietzsche" is a 90% match with "Nietzsche"
"T. S. Eliot" is a 90.25% match with "Eliot, Thomas Stearns"
"Martha C. Nussbaum" is a 50% match with "Martha H Nussbaum"
"Martha C. Nussbaum" is a 47.5% match with "M.H. Nussbaum"
"J.R.R. Tolkein" is a 25% match with "G.R.R. Martin"
"Prince" is a 40.5% match with "Prince Rogers Nelson"
"Beyoncé Giselle Knowles-Carter" is a 36.45% match with "Beyonce"

This is written in JS, and might not be easy to move to wikitext, but as I think it would make the most sense for this to end up as an operator, that doesn’t bother me much.

There are likely to be many adjustments needed to make this hit your 98% target, but I think it could get close, with a threshold value of 50% match.

It is biased toward last names, so “Nietzsche” is closer to “Friedrich Nietzsche” than “Prince” is to “Prince Rogers Nelson” There are certain potential problems. Right now “Van Halen” and “Eddie Van Halen” are only a 45% match, which seems too low. And totally unrelated names like “Scott Sauyet” and “Elise Springer” are a 25% match, which seems too high. But neither seems a show-stopper.

If this seems like a promising approach, I can explain the implementation more completely.

Just some thoughts

I think every “naive” algorithm has to fall short. IMO eg: “M Nussbaum” is not enough information to create a unique reference.

So I was searching for “someone” who should / could know the different variations that make sense and has an API, that we are allowed to use. I did find openlibrary.org and some others

The datasets can be queried with an HTML output or as a JSON data. See the “Martha Nussbaum” link below.

  • Such a query returns “alternate_names” field, which we could directly use.
  • It also contains a unique key - eg: OL262048A which can be used to gather more info. Like “remote_ids”, which lead to even more data eg: wikidata, VIAF and ISNI …

According to openlibrary-org “M. Nussbaum” returns 8 data-sets with several different authors.
Whereas Martha Nussbaum returns 1 author with 7/8 alternative names associated to different publications.

So what you describe in the OP. The first “query operation” would be enough to get the alternate_names, which then would be searchable in TW.

If we would also store the unique key as eg: olib_author_key it would be relatively straight forward to gather more info later in the process.

More links:

As I wrote: “Just some thoughts”
-m

@Scott_Sauyet This seems very promising (though I’m not able to check out the details right now)! The idea of a sliding threshold is even more flexible than the strong and weak levels I imagined.

I’ve been working on a function-based solution, which (as you do above) prioritizes last name matches, then compatible names/initials — but it’s slow going (largely because I’m using this as an excuse to learn my way around the power of functions)!

A very generous batch of thoughts indeed!

I think your thoughts are leaning in the direction of a powerful web-connected long-term “sturdy” solution, while I was imagining something near the other end o the spectrum: a quick-and-clever local compatibility filter that works on the records one “just has” at the moment.

Excellent bibliographic databases (for institutions, for example) will indeed have a unique author id key (and will not spit out mere initials for author first and middle names). Yet such solutions take serious time (even research) to set up.

To stand on the shoulders of public database-builders is a wise approach — to the extent that their catalogue covers all or most of one’s area of interest…

Yet I’m guessing that a good solution that works with an online resource would ideally involve an API integration (quite intimidating to me!). Perhaps an iframe with an automatic query string would be an middle road for a manageable workflow…

Ah yes. Bibtex distinguishes four elements:

  • First (plus middle etc bundled in)
  • Last name / surname
  • “von”-type prefixes to last name
  • “Jr”-type suffixs

Those are the elements I’m slowly trying to build functions to parse more or less successfully. I’m in the process of troubleshooting this approach. I AM having success attend to the presence of multiple names in a field (by adding an order parameter to the function, which defaults to 1). I’m as far as getting most Last, First and First Last names parsed into those elements (such that the functions will then be available to filters, within which we can further parse initial compatibility etc.) but I haven’t yet gotten to pulling out von or Jr components directly.

Commendable endeavor, @Springer. I’d vouch (and trust) that there are more reading and watching this thread than are actually chiming in, at this point.

Power to you, and keep on keeping on.

1 Like

@Springer Re the OP, I meant to point out that there is some “shared art”, surrounding “virtual tiddler”, dating back late 2022. Not sure if you and/or other folks are aware of it:

@Mohammad do you have a status report? Is this still on the table as a PR? (Answer in a new thread, please, so as not to derail this one. I’ll delete this post when you do.)

EDIT: prior artshared art

1 Like

I think I am the OG on this one. :slight_smile:

(At the very least, I’ve been harnessing this idea — “A so-called ‘Missing’ tiddler can serve as very much visible and useful GUI node” — technique for years, though I can’t find a good smoking-gun “OG” post about it.)

2 Likes

Oh, me too. I’ve been dreaming about “in-memory tiddlers” for five hundred years :wink:

Post updated.

For example, here’s a Dec 2021 post (ported from google groups) in which I’m talking about my practice of sending permalinks to these useful “missing” tiddlers:

in a number of my teaching-related tiddlywiki instances, I’ve zapped the Missing Tiddler hint (as well as “empty filter” message on Shiraz dynamic tables…). I do this because I have set up ViewTemplate nodes for missing tiddlers (<$list filter="[all[current]is[missing]]"> ... </$list> ) , and don’t want visitors to be distracted by the missing tiddler message.

  • if it helps finding a code pattern this is very similar to “stop words”
    • Removing the and etc… from text to retain the keywords.
    • Except in this case you remove von./jr etc… but detect they belong to the name and add them back, in the appropriate place.

Virtual tiddlers

I had not seen @Mohammad’s virtual tiddler approach, and not yet worked out the full mechanism which I imagine is based on missing tiddlers. What if we could create “virtual tiddlers” using variables, no need for an action widget or trigger?

  • this could allow us to define macros and tiddlers on the fly without user interaction.
  • Then have a mechanisium connected to the save or logout triggers that optionally commit them to real tiddlers.
  • Recent work to intercept the link widget for [[tiddler links]] and CameLcAse for tiddlers that don’t yet exist would help here.

Wow. The mind boggles.

If you get to removing diacritical marks, I think you may hit a wall in wikitext, unless your function handles a whole lot of independent cases. But if that can run in pure JS, then it’s reasonably simple:

const removeDiacriticals = (s) => s.normalize("NFD").replace(/[\u0300-\u036f]/g, "")

This first converts accented letters into two separate characters, a base character and an accent character. Then it removes all accent characters.

G a b r i e l   G a r c í a   M á r q u e z
G a b r i e l   G a r c i ́ a   M a ́ r q u e z
G a b r i e l   G a r c i a   M a r q u e z

We could replace normalize('NFD') with normalize('NFKD') if we also wanted to separate out ligatures, like

"fi" -> "f" + "i"

but I don’t think I’ve ever seen them in names.

There’s more information on Wikipedia.

I’m slowly getting there myself. I keep assuming they’re just like functions in a programming language, and shooting myself in the foot. They’re similar, but by no means the same.

1 Like

Feel free to ask me, I have already become fairly well versed in them, having put at least 40hours into researching, tweaking and driving them hard. In part because they solve at least 5 problems that always annoyed me and I really love them.

2 Likes

This is very crude idea:

  • I assume a bibliographic record is stored as a tiddler (TW bibtex plugin uses this approach)
  • You can use a viewtemplate to show the bibliographic tiddler in your desired form and style
  • In the view template write wikitext solution to do the job and create the dynamic table or create the drop down when you click author name

What code shall do in:

  • collect all records for that author (no matter how his/her name is stored)
  • create the drop down on click or generate a dynamic table
  • the challenge is to correctly identify the author if name comes in different formats
  • a procedure can be used to show author in other content as a button/link on click shows info

@TW_Tones pointed to some functions to treat different name formats

Hi @CodaCoder
Please keep this open. I will create a PR to report virtual tiddler as a feature in documentation. Then I will create a new thread and this post can link to that.

I’m baffled by this question… virtual tiddlers don’t need to be created. If you open a “missing” tiddler in the story river, nothing is “created”; it’s just that a tiddler frame with appropriate view templates (as determined by cascade filters) appears in that spot… :thinking: Do you mean that virtual tiddlers would appear in a wiki even without the simple action of being “opened”? (I’m not sure why, or how one would decide which of the quite genuinely infinite number of missing-tiddler-title-strings to open without some action triggering it…? Of course, I have used permalinks (and or query-strings) to help visitors open a web-hosted wiki with “missing” tiddlers (as an appropriate customized “landing-page”, but that’s still a kind of action).

Of course, you can always create a tiddler for any “missing” tiddler (by editing it and adding any info). But if doing so would not add info beyond the helpful structural road-signs and transclusions that already can be summarized in a virtual tiddler, then there’s no need to burden your wiki with another line of json. (But again, I wonder if I’m missing something that you have in mind.)

Yes, this is exactly what I’ve been modeling in my mockup so far… And I do think dynamic tables in a virtual “missing” tiddler are the most powerful tool for this role! :slight_smile:

^ THIS is the hard part.

I understand any solution here will always be some kind of approximation, given that bibtex data “from the wild” may come in with various degrees of messy or incomplete data, and we’ll never have a perfect or complete list of all the ways that names can surprise us. :upside_down_face:

On thing I want to clarify is that my intention is to have the solution find compatible names for each author name as it appears in the tiddler being viewed (in other words, the tiddler from which one accesses the link to the author-overview tiddler).

One consequence of this approach is that it will be asymmetrical (unlike the solution that @Scott_Sauyet was building above in this thread). If you click on Martha Craven Nussbaum, you’ll get a virtual tiddler (or popup, or whatever) that finds exact matches plus all the “weaker” versions of her name, pretty much guaranteed to rule out false positives. But if you click on “M Nussbaum” the list will necessarily be less constrained, so the filter based on that string will easily include other authors, assuming there are bibtex-author fields listing Mary Nussbaum or Martha Helen Nussbaum, etc.).

This kind of solution is an “interim” one. A robust and well-supported biblio database should want eventually to standardize names and/or implement an author ID system. (Also, this system will be calculation-intensive compared to a simpler match, so over time performance will suffer, and benefits dwindle.) The purpose is to make the database useful even before such standardization has been achieved (and when an influx of new “wild” data comes in).

Yes, I suspect that either regex or going straight to javascript would make a tremendous difference in the manageability of this task!

First, although I am commenting on the concept of virtual tiddlers, keep in mind its relevance to the OT.

  • As soon as you have links to alternate “name order variations” you in effect can list/search them and even find out where they are defined, with backlinks.
  • I am not talking about saving or creating tiddlers. Just that point at which they come into existence. Such that they are named, observed or useable.

Yes, It is not always about the user and the the user interface, there are useful cases where we want lists, possible titles, compound titles, and more to exist, some of which we shove into temp or state tiddlers that could also exist in virtual tiddler titles.

  • Sure that title may have to exist in some text some where [[my virtual tiddler]]
  • At a minimum you could prefix such titles, but if your code creates them it can also take them away. A tile search may be sufficient eg “surname”:

Perhaps an extension of Exploring default tiddler links hackability in V5.3.0 to add features to missing tiddler titles, our virtual tiddlers. Especially with reference back to where they come into existence, or subsequently linked to.