Best systemic solution for name order variations — a new "compare" type?

Springer · December 16, 2023, 1:04am

I’m tweaking my bibliographic resource toward working well as a demo/model project that amplifies the power of RefNotes tools and maximizes “intertwingularity” . And I’m hitting a stumbling-block that is not unique to me…

Objective: On the hyperlinked interface I want, any author names within any bibliographic record will function as a link to a virtual tiddler gathering info about that author. (More fancy: any author name will function as a link only if there’s more to browse about that author — beyond what’s in the tiddler being displayed).

Any “missing” tiddler can serve as a virtual tiddler/node (gathering a dynamic-table overview of bibliographic resources attributed to that author) if the name of that missing tiddler appears in the bibtex-author field of existing tiddlers.

All good so far in theory. I have been developing such “missing/virtual” utility/hub tiddlers for a while now. (Author names could alternately show up as filter-pills in ways that @TW_Tones has been developing.)

First Wrinkle: bibtex records as they come in “from the wild” include both LastName, FirstName MiddleName format and FirstName MiddleName LastName format for author names, plus other variants such as LastName, F M. It’s possible that I could modify the import process so as to standardize somewhat. Cleaning incoming data and sticking to a standard is generally a good thing, but it’s nowhere near a sufficient solution to this problem.

(In case anyone doesn’t see why “cleaning up” in batch-mode or at import-time doesn’t help much: Some records come in with only first initials, as in “F M Alexander”. If I overwrite the actual field data for both “Mills, Charles W.” and “C Wright Mills” down to the “least-common-denominator” value of “C W Mills” — in order to standardize for matching purposes, then I lose actual information. Charles Wade Mills (who goes by Charles W Mills) will end up irreversibly conflated with the author who goes by C Wright Mills (both real authors I do cite!). Now I grant I may inherit some records that list only “C W Mills” as author. Still, I can avoid making the problem worse!)

So, I wonder what’s the most elegant filter-based way to make sure that the missing/virtual tiddler for Jane Addams — which someone can click wherever that name appears in bibtex-author role — could (in its dynamic table or filter-pill) pull up tiddlers with “Jane Addams” and ALSO tiddlers with “Addams, Jane” in the bibtex-author field.

The simple connection between LastName, FirstName and FirstName LastName variants would be a great start, and I think I could accomplish this by myself. But I pause because…

Additional Complication: The missing/virtual tiddler (or filter-pill) associated with author string “Nussbaum, Martha C.” should catch tiddlers with bibtex-author values “Nussbaum, Martha” and “Martha Nussbaum” and “M Nussbaum” and “M C Nussbaum” and “Martha C. Nussbaum” and “Martha Craven Nussbaum” but not “Martha H Nussbaum” (or any other incompatible variant on the name, with some 98%-sufficient rule of thumb algorithm… we could get into the weeds with Jr. and such… cross some bridges later!)

The basic trick for the filter is to standardize an author-name string into FirstName MiddleName LastName format (if there’s exactly one comma in the string, bump whatever’s before it to the end), and then to check it against values in the bibtex-author field of other tiddlers… which may require splitting those fields at the ; character (since sometimes there are multiple authors — but perhaps we’ll focus on single-author works for now!), and setting a variable to FirstName... LastName standard to check for the right kind of match (meaning a match that avoids glaring false positives, such as names with conflicting first/middle data, while erring on the side of weak false positives with compatible initials).

(The reason to standardize into that FirstName ... LastName order rather than LastName, FirstName M [etc] is that otherwise our algorithm will have to impose some guess about whether “Simone de Beauvoir” ought to be “Beauvoir, Simone de” or “de Beauvoir, Simone” and whether “Gabriel García Márquez” ought to be “García Márquez, Gabriel” or “Márquez, Gabriel García” (etc.) — all of which invites more trouble than we want!)

Why I’m coming to Y’all Of course, I am willing to put a bunch of trial-and-error into this from my end. However, I suspect

RegExp wizards may be able to do this much more easily than I can;
A solution to this name-order problem may actually be useful for other purposes, such as projects that batch-import or otherwise inherit name data in multiple formats and levels of completeness. For example:

genealogical records
student names
employee names

Dreaming BIG: Maybe TiddlyWiki could eventually offer a new type for the compare filter operator so that a filter can use compare:person-name:weak-match (for example) to catch weak matches (regardless of whether LastName, FirstName convention is used, and regardless of whether first/middle names are reduced to an initial, with or without period, flattening all diacritic-marked-characters to ascii, all case-insensitive, etc.), while filtering out genuine conflicts. Something like compare:person-name:strong-match could compensate only for the lastname-order variation, plus perhaps certain differences in punctuation.

I’m not the best person to invent this wheel. But I can describe it!

(And to be fair, I think some folks have grappled with variants on this problem, before, including @Mohammad. Please do point me to any threads that document progress on this front, since my only impression is that things were left mostly unresolved!)

TW_Tones · December 16, 2023, 1:52am

Don’t do that, I hope the following assists.

One approach is to capture the initials as separate values, to the full names if available. When you have the full name eg Tony you can find the initial T [[Tony]split[]first[]]

When you don’t have an initial you could set a function in it place <<first. Initial>>, this is where using data fields as if they are filters is good idea, like the caption it is “transcluded”

Most elegant, I am not sure, but it may be as simple as this;

\function to.first.last(last-first)  [<last-first>split[,]reverse[]join[ ]]
\function to.last.first(first-last)  [<first-last>split[ ]reverse[]join[, ]]

# <<to.first.last "Muscio, Tony">>
# <<to.last.first "Tony Muscio">>

If it helps, my surname is pronounced “muss-see-oh”

Interestingly If I was to give my initials it would be A M as I am Anthony

The point being initials need to be either derived or given as needed.

If your first and surname are divided according to the incoming format to separate fields, and you may extract initials from that point forward you can refer, search or display them according to which ever format you care for.

At import you could provide a few buttons to select which format to convert from before saving, if you can derive it. eg, if contains comma its suname, firstname othernames". This needs only be done once for most cases, because if they are incoming a second time you can find them.
Remember honorifics, create a list and detect/remove them, like in Catch 22 you may have a problem with “Major Major”
Also consider a mechanism for more than one person with the same name.
- Consider having organisation, position or role to uniquify, or a number (1)

@Mohammad and possible @DaveGifford have spent time in this space from memory.

I will not go into the details of bibtext because I know nothing. But my approach would be to “normalise” all incoming names into named fields.

Auto detect and show most likely but allow another to be set.

Finaly although this may be unnecessarily complex for now my speculative design idea here Design Opportunity pass parameters to macros and procedures from a filter? could help if you where to build a complete name “parser” then want to use the result in a filter.

[Post Script]

The functions above can also be used as custom filter operators

Mark_S · December 16, 2023, 1:58am

In database technologies, we wouldn’t depend on getting lucky with name formats. I’m guessing that with multiple filters and regular expressions you might get 98% correct.

Instead, after or during import, if possible, you would assign an author id #, which you assign to an author tiddler that has the definitive name for the author. You would probably do this using the select widget that lets you see, for instance, all the “Mills” in your author tiddlers, and then pick the real one to assign to the current incoming info. Your ability to know which is which will always be better than an filter.

Because all the incoming tiddlers will have an assigned id, it will be easy to find all the tiddlers that represent that actual user.

I think by “virtual” tiddler you mean some sort of pop-up, like refnotes, that shows you info related to a particular author? Otherwise I would be curious what you mean.

Seasons twiddles!

Springer · December 16, 2023, 2:00am

Thanks, Tony.

I’m wary of trying to require normalization. Sometimes one gets a batch of incoming data, and normalizing it would require actually knowing things. The Last, First ... format encodes more info (so is preferable in that way — and you’ll see the double-underline in library books, helping future cataloguers track where the last name starts in the author’s name string).

I don’t want to remove the surname/alphabetization info available in an incoming record with author name “García Márquez, Gabriel.” Turning it into “Gabrial García Márquez” would strip out the information about where the last name starts.

When I get a record in FirstName ... LastName format, I don’t want to force it in any automatic way into LastName, FirstName ... format either, for the very same reason running conversely: I may not know where the last name starts.

So, my preference is not to require clean data, but to offer an interface that makes great approximations of likely matches through use of filter magic.

Springer · December 16, 2023, 2:01am

If I were designing a solution just for myself, that would be a good strategy.

I’m aiming for a plug-and-play that can grab records from the interwebz (such as google scholar) and be usefully GUI-dense without all this data-massage work.

Springer · December 16, 2023, 2:03am

Yes, this is exactly the kind of thing that I already do often. “missing” tiddlers can be turned into great automatic data-hubs, using viewtemplates and cascade conditions.

But it’s not just a pop-up. I mean, it could be. But you can also have it open in the story river as a tiddler. I always remove the “missing tiddler” message, because if someone clicked something to get to the “missing” tiddler (which has useful view templates transcluding important connections “from here”), then there’s almost always some related information somewhere in the wiki worthy of giving a roadmap for.

TW_Tones · December 16, 2023, 2:17am

Personally, and from decades of IT my belief is whether or not to normalize it not up to us, the universe demands it. Sure can choose to play with the Devil if you want

Don’t, store it in a field before doing anything. You can then revise it later, if your automation gets it wrong.

You don’t need to, once normalised you can generate whatever form you want, whenever you want.

sure but consider “clean data as the ideal, but you can cope if its not”.
by the way I am willing to bet if your automation resolves 99% of all names that on reading any incoming name a human can clearly see what is intended in most of the remaining names.
To me its all about volume, if one in a thousand is difficult to code for, do them manually if 10% if the incoming are a particular format, code an automatic algorithm.
Fortunately existing standards help a lot here. If the names can’t be identified correctly I would question the source.

That was what I was proposing. You choose if you intervein during, or after a batch import or both.
If you state an incoming format of any kind, concisely, I am happy to help write the filter to normalise it, once normalised we can choose the output format with another filter. But I expect you are capable of this yourself.

I saw your? recent post on extending this and love the idea. I have already explored some aspects of this already and can share some useful methods or applications if you ask. Hint: Custom link widget.

Springer · December 16, 2023, 2:50am

I think I didn’t make my purpose clear. I’m talking about making a solution that’s maximally friendly to beginners/newcomers, and which “just works” even with real-world variations in name formats.

I have no worries about how to clean and troubleshoot my own data (in the bibliographic tools that I use for my own purposes). What I’d like is to envision a compare:person-name:match filter step that works across name formats, so that powerful cross-format recognition (and GUI affordances) can come into view without waiting until after data cleanup.

To the extent that data-cleaning could be automated at time of import, it can equally, in principle, happen as “virtual data-cleaning” that is accomplished by a well-built filter. To the extent that it requires actual judgment-calls, people often need to go about their business without pausing for that step. (I realize that performance efficiency is improved by standardized info, but I’m imagining a “fuzzy-match” on/off preference setting that can be deactivated if/when someone’s quantity and quality of records makes the fuzzy-match too slow and no longer especially needed.)

I think this kind of name-parsing capacity is important partly because someone evaluating TiddlyWiki (as a biblio management solution, for example) might try importing their own bibtex data into a bibliographic “demo” solution (or any other person-data-intensive solution), and then become discouraged when things don’t “just work”. Data-cleaning is something that people don’t want to do until after they commit to a solution.

By analogy: this is a bit like the difference between advising photographers to fix each of their image files with a system based on careful filenames and meta-data , and offering people the kind of tool that often “just works” at pulling up photos that contain related faces / text / colors, even when the users happen to be sitting on an archive of digital images that hasn’t yet been organized in any consistent way.

Springer · December 16, 2023, 3:42am

First: this is generous, and I very much appreciate your offer of help!

Still, this is exactly what I don’t think we should always need to do — or rather, not everything needs to wait for such normalization.

Suppose some incoming records show author names without a comma… or perhaps with a comma that functions in unexpected ways:

“Simone de Beauvoir”, “Gabriel García Márquez”, “José Ortega y Gasset”, “Thérèse of Lisieux”, “Sor Juana Inés de la Cruz”, “Paul-Henri Dietrich, baron d’Holbach”, “Martin Luther King, Jr.”

I prefer the mantra “do no harm”. If I don’t know those authors, I leave them as they stand, rather than impose a LastName, FirstName... format that requires guessing about where the last name starts. (And maybe I just don’t have the time right now…)

Meanwhlie, if incoming records do show last name first, then I again prefer the mantra “do no harm”. Leave those name strings as they are, rather than remove the helpful (until proven otherwise) information about where the last name starts.

Of course, there’s no conflict between my priorities and yours: It makes sense to try parsing names so as to auto-populate several custom fields (on import, and/or by batch process) corresponding to surname, given name #1, given name #2, suffixes, and perhaps honorific/title. And when we realize there are mistaken patterns there, such glitches may lead us to tweak the algorithm for auto-parsing…

I suggest we do all this without deleting the original imported field, until and unless we have reason to think that we’ll do zero damage by overwriting that original.

But look: if we can auto-populate fields, we can also (in theory) auto-populate variables while leaving the actual field values untouched. This may be too computation-intensive for most routine purposes, but it may yet be a smart approach when you want a proof-of-concept or exploratory interface.

Imagine, for example, something like @simon’s solution to auto-load a set of remote tiddlers. Can I auto-load a colleague’s set of bibtex tiddlers, and get name-recognition working across local and remote tiddlers even if the remote file uses LastName, FirstName... format and I don’t? Can I scan for likely “smart” author-matches on this or that name before deciding whether I actually want to pull those records over permanently into my own wiki (where I can “clean” in whatever way I see fit)? Even if that kind of name-matching algorithm is a bit slow, it will be much faster than actually importing and cleaning all that data in order to run a more efficient tool.

I think this is all analogous to my reasons in favor of enabling freelinks for certain purposes. If you’re manually typing and managing all the data in a wiki yourself, of course you should just discipline yourself to include double-bracket links and pretty links exactly when links are appropriate (and also use Relink plugin ). But if your solution involves handling stuff pouring in from elsewhere (like student writing and excerpted passages from source texts), sometimes it’s better to have a tool that just “sees” and virtually highlights the connections, even before / without running all of it through the process of cleanup or format-normalization beforehand.

TW_Tones · December 16, 2023, 4:13am

I get it.
I think perhaps you misunderstand me.
I am talking about how we achieve it to put to whatever purpose you want. Whether you commit the normalised data to a tiddler or just re-interpret it every time that’s up to you. you still need similar algorithms.
It the input adheres to one or more standards you can tell the difference between you are good to go.

Me too.

Yes, and with both set multiple variables and action set multiple fields you can commit with a click.

Again totally agree. My emphasis is to craft the solution to handle either or both as needed. Normalisation can just be in memory.

Not withstanding the above, if you are committing your normalisation of the input to storage you do it once, and need not do it again. If the data comes from an external sourse, there is a choice to be made that depends on the use case.

Scott_Sauyet · December 17, 2023, 5:42am

I wonder if something like this would help. We could create a function that accepts two names, does a fair bit of normalization of each of them, and then computes a similarity score.

Then on import, we can offer as options for normalization all names in the wiki above a certain threshold, which the user can accept or choose to ignore. I coded a very naïve implementation, which will give us values like this:

"Simone de Beauvoir" is a 100% match with "Beauvoir, Simone de"
"Simone de Beauvoir" is a 100% match with "de Beauvoir, Simone"
"Márquez, Gabriel García" is a 100% match with "Gabriel Garcia Marquez"
"Mills, Charles W." is a 90.25% match with "C Wright Mills"
"Nussbaum, Martha C." is a 90% match with "Nussbaum, Martha"
"Martha Nussbaum" is a 95% match with "M Nussbaum"
"Martha Nussbaum" is a 85.5% match with "M C Nussbaum"
"Martha C. Nussbaum" is a 95% match with "Martha Craven Nussbaum"
"Martha Nussbaum" is a 90% match with "Martha H Nussbaum"
"Friedrich Nietzsche" is a 90% match with "Nietzsche"
"T. S. Eliot" is a 90.25% match with "Eliot, Thomas Stearns"
"Martha C. Nussbaum" is a 50% match with "Martha H Nussbaum"
"Martha C. Nussbaum" is a 47.5% match with "M.H. Nussbaum"
"J.R.R. Tolkein" is a 25% match with "G.R.R. Martin"
"Prince" is a 40.5% match with "Prince Rogers Nelson"
"Beyoncé Giselle Knowles-Carter" is a 36.45% match with "Beyonce"

This is written in JS, and might not be easy to move to wikitext, but as I think it would make the most sense for this to end up as an operator, that doesn’t bother me much.

There are likely to be many adjustments needed to make this hit your 98% target, but I think it could get close, with a threshold value of 50% match.

It is biased toward last names, so “Nietzsche” is closer to “Friedrich Nietzsche” than “Prince” is to “Prince Rogers Nelson” There are certain potential problems. Right now “Van Halen” and “Eddie Van Halen” are only a 45% match, which seems too low. And totally unrelated names like “Scott Sauyet” and “Elise Springer” are a 25% match, which seems too high. But neither seems a show-stopper.

If this seems like a promising approach, I can explain the implementation more completely.

pmario · December 17, 2023, 11:13am

Just some thoughts

I think every “naive” algorithm has to fall short. IMO eg: “M Nussbaum” is not enough information to create a unique reference.

So I was searching for “someone” who should / could know the different variations that make sense and has an API, that we are allowed to use. I did find openlibrary.org and some others

The datasets can be queried with an HTML output or as a JSON data. See the “Martha Nussbaum” link below.

Such a query returns “alternate_names” field, which we could directly use.
It also contains a unique key - eg: OL262048A which can be used to gather more info. Like “remote_ids”, which lead to even more data eg: wikidata, VIAF and ISNI …

According to openlibrary-org “M. Nussbaum” returns 8 data-sets with several different authors.
Whereas Martha Nussbaum returns 1 author with 7/8 alternative names associated to different publications.

So what you describe in the OP. The first “query operation” would be enough to get the alternate_names, which then would be searchable in TW.

If we would also store the unique key as eg: olib_author_key it would be relatively straight forward to gather more info later in the process.

Virtual Tiddler: An Issue or A Feature

opened 04:01PM - 31 Oct 22 UTC

kookma

### Describe the bug Using Tiddlywiki 5.2.3 I can create a virtual tiddler. …* A virtual tiddler is not a physical tiddler * It is not existed in TW file saved on disk * It does not appear in any search result * BUT it can be viewed in Story River. I discovered this feature during development of my new plugin called TiddlyWiki Garden which is a playground to learn Filters and CSS. ### Expected behavior * I like this behavior * I call this a feature * BUT from developer point of view this may be considered an unwanted behavior. ### To Reproduce 1. download [virtual-tiddler.zip](https://github.com/Jermolene/TiddlyWiki5/files/9902724/virtual-tiddler.zip) 2. Unzip it 3. Then drag and drop to https://tiddlywiki.com 4. Open **Test** tiddler 5. Click on Virtual Tiddler link (it is a missing tiddler link) 6. See the ghost tiddler ;-) the **Virtual Tiddler** ### Screenshots _No response_ ### TiddlyWiki Configuration - Windows 10 - Edge 105 - TiddlyWiki 5.2.3 ### Additional context _No response_

@Mohammad do you have a status report? Is this still on the table as a PR? (Answer in a new thread, please, so as not to derail this one. I’ll delete this post when you do.)

EDIT: ~~prior art~~ → shared art

Springer · December 17, 2023, 4:48pm

I think I am the OG on this one.

(At the very least, I’ve been harnessing this idea — “A so-called ‘Missing’ tiddler can serve as very much visible and useful GUI node” — technique for years, though I can’t find a good smoking-gun “OG” post about it.)

CodaCoder · December 17, 2023, 5:04pm

Oh, me too. I’ve been dreaming about “in-memory tiddlers” for five hundred years

Post updated.

Springer · December 17, 2023, 5:43pm

For example, here’s a Dec 2021 post (ported from google groups) in which I’m talking about my practice of sending permalinks to these useful “missing” tiddlers:

in a number of my teaching-related tiddlywiki instances, I’ve zapped the Missing Tiddler hint (as well as “empty filter” message on Shiraz dynamic tables…). I do this because I have set up ViewTemplate nodes for missing tiddlers (<$list filter="[all[current]is[missing]]"> ... </$list> ) , and don’t want visitors to be distracted by the missing tiddler message.

…