Best systemic solution for name order variations — a new "compare" type?

I’m tweaking my bibliographic resource toward working well as a demo/model project that amplifies the power of RefNotes tools and maximizes “intertwingularity” :wink: . And I’m hitting a stumbling-block that is not unique to me…

Objective: On the hyperlinked interface I want, any author names within any bibliographic record will function as a link to a virtual tiddler gathering info about that author. (More fancy: any author name will function as a link only if there’s more to browse about that author — beyond what’s in the tiddler being displayed).

Any “missing” tiddler can serve as a virtual tiddler/node (gathering a dynamic-table overview of bibliographic resources attributed to that author) if the name of that missing tiddler appears in the bibtex-author field of existing tiddlers.

All good so far in theory. I have been developing such “missing/virtual” utility/hub tiddlers for a while now. (Author names could alternately show up as filter-pills in ways that @TW_Tones has been developing.)

First Wrinkle: bibtex records as they come in “from the wild” include both LastName, FirstName MiddleName format and FirstName MiddleName LastName format for author names, plus other variants such as LastName, F M. It’s possible that I could modify the import process so as to standardize somewhat. Cleaning incoming data and sticking to a standard is generally a good thing, but it’s nowhere near a sufficient solution to this problem.

(In case anyone doesn’t see why “cleaning up” in batch-mode or at import-time doesn’t help much: Some records come in with only first initials, as in “F M Alexander”. If I overwrite the actual field data for both “Mills, Charles W.” and “C Wright Mills” down to the “least-common-denominator” value of “C W Mills” — in order to standardize for matching purposes, then I lose actual information. Charles Wade Mills (who goes by Charles W Mills) will end up irreversibly conflated with the author who goes by C Wright Mills (both real authors I do cite!). Now I grant I may inherit some records that list only “C W Mills” as author. Still, I can avoid making the problem worse!)

So, I wonder what’s the most elegant filter-based way to make sure that the missing/virtual tiddler for Jane Addams — which someone can click wherever that name appears in bibtex-author role — could (in its dynamic table or filter-pill) pull up tiddlers with “Jane Addams” and ALSO tiddlers with “Addams, Jane” in the bibtex-author field.

The simple connection between LastName, FirstName and FirstName LastName variants would be a great start, and I think I could accomplish this by myself. But I pause because…

Additional Complication: The missing/virtual tiddler (or filter-pill) associated with author string “Nussbaum, Martha C.” should catch tiddlers with bibtex-author values “Nussbaum, Martha” and “Martha Nussbaum” and “M Nussbaum” and “M C Nussbaum” and “Martha C. Nussbaum” and “Martha Craven Nussbaum” but not “Martha H Nussbaum” (or any other incompatible variant on the name, with some 98%-sufficient rule of thumb algorithm… we could get into the weeds with Jr. and such… cross some bridges later!)

The basic trick for the filter is to standardize an author-name string into FirstName MiddleName LastName format (if there’s exactly one comma in the string, bump whatever’s before it to the end), and then to check it against values in the bibtex-author field of other tiddlers… which may require splitting those fields at the ; character (since sometimes there are multiple authors — but perhaps we’ll focus on single-author works for now!), and setting a variable to FirstName... LastName standard to check for the right kind of match (meaning a match that avoids glaring false positives, such as names with conflicting first/middle data, while erring on the side of weak false positives with compatible initials).

(The reason to standardize into that FirstName ... LastName order rather than LastName, FirstName M [etc] is that otherwise our algorithm will have to impose some guess about whether “Simone de Beauvoir” ought to be “Beauvoir, Simone de” or “de Beauvoir, Simone” and whether “Gabriel García Márquez” ought to be “García Márquez, Gabriel” or “Márquez, Gabriel García” (etc.) — all of which invites more trouble than we want!)

Why I’m coming to Y’all Of course, I am willing to put a bunch of trial-and-error into this from my end. However, I suspect

  1. RegExp wizards may be able to do this much more easily than I can;
  2. A solution to this name-order problem may actually be useful for other purposes, such as projects that batch-import or otherwise inherit name data in multiple formats and levels of completeness. For example:
  • genealogical records
  • student names
  • employee names

Dreaming BIG: Maybe TiddlyWiki could eventually offer a new type for the compare filter operator so that a filter can use compare:person-name:weak-match (for example) to catch weak matches (regardless of whether LastName, FirstName convention is used, and regardless of whether first/middle names are reduced to an initial, with or without period, flattening all diacritic-marked-characters to ascii, all case-insensitive, etc.), while filtering out genuine conflicts. Something like compare:person-name:strong-match could compensate only for the lastname-order variation, plus perhaps certain differences in punctuation.

I’m not the best person to invent this wheel. But I can describe it!

(And to be fair, I think some folks have grappled with variants on this problem, before, including @Mohammad. Please do point me to any threads that document progress on this front, since my only impression is that things were left mostly unresolved!)

  • Don’t do that, I hope the following assists.

One approach is to capture the initials as separate values, to the full names if available. When you have the full name eg Tony you can find the initial T [[Tony]split[]first[]]

  • When you don’t have an initial you could set a function in it place <<first. Initial>>, this is where using data fields as if they are filters is good idea, like the caption it is “transcluded”

Most elegant, I am not sure, but it may be as simple as this;

\function to.first.last(last-first)  [<last-first>split[,]reverse[]join[ ]]
\function to.last.first(first-last)  [<first-last>split[ ]reverse[]join[, ]]

# <<to.first.last "Muscio, Tony">>
# <<to.last.first "Tony Muscio">>
  • If it helps, my surname is pronounced “muss-see-oh”

Interestingly If I was to give my initials it would be A M as I am Anthony :nerd_face:

  • The point being initials need to be either derived or given as needed.

If your first and surname are divided according to the incoming format to separate fields, and you may extract initials from that point forward you can refer, search or display them according to which ever format you care for.

  • At import you could provide a few buttons to select which format to convert from before saving, if you can derive it. eg, if contains comma its suname, firstname othernames". This needs only be done once for most cases, because if they are incoming a second time you can find them.
  • Remember honorifics, create a list and detect/remove them, like in Catch 22 you may have a problem with “Major Major”
  • Also consider a mechanism for more than one person with the same name.
    • Consider having organisation, position or role to uniquify, or a number (1)

@Mohammad and possible @DaveGifford have spent time in this space from memory.

I will not go into the details of bibtext because I know nothing. But my approach would be to “normalise” all incoming names into named fields.

  • Auto detect and show most likely but allow another to be set.

Finaly although this may be unnecessarily complex for now my speculative design idea here Design Opportunity pass parameters to macros and procedures from a filter? could help if you where to build a complete name “parser” then want to use the result in a filter.

[Post Script]

The functions above can also be used as custom filter operators

1 Like

In database technologies, we wouldn’t depend on getting lucky with name formats. I’m guessing that with multiple filters and regular expressions you might get 98% correct.

Instead, after or during import, if possible, you would assign an author id #, which you assign to an author tiddler that has the definitive name for the author. You would probably do this using the select widget that lets you see, for instance, all the “Mills” in your author tiddlers, and then pick the real one to assign to the current incoming info. Your ability to know which is which will always be better than an filter.

Because all the incoming tiddlers will have an assigned id, it will be easy to find all the tiddlers that represent that actual user.

I think by “virtual” tiddler you mean some sort of pop-up, like refnotes, that shows you info related to a particular author? Otherwise I would be curious what you mean.

Seasons twiddles!

Thanks, Tony.

I’m wary of trying to require normalization. Sometimes one gets a batch of incoming data, and normalizing it would require actually knowing things. The Last, First ... format encodes more info (so is preferable in that way — and you’ll see the double-underline in library books, helping future cataloguers track where the last name starts in the author’s name string).

I don’t want to remove the surname/alphabetization info available in an incoming record with author name “García Márquez, Gabriel.” Turning it into “Gabrial García Márquez” would strip out the information about where the last name starts.

When I get a record in FirstName ... LastName format, I don’t want to force it in any automatic way into LastName, FirstName ... format either, for the very same reason running conversely: I may not know where the last name starts.

So, my preference is not to require clean data, but to offer an interface that makes great approximations of likely matches through use of filter magic. :slight_smile:

If I were designing a solution just for myself, that would be a good strategy.

I’m aiming for a plug-and-play that can grab records from the interwebz (such as google scholar) and be usefully GUI-dense without all this data-massage work.

Yes, this is exactly the kind of thing that I already do often. “missing” tiddlers can be turned into great automatic data-hubs, using viewtemplates and cascade conditions. :slight_smile:

But it’s not just a pop-up. I mean, it could be. But you can also have it open in the story river as a tiddler. I always remove the “missing tiddler” message, because if someone clicked something to get to the “missing” tiddler (which has useful view templates transcluding important connections “from here”), then there’s almost always some related information somewhere in the wiki worthy of giving a roadmap for.

Personally, and from decades of IT my belief is whether or not to normalize it not up to us, the universe demands it. Sure can choose to play with the Devil if you want :nerd_face:

  • Don’t, store it in a field before doing anything. You can then revise it later, if your automation gets it wrong.
  • You don’t need to, once normalised you can generate whatever form you want, whenever you want.
  • sure but consider “clean data as the ideal, but you can cope if its not”.
  • by the way I am willing to bet if your automation resolves 99% of all names that on reading any incoming name a human can clearly see what is intended in most of the remaining names.
  • To me its all about volume, if one in a thousand is difficult to code for, do them manually if 10% if the incoming are a particular format, code an automatic algorithm.
  • Fortunately existing standards help a lot here. If the names can’t be identified correctly I would question the source.
  • That was what I was proposing. You choose if you intervein during, or after a batch import or both.
  • If you state an incoming format of any kind, concisely, I am happy to help write the filter to normalise it, once normalised we can choose the output format with another filter. But I expect you are capable of this yourself.
  • I saw your? recent post on extending this and love the idea. I have already explored some aspects of this already and can share some useful methods or applications if you ask. Hint: Custom link widget.

I think I didn’t make my purpose clear. I’m talking about making a solution that’s maximally friendly to beginners/newcomers, and which “just works” even with real-world variations in name formats.

I have no worries about how to clean and troubleshoot my own data (in the bibliographic tools that I use for my own purposes). What I’d like is to envision a compare:person-name:match filter step that works across name formats, so that powerful cross-format recognition (and GUI affordances) can come into view without waiting until after data cleanup.

To the extent that data-cleaning could be automated at time of import, it can equally, in principle, happen as “virtual data-cleaning” that is accomplished by a well-built filter. To the extent that it requires actual judgment-calls, people often need to go about their business without pausing for that step. (I realize that performance efficiency is improved by standardized info, but I’m imagining a “fuzzy-match” on/off preference setting that can be deactivated if/when someone’s quantity and quality of records makes the fuzzy-match too slow and no longer especially needed.)

I think this kind of name-parsing capacity is important partly because someone evaluating TiddlyWiki (as a biblio management solution, for example) might try importing their own bibtex data into a bibliographic “demo” solution (or any other person-data-intensive solution), and then become discouraged when things don’t “just work”. Data-cleaning is something that people don’t want to do until after they commit to a solution.

By analogy: this is a bit like the difference between advising photographers to fix each of their image files with a system based on careful filenames and meta-data :face_with_raised_eyebrow:, and offering people the kind of tool that often “just works” at pulling up photos that contain related faces / text / colors, even when the users happen to be sitting on an archive of digital images that hasn’t yet been organized in any consistent way.

1 Like

First: this is generous, and I very much appreciate your offer of help!

Still, this is exactly what I don’t think we should always need to do — or rather, not everything needs to wait for such normalization.

Suppose some incoming records show author names without a comma… or perhaps with a comma that functions in unexpected ways:

“Simone de Beauvoir”, “Gabriel García Márquez”, “José Ortega y Gasset”, “Thérèse of Lisieux”, “Sor Juana Inés de la Cruz”, “Paul-Henri Dietrich, baron d’Holbach”, “Martin Luther King, Jr.”

I prefer the mantra “do no harm”. If I don’t know those authors, I leave them as they stand, rather than impose a LastName, FirstName... format that requires guessing about where the last name starts. (And maybe I just don’t have the time right now…)

Meanwhlie, if incoming records do show last name first, then I again prefer the mantra “do no harm”. Leave those name strings as they are, rather than remove the helpful (until proven otherwise) information about where the last name starts. :slight_smile:

Of course, there’s no conflict between my priorities and yours: It makes sense to try parsing names so as to auto-populate several custom fields (on import, and/or by batch process) corresponding to surname, given name #1, given name #2, suffixes, and perhaps honorific/title. And when we realize there are mistaken patterns there, such glitches may lead us to tweak the algorithm for auto-parsing…

I suggest we do all this without deleting the original imported field, until and unless we have reason to think that we’ll do zero damage by overwriting that original.

But look: if we can auto-populate fields, we can also (in theory) auto-populate variables while leaving the actual field values untouched. This may be too computation-intensive for most routine purposes, but it may yet be a smart approach when you want a proof-of-concept or exploratory interface.

Imagine, for example, something like @simon’s solution to auto-load a set of remote tiddlers. Can I auto-load a colleague’s set of bibtex tiddlers, and get name-recognition working across local and remote tiddlers even if the remote file uses LastName, FirstName... format and I don’t? :grin: Can I scan for likely “smart” author-matches on this or that name before deciding whether I actually want to pull those records over permanently into my own wiki (where I can “clean” in whatever way I see fit)? Even if that kind of name-matching algorithm is a bit slow, it will be much faster than actually importing and cleaning all that data in order to run a more efficient tool.

I think this is all analogous to my reasons in favor of enabling freelinks for certain purposes. If you’re manually typing and managing all the data in a wiki yourself, of course you should just discipline yourself to include double-bracket links and pretty links exactly when links are appropriate (and also use Relink plugin :slight_smile: ). But if your solution involves handling stuff pouring in from elsewhere (like student writing and excerpted passages from source texts), sometimes it’s better to have a tool that just “sees” and virtually highlights the connections, even before / without running all of it through the process of cleanup or format-normalization beforehand.

  • I get it.
  • I think perhaps you misunderstand me.
  • I am talking about how we achieve it to put to whatever purpose you want. Whether you commit the normalised data to a tiddler or just re-interpret it every time that’s up to you. you still need similar algorithms.
  • It the input adheres to one or more standards you can tell the difference between you are good to go.

Me too.

Yes, and with both set multiple variables and action set multiple fields you can commit with a click.

Again totally agree. My emphasis is to craft the solution to handle either or both as needed. Normalisation can just be in memory.

Not withstanding the above, if you are committing your normalisation of the input to storage you do it once, and need not do it again. If the data comes from an external sourse, there is a choice to be made that depends on the use case.

I wonder if something like this would help. We could create a function that accepts two names, does a fair bit of normalization of each of them, and then computes a similarity score.

Then on import, we can offer as options for normalization all names in the wiki above a certain threshold, which the user can accept or choose to ignore. I coded a very naïve implementation, which will give us values like this:

"Simone de Beauvoir" is a 100% match with "Beauvoir, Simone de"
"Simone de Beauvoir" is a 100% match with "de Beauvoir, Simone"
"Márquez, Gabriel García" is a 100% match with "Gabriel Garcia Marquez"
"Mills, Charles W." is a 90.25% match with "C Wright Mills"
"Nussbaum, Martha C." is a 90% match with "Nussbaum, Martha"
"Martha Nussbaum" is a 95% match with "M Nussbaum"
"Martha Nussbaum" is a 85.5% match with "M C Nussbaum"
"Martha C. Nussbaum" is a 95% match with "Martha Craven Nussbaum"
"Martha Nussbaum" is a 90% match with "Martha H Nussbaum"
"Friedrich Nietzsche" is a 90% match with "Nietzsche"
"T. S. Eliot" is a 90.25% match with "Eliot, Thomas Stearns"
"Martha C. Nussbaum" is a 50% match with "Martha H Nussbaum"
"Martha C. Nussbaum" is a 47.5% match with "M.H. Nussbaum"
"J.R.R. Tolkein" is a 25% match with "G.R.R. Martin"
"Prince" is a 40.5% match with "Prince Rogers Nelson"
"Beyoncé Giselle Knowles-Carter" is a 36.45% match with "Beyonce"

This is written in JS, and might not be easy to move to wikitext, but as I think it would make the most sense for this to end up as an operator, that doesn’t bother me much.

There are likely to be many adjustments needed to make this hit your 98% target, but I think it could get close, with a threshold value of 50% match.

It is biased toward last names, so “Nietzsche” is closer to “Friedrich Nietzsche” than “Prince” is to “Prince Rogers Nelson” There are certain potential problems. Right now “Van Halen” and “Eddie Van Halen” are only a 45% match, which seems too low. And totally unrelated names like “Scott Sauyet” and “Elise Springer” are a 25% match, which seems too high. But neither seems a show-stopper.

If this seems like a promising approach, I can explain the implementation more completely.

Just some thoughts

I think every “naive” algorithm has to fall short. IMO eg: “M Nussbaum” is not enough information to create a unique reference.

So I was searching for “someone” who should / could know the different variations that make sense and has an API, that we are allowed to use. I did find openlibrary.org and some others

The datasets can be queried with an HTML output or as a JSON data. See the “Martha Nussbaum” link below.

  • Such a query returns “alternate_names” field, which we could directly use.
  • It also contains a unique key - eg: OL262048A which can be used to gather more info. Like “remote_ids”, which lead to even more data eg: wikidata, VIAF and ISNI …

According to openlibrary-org “M. Nussbaum” returns 8 data-sets with several different authors.
Whereas Martha Nussbaum returns 1 author with 7/8 alternative names associated to different publications.

So what you describe in the OP. The first “query operation” would be enough to get the alternate_names, which then would be searchable in TW.

If we would also store the unique key as eg: olib_author_key it would be relatively straight forward to gather more info later in the process.

More links:

As I wrote: “Just some thoughts”
-m

@Scott_Sauyet This seems very promising (though I’m not able to check out the details right now)! The idea of a sliding threshold is even more flexible than the strong and weak levels I imagined.

I’ve been working on a function-based solution, which (as you do above) prioritizes last name matches, then compatible names/initials — but it’s slow going (largely because I’m using this as an excuse to learn my way around the power of functions)!

A very generous batch of thoughts indeed!

I think your thoughts are leaning in the direction of a powerful web-connected long-term “sturdy” solution, while I was imagining something near the other end o the spectrum: a quick-and-clever local compatibility filter that works on the records one “just has” at the moment.

Excellent bibliographic databases (for institutions, for example) will indeed have a unique author id key (and will not spit out mere initials for author first and middle names). Yet such solutions take serious time (even research) to set up.

To stand on the shoulders of public database-builders is a wise approach — to the extent that their catalogue covers all or most of one’s area of interest…

Yet I’m guessing that a good solution that works with an online resource would ideally involve an API integration (quite intimidating to me!). Perhaps an iframe with an automatic query string would be an middle road for a manageable workflow…

Ah yes. Bibtex distinguishes four elements:

  • First (plus middle etc bundled in)
  • Last name / surname
  • “von”-type prefixes to last name
  • “Jr”-type suffixs

Those are the elements I’m slowly trying to build functions to parse more or less successfully. I’m in the process of troubleshooting this approach. I AM having success attend to the presence of multiple names in a field (by adding an order parameter to the function, which defaults to 1). I’m as far as getting most Last, First and First Last names parsed into those elements (such that the functions will then be available to filters, within which we can further parse initial compatibility etc.) but I haven’t yet gotten to pulling out von or Jr components directly.

Commendable endeavor, @Springer. I’d vouch (and trust) that there are more reading and watching this thread than are actually chiming in, at this point.

Power to you, and keep on keeping on.

1 Like

@Springer Re the OP, I meant to point out that there is some “shared art”, surrounding “virtual tiddler”, dating back late 2022. Not sure if you and/or other folks are aware of it:

@Mohammad do you have a status report? Is this still on the table as a PR? (Answer in a new thread, please, so as not to derail this one. I’ll delete this post when you do.)

EDIT: prior artshared art

1 Like

I think I am the OG on this one. :slight_smile:

(At the very least, I’ve been harnessing this idea — “A so-called ‘Missing’ tiddler can serve as very much visible and useful GUI node” — technique for years, though I can’t find a good smoking-gun “OG” post about it.)

2 Likes

Oh, me too. I’ve been dreaming about “in-memory tiddlers” for five hundred years :wink:

Post updated.

For example, here’s a Dec 2021 post (ported from google groups) in which I’m talking about my practice of sending permalinks to these useful “missing” tiddlers:

in a number of my teaching-related tiddlywiki instances, I’ve zapped the Missing Tiddler hint (as well as “empty filter” message on Shiraz dynamic tables…). I do this because I have set up ViewTemplate nodes for missing tiddlers (<$list filter="[all[current]is[missing]]"> ... </$list> ) , and don’t want visitors to be distracted by the missing tiddler message.