Parse Fullnames in First, middle and last name

Ref: A Filter Question: “Fullname, F. M.” - Discussion - Talk TW (tiddlywiki.org)

Question 1: How to split a full name into first, middle and last names? Code shall handle the name contains Nobiliary particle?

Example i:
“Ludwig van Beethoven” shall be parsed into

Ludwig
van Beethoven

Example ii
“Michael Joseph Jackson” shall be parsed to

Michael
Joseph
Jackson

I started with some KISS code like this

\define parse-fullname(name)
<$let name=<<__name__>>
      pattern1="\b(?=[a-z])"
			pattern2="\s"
>
<$list filter="[<name>splitregexp<pattern1>trim[]] :filter[<name>splitregexp<pattern1>trim[]count[]compare:integer:gteq[2]]">
<$text text=<<currentTiddler>> /><br/>
</$list>
<$list filter="[<name>splitregexp<pattern2>trim[]] :filter[<name>splitregexp<pattern1>trim[]count[]compare:integer:eq[1]]">
<$text text=<<currentTiddler>> /><br/>
</$list>

</$let>
\end

# <<parse-fullname "Ludwig van Beethoven">>

# <<parse-fullname "Michael Joseph Jackson">>

# <<parse-fullname "Jeremy Ruston">>

This produces correct outputs

  1. Ludwig
    van Beethoven

  2. Michael
    Joseph
    Jackson

  3. Jeremy
    Ruston

I am looking for simpler solution.

What is a better regex to parse the full names in one go?

\define parse-name(name)
<$let br="<br>" pname={{{ [<__name__>split[ ]join<br>] }}}>
<<pname>>
<$let>
\end

<<parse-name "Ludvig van Beethoven">>
<<parse-name "Michael Joseph Jackson">>
<<parse-name "Jeremy Ruston">>

Thank you @CodaCoder,
Please note the surname van Beethoven cannot be broken down further into van and Beethoven.
The above code cannot detect Nobiliary particle. See example 1 in my original post above!

\define parse-fullname(name)
<$let name=<<__name__>>
      pattern1="\s+([a-z][a-zA-Z]*?)\s+"
>

<$list filter="""
[<name>search-replace:g:regexp<pattern1>,[ $1_]]
+[splitregexp[\s]trim[]]
+[search-replace:g:regexp[_],[ ]]
"""
>
<$text text=<<currentTiddler>> /><br/>
</$list>
</$let>
\end
1 Like

I don’t have a solution, but I share the challenge, and I frequently need to use databases to parse name rosters in various orders. So I’d love a solution…

Alas, I think the solution needs to be complex, because of language differences.

Mohammad’s KISS solution (and Mark’s follow-up) parses “van” and “von” as part of the last name, which solves some some problems for some languages… But when last names are alphabetized, the lower-case bits can get in the way. Simone de Beauvoir is alphabetized under B. :stuck_out_tongue:

More complex still, if you encounter José Ortega y Gasset (for example), the " y " in the name should trigger the solution to capture “Ortega y Gasset” as one compound last name. Alas, not all Spanish compound last names use the “y” (for “and”). Sometimes people don’t use a hyphen either. (Gabriel García Márquez has the last name of García Márquez, to be alphabetized under Gar…).

So, I think the best solutions end up having one or more of these three parts:

First, there are some relatively easy automatic steps (like looking for non-capitalized short words, and treating any following space as a non-breaking space).

Second, we can resolve to manually splice an invisible “last name starts here” marker into certain actual names as they are input (and/or invisibly converting intra-last-name spaces into non-breaking spaces).

Third, you might want a dictionary-like table of actual names that tend to break the pattern, together with the proper parsing into first and last — useful if you’re working with a set of authors or public figures whose names might come up frequently.

This would be a great little nugget to have a shared solution for. And actually, I’m sure reference librarians have figured it out. Anyone know a coding librarian?

-Springer

1 Like

You may know, I am working on Refnotes and it is difficult to handle such cases just with wikitext. Fortunately bibtex has a second format to use comma between name parts, but I am still try to parse names with space as delimiter.

I am not good at regular expression but @Mark_S will absolutely have some solutions.

For your case like Gabriel García Márquez we have to know the middle part is the middle name or the last name (first section). In Persian we have compound last name. I think space as delimiter is not good here!

@Mohammad the truth is when information is stored it should be done so with either

  • an independent field for each separate element
  • OR an non ambiguous set of delimiters

If the above is not the case then there is an argument there is missing information.

  • In this case we need to clean or delimit the source data so it complies
  • This can often be done once and manual intervention is easier than building an algorithm as it will never be 100% accurate for all possible future data
  • If you have to keep reading the data with missing information (not fix it at source)
    • With manual intervention you could record the translation between the source and reformatted data to use as a lookup table for previous names
    • Then for each name you can check this lookup and if not recorded prompt the user to delineate it correctly.
1 Like

This is the standard of author field in bibtex:

So as the doco indicates the missing information on inbound names, is the algorithm used or “standard”;

BibTeX divides a person’s name into four parts:

  • First: First names or given names
  • Last: Last name or family name
  • von: a particle (e.g., de, de la, der, van, von)
  • jr: a suffix (e.g., Jr., Sr., III)

BibTeX’s internal name parser knows three ways these name parts can be combined:

  • Method 1: First von Last
  • Method 2: von Last, First
  • Method 3: von Last, Jr, First

However it seems to me, to appropriately apply this algorithm to the data, the list of “partials” and “suffixes” needs to be complete and I do not see Mr, Master, Mz OR Mac, Mc

The good news is, you can import bibtex data from different sources in better format!
As descrbed here BibTeX field: author [with examples] - BibTeX.com method 2 and 3 are much easier to process and fortunately several common tools like Google Scholar export bibtex data using Method 2.

In bibliography standard like APA, MLA it is not common to use prefixes like Mr and Ms.

My workflow looks like as follows.

  • Use reference manager software Zotero to manage all references. All required information is obtained from website or other sources. Zotero provides a Chrome plugin to directly download reference information and full paper (PDF files) into a database.

  • Export reference into bibtex format from reference manager software. Better bibtex is a very useful Zotero plugin to export bibtext into file and clipboard.

  • Import bibtex into Tiddlywiki. The paste bin in Refnotes plugin is very useful and easy to use import bibtex file from clipboard.

  • Copy the cite key from Zotero as it is much easier to find the right reference from thousands of literature in Zotero.

Other software (e.g. Endnote, Mendeley) would have the similar features.

As I know, Zotero and Mendeley use Method 2 (i.e. von Last, First) as author format.

3 Likes

Hi Bangyou,
Nice description I added it to Refnotes doc as a sample workflow.

Zotero and betterbibtex should be configured to export bibtex into Refnotes. I wrote a post to document how I use Zotero with Refnotes .

Feel free to reuse or modify in any places you like. You can find the images, tid and markdown files from my repository. Please notice the image path is not correct in the tid file. I use node.js and put all images under ./files/images/.

3 Likes

Thank you! I cannot open the post but I saw the repo.