Parse Fullnames in First, middle and last name

Mohammad · June 2, 2022, 12:55pm

Ref: A Filter Question: “Fullname, F. M.” - Discussion - Talk TW (tiddlywiki.org)

Question 1: How to split a full name into first, middle and last names? Code shall handle the name contains Nobiliary particle?

Example i:
“Ludwig van Beethoven” shall be parsed into

Ludwig
van Beethoven

Example ii
“Michael Joseph Jackson” shall be parsed to

Michael
Joseph
Jackson

I started with some KISS code like this

\define parse-fullname(name)
<$let name=<<__name__>>
      pattern1="\b(?=[a-z])"
			pattern2="\s"
>
<$list filter="[<name>splitregexp<pattern1>trim[]] :filter[<name>splitregexp<pattern1>trim[]count[]compare:integer:gteq[2]]">
<$text text=<<currentTiddler>> /><br/>
</$list>
<$list filter="[<name>splitregexp<pattern2>trim[]] :filter[<name>splitregexp<pattern1>trim[]count[]compare:integer:eq[1]]">
<$text text=<<currentTiddler>> /><br/>
</$list>

</$let>
\end

# <<parse-fullname "Ludwig van Beethoven">>

# <<parse-fullname "Michael Joseph Jackson">>

# <<parse-fullname "Jeremy Ruston">>

This produces correct outputs

Ludwig
van Beethoven
Michael
Joseph
Jackson
Jeremy
Ruston

I am looking for simpler solution.

What is a better regex to parse the full names in one go?

CodaCoder · June 2, 2022, 1:19pm

\define parse-name(name)
<$let br="<br>" pname={{{ [<__name__>split[ ]join<br>] }}}>
<<pname>>
<$let>
\end

<<parse-name "Ludvig van Beethoven">>
<<parse-name "Michael Joseph Jackson">>
<<parse-name "Jeremy Ruston">>

Mohammad · June 2, 2022, 1:22pm

Thank you @CodaCoder,
Please note the surname van Beethoven cannot be broken down further into van and Beethoven.
The above code cannot detect Nobiliary particle. See example 1 in my original post above!

Mark_S · June 2, 2022, 3:33pm

\define parse-fullname(name)
<$let name=<<__name__>>
      pattern1="\s+([a-z][a-zA-Z]*?)\s+"
>

<$list filter="""
[<name>search-replace:g:regexp<pattern1>,[ $1_]]
+[splitregexp[\s]trim[]]
+[search-replace:g:regexp[_],[ ]]
"""
>
<$text text=<<currentTiddler>> /><br/>
</$list>
</$let>
\end

Springer · June 2, 2022, 3:34pm

I don’t have a solution, but I share the challenge, and I frequently need to use databases to parse name rosters in various orders. So I’d love a solution…

Alas, I think the solution needs to be complex, because of language differences.

Mohammad’s KISS solution (and Mark’s follow-up) parses “van” and “von” as part of the last name, which solves some some problems for some languages… But when last names are alphabetized, the lower-case bits can get in the way. Simone de Beauvoir is alphabetized under B.

More complex still, if you encounter José Ortega y Gasset (for example), the " y " in the name should trigger the solution to capture “Ortega y Gasset” as one compound last name. Alas, not all Spanish compound last names use the “y” (for “and”). Sometimes people don’t use a hyphen either. (Gabriel García Márquez has the last name of García Márquez, to be alphabetized under Gar…).

So, I think the best solutions end up having one or more of these three parts:

First, there are some relatively easy automatic steps (like looking for non-capitalized short words, and treating any following space as a non-breaking space).

Second, we can resolve to manually splice an invisible “last name starts here” marker into certain actual names as they are input (and/or invisibly converting intra-last-name spaces into non-breaking spaces).

Third, you might want a dictionary-like table of actual names that tend to break the pattern, together with the proper parsing into first and last — useful if you’re working with a set of authors or public figures whose names might come up frequently.

This would be a great little nugget to have a shared solution for. And actually, I’m sure reference librarians have figured it out. Anyone know a coding librarian?

-Springer

Mohammad · June 2, 2022, 3:49pm

You may know, I am working on Refnotes and it is difficult to handle such cases just with wikitext. Fortunately bibtex has a second format to use comma between name parts, but I am still try to parse names with space as delimiter.

I am not good at regular expression but @Mark_S will absolutely have some solutions.

For your case like Gabriel García Márquez we have to know the middle part is the middle name or the last name (first section). In Persian we have compound last name. I think space as delimiter is not good here!

TW_Tones · June 3, 2022, 4:56am

@Mohammad the truth is when information is stored it should be done so with either

an independent field for each separate element
OR an non ambiguous set of delimiters

If the above is not the case then there is an argument there is missing information.

In this case we need to clean or delimit the source data so it complies
This can often be done once and manual intervention is easier than building an algorithm as it will never be 100% accurate for all possible future data
If you have to keep reading the data with missing information (not fix it at source)
- With manual intervention you could record the translation between the source and reformatted data to use as a lookup table for previous names
- Then for each name you can check this lookup and if not recorded prompt the user to delineate it correctly.

Zheng_Bangyou · June 3, 2022, 5:02am

This is the standard of author field in bibtex:

TW_Tones · June 3, 2022, 5:12am

So as the doco indicates the missing information on inbound names, is the algorithm used or “standard”;

BibTeX divides a person’s name into four parts:

First: First names or given names
Last: Last name or family name
von: a particle (e.g., de, de la, der, van, von)
jr: a suffix (e.g., Jr., Sr., III)

BibTeX’s internal name parser knows three ways these name parts can be combined:

Method 1: First von Last
Method 2: von Last, First
Method 3: von Last, Jr, First

However it seems to me, to appropriately apply this algorithm to the data, the list of “partials” and “suffixes” needs to be complete and I do not see Mr, Master, Mz OR Mac, Mc

Mohammad · June 3, 2022, 5:15am

The good news is, you can import bibtex data from different sources in better format!
As descrbed here BibTeX field: author [with examples] - BibTeX.com method 2 and 3 are much easier to process and fortunately several common tools like Google Scholar export bibtex data using Method 2.

Mohammad · June 3, 2022, 5:16am

In bibliography standard like APA, MLA it is not common to use prefixes like Mr and Ms.

Zheng_Bangyou · June 3, 2022, 5:47am

My workflow looks like as follows.

Use reference manager software Zotero to manage all references. All required information is obtained from website or other sources. Zotero provides a Chrome plugin to directly download reference information and full paper (PDF files) into a database.
Export reference into bibtex format from reference manager software. Better bibtex is a very useful Zotero plugin to export bibtext into file and clipboard.
Import bibtex into Tiddlywiki. The paste bin in Refnotes plugin is very useful and easy to use import bibtex file from clipboard.
Copy the cite key from Zotero as it is much easier to find the right reference from thousands of literature in Zotero.

Other software (e.g. Endnote, Mendeley) would have the similar features.

As I know, Zotero and Mendeley use Method 2 (i.e. von Last, First) as author format.

Mohammad · June 3, 2022, 6:01am

Hi Bangyou,
Nice description I added it to Refnotes doc as a sample workflow.

Zheng_Bangyou · June 3, 2022, 11:08am

Zotero and betterbibtex should be configured to export bibtex into Refnotes. I wrote a post to document how I use Zotero with Refnotes .

Feel free to reuse or modify in any places you like. You can find the images, tid and markdown files from my repository. Please notice the image path is not correct in the tid file. I use node.js and put all images under ./files/images/.

Mohammad · June 3, 2022, 4:02pm

Thank you! I cannot open the post but I saw the repo.