Parsing plain text

I would like to parse plain text data like that below using TW.

From what I can see you can’t use regular expressions to chop text up (only to match).

I must say I haven’t even been able to split the text into lines!

The desired output is a single data tiddler with fields like Name, Doctor, FEV1 etc.

Can someone get me started?

Thanks

Nick

Name: DOE,JOHN    ID: T123456
Doctor: Riviera, Nick, Dr    Height: 162.00 cm    Age: 65
Technician: Bob, Sideshow Weight: 73.50 kg    Gender: Female
Visit Date: 11/01/2023    Time: 08:42    Race: White
Diagnosis: 
Dyspnea: 
Cough: 
Wheeze: 
Years Quit:     Packs/Day: 
Years Smoked:     Tobacco Product: 
Comments: FVC likely sub-maximal due to coughing and some glottis closure so 
please view with caution. 
Variable lung volumes technique so only 2 technically acceptable attempts 
achieved today so please view with caution. 
Variable transfer factor technique, only 1 technically acceptable attempt with 
the sample value only being 450ml. 
Hb was tested using a HemoCue in the department today.   
A six minute walk test was also performed.


                                             PRE-BRONCH              POST-BRONCH
                                   Meas     Prd    %Prd     Meas    %Prd    %Chg
SPIROMETRY
FVC (L)                            1.18    2.96      39                         
FEV1 (L)                           0.97    2.31      41                         
FEV1/FVC                           0.82    0.79     104                         
FEF25 (L/sec)                      3.13    4.82      64                         
FEF50 (L/sec)                      2.22    3.46      64                         
FEF75 (L/sec)                      0.32    0.53      61                         
FEF25-75 (L/sec)                   1.28    2.02      63                         
PEF L/s (L/sec)                    3.77    5.91      63                         
FIVC (L)                           0.70                                         
FIF50 (L/sec)                      0.86    3.37      25                         
PIF (L/sec)                        1.17                                         

LUNG VOLUMES
SVC (L)                            1.47    2.96      49                         
IC (L)                             0.58    1.97      29                         
ERV (L)                            0.79    0.99      80                         

@NickB TiddlyWiki is quite capable to parsing text in multiple ways even without using deep core or JavaScript function’s. The problem is it’s capabilities so broad that in reality the complexity arises as much from the nature and quality of the data.

  • Is this one off? then manually prepare the date before making a converter
    • There are editors that help manually preparing the input data with many tools to help eg NotePad++
    • For example in your data, there are two sections, a list of field and values, a table to parse.
    • For the table create a new column in your test data and move / copy the section headings like SPIROMETRY into each row.
  • if you want to automate the whole process then investigate the different way to obtain the data as close to a usable format as you can. Eg look at CSV as a common format.
  • Most things you can do manually can be done programmatically but not always so ensuring the input data has reasonable structure, ideally from the source is important. For example your source data is “not that good” because it is not well structured.

So you have quite a Journey what ever tool you use so quick tips

  • look at the features of the JSON Mangler Plugin especially for CSV files.
  • consider transferring data into tiddlers then interrogating the tiddler as this is tiddlywikis native way. Eg one tiddler per item, then field/value pairs.
  • One key filter is the following "[[tiddlername]get[text]splitregexp[\n]]" which returns each line of tiddlernameand from there you can write a responce to each line.
  • Use nested lists to structure your parsing, trust me on this one.
  • In the above example you could then take the each-line variable from the above filter and for the top proportion split on “:” using first[] to get the fieldname and rest[]] for the value and later on split on space or add the “:” to the data.

Here is an example of manual preparation of the data to make it easy

Name: DOE,JOHN    
ID: T123456
Doctor: Riviera, Nick, Dr    
Height: 162.00 cm    Age: 65
Technician: Bob, Sideshow Weight: 73.50 kg    
Gender: Female
Visit Date: 11/01/2023    Time: 08:42    Race: White
Diagnosis: 
Dyspnea: 
Cough: 
Wheeze: 
Years Quit:     Packs/Day: 
Years Smoked:     Tobacco Product: 
Comments: FVC likely sub-maximal due to coughing and some glottis closure so 
please view with caution. 
Comments: Variable lung volumes technique so only 2 technically acceptable attempts 
achieved today so please view with caution. 
Comments: Variable transfer factor technique, only 1 technically acceptable attempt with the sample value only being 450ml. 
Comments: Hb was tested using a HemoCue in the department today.   
Comments: A six minute walk test was also performed.

Even this needs special treatment because there are multiple comments. Did the source data have a timestamp you could use as a unique key?

If an when you seek further help, a JSON file of test data (in tiddlers) shared here will make a big difference.

2 Likes

To do it with wikitext, I’d say it is critical that the format in the original text is very consistent.

But there is a completely different way. I’ve considered posting about it for some time and your question pushes me to do it… but at this very moment I am too tired, but probably over the weekend. It will demand quite some bit of dev though but I’m pretty sure you’d love the outcome if it actually is implemented.

2 Likes

@twMat I look forward to your ideas.

Perhaps share the scope of its application when you do, and what kind of data it is suitable for, what volume of data and is it a manual assist or automatic process etc…

Thanks @TW_Tones, the filter was what I needed!

Thousands of imports/year so manual intervention not feasible. Would ideally be dragging and dropping from a directory.

No control over formatting alas…

@twMat you’ve piqued my interest!

Here’s a start … easier than I thought!

I gotta learn that regEx!

This code exports the source data in [[spirodata]] into a csv tiddler when you click a button.

My first time using wikify in earnest!

\define patternHeading() \n[A-Z ]+\n
\define csv()
<$list filter="[[spirodata]get[text]split[SPIROMETRY]last[]trim[]]" variable=results><$list filter="[<results>splitregexp<patternHeading>trim[]]" variable=block><$list filter="[<block>splitregexp[\n]]" variable=line>{{{[<line>splitregexp[ {4,}]trim[]join[,]trim[,]]}}}
</$list></$list></$list>
\end

<$wikify name=c text="""<<csv>>""">
<$button>Create
<$action-createtiddler $basetitle="data" text=<<c>>/>
</$button>
</$wikify>

data tiddler:
image