Searching/indexing web site/page

Bob_Jansen · February 12, 2024, 4:10am

Not specifically a TW topic but this group seem populated with intelligent people with wide experience.

I am reviewing the use of TW for my Central Street Archive TW app (Central Street Archive). My proposed solution, of a tiddler per document populated with tags for each named person, place, organisation and exhibition, is too onerous for me to complete in any reasonable timeframe.

So I am looking at a web site of searchable PDF’s covered by a search facility. Now this is the issue, I can not find a web site search facility (preferably free). I dont just want to have Google/Bing/etc index the site as then search results fro this document set get interspersed with results from other web sites/pages.

So, can anyone point me to a site search facility that I could implement.

thanks in advance

bobj

Bob_Jansen · February 12, 2024, 4:38am

I have found the Google Search Console as a way of limiting searches to my web site and am trying that to see if it works for me. If so, then I assume I’ll just have to add a front end to implement Google searches.

linonetwo · February 12, 2024, 5:54am

If you want to search your wiki in Google, you can try

Search your nodejs wiki in Google if you use nodejs based wiki
Use the Site Search Operator if it is indexed by google

TW_Tones · February 12, 2024, 12:16pm

In addition to lineonetwo’s comments. Google has in the past provided a way to embed a search limited to your own site;

You may need to use the $:/tags/RawMarkupWikified/TopBody to define a script, then use a div to display the search box.

I will see if I can share an example tomorrow.

Scott_Sauyet · February 13, 2024, 3:31am

I don’t know your skills (do you program?) or your data, but if you already have this data in another reasonable format, then you might find my workflow useful: For several large wikis, I write simple JS modules to convert the old format to my desired tiddler format, and save the results as .tid files inside a Node wiki, under a specific folder either in tiddlers or plugins. As I build up those modules, I keep overwriting these folders, until I’m happy with the results. While I do this in JS/Node, if you’re a programmer, you should be able to do something similar in nearly any programming language.

I used to see a lot of this, although it seems to be disappearing. One place I still see it is at Search - FactCheck.org. They might serve as a useful model.

Bob_Jansen · February 14, 2024, 1:47am

Scott,

thanks for your thoughts. I have been programming for over 40 years but the problem is that I need to read each document to extract proper nouns. They are not i any other format. This takes time and time and time and I am hoping that I can short circuit things by providing searchable PDF’s.

I have converted all PDF’s so far (165 of them) to searchable format using PDF2Go.com, have registered the subdomain with Google and now will try and play with searching their content.

bobj

Scott_Sauyet · February 14, 2024, 2:17pm

4 posts were merged into an existing topic: Identifying proper nouns

Bob_Jansen · February 16, 2024, 3:40am

Scott,

could you expand on your notion of developing .tid files as this sounds intriguing.

I assume once we have a tid file we can merely import it into a TW and it would create the tiddler structure, tags and all. Would this also update an existing tiddler so that if the skeleton was already in the TW, this would add missing content?

Can you provide an example of such a tid file?

bobj

Scott_Sauyet · February 16, 2024, 4:17am

It’s about bedtime for me, and I’ll try to write more tomorrow, but very briefly, .tid files are how a wiki is stored when running on Node. You’re right that they can easily be imported into any wiki, but in Node, these files are the tiddlers.

For instance, this tiddler: https://tiddlywiki.com/#Drag%20and%20Drop

Is stored directly in this format:

created: 20170328143119836
modified: 20170328173846754
tags: Features
title: Drag and Drop
type: text/vnd.tiddlywiki

~TiddlyWiki uses drag and drop to power two separate features:

* [[Importing Tiddlers]] into ~TiddlyWiki 
* Manipulating tiddlers within a ~TiddlyWiki 

Tiddler manipulation via drag and drop is supported by the core user interface in the following contexts:

* Entries in the "Open" tab of the sidebar can be reordered by drag and drop; new tiddlers can be opened by dragging their titles into the list
* Entries within a tag pill dropdown can be reordered by drag and drop; new tiddlers can be assigned the tag by dragging their titles into the list
* Entries in the [[control panel|$:/ControlPanel]] "Appearance"/"Toolbars" tab can be reordered by drag and drop. (Less usefully, new entries can be added to the toolbars by dragging their titles into the list)

All tiddler links are draggable by default. They can be dragged within a browser window for manipulating tiddlers, or dragged to a different browser window to initiate an [[import operation|Importing Tiddlers]]

If you want to drag a link, first move it vertically, because horizontal movement is recognized by the browser as text selection.

Tag pills are also draggable, and are equivalent to simultaneously dragging all of the individual tiddlers carrying the tag.

Some common scenarios for drag and drop tiddler manipulation are available as reusable macros:

* [[list-links-draggable Macro]] for reordering the entries in a tiddler ListField
* [[list-tagged-draggable Macro]] for reordering the tiddlers that carry a specified tag

See DragAndDropMechanism for details of how to use the low level drag and drop primitives to build more complex interactions.

The standard HTML 5 drag and drop APIs used by ~TiddlyWiki are not generally available on mobile browsers on smartphones or tablets. The [[Mobile Drag And Drop Shim Plugin]] adds an open source library that implements partial support on many mobile browsers, including iOS and Android.

in this file: https://github.com/Jermolene/TiddlyWiki5/blob/master/editions/tw5.com/tiddlers/features/Drag%20and%20Drop.tid

If your tiddlers are more data-oriented than text-oriented, then it might be simple to build them in a custom transformation from a raw data source, storing them in individual files.

I will try to expand more on this idea after I get some sleep.

Scott_Sauyet · February 16, 2024, 7:35pm

I’m going to describe parts of how I built a wiki for my local political party. I think the analogies should be obvious. But if not, please ask.

Input Data

Among other things I want in my wiki is a list of voters, so we can query various things about them without going out to an external tool. In fact only a few of the intended users of this wiki have access to that tool, but I was able to use it to get an extract, which looks like this:

RawData/VAN_Extract.txt

Voter File VANID|LastName|FirstName|MiddleName|Suffix|Sex|DOB|Age|Party|mAddress|mCity|mState|mZip5|mZip4|mAddressID|Address|City|State|Zip5|Zip4|AddressID|Preferred Phone
123|Bear|Yogi|Da||M|05/01/1958|66|U |123 Jellystone Park |Hanna Barbera|ST|12345|6789|294191234|123 Jellystone Park |Hanna Barbera|ST|12345|6789|294191234|8005551234
234|Rubble|Betty|Marie||F|11/01/1960|64|U |347 Cave Stone Rd |Hanna Barbera|ST|12345|6789|294191491|347 Cave Stone Rd |Hanna Barbera|ST|12345|6789|294191491|8005552345
345|Doo|Scooby|Joseph||M|05/01/1960|63|D |5 Mystery Machine Ave |Hanna Barbera|ST|12345|6789|294191899|5 Mystery Machine Ave |Hanna Barbera|ST|12345|6789|294191899| 
456|Flintstone|Fred|C||M|03/01/1960|64|R |345 Cave Stone Rd |Hanna Barbera|ST|12345|6789|294195505|345 Cave Stone Rd |Hanna Barbera|ST|12345|6789|294195505|8005553456
567|Flintstone|Wilma|J||F|03/01/1960|64|D |345 Cave Stone Rd |Hanna Barbera |ST|12345|6789|294190525|345 Cave Stone Rd |Hanna Barbera|ST|12345|6789|294195505|800055553456

(One row for each of the approximated 2500 voters in my town.)

Okay, perhaps I changed the details to protect the ~~innocent~~ guilty.

I extracted all the addresses from there and used an online service to get latitude and longitude information for each address, and stored it alongside the above, in this format:

RawData/LatLong.json

{
    "Hanna Barbera": {"latitude": -14.599400, "longitude": -28.673100},
    
    "123 Jellystone Park": {"latitude": -14.592123, "longitude": -28.654321}, 
    "345 Cave Stone Rd": {"latitude": -14.598888, "longitude": -28.684666},
    "347 Cave Stone Rd": {"latitude": -14.597777, "longitude": -28.684567}, 
    "5 Mystery Machine Ave": {"latitude": -14.600600, "longitude": -28.678901} 
}

(and the lat/long’s have been moved to the middle of the Atlantic to protect them )

Conversion Process

I wrote some Node.js code to parse these two files and convert to the tiddler .tid format, one for each Voter, one for each Address. The specific code is probably not very helpful, but it’s available if you want to see it:

Code

scripts/buildContent.js

const {writeFile, mkdir, rm, readFile: rf} = require ('fs/promises')
const tap = (fn) => (x) => ((fn (x)), x)
const map = (fn) => (xs) => xs .map (x => fn (x))
const call = (fn, ...args) => fn (...args)
const display = msg => tap (() => console .log (msg))
const allPromises = (ps) => Promise .all (ps)
const readFile = (filename) => () => rf(filename, 'utf8')

const main = (fileName, latLong) =>
  deleteOutputDirs()   // ensure there's no detritus from previous runs
    .then (createOutputDirs)
    .then (readFile(fileName))
    .then (delay(500))
    .then (display ('Built directories'))
    .then (psv2arr)
    .then (handleVoters)
    .then (handleAddresses(latLong))
    .then (() => console .log ('Completed!'))
    .catch (console .warn)

const deleteOutputDirs = () => 
    rm ('./tiddlers/HannaBarbera/Voters', {force: true, recursive: true})
    .then (() => rm ('./tiddlers/HannaBarbera/Addresses', {force: true, recursive: true}))
  
const createOutputDirs = () =>
  mkdir ('./tiddlers/HannaBarbera/Addresses', {recursive: true})
  .then (() => mkdir ('./tiddlers/HannaBarbera/Voters', {recursive: true}))

const delay = (t) => (v) => new Promise (r => setTimeout(() => r(v), t))

const psv2arr = ( 
  psv, [headers, ...rows] = psv.split('\n').filter(Boolean).map((r => r.split('|')))
) => rows.map((r) => Object.fromEntries(r.map((c, i) => [headers[i], c.trim()])))

const handleVoters = (rs) => Promise.resolve(rs)
  .then (map(getOverview))
  .then (map(writeTiddler))
  .then (allPromises)
  .then (tap (ps => console .log (`Wrote ${ps.length} Voter tiddlers`)))
  .then (() => rs)

const getOverview = (r) => [
  `./tiddlers/HannaBarbera/Voters/van-${r['Voter File VANID']}-${r.FirstName}_${r.LastName}.tid`,
  convertPerson(r)
]

const convertPerson = r => `title: Voters/${r['Voter File VANID']}
tags: Voter
caption: Voters/${r.FirstName + ' ' + r.LastName + (r.Suffix ? (' ' + r.Suffix) : '')}
first-name: ${r.FirstName}
last-name: ${r.LastName}
middle-name: ${r.MiddleName}
suffix: ${r.suffix || ''}
full-name: ${r.FirstName + ' ' + r.LastName + (r.Suffix ? (' ' + r.Suffix) : '')}
gender: ${r.Sex}
age: ${r.Age}
party: ${getParty(r.Party)}
phone: ${makePhone(r['Preferred Phone'])}
address: ${r.Address}
`

const makePhone = (p) => p
  ? `${p.slice(0, 3)}-${p.slice(3, 6)}-${p.slice(6, 10)}`
  : ''
 
  
const getParty = (p) => ({
  'U': '', 'D': 'Democratic', 'R': 'Republican', 
  'I': 'Independent', 'G': 'Green', 'L': 'Libertairan',
}) [p]

const handleAddresses = (latLong) => (rs) => 
  Promise.resolve(rs)
    .then (convertAddresses(latLong))
    .then (map(writeTiddler))
    .then (allPromises)
    .then (tap (ps => console .log (`Wrote ${ps.length} Address tiddlers`)))
    .then (() => rs)

const convertAddresses = (latLong) => (rs, loc) => Object .entries (Object .fromEntries (rs.map (r => [ // `entries` dance for uniqueness
`./tiddlers/HannaBarbera/Addresses/${r.Address.replace(/\s/g, '_')}.tid`,

`title: Address/${r.Address}
tags: Address
caption: Address/${r.Address}
address: ${r.Address}
street-number: ${r.Address.split(' ')[0]}
street: ${r.Address.split(' ').slice(1).join(' ').replace(/ (?:Apt.?|#).*$/i, '')}
${addApt(r.Address)
}city: ${r.City}
state: ${r.State}
zip5: ${r.Zip5}
zip4: ${r.Zip4}
${(
  loc = latLong[r.Address.replace(/ (?:Apt.?|#).*$/i, '')] || latLong['Andover'], 
`lat: ${loc.latitude}
long: ${loc.longitude}
alt: 0`
)}`
])))

const addApt = (a, m = a.match(/ (?:Apt|#) (.*)$/)) => m ? `apt: ${m[1]}
` : ''  

const writeTiddler = ([fileName, content]) => writeFile (fileName, content, 'utf8')

   
main ('./RawData/VAN_Extract.txt', require('../RawData/LatLong.json'))

And I run this code with node scripts/buildContent

Tiddler format

This reasonably simple script creates files that look like this:

tiddlers/HannaBarbera/Voters/van-123-Yogi_Bear.tid

title: Voters/123
tags: Voter
caption: Voters/Yogi Bear
first-name: Yogi
last-name: Bear
middle-name: Da
suffix: 
full-name: Yogi Bear
gender: M
age: 66
party: 
phone: 800-555-1234
address: 123 Jellystone Park

and like this:

tiddlers/HannaBarbera/Addresses/123_Jellystone_Park.tid

title: Address/5 Mystery Machine Ave
tags: Address
caption: Address/5 Mystery Machine Ave
address: 5 Mystery Machine Ave
street-number: 5
street: Mystery Machine Ave
city: Hanna Barbera
state: ST
zip5: 12345
zip4: 6789
lat: -14.6006
long: -28.678901
alt: 0

This is the .tid format. Fields are given in lines like field-name: Field value. title is a required TW field. tags and caption are very common ones. The others in these samples are specific to my wiki. Note that here I don’t include a text field in these tiddlers, but if you want one, it simply appears last, and without any key, separated from the other fields by a blank line:

title: My Tiddler
tags: Demo [[This is temporary]]

And here we begin the multi-line
`text` field, with //whatever// wikitext
we choose.

Folder Structure

Note that these files get dropped into subfolders inside the tiddlers folder. If you’re running in Node, all tiddlers live inside tiddlers, and the internal folder structure is simply a convenience. They all end up in TW’s flat namespace – based on internal title and not the file name. But if you’re not running in Node, these files can still be dragged and imported into any wiki.

(In actual practice, I don’t drop these in the tiddlers folder, but in the plugins one instead. That treats them essentially as data-only tiddlers. But it’s not important here.)

Now by adding various templates and cascade options, I can view and search this data in an interconnected manner. You can see it in action at http://scott.sauyet.com/Tiddlywiki/Demo/HannaBarbera.html.

So that’s my practice. I’ve done similar things on a number of wikis. I don’t know if you can get your data into good enough shape for such an automated convers. I have my doubts, seeing the text generated by pdf2go, but there’s much I don’t know. In any case, I find this a useful way to initiate data-heavy wikis, based on an external source.

Bob_Jansen · February 16, 2024, 8:45pm

Scott,

Thank you for the explanation. I will have to think about how I might utilise this approach.

Bobj

Bob_Jansen · April 6, 2025, 1:30am

I have been pondering more and more about understanding how a user, using for example, google search, find my TW and then accesses the appropriate tiddler. I am not a node-js user an so am unsure of the technical details.

However, as I understand it.

If I run a TW as a single file TW, then google will eventually come along and index the content, just like it does for ‘normal’ http-style web pages. But in this case, only finds a single web page containing lots and lots of div's. So when a user comes across my TW through a google search, the search points to the individual page and when clicking on the link, loads, I presume, the TW showing its starting tiddler. So the user sees the starting tiddler and from there has to work out which tiddler(s) to go to, using search/wiki links/etc., as the appropriate tiddler was further down stream in the story-river.

To me, that means, essentially, indexing a single file TW is of no real use. In fact, the google search just records the existence of my wiki and as much of its content as possible but does not facilitate access inside the TW.

If I run my TW under node.js, each tiddler is stored as a separate file in a specified folder in the node.js data folder. Thus google would come along and index each individual tiddler, just like individual html pages. So I assume when searching, google would return links to the appropriate individual tiddlers. However, I am unsure what happens when the user follows a link. Does the appropriate tiddler display inside its environment (layout, etc) like a conventional html page or does the starting tiddler still display first?

As an aside, are there any maintenance pros and cons to running under node.js?

bobj

pmario · April 6, 2025, 2:13am

If you run a node js server and load a wiki you will see a single page app that looks exactly the same as a single file wiki. There is absolutely no difference.

Only the local maintenance is different. Google will never see your .tid files. Google will see an html page.

Tw-com static rendering was introduced for SEO reasons. So Google could see single tiddlers as html files.

Since 2014 Google bots have been improved quite a bit. So I am pretty confident that they can load TW as an app and index some content that way. But I am not sure about that one.

So if you want to be sure Google can find you single tiddler content the best way to go will be a static site.

Bob_Jansen · April 6, 2025, 4:58am

@pmario, thanks for the explanation. I guess it begs the question, why use node.js at all, what pros does it provide?

bobj