Improving MAP and Reduce Documentation

TW_Tones · January 30, 2023, 1:57am

I understand the use of MAP and REDUCE from a big data perspective but don’t find is as easy for TiddlyWiki. For my own sanity I have reviewed the MAP/REDUCE Filter runs and reduce operator looking for a better way to describe or useful metaphors.

I decided to put ChatGPT (below) onto the subject and collected the following that may be of interest and inform an improvement to the documentation.
However a matter arises how can we realise the full value of REDUCE if MAP cant provide the KEY/Value pairs to REDUCE?

Map and reduce are two fundamental concepts in distributed computing.

Map refers to the process of applying a function to each element of a data set in order to transform that element in some way. The output of the map function is a new data set, where each element has been transformed according to the function.

Reduce refers to the process of combining the elements of a data set into a single value. This is typically done by applying a binary operator (such as addition or multiplication) to the elements of the data set, in a way that successively combines elements of the data set until a single value is obtained.

Together, map and reduce form the basis of the map-reduce programming model, which is commonly used for large-scale data processing on distributed systems, such as Hadoop.

Are there alternative terms for map and reduce?

Yes, there are alternative terms for map and reduce in distributed computing.

Map is also known as “apply,” “transform,” or “process.”

Reduce is also known as “combine,” “aggregate,” “fold,” “accumulate,” or “compress.”

These alternative terms are often used interchangeably with map and reduce, and they convey the same basic concepts.

Additionally, the term “Filter” is also used in distributed computing, it refers to the process of selecting certain elements from a data set based on a certain criteria.

More on the map-reduce programming model

The map-reduce programming model is a way of processing large data sets in parallel on a distributed system. It was first introduced by Google in a 2004 paper, and it has since become a popular method for large-scale data processing.

The basic idea behind map-reduce is to break a large data processing task into two smaller tasks: the map task and the reduce task.

The map task is applied to each element of the input data set, and it produces a set of intermediate key-value pairs. The keys in these pairs are used to group together related data, and the values are the results of applying the map function to the input data.

The reduce task is then applied to the set of intermediate key-value pairs, and it combines the values associated with each key to produce a final set of key-value pairs. The keys in the final set of pairs are the same as the keys in the intermediate set, and the values are the results of applying the reduce function to the intermediate values.

The map-reduce programming model is typically implemented on a distributed system, such as Hadoop, which allows the map and reduce tasks to be executed in parallel on multiple machines. This makes it possible to process very large data sets in a relatively short amount of time.

Additionally, map-reduce can be used in a variety of contexts, for example for data analysis, data mining, machine learning, natural language processing and much more. There are different frameworks and libraries (like Hadoop, Apache Spark, Apache Flink etc) that are built on top of map-reduce, making it even more powerful and easy to use.

Further exploration needed

In the map-reduce programming model MAP produces a set of intermediate key-value pairs how do we do this in TiddlyWiki?

More details on intermediate key-value pairs in MAP

In the map-reduce programming model, the intermediate key-value pairs produced by the map task play an important role in the overall data processing pipeline.

When the map task is applied to each element of the input data set, it produces a set of key-value pairs that are used as input for the reduce task. These key-value pairs are intermediate results, and they are used to group together related data before it is passed to the reduce task.

The keys in the intermediate key-value pairs are used to determine which values are grouped together, and they are typically derived from the input data. For example, if you were counting the occurrences of words in a large text file, the map task might produce a key-value pair for each word, where the key is the word and the value is the number of occurrences of that word.

The values in the intermediate key-value pairs are the results of applying the map function to the input data. These values are typically used as input for the reduce task, which combines the values associated with each key to produce a final set of key-value pairs.

The intermediate key-value pairs are usually stored in memory or in a distributed file system during the map phase, so that they can be read and processed by the reduce task. The output of the map task is typically not stored in a persistent storage like a database, it is used only for the next phase of the pipeline.

The intermediate key-value pairs play an important role in the map-reduce programming model, as they allow the reduce task to group together related data and process it in a more efficient way.

Scott_Sauyet · January 30, 2023, 4:52am

This was around well before then. I studied it in an undergraduate CompSci course in the late 1980s, discussing techniques of parallelizing complex calculations.

I think this is missing the point. The Map-Reduce algorithm that Google popularized is, yes, based on map and reduce. But those are much older and much more ubiquitous tools. And the way I read the docs, it is those more basic ideas that Tiddlywiki implements. In other words, TW does not have map and reduce so that they can be used in some highly scalable, massively parallel data analysis. Instead it has map because it’s genuinely useful for many tasks. It has reduce for the same reason. While they might occasionally be used together, they are entirely independent.

In general, map can return any sort of value, and not typically key-value pairs. For instance,

[tag[Editions]] :map[uppercase[]]

would yield the following on tiddlywiki.com:

EMPTY EDITION
BLOG EDITION
FULL EDITION
PLUGIN EDITIONS
RÉSUMÉ BUILDER EDITION
TEXT-SLICER EDITION
XLSX UTILITIES EDITION

And for reduce, a filter like this:

Self Contained Underwater Breathing Apparatus :reduce[split[]first[]addprefix<accumulator>]

would yield

SCUBA

I believe these filter run prefixes are simpler than you imagine.

TW_Tones · January 30, 2023, 6:25am

Thanks for the response @Scott_Sauyet I always value your contribution

Of course but who knows what tiddlywikis may look like in the future

No seriously,

On one hand I understand that these operators offer far more than the historical precedents would suggest in tiddlywiki and perhaps these were not the best names, at least for less specialist users. Hence my desire to improve the documentation.
On the other hand the operators and/or their documentation and examples are leaving out a critical relationship between the two, the ability to pass in key/value pairs.
- By using these terms MAP ands REDUCE, I believe we are duty bound to try and live up to them even if TiddlyWiki can’t handle the big data and massively parallel data analysis it could at least allow the functional equivalency.
- I have done an “Introduction to big data” course, but I am in no way expert. If you take away the “big data”, as I see it, the key result of using MAP/REDUCE is to build decision trees based on the input data. This is functionality we could use to extract “expert knowledge” from “Accumulated information/data”, without what is effectively “Machine learning” we can have “TiddlyWiki Learning”.
- The recent ‘‘nearby tiddlers’’ discussion may have been able to make use of this, as could critical path analysis and interrogating network data structures to identify a few.

On tiddlywiki uses of these operators
I particularly like your SCUBA example because it could help with glossaries for acronyms, perhaps even an automatic link from [[SCUBA]] a missing link to [[Self Contained Underwater Breathing Apparatus]] eg did you mean?

The thing is what ever the merits, similarities and differences between tiddlywiki map and reduce and the meaning of map and reduce we need to improve the documentation and illustrate the tiddlywiki focused advantages of them.

One other observation as I play with this of late is “it appears” MAP does nothing nested list widgets can’t do (which is still useful) , REDUCE does add something, specifically the accumulator but then that’s almost where it ends. Off the top of my head we could have called them FOREACH and ACCUMULATE operators, and they may of made sense to a larger audience. It may also have made the possibilities easier for people to envision.

Food for thought

etardiff · January 30, 2023, 6:56am

Thank you for this excellent example of :reduce! I tend to forget it exists, and while I can write around it (ironically, I probably would have made use of the :map filter in this case: Self Contained Underwater Breathing Apparatus :map[split[]first[]] +[join[]]) it’s nice to have the reminder!

It’s also nice to see a clear use-case that’s not just math or shopping lists, and I’d say it would be well worth adding to the official examples for that reason alone.

tw-FRed · January 30, 2023, 1:47pm

For what I understand, the intermediate key-value pairs are useful mainly in distributed computing environments, where the same task is computed by many nodes each on a distinct dataset.
If we consider ChatGPT example of counting word occurrences in a text, in a distributed model the task (counting word occurrences) will be the same on each node, but the dataset (the text) on which the task will be applied will differ on each node. In this context, each map node has to build an intermediate result set in a form that can then be collated by reduce tasks on other nodes.
So any map node will return a result like { "the": 35678, "in": 2343, ... } which will allow a reduce node to further sum up intermediate counts of many map nodes results, possibly to a final result in the same form, but with bigger numbers.

In TW:

map/reduce operations are computed through filter runs inside a single filter
a filter is run on a single compute node
a map filter run needs an input list, which is very similar to the (intermediate or final) key-set of the distributed map/reduce workflow
this input list can be retrieved anytime, it’s just a [sub]filter.

Fred

Scott_Sauyet · January 30, 2023, 3:17pm

Thank you. I struggled to find one that I can write at my skill-level that wasn’t just “math or shopping lists”.

A big fan of functional programming, I use reduce often, but it feels unnecessary for most of what I’ve seen in TW. Most of what I would use it for here would similarly be expressible as a map followed by some other operation. But in fact, in general, it’s the other way around: reduce (aka fold) is primary, and things like find, map, and filter can be built atop reduce.

So, while I like the example, it’s missing something. I couldn’t find a substring or slice operator. While split[]first[] does the job, I’d love for it to be more explicit. Am I missing that operator? Also, while SCUBA is nice, a three or four word abbreviation would be nicer. Trying to come up with one, I kept running into helper words:

FBI: Federal Bureau of Investigation
LASER: Light Amplification by Stimulated Emission of Radiation
NASA: National Aeronautics and Space Administration

so I cheated and removed the usual hyphen from Self-Contained. Sue me!

etardiff · January 30, 2023, 3:47pm

The split operator is pretty limited; I suspect Jeremy was reluctant to introduce another operator when split[]first[] covers the same need. Tobi Beer actually put out a much more comprehensive split plugin with a more sophisticated syntax, but since it predates the official operator, I’ve had to hack it a bit to keep them from clashing. And as a plugin, it’s obviously inappropriate for an official example.

I do think you could technically use the regexp operator to filter out pesky 'and’s and 'of’s, though; I’d probably use it to filter for uppercase only after you obtain the initial letters. And in theory, you could split on hyphens before you reduce… but I do like the simplicity of the example.

saqimtiaz · January 30, 2023, 4:12pm

We have had some difficulty in coming up with an implementation that would be both thorough and easy to understand for most of our users, see [IDEA] String operator that provides substring functionality. · Issue #5824 · Jermolene/TiddlyWiki5 · GitHub

Mark_S · January 30, 2023, 4:26pm

We also don’t have regular expression operator for extracting strings, which would let you do whatever substring or slice operation you needed. Pull #2963.

Scott_Sauyet · January 30, 2023, 6:32pm

Wow, that reminds me of jQuery, both in the good (extreme flexibility) and the bad (complexity)!

I founded and (sporadically) maintain a popular functional programming library for JS, Ramda. We went a very different route, with no optional parameters and polymorphism only when the different types were represented by a common abstraction. I tend to prefer that now to the power and complexity of this plugin.

Sure, but remember that this discussion is mostly in support of creating a useful example of reduce.

… and I’m feeling like it’s not simple enough!

I’ve read that over, and I will try to think of a useful API; I think most of the suggestions in there would have worked, and none would have been horrible. I might look at a start/count implementation rather than a start/end one, although as a JS programmer it’s slice is more familiar to me. If so, I would imagine an entirely separate operator to work from the end if we found it necessary. I’ll think about it.

I’ve read the discussion, but only glanced at the implementation. Clearly there’s plenty of power in regex, and we could certainly write a regex to simulate any sort of substring, but I would not want to use that to document reduce!

I see that some version of regex did make it in, but this is both significantly limited by the square brackets problem and still too advanced for many non-programmer users.

Mark_S · January 30, 2023, 6:56pm

AFAIK, no official version of regex is available for extraction. Only filtering and splitting.

Scott_Sauyet · January 30, 2023, 7:01pm

I very much have enjoyed our conversations!

As much as I appreciate (and sometimes detest!) JavaScript, I rather doubt that it will ever be used for heavy data analysis.

I can’t tell if you got my last point. Their “historical precedents” are not particularly tied to Google’s Map-Reduce. That is essentially a pattern for distributing fairly simple work on extremely large datasets across large numbers of independent processes. The version of map used in that pattern is very similar to other historical uses of map and to TW’s version. But its reduce is less similar, as it has more sophisticated concerns than most common uses and than TW’s version.

But these should not be thought of primarily as a pair. We have map. We have reduce. We don’t have Map-Reduce, and though we could create one if we chose, I see little reason.

I’m all for improving the documentation. Other discussion here raises the possibility of adding the “SCUBA” example to reduce. If it helps, I’m definitely in favor. As to names, I would not change map, as that is universal. reduce is often called fold, and I prefer fold, but there is no perfect name, and I certainly would not suggest changing it after the fact.

This is what I disagree with. This has very little to do with map and reduce and adding it to their documentation would simply confuse things.

I’m still struggling with the lack of types beyond lists of strings. Although I’m starting to get the hang of it, it still frustrates me. So I would love to have more types, even just key-value objects (I know tiddlers can serve that role…) but I still don’t see this tied to map or reduce.

Think of forEach doing something for every element in a list. Think of map as doing a transformation of every element in the list and returning a new list with those results. Typically, forEach doesn’t return anything, just does something. accumulate would be an equally good name as fold, and probably better than reduce. But I think that ship has sailed.

Scott_Sauyet · January 30, 2023, 7:03pm

Oops, you’re right. I didn’t read the docs closely enough.

TW_Tones · January 30, 2023, 10:37pm

Thanks all for your responces, keep them up but please also try and return to the OT

Improving MAP and Reduce Documentation

I also admit I may not be totally up to date on these terminologies but I do understand data and computing. Hopefully my comments make sense.

I understand there [Map/reduce] independence conceptually, they come together because in tiddlywiki, they are similarly obscure for naïve users and were released at the same time. They are also both functions within a filter that are equivalent to nesting.

For example nesting without map

<$list filter="Self Contained Underwater Breathing Apparatus" variable=word>
  {{{ [<word>split[]first[]] +[join[]] }}}
</$list>

Although each letter is a title.

The main value I got from your SUBA example @Scott_Sauyet is thinking outside the box with reduce, accumulating letters, not totals.

I am keen to try and document its functionality so these other uses are more obvious, but not minimise the other ways to get the same results.

We have discussed “stop words” in the past in relation to sorting. -[subfilter<stopwordlist>] from memory.

just to illustrate;

\define stop-words() and the The 

{{{ The Self Contained Underwater and Breathing Apparatus -[subfilter<stop-words>] :map[split[]first[]] +[join[]] }}}

Using splitregexp we can spilt content into words, but perhaps for acronyms “Self-Contained” needs to become “SelfContained”, thus only “s”.

AND

Regardless of big data strategies this collection of frequencies could be a useful “data analytics tool”.
May be we can already do this with nesting but it seemed like an opportunity to make use of the functionality implied in MAP.

I have not chimed in on this issue in part because the best example I recall is too old for many, it was the inspect in Cobol, I have wanted the same thing in almost every language I have used since, simply it allows to to find and replace once character or string with another character or string and would be even more powerful if the string could ALSO be stated as regexp eg replace ⏎ with <br>
This would act on a string (independent of filter syntax) then present that string as a filter to the next step

Regex is a useful and powerful feature we need but I believe we need tools that meet many of the requirements without regexp for a larger audience, unless we fully document the use of regex for tiddlywiki and strings.

Is this a problem? Can we not just define any regexp string in a variable? sometimes abstracting this to a variable name actually makes it easier to read and understand. eg splitregexp<each-word>.
I would argue we could provide a library of such regexp variables

Heavy does not differentiate between complexity and volume. Tiddlywiki needs the capacity to deal with complexity.

I agree in relation to key-value objects or at least pairs, and have asked for tools to be surfaced, to help previously. Paramatised transclusions may do this.
Could we not tie this to map or reduce if they also handled key=value pairs or frequencies?

forEach is a superset (not a subset) within which the following can all occur

transformation of every element in the list and returning a new list with those results
Filtering out items
Accumulating something

I have asked previously;

Do we need to try and “shoe horn” every thing into single filters when nested lists are just as if not more performant and more structured?

Well we do need reduce to get an accumulator, perhaps we could add a accumulator to the listwidget as we already have a counter.
- if this is over loading the list widget let us create something like a reportwidget.

btheado · January 30, 2023, 11:13pm

Great piece of software. Major kudos for launching it! I don’t write much javascript (just a little for TiddlyWiki), but I’m a fan of functional programming and I’ve used Ramda before for a toy project.

Scott_Sauyet · January 31, 2023, 12:46am

Yes, on that note, do you have more specific suggestions for documentation improvement?

While I could see adding some version of the SCUBA example, that’s pretty minor. And while I can see some clarifications/corrections that might be useful, I think you’re looking at something more substantive. Has your thinking progressed to specifics, or are you still thinking mostly in generic terms.

For instance, I would absolutely update this paragraph:

Each input title from previous runs is passed to this run in turn. The filter run transforms the input titles and the output of this run replaces the input title. For example, the filter run [get[caption]else{!!title}] replaces each input title with its caption field, unless the field does not exist in which case the title is preserved.

to make the map call explicit, like this:

Each input title from previous runs is passed to this run in turn. The filter run transforms the input titles and the output of this run replaces the input title. For example, the filter run :map[get[caption]else{!!title}] replaces each input title with its caption field, unless the field does not exist in which case the title is preserved.

(changing just [get[caption]else{!!title}] to :map[get[caption]else{!title}].)

But that’s a minor tweak. It sounds like you have something more significant in mind. Is it ready to share?

I haven’t had a need to use this yet in TW, but I saw it in the docs. Does the search-replace Operator not do what you would like?

I think I’ve misunderstood this earlier when it was discussed. When you talked of “nested lists”, you are not referring to a data structure similar to arrays of arrays in a programming language, but to the syntactic technique of nesting one <$list> widget inside another, correct?

TW_Tones · January 31, 2023, 1:37am

a start,

In part more general documentation
But actually more meaningful, practical not so technical.

The filter run transforms the input titles and the output of this run replaces the input title.

For example;

The filter run transforms the input values (titles), the output is the transformed values (titles).

I do, I was hoping others would venture their ideas.

Correct, but not simply syntax, also structure and logic.
Eg when nesting you can set a variable on an outer list you can use anywhere inside, or every nested list thus gaining access to “intermediate” values at every level, MAP and REDUCE can only access currentTiddler and …currentTiddler variables and fields of the current tiddler.

It does in fact, for example;

\define string() returns each item in the list, replacing within each title the string specified by the first parameter with the second parameter

{{{ [<string>search-replace:g[ ],[-]] }}}

But is this the aformentioned substring functionality?

Scott_Sauyet · January 31, 2023, 2:06am

TW_Tones:

Scott_Sauyet:

Does the search-replace Operator not do what you would like?

It does in fact, for example;
\define string() returns each item in the list, replacing within each title the string specified by the first parameter with the second parameter

{{{ [<string>search-replace:g[ ],[-]] }}}
But is this the aformentioned substring functionality?

No, sub here as in subset, not as in substitute. A substring or slice operator would turn one string into another by extracting a contiguous block of its characters. In JavaScript (0-indexed, remember), 'abcdefghij'.slice(3, 7) would return defg.

We could decide for ourselves if we wanted a version that gave a start index and a character count, or a version that gave a start index and an end index. And in the latter, we could decide if we wanted to keep either or both endpoints. Often the choice is to keep the first such index and skip the last one, but it would be our call.

TW_Tones · January 31, 2023, 2:57am

Thanks for the clarification
It would be simple enough to give the nth operator a second parameter?
I don’t personaly see much use for this substring or slice operator

<$list filter="[range[3],[7]]" variable=position>
<$set name=char filter="[[abcdefghij]split[]]" select=<<position>> >
<<char>>
</$set>
</$list>

You would just turn this into a macro, and in some cases be forced to join/wikify before use.