Joyent Manta

Word index

This job takes an arbitrary number of plaintext objects as input and produces a plaintext index file that lists the files and line numbers where each word appears in the whole corpus. This example runs on a corpus of Shakespeare's written works.

This example uses a couple of assets (external scripts that are used in running the job). The first is indexone.sh, which indexes a single object:

#!/bin/bash

#
# indexone.sh: simple indexer for plaintext files
#
sed -E -e "s/[^A-Za-z']/ /g;" -e "s#([[:space:]])'+#\1#g" -e "s#^'+##g" "$1" | \
    tr '[:upper:]' '[:lower:]' | \
    awk -v OBJLABEL="$(basename ${1-stdin})" '{ \
             for (i = 1; i <= NF; i++) { \
                 if (length($i) < 4) \
                     continue; \
                 if ($i in indx) { \
                     indx[$i] = indx[$i] "," NR \
                 } else { \
                     indx[$i] = OBJLABEL ":" NR \
                 } \
             } \
         } \
         END { \
             for (word in indx) { \
                 print word, indx[word] \
             } \
         }'

The second is indexmerge.awk, which is used to merge the per-object indexes:

{
    for (i = 2; i <= NF; i++) {
        indx[$1] = indx[$1] " " $i
    }
}
END {
    for (word in indx) {
        print word, indx[word]
    }
}

Run it yourself

Once you've set up the Manta CLI tools, you can run this job yourself on the publicly accessible dataset using the following command:

$ mfind -t o -n '.*.txt' /manta/public/examples/shakespeare | \
    mjob create -n "Word index" -w \
     -s /manta/public/examples/assets/indexone.sh \
     -m '/assets/manta/public/examples/assets/indexone.sh "$MANTA_INPUT_FILE"' \
     -s /manta/public/examples/assets/indexmerge.awk \
     -r 'awk -f /assets/manta/public/examples/assets/indexmerge.awk | sort'

Because the output for this job is quite large, this example did not use "mjob create -o" (which prints all job outputs). For the actual output, see "Output summary" below.

Job body

[
        {
                "assets": [
                        "/manta/public/examples/assets/indexone.sh"
                ],
                "exec": "/assets/manta/public/examples/assets/indexone.sh \"$MANTA_INPUT_FILE\"",
                "type": "map"
        },
        {
                "assets": [
                        "/manta/public/examples/assets/indexmerge.awk"
                ],
                "exec": "awk -f /assets/manta/public/examples/assets/indexmerge.awk | sort",
                "type": "reduce"
        }
]

Input summary

(show) /manta/public/examples/shakespeare/2kinghenryiv.txt
(show) /manta/public/examples/shakespeare/2kinghenryvi.txt
(show) /manta/public/examples/shakespeare/1kinghenryvi.txt
(show) /manta/public/examples/shakespeare/3kinghenryvi.txt
(show) /manta/public/examples/shakespeare/1kinghenryiv.txt
(show) /manta/public/examples/shakespeare/allswellthatendswell.txt
(show) /manta/public/examples/shakespeare/comedyoferrors.txt
(show) /manta/public/examples/shakespeare/asyoulikeit.txt
(show) /manta/public/examples/shakespeare/cymbeline.txt
(show) /manta/public/examples/shakespeare/antonyandcleopatra.txt
(show) /manta/public/examples/shakespeare/coriolanus.txt
(show) /manta/public/examples/shakespeare/hamlet.txt
(show) /manta/public/examples/shakespeare/kinghenryviii.txt
(show) /manta/public/examples/shakespeare/kingjohn.txt
(show) /manta/public/examples/shakespeare/juliuscaesar.txt
(show) /manta/public/examples/shakespeare/kinghenryv.txt
(show) /manta/public/examples/shakespeare/kinglear.txt
(show) /manta/public/examples/shakespeare/kingrichardii.txt
(show) /manta/public/examples/shakespeare/kingrichardiii.txt
(show) /manta/public/examples/shakespeare/macbeth.txt
(show) /manta/public/examples/shakespeare/measureforemeasure.txt
(show) /manta/public/examples/shakespeare/loveslabourslost.txt
(show) /manta/public/examples/shakespeare/merchantofvenice.txt
(show) /manta/public/examples/shakespeare/loverscomplaint.txt
(show) /manta/public/examples/shakespeare/othello.txt
(show) /manta/public/examples/shakespeare/muchadoaboutnothing.txt
(show) /manta/public/examples/shakespeare/merrywivesofwindsor.txt
(show) /manta/public/examples/shakespeare/midsummersnightsdream.txt
(show) /manta/public/examples/shakespeare/periclesprinceoftyre.txt
(show) /manta/public/examples/shakespeare/rapeoflucrece.txt
(show) /manta/public/examples/shakespeare/timonofathens.txt
(show) /manta/public/examples/shakespeare/tamingoftheshrew.txt
(show) /manta/public/examples/shakespeare/sonnets.txt
(show) /manta/public/examples/shakespeare/titusandronicus.txt
(show) /manta/public/examples/shakespeare/romeoandjuliet.txt
(show) /manta/public/examples/shakespeare/tempest.txt
(show) /manta/public/examples/shakespeare/twogentlemenofverona.txt
(show) /manta/public/examples/shakespeare/troilusandcressida.txt
(show) /manta/public/examples/shakespeare/twelfthnight.txt
(show) /manta/public/examples/shakespeare/venusandadonis.txt
(show) /manta/public/examples/shakespeare/winterstale.txt
(show) /manta/public/examples/shakespeare/various.txt

Output summary

1 total outputs
(show) /manta/jobs/7df61576-ad75-caf7-d44f-8a1e65219b1a/stor/reduce.1.d5900f30-172a-4584-8b19-cd1beacff7e6

Error summary

0 total errors