Joyent Manta

Word frequency count

This job takes any number of plaintext objects and prints a list of the top 30 most frequently occuring words, along with their frequency, sorted in descending order of frequency. The job uses two phases: the first phase (a map phase), processes the input objects in parallel, and the second phase (a reduce phase) combines the results to produce a single output. You can change the number of top words printed by changing the value of COUNT on the first line.

This solution is based on Doug McIlroy's solution to a very similar problem posed by Jon Bentley in the June, 1986 issue of Communications of the ACM. The problem is also very similar to the URL frequency count problem posed in section 2.3 of Google's MapReduce paper.

Run it yourself

Once you've set up the Manta CLI tools, you can run this job yourself on the publicly accessible dataset using the following command:

$ COUNT=30; mfind /manta/public/examples/shakespeare | \
    mjob create -n "Word frequency count" -o \
        -m "tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c" \
        -r "awk '{ x[\$2] += \$1 }
                 END { for (w in x) { print x[w] \" \" w } }' |
            sort -rn | sed ${COUNT}q"

which outputs:

added 42 inputs to 2838a8e0-ec41-ef11-a28c-abec6905da18
29854 the
27554 and
23357 i
21075 to
18520 of
15523 a
14264 you
12964 my
11955 that
11842 in
9734 is
8871 not
8269 with
8160 s
8100 for
8080 it
8059 me
7357 his
7228 be
7120 he
6917 this
6876 your
6624 but
6106 have
6056 as
5874 thou
5765 d
5431 him
5355 so
5206 will

Job body

[
        {
                "exec": "tr -cs A-Za-z '\\n' | tr A-Z a-z | sort | uniq -c",
                "type": "map"
        },
        {
                "exec": "awk '{ x[$2] += $1 }\n                 END { for (w in x) { print x[w] \" \" w } }' |\n            sort -rn | sed 30q",
                "type": "reduce"
        }
]

Input summary

(show) /manta/public/examples/shakespeare/2kinghenryiv.txt
(show) /manta/public/examples/shakespeare/3kinghenryvi.txt
(show) /manta/public/examples/shakespeare/1kinghenryvi.txt
(show) /manta/public/examples/shakespeare/1kinghenryiv.txt
(show) /manta/public/examples/shakespeare/2kinghenryvi.txt
(show) /manta/public/examples/shakespeare/allswellthatendswell.txt
(show) /manta/public/examples/shakespeare/asyoulikeit.txt
(show) /manta/public/examples/shakespeare/comedyoferrors.txt
(show) /manta/public/examples/shakespeare/cymbeline.txt
(show) /manta/public/examples/shakespeare/antonyandcleopatra.txt
(show) /manta/public/examples/shakespeare/coriolanus.txt
(show) /manta/public/examples/shakespeare/hamlet.txt
(show) /manta/public/examples/shakespeare/juliuscaesar.txt
(show) /manta/public/examples/shakespeare/kinglear.txt
(show) /manta/public/examples/shakespeare/kinghenryviii.txt
(show) /manta/public/examples/shakespeare/kinghenryv.txt
(show) /manta/public/examples/shakespeare/kingrichardii.txt
(show) /manta/public/examples/shakespeare/kingjohn.txt
(show) /manta/public/examples/shakespeare/loverscomplaint.txt
(show) /manta/public/examples/shakespeare/loveslabourslost.txt
(show) /manta/public/examples/shakespeare/kingrichardiii.txt
(show) /manta/public/examples/shakespeare/measureforemeasure.txt
(show) /manta/public/examples/shakespeare/merchantofvenice.txt
(show) /manta/public/examples/shakespeare/macbeth.txt
(show) /manta/public/examples/shakespeare/merrywivesofwindsor.txt
(show) /manta/public/examples/shakespeare/muchadoaboutnothing.txt
(show) /manta/public/examples/shakespeare/periclesprinceoftyre.txt
(show) /manta/public/examples/shakespeare/othello.txt
(show) /manta/public/examples/shakespeare/midsummersnightsdream.txt
(show) /manta/public/examples/shakespeare/rapeoflucrece.txt
(show) /manta/public/examples/shakespeare/tamingoftheshrew.txt
(show) /manta/public/examples/shakespeare/tempest.txt
(show) /manta/public/examples/shakespeare/timonofathens.txt
(show) /manta/public/examples/shakespeare/sonnets.txt
(show) /manta/public/examples/shakespeare/romeoandjuliet.txt
(show) /manta/public/examples/shakespeare/titusandronicus.txt
(show) /manta/public/examples/shakespeare/twelfthnight.txt
(show) /manta/public/examples/shakespeare/troilusandcressida.txt
(show) /manta/public/examples/shakespeare/various.txt
(show) /manta/public/examples/shakespeare/venusandadonis.txt
(show) /manta/public/examples/shakespeare/twogentlemenofverona.txt
(show) /manta/public/examples/shakespeare/winterstale.txt

Output summary

1 total outputs
(show) /manta/jobs/2838a8e0-ec41-ef11-a28c-abec6905da18/stor/reduce.1.ca5819e2-ae72-40a6-a3d6-aff5d94cf794

Error summary

0 total errors