Algorithms in MapReduce | Intro to Data Science

After having slacked off for a few months on this course, I've finally decided to tackle this assignment.

This article continues from my series of Walkthroughs for the programming assignments for the online data science class being offered. This post covers MapReduce principles.

After having slacked off for a few months on this course, I've finally decided to tackle this assignment.

I've been looking forward to this assignment in particular becuase it is based on MapReduce, which would introduce me to a very powerful framework and enable me to take this knowledge one step further with frameworks like Hadoop, and also Amazon's ElasticMapReduce framework.

So without further ado.. let's get started!

 

Problem 1

The task:

The first problem asks us to create a map-reduce job that provides a reverse-index on words in documents. Putting it simply, we will be provided with a list of documents, we need to output all the words with a list of documents in which the word was found.

The provided tools:

They have provided a books.json input file on which to run the MR job.

They have provided a MapReduce.py component that serves as the framework to be run in a non-distributed manner locally to simulate a MapReduce job run in a distributed environment.

We have a provided skeleton python file (skeleton.py), which basically has the mapper and reducer methods outlined with the respective emit_intermediate(...) and emit(...) methods along with a few more lines of boilerplate code.

My approach:

This is a very straight forward problem, so I'm going to keep this short. 

Mapper: For each file passed into the mapper, I will add the words to a set to remove duplicates. Then for each word (the intermediate key) in the set I will do an emit_intermediate with the document_id (intermediate value). This will form a tuple to be passed into the reducer.

Reduer: The reducer receives a word with a list of documentIds (the value). The MR framework handles "shuffling" the values to group them in this manner as is defined in the MapReduce paper. Since each call to the reduce job gets all the values (a list of document_ids in this case) associated with a given key (word in this case), we do not need to do much so we just emit the values with an "identity function".

You can see the result of this in the inverted_index.py MR job checked-in to my github account

 

Problem 2

sdf

Show comments 0