A Report on 2 day workshop on Big Data

Day 1, 16th June 2012

On a silent Saturday morning with the usual Bangalore monsoon trying to peep in, Prabhu and I slowly alighted from the 3-wheeled prominent (notorious?) entity of Bangalore traffic. At first, the small left turn that leads to the venue was blocked. To our fortune, there is a convenient way to the venue from its parallel road. After requesting the man guarding that place to direct participants coming to the workshop, we both reached the venue at 9:10.

As the training room Vanaka (a Sanskrit word meaning “the state of Brahmacharin”) greeted, I finally got the day I was waiting for. While on one side the room was getting colder with an A/C set to its minimum, our hands shared the warmth with enthusiastic participants getting in to the lab. At 9:45, with almost 90% of the participants in, I started to set the context for the 2 day workshop with an overview of Big Data – Why, What, How? and its relevance to the current decade where every one of us are contributing and consuming data in various forms.

Right from a school going kid to our grand ma’s and grand pa’s every one is a stake holder in some role or the other to the digital data that is getting crowded in multiple places. Big data is not just “big” in size, it is big in complexity due to various forms and origins it takes, it is big in processing, it is big in technology, it is big in applications and finally it is a big fascinating subject that has no horizon in the Knowledge Areas it embodies.  Being aware of the widespread documentation of “what is big data”, I consciously handed over the baton to Prabhu, a colleague and friend of Manish. While it is unfortunate Manish couldn’t join this 2-day party, he had found a very convincing talent in Prabhu (participants’ feedback reinstate this) who had worked on content extraction, searching and indexing from the initial versions of tika and solr while they were in its infant not-so-stable releases.

Prabhu started off with content extraction. Before tea and coffee could turn cold, participants were able to get their first hands-on in tika by extracting the content from PDF. Every participant system was loaded with dual OS – windows for the first day and Ubuntu for the second. Eclipse was pre-configured with Maven plugin. Maven repository was pre-loaded with all the dependencies of tika project. We didn’t want our participants to go through the suspense maven gives for a first maven project on new technology where maven has to resolve dependencies by downloading the dependent project artifacts from the configured maven repositories.

Big Data Workshop - Day 1 - Tika, Solr, Cassandra

Big Data Workshop – Day 1 – Tika, Solr, Cassandra

Big Data Tika Solr

Big Data Workshop – Day 1 – Tika, Solr, Cassandra

After a quick tea break, the once cold shivering room turned warm with everyone latched on to their eclipse. Lexicons of content extraction started to flow in. Meta data, dynamic variables, handling multiple MIME types, respecting other language were all around. We moved on to Solr witnessing the Sol(a)r (on) Eclipse.

With each one of us having our second brain in google search to often solve every problem that comes to us, this Solr definitely removed the darkness in understanding how indexing and search works. Every participant were given a handful of documents to index it. With the standard set of config files (don’t forget the main one – schema.xml – the “controller” config, if I can call it that way) obtained from Solr, participants were able to write their first method that index these documents. The terminologies of indexing – the useless (or misleading) stopwords, synonyms. Without the Solr server, isn’t Solr just a Lucene?

Solr was then running on Jetty inside Eclipse. The HTTP queries are fired from the browser to search content indexed in Solr. Starting with simple query that fetched every item displaying all the fields extracted by tika, we could move on to every single aspect of search – getting only the required fields, searching for a value present in any of the field, stemming, scoring, highlighting, proximity, faceting, Edismax. With our role model “Google” on one-side, every concept or term is explained with our search executing on our browser and the similar features implemented in Google. Here are the few queries.

http://localhost:8983/solr/select?q=content:*
http://localhost:8983/solr/select?=content:facebook&fl=content,score
http://localhost:8983/solr/select?q=content:facebook&fl=content,score&hl=true&hl.fl=content
http://localhost:8983/solr/select?q=content:google&fl=content,score&hl=true&hl.fl=content
http://localhost:8983/solr/select?q=content:facebok~0.5
http://localhost:8983/solr/select?q=content:%22facebook%20philadelphia%22~19&fl=content,score&hl=true&hl.fl=contentent
http://localhost:8983/solr/select?q=author_s:*&fl=author_s
http://localhost:8983/solr/select?q=content:*&fl=author_s&rows=2
http://localhost:8983/solr/select?q=content:*&fl=author_s,score&group=true&group.field=author_s
http://localhost:8983/solr/select?q=content:facebook&facet=true&facet.field=contenttype_shttp://localhost:8983/solr/select?q=content:facebook&facet=true&facet.field=id
http://localhost:8983/solr/select?q=content:facebook&facet=true&facet.field=author_s&fq=contenttype_s=application/pdf
http://localhost:8983/solr/select?q=content:*&fl=id&facet=true&facet.field={!ex=dt}contenttype_s&fq={!tag=dt}contenttype_s:application/pdf
http://localhost:8983/solr/select?q=content:*&fl=id&facet=true&facet.field={!ex=dt}contenttype_s&facet.field=author_s&fq={!tag=dt}contenttype_s:application/pdf
http://localhost:8983/solr/select?q=content:million%20cricket&fl=score,content&hl=true&hl.fl=content&defType=edismax&mm=2
http://localhost:8983/solr/select?q=content:facebook&mlt=true&mlt.fl=content&mlt.count=3&mlt.mindf=1&mlt.mintf=1&fl=id,score
Big Data Workshop - Day 1 - Tika, Solr, Cassandra

Big Data Workshop – Day 1 – Tika, Solr, Cassandra

As the dusk hitting the dust outside, we moved on to the key-value No-SQL database – Cassandra. It was a quick session where participants got to meet Cassandra hand-in-hand. Setting up the keyspace, coupled with column family and then firing normal and CQL queries.

The standing line on the clock forced us to part ways half-heartedly. I am sure everyone’s plate was full with food for thought that they could chew upon later in the evening.

After bidding good bye to participants, I decided to set up a pseudo installation of Hadoop on one of the Ubuntu systems. I had Ajit on Skype to ensure I get it right the first time. On a plain Ubuntu box, it started from Java installation, downloading and installing Collabnet’s Hadoop 0.20 conf-pseudo through apt-get. After formatting the name node, I had to provide permission rights to folders under the name node. After a handful of permission rights being changed or added for the folder (read from hadoop’s log file), Hadoop was running successfully and I started my run back home.

Day 2, 17th June 2012

It was a usual Sunday morning at Bangalore. Peaceful, very less traffic. I could relish the charm of Inner Ring Road that I saw 10 years back. I was still tensed if all participants could turn up at 9:30 on a cloudy and windy Sunday morning. Yes, they did. I should admit that each and every one were so enthusiastic about technology and learning. Ajit had already come by then. We both started to copy all the downloaded stuff that I got in the Linux machine that I set up last night. Participants who dropped in early like Guruprasad too rendered a helping hand. Thanks to a handful of USB drives that speeded up the whole process.

I started off with a flash back of what we did on Saturday and handed it over to Ajit. We decided to start with NoSQL and Neo4j as we thought we might not get to do justice to it if we schedule it after the two monsters – Hadoop and Mahout. Ajit started off with the concepts behind No SQL (Not-Only SQL), various forms that it takes (for example key-value, Big Table, Document, Graph), pros-and-cons of each, various industrial players currently in the market and moved on to explain Graph database with specific reference to Neo4j. With most of us heavily using facebook and linkedin, it made more sense to model the social / professional graph rather than a conventional relational one. Ajit then moved on to explain Neo4j in depth explaining the APIs, Cypher Queries, Spring-Neo4j, Neo4j entity classes, Neo4j-specific annotations and a demo. After a good deal of conceptual overview of Neo4j, participants got their first hands-on in Neo4j. The example was pretty much apt for a Sunday – Movie, Actor, Roles, Director.

Big Data Workshop - Day 2 - Neo4j, Hadoop, Mahout

Big Data Workshop – Day 2 – Neo4j, Hadoop, Mahout

Big Data Workshop - Day 2 - Neo4j, Hadoop, Mahout

Big Data Workshop – Day 2 – Neo4j, Hadoop, Mahout

We then moved on with Hadoop installation. We first downloaded CDH3 (Cloudera Distribution Including Hadoop Version 3) repository using apt.

Create a new file /etc/apt/sources.list.d/cloudera-cdh3.list with the following contents:
deb http://archive.cloudera.com/debian maverick-cdh3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib
Update the APT package index:
$ sudo apt-get update

We started hadoop installation process and rushed to satiate our hunger.

Big Data Workshop - Day 2 - Neo4j, Hadoop, Mahout

Big Data Workshop – Day 2 – Neo4j, Hadoop, Mahout

Big Data Workshop - Day 2 - Neo4j, Hadoop, Mahout

Big Data Workshop – Day 2 – Neo4j, Hadoop, Mahout

sudo apt-get install hadoop-0.20-conf-pseudo

When we were back hadoop had already installed and was waiting for our configuration to be done. We started off with the formatting of name node.

bin/hadoop namenode -format

After the above command, we had to set the permission rights appropriately (777) for the user (in our case it was root) to name node folder and SecurityAuth.audit file under log (Check out the hadoop log file for further details). It was then time for us to start hadoop.

for x in /etc/init.d/hadoop-0.20* ; do sudo $x start; done

 Hadoop didn’t start the first time. It needed further permission rights to folders it had created under the name node. After setting the permission rights we again executed the above command. Some got to meet the big elephant (hadoop) soon, some a little later and some couldn’t at all. After troubleshooting hadoop installation with half success we proceeded with our first Map Reduce job on hadoop in the interest of time. We had in hand a working example on word count, secondary sort and patient citation.

1. Move the input files to HDFS

> bin/hadoop dfs -copyFromLocal Hadoop_Examples/src/main/resources/word_count word_count/input/

2. Deploy the Map-Reduce job on hadoop

> bin/hadoop jar  Hadoop_Examples/target/hadoopexample-1.0-SNAPSHOT-job.jar  word_count/input/ word_count/output/

3. View the results of MR job

> bin/hadoop dfs -ls word_count/output/

We now resorted to driving the yellow elephant (Mahout). Ajit started off with the concepts behind Machine Learning and its applications. How many of us have bought products just because it flashed on our screen while buying another product? How many of us got back in touch with our old school friends or colleagues with the help of facebook and linked in? It is a human intelligence derived from an unimaginable size of data.

Focusing specifically on recommendation engine Ajit started to reveal some of the secrets behind the screen. User based and Item based recommenders to start with and then continued with components of Mahout Recommender Architecture – Data Model, User and Item Similarity, User Neighbourhood and the whole abstraction of recommenders. Terms like Pearson correlation-based similarity, Euclidean distance similarity, Cosine Measure similarity, Slope-one recommender started to flow in.

Participants got to drive the yellow Elephant with a recommendation engine example using PearsonCorrelationSimilarity, GenericUserBasedRecommender. Ajit then concluded with a brief note on how to run Mahout application code as MR jobs in hadoop.

While the 2 day workshop had to end as the Sun has to set, our voyage on the Big Elephant has just began.

 

Advertisements

About Arunkumar Krishnamoorthy

Passionate Software Engineer and a Computer Science Engineer. Brave and responsible leader. Hands-on Solution Architect. Continuous learner.
This entry was posted in Big Data and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s