Hadoop Mania

Tuesday, 7 July 2015

Unix/Linux - Must Know Hacks

This post is a reference post for me to look back on important and most used hacks in Unix world. As unix/linux is bread and butter for any developer like me, it could be handy for others also.

For unix hacks which i came across during my work, i am listing the subject and link/links which provide solution to problem. Kudos to all the linux administrators and techies whose contributions have made life easier for other developers.

1. How to automatically start services on boot?
http://www.abhigupta.com/2010/06/how-to-auto-start-services-on-boot-in-centos-redhat/

2. How to unlist a service from automatic start on boot?
http://www.abhigupta.com/2010/06/how-to-auto-start-services-on-boot-in-centos-redhat/

3. How to run a long-running process detached from session(keeping it running even after closing the session) and attached it later to session?
http://www.thegeekstuff.com/2010/07/screen-command-examples/
http://www.tecmint.com/screen-command-examples-to-manage-linux-terminals/

4. Sticky bit in Unix File permission
https://en.wikipedia.org/wiki/Sticky_bit
http://www.thegeekstuff.com/2013/02/sticky-bit/

5. VI/VIM Cheatsheet

move to start of file 1G/gg
move to end of file G
move to nth line of file nG (for ex, 100G to move to line no 100)

6. Unix Special Variables

$#                        Stores the number of command-line arguments that
                            were passed to the shell program.
$?                         Stores the exit value of the last command that was
                            executed.
$0                        Stores the first word of the entered command (the
                            name of the shell program).
$*                        Stores all the arguments that were entered on the
                           command line ($1 $2 ...).
"$@"                    Stores all the arguments that were entered
                           on the command line, individually quoted ("$1" "$2" ...).

Tuesday, 9 June 2015

Passing variables from shell script to hive script

I have a hive script which might have some variables like table-name, count, etc to make it more generic and automated.

$ cat hive-test.hql

select count(*) from demodb.demo_table limit ${hiveconf:num}

Creating a shell script to execute above hive script and passing required variables to it.

$ cat script-test.sh

# /bin/bash
count=5
hive -hiveconf num="$cnt" -f hive-test.hql

Running Hive queries/Hadoop Command from non-hadoop environment using ssh

Sometimes, we need to run set of hive queries like SELECT, etc to analyse the data in Hive or hadoop fs commands to list the files in HDFS from non-hadoop machines. One of the usecase could be scripts that are pushing data to HDFS on remote cluster. SSH comes as a quick resolution in such scenario, when we don't need to maintain session or do transaction management.

But there are some catches here in this obvious solution which needs to be kept in mind. I will give complete flow of solution.

1) Login interactive Shell Vs Non-Login interactive Shell while executing hadoop commands from SSH

$ ssh user@remote-host 'hadoop fs -ls /'
user@remote-host's password:
+======================================================================+
| Error: JAVA_HOME is not set and Java could not be found |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site |
| > http://java.sun.com/javase/downloads/ < |
| |
| Hadoop requires Java 1.6 or later. |
| NOTE: This script will find Sun Java whether you install using the |
| binary or the RPM based installer. |
+======================================================================+

The reason of this error is non-login bash shell. When we ssh into remote machine, it defaults to non-login shell which only reads from ~/.bashrc. Environment variables like PATH, JAVA_HOME, etc are sourced from ~/.bash_profile for each user. As ~/.bash_profile source from ~/.bashrc and not vice-versa, lies the reason for our problem.
The solution lies in interactive login bash shell which is same as when we login from putty or any other ssh client. The interactive login bash shell source from ~/.bash_profile which reads all environment variables for the user on remote-host.
The syntax to interactive login bash shell is
bash -l -c '<command>'

$ ssh user@remote-host 'bash -l -c "hadoop fs -ls /"'
user@remote-host's password:
Found 5 items
drwxr-xr-x - 49483 mapr 9 2015-05-28 14:24 /abc
drwxr-xr-x - 49483 mapr 0 2013-12-10 11:45 /hbase
drwxrwxrwx - 49483 mapr 38 2015-06-08 20:03 /tmp
drwxr-xr-x - 49483 mapr 3 2015-05-27 16:53 /user
drwxr-xr-x - 49483 mapr 1 2013-12-10 11:45 /var

Voila!!!

2) '(Single Quote) vs "(Double Quote) in passing variables to commands being executed by SSH

As end of every process is automation, we want to read target directory path from environment variable.

$ DIR_PATH="/"
$ echo $DIR_PATH
/
$ ssh user@remote-host 'bash -l -c "hadoop fs -ls $DIR_PATH"'
user@remote-host's password:
ls: Cannot access .: No such file or directory.

It is failing now with such a confusing error trace. The target dir path is $DIR_PATH whose value is "/" but error trace is not even able to read correct value.
The solution lies in meaning of '(single-quote) vs "(double-quote) in shell expressions. Everything inside '(single-quote)s is literal, even the variables. But variables are evaluated and replaced with their value during evaluation if same expression is quoted by "(double-quote)s.

In above case, as argument to ssh is quoted by '(single-quote), it is passed to remote-host as it is without evaluating $DIR_PATH. On remote host, hadoop command under "(double-quote)s is getting evaluated but as variable $DIR_PATH is not set in remote-host, error is coming.
So, re-arrange the quotes to make it evaluate expression before ssh'ing it to remote-host.

$ ssh user@remote-host "bash -l -c 'hadoop fs -ls '$DIR_PATH"
user@remote-host's password:
Found 5 items
drwxr-xr-x - 49483 mapr 9 2015-05-28 14:24 /abc
drwxr-xr-x - 49483 mapr 0 2013-12-10 11:45 /hbase
drwxrwxrwx - 49483 mapr 38 2015-06-08 20:03 /tmp
drwxr-xr-x - 49483 mapr 3 2015-05-27 16:53 /user
drwxr-xr-x - 49483 mapr 1 2013-12-10 11:45 /var

Now similarly, we can run hive or any other hadoop ecosystem commands from non-hadoop remote host using ssh.

HAPPY HACKING !!!

Friday, 29 May 2015

Profiling (CPU and MemoryPerformance) of the R function or expression

R is the mother tongue of the data science afficiondas(popularly known as "data scientists") and researchers. If you love playing with data, it is one of the most intuitive languages around but once you start playing with complex expressions and especially large datasets, performance becomes paramount consideration.
For example, suppose an expression/function, f takes 0.1 second units. If dataset has rows in order of thousand, the cost comes to about 15-20 mins which is bearable. but if dataset has rows in higher orders like in millions, it ends up in business of hours which leaves very very less room of margin for even miniscule error as it grows exponentially with size of data.
I learnt it the hard way in past few days writing a simple R function to do data processing and had to come back to my coding instincts to come around it.

The guys from coding background like me have all used profilers to do performance check on their code and optimize it as much as possible. R also provides it and voila!

The trick is to create data samples and profile concerned function using them. Once satisfied that code is optimized as much as possible, move on to bigger datasets for actual run.

> sampleData <- final.traindata[1:10, ]
> Rprof("profiler.out")
> output <- f(x = sampleData)
> Rprof(NULL)

First line here creates data sample. RProf command in second line starts the profiler with only parameter being the path of the profiler output file. It will be created in current working directory by default. Then, execute the function/expression you want to profile and close the profile with NULL parameter.

You are done with profiling and its time now for profiling result!!

> summaryRprof("profiler.out")
$by.self
self.time self.pct total.time total.pct
"as.POSIXct.POSIXlt" 0.82 24.26 0.90 26.63
"structure" 0.36 10.65 0.42 12.43
"levels" 0.26 7.69 0.30 8.88
"as.Date" 0.18 5.33 1.80 53.25
"is.na" 0.10 2.96 1.06 31.36
"sort.int" 0.10 2.96 0.16 4.73
"strptime" 0.10 2.96 0.16 4.73
"NextMethod" 0.10 2.96 0.10 2.96
"$" 0.08 2.37 0.24 7.10
"[[.data.frame" 0.08 2.37 0.16 4.73
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"[.data.frame" 0.06 1.78 0.32 9.47
"format" 0.06 1.78 0.18 5.33
"$<-.data.frame" 0.06 1.78 0.12 3.55
"match" 0.06 1.78 0.08 2.37
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.POSIXct" 0.04 1.18 0.94 27.81
"+.Date" 0.04 1.18 0.38 11.24
"as.character" 0.04 1.18 0.28 8.28
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"as.Date.factor" 0.02 0.59 1.56 46.15
"as.Date.character" 0.02 0.59 1.44 42.60
"charToDate" 0.02 0.59 1.18 34.91
"is.na.POSIXlt" 0.02 0.59 0.98 28.99
"Ops.factor" 0.02 0.59 0.44 13.02
"+" 0.02 0.59 0.40 11.83
"[" 0.02 0.59 0.34 10.06
"$<-" 0.02 0.59 0.14 4.14
"%in%" 0.02 0.59 0.10 2.96
"<Anonymous>" 0.02 0.59 0.04 1.18
"levels.default" 0.02 0.59 0.04 1.18
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59

$by.total
total.time total.pct self.time self.pct
"f" 3.38 100.00 0.00 0.00
"as.Date" 1.80 53.25 0.18 5.33
"as.Date.factor" 1.56 46.15 0.02 0.59
"as.Date.character" 1.44 42.60 0.02 0.59
"charToDate" 1.18 34.91 0.02 0.59
"is.na" 1.06 31.36 0.10 2.96
"is.na.POSIXlt" 0.98 28.99 0.02 0.59
"as.POSIXct" 0.94 27.81 0.04 1.18
"as.POSIXct.POSIXlt" 0.90 26.63 0.82 24.26
"[<-" 0.64 18.93 0.00 0.00
"[<-.factor" 0.64 18.93 0.00 0.00
"Ops.factor" 0.44 13.02 0.02 0.59
"==" 0.44 13.02 0.00 0.00
"structure" 0.42 12.43 0.36 10.65
"+" 0.40 11.83 0.02 0.59
"+.Date" 0.38 11.24 0.04 1.18
"[" 0.34 10.06 0.02 0.59
"[.data.frame" 0.32 9.47 0.06 1.78
"levels" 0.30 8.88 0.26 7.69
"as.character" 0.28 8.28 0.04 1.18
"$" 0.24 7.10 0.08 2.37
"format" 0.18 5.33 0.06 1.78
"as.character.Date" 0.18 5.33 0.00 0.00
"format.Date" 0.18 5.33 0.00 0.00
"sort.int" 0.16 4.73 0.10 2.96
"strptime" 0.16 4.73 0.10 2.96
"[[.data.frame" 0.16 4.73 0.08 2.37
"$.data.frame" 0.16 4.73 0.00 0.00
"[.factor" 0.16 4.73 0.00 0.00
"[[" 0.16 4.73 0.00 0.00
"$<-" 0.14 4.14 0.02 0.59
"$<-.data.frame" 0.12 3.55 0.06 1.78
"NextMethod" 0.10 2.96 0.10 2.96
"%in%" 0.10 2.96 0.02 0.59
"noNA.levels" 0.10 2.96 0.00 0.00
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"match" 0.08 2.37 0.06 1.78
".POSIXct" 0.08 2.37 0.00 0.00
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.character.factor" 0.06 1.78 0.00 0.00
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"<Anonymous>" 0.04 1.18 0.02 0.59
"levels.default" 0.04 1.18 0.02 0.59
"NROW" 0.04 1.18 0.00 0.00
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59

$sample.interval
[1] 0.02

$sampling.time
[1] 3.38

Profiler output has two sections, $by.self and $by.total. "$by.self" gives individual operations with highest recorded time. "$by.total" gives the tree-like break up of operations and their weightage in overall recorded time.
I personally find the second one more fruitful as it gives the nice flow of code execution and at each step, time consumed and its percentage to overall time.

Have fun with Data !!

Friday, 20 February 2015

Meluha Trilogy: Big Text Data Analytics and Visualization from kindle ebooks

Find the Meluha Trilogy App here

I read meluha trilogy by Amish Tripathi last month. Although i know about this book from past couple of years, somehow i never think of picking it up but now, after reading it, i regret not picking it up earlier. It is one of the most fascinating and detail-oriented books i have read recently. The way it has connected Hindu Mythology to History is spellbounding, only comparable to Shashi Tharoor's The Great Indian Novel. The way author has created this ancient world rich with maps, geographies and mythology characters that also exist today helps in creating this magical world which sometimes seems creates illusion of being real. And book message is resonating with mine "Gods are human beings that had done great deeds".

As a homage to this great trilogy and me being a student of data science, i build up this app which visualizes this book trilogy. I am heavily influenced by Trevor Stephen's Catch-22 visualization (http://trevorstephens.com/post/86060548369/catch-22-visualized) and Les Miserables Visualization. I majorly focussed on sentiment analysis as this book trilogy is so rich of emotions and contrasting characters.

The Data Extraction and Preparation

The source of data is the ebook on my kindle. I have extracted chapter wise text from three meluha trilogy kindle ebooks. I have explained the methodology of extracting chapter-wise text from kindle ebook in my last post (http://hadoopmania.blogspot.in/2015/02/extracting-text-from-kindle-ebooks.html).

After running the code mentioned in last post,i got chapter-wise text from e-book and converted all text in lowercase . Next step was to apply natural language processing on it to get insights from such huge data

Theoretically, i came up with the following sequence of nlp steps:

1) Word Tokenization

2) Sentence Annotation

3) pos (part-of-speech) tagging

4) lemmatisation

5) named entity recognition - to find person characters within text.

6) deep syntactical parsing

7) coreference resolution

8) sentiment

Implementation-wise, i was sure of facing problem at step of named entity recognition. As existing ner models are trained on standard text corpus, it was obvious they are not going to find characters with exotic name as Shiva, Nandi, Sati, Karthik, etc with good accuracy. I started with training a ner model on book text but soon realised it huge wastage of time as i was indirectly tagging each character in book. So, i decided to try regex-ner which identifies named entities on the basis of regular-expression which can give me 100% accuracy.

For nlp implementation, i started with apache's opennlp(https://opennlp.apache.org/) but after a week of experimentation, abandon it due to lack of regex-ner and good quality coreference resolution. Then, i came across stanford corenlp parser(http://nlp.stanford.edu/software/corenlp.shtml) which is a very rich nlp toolset and fitted almost perfectly into my requirement except coreference resolution which doesn't work on named entities extracted from regex-ner. I am still working on this feature and expecting help on stackoverflow(http://stackoverflow.com/questions/28169139/run-stanford-corenlps-dcoref-on-the-basis-of-output-of-regexner).

Character Occurrence Plot
For each chapter text, i ran stanford corenlp pipeline with following nlp tools in sequence: "tokenize, ssplit, regexner". "tokenize" does word tokenization and "ssplit" does sentence annotation. "regexner" extracts named entities on the basis of a regex.
The way to set regular expression is by setting following property to stanford corenlp parser

Properties props = new Properties();
props.setProperty("regexner.mapping", "in/characters.txt");

pipeline = new StanfordCoreNLP(props);

and adding a regex like following in the tab-separated file "in/characters.txt".

((\s*)shiva(\s*)|(\s*)neelkanth(\s*)|(\s*)sati(\s*) CHARACTER
----------
--------

where each field has two columns, first one is regular expression and second is Named Entity Label.

After completing the nlp parsing, for each "CHARACTER" labelled token, calculate the word percentile for that word which gives the location of that character word within chapter and book. Then plotting the character and its occurrence percentile by book chapters give useful insights on the character influence and coverage within book.

On the left task bar, there is an option to select book of series and chapter range for which we want to visualize. As evident, character "shiva" is almost omnipresent in book series but it is other characters like "sati","bhrigu", "anandmayi" who have limited chapters of occurence but strong impact on story line.

Character Sentiments Plot
This plot, i think, is the most meaningful and impactful plot of my enterprise. Novels being always full of emotion swings and conflicts, it makes interesting visualization to know how the characters mood swings across book.

For data extraction, i ran following stanford nlp pipeline on each chapter data, "tokenize, ssplit, regexner, pos, lemma, parse, sentiment". First three components already explained, "pos" does part of speech tagging nouns, adverbs, etc. "lemma" does lemmatisation. "parse" does deep syntactical parsing and tag relationships between phrases identified by "pos". "sentiment" does sentiment scoring of each phrase in text using models build by recursive neural network and sentiment treebank.

But we only want the sentiment of the concerned characters, identified by "regexner", So, i filtered out the search space to include only those lines which have book character as "SUBJECT" or "PASSIVE SUBJECT".
for example, shiva killed the demons. OR shiva was angered by carelessness of devotees.

Then, in this filtered search space, i recorded each character's occurrence percentile and enclosing sentence's sentiment score. Plotting this information using ggplot looks something like this:

In chapter 44 and 45, "sati" has concentrated cluster of red which is "very negative" emotion, It stands justified as she was cheated by her father and engaged in battle. Similarly, "shiva" has good amount of green lines in chapter 6 of first book when he fall in love with "sati".

This plot could have been more richer and insightful if i had been able to coreference resolution on these characters identified by regexner but alas, it only runs on output of "ner" which is not able to identify book characters with accuracy.

Character Co-occurrence
This plot basically correlates the occurrence of two characters simultaneously within the book. It gives us the characters which are frequent collaborators in book storyline.

In this matrix, we can ignore the diagonal as they reflect occurrence density of a character only. Otherwise, we can see strong correlation patterns of shiva with sati, ganesh and bhagirath. Also, kali and ganesh have strong co-occurrence.

Conclusion

I am still working to enhance its sentiment analysis and working to find more visualization options to enrich the overall insights.
Though, i admit it far exceeds my initial expectations and work out to be a good learning journey in which i worked across many technical and analytic hiccups.
Regarding its real-world value, i am thinking it can be used to give users book visualization before reading it to help them choose on the basis of their sentiment preferences. Also, it can become of very high nostalgic value for book fans and can be used for targeted marketing.

Open for Healthy Criticism !!!

Sunday, 8 February 2015

Extracting text from Kindle ebooks : Chapter wise

Amazon Kindle has been able to digitize the book world in a revolutionary way. Being an avid book lover and reader, i had tried to avoid, in fact hate the digitized book available in PDF and significantly, Kindle for long time. I loved that experience of turning those crispy pages and exploring a new world unfold on every page. It all changed 2 months back, when i bought amazon kindle paperwhite. Although, the only pro which i had in mind at that time, was their amazing Vocabulary Builder app but over time, i started to appreciate other things also like ability to explore more books on digital bookshop, recommendations and good simulated book reading experience. Still, I think it will take me some time to move over the hard book copies. It is a classical Man vs Machine case. I am most skeptical about a machine controlling my mind and thoughts and what better than digitized books to do so.
Coming back from digression and dreadful dreams, this blog is about how to extract chapter-wise text from kindle book formats. This is all in necessity to my personal project which i conceived in one of those dreadful dreams. The project is about doing text analysis on each chapter to find the influence of characters in book and on each other. Another part was to do sentiment analysis on characters to find their mood in different parts of book. I will explain it in next series of blogs.

Problem Statement:
Amazon Kindle Reader has its own digital format in which it encodes the books. It is mobi/epub format which we can see easily if we plug our kindle device to a computer and explore its filesystem. I wanted to read text from the kindle ebook chapter wise.

For background information, in kindle ebook format, all components(html, css, xml, fonts, images) are represented as Resources and are in XHTML format. It has 3 indexes into these Resources, as per the epub specification.

Spine: these are the Resources to be shown when a user reads the book from start to finish.
Table of Contents
: The table of contents. Table of Contents references may be in a different order and contain different Resources than the spine, and often do.
Guide: The Guide has references to a set of special Resources like the cover page, the Glossary, the copyright page, etc.

The complication is that these 3 indexes may and usually do point to different pages. A chapter may be split up in 2 pieces to fit it in to memory. Then the spine will contain both pieces, but the Table of Contents only the first. The Content page may be in the Table of Contents, the Guide, but not in the Spine. Etc.

Solution:
Thus, I started the recce on internet and found two libraries fit for it
1) Apache Tika(http://tika.apache.org/): It is an open-source library for detecting and extracting metadata and content from almost any file type. Just pass the file to its simple interface, it will automatically detect its type and call its corresponding driver to extract metadata and content.

It is as simple as:

Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024); //setting max string buffer length - 10MB

InputStream input = new BufferedInputStream(new FileInputStream("in/book1.epub"));
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);

As evident, it will extract the text of whole file at once. We can't extract it piecewise(here, chapterwise). We can't skip index, glossary, preface and other sections of book which aren't part of book story.

2) epublib(https://github.com/psiegman/epublib): It is another open-source library for creating epub files from existing html files. It also has a basic api to read metadata and content from an epub file.
Each Kindle ebook is represented as nl.siegmann.epublib.domain.Book object which have methods:

getMetadata to get metadata about book
getSpine to get spine reference
getTableOfContents to get reference to table of contents
getGuide to get reference to guide
getResources to get reference to all the images, chapters, sections, xhtml files, stylesheets, etc that make up the book.

Coming back to our requirement, this library almost fulfills the necessity to be able to read the book chapter wise.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

byte[] data = resource.getData(); //it gives content in byte[]

Reader reader = resource.getReader();

InputStream is = resource.getInputStream;

}

but the issue is that chapterResource has three ways to return content: getData(), getReader() and getInputStream(). All of them return XHTML content which needs to be further parsed to extract text content.

(Important Point to note is that i haven't used bookSpine directly to scan through chapters because as mentioned before, if a big chapter is split into two sections to fit into memory, spine will have both references. Scanning of Chapters is more relevant through TableOfContents.)

So, in order to parse XHTML to extract text content, there are two ways, either write a SAX parser or use SAX parser from TIKA library to do so. Keeping programming spirit in mind, i am opting for second option.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

String chapterTitle = toc.getTitle();

String chapterText = null;

try{

org.apache.tika.metadata.Metadata metadata = new

org.apache.tika.metadata.Metadata();

ParseContext context = new ParseContext();

BodyContentHandler handler = new BodyContentHandler(10*1024*1024);

XHTMLContentHandler xhtmlHandler = new XHTMLContentHandler(handler,

metadata);

xhtmlHandler.startDocument();

ContentHandler contentHandler = new EmbeddedContentHandler(new

BodyContentHandler(xhtmlHandler));

Parser epubContentParser = new EpubContentParser();

epubContentParser.parse(chapterResource.getInputStream(), contentHandler,

metadata, context);

xhtmlHandler.endDocument();

chapterText = contentHandler.toString().toLowerCase();

}

catch(Exception e){

System.err.println(e);

}

Input to handler object is size of text buffer which i have configured as 10MB (10*1024*1024). I know it is bit of unclean but it reuses the SAX parser in TIKA module which is already well tested. And we do have best excuse that programmer can have "REUSABILITY". On serious note, i have tested it numerous times and its work fine.

I do have implemented same functionality in my project using Iterator Design Pattern. You can check org.poc.book.reader.impl.KindleBookReader.java at github project, https://github.com/shakti-garg/bookProject.

Signing off!

Sunday, 4 May 2014

Panacea to Big Data Problems: 2 steps of common sense

It wouldn't be wrong if i say that this decade in information age belongs to web and internet. As everything in world is connected or going to connect to every other thing, the rate of data explosion can be estimated from simple usecase of number of handshakes daily between each device. That alone will amount to some GB's of data taking into account 10 billion population each having a laptop and smartphone. And it is most trivial usecase i have ever thought of!!! (Kudos) Now start imagining the kind of data that will be generated in non-trivial cases.
Internet companies are leading the way in technology evolution for handling such big amount of data. The reason is that they need it to survive in Internet War and as said by wisest guys, the need is mother of invention(in our case, innovation). Slowly, other domains also started to adopt these technologies to gain insight and churn their internal data and data coming from user interaction. I think this is enough of introduction...
The one thing i can't help observing in last couple of years is the kind of enthusiasm to adopt big data technologies. It has become cool for job aspirants to show off Hadoop, Hbase and other weapons on their resume and cooler for product manager and architects to showcase use of these weapons in their arsenal(I mean, projects). It is coolest for guys like me who sometimes ask other guys "Do you really think it is a big data scenario" and reply comes "I think Big Data can give an edge to product" or "Big Data is future and better we add it into product". Result, half cooked dishes that you have to eat but you can't spit!
Coming to coders or job aspirants, they are just showing what the other side on interview panel is expecting. I know people ignoring basic coding skills like Java, Design Patterns, Algorithms, etc just to mug up big data technologies. You ask them "How", they have a whole lot of correct answers but "Why" always results in stuttering.
From my experience in last couple of years(not great enough to become an authority but enough to gain courage to express myself), simple steps for programmers or near dear ones are:
1) "Try answering data problem without any Big Data concept/technology. If you fail two times, try to add big data concept on need basis".
This is the mantra i follows for not doing big data overkill while proposing solution to a data problem. Past experience to support it is "We have a billing system in an electrical distribution company having 1 million users. Company switched from conventional analog meters(remember, one reading in a month meters) to smart meters(every now and then, reading is being sent to system along with diagnostic parameters). Now reading is coming hourly and data for each month has increased by (31 days * 24 hours)." We should move to Big Data technologies to find an answer.Right????
Now follow above mantra. Try to come up with a non-Big Data solution.

"Ok. We have data coming hourly, that means the same load daily which we had once in a month earlier. We can keep the system online daily for this. Done. Next, where to store so much amount of data and provide same data-read efficiency and random seeks of records as earlier??? store data partitioned on day basis, i.e. create master table for each hour. The index size for a table will remain same without any noticeable performance degradation in data read. Possible?Done. Next, aggregation of such huge amount of data for bill calculation that is done once in a month. Daily aggregate the data into a daily billing table. Aggregate all daily billing table for that month on the last day of month...bill calculated!! Hurray!!! One minute, what about scalability?? User growth rate in utilities is almost equal to population growth rate..that means 2% in developed countries and on average 5% in developing ones. Add a sever to DB cluster once in a year or two to handle data load. Anything else.....Ah! what about data that is more than 2 months old. we don't need it anymore for bill calculation. Simple solution, go with the earlier archiving policy as before. It might take now 2 days to archive 1 month worth of data, But who cares!!!"
All bases covered...new requirement needs some minor modifications in existing application on logic and db side. It will take max 1 month for couple of guys to do that with no testing headache, no production headache, no quality headache and most important, no new technology headache which can cause severe bleeding of resources and attacks of uncertainty

2. Pick your big data tool by need, not by taste or enthusiasm or comfort or fashion or any thing else.
Big Data technologies are currently like the carpenter tool box. The choice of tools defines the effectiveness of end product as well as the neatness of the job done. That one instinct in carpenter to pick up correct chisel, correct saw and correct hammer for the job on the basis of requirement, defines his cost and market demand.
Same applies to big data scenario. A big data technology doesn't fit perfectly in every case. It might work for all cases but perfectly for selected ones. The classic example is our star performer "Hadoop". It is a batch time mapreduce distributed computation engine. Ok, now i have a usecase where in indian tax department, daily i have to verify the Income-Tax Returns of 1000 tax payers on ad-hoc basis. All the financial, employment and personal details are on an Hadoop cluser(HDFS). Seeing the scale of data involved that is in PetaBytes, Hadoop MapReduce seems a perfect fit where in mapper, we can randomly pick our tax payers and their financial details which will be aggregated in reducer to create ideal "Income-Tax-Return" which can then be compared with submitted one.
But one catch here is that even 5 million tax-payers from 500 million tax payers is like around 1 % of total data. It is reading complete data currently to filter out 1 % of data. Phew!!! Go with indexing on tax payer name and year. or push it into Hbase for faster reads. Now our Income Tax department might be able to verify statement for 1000 times more users for each day.

Similarly for Big Data Developer/Architect, who is like carpenter. It's good to know how to use different big data tools, but what makes you standout is the choice of tools and their usage for a big data problem. And believe me Hunches on tool selection never work, at least here. I have learnt it the hard way.

Adieu!!