Hadoop Mania: Meluha Trilogy: Big Text Data Analytics and Visualization from kindle ebooks

Find the Meluha Trilogy App here

I read meluha trilogy by Amish Tripathi last month. Although i know about this book from past couple of years, somehow i never think of picking it up but now, after reading it, i regret not picking it up earlier. It is one of the most fascinating and detail-oriented books i have read recently. The way it has connected Hindu Mythology to History is spellbounding, only comparable to Shashi Tharoor's The Great Indian Novel. The way author has created this ancient world rich with maps, geographies and mythology characters that also exist today helps in creating this magical world which sometimes seems creates illusion of being real. And book message is resonating with mine "Gods are human beings that had done great deeds".

As a homage to this great trilogy and me being a student of data science, i build up this app which visualizes this book trilogy. I am heavily influenced by Trevor Stephen's Catch-22 visualization (http://trevorstephens.com/post/86060548369/catch-22-visualized) and Les Miserables Visualization. I majorly focussed on sentiment analysis as this book trilogy is so rich of emotions and contrasting characters.

The Data Extraction and Preparation

The source of data is the ebook on my kindle. I have extracted chapter wise text from three meluha trilogy kindle ebooks. I have explained the methodology of extracting chapter-wise text from kindle ebook in my last post (http://hadoopmania.blogspot.in/2015/02/extracting-text-from-kindle-ebooks.html).

After running the code mentioned in last post,i got chapter-wise text from e-book and converted all text in lowercase . Next step was to apply natural language processing on it to get insights from such huge data

Theoretically, i came up with the following sequence of nlp steps:

1) Word Tokenization

2) Sentence Annotation

3) pos (part-of-speech) tagging

4) lemmatisation

5) named entity recognition - to find person characters within text.

6) deep syntactical parsing

7) coreference resolution

8) sentiment

Implementation-wise, i was sure of facing problem at step of named entity recognition. As existing ner models are trained on standard text corpus, it was obvious they are not going to find characters with exotic name as Shiva, Nandi, Sati, Karthik, etc with good accuracy. I started with training a ner model on book text but soon realised it huge wastage of time as i was indirectly tagging each character in book. So, i decided to try regex-ner which identifies named entities on the basis of regular-expression which can give me 100% accuracy.

For nlp implementation, i started with apache's opennlp(https://opennlp.apache.org/) but after a week of experimentation, abandon it due to lack of regex-ner and good quality coreference resolution. Then, i came across stanford corenlp parser(http://nlp.stanford.edu/software/corenlp.shtml) which is a very rich nlp toolset and fitted almost perfectly into my requirement except coreference resolution which doesn't work on named entities extracted from regex-ner. I am still working on this feature and expecting help on stackoverflow(http://stackoverflow.com/questions/28169139/run-stanford-corenlps-dcoref-on-the-basis-of-output-of-regexner).

Character Occurrence Plot
For each chapter text, i ran stanford corenlp pipeline with following nlp tools in sequence: "tokenize, ssplit, regexner". "tokenize" does word tokenization and "ssplit" does sentence annotation. "regexner" extracts named entities on the basis of a regex.
The way to set regular expression is by setting following property to stanford corenlp parser

Properties props = new Properties();
props.setProperty("regexner.mapping", "in/characters.txt");

pipeline = new StanfordCoreNLP(props);

and adding a regex like following in the tab-separated file "in/characters.txt".

((\s*)shiva(\s*)|(\s*)neelkanth(\s*)|(\s*)sati(\s*) CHARACTER
----------
--------

where each field has two columns, first one is regular expression and second is Named Entity Label.

After completing the nlp parsing, for each "CHARACTER" labelled token, calculate the word percentile for that word which gives the location of that character word within chapter and book. Then plotting the character and its occurrence percentile by book chapters give useful insights on the character influence and coverage within book.

On the left task bar, there is an option to select book of series and chapter range for which we want to visualize. As evident, character "shiva" is almost omnipresent in book series but it is other characters like "sati","bhrigu", "anandmayi" who have limited chapters of occurence but strong impact on story line.

Character Sentiments Plot
This plot, i think, is the most meaningful and impactful plot of my enterprise. Novels being always full of emotion swings and conflicts, it makes interesting visualization to know how the characters mood swings across book.

For data extraction, i ran following stanford nlp pipeline on each chapter data, "tokenize, ssplit, regexner, pos, lemma, parse, sentiment". First three components already explained, "pos" does part of speech tagging nouns, adverbs, etc. "lemma" does lemmatisation. "parse" does deep syntactical parsing and tag relationships between phrases identified by "pos". "sentiment" does sentiment scoring of each phrase in text using models build by recursive neural network and sentiment treebank.

But we only want the sentiment of the concerned characters, identified by "regexner", So, i filtered out the search space to include only those lines which have book character as "SUBJECT" or "PASSIVE SUBJECT".
for example, shiva killed the demons. OR shiva was angered by carelessness of devotees.

Then, in this filtered search space, i recorded each character's occurrence percentile and enclosing sentence's sentiment score. Plotting this information using ggplot looks something like this:

In chapter 44 and 45, "sati" has concentrated cluster of red which is "very negative" emotion, It stands justified as she was cheated by her father and engaged in battle. Similarly, "shiva" has good amount of green lines in chapter 6 of first book when he fall in love with "sati".

This plot could have been more richer and insightful if i had been able to coreference resolution on these characters identified by regexner but alas, it only runs on output of "ner" which is not able to identify book characters with accuracy.

Character Co-occurrence
This plot basically correlates the occurrence of two characters simultaneously within the book. It gives us the characters which are frequent collaborators in book storyline.

In this matrix, we can ignore the diagonal as they reflect occurrence density of a character only. Otherwise, we can see strong correlation patterns of shiva with sati, ganesh and bhagirath. Also, kali and ganesh have strong co-occurrence.

Conclusion

I am still working to enhance its sentiment analysis and working to find more visualization options to enrich the overall insights.
Though, i admit it far exceeds my initial expectations and work out to be a good learning journey in which i worked across many technical and analytic hiccups.
Regarding its real-world value, i am thinking it can be used to give users book visualization before reading it to help them choose on the basis of their sentiment preferences. Also, it can become of very high nostalgic value for book fans and can be used for targeted marketing.

Open for Healthy Criticism !!!

Hadoop Mania

Friday, 20 February 2015

Meluha Trilogy: Big Text Data Analytics and Visualization from kindle ebooks

No comments:

Post a Comment

About Me