Hadoop Mania: Extracting text from Kindle ebooks : Chapter wise

Amazon Kindle has been able to digitize the book world in a revolutionary way. Being an avid book lover and reader, i had tried to avoid, in fact hate the digitized book available in PDF and significantly, Kindle for long time. I loved that experience of turning those crispy pages and exploring a new world unfold on every page. It all changed 2 months back, when i bought amazon kindle paperwhite. Although, the only pro which i had in mind at that time, was their amazing Vocabulary Builder app but over time, i started to appreciate other things also like ability to explore more books on digital bookshop, recommendations and good simulated book reading experience. Still, I think it will take me some time to move over the hard book copies. It is a classical Man vs Machine case. I am most skeptical about a machine controlling my mind and thoughts and what better than digitized books to do so.
Coming back from digression and dreadful dreams, this blog is about how to extract chapter-wise text from kindle book formats. This is all in necessity to my personal project which i conceived in one of those dreadful dreams. The project is about doing text analysis on each chapter to find the influence of characters in book and on each other. Another part was to do sentiment analysis on characters to find their mood in different parts of book. I will explain it in next series of blogs.

Problem Statement:
Amazon Kindle Reader has its own digital format in which it encodes the books. It is mobi/epub format which we can see easily if we plug our kindle device to a computer and explore its filesystem. I wanted to read text from the kindle ebook chapter wise.

For background information, in kindle ebook format, all components(html, css, xml, fonts, images) are represented as Resources and are in XHTML format. It has 3 indexes into these Resources, as per the epub specification.

Spine: these are the Resources to be shown when a user reads the book from start to finish.
Table of Contents
: The table of contents. Table of Contents references may be in a different order and contain different Resources than the spine, and often do.
Guide: The Guide has references to a set of special Resources like the cover page, the Glossary, the copyright page, etc.

The complication is that these 3 indexes may and usually do point to different pages. A chapter may be split up in 2 pieces to fit it in to memory. Then the spine will contain both pieces, but the Table of Contents only the first. The Content page may be in the Table of Contents, the Guide, but not in the Spine. Etc.

Solution:
Thus, I started the recce on internet and found two libraries fit for it
1) Apache Tika(http://tika.apache.org/): It is an open-source library for detecting and extracting metadata and content from almost any file type. Just pass the file to its simple interface, it will automatically detect its type and call its corresponding driver to extract metadata and content.

It is as simple as:

Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024); //setting max string buffer length - 10MB

InputStream input = new BufferedInputStream(new FileInputStream("in/book1.epub"));
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);

As evident, it will extract the text of whole file at once. We can't extract it piecewise(here, chapterwise). We can't skip index, glossary, preface and other sections of book which aren't part of book story.

2) epublib(https://github.com/psiegman/epublib): It is another open-source library for creating epub files from existing html files. It also has a basic api to read metadata and content from an epub file.
Each Kindle ebook is represented as nl.siegmann.epublib.domain.Book object which have methods:

getMetadata to get metadata about book
getSpine to get spine reference
getTableOfContents to get reference to table of contents
getGuide to get reference to guide
getResources to get reference to all the images, chapters, sections, xhtml files, stylesheets, etc that make up the book.

Coming back to our requirement, this library almost fulfills the necessity to be able to read the book chapter wise.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

byte[] data = resource.getData(); //it gives content in byte[]

Reader reader = resource.getReader();

InputStream is = resource.getInputStream;

}

but the issue is that chapterResource has three ways to return content: getData(), getReader() and getInputStream(). All of them return XHTML content which needs to be further parsed to extract text content.

(Important Point to note is that i haven't used bookSpine directly to scan through chapters because as mentioned before, if a big chapter is split into two sections to fit into memory, spine will have both references. Scanning of Chapters is more relevant through TableOfContents.)

So, in order to parse XHTML to extract text content, there are two ways, either write a SAX parser or use SAX parser from TIKA library to do so. Keeping programming spirit in mind, i am opting for second option.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

String chapterTitle = toc.getTitle();

String chapterText = null;

try{

org.apache.tika.metadata.Metadata metadata = new

org.apache.tika.metadata.Metadata();

ParseContext context = new ParseContext();

BodyContentHandler handler = new BodyContentHandler(10*1024*1024);

XHTMLContentHandler xhtmlHandler = new XHTMLContentHandler(handler,

metadata);

xhtmlHandler.startDocument();

ContentHandler contentHandler = new EmbeddedContentHandler(new

BodyContentHandler(xhtmlHandler));

Parser epubContentParser = new EpubContentParser();

epubContentParser.parse(chapterResource.getInputStream(), contentHandler,

metadata, context);

xhtmlHandler.endDocument();

chapterText = contentHandler.toString().toLowerCase();

}

catch(Exception e){

System.err.println(e);

}

Input to handler object is size of text buffer which i have configured as 10MB (10*1024*1024). I know it is bit of unclean but it reuses the SAX parser in TIKA module which is already well tested. And we do have best excuse that programmer can have "REUSABILITY". On serious note, i have tested it numerous times and its work fine.

I do have implemented same functionality in my project using Iterator Design Pattern. You can check org.poc.book.reader.impl.KindleBookReader.java at github project, https://github.com/shakti-garg/bookProject.

Signing off!

Hadoop Mania

Sunday, 8 February 2015

Extracting text from Kindle ebooks : Chapter wise

1 comment:

About Me