Sunday 4 May 2014

Panacea to Big Data Problems: 2 steps of common sense

It wouldn't be wrong if i say that this decade in information age belongs to web and internet. As everything in world is connected or going to connect to every other thing, the rate of data explosion can be estimated from simple usecase of number of handshakes daily between each device. That alone will amount to some GB's of data taking into account 10 billion population each having a laptop and smartphone. And it is most trivial usecase i have ever thought of!!! (Kudos)  Now start imagining the kind of data that will be generated in non-trivial cases.
Internet companies are leading the way in technology evolution for handling such big amount of data. The reason is that they need it to survive in Internet War and as said by wisest guys, the need is mother of invention(in our case, innovation). Slowly, other domains also started to adopt these technologies to gain insight and churn their internal data and data coming from user interaction. I think this is enough of introduction...
The one thing i can't help observing in last couple of years is the kind of enthusiasm to adopt big data technologies. It has become cool for job aspirants to show off Hadoop, Hbase and other weapons on their resume and cooler for product manager and architects to showcase use of these weapons in their arsenal(I mean, projects). It is coolest for guys like me who sometimes ask other guys "Do you really think it is a big data scenario" and reply comes "I think Big Data can give an edge to product" or "Big Data is future and better we add it into product". Result, half cooked dishes that you have to eat but you can't spit!
Coming to coders or job aspirants, they are just showing what the other side on interview panel is expecting. I know people ignoring basic coding skills like Java, Design Patterns, Algorithms, etc just to mug up big data technologies. You ask them "How", they have a whole lot of correct answers but "Why" always results in stuttering.
From my experience in last couple of years(not great enough to become an authority but enough to gain courage to express myself), simple steps for programmers or near dear ones are:
1) "Try answering data problem without any Big Data concept/technology. If you fail two times, try to add big data concept on need basis".
This is the mantra i follows for not doing big data overkill while proposing solution to a data problem. Past experience to support it is "We have a billing system in an electrical distribution company having 1 million users. Company switched from conventional analog meters(remember, one reading in a month meters) to smart meters(every now and then, reading is being sent to system along with diagnostic parameters). Now reading is coming hourly and data for each month has increased by (31 days * 24 hours)." We should move to Big Data technologies to find an answer.Right????
Now follow above mantra. Try to come up with a non-Big Data solution.

"Ok. We have data coming hourly, that means the same load daily which we had once in a month earlier. We can keep the system online daily for this. Done. Next, where to store so much amount of data and provide same data-read efficiency and random seeks of records as earlier??? store data partitioned on day basis, i.e. create master table for each hour. The index size for a table will remain same without any noticeable performance degradation in data read. Possible?Done. Next, aggregation of such huge amount of data for bill calculation that is done once in a month. Daily aggregate the data into a daily billing table. Aggregate all daily billing table for that month on the last day of month...bill calculated!! Hurray!!! One minute, what about scalability?? User growth rate in utilities is almost equal to population growth rate..that means 2% in developed countries and on average 5% in developing ones. Add a sever to DB cluster once in a year or two to handle data load. Anything else.....Ah! what about data that is more than 2 months old. we don't need it anymore for bill calculation. Simple solution, go with the earlier archiving policy as before. It might take now 2 days to archive 1 month worth of data, But who cares!!!"
All bases covered...new requirement needs some minor modifications in existing application on logic and db side. It will take max 1 month for couple of guys to do that with no testing headache, no production headache, no quality headache and most important, no new technology headache which can cause severe bleeding of resources and attacks of uncertainty

2. Pick your big data tool by need, not by taste or enthusiasm or comfort or fashion or any thing else.
Big Data technologies are currently like the carpenter tool box. The choice of tools defines the effectiveness of end product as well as the neatness of the job done. That one instinct in carpenter to pick up correct chisel, correct saw and correct hammer for the job on the basis of requirement, defines his cost and market demand.
Same applies to big data scenario. A big data technology doesn't fit perfectly in every case. It might work for all cases but perfectly for selected ones. The classic example is our star performer "Hadoop". It is a batch time mapreduce distributed computation engine. Ok, now i have a usecase where in indian tax department, daily i have to verify the Income-Tax Returns of 1000 tax payers on ad-hoc basis. All the financial, employment and personal details are on an Hadoop cluser(HDFS). Seeing the scale of data involved that is in PetaBytes, Hadoop MapReduce seems a perfect fit where in mapper, we can randomly pick our tax payers and their financial details which will be aggregated in reducer to create ideal "Income-Tax-Return" which can then be compared with submitted one.
But one catch here is that even 5 million tax-payers from 500 million tax payers is like around 1 % of total data. It is reading complete data currently to filter out 1 % of data. Phew!!! Go with indexing on tax payer name and year. or push it into Hbase for faster reads. Now our Income Tax department might be able to verify statement for 1000 times more users for each day.

Similarly for Big Data Developer/Architect, who is like carpenter. It's good to know how to use different big data tools, but what makes you standout is the choice of tools and their usage for a big data problem. And believe me Hunches on tool selection never work, at least here. I have learnt it the hard way.

Adieu!!