Monday, August 6, 2012

Twitter and the new big data lifecycle

Recently I came across this fine article in The New York Times: "Twitter Is Working on a Way to Retrieve Your Old Tweets". Dick Costolo, Twitter’s chief executive, said:
"It’s a different way of architecting search, going through all tweets of all time. You can’t just put three engineers on it."
Mr. Costolo is right, and pointed the spotlight to a very important change we're experiencing today, in these such interesting times. The word is expectations and those are changing fast!

Not so long ago, Big Data was a synonym to Analytics, Data Warehouse, Business Intelligence. Traditionally operational (OLTP) apps held limited amounts of data, only the "current" data, relevant for the ongoing operations. A cashier in a supermarket would hold only recent transactions, to enable lookup of a charge that was done 10 minutes ago, if I need to return and item or dispute the charge while at the cashier. When I come back to the store the day after, I won't go to the cashier, I should go to "customer service" that with a different application, a different database - I will get the service for my returning items or disputes. A dispute after several months will not be handled by the customer service in the store, but by "the chain's dispute department", using a different, 3rd app with a 3rd cumulative aggregative DB. And on and on it goes. 

In this simplified example, the organization invested many resources in 3 different DBs and apps aggregating different levels of data, enabling similar and marginal additional functionality. Why? Data volume and concurrency.

At the cashiers, the only place where new data is really generated, there also the highest concurrency. In a global look many thousands of items are "beeped" and sold through the cashiers every minute - data is kept small - generated and extracted out shortly after that. The customer service reps handle tens of customers a minute over larger data, and "the chain's dispute department" overlooks the biggest data, but handles 1 or 2 cases an hour, and might also execute more "analytic-style" queries to determine the nature of a dispute... 

This was, in a nutshell, the "lifecycle of the data" in the old world. But today, everything changes - it's all online, right here, right now!

Enormous amount of (big) data is generated and also searched and analyzed at the same time. Everything is online, here and now. Every tweet (millions a day) is reported instantly to hundreds of followers, participates in saved searches, analyzed by numerous robots and engines throughout the web, and also by Twitter itself. Same goes for every search or e-mail I send in Google and for every status or "like" in Facebook that is is reported to my hundreds of friends and also analyzed at the same time, here and now. Hey its their way to make money, to push the right ads at the right time.  

And now - we learn the users expect to see online data that "old" in the terminology of the old days. I want to see statuses, likes and tweets from 2 and 4 months ago, in the same interface and the same experience I'm used to, don't send me to the "customer service department"!

On the bottom line - it requires scale. Scale you online database to handle online data volumes and throughput, as well as older data, on the same grid, without interference, with the same applications. This is what scale out is all about. Think outside the (one database server) box. If you have 10 databases for the current data, you can have 10 more with older data, and 100 more with even-older data and so on. Giving a transparent unified view to (or virtualizing) this database grid - is the solution occupies most of my time, and it's the missing link to making a database scale-out a commodity.


  1. Twitter has grown beyond a micro-blogging service to become a social messaging platform, it can still leave you wondering why you might want to use it. We live in a data-driven world. Increasingly, the efficient operation of organizations across sectors relies on the effective use of vast amounts of data. Making sense of big data is a combination of organizations having the tools, skills and more importantly, the mindset to see data as the new "oil" fueling a company. Unfortunately, the technology has evolved faster than the workforce skills to make sense of it and organizations across sectors must adapt to this new reality or perish. Sector Report

  2. Hi Sir, You are really good writer.

    Useful guidelines on here.It is a right able website for of all us.This website maintaining play a role significantly at the all area.Really I choose this website. It is very beneficial for me because I can to know about these factors. This is very awesome post! I will save this weblog.So I want to discuss It is possible to Get more followers. On certain sites many a large number of supporters can be purchased for just few money, and those who want to make an impression on by having 20,000 supporters, in comparison to their buddy's 160, can do this quickly, quickly and at low costs. However, many of the records which will adhere to you in the time after your buy are junk records, bogus records and records with no actual individual behind them.

    Thank you very much for your Amazing Article.