"It’s a different way of architecting search, going through all tweets of all time. You can’t just put three engineers on it."Mr. Costolo is right, and pointed the spotlight to a very important change we're experiencing today, in these such interesting times. The word is expectations and those are changing fast!
Not so long ago, Big Data was a synonym to Analytics, Data Warehouse, Business Intelligence. Traditionally operational (OLTP) apps held limited amounts of data, only the "current" data, relevant for the ongoing operations. A cashier in a supermarket would hold only recent transactions, to enable lookup of a charge that was done 10 minutes ago, if I need to return and item or dispute the charge while at the cashier. When I come back to the store the day after, I won't go to the cashier, I should go to "customer service" that with a different application, a different database - I will get the service for my returning items or disputes. A dispute after several months will not be handled by the customer service in the store, but by "the chain's dispute department", using a different, 3rd app with a 3rd cumulative aggregative DB. And on and on it goes.
In this simplified example, the organization invested many resources in 3 different DBs and apps aggregating different levels of data, enabling similar and marginal additional functionality. Why? Data volume and concurrency.
At the cashiers, the only place where new data is really generated, there also the highest concurrency. In a global look many thousands of items are "beeped" and sold through the cashiers every minute - data is kept small - generated and extracted out shortly after that. The customer service reps handle tens of customers a minute over larger data, and "the chain's dispute department" overlooks the biggest data, but handles 1 or 2 cases an hour, and might also execute more "analytic-style" queries to determine the nature of a dispute...
This was, in a nutshell, the "lifecycle of the data" in the old world. But today, everything changes - it's all online, right here, right now!
Enormous amount of (big) data is generated and also searched and analyzed at the same time. Everything is online, here and now. Every tweet (millions a day) is reported instantly to hundreds of followers, participates in saved searches, analyzed by numerous robots and engines throughout the web, and also by Twitter itself. Same goes for every search or e-mail I send in Google and for every status or "like" in Facebook that is is reported to my hundreds of friends and also analyzed at the same time, here and now. Hey its their way to make money, to push the right ads at the right time.
And now - we learn the users expect to see online data that "old" in the terminology of the old days. I want to see statuses, likes and tweets from 2 and 4 months ago, in the same interface and the same experience I'm used to, don't send me to the "customer service department"!
On the bottom line - it requires scale. Scale you online database to handle online data volumes and throughput, as well as older data, on the same grid, without interference, with the same applications. This is what scale out is all about. Think outside the (one database server) box. If you have 10 databases for the current data, you can have 10 more with older data, and 100 more with even-older data and so on. Giving a transparent unified view to (or virtualizing) this database grid - is the solution occupies most of my time, and it's the missing link to making a database scale-out a commodity.