Thursday, March 13, 2014

How Elasticity is dictated by Data Model?

I talk a lot about "Elasticity" and "Data Model", a prospect asked me today "what makes you think they are related?".

Not only are they related, the relation between them holds big part of the substance of ScaleBase, the technology I've been working on for the last 5 years...

Elasticity is the ability to grow or shrink in accordance to the demand.
The cloud makes it very easy to spin more machines, on demand and kill them a day after, pay by the hour, only for real usage. This alone offers fantastic elasticity. Remember that AWS's EC2 stands for "Elastic Compute Cloud".

Volatile/transient/stateless servers are easier to make elastic, AKA application servers, web servers. Just spin another same-image-server behind a round-robin load balancer would solve 80% of the problem.

Data is harder to "elastify".

  1. Data can be replicated across multiple identical servers behind the same round-robin load balancer, but data-replication multiplies data size (bad ROI) and cannot scale writes and updates to the data. 
  2. The only way to scale data is to have it distributed across multiple non-identical servers. 

New challenges:

  1. How would all data consumers (apps, tools) know where the data they look for resides? 
  2. If all for every access they need data from several (or all) the servers, load will end-up multiplied rather than distributed. = no scalability.
  3. OK not all or most, but the minority of accesses do need data from several (or all) the servers. How this data can be found on all quickly and aggregated? 

Challenge 1 is the simplest, just have an index expressing "I want to distribute my data by profile_id" and "put profiles 1-1000 on db1 and 1001-1500 on db2", and then force all data consumers check this index before every data access.

Challenges 2 and 3 are where data model kicks in. For NoSQLs, data model is a document, complete and self-contained, challenges 2 and 3 do not exist.
For SQL databases, a relational data model, takes challenges 2 and 3 to the extreme.

A carefully crafted data distribution policy and the ability to do real-time data aggregation are crucial for a successful scaling relational database.

In our profiles distribution example, identifying that "a profile" is actually a chunk of related data from 100 tables in a complex, multi-level, deep hierarchy - is a hard task to do.
ScaleBase Analysis Genie simplifies the authoring of a data distribution policy that makes sure that related data is stored together on the same server, solving challenge 2.

ScaleBase Controller employs multi-threaded massive parallel execution and advanced result aggregation, supporting all SQL aspects including support for GROUP BY, ORDER BY, HAVING, UNION, JOIN, SUBSELECT to solve challenge 3.