The Database Scalability Blog - Doron Levari: June 2012

Thursday, June 28, 2012

ARM based data center. Inspiring.

In a previous post I wrote ARM based servers. Since then, and thanks to all the comments and responses I got, I looked more into this ARM thing and it's absolutely fascinating...

Look at this beauty (taken from the site of Calxeda, the manufacturer):

What is it? A chip? A server? No, it's a cluster of 4 servers...

And this:

is HP Redstone Server, 288 chips, 1,152 cores (Calxeda quad-core SoC) in a 4U server “Dramatically reducing the cost and complexity of cabling and switching”. Calxeda is talking about: “Cut energy and space by 90%”, and “10x the performance at the same power, the same space” and it's just the beginning...

And this is from the last couple of days... From ISC'12 (International Supercomputing Conference): "ARM in Servers – Taming Big Data with Calxeda":

In the case of data intensive computing, re-balancing or ‘right-sizing’ the solution to eliminate bottlenecks can significantly improve overall efficiency
By combining a quad-core ARM® Cortex™-A series processor with topology agnostic integrated fabric interconnect (providing up to 50Gbits of bandwidth at latencies less than 200ns per hop), they can eliminate network bottlenecks and increase scalability

You still can't go to the store and buy a 4U ARM-based database server that performs 10x and uses 1/10 of the power (combine them, it order of magnitude of 100x...). It's not now, maybe not tomorrow, but it's not sci-fi. And technologies will have to adapt to this world of "multiple machines, shared nothing, commodity hardware". I think databases will be the hardest tech to adapt, the only way is to distribute the data wisely and then distribute the processing, sometimes parallelize processing and access to harness those thousands of cores.

Wednesday, June 20, 2012

The catch-22 of read/write splitting

In my previous post I covered the shard-disk paradigm's pros and cons, but the conclusion that is that it cannot really qualify as a scale-out solution, when it comes to massive OLTP, big-data, big-sessions-count and mixture of reads and writes.

Read/Write splitting is achieved when numerous replicated database servers are used for reads. This way the system can scale to cope with increase in concurrent load. This solution qualifies as a scale-out solution as it allow expansion beyond the boundaries of one DB, DB machines are shared-nothing, can be added as a slave to the replication "group" when required.

And, as a fact, read/write splitting is very popular and widely used by lots of high-traffic applications such as popular web sites, blogs, mobile apps, online games and social applications.

However, today's extreme challenges of big-data, increased load and advance requirements expose vulnerabilities and flaws in this solution. Let's summarize them here:

All writes go to the master node = bottleneck: While reading sessions are distributed across several database servers (replication slaves), writing sessions are all going to the same primary/master server, hence still a bottleneck, all of them will consume all resources from the DB for our well-known "buffer management, locking, thread locks/semaphores, and recovery tasks"
Scaled sessions' load, not big data: While I can take my, X reading sessions and spread them over my 5 replication slaves giving each to handle with only X/5 sessions, however my giant DB will have to be replicated as a whole to all servers. Prepare lots of disks...
Scale? Yes. Query performance? No: Queries on each read-replica need to cope with the entire data of the database. No parallelism, to smaller data sets to handle
Replication lag: Async replication will always introduce lag. Be prepared for a lag between the reads and the writes.
Reads after write will show missing data. The transaction is not yet committed so it's not written to the log, not propagated to salve machine, not applied at the slave DB.

Above all, databases suffer from writes made by many concurrent sessions. Database engine themselves become bottleneck because of their *buffer management, locking, thread locks/semaphores, and recovery tasks*. Reads are a secondary target. BTW - reads performance and scale can be very well gained by good smart caching, use of a NoSQL such as Memcached in the app, in front of the RDBMS. In modern applications we see more and more avoided reads and writes, that cannot be avoided or cached, storming the DB.

R/W splitting is usually implemented today inside the application code, the it's easy to start, then becomes hard... I recommend using a specialized COTS product that does it 100 times better and may eliminate some or all limitations above (ScaleBase is one solution that gives that (among other things)).

This is read/write splitting's catch 22. It's an OK scale-out solution and relatively easy to implement, but improvement of caching systems, changing requirements in the online applications and big-data and big-concurrency - rapidly driving it towards its fate, become less and less relevant, and only play a partial role in a complete scale-out plan.

In a complete scale-out solution, where data is distributed (not replicated) throughout a grid of shared-nothing databases, read/write splitting will play its part, but only a minor one. Will get to that in next posts.

Thursday, June 7, 2012

Why shared-storage DB clusters don't scale

Yesterday I was asked by a customer for the reason why he had failed to achieve scale with a state-of-the-art "shared-storage" cluster. "It's a scale-out to 4 servers, but with a shared disk. And I got, after tons of work and efforts, 130% throughput, not even close to the expected 400%" he said.

Well, scale-out cannot be achieved with a shared storage and the word "shared" is the key. Scale-out is done with absolutely nothing shared or a "shared-nothing" architecture. This what makes it linear and unlimited. Any shared resource, creates a tremendous burden on each and every database server in the cluster.

In a previous post, I identified database engine activities such as buffer management, locking, thread locks/semaphores, and recovery tasks - as the main bottleneck in the OLTP database, handling reads and writes mixture. No matter which database engine, they all have all of the above, Oracle, MySQL, all of them. "The database engine itself becomes the bottleneck!" I wrote.

With a shared disk - there is a single shared copy of my big data on the shared disk, the database engine still have to maintain "buffer management, locking, thread locks/semaphores, and recovery tasks". But now - with a twist! Now all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Every database server in the cluster needs to update all other nodes for every "buffer management, locking, thread locks/semaphores, and recovery tasks" it is doing on a block of data. See here number of "conversation paths" between the 4 nodes:

Blue dashed lines are data access, red lines are communications between nodes. Every node must notified and be notified to and by all other nodes. It's a complete graph, with 4 nodes and there are 6 edges. With 10 you'll find 45 red lines here (n(n - 1)/2, a reminder from Computer Science courses...). Imagine the noise, the latency, for every "buffer management, locking, thread locks/semaphores, and recovery tasks". Shared-storage becomes shared-everything. Node A makes an update to block X - 10 machines need to acknowledge. I wanted to scale, but instead I multiplied the initial problem.

And I didn't even mention the fact that the shared disk might become a SPOF and a bottleneck.
And I didn't even mention the limitations when you wanna go with this to cloud or virtualization.
And I didn't mention the tons of money this toy costs. I prefer buying 2 of these, one for me and one for a good friend, and hit the road together...

Shared-disk solutions gives a very limited solution to OLTP scale. If the data is not distributed, computing resources required to handle this data will not be distributed and thus reduced, on the contrary, they will be multiplied and consume all available resources from all machines.

Real scale-out is achieved by distributing the data in shared nothing, every database node is independent with its data, no duplications, no notifications, ownership or acknowledges over any network. If data is distributed correctly, concurrent sessions will be also distribute across servers, each node runs extremely fast on its small-data, small load, with its own small "buffer management, locking, thread locks/semaphores, and recovery tasks".

My customer responded "Makes perfect sense! Tell me more about Scale-Out and distribution...".