Thursday, June 7, 2012

Why shared-storage DB clusters don't scale

Yesterday I was asked by a customer for the reason why he had failed to achieve scale with a state-of-the-art "shared-storage" cluster. "It's a scale-out to 4 servers, but with a shared disk. And I got, after tons of work and efforts, 130% throughput, not even close to the expected 400%" he said.

Well, scale-out cannot be achieved with a shared storage and the word "shared" is the key. Scale-out is done with absolutely nothing shared or a "shared-nothing" architecture. This what makes it linear and unlimited. Any shared resource, creates a tremendous burden on each and every database server in the cluster.

In a previous post, I identified database engine activities such as buffer management, locking, thread locks/semaphores, and recovery tasks - as the main bottleneck in the OLTP database, handling reads and writes mixture. No matter which database engine, they all have all of the above, Oracle, MySQL, all of them. "The database engine itself becomes the bottleneck!" I wrote.

With a shared disk - there is a single shared copy of my big data on the shared disk, the database engine still have to maintain "buffer management, locking, thread locks/semaphores, and recovery tasks". But now - with a twist! Now all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Every database server in the cluster needs to update all other nodes for every "buffer management, locking, thread locks/semaphores, and recovery tasks" it is doing on a block of data. See here number of "conversation paths" between the 4 nodes:
Blue dashed lines are data access, red lines are communications between nodes. Every node must notified and be notified to and by all other nodes. It's a complete graph, with 4 nodes and there are 6 edges. With 10 you'll find 45 red lines here (n(n - 1)/2, a reminder from Computer Science courses...). Imagine the noise, the latency, for every "buffer management, locking, thread locks/semaphores, and recovery tasks". Shared-storage becomes shared-everything. Node A makes an update to block X - 10 machines need to acknowledge. I wanted to scale, but instead I multiplied the initial problem.

And I didn't even mention the fact that the shared disk might become a SPOF and a bottleneck.
And I didn't even mention the limitations when you wanna go with this to cloud or virtualization.
And I didn't mention the tons of money this toy costs. I prefer buying 2 of these, one for me and one for a good friend, and hit the road together...

Shared-disk solutions gives a very limited solution to OLTP scale. If the data is not distributed, computing resources required to handle this data will not be distributed and thus reduced, on the contrary, they will be multiplied and consume all available resources from all machines.

Real scale-out is achieved by distributing the data in shared nothing, every database node is independent with its data, no duplications, no notifications, ownership or acknowledges over any network. If data is distributed correctly, concurrent sessions will be also distribute across servers, each node runs extremely fast on its small-data, small load, with its own small "buffer management, locking, thread locks/semaphores, and recovery tasks".

My customer responded "Makes perfect sense! Tell me more about Scale-Out and distribution...". 


  1. Great post, Doron.
    One question though - you say "every database node is independent with its data, no duplications..." But what happens when I want redundancy, when I want to replicate my data for better robustness, and for performance if a single node becomes too popular. How would you accomplish that?

  2. Thank you Yaniv. You're making 2 very good points:

    Redundancy: every database in the array of a shraed-nothing-scaled-out databases can be replicated or share storage (*active/passive* only!) with a standby database. With the first option, replication, those standby databases can be used with "read/write splitting" to offload queries from the masters, still, it's not perfect as writes (the main problem...) still clog the masters...

    "A single node becomes too popular": you are right, it might happen and the "distribution policy" should be dynamic enough to let you split a busy node into 2 nodes, or rebalance data inside the array. Such movement of data can be done in runtime, while applications are running, so you can achieve real database elasticity.

    I intend to touch these 2 good topics in detail further posts, as there're lots of considerations and implications.

  3. Doron, You'll find that your analysis oversimplifies the modern shared-disk architecture. Maybe it needs a new term, but the "shared-disk" is increasingly comprised of what Oracle Exadata calls smart storage nodes. Utilizing modern networking (not point to point) and distributed smart storage you are able to (a) offload the I/O bottleneck, (2) engage in distributed processing much like Exadata or Map-Reduce. In addition, there are techniques to minimize database conflict (locking, etc.). In effect, you are correct that you want database nodes to work independently and avoid conflicts, but that can be done with a shared-disk DBMS if designed properly.

    1. Hi Mike, thanks for your interest and comment!

      I agree with you with regards to new modern techniques such as Oracle Exadata Storage Server and such, that definitely play role in massive reads, pushing WHERE predicates to the storage, "map-reduce" parallelizing scans and more.

      However I pointed my discussion towards the writes, like INSERTs or UPDATEs. When writes are in order, and the database needs to manage buffers and row level locking. If a row is updated on node A, it must push ownership notifications to all other rows, saying "I own this row", no escape from that, since this same row might be updated on any other node in that very same moment...

      And to make it worse, since row-level ownership is too fine-grained, it's done by the block level. Yes, row level locking is true for a single instance Oracle. With RAC/EXADATA, ownership is block-level... Not too fancy...

      Again, I agree modern solutions exist for read-only/read-mostly. Put some writes - everything crumbles like a house of cards. And adding nodes to the cluster will just make it worse...

  4. The Blog title is the name of the album and the number in brackets next to it tells you the number of images currently stored. At the bottom of the page, you can see a message in green small font which gives you an overall idea of how much storage space all your Blog images have taken up.

  5. Very happy to store my things here. Contended with my experience at this Storage. I've looked around and they came highly recommended by even other people.
    Sitelink self storage software