The Database Scalability Blog - Doron Levari

I've lived around databases all my life, 21st century is challenging for them: big data, throughput, complexity, virtualization, global distribution - it's all scalability. I'm the founder and CTO of ScaleBase, solving this problem is a workoholic's heaven, so I'm having great time!
My agenda is to stay technical, no marketing and sales BS, give my summarized set of views and opinions to urgent topics, events and latest news in database scalability.

Friday, September 12, 2014

Differences between NoSQL databases

Just sharing an answer I gave today to a question in Quora: http://www.quora.com/Whats-the-difference-between-the-different-NoSQL-databases

I think the question is relevant, I think other answers were very relevant and this was my humble addition to the thread:

I think answers above very very good, in my POV, the right NoSQL database for you is the one best fit your requirements in:

Data representation: as said above, key-value, document, graph, etc.
Data usage pattern: OLTP (high concurrency throughput, many queries and updates) vs. Analytics? (low concurrency, few big queries, no updates)
Data availability and consistency: this is the main topic I wish to add

While all relational databases provide the virtues of ACID to keep transactions and data Atomicity, Consistency, Isolation, Durability - few NoSQLs provide full ACID, most do not provide full ACID but rather provide interesting tradeoffs around CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem). Since you can't have all 3, different databases give different combinations, for example 2 NoSQLs from Apache, HBase provides CP and Cassandra provides AP (http://wiki.apache.org/cassandra/ArchitectureOverview).

Hope that helped.

Tuesday, May 20, 2014

Kudos to RDS's SLA, proving the point of the public cloud

If you go and spin a new RDS server, you'll see this new page added before the wizard:

My perception over the last months is that AWS improved RDS availability, multi-AZ, and they are pushing it more aggressively.

An availability factor of "three and a half nines" (~8hr/year of downtime) is very very good, it usually has a very high price tag attached to it (hardware, software & labor) and usually is a dream for the smaller-medium IT organizations.

Enabling it on a utility low price, 25%-33% higher than the corresponding EC2 machine, RDS makes a real bargain for everyone, making it harder to stay out of public cloud.

Saturday, May 3, 2014

Eventual consistency of NoSQL marketing

Yesterday I learnt an important lesson about an important difference between NoSQL and MySQL, at least when it comes to the marketing and hype.

I saw a tweet from around marketing of one of NoSQL leaders:

Most people apparently would just conclude from the tweet's text, however I actually clicked the link, and couldn't believe eyes:

I guess that in NoSQL, when it comes to the integrity of data as well as hype - it is eventually consistent...

Thursday, May 1, 2014

Explaining the case for MySQL

My faithful readers, please spare 10 mins of your time, and read Baron's excellent post: https://vividcortex.com/blog/2014/04/30/why-mysql

Nuff said.

Since I can't really shut up, and only if you do like my (humble) take on this, I could say in short:

Every technology/platform/framework I choose, will end up surprising me, limiting me for things can be done easily, and throw many painful challenges at me if and when I need to do things that are closer to the platform's "edges". This is true for everything including Rails, JEE, Hibernate, MongoDB, MySQL.

I've learned that the more mature, generically-capable, transparent and ecosystem-rich a solution is - the less painful surprises for me in the worst timings - and more successful I am in my job.

Wednesday, April 9, 2014

Porting from Oracle to MySQL

A potential customer asked my about porting her application from Oracle Database to MySQL.

I always try to start with the "why" (a dear friend bought me this book, recommended: http://www.amazon.com/Start-Why-Leaders-Inspire-Everyone/dp/1591846447).

She said "cloud!". I said "OK!".

I conducted a short research, found many things in many places all over the place, brought them to a nice email I sent her back and then thought I'll post it here and make it public as it might be useful for us all. If you feel that I missed something, add comments, send feedback.

These are the leading tools to do the actual migration of the data structure, data export/import, sprocs, triggers, etc.:

MySQL Workbench has a migration feature: http://www.mysql.com/products/workbench/migrate/
MySQLYog can be used to migrate: http://tkurek.blogspot.com/2013/04/migrate-oracle-to-mysql.html (already in the conversation in the second comment there)
Navicat can be used to migrate: http://www.navicat.com/products/navicat-for-mysql
Tungsten support Oracle-to-MySQL replication: http://www.continuent.com/downloads/software
Focused data migrators:

http://www.ispirer.com/products/oracle-to-mysql-migration
https://www.youtube.com/watch?v=IW3vKHWJljY
http://www.slideshare.net/Tess98/oracle-to-mysql-migration-presentation
http://www.dbload.com/
http://dbconvert.com/convert-oracle-to-mysql-pro.php
http://www.spectralcore.com/omegasync/

The way I see it, migrating the data is 15% of a database porting project. Efforts are in (partial list):

Porting drivers and driver behavior in the app code
Porting SQL commands all around the app code

Conversion of non-standard SQL flavor
Work-around restrictions and non-supported commands

Ecosystem, monitoring, tuning, tools, scripts, hardware best practices, ops skills, dev skills

Way before the migration of the data on d-day.

A lot of services, some tools. Services-wise I see around:

Pythian: http://www.percona.com/live/mysql-conference-2012/sessions/oracle-mysql-migration
Baron (Percona): http://www.xaprb.com/blog/2009/03/13/50-things-to-know-before-migrating-oracle-to-mysql/

I bet the big SIs (Accenture et al) are strong in this game, as those would be the default go-to service provider for the Oracle shops.

Thursday, March 13, 2014

How Elasticity is dictated by Data Model?

I talk a lot about "Elasticity" and "Data Model", a prospect asked me today "what makes you think they are related?".

Not only are they related, the relation between them holds big part of the substance of ScaleBase, the technology I've been working on for the last 5 years...

Elasticity is the ability to grow or shrink in accordance to the demand.
The cloud makes it very easy to spin more machines, on demand and kill them a day after, pay by the hour, only for real usage. This alone offers fantastic elasticity. Remember that AWS's EC2 stands for "Elastic Compute Cloud".

Volatile/transient/stateless servers are easier to make elastic, AKA application servers, web servers. Just spin another same-image-server behind a round-robin load balancer would solve 80% of the problem.

Data is harder to "elastify".

Data can be replicated across multiple identical servers behind the same round-robin load balancer, but data-replication multiplies data size (bad ROI) and cannot scale writes and updates to the data.
The only way to scale data is to have it distributed across multiple non-identical servers.

New challenges:

How would all data consumers (apps, tools) know where the data they look for resides?
If all for every access they need data from several (or all) the servers, load will end-up multiplied rather than distributed. = no scalability.
OK not all or most, but the minority of accesses do need data from several (or all) the servers. How this data can be found on all quickly and aggregated?

Challenge 1 is the simplest, just have an index expressing "I want to distribute my data by profile_id" and "put profiles 1-1000 on db1 and 1001-1500 on db2", and then force all data consumers check this index before every data access.

Challenges 2 and 3 are where data model kicks in. For NoSQLs, data model is a document, complete and self-contained, challenges 2 and 3 do not exist.
For SQL databases, a relational data model, takes challenges 2 and 3 to the extreme.

A carefully crafted data distribution policy and the ability to do real-time data aggregation are crucial for a successful scaling relational database.

In our profiles distribution example, identifying that "a profile" is actually a chunk of related data from 100 tables in a complex, multi-level, deep hierarchy - is a hard task to do.
ScaleBase Analysis Genie simplifies the authoring of a data distribution policy that makes sure that related data is stored together on the same server, solving challenge 2.

ScaleBase Controller employs multi-threaded massive parallel execution and advanced result aggregation, supporting all SQL aspects including support for GROUP BY, ORDER BY, HAVING, UNION, JOIN, SUBSELECT to solve challenge 3.

See here for more info: http://www.scalebase.com/products/product-architecture