Data Persistence at the times of microservices

SQL is good, Single point of failure is bad

For data analytics, joins are good. For real time data processing, joins are BAD.

If you think data at scale, the relational model produces databases that - passed reasonable limits for a database on a single machine becomes a single point of failure. You can scale everything else in an application, but the database is most often a monolith on whose integrity depends the life of the organization.

While this is quite often the case, most often your application doesn't actually need a relational model, and -regardeless- object oriented code is simply bad at dealing with relations.

The beginning of the 'NoSQL' movement just unfortunately decided that SQL was the bad thing, while it is actually one of the best things that has happened in computing for long.

Why SQL has a point

SQL is good because it's declarative: you don't say to your datastore how to persist or retrieve data, you just declare you intentions. This enables for interchangeable products and portable code.

Maybe it's also simplistic to say that NoSQL gave SQL a bad name, it's maybe just an unfortunate denomination 'NoSQL' itself, since it's really not all, or not necessarily what the name itself says.

Keeping the good parts

We could just restrict our code to a SQL subset and just avoid joins as long as we aren't running analytics that do require them. This enables the application to shard data between different physical locations, and actually giving room to the real possibility to scale.

If you look at what they did in Apache Cassandra, that's exactly it: every operation that is expensive at massive scale is simply non supported, while maintaining a nice SQL-like fashion, and the backend is sharding data via consistent hashing in order to achieve availability at scale.

If you're going to support replication, operations like sorting or searching by secondary index are not really viable options. Think what it means to sort the rows if they're stored on different physical machines. Think if you really need it before triggering such a query.

As a matter of fact, using this SQL subset opens the way to very powerful implementations. And - when really needed - you can use a relational database, or export to one, if you need a place for plain analytics.

I heard very often the advice: "go for relational, change in case", but from an application point of view, I'd rather say: "Persist first, use whatever supports you in doing what you need with your data."

In retrospective

I also think that the early thinking of the first NoSQL wave was wrong in pointing SQL as the problem. This has produced a lot of less-than-optimal querying syntaxes that are ad-hoc and quite painful to use on a commandline. I am thinking here to MongoDb and others, but also ElasticSearch is not such a fun experience in the terminal. Programmatically, REST endpoints are nice things, but sorry, you've just cut out the operator that just needs to run a one-off query to see if data is there.

Instead - as good examples - I praise Apache Cassandra and InfluxDb for maintaining this SQL-looking resemblance. I know also that there is some plugin for ElasticSearch to use a pseudo-SQL language, and that can help in some cases...



[cassandra] [backend]