Notes on DataStax and Cassandra

Datetime:2016-08-23 00:40:41          Topic:          Share

I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions ofNoSQL DBAs and misplaced fear of vendor lock-in . But of course I also learned some things about DataStax and Cassandra themselves.

On the customer side:

  • DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
  • This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.

Customers in large numbers want cloud capabilities, as a potential future if not a current need.

One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer , using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.

Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are:

  • Graph capabilities.
  • Cassandra 3.0, which includes a complete storage engine rewrite.
  • Tiered storage/ILM (Information Lifecycle Management).
  • Policy-based replication.

Some of that terminology is mine, but perhaps my clients at DataStax will adopt it too.

We didn’t go into as much technical detail as I ordinarily might, but a few notes on that tiered storage/ILM bit are:

  • It’s a way to have some storage that’s more expensive (e.g. flash) and some that’s cheaper (e.g. spinning disk). Duh.
  • Since Cassandra has a strong time-series orientation, it’s easy to imagine how those policies might be specified.
  • Technologically, this is tightly integrated with Cassandra’s compaction strategy.

DataStax Enterprise 5 also introduced policy-based replication features, not all of which are in open source Cassandra. Data sovereignty/geo-compliance is improved, which is of particular importance in financial services. There’s also hub/spoke replication now, which seems to be of particular value in intermittently-connected use cases. DataStax said the motivating use case in that area was oilfield operations, where presumably there are Cassandra-capable servers at all ends of the wide-area network.