Tombstoning Cassandra’s super columns

High-dimensional data

Cassandra’s native data model is two dimensional: rows and columns. This is great for data that is naturally grouped together e.g. fields of a message. However, some uses require more dimensions – maybe you want to group your messages by recipient too, creating something like this:

"alice": {
  "ccd17c10-d200-11e2-b7f6-29cc17aeed4c": {
    "sender": "bob",
    "sent": "2013-06-10 19:29:00+0100",
    "subject": "hello",
    "body": "hi"
  }
}

For this reason, super columns were added to Cassandra. These are columns that contain columns. You can create as many columns as you want within them and have different numbers in different super columns. You can fetch some or all columns in a super column. You can delete an entire super column or just some sub-columns. They do everything you need.

So what’s the problem?

One obvious problem is for yet higher dimensional data models. Maybe your messages have attachments with fields for size, type, encoding, etc.. You could represent this by concatenating the field names, but in general this doesn’t work if you want to use different comparators for different dimensions.

There is also a less obvious problem that manifests even if you have 2-dimensional data. The way super columns are implemented deep down in Cassandra is as a single column with the sub-columns serialized within it. This means, in order to read one sub-column, Cassandra has to deserialize the whole super column. For large super columns this becomes very inefficient. Also, during partial updates to super columns, the merging during compaction becomes expensive.

A further reason is there are many special cases in the code for super columns. All Cassandra developers I’ve spoken too would love to clean this up.

Yet another problem is that super columns cannot be accessed through CQL, the now much preferred interface to Cassandra.

Enter composite columns

Cassandra 0.8.1 introduced composite columns. These are arbitrary dimensional column names that can have types like CompositeType(UTF8Type, ReversedType(TimeUUIDType), LongType)). It’s also really simple: it is implemented as a comparator so adds very little complexity to Cassandra or clients.

When 0.8.1 was out, I immediately set to converting all the super column data models I knew to use composite columns. Everything worked well apart from one issue: I wanted an equivalent of super column delete.

Remember I said above that you can delete a whole super-column. This is a great feature: even without knowing which sub-columns are present you can delete them all. This presents a problem since I certainly didn’t want to read the sub-columns in to find out which ones to delete (reads in Cassandra are much more costly than writes/deletes).

Range tombstones

What I really needed was a range delete: delete everything from column X to column Y. I was excited to see this added to Cassandra 1.2. Currently, this feature only works through CQL but there are plans to add it to the old thrift interface.

Let’s see them work in an example. Suppose I have a large number of sensors sending all kinds of different readings. Depending on the conditions they may send more or less data so I cannot tell in advance which fields I will get. New sensors are being added all the time and I want Cassandra to store everything.

A possible data model is this:

CREATE TABLE readings (
  sensor_id uuid,
  reading_id timeuuid,
  name text,
  value blob,
  PRIMARY KEY (sensor_id, reading_id, name)
);

Each sensor has a UUID and each reading has a TimeUUID (so it is time ordered). Each reading consists of one or more (name, value) pairs. By using a compound primary key I can use arbitrary column names. (My partition key or row key is sensor_id; this means my data is partitioned according to the sensor it came from.)

Then let’s insert some readings from a fictional weather sensor:

BEGIN BATCH
  INSERT INTO readings (sensor_id, reading_id, name, value) VALUES (d1e59ab9-0fa2-49dd-97c1-41ce9537c110, dde7dfd0-d200-11e2-b05b-fac359ec8ffb, 'temp', 0x11)
  INSERT INTO readings (sensor_id, reading_id, name, value) VALUES (d1e59ab9-0fa2-49dd-97c1-41ce9537c110, dde7dfd0-d200-11e2-b05b-fac359ec8ffb, 'time', 0x51b622aa)
APPLY BATCH;

BEGIN BATCH
  INSERT INTO readings (sensor_id, reading_id, name, value) VALUES (d1e59ab9-0fa2-49dd-97c1-41ce9537c110, ccd17c10-d200-11e2-b05b-fac359ec8ffb, 'temp', 0x12)
  INSERT INTO readings (sensor_id, reading_id, name, value) VALUES (d1e59ab9-0fa2-49dd-97c1-41ce9537c110, ccd17c10-d200-11e2-b05b-fac359ec8ffb, 'humidity', 0x52)
APPLY BATCH;

BEGIN BATCH
  INSERT INTO readings (sensor_id, reading_id, name, value) VALUES (d1e59ab9-0fa2-49dd-97c1-41ce9537c110, 13a1ec90-d203-11e2-b05b-fac359ec8ffb, 'temp', 0x11)
APPLY BATCH;

I used a batch so that my writes are atomic and isolated: I won’t end up with partial readings if a write fails or if I read during a write.

Internally, this is using sensor_id as the row key and CompositeType for the column names. The first dimension is reading_id and second is name.  value is stored as the column value.

I can now list all readings for my sensor:

select dateOf(reading_id), name, value from readings
  where sensor_id = d1e59ab9-0fa2-49dd-97c1-41ce9537c110 order by reading_id desc;
dateOf(reading_id) | name | value
--------------------------+----------+------------
 2013-06-10 20:22:23+0100 |     temp |       0x11
 2013-06-10 20:06:33+0100 |     time | 0x51b622aa
 2013-06-10 20:06:33+0100 |     temp |       0x11
 2013-06-10 20:06:05+0100 |     temp |       0x12
 2013-06-10 20:06:05+0100 | humidity |       0x52

This is reading the time from the TimeUUID and showing most recent first.

Now comes the bit you’ve all been waiting for: we can now delete a whole reading:

DELETE FROM readings WHERE sensor_id = d1e59ab9-0fa2-49dd-97c1-41ce9537c110
  AND reading_id = ccd17c10-d200-11e2-b05b-fac359ec8ffb;

This deletes the two columns for this reading and I didn’t have to know what they were beforehand. Internally, this has inserted a range tombstone in the row, deleting all columns with prefix “ccd17c10-d200-11e2-b05b-fac359ec8ffb”.

More complex deletes

Maybe I then wanted to delete all readings within a certain time range because the sensor was giving invalid readings. I could try this:

DELETE FROM readings WHERE sensor_id = d1e59ab9-0fa2-49dd-97c1-41ce9537c110
  AND reading_id > 13a1ec90-d203-11e2-b05b-fac359ec8ffb;

This isn’t yet supported in CQL. Hopefully it will come soon since all the underlying machinery is ready to support such a query. (NB such a delete was never possible with super columns.)

Future

Now we can truly say that super columns are deprecated: you can do everything you ever wanted to do with them with composite type, CQL and range tombstones. Indeed, in the upcoming Cassandra 2.0 super columns have been replaced internally: CASSANDRA-3237. I don’t expect they will be removed (from the thrift interface) but at least when someone tells you not to use super columns you now have a viable and complete alternative.

2 comments

  1. Interesting post! What about the case where you wanted to do range queries across a high dimensional dataset? Say you have 5000 sensors and what to find the time points where each sensor was within an upper and lower bound (unique to that sensor)? My understanding would be a secondary index would be required on each of the 5000 columns? Is this efficient?

    1. Do you mean a query with a different time range for each sensor? I think you would need to do 5000 separate queries for each sensor (assuming you know the sensor_id) with the time range. You would only need a secondary index if you didn’t know the sensor_id.

Comments are closed.