The sweet spot for Cassandra secondary indexing

Secondary indexes

Secondary indexes have been in Cassandra since 0.7 and can be incredibly useful. For example, if you were implementing a user accounts database, you might have the schema

CREATE TABLE user_accounts (
    username text PRIMARY KEY,
    email text,
    password text,
    last_visited timestamp,
    country text
);

The only key you can lookup on is the primary key – the username. If you wanted to find users in a particular country, you can’t do it without doing a full scan. Instead, you could create an index:

create index user_accounts_country on user_accounts(country);

So you can now run queries like:

select * from user_accounts where country = 'UK';

This works, but if you were deploying this in production you should understand what’s going on under the hood to know if it will work for you.

How secondary indexes work

At a high level, secondary indexes look like normal column families, with the indexed value as the partition key.

For user_accounts, the partition key is username and that is the key the data is indexed with in Cassandra’s SSTables. For the index, the partition key is the country and the column name is the username.  As an example, suppose there are two users in the UK, the data stored in Cassandra is (showing only username and country) in JSON form:

{
    "rlow": {
        "country": "UK"
    },
    "jbloggs": {
        "country": "UK"
    }
}

with corresponding index entries:

{
    "UK": {
        "rlow": "",
        "jbloggs": ""
    }
}

This means, to find everyone in the UK, we simply lookup this row to find the primary key for the user_accounts table i.e. the usernames.

Distribution

The subtly here is how the data is distributed. For user_accounts, the partitions are distributed by hashing the username and using the ring to find the nodes that store the data. This means user accounts will in general be stored on different nodes. So to find all the users in the UK we will have to do lookups on different nodes. If there are many users in the UK – many more than the number of nodes in the cluster – we should expect to do a query on every node.

If the index were stored like a regular column family, the ‘UK’ partition would be stored on a single node (plus replicas). This partition would grow and grow over time and all index lookups would hit this node. This doesn’t scale – the node(s) indexing the ‘UK’ partition would have to do more and more work as the data grows.

For this reason, Cassandra’s secondary indexes are not distributed like normal tables. They are implemented as local indexes. Each node stores an index of only the data that it stores. For our example, if partitions ‘rlow’ and ‘jbloggs’ are stored on different nodes then one node will have index

{
    "UK": {
        "rlow": "",
    }
}

and the other

{
    "UK": {
        "jbloggs": ""
    }
}

This means our index scales nicely – as our data grows and we add more nodes to compensate, the index on each node stays a constant size.

Note that this doesn’t allow us to scale the number of index lookups since each index lookup does work on each node. But, as our data grows, the data returned from each query grows. The scaling allows us to effectively balance this load around the cluster.

High vs low cardinality

To perform the country index lookup, every node is queried, looks up the ‘UK’ partition and then looks up each user_accounts partition found. This is pretty efficient – each node does one index lookup plus one lookup for each bit of data returned. Each lookup is potentially a disk seek, so if there are n nodes and p partitions returned, we’ve done O(n+p) disk seeks. Since we’ve assumed there are many more users than nodes, p >> n so this is O(p) disk seeks, or O(1) per partition returned.

However, suppose instead we had created an index on email. The key difference here is the cardinality of the fields. There are many entries with the same country but probably only one with the same email. This means only one node (plus replicas) store data for a given email address but all nodes are queried for each lookup. This is wasteful – every node has potentially done a disk seek but we’ve only got back one partition. In this case, we’ve done O(n+1)=O(n) disk seeks. This is O(n) per partition returned.

In this case, the scaling we mostly care about is the number of queries we can perform. The size of the data we are requesting doesn’t change so the only parameter that can grow over time is the query rate. But since we are doing O(n) lookups, increasing n doesn’t change our query rate so we cannot scale.

What would be much more efficient in this case is a distributed index. If the index was distributed just like a normal table then the index lookup would be a single lookup, followed by another single lookup to retrieve the data. These lookups will in general be on different nodes but there are only two lookups in total.

Distributed indexes

Cassandra doesn’t provide an index suitable for the email index, but you can do it yourself. You can create a separate table to store the inverted index:

CREATE TABLE user_accounts_email_idx (
    email text,
    username text,
    PRIMARY KEY(email, username)
);

With the advent of atomic batches in Cassandra 1.2, you can update it atomically. You would, however, miss two nice features of the inbuilt indexing. If you create the index when there is already data, you will need to build the initial index yourself. Also, CASSANDRA-2897 (in Cassandra 1.2) adds ‘lazy’ updating to secondary indexes. When you change an indexed value, you need to remove the old value from the index. Prior to Cassandra 1.2, a read was performed to read the old value to remove it from the index. This made index inserts significantly slower. Lazy updating on reads makes inserts into indexed tables significantly cheaper. There’s no reason why you couldn’t do this manually in your client too but it is complicated.

The sweet spot

Going back to the country index, recall that Cassandra is doing O(p) seeks to return p users. This is a rare case in Cassandra where you perform random I/O rather than sequential I/O. If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.

This leads to the conclusion that the best use case for Cassandra’s secondary indexes is when p is approximately n i.e. the number of partitions is about equal to the number of nodes. Any fewer partitions and your n index lookups are wasted; many more partitions and each node is doing many seeks.

In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.