The sweet spot for Cassandra secondary indexing

Secondary indexes

Secondary indexes have been in Cassandra since 0.7 and can be incredibly useful. For example, if you were implementing a user accounts database, you might have the schema

CREATE TABLE user_accounts (
    username text PRIMARY KEY,
    email text,
    password text,
    last_visited timestamp,
    country text
);

The only key you can lookup on is the primary key – the username. If you wanted to find users in a particular country, you can’t do it without doing a full scan. Instead, you could create an index:

create index user_accounts_country on user_accounts(country);

So you can now run queries like:

select * from user_accounts where country = 'UK';

This works, but if you were deploying this in production you should understand what’s going on under the hood to know if it will work for you.

How secondary indexes work

At a high level, secondary indexes look like normal column families, with the indexed value as the partition key.

For user_accounts, the partition key is username and that is the key the data is indexed with in Cassandra’s SSTables. For the index, the partition key is the country and the column name is the username. Â As an example, suppose there are two users in the UK, the data stored in Cassandra is (showing only username and country) in JSON form:

{
    "rlow": {
        "country": "UK"
    },
    "jbloggs": {
        "country": "UK"
    }
}

with corresponding index entries:

{
    "UK": {
        "rlow": "",
        "jbloggs": ""
    }
}

This means, to find everyone in the UK, we simply lookup this row to find the primary key for the user_accounts table i.e. the usernames.

Distribution

The subtly here is how the data is distributed. For user_accounts, the partitions are distributed by hashing the username and using the ring to find the nodes that store the data. This means user accounts will in general be stored on different nodes. So to find all the users in the UK we will have to do lookups on different nodes. If there are many users in the UK – many more than the number of nodes in the cluster – we should expect to do a query on every node.

If the index were stored like a regular column family, the ‘UK’ partition would be stored on a single node (plus replicas). This partition would grow and grow over time and all index lookups would hit this node. This doesn’t scale – the node(s) indexing the ‘UK’ partition would have to do more and more work as the data grows.

For this reason, Cassandra’s secondary indexes are not distributed like normal tables. They are implemented as local indexes. Each node stores an index of only the data that it stores. For our example, if partitions ‘rlow’ and ‘jbloggs’ are stored on different nodes then one node will have index

{
    "UK": {
        "rlow": "",
    }
}

and the other

{
    "UK": {
        "jbloggs": ""
    }
}

This means our index scales nicely – as our data grows and we add more nodes to compensate, the index on each node stays a constant size.

Note that this doesn’t allow us to scale the number of index lookups since each index lookup does work on each node. But, as our data grows, the data returned from each query grows. The scaling allows us to effectively balance this load around the cluster.

High vs low cardinality

To perform the country index lookup, every node is queried, looks up the ‘UK’ partition and then looks up each user_accounts partition found. This is pretty efficient – each node does one index lookup plus one lookup for each bit of data returned. Each lookup is potentially a disk seek, so if there are n nodes and p partitions returned, we’ve done O(n+p) disk seeks. Since we’ve assumed there are many more users than nodes, p >> n so this is O(p) disk seeks, or O(1) per partition returned.

However, suppose instead we had created an index on email. The key difference here is the cardinality of the fields. There are many entries with the same country but probably only one with the same email. This means only one node (plus replicas) store data for a given email address but all nodes are queried for each lookup. This is wasteful – every node has potentially done a disk seek but we’ve only got back one partition. In this case, we’ve done O(n+1)=O(n) disk seeks. This is O(n) per partition returned.

In this case, the scaling we mostly care about is the number of queries we can perform. The size of the data we are requesting doesn’t change so the only parameter that can grow over time is the query rate. But since we are doing O(n) lookups, increasing n doesn’t change our query rate so we cannot scale.

What would be much more efficient in this case is a distributed index. If the index was distributed just like a normal table then the index lookup would be a single lookup, followed by another single lookup to retrieve the data. These lookups will in general be on different nodes but there are only two lookups in total.

Distributed indexes

Cassandra doesn’t provide an index suitable for the email index, but you can do it yourself. You can create a separate table to store the inverted index:

CREATE TABLE user_accounts_email_idx (
    email text,
    username text,
    PRIMARY KEY(email, username)
);

With the advent of atomic batches in Cassandra 1.2, you can update it atomically. You would, however, miss two nice features of the inbuilt indexing. If you create the index when there is already data, you will need to build the initial index yourself. Also,Â CASSANDRA-2897Â (in Cassandra 1.2) adds ‘lazy’ updating to secondary indexes. When you change an indexed value, you need to remove the old value from the index. Prior to Cassandra 1.2, a read was performed to read the old value to remove it from the index. This made index inserts significantly slower. Lazy updating on reads makes inserts into indexed tables significantly cheaper. There’s no reason why you couldn’t do this manually in your client too but it is complicated.

The sweet spot

Going back to the country index, recall that Cassandra is doing O(p) seeks to return p users. This is a rare case in Cassandra where you perform random I/O rather than sequential I/O. If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.

This leads to the conclusion that the best use case for Cassandra’s secondary indexes is when p is approximately n i.e. the number of partitions is about equal to the number of nodes. Any fewer partitions and your n index lookups are wasted; many more partitions and each node is doing many seeks.

In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.

14 comments

It’s quite a good summary, but it would have even better when taking into account the importance of the number of requested rows, expected by the Cassandra client.

1) “To perform the country index lookup, every node is queried, looks up the â€˜UKâ€™ partition and then looks up each user_accounts partition found. ”

Well, not every node is queried : AFAIK, the node calls stop when enough rows have been found. So, not all nodes are always queried.

2) “This leads to the conclusion that the best use case for Cassandraâ€™s secondary indexes is when p is approximately n i.e. the number of partitions is about equal to the number of nodes.”

What a narrow best use case ! Hopefully, there are other use cases where seconday index are fine (that is, for low-cardinality sets), or even finer (according to the number of resulting rows requested vs the cardinality of indexed values).

1) You’re right, I had overlooked the LIMIT query case. But such limits give you a random sample of the results, rather than e.g. the first 10 results. So I think in general LIMIT queries on secondary indexes will be used for paging through the entire set rather than a one off. In that case, you will eventually query all nodes.

2) This is just the best case, of course you can use the indexes in other regimes, where they often work well.

Excellent writeup; and as far as I can tell, spot on. Thanks!

“If the index was distributed just like a normal table then the index lookup would be a single lookup, followed by another single lookup to retrieve the data.”

If I’m not missing something, this is only true if the cardinality is 1-to-1, right? If I your user_accounts_email_idx “index” contained say 10 usernames per email (not really a real-life example, but hopefully you understand what I mean), then after querying the “index” you’d have to do 10 separate lookups (queries) to get the rest of the data.

Yes, this is true for cardinality 1 only. I was talking about just that case here – it is more efficient to use a distributed index for a cardinality 1 field than Cassandra’s inbuilt index.

Very nice article – it has inspired me to build compile time awareness of secondary indexes into cqlc: http://relops.com/cqlc/secondary/

This is a great article that goes to the point on when to use secondary index and when an additional table! Reading should be mandatory for developers.

Thanks!

I use the following definition for an inverted index table

CREATE TABLE IF NOT EXISTS user_email_to_user_key_idx (
email text PRIMARY KEY,
user_key timeuuid
);

The difference is in PRIMARY KEY definition. This allows me to use lightweight transaction to determine if user with given email is already registered without performing select query when creating a new user.

INSERT INTO user_email_to_user_key_idx … IF NOT EXISTS;

If result is successful â€” ok, otherwise I show an error that user with given email already registered.

this is one of the best article. But still I am having some doubts

“Cassandra secondary indexes are not distributed like normal tables. they are implemented as local Indexes. Each node store an index of only the data that it stores.”
But in both cases for high and low cardinality columns it’s touching all nodes.

suppose I am writing a query like

Select * from user_accounts where username=’ABC’ and email=”abc@pqr.com”;

here username is the partition key for user_accounts table and email is secondary index.
does still cassandra will touch all nodes? WHY or WHY NOT ?

How cassandra will perform intersection over these two results. I mean over email index result and user_accounts result.

richard says:

2015-05-30 at 16:10

Good point – most of what I wrote was for the case when your where clause only contains indexed values. For your example, you give Cassandra the partition key so it will use that to only touch replicas for that key.

Great article!
I’m wondering if it matters whether you’re using vnodes or not. I’m seeing far worse performance on secondary index queries on servers with vnodes than on on servers without vnodes, especially on low-cardinality data.

Tom

richard says:

2015-10-18 at 16:33

The secondary index lookup itself should be the same. But you can’t get weird behaviour with vnodes when there’s not much data e.g. select with no where will walk round each vnode until it finds data, taking much longer with vnodes and an almost empty table.

What about usecase like this:
– simple table for IoT, just columns: sensor_name, value, timestamp
– PK is on sensor_name column
– Secondary Index is on value column
– query (once per 3-5 minutes) is: SELECT * FROM sensors_table WHERE sensor_name=’ABC’ and value BETWEEN 5.4 AND 18.0;

The question: is the secondary index useful for range query like that? Cardinality of secondary index is very high (double precision number), but I can’t find out other way to get sensor’s data narrowed to particular value range…

Thank you for a nice article.
I’m struggling a bit with the Cassandra infrastructure I’ve inherited
– technically I have one big table fully replicated across the world and has 3 secondary indexes ( not sure if they are needed and I’m investigating that too) that have bloated to monstrous sizes compare to the original data ( table itself is about 20gb on each node, each secondary index is over 50gb so used space comes to 20 + 3×50 =170gb) – when I kick off nodetool rebuild_index it creates nex bunch of files in the secondary index subridectory but never cleans anything up – I did not find any hints on how to clean this up and just testing ( stopping cassandra, moving old index files away and kicking off reindex ) – which seems to work ok – stop/start of cassandra is necessary because it keeps all the old files open – is there any process that can be invoked from inside cassandra to clean this up that I missed in the documentation ?

Comments are closed.