Skip to content

Instantly share code, notes, and snippets.

@jasonwatkinspdx
Last active September 26, 2015 07:08
Show Gist options
  • Select an option

  • Save jasonwatkinspdx/1059596 to your computer and use it in GitHub Desktop.

Select an option

Save jasonwatkinspdx/1059596 to your computer and use it in GitHub Desktop.
Sharding in small systems

Sharding in small systems presents problems:

Let's imagine a social network. Users can see posts from users they follow in a timeline.

Let's also imagine a sharded database ...using around 10 servers.

Two basic architectures: scatter and gather.

Scatter is like the mail: sent out individually by the writer. A user's timeline is loaded from one shard. A new post has to be written to the shards of all followers.

Gather is like the supermarket: put together in the basket by the buyer. A new post is only written in one shard. A user's timeline is loaded from the shards of everyone they follow.

At first scatter seems the obvious choice but it has problems:

  • Writes are typically more expensive than reads and scatter multiplies writes by follower count.
  • Majority of reads need to come out of RAM but scatter duplicates content, wasting RAM systemwide.
  • Write replication latency causes consistency issues.

What about gather? It has it's own problems:

  • We read multiple shards to produce a timeline.
  • All reads for a hotly followed person go to one shard potentially overloading it.
  • Worst read latency among a gather set dominates read latency of the whole set. Gather is not ideal, but since reads from RAM are cheap it's perhaps workable. It maximizes RAM efficiency, which is very good from a global perspective.

Scatter is best when reads are expensive. Gather is best when writes are expensive. Gather is best when capacity is expensive.

Scatter for disk. Gather for RAM.

For most systems, RAM costs vs performance dominate other concerns. Memcached is popular because it makes gather a bolt on to your existing architecture.

But recall, we're talking about small systems, and sharding in small systems presents problems. On twitter, the average user follows some 35 other users. That's our natural data fanout. So we'll send queries to 35 shards to read a single timeline under gather. But we have only 10 database servers. 4 queries per database, when onlyl one would do.

This is why memcached has multi-get.

But let's go back to this basic situation: our natural data fanout is larger than our number of servers.

For scatter, we'll on average be writing to all servers each write no matter our shard partitioning strategy. For gather, we'll on average be reading from all servers no matter our shard partitioning strategy.

Blind Sharding:

This enables a vastly simpler sharding architecture: No shard mapping to maintain or communicate. No constraints on data location or relative database size. ...we can chose location purely for efficiency. ...while ignoring the shape of the natural data fan. ...while ignoring the logical structure of the application.

Only one constraint: combine shard answers into system answer. This is Teradata, Abadi, and Stonebraker all over again.

There are lots of interesting opportunities:

For gather, chose write location to eventually equalize database size. For scatter, choose read location to maximize cache coherance. Choose scatter or gather per logical model. Choose scatter or gather per queried attribute. Choose scatter or gather for point or range queries. Choose scatter or gather for consistency requirements. Choose scatter or gather based on attribute value. ... all enabled by overlapping scatter query range and gather query range conservatively.

@relistan
Copy link
Copy Markdown

relistan commented Jul 5, 2011

Miss working with you, Jason! This is good clear thinking, nicely summarized. We currently face some of these challenges at a large scale (21k RPM) and have chosen to gather almost everything. But certain things are scattered when read performance is a hard requirement. I had not thought of this in those terms, but that's what we do.

Redis with its versatile command set and persistence has been really useful here. We have a big Redis cluster that indexes most of our important data. The data itself is currently in Membase (but I don't recommend it) or Postgres. Workers handle some of the management of these data stores via Resque.

@jasonwatkinspdx
Copy link
Copy Markdown
Author

Miss working with you too man.

I've only used Redis for Resque, and been quite happy with it there. Definitely interested in seeing how Redis Cluster fleshes out. I think it may be a great solution for the sorts of projects that are too big for a simple single database but still small enough that running the whole hadoop stack would be a burden for the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment