Saturday, August 13, 2016

Design Key Value Store

Designing a KV store with external storage


  1. Data size: Data size of values is too large to be held in memory, and we should leverage the external storage for them. However, we can still keep the data keys in memory.
  2. Single-host solution. No distributed design.
  3. Optimize for write.


  • In-memory hashmap index + index hint file + data files
  • Append-only for write optimization. Have only one active data file for write. And compact active data to the older data file(s) for read.


  1. In-memory HashMap<Key, <FildId, ValueOffset, ValueSize, Timestamp>>
  2. Data file layout
  1. (index) hint file that the in-memory hashmap can recover from


Delete: get the location by the in-memory hashmap, if it exists, then go to the location on the disk to set the value to a magic number.
Get: get the location by the in-memory hashmap, and then go to the location on the disk for the value.
Put: append to the active data file and update the in-memory hash map.
Periodical compaction strategies
  • Copy latest entries: In-memory hashmap is always up-to-date. Stop and copy into new files. Time complexity is O(n) n is the number of valid entries.
    • Pros: Efficient for lots of entries out-dated or deleted.
    • Cons: Consume storage if little entries are out-dated. May double the space. (can be resolved by having a secondary node do the compression work with GET/POST periodically. E.g., Hadoop secondary namenode).
  • Scan and move: foreach entry, if it is up-to-date, move to the tail of the validated section. Time complexity is O(n) n is the number of all the entries.
    • Pros:
      • shrink the size
      • no extra storage space needed
    • Cons:
      • Complex and need to sync hashmap and storage with transactions. May hurt performance.
Following up questions
  • How to detect records that can be compacted?
    • Use timestamp.
  • What if one hashmap cannot fit into a single machine’s memory?
    • Consistent hashing, chord DHT, query time complexity is O(logn) with the finger table, instead of O(1) here with a hashmap.

已知一个文件系统,可以create files, delete files, sequentially scan file content, read file content randomly, append file content.对于key-value pairs,在给定的文件系统中实现put,get,delete 方法。其中key比较小,全部key可以放在内存中,value会比较大。Answer:
a. In the main memory, maintain a hashmap, key is the given input key, value is thefile ptr and line ptr(key-value pair location in the file system).

b. Each time we need to update the value, write a new key value pair at the end of the file, and also update the related key value pairs in the main memory. Then when main memory cracks, we can reconstruct the main memory hashmap from the files content.

c. Merge: 

Make file modifications to save space: scan the file sequentially for each key to find the latest update, then put the latest key-value pairs into another file. After we finished doing this for all the keys in a file, delete that file.

d. Deal with synchronization: when we do the file modification, put all the newest updates into another file, then replay it.

e. Each file should have a limited amount of different keys, so that file modification is not that hard.
A key-value store is a very power technique that is used in almost every system in the world. It can be as simple as a hash table and at the same time, it can also be a distributed storage system. For instance, the underline system of Cassandra is a key-value storage system

Basic key-value storage

How would you design a simple key-value storage system in a single machine?
The most straightforward thing is to use a hash table to store key-value pairs, which is how most this kind of systems work nowadays. A hash table allows you to read/write a key-value pair in constant time and it’s extremely easy to use. Most languages have built-in support for this.
However, the downside is also obvious. Using a hash table usually means you need to store everything in memory, which may not be possible when the data set is big. There are two common solutions:
  • Compress your data. This should be the first thing to think about and often there are a bunch of stuff you can compress. For example, you can store reference instead of the actual data. You can also use float32 rather than float64. In addition, using different data representations like bit array (integer) or vectors can be effective as well.
  • Storing in a disk. When it’s impossible to fit everything in memory, you may store part of the data into disk. To further optimize this, you can think of the system as a cache system. Frequently visited data is kept in memory and the rest is on a disk.
Since a single machine doesn’t have enough storage for all the data, the general idea here is to split the data into multiple machines by some rules and a coordinator machine can direct clients to the machine with requested resource. The question is how to split the data into multiple machines and more importantly, what is a good strategy to partition data?


Suppose all the keys are URLs like and we have 26 machines. One approach is to divide all keys (URLs) based on the first character of the URL (after “www”) to these 26 machines. For example, will be stored at machine G and will be stored at machine B. So what are disadvantages of this design?
Let’s ignore cases where URL contains ASCII characters. A good sharding algorithm should be able to balance traffic equally to all machines. In other words, each machine should receive equal requests ideally. Apparently, the above design doesn’t work well. First of all, the storage is not distributed equally. Probably there are much more URLs starting with “a” than “z”. Secondly, some URLs are much more popular like Facebook and Google.
In order to balance the traffic, you’d better make sure that keys are distributed randomly. Another solution is to use the hash of URL, which usually have much better performance. To design a good sharding algorithm, you should fully understand the application and can estimate the bottleneck of the system.

System availability

To evaluate a distributed system, one key metric is system availability. For instance, suppose one of our machines crashes for some reason (maybe hardware issue or program bugs), how does this affect our key-value storage system?
Apparently, if someone requests resources from this machine, we won’t be able to return the correct response. You may not consider this issue when building a side project. However, if you are serving millions of users with tons of servers, this happens quite often and you can’t afford to manually restart the server every time. This is why availability is essential in every distributed system nowadays. So how would you address this issue?
Of course you can write more robust code with test cases. However, your program will always have bugs. In addition, hardware issues are even harder to protect. The most common solution is replica. By setting machines with duplicate resources, we can significantly reduce the system downtime. If a single machine has 10% of chance to crash every month, then with a single backup machine, we reduce the probability to 1% when both are down.

Replica VS sharding

First of all, we need to be clear about the purpose of these two techniques. Sharding is basically used to splitting data to multiple machines since a single machine cannot store too much data. Replica is a way to protect the system from downtime. With that in mind, if a single machine can’t store all the data, replica won’t help.


By introducing replicas, we can make the system more robust. However, one issue is about consistency. Let’s say for machine A1, we have replica A2. How do you make sure that A1 and A2 have the same data? For instance, when inserting a new entry, we need to update both machines. But it’s possible that the write operation fails in one of them. So over time, A1 and A2 might have quite a lot inconsistent data, which is a big problem.
There are a couple of solutions here. First approach is to keep a local copy in coordinator. Whenever updating a resource, the coordinator will keep the copy of updated version. So in case the update fails, the coordinator is able to re-do the operation.
Another approach is commit log. If you have been using Git, the concept of commit log should be quite familiar to you. Basically, for each node machine, it’ll keep the commit log for each operation, which is like the history of all updates. So when we want to update an entry in machine A, it will first store this request in commit log. And then a separate program will process all the commit logs in order (in a queue). Whenever an operation fails, we can easily recover as we can lookup the commit log.
The last approach I’d like to introduce is to resolve conflict in read. Suppose when the requested resource locates in A1, A2 and A3, the coordinator can ask from all three machines. If by any chance the data is different, the system can resolve the conflict on the fly.
It’s worth to note that all of these approaches are not mutually exclusive. You can definitely use multiple ones based on the application.

Read throughput

I’d also like to briefly mention read throughput in this post. Usually, key-value storage system should be able to support a large amount of read requests. So what approaches will you use to improve read throughput?
To improve read throughput, the common approach is always taking advantage of memory. If the data is stored in disk inside each node machine, we can move part of them in memory. A more general idea is to use cache. Since the post – design a cache system has an in-depth analysis of this topic, I won’t talk about it too much here.
Define objects
Define identifiers
Analyze requirement for compound keys
Define the physical implementation
To have real data in the database we have to define and choose physical implementation of our objects. For this we have to know the database implementation what kind of options we have. Redis is not a plain key-value store, but a data structures server, supporting different kind of values. Most obvious choices are String – store one value, Hash – store multiple fields. In our example we will use hash for USER and string for CLICK.


Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts