Massive Technical Interviews Tips: Redis Architecture

There are three different ways to make Redis persistance: RDB, AOF and SAVE command.

RDB Mechanism

RDB makes a copy of all the data in memory and stores them in secondary storage(permanent storage). This happens in a specified interval. So there is chance that you will loose data that are set after RDB’s last snapshot.

AOF

AOF logs all the write operations received by the server. Therefore everything is persistance. The problem with using AOF is that it writes to disk for every operation and it is a expensive task and also size of AOF file is large than RDB file.

SAVE Command

You can force redis server to create a RDB snapshot anytime using the redis console client SAVE command.

You can use AOF and RDB together to get best persistance result.

http://redis.io/presentation/Redis_Cluster.pdf
All nodes are directly connected with a service channel.
TCP baseport+4000, example 6379 -> 10379.
Node to Node protocol is binary, optimized for bandwidth and speed.
Clients talk to nodes as usually, using ascii protocol, with minor additions.
Nodes don't proxy queries.

Hash slots
keyspace is divided into 4096 hash slots.
Different nodes will hold a subset of hash slots.
Nodes are all connected and functionally equivalent, but
actually there are two kind of nodes: slave and master nodes:

there are two replicas per every master node, so
up to two random nodes can go down without issues.
Working with two nodes down is guaranteed, but in the best case the cluster will continue to work as long as there is at least one node for every hash slot.

Every key only exists in a single instance, plus N replicas that will never receive writes. So there is no merge, nor application-side inconsistency resolution.
The price to pay is not resisting to net splits that are bigger than replicas-per-hashslot nodes down.
Master and Slave nodes use the Redis Replication you already know.

Every physical server will usually hold multiple nodes, both slaves and masters, but the redis-trib cluster manager program will try to allocate slaves and masters so that the replicas are in different physical servers

Client requests - dummy client
1. Client => A: GET foo
2. A => Client: -MOVED 8 192.168.5.21:6391
3. Client => B: GET foo
4. B => Client: "bar"
-MOVED 8 ... this error means that hash slot 8 is located at the specified IP/port, and the client should reissue the query there.

Client requests - smart client
1. Client => A: CLUSTER HINTS
2. A => Client: ... a map of hash slots -> nodes
3. Client => B: GET foo
4. B => Client: "bar"

Client requests
Dummy, single-connection clients, will work with minimal modifications to existing client code base. Just try a random node among a list, then reissue the query if needed.

Smart clients will take persistent connections to many nodes, will cache hashslot -> node info, and will update the table when they receive a -MOVED error.

This schema is always horizontally scalable, and low latency if the clients are smart.
Especially in large clusters where clients will try to have many persistent connections to multiple nodes, the Redis client object should be shared.

Re-sharding - moving data
All the new keys for slot 7 will be created / updated in D.
All the old keys in C will be moved to D by redis-trib using the MIGRATE command.
MIGRATE is an atomic command, it will transfer a key from C to D, and will remove the key in C when we get the OK from D. So no race is possible.
p.s. MIGRATE is an exported command

Re-sharding with failing nodes
Nodes can fail while resharding. It's slave promotion as usually.
The redis-trib utility is executed by the sysadmin. Will exit and warn when something is not ok as will check the cluster config continuously while resharding.

Fault tolerance
All nodes continuously ping other nodes...
A node marks another node as possibly failing when there is a timeout longer than N seconds.
Every PING and PONG packet contain a gossip section: information about other nodes idle times, from the point of view of the sending node.

Fault tolerance - failing nodes
A guesses B is failing, as the latest PING request timed out.
A will not take any action without any other hint.
C sends a PONG to A, with the gossip section containing information about B: C also thinks B is failing.
At this point A marks B as failed, and notifies the information to all the other nodes in the cluster, that will mark the node as failing.
If B will ever return back, the first time he'll ping any node of the cluster, it will be notified to shut down ASAP, as intermitting clients are not good for the clients.

Redis-trib - the Redis Cluster Manager
It is used to setup a new cluster, once you start N blank nodes.
it is used to check if the cluster is consistent. And to fix it if the cluster can't continue, as there are hash slots without a single node.
It is used to add new nodes to the cluster, either as slaves of an already existing master node, or as blank nodes where we can re-shard a few hash slots to lower other nodes load.

Ping/Pong packets contain enough information for the cluster to restart after graceful stop. But the sysadmin can use CLUSTER MEET command to make sure nodes will engage if IP changed and so forth.
Every node has a unique ID, and a cluster config file. Every time the config changes the cluster config file is saved.
The cluster config file can't be edited by humans.
The node ID never changes for a given node.
http://engineering.bloomreach.com/the-evolution-of-fault-tolerant-redis-cluster/
http://www.yzuzun.com/2015/04/some-architectural-design-concepts-for-redis/

Redis list as queue
https://product.reverb.com/a-simple-priority-queue-with-redis-in-ruby-7e3ec780f237#.uk7zn5x41
For our Redis data structure, we want to use a sorted set. A sorted set is essentially like a set, but with a score.

When members are listed, members with lower scores are returned first.

When adding a new member, if there is already an existing member, we want to keep the one with the most important priority. In other words, if “foo” is put in the set score in Redis of -1 and then later I want to put “foo” in the set with a score of +1, I want to keep the score at -1 (lower scores are returned first). By default, Redis will just change the score, but we do not want that.
We should be able to atomically pop members out of the Redis set.

class RedisPriorityQueue

def initialize(key, redis: Reverb.redis_instance)
@key = key
@redis = redis
end

def add(member, score)
member_score = score(member)
if member_score.nil? || member_score > score
@redis.zadd(@key, score, member)
end
end

def pop
@redis.watch(@key) do
members(end_index: 0).first.tap do |member|
if !member.nil?
@redis.multi do
remove(member)
end
else
@redis.unwatch
end
end
end
end

def remove(member)
@redis.zrem(@key, member)
end

def size
@redis.zcard(@key)
end

def clear!
@redis.del(@key)
end

def members(start_index: 0, end_index: -1)
@redis.zrange(@key, start_index, end_index) || []
end

private

def score(member)
@redis.zscore(@key, member)
end
end

Here, we use WATCH to guarantee atomicity and to avoid the race condition of removing an element right after we add it. WATCH essentially means that if the key changes while we are watching (if something is added or removed from it), it will abort the transaction (here, anything in the multi block). This guarantees us atomicity.

http://stackoverflow.com/questions/27555663/java-can-i-use-redis-db-to-create-priority-queues-and-the-priority-is-set-accor

It should be possible to implement this via Redis sorted sets and Lua scripts. Sorted sets are set of elements ordered by a score, but the score can be updated dynamically. One way to update the score of a sorted set, is to increment its scopre via ZINCRBY, this corresponds to what you are trying to do AFAIK. The element will be automatically moved in the right place according to the incremented score. Then by using the other sorted sets commands you can simulate popping from the sorted set by consuming the element, if this is the semantics you want.

http://redis.io/commands/ping
http://www.cnblogs.com/liuhao/archive/2012/06/26/2563702.html

Redis Sorted Sets are, similarly to Redis Sets, non repeating collections of Strings. The difference is that every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score. While members are unique, scores may be repeated.

With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!

In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.

With Sorted Sets you can:

Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!

sorted sets有如下三个命令：

1.ZADD key score member [score] [member]

以O(log(N))的复杂度，向集合中加入一个元素。如下所示：

redis 127.0.0.1:6379> ZADD "www.baidu.com" 1 "first_page" 2 "second_page" 3 "third_page" 3 "another_page"
(integer) 4

2.ZREVRANGE key start stop [WITHSCORES]

以O(log(N)+M)的复杂度，取元素。N是集合中元素个数，M是返回值的元素个数。使用WITHSCORES，将会同时返回对应元素的SCORE。在优先级队列中，我们只取最高优先级的一个元素，如下所示：

redis 127.0.0.1:6379> ZREVRANGE "www.baidu.com" 0 0
1) "third_page"
redis 127.0.0.1:6379> ZREVRANGE "www.baidu.com" 0 0 WITHSCORES
1) "third_page"
2) "3"

3.ZREM key member [member]

以O(log(N))的复杂度，删除sorted set中的特定元素。这里的member为ZREVRANGE中的返回值即可，如下所示：

redis 127.0.0.1:6379> ZREM "www.baidu.com" "third_page"
(integer) 1

据此，一个高效（O（logN）的复杂度）的优先级队列就可以使用了。

参照上述方法构造的优先级队列是非阻塞模式的，这样，如果当前Sorted Sets为空，要求调用方不断轮循（polling），这对使用者来说是非常不方便的。redis并未提供阻塞版本的ZREVRANGE，但是使用blpop命令，可以实现优先级队列的阻塞语义。

参考：Pattern: Event notification

消费者（consumer）如下：

LOOP forever
    WHILE ZREVRANGE(key,0,0) returns elements
        ... process elements ...
        ZREM(key, elements)
    END
    BRPOP helper_key
END

生产者（producer）如下：

MULTI
ZADD key element
LPUSH helper_key x
EXEC

使用PriorityQueue和heapq实现基于时间戳的时序优先级队列
http://skipperkongen.dk/2013/08/27/how-many-requests-per-second-can-i-get-out-of-redis/

redis-server --port 7777

To test (set and get):

redis-benchmark -p 7777 -t set,get -q

Redis is a single-threaded server.
Using a single-node instance of Redis running on my laptop I managed to get 300K requests per second (both get and set). This is achieved only if using pipelining (100 commands at a time). On a high-end machine someone got 700K get requests per second using pipelining, i.e. a bit more than twice the throughput.

https://azure.microsoft.com/en-us/documentation/articles/cache-faq/

Throughput for the caches that are the same size is higher in the Premium tier as compared to the Standard tier. For example, with a 6 GB Cache, throughput of P1 is 140K RPS as compared to 49K for C3.
With Redis clustering, throughput increases linearly as you increase the number of shards (nodes) in the cluster. For example, if you create a P4 cluster of 10 shards, then the available throughput is 250K *10 = 2.5 Million RPS.

C3	6 GB	4	400 / 50	49000
C4	13 GB	2	500 / 62.5	61000
C5	26 GB	4	1000 / 125	115000
C6	53 GB	8	2000 / 250	150000
Premium cache sizes		CPU cores per shard		Requests per second (RPS), per shard
P1	6 GB	2	1000 / 125	140000
P2	13 GB	4	2000 / 250	220000
P3	26 GB	4	2000 / 250	220000
P4	53 GB	8	4000 / 500	250000

https://www.quora.com/Why-isnt-Redis-designed-to-benefit-from-multi-threading

CPU is not bottleneck - Usually network is bottleneck. CPUs are very fast. If application is designed right, i.e. avoiding blocking IO, threading will be near the bottom of the list to worry about.

To me, when Redis says they are single threaded, Redis is saying they designed to prevent any lock contention and thus their memory access is blazing fast.

https://stackoverflow.com/questions/10489298/redis-is-single-threaded-then-how-does-it-do-concurrent-i-o

Redis is single-threaded with epoll/kqueue and scale indefinitely in terms of I/O concurrency.

In server-side software, concurrency and parallelism are often considered as different concepts. In a server, supporting concurrent I/Os means the server is able to serve several clients by executing several flows corresponding to those clients with only one computation unit. In this context, parallelism would mean the server is able to perform several things at the same time (with multiple computation units), which is different.

A single-threaded program can definitely provides concurrency at the I/O level by using an I/O (de)multiplexing mechanism and an event loop (which is what Redis does).

Parallelism has a cost: with the multiple sockets/multiple cores you can find on modern hardware, synchronization between threads is extremely expensive. On the other hand, the bottleneck of an efficient storage engine like Redis is very often the network, well before the CPU. Isolated event loops (which require no synchronization) are therefore seen as a good design to build efficient, scalable, servers.

The fact that Redis operations are atomic is simply a consequence of the single-threaded event loop. The interesting point is atomicity is provided at no extra cost (it does not require synchronization). It can be exploited by the user to implement optimistic locking and other patterns without paying for the synchronization overhead.

https://medium.com/smyte/rate-limiter-df3408325846

While Redis is great for quickly taking a prototype to production, it has a lot of downsides:

It’s hard to scan over subsets of the keyspace.
No built-in key compression. For applications that have many keys that share the same prefix and relatively small values (like our rate limiter), this results in much higher storage requirements.
Redis in production isn’t easy. We decided not to trust the new sentinel mode just yet, and the standard leader-follower replication is far from failsafe.
BGSAVE is problematic. It is required for leader-follower replication and uses the fork() syscall which then operates in copy-on-write memory. For write-heavy workloads such as rate-limiting, every write likely will cause that entire page to be copied for the BGSAVE process. This means the operation might not complete unless you have double the amount of memory available.

Saturday, March 12, 2016

Redis Architecture

Labels

Popular Posts