https://www.slideshare.net/iammutex/redis-cluster
Every physical server will usually hold multiple nodes, both slaves and masters, but the redis-trib cluster manager program will try to allocate slaves and masters so that the replicas are in different physical servers.
Client requests - smart client
1. Client => A: CLUSTER HINTS
2. A => Client: ... a map of hash slots -> nodes
3. Client => B: GET foo
4. B => Client: "bar"
Especially in large clusters where clients will try to have
many persistent connections to multiple nodes, the Redis
client object should be shared.
Re-sharding
We are experiencing too much load. Let's add a new server.
Node C marks his slot 7 as "MOVING to D"
Every time C receives a request about slot 7, if the key is
actually in C, it replies, otherwise it replies with -ASK D
-ASK is like -MOVED but the difference is that the client
should retry against D only this query, not next queries.
That means: smart clients should not update internal state.
All the new keys for slot 7 will be created / updated in D.
All the old keys in C will be moved to D by redis-trib using
the MIGRATE command.
MIGRATE is an atomic command, it will transfer a key from
C to D, and will remove the key in C when we get the OK
from D. So no race is possible.
p.s. MIGRATE is an exported command. Have fun...
Open problem: ask C the next key in hash slot N, efficiently.
Re-sharding with failing nodes
Nodes can fail while resharding. It's slave promotion as usually.
The redis-trib utility is executed by the sysadmin. Will exit and warn when something is not ok as will check the cluster config continuously while resharding.
Fault tolerance
All nodes continuously ping other nodes...
A node marks another node as possibly failing when there is a timeout longer than N seconds.
Every PING and PONG packet contain a gossip section: information about other nodes idle times, from the point of view of the sending node.
Fault tolerance - failing nodes
A guesses B is failing, as the latest PING request timed out.
A will not take any action without any other hint.
C sends a PONG to A, with the gossip section containing
information about B: C also thinks B is failing.
At this point A marks B as failed, and notifies the
information to all the other nodes in the cluster, that will
mark the node as failing.
If B will ever return back, the first time he'll ping any node of
the cluster, it will be notified to shut down ASAP, as
intermitting clients are not good for the clients.
Only way to rejoin a Redis cluster after massive crash
is: redis-trib by hand.
Redis-trib - the Redis Cluster Manager
It is used to setup a new cluster, once you start N blank nodes.
it is used to check if the cluster is consistent. And to fix it if the cluster can't continue, as there are hash slots without a single node.
It is used to add new nodes to the cluster, either as slaves of an already existing master node, or as blank nodes where we can re-shard a few hash slots to lower other nodes load.
Ping/Pong packets contain enough information for the cluster to restart after graceful stop. But the sysadmin can use CLUSTER MEET command to make sure nodes will engage if IP changed and so forth.
Every node has a unique ID, and a cluster config file.
Everytime the config changes the cluster config file is saved.
The cluster config file can't be edited by humans.
The node ID never changes for a given node.
https://www.slideshare.net/NoSQLmatters/no-sql-matters-bcn-2014
Sharding and replication (asynchronous).
Asynchronous replication
async ACK
Full Mesh
• Heartbeats.
• Nodes gossip.
• Failover auth.
• Config update.
No proxy, but redirections
Failure detection
• Failure reports within window of time (via gossip).
• Trigger for actual failover.
• Two main states: PFAIL -> FAIL.
PFAIL state propagates
Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.
Raft and failover
• Config propagation is solved using ideas from the Raft algorithm (just a subset).
• Why we don’t need full Raft?
• Because our config is idempotent: when the
partition heals we can trow away slots config for new versions.
• Same algorithm is used in Sentinel v2 and works well.
Config propagation
• After a successful failover, new slot config is
broadcasted.
• If there are partitions, when they heal, config will
get updated (broadcasted from time to time, plus
stale config detection and UPADTE messages).
• Config with greater Epoch always wins.
Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.
http://blog.houzz.com/post/162981718143/migration-to-redis-cluster
The large memory footprints are problematic to operations such as restart and master-slave synchronization. It can take more than 30 minutes for a large shard to restart or to do a full master-slave sync.
https://redis.io/topics/cluster-tutorial
http://blog.csdn.net/r_p_j/article/details/78813265
整个流程跟哨兵相比,非常类似,所以说,redis cluster功能强大,直接集成了replication和sentinal的功能;
http://download.redis.io/redis-stable/sentinel.conf
https://seanmcgary.com/posts/how-to-build-a-fault-tolerant-redis-cluster-with-sentinel
Redis replication is a very simple to use and configure master-slave replication that allows slave Redis servers to be exact copies of master servers.
https://redis.io/topics/cluster-tutorial
Partitioning is the process of splitting your data into multiple Redis instances, so that every instance will only contain a subset of your keys.
http://redis.io/topics/cluster-spec
http://www.zenlife.tk/redis-cluster.md
Every physical server will usually hold multiple nodes, both slaves and masters, but the redis-trib cluster manager program will try to allocate slaves and masters so that the replicas are in different physical servers.
Client requests - smart client
1. Client => A: CLUSTER HINTS
2. A => Client: ... a map of hash slots -> nodes
3. Client => B: GET foo
4. B => Client: "bar"
Especially in large clusters where clients will try to have
many persistent connections to multiple nodes, the Redis
client object should be shared.
Re-sharding
We are experiencing too much load. Let's add a new server.
Node C marks his slot 7 as "MOVING to D"
Every time C receives a request about slot 7, if the key is
actually in C, it replies, otherwise it replies with -ASK D
-ASK is like -MOVED but the difference is that the client
should retry against D only this query, not next queries.
That means: smart clients should not update internal state.
All the new keys for slot 7 will be created / updated in D.
All the old keys in C will be moved to D by redis-trib using
the MIGRATE command.
MIGRATE is an atomic command, it will transfer a key from
C to D, and will remove the key in C when we get the OK
from D. So no race is possible.
p.s. MIGRATE is an exported command. Have fun...
Open problem: ask C the next key in hash slot N, efficiently.
Re-sharding with failing nodes
Nodes can fail while resharding. It's slave promotion as usually.
The redis-trib utility is executed by the sysadmin. Will exit and warn when something is not ok as will check the cluster config continuously while resharding.
Fault tolerance
All nodes continuously ping other nodes...
A node marks another node as possibly failing when there is a timeout longer than N seconds.
Every PING and PONG packet contain a gossip section: information about other nodes idle times, from the point of view of the sending node.
Fault tolerance - failing nodes
A guesses B is failing, as the latest PING request timed out.
A will not take any action without any other hint.
C sends a PONG to A, with the gossip section containing
information about B: C also thinks B is failing.
At this point A marks B as failed, and notifies the
information to all the other nodes in the cluster, that will
mark the node as failing.
If B will ever return back, the first time he'll ping any node of
the cluster, it will be notified to shut down ASAP, as
intermitting clients are not good for the clients.
Only way to rejoin a Redis cluster after massive crash
is: redis-trib by hand.
Redis-trib - the Redis Cluster Manager
It is used to setup a new cluster, once you start N blank nodes.
it is used to check if the cluster is consistent. And to fix it if the cluster can't continue, as there are hash slots without a single node.
It is used to add new nodes to the cluster, either as slaves of an already existing master node, or as blank nodes where we can re-shard a few hash slots to lower other nodes load.
Ping/Pong packets contain enough information for the cluster to restart after graceful stop. But the sysadmin can use CLUSTER MEET command to make sure nodes will engage if IP changed and so forth.
Every node has a unique ID, and a cluster config file.
Everytime the config changes the cluster config file is saved.
The cluster config file can't be edited by humans.
The node ID never changes for a given node.
https://www.slideshare.net/NoSQLmatters/no-sql-matters-bcn-2014
Sharding and replication (asynchronous).
Asynchronous replication
async ACK
Full Mesh
• Heartbeats.
• Nodes gossip.
• Failover auth.
• Config update.
No proxy, but redirections
Failure detection
• Failure reports within window of time (via gossip).
• Trigger for actual failover.
• Two main states: PFAIL -> FAIL.
PFAIL state propagates
Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.
Raft and failover
• Config propagation is solved using ideas from the Raft algorithm (just a subset).
• Why we don’t need full Raft?
• Because our config is idempotent: when the
partition heals we can trow away slots config for new versions.
• Same algorithm is used in Sentinel v2 and works well.
Config propagation
• After a successful failover, new slot config is
broadcasted.
• If there are partitions, when they heal, config will
get updated (broadcasted from time to time, plus
stale config detection and UPADTE messages).
• Config with greater Epoch always wins.
Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.
http://blog.houzz.com/post/162981718143/migration-to-redis-cluster
The large memory footprints are problematic to operations such as restart and master-slave synchronization. It can take more than 30 minutes for a large shard to restart or to do a full master-slave sync.
One option we considered was Redis Cluster. It was released by the Redis community on April 1, 2015. It automatically shards data across multiple servers based on hashes of keys. The server selection for each query is done in the client libraries. If the contacted server does not have the queried shard, the client will be redirected to the right server.
There are several advantages with Redis Cluster. It is well documented and well integrated with Redis core. It does not require an additional server between clients and Redis servers, hence has a lower capacity requirement and a lower operational cost. It does not have a single point of failure. It has the ability to continue read/write operations when a subset of the servers are down. It supports multi-key queries as long as all the keys are served by the same server. Multiple keys can be forced to the same shard with “hash tags”, i.e., sub-key hashing. It has built-in master-slave replication.
As mentioned above, Redis Cluster does not support NAT’ed environments and in general environments where IP addresses or TCP ports are remapped. This limitation makes it incompatible with our existing settings, in which we use Redis Sentinel to do automatic failover, and the clients access Redis through HAProxy. HAProxy provides two functions in this case: It does health checks on the Redis servers so that the client will not access unresponsive or otherwise faulty servers. It also detects the failover that is triggered by Redis Sentinel, so that write requests will be routed to the latest masters. Although Redis Cluster has built-in replication, as we discovered later, few client libraries, if any, have support for it. The open source client libraries we use, e.g., Predis and Jedis, would ignore the slaves in the cluster and send all requests to the masters.
One option we considered was Redis Cluster. It was released by the Redis community on April 1, 2015. It automatically shards data across multiple servers based on hashes of keys. The server selection for each query is done in the client libraries. If the contacted server does not have the queried shard, the client will be redirected to the right server.
There are several advantages with Redis Cluster. It is well documented and well integrated with Redis core. It does not require an additional server between clients and Redis servers, hence has a lower capacity requirement and a lower operational cost. It does not have a single point of failure. It has the ability to continue read/write operations when a subset of the servers are down. It supports multi-key queries as long as all the keys are served by the same server. Multiple keys can be forced to the same shard with “hash tags”, i.e., sub-key hashing. It has built-in master-slave replication.
As mentioned above, Redis Cluster does not support NAT’ed environments and in general environments where IP addresses or TCP ports are remapped. This limitation makes it incompatible with our existing settings, in which we use Redis Sentinel to do automatic failover, and the clients access Redis through HAProxy. HAProxy provides two functions in this case: It does health checks on the Redis servers so that the client will not access unresponsive or otherwise faulty servers. It also detects the failover that is triggered by Redis Sentinel, so that write requests will be routed to the latest masters. Although Redis Cluster has built-in replication, as we discovered later, few client libraries, if any, have support for it. The open source client libraries we use, e.g., Predis and Jedis, would ignore the slaves in the cluster and send all requests to the masters.
The other option we evaluated was Twemproxy. Twitter developed and launched Twemproxy before Redis Cluster was available. Like Redis Cluster, Twemproxy automatically shards data across multiple servers based on hashes of keys. The clients send queries to the proxy as if it is a single Redis server that owns all the data. The proxy then relays the query to the Redis server that has the shard, and relays the response back to the client.
Like Redis Cluster, there is no single point of failure in Twemproxy if multiple proxies are running for redundancy. Twemproxy also has an option to enable/disable server ejection, which can mask individual server failures when Redis is used as a cache vs. a data store.
One disadvantage of Twemproxy is that it adds an extra hop between clients and Redis servers, which may add up to 20% latency according to prior studies. It also has extra capacity requirement and operational cost for monitoring the proxies. It does not support multi-key queries. It may not be well integrated with Redis Sentinel.
https://redis.io/topics/cluster-tutorial
Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.
Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).
(1) 什么是smart jedis?
基于重定向的客户端,很消耗网络io,因为大部分情况下,可能都会出现一次请求重定向,才能找到正确的节点;
所以大部分的客户端比如java redis客户端,都是jedis,都是smart的,
本地维护一份hashslot -> node的映射表在缓存里,大部分情况下直接走本地缓存就可以找到hashslot -> node,不需要通过节点进行moved重定向;
基于重定向的客户端,很消耗网络io,因为大部分情况下,可能都会出现一次请求重定向,才能找到正确的节点;
所以大部分的客户端比如java redis客户端,都是jedis,都是smart的,
本地维护一份hashslot -> node的映射表在缓存里,大部分情况下直接走本地缓存就可以找到hashslot -> node,不需要通过节点进行moved重定向;
3.1 判断节点宕机
如果一个节点认为另外一个节点宕机,name就是pfail,主观宕机;
如果多个节点都认为另外一个节点宕机了,那么就是fail,客观宕机,跟哨兵的原理几乎一样,sdown,odown;
在cluster-node-timeout内,某个节点一直没有返回pong,那么就被认为pfail;
如果一个节点认为某个节点pfail了,那么会在gossip ping消息中,ping给其他节点,如果超过半数的节点都认为pfail了,那么就会变成fail;
如果多个节点都认为另外一个节点宕机了,那么就是fail,客观宕机,跟哨兵的原理几乎一样,sdown,odown;
在cluster-node-timeout内,某个节点一直没有返回pong,那么就被认为pfail;
如果一个节点认为某个节点pfail了,那么会在gossip ping消息中,ping给其他节点,如果超过半数的节点都认为pfail了,那么就会变成fail;
3.2 从节点过滤
对宕机的master node,从其所有的slave node中,选择一个切换成master node;
检查每个slave node与master node断开连接的时间,如果超过了cluster-node-timeout * cluster-slave-validity-factor,那么就没有资格切换成master;
检查每个slave node与master node断开连接的时间,如果超过了cluster-node-timeout * cluster-slave-validity-factor,那么就没有资格切换成master;
3.3 从节点选举
哨兵:对所有从节点进行排序,slave priority,offset,run id;
每个从节点,都根据自己对master复制数据的offset,来设置一个选举时间,offset越大(复制数据越多)的从节点,选举时间越靠前,优先进行选举;
所有的master node开始slave选举投票,给要进行选举的slave进行投票,如果大部分master node(N/2 + 1)都投票给了某个从节点,那么选举通过,那个从节点可以切换成master;
从节点执行主备切换,从节点切换为主节点;
每个从节点,都根据自己对master复制数据的offset,来设置一个选举时间,offset越大(复制数据越多)的从节点,选举时间越靠前,优先进行选举;
所有的master node开始slave选举投票,给要进行选举的slave进行投票,如果大部分master node(N/2 + 1)都投票给了某个从节点,那么选举通过,那个从节点可以切换成master;
从节点执行主备切换,从节点切换为主节点;
http://download.redis.io/redis-stable/sentinel.conf
https://seanmcgary.com/posts/how-to-build-a-fault-tolerant-redis-cluster-with-sentinel
sentinel monitor redis-cluster 127.0.0.1 6380 2
https://redis.io/topics/replicationRedis replication is a very simple to use and configure master-slave replication that allows slave Redis servers to be exact copies of master servers.
- Replication can be used both for scalability, in order to have multiple slaves for read-only queries (for example, slow O(N) operations can be offloaded to slaves), or simply for data redundancy.
If you set up a slave, upon connection it sends a PSYNC command.
If this is a reconnection and the master has enough backlog, only the difference (what the slave missed) is sent. Otherwise what is called a full resynchronization is triggered.
When a full resynchronization is triggered, the master starts a background saving process in order to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the slave, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the slave. This is done as a stream of commands and is in the same format of the Redis protocol itself.
You can try it yourself via telnet. Connect to the Redis port while the server is doing some work and issue the SYNCcommand. You'll see a bulk transfer and then every command received by the master will be re-issued in the telnet session.
Slaves are able to automatically reconnect when the master-slave link goes down for some reason. If the master receives multiple concurrent slave synchronization requests, it performs a single background save in order to serve all of them.
https://redis.io/topics/cluster-tutorial
Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.
Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).
So in practical terms, what you get with Redis Cluster?
- The ability to automatically split your dataset among multiple nodes.
- The ability to continue operations when a subset of the nodes are experiencing failures or are unable to communicate with the rest of the cluster.
Partitioning is the process of splitting your data into multiple Redis instances, so that every instance will only contain a subset of your keys.
- It allows for much larger databases, using the sum of the memory of many computers. Without partitioning you are limited to the amount of memory a single computer can support.
- It allows scaling the computational power to multiple cores and multiple computers, and the network bandwidth to multiple computers and network adapters.
One of the simplest ways to perform partitioning is with range partitioning, and is accomplished by mapping ranges of objects into specific Redis instances. For example, I could say users from ID 0 to ID 10000 will go into instance R0, while users form ID 10001 to ID 20000 will go into instance R1 and so forth.
This system works and is actually used in practice, however, it has the disadvantage of requiring a table that maps ranges to instances. This table needs to be managed and a table is needed for every kind of object, so therefore range partitioning in Redis is often undesirable because it is much more inefficient than other alternative partitioning approaches.
An alternative to range partitioning is hash partitioning. This scheme works with any key, without requiring a key in the form
object_name:<id>
, and is as simple as:
One advanced form of hash partitioning is called consistent hashing and is implemented by a few Redis clients and proxies.
- Client side partitioning means that the clients directly select the right node where to write or read a given key. Many Redis clients implement client side partitioning.
- Proxy assisted partitioning means that our clients send requests to a proxy that is able to speak the Redis protocol, instead of sending requests directly to the right Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema, and will send the replies back to the client. The Redis and Memcached proxy Twemproxy implements proxy assisted partitioning.
- Query routing means that you can send your query to a random instance, and the instance will make sure to forward your query to the right node. Redis Cluster implements an hybrid form of query routing, with the help of the client (the request is not directly forwarded from a Redis instance to another, but the client gets redirected to the right node).
- Adding and removing capacity can be complex. For instance Redis Cluster supports mostly transparent rebalancing of data with the ability to add and remove nodes at runtime, but other systems like client side partitioning and proxies don't support this feature. However a technique called Pre-sharding helps in this regard.
Although partitioning in Redis is conceptually the same whether using Redis as a data store or as a cache, there is a significant limitation when using it as a data store. When Redis is used as a data store, a given key must always map to the same Redis instance. When Redis is used as a cache, if a given node is unavailable it is not a big problem if a different node is used, altering the key-instance map as we wish to improve the availability of the system (that is, the ability of the system to reply to our queries).
Consistent hashing implementations are often able to switch to other nodes if the preferred node for a given key is not available. Similarly if you add a new node, part of the new keys will start to be stored on the new node.
The main concept here is the following:
- If Redis is used as a cache scaling up and down using consistent hashing is easy.
- If Redis is used as a store, a fixed keys-to-nodes map is used, so the number of nodes must be fixed and cannot vary. Otherwise, a system is needed that is able to rebalance keys between nodes when nodes are added or removed, and currently only Redis Cluster is able to do this
Since Redis is extremely small footprint and lightweight (a spare instance uses 1 MB of memory), a simple approach to this problem is to start with a lot of instances since the start. Even if you start with just one server, you can decide to live in a distributed world since your first day, and run multiple Redis instances in your single server, using partitioning.
And you can select this number of instances to be quite big since the start. For example, 32 or 64 instances could do the trick for most users, and will provide enough room for growth.
In this way as your data storage needs increase and you need more Redis servers, what to do is to simply move instances from one server to another. Once you add the first additional server, you will need to move half of the Redis instances from the first server to the second, and so forth.
Using Redis replication you will likely be able to do the move with minimal or no downtime for your users:
- Start empty instances in your new server.
- Move data configuring these new instances as slaves for your source instances.
- Stop your clients.
- Update the configuration of the moved instances with the new server IP address.
- Send the
SLAVEOF NO ONE
command to the slaves in the new server. - Restart your clients with the new updated configuration.
- Finally shut down the no longer used instances in the old server.
Twemproxy is a proxy developed at Twitter for the Memcached ASCII and the Redis protocol. It is single threaded, it is written in C, and is extremely fast. It is open source software released under the terms of the Apache 2.0 license.
Twemproxy supports automatic partitioning among multiple Redis instances, with optional node ejection if a node is not available (this will change the keys-instances map, so you should use this feature only if you are using Redis as a cache).
It is not a single point of failure since you can start multiple proxies and instruct your clients to connect to the first that accepts the connection.
Basically Twemproxy is an intermediate layer between clients and Redis instances, that will reliably handle partitioning for us with minimal additional complexities.
http://www.zenlife.tk/redis-cluster.md
设计的主要和特性
redis集群目标
- 高性能,线性scalable到1000节点级别。无proxy。
- 对于写的保证:尽最大努力交付。意味着即使通过客户端写成功了,但实际上写操作可能丢失。
- 可用性:"大多数master节点可达,并且每个不可达的master节点至少有一个slave可达"
实现了的子集
所有的单key命令都是可用的。复杂的多key操作如果属于同一个节点,也是实现了的。
实现了hash tags的概念,可以将特定的key强制存储在相同结点上面。但是手动reshard期间,多key操作可能不可用,而单key操作一直是可用的。
redis集群不支持多数据库。也不允许select命令。
集群协议中客户端和服务器的角色
集群节点负责存储数据以及交换集群状态,包括将key映射到正确的结点。自动发现其它节点。检测不工作的节点。必要是会将slave提升为master。
集群节点使用TCP,二进制协议,Redis Cluster Bus。每个节点会连到其它所有节点。gossip协议。处理结点发现,ping,通知一些特殊条件,pubsub传递,手动failover。
不代理,所以会返回MOVED和-ASK消息。客户端不知道集群状态也能工作。
写安全性
由于是异步写副本,master挂掉后,最终状态为顶上来的那个slave中的数据。意味着,一段时间窗口内的写可能丢失。
举两个写不一致的例子。 1.master收到client写请求,master回复client写ok了。但是master还没把这个写同步到slave中,master挂掉了。slave提升为master。该写请求丢失。 2.master暂时不可达了。slave提升为master。原来的master又可达了(但身份还没切换到slave),client的路由还没更新过来,继续往它里面写数据。写请求会丢失。
要检测到master挂掉,需要超过半数的master都发现它不可达,并且持续至少NODE_TIMEOUT。
可用性
假设集群分区割裂了。那么minority那部分的分区将不可用。而majority端那边,“大多数master节点可达,并且每个不可达的master节点至少有一个slave可达”,那么在NODE_TIMEOUT时间加上选举新的master这段时间之内都是available的。
性能
嗯,单机N倍!
为什么不做merge操作
redis集群设计不记录数据版本,主要考虑是value通常很大,比如说list或者set,还有就是数据类型很复杂。说白了就是版本不好做,数据不好合并。
其实也不算技术limit啦,CRDT是可以做到的。不过那样就不符合redis集群的设计了。
redis集群组件综述
key的分布模型
key被hash到16384个slot。正常情况下每个slot只由一个节点负责(不算slave)。
HASH_SLOT = CRC16(key) mod 16384
key hash tags
对hash slot的计算有一个例外就是hash tags。hash tags是一种确保多个key分配到同一个hash slot的方式,用来实现集群的多key操作。
如果key中包含一个"{...}"模式的子串,只有{}之间的子串会被hash于计算slot。使用第一个出现的{}
- 对{user1000},user1000将被用于计算hash slot
- 对foo{}{bar},使用整个串,因为第一个{}里面是空的
- foo{{bar}}zap,使用{bar,因为它是在第一个{}内面
- foo{bar}{zap},使用bar
集群节点属性
集群内每个节点都有一个唯一的名字。第一次从/dev/urandom读,后面都不会变。除非改配置文件或者使用CLUSTER RESET命令。
节点ID是用于识别集群内的节点的。可能IP会变,但是节点ID不会,集群会识别出IP/端口的变化并通过gossip协议重新配置。
cluster nodes命令可以打印当前集群的节点信息。
集群总线
集群中每个节点都监听一个额外的TCP端口用于接受来自其它redis节点的连接。这个端口是正常服务端口加10000,如果redis端口是6379,那么用于集群总线的端口就是16379。
集群拓扑
redis集群是一个每个节点都和其它所有节点连接的网格。如果有N个节点,那么每个节点都有N-1条往外的TCP连接,以及N-1条输入的连接。并且是keepalive而不是按需创建的。
尽管集群节点形成了一个完整的网格,节点之间是使用gossip协议和配置更新技术来避免节点间过多的消息通信,这样消息数不会成指数增长。
节点握手
节点总是在cluster bus端口接受新连接。如果收到ping就会回复,但是如果收到(除了ping以外)不属于集群节点的包,会将它丢弃。
有两种方式节点会接受其它节点成为cluster一部分:
- 如果节点发一个MEET消息。MEET消息跟PING消息很类似,但是会把节点当作属于集群。只有系统管理员使用下面命令时,节点会发送MEET消息:
CLUSTER MEET ip port
- 如果节点已经通过gossip协议成为一个受信任的节点。即,如果A知道B,B知道C,最终B会给A发送关于C的gossip消息。这里,A会注册C成为网络的一部分,并尝试连接C。
这个机制保证了最终节点可以知道其它节点,并且阻止了在改IP或者一些其它事件时redis集群可能弄混。
重定向和resharding
MOVED重定向
client可以随意给集群任意节点发请求,包括slave。节点会分析请求,如果hash slot是这个节点负责的,那就简单的处理这个查询,否则节点会检查映射表,并返回MOVED错误,如下所示:
GET X
-MOVED 3999 127.0.0.1:6381
client需要将请求重新发到指定节点。如果在client等了很久才重发消息,期间集群配置变动了,目标节点会回复MOVED。client联系到过时的节点都是这种情况。
尽管集群中是用ID来作为节点标识的,这里跟client通过都是简单的使用IP:port。
虽然不强制要求,但是client要记下slot 3999是由127.0.0.1:6381提供的。这样新的命令就会挑选正确的节点发送了。
还有一种做法是client在收到MOVED的时候,使用CLUSTER NODES和CLUSTER SLOTS命令,因为遇到重定向时,很可能许多个slot都变化了,因此client尽快更新配置可能是最后的策略。
client还必须能够处理-ASK重定向,不然不能算完整的client。
集群在线改配置
redis集群支持在运行时添加和删除节点。实际上添加和删除节点都是抽象成同一种操作,将hash slot从一个节点迁移到其它。
核心部分就是移动hash slot的能力。实际上hash slot就是一系列的key,因此resharding就是将一些key从一个实例移到其它实例。
手动迁移命令:
- CLUSTER ADDSLOTS slot1 [slot2] ... [slotN]
- CLUSTER DELSLOTS slot1 [slot2] ... [slotN]
- CLUSTER SETSLOT slot NODE node
- CLUSTER SETSLOT slot MIGRATING node
- CLUSTER SETSLOT slot IMPORTING node
前两个是简单的将slot赋到redis节点。ADDSLOTS通常是创建新集群的时候用。DELSLOTS主要用于手动修改集群配置或者调试,实际上很少使用。
SETSLOT <slot> NODE形式的子命令用于将一个slot赋给特定的节点ID。MIGRATING和IMPORTING用于迁移hash slot从一个节点到其它节点。
- 如果一个slot状态是MIGRATING,如果请求的key存在,节点会接受查询请求,否则会返回一个-ASK重定向。
- 如果一个slot的状态是IMPORTING,节点会接受所有的查询请求,但是只处理ASKING命令,如果不是ASKING命令,会返回-MOVED重定向。
ASK重定向
为什么是ASK而不是MOVED? ASK意味着,只是下一条命令要查询特定的节点,而MOVED是后面的查询都是到了其它节点。
我们要限定client的行为。因此,IMPORTING的节点只接受ASKING命令。
从client的角度,ASK重定向的完整语义:
- 如果收到ASK重定向,只是下一次把请求发到特定节点,之后仍然是查询老的节点。
- 使用ASKING命令启动重定向查询。
- 暂时不要更新本地的hash slot
客户端首次连接和重定向处理
client要尽量聪明一些,记下slot的配置。不过不需要是最新的,因为连到错的节点后会收到重定向。
通常在以下两种情况,client要拿一份完整的slot到节点映射表。
- 启动的时候
- 收到MOVED重定向时
CLUSTER SLOTS命令可以拿到相关信息。
多key操作
使用hash tags可以很容易做多key操作。
但是reshard的时候,多key操作将不可用。因为迁移的时候数据分散在两个节点之间了。这个时候会返回一个-TRYAGAIN错误。
利用slave节点提升读性能
正常情况下,slave节点会将client重定向到master。但是可以用READONLY命令,从slave读。
在只读模式下,只有slave的master不拥有对应的slot的时候才会发重定向。
容错
节点心跳和gossip消息
集群节点会持续交换ping和pong。这两类消息结构相同,都带了配置信息。实际上只有type字段有区别。统一都叫心跳包。
通常是发ping然后返回pong。不过也可以直接发pong,这样就不需要返回,可以尽快把配置信息广播出去。
通常ping会随机挑选出一些节点,这样每个节点的发送消息是常量,跟集群规模无关。但是要确保,每个节点在NODE_TIMEOUT/2时间之内,会ping到所有的其它节点。
心跳包内容
异常检测
如果多数节点都不可访问某个master或者slave节点,slave会被提升为master,如果不行,集群将进入error状态并停止接受client请求。
每个节点都维护了一个它知道的节点的列表,有两个flag用于异常检测。PFAIL和FAIL。PFAIL意思是Possible failure,并不确认异常类型。FAIL意思是节点挂了并且大部分的master都确认过。
PFAIL标记
如果一个节点超过了NODE_TIMEOUT无法连到另一个节点,它会将这个节点加上PFAIL标记。master和slave都可以标记其它节点为PFAIL。
不可达的概念是发了一个ping,超过NODETIMEOUT还没收到响应。为了增加正常情况下的可靠性,在超过NODETIMNEOUT一半时间后,会尝试重连节点。这样子可以确保连接是keep alive的并且连接坏了不会导致错误的报告节点挂掉。
FAIL标记
PFLAG只是本地信息,并且不足以触发slave提升。PFAIL变为FAIL后才能确认节点挂了。
A节点中标记了B为PFAIL。A通过gossip收集到其它节点关于B的信息。多数master标记了PFAIL。那么A会将B标记为FAIL,并给所有可达的节点发FAIL消息。
变成FAIL需要经历PFAIL,清除FAIL有两种情况:
- 节点可达了,并且它是slave。直接清除FAIL标记,因为slave不需要容错。
- 节点可达了,它是master但是不对应任何slot。也可以直接清除FAIL标记
- 节点可达了,它是master,并且过了N倍的NODE_TIMEOUT时间没有检测到slave提升。这种情况有利于节点重新加入到集群。
配置处理,传递和容错
集群当前epoch
redis集群用了一个类似raft算法中"term"的概念,在redis集群中叫做epoch。使用它是为了给事件一个递增的版本,这样,当多个节点提供的信息冲突了,其它节点可以知道哪个状态是最新的。
集群创建的时候,所有节点,包含master和slave,都将currentEpoch设置为0。每次收到其它节点发来的包,如果发送者的epoch大于本地,则将currentEpoch更新到发送者的epoch。
slave选举和提升
slave选举和提升是由slave节点处理的,master节点会投票。当master处理FAIL状态,并且至少有一个满足条件的slave时,会执行slave选举。