Wednesday, July 22, 2015

[Design] Distributed hash table - Shuatiblog.com



[Design] Distributed hash table - Shuatiblog.com
ref

Distributed hash table

A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table. (key,value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key.
对于一个key/value对,DHT在分布式集群中,提供像HashTable一样的服务,例如简单快捷的存取、查询。


DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems.

Properties

Unlike unstructured P2P, DHT is tightly coupled between nodes and file locations. (when request a content, directly go to the content instead of searching by flooding)
DHT has the following properties:
  1. Autonomy and Decentralization: the nodes collectively form the system without any central coordination.
  2. Fault tolerance: the system should be reliable (in some sense) even with nodes continuously joining, leaving, and failing.
  3. Scalability: the system should function efficiently even with thousands or millions of nodes.

Building a DHT

  1. Hash function that maps a file to a unique ID. Eg. hash("Harry Potter") –> 3912.
  2. Distribute range space for all nodes in the network.
  3. The desinated node stores the location of the file. (this is indirect approach)

Search in DHT

  1. Search query routed to the node whose range covers the file.
  2. Each node would retains a routing information that is implemented in a fully distributed manner (i.e. no central point, no single point of failure).
There is different hashing and routing techniques associated with DHT. The most important is Consistent Hashing and Chord Routing.

Consistent Hashing

Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.

Motivation

In most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. Specifically, the 3 cases below can end up in a technology crisis:
  1. leaves/failures – 一个 cache 服务器 m down 掉了(在实际应用中必须要考虑这种情况),这样所有映射到 cache m 的对象都会失效,怎么办,需要把 cache m 从 cache 中移除,这时候 cache 是 N-1 台,映射公式变成了 hash(object)%(N-1);
  2. join – 由于访问加重,需要添加 cache ,这时候 cache 是 N+1 台,映射公式变成了 hash(object)%(N+1)
  3. scalability – 由于硬件能力越来越强,你可能想让后面添加的节点多做点活,显然上面的 hash 算法也做不到。

2 hash 算法和单调性

   Hash 算法的一个衡量指标是单调性( Monotonicity ,定义如下:

  单调性是指如果已经有一些内容通过哈希分派到了相应的缓冲中,又有新的缓冲加入到系统中。哈希的结果应能够保证原有已分配的内容可以被映射到新的缓冲中去,而不会被映射到旧的缓冲集合中的其他缓冲区。
consistent hashing 是一种 hash 算法,简单的说,在移除 / 添加一个 cache 时,它能够尽可能小的改变已存在key 映射关系,尽可能的满足单调性的要求。
Consistent hashing 的基本思想就是将对象和 cache 都映射到同一个 hash 数值空间中,并且使用相同的 hash算法。
假设当前有 A,B  C  3  cache ,那么其映射结果将如图 3 所示,他们在 hash 空间中,以对应的 hash 值排列。
hash(cache A) = key A;
… …
hash(cache C) = key C;
cache
 3 cache 和对象的 key 值分布

cache  hash 计算,一般的方法可以使用 cache 机器的 IP 地址或者机器名作为 hash输入。
3.5.1 移除 cache
考虑假设 cache B 挂掉了,根据上面讲到的映射方法,这时受影响的将仅是那些沿 cache B 逆时针遍历直到下一个 cache  cache C )之间的对象,也即是本来映射到 cache B 上的那些对象。
因此这里仅需要变动对象 object4 ,将其重新映射到 cache C 上即可;参见图 4 
remove
 4 Cache B 被移除后的 cache 映射
3.5.2 添加 cache
再考虑添加一台新的 cache D 的情况,假设在这个环形 hash 空间中, cache D 被映射在对象 object2 object3 之间。这时受影响的将仅是那些沿 cache D 逆时针遍历直到下一个 cache  cache B )之间的对象(它们是也本来映射到 cache C 上对象的一部分),将这些对象重新映射到 cache D 上即可。

因此这里仅需要变动对象 object2 ,将其重新映射到 cache D 上;参见图 5 
add
 添加 cache D 后的映射关系

4 虚拟节点
考量 Hash 算法的另一个指标是平衡性 (Balance) ,定义如下:
平衡性

  平衡性是指哈希的结果能够尽可能分布到所有的缓冲中去,这样可以使得所有的缓冲空间都得到利用。

“虚拟节点”( virtual node )是实际节点在 hash 空间的复制品( replica ),一实际个节点对应了若干个“虚拟节点”,这个对应个数也成为“复制个数”,“虚拟节点”在 hash 空间中以 hash 值排列。
仍以仅部署 cache A  cache C 的情况为例,在图 4 中我们已经看到, cache 分布并不均匀。现在我们引入虚拟节点,并设置“复制个数”为 2 ,这就意味着一共会存在 4 个“虚拟节点”, cache A1, cache A2 代表了 cache A cache C1, cache C2 代表了 cache C ;假设一种比较理想的情况,参见图 6 
virtual nodes
引入“虚拟节点”后,映射关系就从 { 对象 -> 节点 } 转换到了 { 对象 -> 虚拟节点 } 
引入“虚拟节点”前,计算 cache A  hash 值:
Hash(“202.168.14.241”);
引入“虚拟节点”后,计算“虚拟节”点 cache A1  cache A2  hash 值:
Hash(“202.168.14.241#1”);  // cache A1
Hash(“202.168.14.241#2”);  // cache A2

Technique

Consistent hashing is based on mapping each object to a point on the edge of a circle. The system maps each available machine to pseudo-randomly distributed points on the edge of the same circle.
  1. 假定哈希key均匀的分布在一个环上
  2. 所有的节点也都分布在同一环上
  3. 每个节点只负责一部分Key,当节点加入、退出时只影响加入退出的节点和其邻居节点或者其他节点只有少量的Key受影响
For a very detailed steps of consistent hashing, read this Chinese blog.

In this way, 一致性Hash在node加入/离开时,不会导致映射关系的重大变化。

Routing (searching)

Simple Routing would search successor node, and runtime is linear. These node would keep O(1) routing information, and spend O(n) time in query routing.
Otherwise, we make every node store ID and IP of all nodes, thus query routing takes O(1) but routing information is O(n).
We'll now discuss Chord Routing.

Chord Routing

Each node stores more info closely following it on the identifier circle than nodes further away. That is, the subsequent nodes at position 1, 2, 4, 8, 16, 32… (each entry is called a finger)
为网络中每个Node分配一个唯一id(可以通过机器的mac地址做Hash),假设整个网络有N 个节点,我们可以认为这些整数首尾相连形成一个环,称之为Chord环。两个节点间的距离定义为节点间下标差,每个节点会存储一张路由表(Finger表),表内顺时针按照离本节点2、4、8、16、32.……2i的距离选定log2N个其他节点的ip信息来记录。
Routing information maintained at each node: O(logN).
Query routing take O(logN) time.

Join and leave in Chord

It's very much like insertion and removal in Doubly Linked List. Read it yourself.
Special thanks to the online resources written by some CSDN bloggers.
public class ConsistentHash<T> {

 private final HashFunction hashFunction;
 private final int numberOfReplicas;
 private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();

 public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
     Collection<T> nodes) {
   this.hashFunction = hashFunction;
   this.numberOfReplicas = numberOfReplicas;

   for (T node : nodes) {
     add(node);
   }
 }

 public void add(T node) {
   for (int i = 0; i < numberOfReplicas; i++) {
     circle.put(hashFunction.hash(node.toString() + i), node);
   }
 }

 public void remove(T node) {
   for (int i = 0; i < numberOfReplicas; i++) {
     circle.remove(hashFunction.hash(node.toString() + i));
   }
 }

 public T get(Object key) {
   if (circle.isEmpty()) {
     return null;
   }
   int hash = hashFunction.hash(key);
   if (!circle.containsKey(hash)) {
     SortedMap<Integer, T> tailMap = circle.tailMap(hash);
     hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
   }
   return circle.get(hash);
 }

}

The circle is represented as a sorted map of integers, which represent the hash values, to caches (of type T here).
When a ConsistentHash object is created each node is added to the circle map a number of times (controlled bynumberOfReplicas). The location of each replica is chosen by hashing the node's name along with a numerical suffix, and the node is stored at each of these points in the map.
To find a node for an object (the get method), the hash value of the object is used to look in the map. Most of the time there will not be a node stored at this hash value (since the hash value space is typically much larger than the number of nodes, even with replicas), so the next node is found by looking for the first key in the tail map. If the tail map is empty then we wrap around the circle by getting the first key in the circle.
Used by memcached, cassandra,  Chord, which is a distributed hash table implementation, and Amazon's Dynamo, which is a key-value store (not available outside Amazon).
only the client that needs to implement the consistent hashing algorithm - the memcached server is unchanged.


Read full article from [Design] Distributed hash table - Shuatiblog.com

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts