Saturday, April 13, 2019

Skip List



http://huntinux.github.io/leveldb-skiplist.html
Skip List是对单向链表的扩展,使得其性能接近AVL和红黑树,同时实现起来更加简单。
Skip List被用来实现leveldb中的memtable,此外也被用来实现Redis中的sorted sets数据结构。
一个单向链表:
这里写图片描述
在单向链表中进行查找的时间复杂度是O(N)。此外,由于插入、删除操作首先都要使用查找操作,因此如何提高查找的性能成为关键问题。
一种改善查找性能的方法,建立多级链表
这里写图片描述
那么查找7的过程是这样的(红色箭头):
这里写图片描述
从最高层链表(最稀疏的)开始,因为7小于12,所以进入下层链表继续查找,这样就不会在链表的后半部进行查找,所以查找范围减少了一半,过程类似折半查找。后续的查找过程类似,最后定位到7的位置。
所以,通过建立多级链表,可以提高查找性能。每层链表都是跳跃性前进的(skip),因此 称为 Skip List。Skip List通过在查找过程中“跳过”一些节点使得find,insertremove的时间复杂度为O(lgN)。

概念梳理

We speak of a Skip List node having levels, one level per forward reference. The number of levels in a node is called the size of the node.
每个节点都在若干个level中出现,或者说节点有若干个level,而每个level是一个前进的指针。如果该节点有n个level,那么称该节点的size是n。

每级链表的跳跃距离

上面的skip list的最底层链表跳跃距离是1,第二次跳跃距离是2,最高层是4。这种跳跃策略是固定的。实际上,常用的跳跃策略是是基于概率的跳跃(probabilistically),像这样事儿的:
这里写图片描述
关于probabilistically策略,wikipedia是这样介绍的:
an element in layer i appears in layer i+1 with some fixed probability p (two commonly used values for p are 1/2 or 1/4). On average, each element appears in 1/(1-p) , and the tallest element (usually a special head element at the front of the skip list) in all the lists ...(公式无法复制)
在第i层出现的元素出现在i+1层出现的概率是p(p通常是1/2或1/4)。平均下来,每个元素出现在1/(1-p)个链表中。
其中比较特殊的元素是head,它出现在所有list中,高度是 : 

为什么要使用随机的level

原文解释在这里 大概是说,如果每层跳跃是2的i次方(i代表层数),那么find性能与二分查找类似,确实能达到O(lgN),但是insertdelete的性能却是O(N)。因为插入或删除时需要对节点进行必要的调整。
This data structure is looking pretty good, but there's a serious problem with it for the insert and remove operations. The work required to reorganize the list after an insertion or deletion is in $O(n)$. For example, suppose that the first element is removed in Figure 3. Since it is necessary to maintain the strict pattern of node sizes, values from 2 to the end must be moved toward the head and the end node must be removed. A similar situation occurs when a new element is added to the list. This is where the probabilistic approach of a true Skip List comes into play. A Skip List is built with the same distribution of node sizes, but without the requirement for the rigid pattern of node sizes shown. It is no longer necessary to maintain the rigid pattern by moving values around after a remove or insert operation. Pugh shows that with high probability such a list still exhibits $O(\lg n)$ behavior. The probability that a given Skip List will perform badly is very small.

确定Max Level

The level of the header node is the maximum allowed level in the SkipList and is chosen at construction. Pugh shows that the maximum level should be chosen as $\log_\frac{1}{p} n$. Thus, for $p = \frac{1}{2}$, the maximum level for a SkipList of up to 65,536 elements should be chosen no smaller than $\lg_2 65536 = 16$.
对于$p = \frac{1}{2}$, 有65536个元素的SkipList,MaxLevel应该小于$\lg_2 65536 = 16$

随机生成节点的level

节点的level是根据概率p随机生成的。伪代码如下所示:
   int generateNodeLevel(double p, int maxLevel)
   {
    int level = 1;

    while (drand48() < p)
      level++;
    return (level > maxLevel) ? maxLevel : level;
   }
那么find的平均比对次数为: $1 + \frac{\log_\frac{1}{p} n}{p} + \frac{1}{1 - p}$
具体来说,对于含有n=65536个元素的skiplist来说,$p = \frac{1}{4}$ ,则平均比较次数为34.3; $p = \frac{1}{2}$, 则平均比较次数为35。
对于一个sorted list来说,平均查找次数是$\frac{n}{2} = 32768$
可以看到,skip list的查找性能有多大的提升。
这篇文章指出: MaxHeight 为SkipList的关键参数,与性能直接相关。 程序中修改MaxHeight时,在数值变小时,性能上有明显下降,但当数值增大时,甚至增大到10000时,和默认的 MaxHeight= 12相比仍旧无明显差异,内存使用上也是如此。 kBranching及 (rnd_.Next() % kBranching。 这使得上层节点的数量约为下层的1/4。那么,当设定MaxHeight=12时,根节点为1时,约可均匀容纳Key的数量为4^11= 4194304(约为400W)。 当单独增大MaxHeight时,并不会使得SkipList的层级提升。MaxHeight=12为经验值,在百万数据规模时,尤为适用。
https://www.jianshu.com/p/e4b775870eea

X. https://fancy-coder.com/2018/07/02/skip-list-in-leveldb/

How is Skip List used in LevelDB

Skip list is used as a buffer that tracks recent writes. The buffer is used to efficiently serve read with key that is recently touched. When buffer is full, its content is flushed into a SSTable (sorted string table) for persistence.

Why Not Hash Table

Hash table is good for serving reads, but not great for flushing data into SSTable. As its name indicates, SSTable content is sorted byte strings, which is good for look things up with binary search. Flushing buffer content to SSTable require traversing buffer contents in sorted order, which is hard for hash table, but easy for Skip List.

Why Not Binary Search Tree

Binary search tree (BST) is a good data structure that supports efficient look up and efficient content traversal in sorted order. I think the biggest advantage of skip list over BST is its concurrent property. Skip list supports concurrent read while write is on going (though writes have to go in sequential). I may create a separate post to go into more details.
TODO
Advantages
  • Skip lists perform very well on rapid insertions because there are no rotations or reallocations.
  • They’re simpler to implement than both self-balancing binary search trees and hash tables.
  • You can retrieve the next element in constant time (compare to logarithmic time for inorder traversal for BSTs and linear time in hash tables).
  • The algorithms can easily be modified to a more specialized structure (like segment or range “trees”, indexable skip lists, or keyed priority queues).
  • Making it lockless is simple.
  • It does well in persistent (slow) storage (often even better than AVL and EH).

MemTable is an in-memory data-structure holding data before they are flushed to SST files. It serves both read and write - new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replace by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed.

https://www.memsql.com/blog/what-is-skiplist-why-skiplist-index-for-memsql/
2) Simple
MemSQL’s skiplist implementation is about 1500 lines of code including comments. Having recently spent some time in both SQL Server’s and Innodb’s Btree implementations, I can tell you they are both close to 50 times larger in terms of lines of code and both have many more moving parts.   For example, a Btree has to deal with page splitting and page compaction while a skiplist has no equivalent operations.  The first generally available build of MemSQL’s took a little over a year to build and stabilize.  This feat wouldn’t have been possible with a more complex indexing data structure.
3) Lock free
A lockfree or non-blocking algorithm is one in which some thread is always able to make progress, no matter how all the threads’ executions are interleaved by the OS.  MemSQL is designed to support highly concurrent workloads running on hardware with many cores.  These goals makes lockfree algorithms desirable for MemSQL.  The algorithms for writing a  thread safe skiplist lockfree are now a solved problem in academia.  A number of papers have been published on the subject in the past decade.  It’s much harder to make a lockfree skiplist perform well when there is low contention (ie., a single thread iterating over the entire skiplist with no other concurrent operations executing).  Optimizing this case is a more active area of research. Our approach to solving this particular problem is a topic for another time.
5) Flexible
Skiplists also support some extra operations that are useful for query processing and that aren’t readily implementable in a balance tree.
For example, a skiplist is able to estimate the number of elements between two elements in the list in logarithmic time.  The general idea is to use the towers to estimate how many rows are between two elements linked together at the same height in the tree.  If we know the tower height at which the nodes are linked, we can estimate how many elements are expected to be between these elements because we know the expected distribution of towers at that height.   Knowing how many elements are expected to be in an arbitrary range of the list is very useful for query optimization when calculating how selective different filters in a select statement are.  Traditional databases need to build separate histograms to support this type of estimation in the query optimizer.
CPU Cache efficiency
Skiplists do not provide very good memory locality because  traversing pointers during a search results in execution jumping somewhat randomly around memory.  The impact of this effect is very workload specific and hard to accurately measure. The cost of executing the rest of a query  (sorting, executing expressions, protocol overhead to return the queries result) tends to dominate the cost of traversing the tower pointers during a search for most queries.
Reverse Iteration
Most skiplist implementations use backwards pointers (double linking of list elements) to support iterating backwards in the list.  The backwards pointers add extra memory overhead and extra implementation complexity (lockfree doubly linked lists are difficult to implement).  MemSQL’s skiplist employs a novel reverse iterator that uses the towers to iterate backwards without the need for reverse links.  The idea is to track the last tower link that is visited at each level of the skiplist while seeking to the end, or to a particular node, of the skiplist.  These links can be used to find the element behind each successive element (each time the reverse iterator moves backwards, it updates the links at each level that are used).  This iterator saves the memory required for backwards pointers, but does result in higher reverse iteration execution cost.   Reverse iteration is important for a SQL database because it allows ORDER BY queries to run without sorting even if the ORDER BY wants the opposite sort order (i.e., ascending instead of descending) of the order the index provides.


What is lucene skip list for?
http://lucene.sourceforge.net/talks/pisa/
https://stackoverflow.com/questions/49298388/what-is-lucene-skip-list-for
Documents are searched by DocID when each result of conditions on query are merged.

https://sease.io/2015/07/exploring-solr-internals-lucene.html
The term dictionary is a sorted skip list containing all the unique terms for the specific field.
The posting list is the sorted skip list of DocIds that contains the related term.
It’s used to return the documents for the searched term.

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts