Massive Technical Interviews Tips: Generate unique sequence number using distributed system

Thursday, October 15, 2015

Generate unique sequence number using distributed system

https://tech.meituan.com/MT_Leaf.html

全局唯一性：不能出现重复的ID号，既然是唯一标识，这是最基本的要求。
趋势递增：在MySQL InnoDB引擎中使用的是聚集索引，由于多数RDBMS使用B-tree的数据结构来存储索引数据，在主键的选择上面我们应该尽量使用有序的主键保证写入性能。
单调递增：保证下一个ID一定大于上一个ID，例如事务版本号、IM增量消息、排序等特殊需求。
信息安全：如果ID是连续的，恶意用户的扒取工作就非常容易做了，直接按照顺序下载指定URL即可；如果是订单号就更危险了，竞对可以直接知道我们一天的单量。所以在一些应用场景下，会需要ID无规则、不规则。

上述123对应三类不同的场景，3和4需求还是互斥的，无法使用同一个方案满足。

不易于存储：UUID太长，16字节128位，通常以36长度的字符串表示，很多场景不适用。
信息不安全：基于MAC地址生成UUID的算法可能会造成MAC地址泄露，这个漏洞曾被用于寻找梅丽莎病毒的制作者位置。

ID作为主键时在特定的环境会存在一些问题，比如做DB主键的场景下，UUID就非常不适用：

① MySQL官方有明确的建议主键要尽量越短越好[4]，36个字符长度的UUID不符合要求。

All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index. If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

② 对MySQL索引不利：如果作为数据库主键，在InnoDB引擎下，UUID的无序性可能会引起数据位置频繁变动，严重影响性能。

强依赖机器时钟，如果机器上时钟回拨，会导致发号重复或者服务会处于不可用状态。

假设我们要部署N台机器，步长需设置为N，每台的初始值依次为0,1,2...N-1那么整个架构就变成了如下图所示：

系统水平扩展比较困难，比如定义好了步长和机器台数之后，如果要添加机器该怎么做？假设现在只有一台机器发号是1,2,3,4,5（步长是1），这个时候需要扩容机器一台。可以这样做：把第二台机器的初始值设置得比第一台超过很多，比如14（假设在扩容时间之内第一台不可能发到14），同时设置步长为2，那么这台机器下发的号码都是14以后的偶数。然后摘掉第一台，把ID值保留为奇数，比如7，然后修改第一台的步长为2。让它符合我们定义的号段标准，对于这个例子来说就是让第一台以后只能产生奇数。扩容方案看起来复杂吗？貌似还好，现在想象一下如果我们线上有100台机器，这个时候要扩容该怎么做？简直是噩梦。所以系统水平扩展方案复杂难以实现。
ID没有了单调递增的特性，只能趋势递增，这个缺点对于一般业务需求不是很重要，可以容忍。
数据库压力还是很大，每次获取ID都得读写一次数据库，只能靠堆机器来提高性能。

Leaf-segment数据库方案

第一种Leaf-segment方案，在使用数据库的方案上，做了如下改变：

原方案每次获取ID都得读写一次数据库，造成数据库压力大。改为利用proxy server批量获取，每次获取一个segment(step决定大小)号段的值。用完之后再去数据库获取新的号段，可以大大的减轻数据库的压力。
各个业务不同的发号需求用biz_tag字段来区分，每个biz-tag的ID获取相互隔离，互不影响。如果以后有性能需求需要对数据库扩容，不需要上述描述的复杂的扩容操作，只需要对biz_tag分库分表就行。

数据库表设计如下：

+-------------+--------------+------+-----+-------------------+-----------------------------+
| Field       | Type         | Null | Key | Default           | Extra                       |
+-------------+--------------+------+-----+-------------------+-----------------------------+
| biz_tag     | varchar(128) | NO   | PRI |                   |                             |
| max_id      | bigint(20)   | NO   |     | 1                 |                             |
| step        | int(11)      | NO   |     | NULL              |                             |
| desc        | varchar(256) | YES  |     | NULL              |                             |
| update_time | timestamp    | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+-------------+--------------+------+-----+-------------------+-----------------------------+

重要字段说明：biz_tag用来区分业务，max_id表示该biz_tag目前所被分配的ID号段的最大值，step表示每次分配的号段长度。原来获取ID每次都需要写数据库，现在只需要把step设置得足够大，比如1000。那么只有当1000个号被消耗完了之后才会去重新读写一次数据库。读写数据库的频率从1减小到了1/step，大致架构如下图所示：

test_tag在第一台Leaf机器上是1~1000的号段，当这个号段用完时，会去加载另一个长度为step=1000的号段，假设另外两台号段都没有更新，这个时候第一台机器新加载的号段就应该是3001~4000。同时数据库对应的biz_tag这条数据的max_id会从3000被更新成4000，更新号段的SQL语句如下：

Begin
UPDATE table SET max_id=max_id+step WHERE biz_tag=xxx
SELECT tag, max_id, step FROM table WHERE biz_tag=xxx
Commit

这种模式有以下优缺点：

优点：

Leaf服务可以很方便的线性扩展，性能完全能够支撑大多数业务场景。
ID号码是趋势递增的8byte的64位数字，满足上述数据库存储的主键要求。
容灾性高：Leaf服务内部有号段缓存，即使DB宕机，短时间内Leaf仍能正常对外提供服务。
可以自定义max_id的大小，非常方便业务从原有的ID方式上迁移过来。

缺点：

ID号码不够随机，能够泄露发号数量的信息，不太安全。
TP999数据波动大，当号段使用完之后还是会hang在更新数据库的I/O上，tg999数据会出现偶尔的尖刺。
DB宕机会造成整个系统不可用。

双buffer优化

对于第二个缺点，Leaf-segment做了一些优化，简单的说就是：

Leaf 取号段的时机是在号段消耗完的时候进行的，也就意味着号段临界点的ID下发时间取决于下一次从DB取回号段的时间，并且在这期间进来的请求也会因为DB号段没有取回来，导致线程阻塞。如果请求DB的网络和DB的性能稳定，这种情况对系统的影响是不大的，但是假如取DB的时候网络发生抖动，或者DB发生慢查询就会导致整个系统的响应时间变慢。

为此，我们希望DB取号段的过程能够做到无阻塞，不需要在DB取号段的时候阻塞请求线程，即当号段消费到某个点时就异步的把下一个号段加载到内存中。而不需要等到号段用尽的时候才去更新号段。这样做就可以很大程度上的降低系统的TP999指标

采用双buffer的方式，Leaf服务内部有两个号段缓存区segment。当前号段已下发10%时，如果下一个号段未更新，则另启一个更新线程去更新下一个号段。当前号段全部下发完后，如果下个号段准备好了则切换到下个号段为当前segment接着下发，循环往复。

每个biz-tag都有消费速度监控，通常推荐segment长度设置为服务高峰期发号QPS的600倍（10分钟），这样即使DB宕机，Leaf仍能持续发号10-20分钟不受影响。
每次请求来临时都会判断下个号段的状态，从而更新此号段，所以偶尔的网络抖动不会影响下个号段的更新。

对于第三点“DB可用性”问题，我们目前采用一主两从的方式，同时分机房部署，Master和Slave之间采用半同步方式[5]同步数据。同时使用公司Atlas数据库中间件(已开源，改名为DBProxy)做主从切换。当然这种方案在一些情况会退化成异步模式，甚至在非常极端情况下仍然会造成数据不一致的情况，但是出现的概率非常小。如果你的系统要保证100%的数据强一致，可以选择使用“类Paxos算法”实现的强一致MySQL方案，如MySQL 5.7前段时间刚刚GA的MySQL Group Replication。但是运维成本和精力都会相应的增加，根据实际情况选型即可。

Leaf-snowflake方案完全沿用snowflake方案的bit位设计，即是“1+41+10+12”的方式组装ID号。对于workerID的分配，当服务集群数量较小的情况下，完全可以手动配置。Leaf服务规模较大，动手配置成本太高。所以使用Zookeeper持久顺序节点的特性自动对snowflake节点配置wokerID。Leaf-snowflake是按照下面几个步骤启动的：

启动Leaf-snowflake服务，连接Zookeeper，在leaf_forever父节点下检查自己是否已经注册过（是否有该顺序子节点）。
如果有注册过直接取回自己的workerID（zk顺序节点生成的int类型ID号），启动服务。

弱依赖ZooKeeper

除了每次会去ZK拿数据以外，也会在本机文件系统上缓存一个workerID文件。当ZooKeeper出现问题，恰好机器出现问题需要重启时，能保证服务能够正常启动。这样做到了对三方组件的弱依赖。一定程度上提高了SLA

若写过，则用自身系统时间与leaf_forever/${self}节点记录时间做比较，若小于leaf_forever/${self}时间则认为机器时间发生了大步长回拨，服务启动失败并报警。
若未写过，证明是新服务节点，直接创建持久节点leaf_forever/${self}并写入自身系统时间，接下来综合对比其余Leaf节点的系统时间来判断自身系统时间是否准确，具体做法是取leaf_temporary下的所有临时节点(所有运行中的Leaf-snowflake节点)的服务IP：Port，然后通过RPC请求得到所有节点的系统时间，计算sum(time)/nodeSize。
若abs( 系统时间-sum(time)/nodeSize ) < 阈值，认为当前系统时间准确，正常启动服务，同时写临时节点leaf_temporary/${self} 维持租约。
否则认为本机系统时间发生大步长偏移，启动失败并报警。
每隔一段时间(3s)上报自身系统时间写入leaf_forever/${self}。

由于强依赖时钟，对时间的要求比较敏感，在机器工作时NTP同步也会造成秒级别的回退，建议可以直接关闭NTP同步。要么在时钟回拨的时候直接不提供服务直接返回ERROR_CODE，等时钟追上即可。或者做一层重试，然后上报报警系统，更或者是发现有时钟回拨之后自动摘除本身节点并报警，如下：

 //发生了回拨，此刻时间小于上次发号时间
 if (timestamp < lastTimestamp) {

            long offset = lastTimestamp - timestamp;
            if (offset <= 5) {
                try {
                    //时间偏差大小小于5ms，则等待两倍时间
                    wait(offset << 1);//wait
                    timestamp = timeGen();
                    if (timestamp < lastTimestamp) {
                       //还是小于，抛异常并上报
                        throwClockBackwardsEx(timestamp);
                      }    
                } catch (InterruptedException e) {  
                   throw  e;
                }
            } else {
                //throw
                throwClockBackwardsEx(timestamp);
            }
        }
 //分配ID

从上线情况来看，在2017年闰秒出现那一次出现过部分机器回拨，由于Leaf-snowflake的策略保证，成功避免了对业务造成的影响。

https://mp.weixin.qq.com/s/KfoLFClRwDXlcTDmhCEdaQ

大家对于唯一标识最容易想到的就是主键自增，这个也是我们最常用的方法。例如我们有个订单服务，那么把订单id设置为主键自增即可。

优点:

简单方便，有序递增，方便排序和分页

缺点:

分库分表会带来问题，需要进行改造。
并发性能不高，受限于数据库的性能。
简单递增容易被其他人猜测利用，比如你有一个用户服务用的递增，那么其他人可以根据分析注册的用户ID来得到当天你的服务有多少人注册，从而就能猜测出你这个服务当前的一个大概状况。
数据库宕机服务不可用。

适用场景:
根据上面可以总结出来，当数据量不多，并发性能不高的时候这个很适合，比如一些to B的业务，商家注册这些，商家注册和用户注册不是一个数量级的，所以可以数据库主键递增。如果对顺序递增强依赖，那么也可以使用数据库主键自增。

熟悉Redis的同学，应该知道在Redis中有两个命令Incr，IncrBy,因为Redis是单线程的所以能保证原子性。

优点：

性能比数据库好，能满足有序递增。

缺点：

由于redis是内存的KV数据库，即使有AOF和RDB，但是依然会存在数据丢失，有可能会造成ID重复。
依赖于redis，redis要是不稳定，会影响ID生成。

适用：由于其性能比数据库好，但是有可能会出现ID重复和不稳定，这一块如果可以接受那么就可以使用。也适用于到了某个时间，比如每天都刷新ID，那么这个ID就需要重置，通过(Incr Today)，每天都会从0开始加。

Snowflake是Twitter提出来的一个算法，其目的是生成一个64bit的整数:

1bit:一般是符号位，不做处理
41bit:用来记录时间戳，这里可以记录69年，如果设置好起始时间比如今年是2018年，那么可以用到2089年，到时候怎么办？要是这个系统能用69年，我相信这个系统早都重构了好多次了。
10bit:10bit用来记录机器ID，总共可以记录1024台机器，一般用前5位代表数据中心，后面5位是某个数据中心的机器ID
12bit:循环位，用来对同一个毫秒之内产生不同的ID，12位可以最多记录4095个，也就是在同一个机器同一毫秒最多记录4095个，多余的需要进行等待下毫秒。

适用场景:当我们需要无序不能被猜测的ID，并且需要一定高性能，且需要long型，那么就可以使用我们雪花算法。比如常见的订单ID，用雪花算法别人就发猜测你每天的订单量是多少。

因为机器的原因会发生时间回拨，我们的雪花算法是强依赖我们的时间的，如果时间发生回拨，有可能会生成重复的ID，在我们上面的nextId中我们用当前时间和上一次的时间进行判断，如果当前时间小于上一次的时间那么肯定是发生了回拨，普通的算法会直接抛出异常,这里我们可以对其进行优化,一般分为两个情况:

如果时间回拨时间较短，比如配置5ms以内，那么可以直接等待一定的时间，让机器的时间追上来。
如果时间的回拨时间较长，我们不能接受这么长的阻塞等待，那么又有两个策略:

直接拒绝，抛出异常，打日志，通知RD时钟回滚。
利用扩展位，上面我们讨论过不同业务场景位数可能用不到那么多，那么我们可以把扩展位数利用起来了，比如当这个时间回拨比较长的时候，我们可以不需要等待，直接在扩展位加1。2位的扩展位允许我们有3次大的时钟回拨，一般来说就够了，如果其超过三次我们还是选择抛出异常，打日志。

public synchronized long nextId() {

long timestamp = timeGen();

//时间回拨，抛出异常

if (timestamp < lastTimestamp) {

System.err.printf("clock is moving backwards. Rejecting requests until %d.", lastTimestamp);

throw new RuntimeException(String.format("Clock moved backwards. Refusing to generate id for %d milliseconds",

lastTimestamp - timestamp));

}

if (lastTimestamp == timestamp) {

sequence = (sequence + 1) & sequenceMask;

if (sequence == 0) {

timestamp = tilNextMillis(lastTimestamp);

}

} else {

sequence = 0;

}

lastTimestamp = timestamp;

return ((timestamp - twepoch) << timestampLeftShift) |

(datacenterId << datacenterIdShift) |

(workerId << workerIdShift) |

sequence;

}

private long tilNextMillis(long lastTimestamp) {

long timestamp = timeGen();

while (timestamp <= lastTimestamp) {

timestamp = timeGen();

}

return timestamp;

}

https://www.cnblogs.com/Zachary-Fan/p/Global_Unique_No.html

2）可读性：如果仅仅作为唯一ID来用，其实最简单粗暴的方式就是使用UUID，因为它仅仅给程序使用，人并不需要理解这个ID的意义。但是单据号则不同，上面也提到了，它需要有一定的可读性，便于人与人之间的沟通。想象一下你和其它人电话沟通时报一串UUID是什么体验。

3）业务性：单据号大部分情况下还需要承担一定的业务含义的体现，比如订单号T00001中的T = Trade、支付号P00001中的P = Pay等。甚至还有可能需要多笔单据号之间有一定的关联，比如一个订单号T00001下相关的支付号都必须是P00001-1，P00001-2这个样子。再甚至有些场景需要包含一些日期信息在其中。

　　　　破除单点的改进方案：

　　　　　　①水平拆分进行多写+同步长（例：机器1的自增数为1，4，7，...；机器2的自增数为2，5，8，...；机器3的自增数为3，6，9，...）：

　　　　　　　　新的缺点：由于是多写，所以需要依赖于负载均衡策略和网络通讯的延时问题，无法保证生成的序号是100%递增的。（例：哪怕是round robin策略先请求1再请求2，但是还是有可能2先返回响应。）

　　　　　　②垂直拆分多写+自增列（机器1专门用于生成订单号、机器2专门用于生成支付单号）：

　　　　　　　　新的缺点：

　　　　　　　　　　a.由于根据业务来分，所以流量不均导致某些大请求量的单据还是存在着单点瓶颈问题。

　　　　　　　　　　b.扩展性较差。每增加一个业务单据就需要增加一个程序

　　　　　　③水平拆分+增加机器码位（给每台生成单据号的程序编个号：1，2，3插入到自增列的前面）：

　　　　　　　　新的缺点：

　　　　　　　　　　a.这个编码要么硬配置到配置文件中，或者依赖与某个分配编号的独立程序。并且号码长度变长了。

　　　　　　　　　　b.无法保证递增。

　　　　提高性能的改进方案：

　　　　　　①预生成到缓存，减少对DB的依赖

　　　　　　　　新的缺点：

　　　　　　　　　　a.如果需要彻底减少对DB的依赖，那么每次单据号被消耗是不应该回写DB的，也导致了一旦程序重启会存在比较大的序号空洞。

　　　　　　　　　　b.缓存的大小与DB获取下一段缓存数据的频率负相关的，当频率比较高的时候，需要做双缓存来预加载下一段缓存数据，避免缓存消耗完之后从DB拉取最新数据产生的阻塞。

　　2）前缀列+日期+自增列：

　　　　我想这个方案应该是大部分系统会采用的方案。这个日期的精度和自增数的数据长度是有关联的。日期精度越高，对于自增数的数据长度需求就越短，反之则越长。

　　OK，那它的长度我们可以如此来设计：

　　

　　其中时间戳、自增数是全局共用的，所以对于单独某一类型的单据号并不是连续的，但是是趋势递增的，这解决了根据订单号猜到订单量之类的问题。

　　那么在这样的设计下可以支撑单据号不重复的上限是多少呢？其实就是单点在1秒内的最大量100000000 /1000 = 100000/ms，1毫秒10W个，以snowflake的生成速度4000/ms来算（网络来源，未经实际验证），再根据摩尔定律考虑CPU升级的影响，大约需要50年后才有可能产生重复。并且理论最大值是100台程序负载均衡，1000W/ms，估计这辈子不用考虑重复问题了。

　　有的人可能会问，为什么不直接时间戳取到毫秒位，会增加3位长度，后面自增数就可以短一点。首先按照比snowflake算法多冗余一个位数来看，哪怕取到时间戳到毫秒，后面还是需要5位（snowflake是4位：4000/ms），所以这个并没有什么区别。那么精度取到秒的好处是什么？我认为有2点：

1）解决了预加载问题，由于精度到秒，所以哪怕程序重启了，我的自增数从0开始累加也不会产生重复。

　　

2）如果精度是毫秒，那么相当于不管我的每秒并发量是多少，哪怕1秒就1个请求进来，也固定占用3位长度。但是如果是秒，那么就省去了这3位，我想除了像阿里腾讯这种体量的公司，实际的环境中毫秒并发达到1W已经不得了了。

　　其中还有一些细节是：

　　　　1.机器码如果是个位数，那么前面加0填充，以免与后面的自增列结合后产生重复（例：机器1，序号11。和机器11，序号1会重复）。

　　　　2.每个程序所在服务器上的时钟同步需要做好，因为我们依赖于此保证递增问题。

　　最终，理论上实际生产环境生成的号码长度在15~19之间。

　　一个设计良好的单据号，不但可以用于主键，也可以用于做分库分表，比如我们把用户ID按照某个规则得出的几位数字拼到单据号的最后，那么直接用这个号来定位数据库，可以确保一个用户的订单全部落在一个同一个数据库里。

Unique Id generator
- Why 128 is too long, 128 bits
https://engineering.instagram.com/sharding-ids-at-instagram-1cf5a71e5a5c

Generate IDs through dedicated service

Ex: Twitter’s Snowflake, a Thrift service that uses Apache ZooKeeper to coordinate nodes and then generates 64-bit unique IDs

Pros:

Snowflake IDs are 64-bits, half the size of a UUID
Can use time as first component and remain sortable
Distributed system that can survive nodes dying

Cons:

Would introduce additional complexity and more ‘moving parts’ (ZooKeeper, Snowflake servers) into our architecture

DB Ticket Servers

Uses the database’s auto-incrementing abilities to enforce uniqueness. Flickr uses this approach, but with two ticket DBs (one on odd numbers, the other on even) to avoid a single point of failure.

Pros:

DBs are well understood and have pretty predictable scaling factors

Cons:

Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
An additional couple of machines (or EC2 instances) to admin
If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.

Of all the approaches above, Twitter’s Snowflake came the closest, but the additional complexity required to run an ID service was a point against it. Instead, we took a conceptually similar approach, but brought it inside PostgreSQL.

Our sharded system consists of several thousand ‘logical’ shards that are mapped in code to far fewer physical shards. Using this approach, we can start with just a few database servers, and eventually move to many more, simply by moving a set of logical shards from one database to another, without having to re-bucket any of our data. We used Postgres’ schemas feature to make this easy to script and administrate.

百度分布式ID 构造springcloud工程

最近正在研究高可用分布式全局ID的生成策略，网上很多种方案和实现。主要集中在几个方面：

数据库自增长序列或字段
UUID
Redis生产ID
Zookeeper的叶子方案生产ID
Twitter的snowflake算法
MongoDB的ObjectId
第三方方案(百度,美团)

https://github.com/baidu/uid-generator/blob/master/README.md

CachedUidGenerator

RingBuffer is an array，each item of that array is called 'slot', every slot keeps a uid or a flag(Double RingBuffer). The size of RingBuffer is 2^n, where n is positive integer and equal or greater than bits of sequence. Assign bigger value to boostPower if you want to enlarge RingBuffer to improve throughput.

Tail & Cursor pointer

Tail Pointer

Represents the latest produced UID. If it catches up with cursor, the ring buffer will be full, at that moment, no put operation should be allowed, you can specify a policy to handle it by assigning property rejectedPutBufferHandler.
Cursor Pointer

Represents the latest already consumed UID. If cursor catches up with tail, the ring buffer will be empty, and any take operation will be rejected. you can also specify a policy to handle it by assigning property rejectedTakeBufferHandler.

CachedUidGenerator used double RingBuffer，one RingBuffer for UID, another for status(if valid for take or put)

Array can improve performance of reading, due to the CUP cache mechanism. At the same time, it brought the side effect of 「False Sharing」, in order to solve it, cache line padding is applied.

RingBuffer filling

Initialization padding During RingBuffer initializing，the entire RingBuffer will be filled.
In-time filling Whenever the percent of available UIDs is less than threshold paddingFactor, the fill task is triggered. You can reassign that threshold in Spring bean configuration.
Periodic filling Filling periodically in a scheduled thread. ThescheduleInterval can be reassigned in Spring bean configuration.

<bean id="cachedUidGenerator" class="com.baidu.fsg.uid.impl.CachedUidGenerator">
    <property name="workerIdAssigner" ref="disposableWorkerIdAssigner" />
 
    <!-- The config below is option -->
    <!-- Specified bits & epoch as your demand. No specified the default value will be used -->
    <property name="timeBits" value="29"/>
    <property name="workerBits" value="21"/>
    <property name="seqBits" value="13"/>
    <property name="epochStr" value="2016-09-20"/>
    <!-- RingBuffer size, to improve the throughput. -->
    <!-- Default as 3. Sample: original bufferSize=8192, after boosting the new bufferSize= 8192 << 3 = 65536 -->
    <property name="boostPower" value="3"></property>
 
    <!-- In-time padding, available UIDs percentage(0, 100) of the RingBuffer, default as 50 -->
    <!-- Sample: bufferSize=1024, paddingFactor=50 -> threshold=1024 * 50 / 100 = 512. -->
    <!-- When the rest available UIDs < 512, RingBiffer will be padded in-time -->
    <property name="paddingFactor" value="50"></property>
 
    <!-- Periodic padding -->
    <!-- Default is disabled. Enable as below, scheduleInterval unit as Seconds. -->
    <property name="scheduleInterval" value="60"></property>
 
    <!-- Policy for rejecting put on RingBuffer -->
    <property name="rejectedPutBufferHandler" ref="XxxxYourPutRejectPolicy"></property>
 
    <!-- Policy for rejecting take from RingBuffer -->
    <property name="rejectedTakeBufferHandler" ref="XxxxYourTakeRejectPolicy"></property>
 
</bean>
 
<!-- Disposable WorkerIdAssigner based on Database -->
<bean id="disposableWorkerIdAssigner" class="com.baidu.fsg.uid.worker.DisposableWorkerIdAssigner" />

https://segmentfault.com/a/1190000010978305

UidGenerator is a Java implemented, Snowflake based unique ID generator. It works as a component, and allows users to override workId bits and initialization strategy. As a result, it is much more suitable for virtualization environment, such as docker. Besides these, it overcomes concurrency limitation of Snowflake algorithm by consuming future time; parallels UID produce and consume by caching UID with RingBuffer; eliminates CacheLine pseudo sharing, which comes from RingBuffer, via padding. And finally, it can offer over 6 million QPS per single instance.

** Snowflake algorithm：** An unique id consists of worker node, timestamp and sequence within that timestamp. Usually, it is a 64 bits number(long), and the default bits of that three fields are as follows:

sign(1bit)
The highest bit is always 0.
delta seconds (28 bits)
The next 28 bits, represents delta seconds since a customer epoch(2016-05-20). The maximum time will be 8.7 years.
worker id (22 bits)
The next 22 bits, represents the worker node id, maximum value will be 4.2 million. UidGenerator uses a build-in database based worker id assigner when startup by default, and it will dispose previous work node id after reboot. Other strategy such like 'reuse' is coming soon.
sequence (13 bits)
the last 13 bits, represents sequence within the one second, maximum is 8192 per second by default.

https://github.com/baidu/uid-generator/blob/master/src/main/java/com/baidu/fsg/uid/impl/DefaultUidGenerator.java

/** Bits allocate */

protected int timeBits = 28;

protected int workerBits = 22;

protected int seqBits = 13;

/** Customer epoch, unit as second. For example 2016-05-20 (ms: 1463673600000)*/

protected String epochStr = "2016-05-20";

protected long epochSeconds = TimeUnit.MILLISECONDS.toSeconds(1463673600000L);

/** Stable fields after spring bean initializing */

protected BitsAllocator bitsAllocator;

protected long workerId;

/** Volatile fields caused by nextId() */

protected long sequence = 0L;

protected long lastSecond = -1L;

/** Spring property */

protected WorkerIdAssigner workerIdAssigner;

public long getUID() throws UidGenerateException {

try {

return nextId();

} catch (Exception e) {

LOGGER.error("Generate unique id exception. ", e);

throw new UidGenerateException(e);

}

public String parseUID(long uid) {

long totalBits = BitsAllocator.TOTAL_BITS;

long signBits = bitsAllocator.getSignBits();

long timestampBits = bitsAllocator.getTimestampBits();

long workerIdBits = bitsAllocator.getWorkerIdBits();

long sequenceBits = bitsAllocator.getSequenceBits();

// parse UID

long sequence = (uid << (totalBits - sequenceBits)) >>> (totalBits - sequenceBits);

long workerId = (uid << (timestampBits + signBits)) >>> (totalBits - workerIdBits);

long deltaSeconds = uid >>> (workerIdBits + sequenceBits);

Date thatTime = new Date(TimeUnit.SECONDS.toMillis(epochSeconds + deltaSeconds));

String thatTimeStr = DateUtils.formatByDateTimePattern(thatTime);

// format as string

return String.format("{\"UID\":\"%d\",\"timestamp\":\"%s\",\"workerId\":\"%d\",\"sequence\":\"%d\"}",

uid, thatTimeStr, workerId, sequence);

}

* @throws UidGenerateException in the case: Clock moved backwards; Exceeds the max timestamp

protected synchronized long nextId() {

long currentSecond = getCurrentSecond();

// Clock moved backwards, refuse to generate uid

if (currentSecond < lastSecond) {

long refusedSeconds = lastSecond - currentSecond;

throw new UidGenerateException("Clock moved backwards. Refusing for %d seconds", refusedSeconds);

}

// At the same second, increase sequence

if (currentSecond == lastSecond) {

sequence = (sequence + 1) & bitsAllocator.getMaxSequence();

// Exceed the max sequence, we wait the next second to generate uid

if (sequence == 0) {

currentSecond = getNextSecond(lastSecond);

}

// At the different second, sequence restart from zero

} else {

sequence = 0L;

}

lastSecond = currentSecond;

// Allocate bits for UID

return bitsAllocator.allocate(currentSecond - epochSeconds, workerId, sequence);

}

* Get next millisecond

private long getNextSecond(long lastTimestamp) {

long timestamp = getCurrentSecond();

while (timestamp <= lastTimestamp) {

timestamp = getCurrentSecond();

}

return timestamp;

}

private long getCurrentSecond() {

long currentSecond = TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis());

if (currentSecond - epochSeconds > bitsAllocator.getMaxDeltaSeconds()) {

throw new UidGenerateException("Timestamp bits is exhausted. Refusing UID generate. Now: " + currentSecond);

}

return currentSecond;

}

public interface WorkerIdAssigner {

/**

* Assign worker id for {@link com.baidu.fsg.uid.impl.DefaultUidGenerator}

* @return assigned worker id

long assignWorkerId();

}

本类用于为每个工作机器分配一个唯一的ID，目前来说是用完即弃，在初始化Bean的时候会自动向MySQL中插入一条关于该服务的启动信息，待MySQL返回其自增ID之后，使用该ID作为工作机器ID并柔和到UID的生成当中。

public class DisposableWorkerIdAssigner implements WorkerIdAssigner {

@Resource

private WorkerNodeDAO workerNodeDAO;

* Assign worker id base on database.

* If there is host name & port in the environment, we considered that the node runs in Docker container

* Otherwise, the node runs on an actual machine.

@Transactional

public long assignWorkerId() {

// build worker node entity

WorkerNodeEntity workerNodeEntity = buildWorkerNode();

// add worker node for new (ignore the same IP + PORT)

workerNodeDAO.addWorkerNode(workerNodeEntity);

LOGGER.info("Add worker node:" + workerNodeEntity);

return workerNodeEntity.getId();

}
http://blog.chriscs.com/2017/08/02/baidu-uid-generator/

BitsAllocator - Bit分配器

整个UID由64bit组成，以下图为例，1bit是符号位，其余63位由deltaSeconds、workerId和sequence组成，注意sequence被放在最后，可方便直接进行求和或自增操作。

该类主要接收上述3个用于组成UID的元素，并计算出各个元素的最大值和对应的位偏移。其申请UID时的方法如下，由这3个元素进行或操作进行拼接。

    /**
     * Allocate bits for UID according to delta seconds & workerId & sequence<br>
     * <b>Note that: </b>The highest bit will always be 0 for sign
     *
     * @param deltaSeconds
     * @param workerId
     * @param sequence
     * @return
     */
    public long allocate(long deltaSeconds, long workerId, long sequence) {
        return (deltaSeconds << timestampShift) | (workerId << workerIdShift) | sequence;
    }

CachedUidGenerator - 使用RingBuffer的UID生成器

该类在应用中作为Spring Bean注入到各个组件中，主要作用是初始化RingBuffer和BufferPaddingExecutor。获取ID是通过委托RingBuffer的take()方法达成的，而最重要的方法为BufferedUidProvider的提供者，即lambda表达式中的nextIdsForOneSecond(long)方法

    /**
     * Get the UIDs in the same specified second under the max sequence
     *
     * @param currentSecond
     * @return UID list, size of {@link BitsAllocator#getMaxSequence()} + 1
     */
    protected List<Long> nextIdsForOneSecond(long currentSecond) {
        // Initialize result list size of (max sequence + 1)
        int listSize = (int) bitsAllocator.getMaxSequence() + 1;
        List<Long> uidList = new ArrayList<>(listSize);

        // Allocate the first sequence of the second, the others can be calculated with the offset
        long firstSeqUid = bitsAllocator.allocate(currentSecond - epochSeconds, workerId, 0L);
        for (int offset = 0; offset < listSize; offset++) {
            uidList.add(firstSeqUid + offset);
        }

        return uidList;
    }

用于生成指定秒currentSecond内的全部UID，提供给BufferPaddingExecutor进行填充。

https://github.com/baidu/uid-generator/blob/master/src/main/java/com/baidu/fsg/uid/impl/CachedUidGenerator.java

https://github.com/baidu/uid-generator/blob/master/src/main/java/com/baidu/fsg/uid/buffer/BufferPaddingExecutor.java
* Represents an executor for padding {@link RingBuffer} 
* There are two kinds of executors: one for scheduled padding, the other for padding immediately.

public class BufferPaddingExecutor {
/** Whether buffer padding is running */
private final AtomicBoolean running;

/** We can borrow UIDs from the future, here store the last second we have consumed */
private final PaddedAtomicLong lastSecond;

/** RingBuffer & BufferUidProvider */
private final RingBuffer ringBuffer;
private final BufferedUidProvider uidProvider;

/** Padding immediately by the thread pool */
private final ExecutorService bufferPadExecutors;
/** Padding schedule thread */
private final ScheduledExecutorService bufferPadSchedule;

/** Schedule interval Unit as seconds */
private long scheduleInterval = DEFAULT_SCHEDULE_INTERVAL;
public BufferPaddingExecutor(RingBuffer ringBuffer, BufferedUidProvider uidProvider, boolean usingSchedule) {
this.running = new AtomicBoolean(false);
this.lastSecond = new PaddedAtomicLong(TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis()));
this.ringBuffer = ringBuffer;
this.uidProvider = uidProvider;

// initialize thread pool
int cores = Runtime.getRuntime().availableProcessors();
bufferPadExecutors = Executors.newFixedThreadPool(cores * 2, new NamingThreadFactory(WORKER_NAME));

// initialize schedule thread
if (usingSchedule) {
bufferPadSchedule = Executors.newSingleThreadScheduledExecutor(new NamingThreadFactory(SCHEDULE_NAME));
} else {
bufferPadSchedule = null;
}
}
public void start() {
if (bufferPadSchedule != null) {
bufferPadSchedule.scheduleWithFixedDelay(() -> paddingBuffer(), scheduleInterval, scheduleInterval, TimeUnit.SECONDS);
}
}
public void asyncPadding() {
bufferPadExecutors.submit(this::paddingBuffer);
}

public void paddingBuffer() {
LOGGER.info("Ready to padding buffer lastSecond:{}. {}", lastSecond.get(), ringBuffer);

// is still running
if (!running.compareAndSet(false, true)) {
LOGGER.info("Padding buffer is still running. {}", ringBuffer);
return;
}

// fill the rest slots until to catch the cursor
boolean isFullRingBuffer = false;
while (!isFullRingBuffer) {
List<Long> uidList = uidProvider.provide(lastSecond.incrementAndGet());
for (Long uid : uidList) {
isFullRingBuffer = !ringBuffer.put(uid);
if (isFullRingBuffer) {
break;
}
}
}

// not running now
running.compareAndSet(true, false);
LOGGER.info("End to padding buffer lastSecond:{}. {}", lastSecond.get(), ringBuffer);
}

该用于填充RingBuffer的执行者最主要的执行方法如下

    /**
     * Padding buffer fill the slots until to catch the cursor
     * <p>
     * 该方法被即时填充和定期填充所调用
     */
    public void paddingBuffer() {
        LOGGER.info("{} Ready to padding buffer lastSecond:{}. {}", this, lastSecond.get(), ringBuffer);

        // is still running
        // 这个是代表填充executor在执行，不是RingBuffer在执行。为免多个线程同时扩容。
        if (!running.compareAndSet(false, true)) {
            LOGGER.info("Padding buffer is still running. {}", ringBuffer);
            return;
        }

        // fill the rest slots until to catch the cursor
        boolean isFullRingBuffer = false;
        while (!isFullRingBuffer) {
            // 填充完指定SECOND里面的所有UID，直至填满
            List<Long> uidList = uidProvider.provide(lastSecond.incrementAndGet());
            for (Long uid : uidList) {
                isFullRingBuffer = !ringBuffer.put(uid);
                if (isFullRingBuffer) {
                    break;
                }
            }
        }

        // not running now
        running.compareAndSet(true, false);
        LOGGER.info("End to padding buffer lastSecond:{}. {}", lastSecond.get(), ringBuffer);
    }

当线程池分发多条线程来执行填充任务的时候，成功抢夺运行状态的线程会真正执行对RingBuffer填充，直至全部填满，其他抢夺失败的线程将会直接返回。

该类还提供定时填充功能，如果有设置开关则会生效，默认不会启用周期性填充。

    /**
     * Start executors such as schedule
     */
    public void start() {
        if (bufferPadSchedule != null) {
            bufferPadSchedule.scheduleWithFixedDelay(this::paddingBuffer, scheduleInterval, scheduleInterval, TimeUnit.SECONDS);
        }
    }

在take()方法中检测到达到填充阈值时，会进行异步填充。

    /**
     * Padding buffer in the thread pool
     */
    public void asyncPadding() {
        bufferPadExecutors.submit(this::paddingBuffer);
    }

其他函数式接口

BufferedUidProvider - UID的提供者，在本仓库中以lambda形式出现在com.baidu.fsg.uid.impl.CachedUidGenerator#nextIdsForOneSecond

RejectedPutBufferHandler - 当RingBuffer满时拒绝继续添加的处理者，在本仓库中的表现形式为com.baidu.fsg.uid.buffer.RingBuffer#discardPutBuffer

RejectedTakeBufferHandler - 当RingBuffer为空时拒绝获取UID的处理者，在本仓库中的表现形式为com.baidu.fsg.uid.buffer.RingBuffer#exceptionRejectedTakeBuffer

RIngBuffer的填充时机有3个：CachedUidGenerator时对RIngBuffer初始化、RIngBuffer#take()时检测达到阈值和周期性填充（如果有打开）。
RingBuffer的slots数组多读少写，不考虑伪共享问题。
JDK8中-XX:-RestrictContended搭配@sun.misc.Contended。

@FunctionalInterface
public interface RejectedTakeBufferHandler {
void rejectTakeBuffer(RingBuffer ringBuffer);
}
https://github.com/baidu/uid-generator/blob/master/src/main/java/com/baidu/fsg/uid/buffer/RingBuffer.java
public class RingBuffer {
private final int bufferSize;
private final long indexMask;
private final long[] slots;
private final PaddedAtomicLong[] flags;

/** Tail: last position sequence to produce */
private final AtomicLong tail = new PaddedAtomicLong(START_POINT);

/** Cursor: current position sequence to consume */
private final AtomicLong cursor = new PaddedAtomicLong(START_POINT);
/** Threshold for trigger padding buffer*/
private final int paddingThreshold;

/** Reject put/take buffer handle policy */
private RejectedPutBufferHandler rejectedPutHandler = this::discardPutBuffer;
private RejectedTakeBufferHandler rejectedTakeHandler = this::exceptionRejectedTakeBuffer;

private BufferPaddingExecutor bufferPaddingExecutor;
* Put an UID in the ring & tail moved 
* We use 'synchronized' to guarantee the UID fill in slot & publish new tail sequence as atomic operations 
*
* Note that: It is recommended to put UID in a serialize way, cause we once batch generate a series UIDs and put
* the one by one into the buffer, so it is unnecessary put in multi-threads
*
* @param uid
* @return false means that the buffer is full, apply {@link RejectedPutBufferHandler}
public synchronized boolean put(long uid) {
long currentTail = tail.get();
long currentCursor = cursor.get();

// tail catches the cursor, means that you can't put any cause of RingBuffer is full
long distance = currentTail - (currentCursor == START_POINT ? 0 : currentCursor);
if (distance == bufferSize - 1) {
rejectedPutHandler.rejectPutBuffer(this, uid);
return false;
}

// 1. pre-check whether the flag is CAN_PUT_FLAG
int nextTailIndex = calSlotIndex(currentTail + 1);
if (flags[nextTailIndex].get() != CAN_PUT_FLAG) {
rejectedPutHandler.rejectPutBuffer(this, uid);
return false;
}

// 2. put UID in the next slot
// 3. update next slot' flag to CAN_TAKE_FLAG
// 4. publish tail with sequence increase by one
slots[nextTailIndex] = uid;
flags[nextTailIndex].set(CAN_TAKE_FLAG);
tail.incrementAndGet();

// The atomicity of operations above, guarantees by 'synchronized'. In another word,
// the take operation can't consume the UID we just put, until the tail is published(tail.incrementAndGet())
return true;
}
* Take an UID of the ring at the next cursor, this is a lock free operation by using atomic cursor
*
* Before getting the UID, we also check whether reach the padding threshold,
* the padding buffer operation will be triggered in another thread 
* If there is no more available UID to be taken, the specified {@link RejectedTakeBufferHandler} will be applied 
public long take() {
// spin get next available cursor
long currentCursor = cursor.get();
long nextCursor = cursor.updateAndGet(old -> old == tail.get() ? old : old + 1);

// check for safety consideration, it never occurs
Assert.isTrue(nextCursor >= currentCursor, "Curosr can't move back");

// trigger padding in an async-mode if reach the threshold
long currentTail = tail.get();
if (currentTail - nextCursor < paddingThreshold) {
LOGGER.info("Reach the padding threshold:{}. tail:{}, cursor:{}, rest:{}", paddingThreshold, currentTail,
nextCursor, currentTail - nextCursor);
bufferPaddingExecutor.asyncPadding();
}

// cursor catch the tail, means that there is no more available UID to take
if (nextCursor == currentCursor) {
rejectedTakeHandler.rejectTakeBuffer(this);
}

// 1. check next slot flag is CAN_TAKE_FLAG
int nextCursorIndex = calSlotIndex(nextCursor);
Assert.isTrue(flags[nextCursorIndex].get() == CAN_TAKE_FLAG, "Curosr not in can take status");

// 2. get UID from next slot
// 3. set next slot flag as CAN_PUT_FLAG.
long uid = slots[nextCursorIndex];
flags[nextCursorIndex].set(CAN_PUT_FLAG);

// Note that: Step 2,3 can not swap. If we set flag before get value of slot, the producer may overwrite the
// slot with a new UID, and this may cause the consumer take the UID twice after walk a round the ring
return uid;
}
protected int calSlotIndex(long sequence) {
return (int) (sequence & indexMask);
}

protected void discardPutBuffer(RingBuffer ringBuffer, long uid) {
LOGGER.warn("Rejected putting buffer for uid:{}. {}", uid, ringBuffer);
}

protected void exceptionRejectedTakeBuffer(RingBuffer ringBuffer) {
LOGGER.warn("Rejected take buffer. {}", ringBuffer);
throw new RuntimeException("Rejected take buffer. " + ringBuffer);
}

private PaddedAtomicLong[] initFlags(int bufferSize) {
PaddedAtomicLong[] flags = new PaddedAtomicLong[bufferSize];
for (int i = 0; i < bufferSize; i++) {
flags[i] = new PaddedAtomicLong(CAN_PUT_FLAG);
}

return flags;
}
UUID - Universally unique identifier

https://github.com/baidu/uid-generator/blob/master/src/main/java/com/baidu/fsg/uid/impl/CachedUidGenerator.java

CREATE TABLE WORKER_NODE
(
ID BIGINT NOT NULL AUTO_INCREMENT COMMENT 'auto increment id',
HOST_NAME VARCHAR(64) NOT NULL COMMENT 'host name',
PORT VARCHAR(64) NOT NULL COMMENT 'port',
TYPE INT NOT NULL COMMENT 'node type: ACTUAL or CONTAINER',
LAUNCH_DATE DATE NOT NULL COMMENT 'launch date',
MODIFIED TIMESTAMP NOT NULL COMMENT 'modified time',
CREATED TIMESTAMP NOT NULL COMMENT 'created time',
PRIMARY KEY(ID)
)
 COMMENT='DB WorkerID Assigner for UID Generator',ENGINE = INNODB;

There are two implements of UidGenerator: DefaultUidGenerator, CachedUidGenerator.
For performance sensitive application, CachedUidGenerator is recommended.

<!-- DefaultUidGenerator -->
<bean id="defaultUidGenerator" class="com.baidu.fsg.uid.impl.DefaultUidGenerator" lazy-init="false">
    <property name="workerIdAssigner" ref="disposableWorkerIdAssigner"/>

    <!-- Specified bits & epoch as your demand. No specified the default value will be used -->
    <property name="timeBits" value="29"/>
    <property name="workerBits" value="21"/>
    <property name="seqBits" value="13"/>
    <property name="epochStr" value="2016-09-20"/>
</bean>
 
<!-- Disposable WorkerIdAssigner based on Database -->
<bean id="disposableWorkerIdAssigner" class="com.baidu.fsg.uid.worker.DisposableWorkerIdAssigner" />

https://soulmachine.gitbooks.io/system-design/content/cn/distributed-id-generator.html

我们经常需要给网站里的每个用户赋予一个唯一ID，或者给每条微博一个唯一ID，等等。如何设计一个分布式ID生成器(Distributed ID Generator)？

Follow up: 如何让ID可以粗略的按照时间排序？

基本上64位整数能够满足绝大多数的场景，我们需要根据具体业务进行分析，缩小范围，毕竟ID短一点，能够节省内存，所以在能够满足百年大计的情况下，越短越好。

用过MongoDB的人会知道，MongoDB会自动给每一条数据赋予一个唯一的ObjectId,保证不会重复，这是怎么做到的呢？实际上它用的是一种UUID算法，生成的ObjectId占12个字节，由以下几个部分组成，

4个字节表示的Unix timestamp,
3个字节表示的机器的ID
2个字节表示的进程ID
3个字节表示的计数器

UUID是一类算法的统称，具体有不同的实现。UUID的有点是每台机器可以独立产生ID，理论上保证不会重复，所以天然是分布式的，缺点是生成的ID太长，实际中要根据具体场景进行剪裁，生成更短的ID。

多台MySQL服务器

既然MySQL可以产生自增ID，那么用多台MySQL服务器，能否组成一个高性能的分布式发号器呢？显然可以。

假设用8台MySQL服务器协同工作，第一台MySQL初始值是1，每次自增8，第二台MySQL初始值是2，每次自增8，依次类推。前面用一个 round-robin load balancer 挡着，每来一个请求，由 round-robin balancer 随机地将请求发给8台MySQL中的任意一个，然后返回一个ID。

Flickr就是这么做的，仅仅使用了两台MySQL服务器。可见这个方法虽然简单无脑，但是性能足够好。不过要注意，在MySQL中，不需要把所有ID都存下来，每台机器只需要存一个MAX_ID就可以了。这需要用到MySQL的一个REPLACE INTO特性。

这个方法的一个缺点是，ID是连续的，容易被爬虫抓数据，爬虫基本不用写代码，顺着ID一个一个发请求就是了，太方便了（手动斜眼）

比如 Twitter 有个成熟的开源项目，就是专门生成ID的，Twitter Snowflake 。Snowflake的核心算法如下：

最高位不用，永远为0，其余三组bit占位均可浮动，看具体的业务需求而定。默认情况下41bit的时间戳可以支持该算法使用到2082年，10bit的工作机器id可以支持1023台机器，序列号支持1毫秒产生4095个自增序列id。

Instagram用了类似的方案，41位表示时间戳，13位表示shard Id(一个shard Id对应一台PostgreSQL机器),最低10位表示自增ID，怎么样，跟Snowflake的设计非常类似吧。这个方案用一个PostgreSQL集群代替了Twitter Snowflake 集群，优点是利用了现成的PostgreSQL，容易懂，维护方便。

有的面试官会问，如何让ID可以粗略的按照时间排序？上面的这种格式的ID，含有时间戳，且在高位，恰好满足要求。如果面试官又问，如何保证ID严格有序呢？在分布式这个场景下，是做不到的，要想高性能，只能做到粗略有序，无法保证严格有序。

http://blog.gainlo.co/index.php/2016/06/07/random-id-generator/

In some systems, you can just keep incrementing the ID from 1, 2, 3 … N. In other systems, we may need to generate the ID as a random string. Usually, there are few requirements for ID generators:

They cannot be arbitrarily long. Let’s say we keep it within 64 bits.
ID is incremented by date. This gives the system a lot of flexibility, e.g. you can sort users by ID, which is same as ordering by register date.

There can be some other requirements, especially when you want to scale the system to support millions or even billions of users.

Single machine

In the simplest case, we can just keep incrementing ID from 1, 2, 3 … N, which in fact is one of the most popular ways to generate ID in many real life projects. If user A’s ID is larger than user B, then we know that A registered later.

However, this approach is hard to scale. Let’s say after one year there are too many users everyday that we have to scale the database to multiple instances. You’ll see that this approach won’t work because it may generate duplicate IDs for different users.

Multiple machine solution

Since within a single timestamp there can also be multiple users, we could solve this with two approaches.

We assign a server ID to each ID generation server and the final ID is a combination of timestamp and the server ID.
We can also allow multiple requests within a single timestamp on a single server. We can keep a counter on each server, which indicates how many IDs have been generated in the current timestamp. So the final ID is a combination of timestamp, serverID and the counter.

As mentioned previously, the ID cannot be arbitrarily long so that the counter may end up with only 8 bits for instance. In this case, the server can handle 256 requests within a single timestamp at most. If it frequently exceeds this limit, we need to add more instances.

Clock synchronization

We ignored a crucial problem in the above analysis. In fact, there’s a hidden assumption that all ID generation servers have the same clock to generate the timestamp, which might not be true in distributed systems.

In reality, system clocks can drastically skew in distributed systems, which can cause our ID generators provide duplicate IDs or IDs with incorrect order. Clock synchronization is out of the scope of this discussion, however, it’s important for you to know such issue in this system. There are quite a few ways to solve this issue, check NTP if you want to know more about it.

If you want to know more about this topic, you can also check Flickr’s ticket servers, which is another approach besides Snowflake.
http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html

Using Zookeeper: Following two threads talk about this. It should be reasonably fast. Twitter guys have tried this and says it was bit slow.
http://zookeeper-user.578899.n2.nabble.com/Sequence-Number-Generation-With-Zookeeper-td5378618.html
http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01976.html
Cassandra: This has been raised several times, and answer was to use UUIDs (which does not work for us)
http://comments.gmane.org/gmane.comp.db.cassandra.user/3304
http://stackoverflow.com/questions/3935915/how-to-create-auto-increment-ids-in-cassandra.

Then Cassandra introduced counters, but it does not support incrementAndGet() and no plan to do the future as well. So that does not work.

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

Write a custom server: This is easy, basically create a service that give a increasing ID. But very hard to cluster this and behavior in case of a failure is complicated.
"A timestamp, worker number and sequence number": Twitter Guys created solution based on "a timestamp, worker number and sequence number" (this is kind of that we use as well, except that ran few dedicated servers for this)http://engineering.twitter.com/2010/06/announcing-snowflake.html
Other Algos: Only looked at these briefly. But they are complicated.
Using DHTs: http://horicky.blogspot.com/2007/11/distributed-uuid-generation.html
A Fault-Tolerant Protocol for Generating Sequence Numbers for Total Ordering Group Communication in Distributed System,http://www.iis.sinica.edu.tw/page/jise/2006/200609_16.pdf

You can also use Zookeeper. It provides methods for creating sequence nodes (appended to znode names, thou I prefer using version numbers of the nodes). Be careful with this one thou: if you don't want missed numbers in your sequence, it may not be what you want.

Also check Twitter的分布式自增ID算法Snowflake - 我的博客 - ITeye技术网站
http://everydaylearnsalittlebit.blogspot.com/2015/10/httpwwwmitbbscomarticletjobhunting33072.html
5. design question, how to generate unique sequence number using distributed
system. e.g. you have a set of machines which is running this sequence
number generator, client can connect to any machine, and get the next
sequence number which is guranteed to increment for same client.
https://www.quora.com/How-do-you-generate-monotonically-increasing-sequence-numbers-seqnums-in-a-large-distributed-system
1) master+workers
In the system, only the master assigns numbers to every worker, on behavior of the client. The master will soon become a performance bottleneck and have the single point of failures, which is reduced by using Zookeeper (see Mesos).

2) token-ring
All nodes in this system are connected as a loop. This design is similar to the previous one, except for the master node is the one which has the token. A nice point is that it avoids the bottleneck. The details to deal with the failures such as the lost token could borrow from the token-ring network.

On the other hand, I tried to come up with a centralized system with strong consistency using mechanisms such as Paxos, or the quorum-based R/W configuration. Furthermore, one may look at Kafka, a system that implements distributed commit logs.

http://srinathsview.blogspot.com/2012/04/generating-distributed-sequence-number.html

Using Zookeeper: Following two threads talk about this. It should be reasonably fast. Twitter guys have tried this and says it was bit slow.
http://zookeeper-user.578899.n2.nabble.com/Sequence-Number-Generation-With-Zookeeper-td5378618.html
http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01976.html
Cassandra: This has been raised several times, and answer was to use UUIDs (which does not work for us)
http://comments.gmane.org/gmane.comp.db.cassandra.user/3304
http://stackoverflow.com/questions/3935915/how-to-create-auto-increment-ids-in-cassandra.

Then Cassandra introduced counters, but it does not support incrementAndGet() and no plan to do the future as well. So that does not work.

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

Write a custom server: This is easy, basically create a service that give a increasing ID. But very hard to cluster this and behavior in case of a failure is complicated.
"A timestamp, worker number and sequence number": Twitter Guys created solution based on "a timestamp, worker number and sequence number" (this is kind of that we use as well, except that ran few dedicated servers for this)http://engineering.twitter.com/2010/06/announcing-snowflake.html
Other Algos: Only looked at these briefly. But they are complicated.
Using DHTs: http://horicky.blogspot.com/2007/11/distributed-uuid-generation.html
A Fault-Tolerant Protocol for Generating Sequence Numbers for Total Ordering Group Communication in Distributed System,http://www.iis.sinica.edu.tw/page/jise/2006/200609_16.pdf

IMHO, "a timestamp, worker number and sequence number" is the best option. Only downside of this is that this assumes that broker nodes are loosely synced in time. Only other option I see is Zookeeper.

Good overview - http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation/5685869
http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems
MySQL auto increment
Numeric IDs
Go up with time
Not resilient

MySQL multi-master replication
Flickr MySQL ticket server

UUID
128 bits
Could use type 4 (Random) or type 1 (MAC address with time component)
Can generate on each machine with no co-ordination

Twitter Snowflake
Under 64 bits
No co-ordination (after startup)
K-ordered
Scala service, Thrift interface, uses Zookeeper for configuration
41 bits Timestamp millisecond precision, bespoke epoch
10 bits Configured machine ID
12 bits Sequence number

77669839702851584
= (timestamp << 22) | (machine << 12) | sequence

PHP Cruftflake
Boundary Flake - Erlang

http://calvin1978.blogcn.com/articles/uuid.html

1. 发号器

我接触的最早的Unique ID，就是Oracle的自增ID。

特点是准连续的自增数字，为什么说是准连续？因为性能考虑，每个Client一次会领20个ID回去慢慢用，用完了再来拿。另一个Client过来，拿的就是另外20个ID了。

新浪微博里，Tim用Redis做相同的事情，Incr一下拿一批ID回去。如果有多个数据中心，那就拿高位的几个bit来区分。

只要舍得在总架构里增加额外Redis带来的复杂度，一个64bit的long就够表达了，而且不可能有重复ID。

批量是关键，否则每个ID都远程调用一次谁也吃不消。

2. UUID

2.1 概述

Universally Unique IDentifier(UUID)，有着正儿八经的RFC规范，是一个128bit的数字，也可以表现为32个16进制的字符，中间用”-"分割。

- 时间戳＋UUID版本号，分三段占16个字符(60bit+4bit)，
- Clock Sequence号与保留字段，占4个字符(13bit＋3bit)，
- 节点标识占12个字符(48bit)，

比如：f81d4fae-7dec-11d0-a765-00a0c91e6bf6

实际上，UUID一共有多种算法，能用于TraceId的是：

- version1: 基于时间的算法
- version4: 基于随机数的算法

version 4

先说Version4，这是最暴力的做法，也是JDK里的算法，不管原来各个位的含义了，除了少数几个位必须按规范填，其余全部用随机数表达。

JDK里的实现，用 SecureRandom生成了16个随机的Byte，用2个long来存储。记得加-Djava.security.egd=file:/dev/./urandom，否则会锁住程序等噪音。
详见 JVM上的随机数与熵池策略

version 1

然后是Version1，严格守着原来各个位的规矩：

因为时间戳有满满的60bit，所以可以尽情花，以100纳秒为1，从1582年10月15日算起(能撑3655年，真是位数多给烧的，1582年有意思么)

节点标识也有48bit，一般用MAC地址表达，如果有多块网卡就随便用一块。如果没网卡，就用随机数凑数，或者拿一堆尽量多的其他的信息，比如主机名什么的，拼在一起再hash一把。

顺序号这16bit则仅用于避免前面的节点标示改变（如网卡改了），时钟系统出问题（如重启后时钟快了慢了），让它随机一下避免重复。

但好像Version 1就没考虑过一台机器上起了两个进程这类的问题，也没考虑相同时间戳的并发问题，所以严格的Version1没人实现，接着往下看各个变种吧。

3. Version1变种 - Hibernate

Hibernate的CustomVersionOneStrategy.java，解决了之前version 1的两个问题

- 时间戳(6bytes, 48bit)：毫秒级别的，从1970年算起，能撑8925年....
- 顺序号(2bytes, 16bit, 最大值65535): 没有时间戳过了一秒要归零的事，各搞各的，short溢出到了负数就归0。
- 机器标识(4bytes 32bit): 拿localHost的IP地址，IPV4呢正好4个byte，但如果是IPV6要16个bytes，就只拿前4个byte。
- 进程标识(4bytes 32bit)：用当前时间戳右移8位再取整数应付，不信两条线程会同时启动。

值得留意就是，机器进程和进程标识组成的64bit Long几乎不变，只变动另一个Long就够了。

4. Version1变种 - MongoDB

MongoDB的ObjectId.java

- 时间戳(4 bytes 32bit): 是秒级别的，从1970年算起，能撑136年。

- 自增序列(3bytes 24bit, 最大值一千六百万)：是一个从随机数开始（机智）的Int不断加一，也没有时间戳过了一秒要归零的事，各搞各的。因为只有3bytes，所以一个4bytes的Int还要截一下后3bytes。

- 机器标识(3bytes 24bit): 将所有网卡的Mac地址拼在一起做个HashCode，同样一个int还要截一下后3bytes。搞不到网卡就用随机数混过去。

- 进程标识(2bytes 16bits)：从JMX里搞回来到进程号，搞不到就用进程名的hash或者随机数混过去。

可见，MongoDB的每一个字段设计都比Hibernate的更合理一点，比如时间戳是秒级别的。总长度也降到了12 bytes 96bit，但如果果用64bit长的Long来保存有点不上不下的，只能表达成byte数组或16进制字符串。

另外对Java版的driver在自增序列那里好像有bug。

5. Twitter的snowflake派号器

snowflake也是一个派号器，基于Thrift的服务，不过不是用redis简单自增，而是类似UUID version1，

只有一个Long 64bit的长度，所以IdWorker紧巴巴的分配成：

- 时间戳(42bit) 自从2012年以来(比那些从1970年算起的会过日子)的毫秒数，能撑139年。
- 自增序列(12bit，最大值4096), 毫秒之内的自增，过了一毫秒会重新置0。
- DataCenter ID (5 bit, 最大值32），配置值。
- Worker ID ( 5 bit, 最大值32)，配置值，因为是派号器的id，所以一个数据中心里最多32个派号器就够了，还会在ZK里做下注册。

可见，因为是派号器，把机器标识和进程标识都省出来了，所以能够只用一个Long表达。

另外，这种派号器，client每次只能一个ID，不能批量取，所以额外增加的延时是问题。

6. 最后问题，能不能不用派号器，又一个Long搞定UUID??

前面说这么多都是铺垫，如果当初你的ID一开始类型设为了Long，又不用派号器的话，怎么办？
从UUID的128位压缩到Long的64位，又不用中央派号器而是本地生成，最难还是怎么来区分本地的机器＋进程号。

思路一，压缩其他字段，留足够多的长度来做机器＋进程号标识

时间戳是秒级别，1年要24位，两年要25位.....
自增序列，6万QPS要16位，10万要17位...
剩下20－24位，百万分之一到一千六百万分之一的重复率，然后把网卡Mac＋进程号拼在一起再hash，取结果32个bit的后面20或24个bit。但假如这个标识字段重复了，后面时间戳和自增序列也很容易重复，不停的重复。

思路二，使用ZK 或 mysql 或 redis来自增管理标识号

如果workder字段只留了12位（4096），就要用ZK或etcd，当进程关闭了要回收这个号。
如果workder字段的位数留得够多，比如有20位（一百万），那用redis或mysql来自增最简单，每个进程启动时拿一个worker id。

思路三，继续Random

继续拼了，直接拿JDK UUID.randomUUID()的低位long（按UUID规范，高位的long被置了4个默认值的bit，低位只被设置3个bit），或者直接SecureRandom.nextLong()，不浪费了那3个bit。

扩展阅读

一乐那篇《业务系统需要什么样的ID生成器》，其中唯一性，时间相关，粗略有序，可反解，可制造这个提法很好，说白了就是让大家尽量用UUID version1风格。

细聊分布式ID生成方法

http://ericliang.info/what-kind-of-id-generator-we-need-in-business-systems/

https://segmentfault.com/a/1190000010978305

simpleflake取消 Worker 号, 保留 41 bits 的 Timestamp, 同时把 sequence number 扩展到 22 bits

+---------------+----------------+
|timestamp(ms)42  | sequence(22) 
+---------------+----------------+
id  = timestamp | sequence

Simpleflake 的特点:

sequence number 完全靠随机产生 (这样也导致了生成的 ID 可能出现重复)
没有 Worker 号, 也就不需要和 Zookeeper 通讯, 实现了完全去中心化
Timestamp 保持和 Snowflake 一致, 今后可以无缝升级到 Snowflake
缺点：
生成的 ID 重复的可能. 这个生成 ID 重复的概率随着每秒生成的 ID 数的增长而增长。
每秒生成的 ID 不能太多 (最好小于 100次/秒, 如果大于 100次/秒的场景, Simpleflake 就不适用

http://ericliang.info/what-kind-of-id-generator-we-need-in-business-systems/

唯一性
时间相关
粗略有序
可反解
可制造

Thursday, October 15, 2015

Generate unique sequence number using distributed system

Leaf-segment数据库方案

弱依赖ZooKeeper

Generate IDs through dedicated service

DB Ticket Servers

CachedUidGenerator

Tail & Cursor pointer

RingBuffer filling

BitsAllocator - Bit分配器

CachedUidGenerator - 使用RingBuffer的UID生成器

其他函数式接口

多台MySQL服务器

Single machine

Multiple machine solution

Clock synchronization

1. 发号器

2. UUID

2.1 概述

version 4

version 1

3. Version1变种 - Hibernate

4. Version1变种 - MongoDB

5. Twitter的snowflake派号器

6. 最后问题，能不能不用派号器，又一个Long搞定UUID??

思路一，压缩其他字段，留足够多的长度来做机器＋进程号标识

思路二，使用ZK 或 mysql 或 redis来自增管理标识号

思路三，继续Random

Labels

Popular Posts