Massive Technical Interviews Tips: ID generation

Wednesday, January 31, 2018

ID generation

Related:
http://massivetechinterview.blogspot.com/2015/06/twitteridsnowflake-iteye.html
http://massivetechinterview.blogspot.com/2015/10/generate-unique-sequence-number-using.html
http://massivetechinterview.blogspot.com/2016/06/buttercola-fast-id-generator.html

https://www.quora.com/Scalability-Why-Is-database-ID-generation-a-bottleneck
Imagine the classic simple case of a single table with an auto-incrementing primary key (or, if you prefer, a SEQUENCE). Now imagine there are very many connections to the database server, each of which is inserting records on this table. At some point, those inserts must be serialised (because of the rules that the sequence of IDs must follow: sequential, no gaps). This is achieved by a lock which amounts to a critical section (i.e. only one connection can increment the sequence at a time), and every connection needs to acquire that lock before obtaining the next ID (and inserting a record).

Many systems don't see Id generation as problem, which is because they don't have a large number of requests. Let's say you want to allocate an ID to each user and the ID should be unique and incremental by the registration date. Once you are receiving tons of registration requests every second or even millisecond, you will find it extremely easy to have duplicate IDs.

https://stackoverflow.com/questions/22883304/why-is-auto-increment-pattern-bad-when-scaling-in-mongodb

In order to guarantee that an auto-increment value is unique, the ID creation must occur on a single thread on a single host (even if multiple threads are used, the point of ID creation must block other threads). So, in a cluster of 100 servers, IDs must be created on 1 thread on 1 out the 100 servers. This not just a performance bottleneck, it is possible that the creation of 2 auto-increment IDs might block each other, which is the race condition noted in the quotation you've cited.

It should be noted that transactional RDBMS systems like Oracle and SQL Server have solved the race condition problem, but there is no solution to the performance bottleneck.

So: no, don't use auto-increment in non-primary keys if you anticipate the need to scale your system.

http://blog.gainlo.co/index.php/2016/06/07/random-id-generator/

Clock synchronization

We ignored a crucial problem in the above analysis. In fact, there’s a hidden assumption that all ID generation servers have the same clock to generate the timestamp, which might not be true in distributed systems.

In reality, system clocks can drastically skew in distributed systems, which can cause our ID generators provide duplicate IDs or IDs with incorrect order. Clock synchronization is out of the scope of this discussion, however, it’s important for you to know such issue in this system. There are quite a few ways to solve this issue, check NTP if you want to know more about it.

Wednesday, January 31, 2018

ID generation

Clock synchronization

Labels

Popular Posts