Thursday, August 31, 2017

Cassandra data model
When inserting records, Cassandra will hash the value of the inserted data's partition key; Cassandra uses this hash value to determine which node is responsible for storing the data.
for Cassandra to look up a set of data (or a set of rows in the relational model), we have to store all of the data under the same partition key. To summarize, rows in Cassandra are essentially data embedded within a partition due to the fact that the data share the same partition key.

Data model goals

  1. Spread data evenly around the cluster. Paritions are distributed around the cluster based on a hash of the partition key. To distribute work across nodes, it's desirable for every node in the cluster to have roughly the same amount of data.
  2. Minimize the number of partitions read. Partitions are groups of columns that share the same partition key. Since each partition may reside on a different node, the query coordinator will generally need to issue separate commands to separate nodes for each partition we query.
  3. Satisfy a query by reading a single partition. This means we will use roughly one table per query. Supporting multiple query patterns usually means we need more than one table. Data duplication is encouraged.
Column families are represented in Cassandra as a map of sorted maps. The partition key acts as the lookup value; the sorted map consists of column keys and their associated values.
Map<ParitionKey, SortedMap<ColumnKey, ColumnValue>>
Compound key
Compound keys include multiple columns in the primary key, but these additional columns do not necessarily affect the partition key. A partition key with multiple columns is known as a composite key
CREATE TABLE crossfit_gyms_by_location (
   country_code text,
   state_province text,
   city text,
   gym_name text,
   PRIMARY KEY (country_code, state_province, city, gym_name)
Note that only the first column of the primary key above is considered the partition key; the rest of columns are clustering keys. This means that while the primary key represents a unique gym record/row, all gyms within a country reside on the same partition.
Clustering key
Clustering keys are responsible for sorting data within a partition. Each primary key column after the partition key is considered a clustering key. In the crossfit_gyms_by_location example, country_code is the partition key; state_province, city, and gym_name are the clustering keys. Clustering keys are sorted in ascending order by default. So when we query for all gyms in the United States, the result set will be ordered first by state_province in ascending order, followed by city in ascending order, and finally gym_name in ascending order.
Order by
To sort in descending order, add a WITH clause to the end of the CREATE TABLE statement.
CREATE TABLE crossfit_gyms_by_location (
   country_code text,
   state_province text,
   city text,
   gym_name text,
   PRIMARY KEY (country_code, state_province, city, gym_name)
) WITH CLUSTERING ORDER BY (state_province DESC, city ASC, gym_name ASC);
The partition key is not part of the ORDER BY statement because its values are hashed and therefore won't be close to each other in the cluster.

Composite key
Composite keys are partition keys that consist of multiple columns. The crossfit_gyms_by_location example only used country_code for partitioning. The result is that all gyms in the same country reside within a single partition. This can lead to wide rows. In the case of our example, there are over 7,000 CrossFit gyms in the United States, so using the single column partition key results in a row with over 7,000 combinations.
To avoid wide rows, we can move to a composite key consisting of additional columns. If we change the partition key to include the state_province and city columns, the partition hash value will no longer be calculated off only country_code. Now, each combination of country_code, state_province, and city will have its own hash value and be stored in a separate partition within the cluster. We accomplish this by nesting parenthesis around the columns we want included in the composite key. 
CREATE TABLE crossfit_gyms_by_city (
 country_code text,
 state_province text,
 city text,
 gym_name text,
 opening_date timestamp,
 PRIMARY KEY ((country_code, state_province, city), opening_date, gym_name)
) WITH CLUSTERING ORDER BY ( opening_data ASC, gym_name ASC );
When issuing a CQL query, you must include all partition key columns, at a minimum. You can then apply an additional filter by adding each clustering key in the order in which the clustering keys appear.

Invalid queries:
SELECT * FROM crossfit_gyms_by_city WHERE country_code = 'USA' and state_province = 'VA'
SELECT * FROM crossfit_gyms_by_city WHERE country_code = 'USA' and state_province = 'VA' and city = 'Arlington' and gym_name = 'CrossFit Route 7'
The first invalid query is missing the city partition key column. The second invalid query uses the clustering key gym_name without including the preceding clustering key opening_date.
The reason the order of clustering keys matters is because the clustering keys provide the sort order of the result set. Because of the clustering key's responsibility for sorting, we know all data matching the first clustering key will be adjacent to all other data matching that clustering key.
In our example, this means all gyms with the same opening date will be grouped together in alphabetical order. Gyms with different opening dates will appear in temporal order.
Because we know the order, CQL can easily truncate sections of the partition that don't match our query to satisfy the WHERE conditions pertaining to columns that are not part of the partition key. However, because the clustering key gym_name is secondary to clustering key opening_date, gyms will appear in alphabetical order only for gyms opened on the same day (within a particular city, in this case). Therefore, we can't specify the gym name in our CQL query without first specifying an opening date.
Cassandra Features
  • Continuous availability. The peer-to-peer replication of data to nodes within a cluster results in no single point of failure. This is true even across data centers.
  • Linear performance when scaling nodes in a cluster. If three nodes are achieving 3,000 writes per second, adding three more nodes will result in a cluster of six nodes achieving 6,000 writes per second.
  • Tunable consistency. If we want to replicate data across three nodes, we can have a replication factor of three, yet not necessarily wait for all three nodes to acknowledge the write. Data will eventually be written to all three nodes, but we can acknowledge the write after writing the data to one or more nodes without waiting for the full replication to finish.
  • Flexible data model. Every row can have a different number of columns with support for many types of data.
  • Query language (CQL) with a SQL-like syntax.
  • Support for Java Monitoring Extensions (JMX). Metrics about performance, latency, system usage, etc. are available for consumption by other applications.
Cassandra Limitations
  • No join or subquery support for aggregation. According to Cassandra’s documentation, this is by design, encouraging denormalization of data into partitions that can be queried efficiently from a single node, rather than gathering data from across the entire cluster.
  • Ordering is set at table creation time on a per-partition basis. This avoids clients attempting to sort billions of rows at run time.
  • All data for a single partition must fit on disk in a single node in the cluster.
  • It's recommended to keep the number of rows within a partition below 100,000 items and the disk size under 100 MB.
  • A single column value is limited to 2 GB (1 MB is recommended).
A less obvious limitation of Cassandra is its lack of row-level consistency. Modifications to a column family (table) that affect the same row and are processed with the same timestamp will result in a tie.
In the event of a tie Cassandra follows two rules:
  1. Deletes take precedence over inserts/updates.
  2. If there are two updates, the one with the lexically larger value wins.
This means for inserts/updates, Cassandra resolves row-level ties by comparing values at the column (cell) level, writing the greater value. This can result in one update modifying one column while another update modifies another column, resulting in rows with combinations of values that never existed.

Distributed Counter
This is the simplest counter to implement, especially for people coming from the RDBMS world. Acquire a lock on the counter, increment the counter and release the lock. In distributed systems, lockscan be implemented using an external coordinator, like Zookeeper.
 1 lock.acquire();
 2 try {
 3    counter.increment(5);
 4 } finally {
 5    lock.release();
 6 } 

  1. Locking involves coordination and coordination is expensive
  2. If the counter is a hotspot, there will be high lock contention
  3. In distributed systems, messages may be redelivered. The solution will not handle duplicate messages and is not idempotent


This involves implementing a CAS (compare-and-swap) operation on a counter. This can be achieved using Zookeeper or Paxos based consensus in Cassandra.
 1 int retries = 5;
 2 while(retries > 0) {
 3     int currentValue = counter.getValue();
 4     if (!counter.compareAndSwap(currentValue, currentValue + 5)) {
 5         try {
 6             Thread.sleep(500); // You want to ideally use an exponential backoff algorithm here
 7         } catch (InterruptedException e) {}
 8         retries--;
 9     } else {
10         break;
11     }
12 }  
  1. High contention
  2. Not very scalable
  3. Non-idempotent
  4. May exhaust retries if the counter is a hotspot


In this example, we will see how we can implement a CRDT based counter using Cassandra. We will show how we can model a counter using a CmRDT (operation-based) and a CvRDT (state-based). We’ll then model these using CQL.

Wednesday, August 30, 2017

Database Review Misc
If you start an explicit transaction, perform some updates, and then roll back, InnoDB restores the original state of data. It preserves the original data by storing it in an area of the database called the rollback segment. So if you roll back, it just re-copies those pages of data to replace the ones you changed.
This might take some time, so if you try to query data that was changed but rolled back, InnoDB automatically takes a detour to read the original data out of the rollback segment, until such time as it is re-merged into the tables.
Say for example you start a transaction, and UPDATE a billion rows. This copies many pages worth of the original rows to the rollback segment, and then fills the tables with changed data -- but the changed data is uncommitted. No one should be able to read uncommitted data, so anyone who queries the table will automatically get the original data from the rollback segment.
Then you rollback your transaction. Gradually over the next few minutes, InnoDB cleans up, and eventually it all comes back into sync. But anyone can continue to query the original data in the meantime.
If you had committed your transaction, then MySQL would just mark all the changed data as committed, and anyone subsequently reading the data wouldn't experience the slight overhead of reading from the rollback segment.
The rollback segment is inside the tablespace on disk.
The most common reason I encounter is databases having InnoDB tables without explicit primary keys. Especially if you are using row-based replication (“RBR”), you want explicit primary keys on all your tables. Otherwise, MySQL will scan the entire table for each row that is updated. (See bug 53375 . )  Maybe I’m a relational purist, but why would you want to have tables without explicit primary keys, anyway?  (On the other, less-purist, hand, for performance reasons, sometimes a short surrogate PK may be preferred to a lengthy logical one. )
The other common reason is that the slave is single-threaded, and single-threaded performance can’t keep up with the multi-threaded master.  In this case, if multiple databases are being updated, enabling the multi-threaded slave can help.  ( See the manual for more.)
I've hit this issue as well. It basically boils down to having variable number of values in your IN clause and Hibernate trying to cache those query plans.
There are two great blog posts on this topic. The first:
Using Hibernate 4.2 and MySQL in a project with an in-clause query such as: select t from Thing t where in (?)
Hibernate caches these parsed HQL queries. Specifically the Hibernate SessionFactoryImplhas QueryPlanCache with queryPlanCache and parameterMetadataCache. But this proved to be a problem when the number of parameters for the in-clause is large and varies.
These caches grow for every distinct query. So this query with 6000 parameters is not the same as 6001.
The in-clause query is expanded to the number of parameters in the collection. Metadata is included in the query plan for each parameter in the query, including a generated name like x10_, x11_ , etc.
Imagine 4000 different variations in the number of in-clause parameter counts, each of these with an average of 4000 parameters. The query metadata for each parameter quickly adds up in memory, filling up the heap, since it can't be garbage collected.
This continues until all different variations in the query parameter count is cached or the JVM runs out of heap memory and starts throwing java.lang.OutOfMemoryError: Java heap space.
Avoiding in-clauses is an option, as well as using a fixed collection size for the parameter (or at least a smaller size).
For configuring the query plan cache max size, see the propertyhibernate.query.plan_cache_max_size, defaulting to 2048 (easily too large for queries with many parameters).
And second (also referenced from the first):
Hibernate internally uses a cache that maps HQL statements (as strings) to query plans. The cache consists of a bounded map limited by default to 2048 elements (configurable). All HQL queries are loaded through this cache. In case of a miss, the entry is automatically added to the cache. This makes it very susceptible to thrashing - a scenario in which we constantly put new entries into the cache without ever reusing them and thus preventing the cache from bringing any performance gains (it even adds some cache management overhead). To make things worse, it is hard to detect this situation by chance - you have to explicitly profile the cache in order to notice that you have a problem there. I will say a few words on how this could be done later on.
So the cache thrashing results from new queries being generated at high rates. This can be caused by a multitude of issues. The two most common that I have seen are - bugs in hibernate which cause parameters to be rendered in the JPQL statement instead of being passed as parameters and the use of an "in" - clause.
Due to some obscure bugs in hibernate, there are situations when parameters are not handled correctly and are rendered into the JPQL query (as an example check out HHH-6280). If you have a query that is affected by such defects and it is executed at high rates, it will thrash your query plan cache because each JPQL query generated is almost unique (containing IDs of your entities for example).
The second issue lays in the way that hibernate processes queries with an "in" clause (e.g. give me all person entities whose company id field is one of 1, 2, 10, 18). For each distinct number of parameters in the "in"-clause, hibernate will produce a different query - e.g. select x from Person x where in (:id0_) for 1 parameter, select x from Person x where in (:id0_, :id1_) for 2 parameters and so on. All these queries are considered different, as far as the query plan cache is concerned, resulting again in cache thrashing. You could probably work around this issue by writing a utility class to produce only certain number of parameters - e.g. 1, 10, 100, 200, 500, 1000. If you, for example, pass 22 parameters, it will return a list of 100 elements with the 22 parameters included in it and the remaining 78 parameters set to an impossible value (e.g. -1 for IDs used for foreign keys). I agree that this is an ugly hack but could get the job done. As a result you will only have at most 6 unique queries in your cache and thus reduce thrashing.
So how do you find out that you have the issue? You could write some additional code and expose metrics with the number of entries in the cache e.g. over JMX, tune logging and analyze the logs, etc. If you do not want to (or can not) modify the application, you could just dump the heap and run this OQL query against it (e.g. using mat): SELECT l.query.toString() FROM INSTANCEOF org.hibernate.engine.query.spi.QueryPlanCache$HQLQueryPlanKey l. It will output all queries currently located in any query plan cache on your heap. It should be pretty easy to spot whether you are affected by any of the aforementioned problems.
As far as the performance impact goes, it is hard to say as it depends on too many factors. I have seen a very trivial query causing 10-20 ms of overhead spent in creating a new HQL query plan. In general, if there is a cache somewhere, there must be a good reason for that - a miss is probably expensive so your should try to avoid misses as much as possible. Last but not least, your database will have to handle large amounts of unique SQL statements too - causing it to parse them and maybe create different execution plans for every one of them.
When a database receives a statement, the database engine first parses the statement and looks for syntax errors. Once the statement is parsed, the database needs to figure out the most efficient way to execute the statement. This can be computationally quite expensive. The database checks what indexes, if any, can help, or whether it should do a full read of all rows in a table. Databases use statistics on the data to figure out what is the best way. Once the query plan is created then it can be executed by the database engine.
It takes CPU power to do the access plan generation. Ideally, if we send the same statement to the database twice, then we'd like the database to reuse the access plan for the first statement. This uses less CPU than if it regenerated the plan a second time.

Statement Caches

Databases are tuned to do statement caches. They usually include some kind of statement cache. This cache uses the statement itself as a key and the access plan is stored in the cache with the corresponding statement. This allows the database engine to reuse the plans for statements that have been executed previously
the entire statement is the key. For example, if we later sent the statement "select a,b from t where c = 3", it would not find an access plan. This is because the "c=3" is different from the cached plan "c=2". 

Here the cache won't be used. Each iteration of the loop sends a different SQL statement to the database. A new access plan is computed for each iteration and we're basically throwing CPU cycles away using this approach. However, look at the next snippet:
PreparedStatement ps = conn.prepareStatement("select a,b from t where c = ?");
For(int I = 0; I < 1000; ++I)
        ps.setInt(1, I);
        ResultSet rs = ps.executeQuery();
Here it will be much more efficient. The statement sent to the database is parameterized using the '?' marker in the sql. This means every iteration is sending the same statement to the database with different parameters for the "c=?" part. This allows the database to reuse the access plans for the statement and makes the program execute more efficiently inside the database.
When the J2EE server gives your application a connection, it isn't giving you the actual connection; you're getting a wrapper. You can verify this by looking at the name of the class for the connection you are given. It won't be a database JDBC connection, it'll be a class created by your application server. Normally, if you called close on a connection then the jdbc driver closes the connection. We want the connection to be returned to the pool when close is called by a J2EE application. We do this by making a proxy jdbc connection class that looks like a real connection. It has a reference to the actual connection. When we invoke any method on the connection then the proxy forwards the call to the real connection. But, when we call methods such as close instead of calling close on the real connection, it simply returns the connection to the connection pool and then marks the proxy connection as invalid so that if it is used again by the application we'll get an exception.

J2EE PreparedStatement Cache

J2EE PreparedStatement Cache is implemented using a cache inside the J2EE server connection pool manager. The J2EE server keeps a list of prepared statements for each database connection in the pool. When an application calls prepareStatement on a connection, the application server checks if that statement was previously prepared. If it was, the PreparedStatement object will be in the cache and this will be returned to the application. If not, the call is passed to the jdbc driver and the query/preparedstatement object is added in that connections cache.
We need a cache per connection because that's the way jdbc drivers work. Any preparedstatements returned are specific to that connection.
"In most cases, ALTER TABLE makes a temporary copy of the original table."
To elaborate on this. In this specific case the table is locked for read and writes. A new temp table is created, the data is copied into the new table with the additional column, the table is renamed and the old one is dropped, finally the table is unlocked.
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
When you create a GUID in SQL Server using the NewId() command you are guaranteed that it will be unique across the whole universe. Which means if you have two databases (with the same schema) completely disconnected adding rows to the same table, using a primary key of uniqueidentifier will ensure that they primary keys don’t conflict. In comparison, if the two databases had an identity integer column as the primary key for their tables, they would be very likely to insert the same primary key in both tables.

It isn’t a good idea to create a clustered index on a uniqueidentifier column and generate your GUIDs with NEWID(). The reason for this is that NEWID() generates GUIDs in non-sequential order and SQL Server orders a clustered index sequentially. It will work – SQL Server will let you build a clustered index around a uniqueidentifier column, however it will cause the SQL Server to do unnecessary work and cause performance slowdowns. The reason for this is that to insert data into the middle of a clustered index (out of sequential order) causes SQL Server to make room for the data by rearranging the cluster.
So if it isn’t a good idea then why do people do it? Well, in SQL Server, if I assign a column as the primary key in SQL Server Management Studio it automatically generates a clustered index for you regardless of the data type of that column. Therefore, if you want a table with a uniqueidentifier data type as a primary key you need to change that index to a non-clustered index.
Non-clustered indexes don’t reorder the data as rows are inserted to the table, so they don’t have the performance impact of a clustered index on inserts of non-sequential data.

1. Add the column as data type datetime
2. I usually call it Date
3. Set the Default Value to GetDate().
4. Make it non-null.
5. Create your clustered index on it before you insert data into you.
Adding a default value of GetDate() to the column writes the date and time that the row was inserted into the column automatically. This insures that the data for the row is inserted at the end of the table data – there is no rearranging of the cluster. Adding the data to the end ensures the best performance for inserts.
Another good choice for a clustered index is a column that reflects the ordering of the table in the majority of the select statements. For example, if you have a table called categories and you have an integer column called ordervalue and you always call the table with an SELECT … ORDER BY [ordervalue] then making ordervalue the clustered index makes sense. Here is why: even though it hinders performance to insert a non-sequential ordervalue into the cluster you will get a performance benefit when you call your data, since the rows will be read sequential from the cluster, that it depending on your workload characteristics.

Just drop the existing clustered index, then create a new one on the ID column.
DROP INDEX [Created] ON dbo.Realty;

Of course, you'll want to do this during a maintenance window so you don't cause too much blocking. If you have Enterprise Edition you can do the CREATE INDEX operation online by adding WITH (ONLINE=ON) to the statement.
If your table is getting up to 1 TB in size and probably has LOTS of rows in it, I would strongly recommend NOT making the clustered index that much fatter!
First of all, dropping and recreating the clustered index will shuffle around ALL your data at least once - that alone will take ages.
Secondly, the big compound clustered index you're trying to create will significantly increase the size of all your non-clustered indices (since they contain the whole clustered index value on each leaf node, for the bookmark lookups).
When architecting your underlying table structure, I'd steer clear of using GUIDs. Examine INTs or BIGINT columns based upon your expected scale (if you know for certain your scale is small you can also examine SMALLINT). GUIDs come with a great deal of overhead and performance hit just to increase the pool of available values open for uniquely-identifying rows in a table. You'd be better off adding a secondary column to deal with scale issues rather than rely on GUIDs in my opinion. The space concerns and institutional fragmentation alone are reasons to keep plenty of space between your tables and GUIDs.
Primary keys are often used as the sequence in which the data is actually stored. If the primary key is incremented, the data is simply appended. If the primary key is random, that would mean that existing data must be moved about to get the new row into the proper sequence. A basic (non-primary-key) index is typically much lighter in content and can be moved around faster with less overhead.
Use the smallest integer data type for the AUTO_INCREMENT column that is large enough to hold the maximum sequence value you will need. When the column reaches the upper limit of the data type, the next attempt to generate a sequence number fails. Use the UNSIGNED attribute if possible to allow a greater range. For example, if you useTINYINT, the maximum permissible sequence number is 127. For TINYINT UNSIGNED, the maximum is 255.

To start with an AUTO_INCREMENT value other than 1, set that value with CREATE TABLE or ALTER TABLE, like this:
Every InnoDB table has a special index called the clustered index where the data for the rows is stored. Typically, the clustered index is synonymous with the primary key.

  • When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index. Define a primary key for each table that you create. If there is no logical unique and non-null column or set of columns, add a new auto-increment column, whose values are filled in automatically.
  • If you do not define a PRIMARY KEY for your table, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDBuses it as the clustered index.
  • If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
How the Clustered Index Speeds Up Queries
Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
How Secondary Indexes Relate to the Clustered Index
All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.
If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.
There are three possible settings for the innodb_autoinc_lock_mode configuration parameter. The settings are 0, 1, or 2, for traditionalconsecutive, or interleaved lock mode, respectively.
  • innodb_autoinc_lock_mode = 0 (traditional lock mode)
There are three possible settings for the innodb_autoinc_lock_mode configuration parameter. The settings are 0, 1, or 2, for traditionalconsecutive, or interleaved lock mode, respectively.
  • innodb_autoinc_lock_mode = 0 (traditional lock mode)
    The traditional lock mode provides the same behavior that existed before the innodb_autoinc_lock_mode configuration parameter was introduced in MySQL 5.1. The traditional lock mode option is provided for backward compatibility, performance testing, and working around issues with “mixed-mode inserts”, due to possible differences in semantics.
    In this lock mode, all INSERT-like statements obtain a special table-level AUTO-INC lock for inserts into tables with AUTO_INCREMENTcolumns. This lock is normally held to the end of the statement (not to the end of the transaction) to ensure that auto-increment values are assigned in a predictable and repeatable order for a given sequence of INSERT statements, and to ensure that auto-increment values assigned by any given statement are consecutive.
  • Suppose that there are two transactions running, each inserting rows into a table with an AUTO_INCREMENT column. One transaction is using an INSERT ... SELECT statement that inserts 1000 rows, and another is using a simple INSERT statement that inserts one row:
    Tx1: INSERT INTO t1 (c2) SELECT 1000 rows from another table ...
    Tx2: INSERT INTO t1 (c2) VALUES ('xxx');
    InnoDB cannot tell in advance how many rows are retrieved from the SELECT in the INSERT statement in Tx1, and it assigns the auto-increment values one at a time as the statement proceeds. With a table-level lock, held to the end of the statement, only one INSERTstatement referring to table t1 can execute at a time, and the generation of auto-increment numbers by different statements is not interleaved. The auto-increment value generated by the Tx1 INSERT ... SELECT statement is consecutive, and the (single) auto-increment value used by the INSERT statement in Tx2 is either be smaller or larger than all those used for Tx1, depending on which statement executes first.
    As long as the SQL statements execute in the same order when replayed from the binary log (when using statement-based replication, or in recovery scenarios), the results are the same as they were when Tx1 and Tx2 first ran. Thus, table-level locks held until the end of a statement make INSERT statements using auto-increment safe for use with statement-based replication. However, those table-level locks limit concurrency and scalability when multiple transactions are executing insert statements at the same time.
    In the preceding example, if there were no table-level lock, the value of the auto-increment column used for the INSERT in Tx2 depends on precisely when the statement executes. If the INSERT of Tx2 executes while the INSERT of Tx1 is running (rather than before it starts or after it completes), the specific auto-increment values assigned by the two INSERT statements are nondeterministic, and may vary from run to run.
    Under the consecutive lock mode, InnoDB can avoid using table-level AUTO-INC locks for simple insert statements where the number of rows is known in advance, and still preserve deterministic execution and safety for statement-based replication.
    If you are not using the binary log to replay SQL statements as part of recovery or replication, the interleaved lock mode can be used to eliminate all use of table-level AUTO-INC locks for even greater concurrency and performance, at the cost of permitting gaps in auto-increment numbers assigned by a statement and potentially having the numbers assigned by concurrently executing statements interleaved.
  • innodb_autoinc_lock_mode = 1 (consecutive lock mode)
This is the default lock mode. In this mode, bulk inserts use the special AUTO-INC table-level lock and hold it until the end of the statement. This applies to all INSERT ... SELECTREPLACE ... SELECT, and LOAD DATA statements. Only one statement holding the AUTO-INC lock can execute at a time. If the source table of the bulk insert operation is different from the target table, the AUTO-INC lock on the target table is taken after a shared lock is taken on the first row selected from the source table. If the source and target of the bulk insert operation are the same table, the AUTO-INC lock is taken after shared locks are taken on all selected rows.
Simple inserts (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes. No table-level AUTO-INC lock is used unless an AUTO-INC lock is held by another transaction. If another transaction holds an AUTO-INC lock, a simple insert waits for the AUTO-INC lock, as if it were a bulk insert.
This lock mode ensures that, in the presence of INSERT statements where the number of rows is not known in advance (and where auto-increment numbers are assigned as the statement progresses), all auto-increment values assigned by any INSERT-like statement are consecutive, and operations are safe for statement-based replication.
Simply put, this lock mode significantly improves scalability while being safe for use with statement-based replication. Further, as with traditional lock mode, auto-increment numbers assigned by any given statement are consecutive. There is no change in semantics compared to traditional mode for any statement that uses auto-increment, with one important exception.
innodb_autoinc_lock_mode = 2 (interleaved lock mode)
In this lock mode, no INSERT-like statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time. This is the fastest and most scalable lock mode, but it is not safe when using statement-based replication or recovery scenarios when SQL statements are replayed from the binary log.
In this lock mode, auto-increment values are guaranteed to be unique and monotonically increasing across all concurrently executingINSERT-like statements. However, because multiple statements can be generating numbers at the same time (that is, allocation of numbers is interleaved across statements), the values generated for the rows inserted by any given statement may not be consecutive.
If the only statements executing are simple inserts where the number of rows to be inserted is known ahead of time, there are no gaps in the numbers generated for a single statement, except for mixed-mode inserts. However, when bulk inserts are executed, there may be gaps in the auto-increment values assigned by any given statement.
How auto_increment is implemented in NDB
In The physical structure of InnoDB index pages I described how “Everything is an index in InnoDB”. This means that InnoDB must always have a “cluster key” for each table, which is normally the PRIMARY KEY. The manual has this to say in Clustered and Secondary Indexes:
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
I had previously assumed this meant that an invisible column would be used along with the same sequence generation code that is used to implement auto_increment (which itself has some scalability issues). However the reality is that they are completely different implementations.
How this is actually implemented is, as the manual says, if a table is declared with no PRIMARY KEY and no non-nullable UNIQUE KEY, InnoDB will automatically add a 6-byte (48-bit) integer column called ROW_ID to the table, and cluster the data based on that column. The column won’t be accessible to any queries nor usable for anything internally such as row-based replication.
What the manual doesn’t mention is that all tables using such ROW_ID columns share the same global sequence counter (the manual says “increases monotonically” and doesn’t clarify), which is part of the data dictionary. The maximum used value for all row IDs (well, technically the next ID to be used) is stored in the system tablespace (e.g. ibdata1) in page 7 (type SYS), within the data dictionary header (field DICT_HDR_ROW_ID).
This global sequence counter is protected by dict_sys->mutex, even for incrementing (as opposed to using atomic increment). The implementation is in include/dict0boot.ic (many blank lines deleted):
    39  row_id_t
    40  dict_sys_get_new_row_id(void)
    41  /*=========================*/
    42  {
    43          row_id_t        id;
    45          mutex_enter(&(dict_sys->mutex));
    47          id = dict_sys->row_id;
    49          if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
    51                  dict_hdr_flush_row_id();
    52          }
    54          dict_sys->row_id++;
    56          mutex_exit(&(dict_sys->mutex));
    58          return(id);
    59  }
(You may also notice that this code lacks any protection for overflowing the 48 bits allotted to row IDs. That is unnecessarily sloppy coding, but even at a continuous 1 million inserts per second [which is probably a bit optimistic ;)] it would take about 9 years to exhaust the ID space. I guess that’s okay.)

Ensuring non-conflicting IDs are generated

The counter is flushed to disk every 256th ID generated (the define DICT_HDR_ROW_ID_WRITE_MARGIN above), by modifying the value in the SYS data dictionary page, which is logged to the transaction log. On startup, InnoDB will increase the DICT_HDR_ROW_ID stored on disk by at least 256, and at most 511. This ensures that any IDs generated will have been less than the new starting value, and thus there will not be any conflicts.

Performance and contention implications

Given how much other code within InnoDB is protected by dict_sys->mutex I think it’s fair to say any tables with an implicit clustered key (ROW_ID) could expect to experience random insert stalls during operations like dropping (unrelated) tables. Parallel insertion into multiple tables with implicit keys could be performance-constrained, as it will be serialized on both the shared mutex and cache contention for the shared counter variable. Additionally, every 256th value generated will cause a log write (and flush) for the SYSpage modification, regardless of whether the transaction has committed yet (or ever will).
  • You need your table to be joinable on something
  • If you want your table to be clustered, you need some kind of a primary key.
  • If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.

table should have a primary key so that you could identify each row uniquely with it.
Technically, you can have tables without a primary key, but you'll be breaking good database design rules.


Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts