Massive Technical Interviews Tips: Database Index

Saturday, March 5, 2016

Database Index

reverse indexing
https://stackoverflow.com/questions/19882785/what-is-the-point-of-reverse-indexing

In your example it refers to sequential numbers being a good application for reverse indexing. Taking the quoted number 24538, it will be inserted in the index at a certain point. The next number in the sequence will be 24539, which will be inserted in the index very close to the first number since the most significant digits are identical. Extending this, many sequential numbers will all require insertion at much the same point, involving a significant overhead in extending index blocks and rebalancing the index along the way.

The least significant digit of these numbers changes more rapidly than the most significant. Thus, reversing the order of the digits gives 83542 and 93542 respectively. These two numbers will be inserted into the index much further apart, and extending this to many numbers, the index will be built in a more balanced way, reducing the overhead in index management.

The operation of reversing the digits is trivial in computing terms, while managing the index can potentially involve many disk accesses, so inserting items in an index in a way that reduces the management overhead can deliver significant performance improvements.

https://en.wikipedia.org/wiki/Reverse_index

Reverse indexes are just as efficient as unreversed indexes for finding specific values, although they aren't helpful for range queries. Range queries are uncommon for artificial values such as sequence numbers. When searching the index, the query processor simply reverses the search target before looking it up.

https://hubshout.com/?Reverse-Index&AID=1431

https://yq.aliyun.com/articles/6901

一种是主键索引，主键索引即聚集索引（Cluster Index），它不仅有主键，而且有主键所属的全部数据，所以在Innodb中，主键索引即数据；
一种是列值为Key，主键位置为Value即 (列值, 主键位置) 的非主键索引（Secondary Index） Innodb属于索引组织表，所有的数据全部挂在主键叶子节点下。所以如果不能保证主键的插入顺序，那么会发生大量的主键节点分裂，产生大量的I/O操作。另外Innodb规定单个索引字段的长度不得超过768字节，否则截断超出长度不放入索引。 Innodb的非主键索引全部都指向主键索引，查找非主键索引无法获得整行数据，需要通过叶子节点的指针查到其主键索引的位置才能获得整行数据，所以主键索引必须设计得尽可能小，否则非主键索引将会非常的大。

1. 最左前缀匹配原则，非常重要的原则，mysql会一直向右匹配直到遇到范围查询(>、<、between、like)就停止匹配，比如a = 1 and b = 2 and c > 3 and d = 4 如果建立(a,b,c,d)顺序的索引，d是用不到索引的，如果建立(a,b,d,c)的索引则都可以用到，a,b,d的顺序可以任意调整。
2. =和in可以乱序，比如a = 1 and b = 2 and c = 3 建立(a,b,c)索引可以任意顺序，mysql的查询优化器会帮你优化成索引可以识别的形式。
3. 尽量选择区分度高的列作为索引,区分度的公式是count(distinct col)/count(*)，表示字段不重复的比例，比例越大我们扫描的记录数越少，唯一键的区分度是1，而一些状态、性别字段可能在大数据面前区分度就是0，那可能有人会问，这个比例有什么经验值吗？使用场景不同，这个值也很难确定，一般需要join的字段我们都要求是0.1以上，即平均1条扫描10条记录
4. 索引列不能参与计算，保持列“干净”，比如from_unixtime(create_time) = ‘2015-08-14’就不能使用到索引，原因很简单，b+树中存的都是数据表中的字段值，但进行检索时，需要把所有元素都应用函数才能比较，显然成本太大。所以语句应该写成create_time = unix_timestamp(‘2015-08-14’)。
5. 尽量的扩展索引，不要新建索引。比如表中已经有a的索引，现在要加(a,b)的索引，那么只需要修改原来的索引即可。
6. 在order by或者group by子句中，如果想通过索引来进行排序，所建索引列的顺序必须与order by或者group by子句的顺序一致，并且所有列的排序方向（倒序或者正序）都一样；如果查询关联多张表，则只有order by子句引用的字段全部来自第一张表时，才能利用索引来排序；order by或者group by语句与查询型语句的限制是一样的:需要满足索引的最左前缀原则；否则mysql就要执行排序操作，无法利用索引来排序；（有一种情况order by或者group by子句可以不满足最左前缀原则，就是其前导为常量的时候，如果where或者join对这些列指定了常量，就可以弥补索引的不足）。

五、举例

语句1：

语句2：

对于这两条语句，如果单独进行考虑的话，大家可能会建立两个索引；
针对语句1建立(status,netting_batch_no,debtor_agent_member_id)；
针对语句2建立(netting_batch_no,debtor_agent_member_id,transaction_currency)；
如果综合考虑来看的话，其实一个索引就够了，即(netting_batch_no,debtor_agent_member_id)，这里没必要将status或者transaction_currency字段放到索引中，因为这两个字段的区分度太差；
根据建立索引的原则2，语句1是可以走到这个索引的；
根据建立索引的原则1，语句2也是可以走到这个索引的；
索引不是越多越好，建立过多的索引会增加数据库内存或者磁盘的消耗，并且会影响到得插入、删除等操作的性能，索引在建立索引时要遵循索引建立的原则，通盘考虑；

http://www.nosqlnotes.com/technotes/learned-index-1/

如论文开篇所言，可以将传统的数据库索引(Index)视为一种模型(Model)：

B-Tree索引B-Tree索引模型将一个Key映射到一个排序数组中的某位置所对应的记录
Hash索引Hash索引模型将一个Key映射到一个无序数组中的某位置所对应的记录
Bitmap索引Bitmap索引模型用来判断一条记录是否存在

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a recordwithin a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not.

索引常被用来加速数据库查询，过去关于索引的优化，常常聚焦于如下几点：

如何基于业务的查询模型构建最合理的索引
如何在查询中选择最佳的索引

但这些传统索引却存在一个最显著的问题：它们没有考虑数据的分布特点，往往预先假设了最差的数据分布，从而期望索引具备更高的通用性。因此，这些索引往往会导致大量的存储空间浪费，而且性能无法达到极致。

Learned Index正是借助机器学习的方法，通过对存量数据进行学习来掌握这些数据的分布特点，从而可以对现有的索引模型进行改进。基于真实数据集的测试效果来看，Learned Index较之传统的B-Tree Index，有60%~70%的性能提升，在存储空间上甚至可以节省99%。

http://www.nosqlnotes.com/technotes/learned-index-2/

Saturday, March 5, 2016

Database Index

五、举例

Labels

Popular Posts