Massive Technical Interviews Tips: Git

Friday, December 11, 2015

Git - Counting Objects

http://githubengineering.com/counting-objects/

$ git clone https://github.com/torvalds/linux
Cloning into 'linux'...
remote: Counting objects: 4350078, done.
remote: Compressing objects: 100% (4677/4677), done.
Receiving objects:   4% (191786/4350078), 78.19 MiB | 8.70 MiB/s

The objects

All data in a Git repository is stored in the shape of a Directed Acyclic Graph. The history of the repository is ordered throughout time by the way commits are linked to each other. Each commit has a link to its parent, the commit that came right before it; some commits have two parents, the result of merging two branches together. At the same time, each commit has a link to a tree. The tree is a snapshot of the contents of the working directory in the moment where the commit was created. It contains links to all the files in the root of your repository, and links to other subtrees which are the folders and recursively link to more files and subtrees.

There's a very specific case of a fetch operation: a fresh clone, where little negotiation is necessary. The server offers the tips of its branches, and the client wants all of them, because it doesn't have any.

This is when this whole ordeal starts becoming expensive, because the server has little information on what to actually send. Git doesn't keep a definite list of all objects reachable from the graph, and it cannot send every single object in its database as a whole, because it could very well be that some of those objects are not reachable at all in the repository and should be thrown away instead of sent to the client. The only thing Git knows are the tips of all branches, so its only option is to walk down the graph, all the way to the beginning of the history, listing every single object that needs to be sent.

Each commit in the history is loaded and added to the list of objects to send, and from each commit its tree is also expanded, and blobs and subtrees are also added to the list. This process scales with the size of the history of the repository (i.e. the amount of commits) and the actual size of the repository. The more files that exist on the repository, the longer it takes to iterate through the trees of each commit to find the few blobs that haven't been listed yet.

Our initial attempts involved caching the result of every clone in order to replay it to other clients asking for the same data. We quickly found out that this approach was not elegant and definitely not scalable.

When it comes to caching clones, the repositories that matter the most are the most active ones. This makes caching the full result of a clone pointless, since the state of the repository changes constantly. On top of that, the initial caching still takes the same unreasonable amount of CPU time, and the size of the cache is about the same as that of the repository on disk. Each cached response roughly doubles the amount of disk space needed by the repository.

Because of these reasons, we decided against pursuing the caching approach. Generally speaking, caching specific results to queries is a weak approach to performance in complex systems. What you want to do instead is caching intermediate steps of the computation, to be able to efficiently answer any kind of query. In this case, we were looking for a system that would not only allow us to serve clones efficiently, but also complex fetches. Caching responses cannot accomplish this.

http://www.ruanyifeng.com/blog/2015/09/git-bitmap.html

清点对象的原始算法如下。

列出本地所有分支最新的一个commit

列出远程所有分支最新的一个commit

两者进行比较，只要有不同，就意味着分支发生变动

每一个发生变动的commit，都清点其中具体变动的子目录和文件

追溯到当前commit的父节点，重复第四步，直至本地与远程的历史一致为止

加总所有需要变动的对象

上面的过程说明，"清点对象"是一个文件遍历算法，变动的对象会被一一清点到，这就意味着大量的文件读操作。对于大型代码库来说，这个过程非常慢。

Github团队想到的新算法，是建立一个Bitmap索引，即为每一个commit生成一个二进制值。

打开本地Github仓库的.git/objects/pack/目录，你会看到一个索引文件和一个数据文件，它们就是Bitmap。简单说，这两个文件索引了当前代码库的所有对象，然后使用一个二进制值代表这些对象。有多少个对象，这个二进制值就有多少位。它的第n位，就代表数据文件里面的第n个对象。

每个commit都会有一个对应的二进制值，表示当前快照包含的所有对象。这些对象对应的二进制位都为1，其他二进制位都为0。

这样做的好处是，不用读取commit对象，只要读取这个二进制值，就会知道当前commit包含了哪些节点。更妙的是，两个二进制值只要做一次XOR运算，就会知道哪些位（即哪些对象）发生了变动。而且，因为新的对象总是添加到现有二进制位的后面，所以只要读取多出来的那些位，就知道当前commit比上一次commit多出了哪些对象。

这样一来，"清点对象"就变成了二进制值的比较运算，因此速度极快。进一步的介绍，请参看官方文档《Bitmap的解释》，《Bitmap的格式》。

Friday, December 11, 2015

Git - Counting Objects

The objects

Labels

Popular Posts