Massive Technical Interviews Tips: Git Internals

Saturday, December 12, 2015

Git Internals

https://github.com/pluralsight/git-internals-pdf
https://raw.githubusercontent.com/pluralsight/git-internals-pdf/master/drafts/peepcode-git.pdf
Git is a stupid content tracker. That is probably the best description of it – don’t think of it in a ‘like (insert favorite SCM system), but…’ context, but more like a really interesting file system.

When most SCMs store a new version of a project, they store the code delta or diff. When Git stores a new version of a project, it stores a new tree – a bunch of blobs of content and a collection of point - ers that can be expanded back out into a full directory of files and subdirectories. If you want a diff between two versions, it doesn’t add up all the deltas, it simply looks at the two trees and runs a new diff on them.

Focus and Design
Non-Linear Development
Git is optimized for cheap and efficient branching and merging. It is built to be worked on simultaneously by many people, having mul - tiple branches developed by individual developers, being merged, branched and re-merged constantly. Because of this, branching is incredibly cheap and merging is incredibly easy.

Distributed Development
Git is built to make distributed development simple. No repository is special or central in Git – each clone is basically equal and could generally replace any other one at any time. It works completely offline or with hundreds of remote repositories that can push to and/ or fetch from each other over several simple and standard protocols.

Efficiency
Git also is efficient in its network operations – the common Git trans - fer protocols transfer only packed versions of only the objects that have changed. It also won’t try to transfer content twice, so if you 13 have the same file under two different names, it will only transfer the content once.

A Toolkit Design

The Blob In Git
The contents of files are stored as blobs.
It is important to note that it is the contents that are stored, not the files. The names and modes of the files are not stored with the blob, just the contents.

This means that if you have two files anywhere in your project that are exactly the same, even if they have different names, Git will only store the blob once. This also means that during repository transfers, such as clones or fetches, Git will only transfer the blob once, then expand it out into multiple files upon checkout.

The Tree
Directories in Git basically correspond to trees.
A tree is a simple list of trees and blobs that the tree contains, along with the names and modes of those trees and blobs. The contents section of a tree object consists of a very simple text file that lists the mode, type, name and sha of each entry.
100644 blob a906cb README
100644 blob a874b7 Rakefile
040000 tree fe8971 lib

The Commit
where does the ‘history’ part of ‘tree history storage system’ come in? The answer is the commit object.

The commit is very simple, much like the tree. It simply points to a tree and keeps an author, committer, message and any parent commits that directly preceded it.
tree e1b3ec
parent a11bef
author Scott Chacon
<schacon@gmail.com> 1205624433
committer Scott Chacon
<schacon@gmail.com> 1205624433
my second commit, which is better than the first

The Tag
This is an object that provides a permanent shorthand name for a particular commit. It contains an object, type, tag, tagger and a message. Normally the type is commit and the object is the SHA-1 of the commit you’re tagging. The tag can also be GPG signed, providing cryptographic integrity to a release or version.
object 0576fa
type commit
tag v0.1
tagger Scott Chacon
<schacon@gmail.com> 1205624655

The Git Data Model
In computer science speak, the Git object data is a directed acyclic graph. That is, starting at any commit you can traverse its parents in one direction and there is no chain that begins and ends with the same object.

All commit objects point to a tree and optionally to previous commits. All trees point to one or many blobs and/or trees. Given this simple model, we can store and retrieve vast histories of complex trees of arbitrarily changing content quickly and efficiently.

References
In addition to the Git objects, which are immutable – that is, they cannot ever be changed, there are references also stored in Git. Unlike the objects, references can constantly change. They are simple pointers to a particular commit, something like a tag, but easily moveable.

Examples of references are branches and remotes. A branch in Git is nothing more than a file in the .git/refs/heads/ directory that contains the SHA-1 of the most recent commit of that branch. To branch that line of development, all Git does is create a new file in that directory that points to the same SHA-1. As you continue to commit, one of the branches will keep changing to point to the new commit SHA- 1s, while the other one can stay where it was.

Branching and Merging
in Git the act of creating a new branch is simply writing a file in the .git/ refs/heads directory that has the SHA-1 of the last commit for that branch.

Creating a branch is nothing more than just writing 40 characters to a file.

Switching to that branch simply means having Git make your working directory look like the tree that SHA-1 points to and updating the HEAD file so each commit from that point on moves that branch pointer forward (in other words, it changes the 40 characters in .git/ refs/heads/[current_branch_name] be the SHA-1 of your last commit).

Remotes
Remotes are basically pointers to branches in other peoples copies of the same repository, often on other computers. If you got your repository by cloning it, rather than initializing it, you should have a remote branch of where you copied it from automatically added as origin by default. Which means the tree that was checked out during your initial clone would be referenced as origin/master, which means “the master branch of the origin remote.”

Saturday, December 12, 2015

Git Internals

Labels

Popular Posts