PBS 102 of X: Introduction (Git)
In the previous instalment we introduced the concept of version control systems (VCS), and we described the two broad categories they fall into — client-server, and peer-to-peer or distributed. We explained why we’d be focusing on distributed version control in this series, and why I chose to explore those concepts using Git. We also laid out a big-picture roadmap for this series-within-a-series, starting with local repositories, then private remote repositories for backup and easy access, and finally shared repositories for collaboration.
In this instalment we’re going to focus on the underlying design of Git — what does it actually store, how does it represent it, and what words does it use to describe things. Trying to learn the Git commands, or even to use a Git GUI without an understanding of Git’s structure and jargon is a recipe for confusion and frustration, so this instalment is intended to nip that in the bud.
Matching Podcast Episode
Listen along to this instalment on episode 654 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Before we get serious, let’s take a moment to look at the name Git, does it mean anything? The short answer is ‘not really’. It’s not an acronym, hence it not being all uppercase, and its author, Linus Torvalds likes to say it can mean whatever you want it to mean, or whatever fits your current mood.
In Irish and British English a git is an unpleasant person (that’s a little understated), and its author joked:
I’m an egotistical bastard, and I name all my projects after myself. First ‘Linux’, now ‘git’.
The git
command’s man page describes Git as the stupid content tracker.
The README file that ships with the Git source code offers a collection of different definitions to match your mood:
- random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of get may or may not be relevant.
- stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
- global information tracker: you’re in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
- “goddamn idiotic truckload of shit”: when it breaks
Where ever the name comes from, no one seems too precious about it, which is a nice change in the world of nerdery, where something as innocuous as expressing a preference for vi
over emacs
can kick off a vicious troll war!
A Git Repository
A Git Repository contains the full version history of a single project along with all its related metadata, and some configuration parameters.
Git repositories are designed to be quickly and efficiently copied, or cloned to use the Git jargon. When you clone a git repository the full history and all the metadata gets copied, but the configuration parameters don’t. This is important, because those configuration parameters are what allow you to personalise your copy of a repository. They hold information like your name, email address, and the remote copies of the repository you want to collaborate with.
On your computer, a Git repository is simply a folder with a well defined structure.
Conceptually, a Git repository consists of a number of parts, some permanent, others not.
- The Database — the collection of objects that represents the entire committed version history of a project. This database is stored in a hidden folder named
.git
at the repository’s root. - The Working Copy — the current state of the project, basically the visible content of the repository
- The Index — a temporary data structure that represents the changes between the last committed version and the current working copy. The index keeps track of information about the changes you are making, and is used to build a version of your project for storage in the database when you commit some or all of your changes.
The Git Database
The Git database stores the entire history of a project as a collection of objects, and these object come in just four types:
- Blobs — a blob is basically the contents of a file.
- Trees — a tree is analogous to a folder in a traditional file system. A tree object maps names to blobs and other tree objects. The recursive nature of that definition allows for an effectively infinite folder depth.
- Commits — a commit represents a point in time in a project’s history, a snapshot if you will. A commit uses a tree object to capture the state of all the files and folders in the project. As well as this tree a commit also contains metadata, like the author, the date, and an optional message describing the changes.
- Tags — tags are used to assign human-friendly names to other objects, usually commits.
All Git’s more complex structures are simply collections of blobs, trees, commits, and tags.
Also, with our programming hat on we notice many has-a relationships here. Trees have blobs, commits have a tree, and tags have a commit.
It’s SHA1 Hashes All The Way Down
Within the Git database, objects don’t have names, instead, they are always referenced by the SHA1 hash of their contents. Notice I described a blob as representing the contents of a file, not a file. What’s the difference? Files have names, blobs don’t! Trees map names to blobs and other trees, and it does so by storing name-value pairs where the names are the human-friendly file names, and the values are the SHA1 hashes of the blobs and the trees those names refer to.
This use of hashes to reference all objects gives Git inherited de-duplication. If you add the same file to your repository with multiple names and include it in multiple commits it will only ever be stored once within the Git database as a single blob that is referenced from many tree objects.
As you use Git you’ll see SHA1 hashes all over the place.
Git Stores it all
Many version control systems work by storing a single original copy of each file and a collection of changes, or deltas. You can re-construct any version of any file by starting with the original and applying each delta one after the other. This is very efficient in terms of storage, but computationally complex, and only works on text files.
Git takes a completely different approach — it stores every version of every file in its entirety. This is a much more robust approach, and allows Git to version absolutely anything. You might think this approach would make Git be a real disk space hog, but don’t worry, it’s not. We’ve already mentioned that Git uses SHA1 hashes to avoid duplication, but Git does a lot more than that, including compression and some extremely clever storage algorithms.
Branches
A commit is a snapshot of a repository’s entire state. With the exception of a repository’s first commit, each commit has a reference to its parent. Each commit has a single parent, but the same commit can be the parent of arbitrarily many future commits. If you plot the commits on a London-Underground-style map, drawing a line between each commit and its parent, you get a tree-like structure with the initial commit at the root of the tree. For every commit at the edge of the tree there is a single path from there all the way back to the initial commit. This path is known as a branch.
Note that branches have nothing to do with tree objects. Tree objects represent files and folders, while branches represent series of commits. It’s unfortunate that Linus chose to use the same metaphor for two un-related concepts.
Each commit is labeled with a branch name. If a commit’s branch name is the same as that of its parent then it’s considered to be on the same branch, if not, then it’s considered to be an off-shoot from the parent commit’s branch.
At the moment (September 2020), the primary branch, the one used for the initial commit is master, but there is movement in the community to move away from words with slavery-related baggage, and instead use main.
The Index, Staging, and Stashes
The index exists to help you transition your changes into commits. With many version control systems it’s an all-or-nothing thing — you check out the code, make changes, and then check all those changes in. Git is different — you can mark individual changes for inclusion in the next commit, leaving other changes un-committed.
Git describes this process as staging, so a change that has not yet been marked for inclusion in the next commit is an un-staged change, and a change that has been marked for inclusion in the next commit is said to be staged.
When I first moved from SVN to Git I found this extra step extremely annoying — I’d always committed everything, and it was just one simple action, so I have to stage and then commit, that’s twice the work! Experience has corrected that initial error of judgement 🙂.
The ability to commit only some of your changes, or to split your changes across multiple well named commits is very powerful — it leads to better organised repositories that are easier to use. With SVN I would end up with commits with messages like “Fixed a typo in the docs, fixed a bug in function a() and added function b()”. With Git I can use staging to easily break that into three distinct commits, each capturing just one conceptual change — “fixed typo in docs”, “fixed bug in function a()”, and “added function b()”.
It also means that if I have to leave in a hurry and have only gotten the typo fixed and am still working on debugging function a() I can commit the typo and leave the rest unstaged so I can continue where I left off when I get back.
The index also allows changes to be temporarily set aside for later though a feature called stashing. I like to think of a stash as a temporary commit that only I can see. The most common time to need to stash something is when I’m working on something and the phone suddenly rings. I have to drop everything to fix a major bug in some other part of the codebase. I simply stash what ever I was doing, fix the bug, commit that change, and then un-stash what I was working on before and continue where I left off.
The Current State of a Git Repository
At every point in time a Git repository has a currently checked out commit, and hence, a current branch. All changes in the Index are changes relative to that currently checked out commit. By default, all commits will be added to the same branch as the currently checked out commit.
When describing where you are in a project’s history the two relevant pieces of information are the currently checked out commit, and the current branch. Git stores this information within the repository itself.
Tags
While SHA1 hashes work great for allowing computers to uniquely name things, they’re really quite user-hostile! Most of the time we don’t need to remember specific commits, we’re usually interested in the most recent commit on a specific branch. But, from time to time some of our commits are in some way special, and it would be nice to give them a human-friendly name. That’s what tags are for!
How you use tags is entirely up to you, but a very common convention is create a tag for each official release of your code, and to name that tag vX.Y.Z
where X
is the major version number, Y
the sub version number, and Z
the minor version number. For example, commit used as the first public release would generally be tagged v1.0.0
.
Installing Git
Git is available on all major platforms, and you generally have multiple options for each platform. The Official docs provide a good overview of the various options.
Personally, these are my preferences:
- Mac: I use the XCode command line tools
- Linux: I use the version from my distribution’s default package manger (
yum install git
on RHEL & CentOS) - Windows: I use the Linux Subsystem for Windows (Ubuntu)
Git GUIs
In this series we’ll be focusing on using Git from the command line because that way you really get to see what is going on. However, in the real world you’ll almost certainly be using Git in some kind of GUI.
There are many different Git GUI options across the various platforms, far too many to mention! Some are free and open source, some are commercial software, and some fall somewhere in between, adopting a freemium model of some kind. Many coding editors also have built-in Git support, including Microsoft’s free, open source, and cross-platform VS Code. The Git home page has a good listing of the most popular GUIs.
Later in the series, when we move on from local repositories to cloud-hosted repositories the free, open source and cross-platform GitHub Desktop will be a very good option.
Personally, I use the middle-tier of the freemium cross-platform client GitKraken. The free tier can be used with local repositories and public cloud repositories, but not with private cloud repositories.
Allison uses the free Windows & Mac client SourceTree.
Because Git retains its state within the repository itself (in the Index), the clients don’t need to store any state information. This means you can simultaneously open the same Git repository in as many clients as you like or use as many apps as you like to interact with a repository simultaneously. This is very useful while you’re still trying to figure out which app you prefer 🙂. You can also keep a GUI open while using the Git command line tools to watch what those tools do in real time.
A Simple Challenge
In preparation for the next instalment, install the Git command line tools on your computer. You’ll know you’ve succeeded when you can run the command git --version
.
For bonus credit, install a Git GUI too.
Final Thoughts
In the previous instalment we learned about the concept of version control in general, then in this instalment we looked at Git’s implementation of those concepts. We’ve now laid a solid foundation on which to build. In the next instalment we’ll create our first repository, and commit our first changes.