PBS 112 of X: Introducing Remotes (Git)
In our initial introduction to version control we introduced the concepts of the client-server model and the peer-to-peer model. When we moved on to Git specifically we learned that Git uses the peer-to-peer model, but from then until now we’ve utterly ignored that aspect of Git, focusing instead on the basics of using a single repository. We’ve now covered all this Git basics in enough detail that it’s time to broaden our view of the Git universe, and start exploring how Git facilitates collaboration.
Throughout our Git journey we’ve learned that Git uses English words in very Git-specific ways. Sure, committing had a meaning in English, but it has a different and narrower meaning in Git. We’ve already picked up a lot of Git jargon in this series-within-a-series, but we’re about to encounter a lot more of it, and it’s very important we familiarise ourselves with it before we continue.
This entire instalment is going to be dedicated to describing Git’s philosophy on collaboration, and the precise meaning Git assigns to otherwise ordinary looking English words like remote, push, and pull. This is the kind of instalment I expect many of you will bookmark for quick reference later.
Matching Podcast Episode
Listen along to this instalment on episode 677 of the Chit Chat Across the Pond Podcast.
You can also Download the MP3
Git is Peer-to-Peer — We Add any Desired Order
In the world of computer networking, what does it mean to be peer-to-peer? It means that at a technological level, all the communicating parties are equal. That doesn’t mean they are equal in every sense, just that the technology doesn’t impose any kind of hierarchy. To Git, a repository is a repository is a repository. At a technological level, Git is anarchy!
It’s up to use humans to impose our desired level of order on our uses of Git. What ever rules or hierarchies or constructs we want to apply we implement using access controls, permissions, policies, and practices. This is what makes Git so flexible. Git can power software development anywhere from the most ad-hoc and un-structured open source community to the most rigidly controlled high-tech company. There could be no rules at all, or a 10 page form to complete before every merge, it’s up to the user to decide!
Every Actor Gets a Repository
To take part in a peer-to-peer network you need to be a peer! In the Git universe that means collaboration happens repository-to-repository. Every actor in a Git process has a repository, be they humans, organisations, or things. BTW, note that a human working on multiple computers should probably have multiple repos, one on each computer!
It seems obvious in a peer-to-peer world that each developer working on a project would have their personal Git repository to work in, but what may be less obvious is that the team or organisation running the project should have a repository too, and if the code is being deployed on servers, those should probably also have repositories. If the project is a collaboration among different teams, or if there are outside contractors involved, those should have repositories too.
Who or what does or doesn’t get a repository depends on the model being used for the project. The possibilities are infinite, but let’s describe some common scenarios to help illustrate the point.
Scenario 1 — A Lone Developer
You code to scratch your own itch. You want to use version control to avoid chaos, and to protect your work. You have a desktop and a laptop, and you have a NAS in your home.
You would typically have three repositories:
- A Primary repository on the NAS that you consider the main copy, and that you back up regularly.
- A working repository on your laptop that you sync with the primary as and when it suits.
- A working repository on your desktop that you sync with the primary as and when it suits.
Since you’re the only human involved, and since all communication is within your home network you don’t need to worry about access control or permissions. You simply implement the practices that work for you.
It’s interesting to note that what you’ve done here is used a peer-to-peer technology to implement a pseudo-client-server model. Each machine you code on is a client, communicating only with a single server through which all communication flows.
Scenario 2 — A Small Open Source Project
You work with a few others to manage a small piece of open source code, you share it with the world, and you accept contributions from strangers after you review them.
You start by hosting a publicly accessible repository online using a service like GitHub or GitLab. You allow the world to read from your repository, but only your personal repository, and perhaps those of a few other trusted contributors to write to the public repository. Other repositories can submit commits into a kind of half-way-house we’ll discus later. From there you or one of your trusted contributors reviews these proposed commits and accepts or rejects them.
In this model the publicly hosted repository is considered authoritative, and you’ve used access controls and permissions to impose some rules around who can add to this authoritative repository.
Scenario 3 — A Major Web App
You run an e-commerce company selling widgets, and you have a large web store. You have a team of in-house developers, and you partner with a hosting company to publish your site.
Rather than using a public Git hosting service like GitHub, you run a self-hosted equivalent within your organisation’s corporate network to host the authoritative repository for the site. You heavily restrict access, allowing only your local developer’s personal repositories read and write access to the authoritative repository, and even then, they can only submit changes for review, they can’t force a commit into the main branch.
Your hosting partner also runs a Git server, and that server host two branches, a staging branch, and a production branch. They grant your central repository read and write access to the staging branch, and you send commits from your main branch to that staging branch when ever your want to start the rollout of a site update. Your hosting partner runs git repositories on your staging servers, and when you send a commit to the staging branch, an automation is triggered that causes the repositories on the staging servers to update themselves from the staging branch in your hosting partner’s repository, putting your new code life on a staging site f or you to test. When you’re happy with the result you approve the change for production, and another automation is triggered, merging the staging branch into the production branch, and triggering repositories on the production servers to get the newly approved code and deploy it.
In this case you have repositories for people, organisations, and servers, all working together in a very controlled way using access controls and permissions to enforce some rules, and policies, procedures, and even automations to implement others.
What I’ve just described is a good example of a modern so-called dev-ops workflow. This kind of approach is part of my daily life as a sysadmin in the 2020s.
I’m ‘Local’
So far I have diligently avoided using any Git keywords when describing how Git repositories interact. Let’s change that!
Let’s start with the simplest piece of Git jargon – the repository you’re working in at any given times is referred to as being the local repository. When ever you see references to local branches or local commits, those are simply branches and commits in your repository.
It’s All About ‘Remotes’
Given that Git refers to the current repository as local, it probably comes as no surprise that all other repositories are referred to as remotes.
A Git repository can store arbitrarily many references to other Git repositories. These references consist of a URL and a name. You can think of them as links or shortcuts, but Git calls these links remotes. A remote is not another repository, it’s the local repository’s reference to another repository.
While it is possible to do one-off communications between repositories with just a bare URL, you’ll almost never do that, instead, you’ll add a remote to your repository for every other repository you’ll be communicating with.
In the above scenario 1 you would add a reference to your primary repository on the NAS as a single remote in both the repository on your laptop, and the one on your desktop.
In scenario 2 each approved developer would add one remote to their personal repository referencing the project’s publicly hosted repository, and users of the open source project could add remotes to their personal repository also referencing the project’s public repository. The trusted developers could read and write, but others could only read and suggest writes.
In scenario 3 each developer would add one remote to their repository to reference the corporation’s authoritative repository, the corporation’s repository would have one remote referencing the hosting company’s repository, and the hosting company would have a remote for each staging and production server.
Note that only the repository initiating a communication to another repository needs to add a remote, regardless of whether it’s reads or writes being communicated.
Fetching, Pushing & Pulling
Before a Git repository can meaningly communicate with one of its remotes it needs to know what commits, branches, and tags exist in the remote repository. Git repositories keep a copy of the latest known state of each of their remotes, but those need to be updated before every communication. The act of updating your repository’s local copy of a remote repository’s content is known in Git as fetching.
Unless you go out of your way to tell it otherwise, Git will download all commits, tags, and branches in each remote. Because commits are named by their cryptographic hash, there is generally very little wasted disk space, because each commit is stored only once, regardless of how many local and remote branches it appears on. This default behaviour is usually not a problem, but there is one common scenario where you might want to tell Git to only fetch a sub-set of a remote repository, and that’s on servers hosting an app or website. The primary repository contains all commits ever, whether or not they ever made it into production, but the copy on the server only needs the production branch. With appropriate commands and flags it’s possible to tell Git to only fetch specific remote branches.
The important thing to note is that the commits on remote branches actually exist within your repository, so you can interact with them even when you’re offline.
Once you’ve fetched a remote repository you can merge the remote branches with your local branches as desired.
Importing commits from a remote branch into a local branch is referred to as pulling, and writing commits from a local brach into a remote branch is referred to as pushing.
So, to work with a remote you need to fetch, then push and/or pull as desired.
Remote Naming Conventions
Since a remote is just a name-value pair where the value is a URL, the name really can be just about anything. With very good reason, there are commonly used conventions around naming remotes.
Origin
By far the most common convention is to name the remote for a primary copy of a repository on some kind of server origin. Usually, the copy on the server exists first, and you then copy it onto your computer, so the server really is your local repository’s origin.
If you’re a GitHub or GitLab user you’ll generally create your repositories in the cloud, then make a copy of them in your computer(s). If you use the standard GitHub or GitLab tools to do this, then a remote named origin will be automatically added to your local repository that references the cloud-hosted primary repository.
Upstream
The second most common convention is to refer to the primary repository of a parent project of some kind as upstream. If you’re a GitHub or GitLab user who contributes to open source projects you’ll typically start by making your own copy of the project’s primary repository in your GitHub/GitLab account. In that cloud-hosted repository the project’s primary repository will generally be added as a remote named origin. When you then your cloud copy to your computer a reference to your cloud copy will get added as a remote named origin, but you may also want to communicate with the open source project’s primary repository from your computer, so it should be added as a remote too. It’s that reference that is named origin by convention.
So, if you host your own private repository in the cloud your local copy will have a remote referencing the cloud copy as origin. If you copy and open source project’s repository into your cloud account, then copy that to your computer, your local copy will have two remotes, one referencing your cloud copy named origin, and one referencing the original repository named upstream.
Bare Repositories
In all the scenarios above, only developers personal repositories are actually used for development work. That means that only those personal repositories have any use for a working copy. Having a working copy on a repository sitting your NAS, your corporate server, or a cloud service like GitHub or GitLab makes no sense.
Repositories that don’t need working copies don’t need to have them. When creating a repository your can specify that no working copy should be created. Git refers to this kind of repository as a bare repository, and by convention, they are stored in folders who’s names end in .git
.
Using Remotes
The Git commands address fetched remote branches and tags using the naming scheme /
. For example, the main
branch on the remote named origin
is origin/main
.
We can use fetched remote branches largely like local branches. We can diff against them, and we can merge them into our local branches to import remote changes.
Tracking Branches
While you can merge changes from any remote branch into any local branch, you’ll usually find you merge from the same remote branch into the same local branch continuously. In fact, you’ll usually be working on local mirrors of remote branches where the names are the same. E.g. I’m editing these shownotes on a local branch named pbs112-wip
that is a mirror of the remote branch origin/pbs112-wip
. I always want to merge commits on origin/pbs112-wip
into pbs112-wip
, and I always want to copy the commits I make on pbs112-wip
to origin/pbs112-wip
.
I could manually fetch origin
and then merge origin/pbs112-wip
into pbs112-wip
each time I want to incorporate remote changes, but that would be a lot of typing, and quite the pain in the you-know-what. Thankfully, I don’t have to do that, and neither does anyone else 🙂
Git supports permanent mappings between local and remote branches. In Git jargon, we can set pbs112-wip
to track origin/pbs112-wip
. We can then pull remote changes into our local branch with a single command (which performs a fetch followed by a merge), and we can push our changes to the remote branch.
Note that the names of the local and remote branches don’t have to be the same. Most of they time they will be, and you definitely shouldn’t arbitrarily mis-match the names just because you can, but there are times when it’s very useful to have different local and remote names.
In the real world the only time I consistently track branches with differing names is when working with third-party contractors. Why? Because there is no reason their naming conventions should match ours! I name my local branches based on our naming convention, and their name their remote branches based on theirs. As an example, we have a vendor that pre-fixes every branch name with a unique customer reference. To them we are just one of many deployments of an open source web app they host, but to us we have just one copy of this web app, and we know who we are, so, my local branches are simply called stage
and prod
, but theirs are of the form pbsUni-12345-staging
and pbsUni-12345-production
.
Git URL Schemes & Protocols
Remember that a remote is simply a reference to another repository consisting of a name and a URL. The name is self explanatory, but the URL warrants some deeper explanation.
URLs take the form of a protocol followed by addressing information, and a colon us used to separate the two. For various reasons most URL schemes in use start with two slashes, so we can generally thing of URLs taking the form: ://
(but there are exceptions like mailto:
).
In Git, the URL scheme tells Git the protocol it should use to connect to the remote repository. Git supports five protocols, but two of them are legacy, and we’ll be ignoring another one in this series.
The simplest protocol is the so-called local protocol which is specified with the file://
URL scheme. This protocol simply reads from the local file system, so it can be used to reference any Git repository that’s accessible as a folder in your local computer. That could be a folder on your internal hard drive, a folder on an external drive, or a folder on a mapped network drive, perhaps shared from a NAS.
The local protocol is in fact the default protocol, so the URL scheme is optional, to Git, file:///Users/Bart/myFirstRepo.git
and /Users/Bart/myFirstRepo.git
are the same.
Notice that Git once again doesn’t do a good job of naming things. Firstly, the word local is re-used with a different meaning, and secondly, you can have local remotes!
We’ll use the local protocol to start our exploration of Git remotes in the next instalment.
Of the truly remote protocols, the original one was built specially for Git and is really quite useless. It requires a dedicated daemon, and offers no access control what so ever. You may sometimes still see it used to publish read-only repositories, and you’ll recognise it by its URL scheme of git://
.
Particularly in the early days if Git it was very common to wrap the Git protocol in SSH to give it both security and authentication. This gives us URLs with the ssh://
scheme. We won’t be using the SSH protocol in this instalment.
The reason the SSH protocol was initially very popular is that the first attempt at an HTTP protocol for Git was read-only. The Git documentation actually refers to this protocol as Dumb HTTP
!
Thankfully modern versions of Git support and updated read/write version of the HTTP protocol (which the documentation calls Smart HTTP), and this is now the most commonly used protocol for accessing remote repositories. Since the protocol actually uses secure HTTP it’s specified with the https://
scheme.
When we move on to using truly remote repositories we’ll be using this protocol.
Final Thoughts
Now that we’ve explained the concepts underpinning Git’s client-server model, and familiarised ourselves with the most important relevant jargon, we’re ready start actually using remote repositories in the next instalment. We’ll start simple using remotes on our local file system, and then in later instalments we’ll move on to remotes hosted on cloud services like GitHub and GitLab via the modern smart HTTP protocol.