PBS 172 of X: Submodules (Git)
We’re making a quick detour back to Git to have a look at one of the more advanced topics I intentionally skipped on our first look at Git because it didn’t solve a relevant problem. As we move forward with the XKPasswd project that’s changing, and it’s also changing in other adjacent contexts, so the time feels ripe to re-visit Git.
One of the things Git’s predecessors got badly wrong was nesting — how? By not supporting it! you can’t have a Subversion repository within a Subversion repository, but you can link a folder in one Git repository to another Git repository. In other words, you can nest Git repositories.
In Git jargon, nested repositories are Git Submodules. In other words when you tell Git to link this folder in the repository I’m working on now to that other repository over there, you are adding a Git Submodule to your repository. Fundamentally, it really is that simple, a Git submodule is a link to another Git repository!
Matching Podcast Episode
You can also Download the MP3
Read an unedited, auto-generated transcript with chapter marks: PBS_2024_10_26
How does Git Handle Nesting?
As a quick reminder, a Git repository consists of a collection of settings, and a database of commits, with each commit being represented as a collection of files and folders with appropriate metadata.
How does Git store a submodule, that is to say a link to another repository?
At the repository level, Git simply stores two pieces of information for each submodule:
- The folder within the repository that is linked to another repository
- The URL for the linked repository
Within each commit Git represents the entire linked repository as simply the branch and commit currently checked out from the linked repository.
Why Nest Repositories?
Given how versatile a tool Git is, and how many ways people use it to solve so many different problems, it shouldn’t come as a surprise that there are more uses for Git submodules than anyone could possibly list. In light of this, we’re going to look at just three example uses that are important to me, and seem relevant to this audience:
- Deploying Plugin-based Web Apps
- Subdividing a Document Repository
- Including Dependencies in our Code
I’ve listed these in reverse order of relevance to the PBS series, so we’ll finish with the use case we’ll use in our examples.
Use Case 1 — Deploying Plugin-based Web Apps at Scale
In this series, we mostly view things from the developer’s point of view. It is called programming by stealth after all, but from time to time, we’ve also looked at the world from the sysadmin’s point of view, especially in the Chezmoi and Bash scripting miniseries. This example looks at deploying software from a sysadmin’s point of view, and it was my first exposure to Git submodules, and really opened my eyes to their power.
Lots of large web apps are built around the idea of a core product that gets expanded by plugins. A good example of this is WordPress, which powers more websites on the Internet than any other publishing platform, including Allison’s podfeet.com website!
You can install WordPress by downloading a ZIP file, expanding it, copying it into a folder on your web server, and editing a single config file to capture some basic information about your web server and website’s configuration, and to connect the code to your database server. This will give you the core WordPress web app, and no more. You can add additional themes and plugins by downloading their zip files, extracting them, and uploading them into the appropriate subfolder within the wp-content
folder in your WordPress instance.
If you configure WordPress with appropriate write permissions to your web server’s disk, you can even have WordPress do plugin and theme installs for you, which automates the same process. Once you’ve given WordPress write permissions to your web server, you can even configure it to keep your plugins updated automatically!
Setting up WordPress like this works great for small sites, but this setup has some drawbacks, and it won’t scale.
Firstly, this entire setup rests on a tradeoff that makes perfect sense for small users — giving any web app write permission on your web server is inherently dangerous because the app can transform itself into literally any app it likes. You are trusting that the update mechanism is secure; otherwise, your website will inevitably be taken over and converted into something nasty. But, in exchange for accepting that risk, you have freed yourself of the chore of keeping your WordPress install and all your WordPress plugins patched. And, an unpatched WordPress site is even more likely to get taken over than a fully patched one with the power to write to disk.
When you use WordPress in a corporate environment (enterprise, education, publishing, non-profit foundation …) you need to do things differently.
Firstly — you can’t accept the risk of granting write access to your web server, so you’ll need your IT Department to take over responsibility for patching your site and your entire suite of plugins.
Secondly, you’ll need to scale out as well as up. Scaling up is throwing more resources at a single server, scaling out is adding more servers. When you scale out you need to keep the code running on each server in the cluster identical, so you can’t have each one updating its copy of WordPress and each plugin autonomously, it has to be done in a synchronised and controlled way.
Thirdly, if your WordPress instance forms a core part of your business, you’ll need to test your updates before you deploy them. This means you won’t only have multiple production servers that need to be kept in perfect sync, you’ll also have one or more staging servers where updates get tested before being pushed to production.
Obviously, you can do all of this work without using Git, but it turns out Git is actually used widely for this kind of sysadmin work, because it can really make things easier.
You can do all this work without Git, but it can all be done so much more easily with Git, if, and only if you can nest Git repositories. In other words, if, and only if you make use of Git submodules!
How Git (with Submodules) helps
What we really need is a way to capture and version not just the current state of our copy of WordPress, but the current state of our copy of WordPress, our current config file, and the current state of all of our plugins in a single Git commit.
If you capture all three of those things in each Git commit, then when you check out a specific commit (probably by checking out a branch), your entire WordPress folder will snap into a consistent known state. To be precise, you’ll get a specific version of WordPress, a specific revision of your config, and a specific version of each and every theme & plugin.
The recipe to get here goes something like this:
- Create a Git repo with two branches you will keep permanently:
main
— this will be the production version on your site’s code that gets automatically deployed to all product web servers in the clusterstaging
— this will be the version of your site that gets automatically deployed to your test server(s)
- Add the official WordPress Git repo as a remote named
upstream
. - Check out your desired version of WordPress from the
upstream
remote into yourstaging
branch - Add your config file
- Link the git repo for each theme and plugin to a specific folder within your repo by repeating the following:
- Add the theme or plugin’s repository to your repository as a submodule
- Change into the now linked folder
- Use
git fetch
to get the list of branches that exist within the theme or plugin’s repo - Check out your desired branch from the theme or plugin’s repo
- Once all the required themes and plugins have been added as Git submodules, commit your repository (this commit captures your current version of core Wordpress, your current config, and the current version of each of your themes and plugins)
- Push your local
staging
branch to your Git server to trigger a deployment to your test servers, and make sure it all works - Merge your
staging
branch into yourmain
branch and push again, triggering a deployment to your production servers
This assumes your sysadmins have used Git’s trigger features to handle the work of actually deploying the code to the server, and making any needed substitutions to tweak the config for staging and production environments. This is all bread and butter stuff for sysadmins, and it gets get wrapped up in the jargon term CI-CD, with stands for Constant Integration-Constant Deployment. GitHub Actions are an example of a CI-CD service.
What’s very important to note here is that when you deploy your repository to your web servers, they all get the identical versions of each of the following:
- Core WordPress
- Your Config
- Each and every installed theme and plugin
This is already cool, but the real magic comes in how you can now handle updates.
To update core WordPress, you can now follow these basic steps:
- Create a new WIP branch from your
main
branch and check it out - Merge the desired version of WordPress into your WIP branch from the
upstream
remote, i.e. from the official WordPress Git repo - Merge your WIP branch into your
staging
branch and push to trigger a deployment to your test environment - Test!
- Merge your WIP branch into your
main
branch, and push to trigger a deployment to your live website
This doesn’t make obvious use of submodules, but updating themes and plugins does — the process to handle theme and plugin updates now becomes:
- Create a new WIP branch from your
main
branch and check it out - For each theme and plugin you want to update, repeat the following:
- Change into the relevant folder — the
git
command is now interacting with the plugin’s repo - Fetch all new commits, tags, and branches from the theme/plugin’s repo with a
git fetch
- Check out the desired branch, tag, or commit
- Change into the relevant folder — the
- Change back to the root of your Git repo, and commit — this captures the new desired state for each theme and plugin you updated
- Merge your WIP branch into your
staging
branch and push to trigger a deployment to your test servers - Test!
- Merge your WIP branch into our
main
brach and push to trigger a deployment to your live website
Finally, let’s look at how easy it is to add another web server to your production or testing cluster — all you need to do is check out either your main
or staging
branch, and you have a full copy of your site’s entire codebase ready to go!
With my work hat on I never did this with WordPress, but it did completely change how I worked with a different, even bigger web app with even more plugins — the open source Virtual Learning Environment Moodle. I started my sysadmin career manually managing a Moodle deployment by downloading and extracting ZIP files and then using things like rsync
to move files around, and when I transitioned away from being a sysadmin to become a cybersecurity specialist a little over two decades later, I was managing the same Moodle instance via Git with submodules. Wow, what a difference!
Use Case 2 — Breaking up a Document Repository
While we may primarily think of Git as a developer tool, and perhaps also a sysadmin tool, it’s also a powerful tool for writers. These very show notes are versioned in Git!
There is an evolving trend to move away from traditional database-driven content management systems (CMSs) like WordPress to move towards so-called static site generators, these are basically folders of text files, usually in Markdown, that get converted into a basic HTML, CSS & JavaScript website each time a page is changed. With Git, these kinds of folders-of-files can be automatically deployed to the web using CI-CD tools, just like in our WordPress use-case above. In fact, that’s exactly how this web page you’re reading right now came to be! I’m writing this text in Markdown using Typora on my Mac, and when I get to the next logical break in the flow I’ll commit my changes on a WIP branch. When Allison is done editing the matching podcast episode she’ll merge the WIP branch for this instalment into the main branch, triggering GitHub Pages to build a fresh copy of the entire PBS website from the folder of Markdown files that are versioned in Git. This is Git+CI-CD in action for publishing!
None of this requires Git submodules, but now imagine using a static site generator to build a much larger site, one for an organisation that has sub-divisions of some kind. You want to allow each section of the organisation to manage their own web content, but only their own web content, and you may want to interject some kind of review process. One way to achieve this would be with Git submodules.
There would be one master Git repository controlled by the part of the organisation that’s ultimately responsible for the entire website, and they would connect their repository to test and production versions of the website using some kind of CI-CD pipeline. They would then create Git submodules for each sub-division within the organisation, each mapped to an appropriate folder in the website’s hierarchy.
Remember that a submodule is just a Git repository checked out within another Git repository, it is fundamentally no different to any other Git repository, so it can be cloned by itself just like any other repository.
In this scenario, each sub-division of the organisation works on the Git repository representing their part of the site as a stand-alone Git repository. They can’t see or edit any other pages. They can implement their own processes and procedures within their repo, but there needs to be some kind of agreed way for their repo to signal to the the repo for the entire site that there’s a change ready to be incorporated into the main site.
Probably the most logical approach would be to add a CI-CD trigger to each sub division’s repository that fires when a commit is added to the main
branch. This trigger could cause the repo for the full site to pull the changes into their staging
branch, publish the changes to the test website, and open an issue in some kind of issue tracker requesting review. Once the change is reviewed and approved, it would be merged into the site’s main branch and deployed to the production website.
So far, this doesn’t seem particularly relevant to this series, but there is another way folders of Markdown files are used to build structured content — knowledge base apps like the open source Obsidian.
I’ve become very fond of using Markdown+Git as my writing environment, and I want to capture all the segments I produce for my own podcasts and for other people’s podcasts in a single versioned Obsidian knowledge base. The addition of some YAML front matter lets me add context to these contributions, so I can quickly answer useful questions like:
- What was the most recent segment I contributed on AI?
- How many EVs have I reviewed on the Kilowatt Podcast?
- What segments did I produce in June 2024?
- Build a list of the links to every segment I’ve produced about VPNs
And so on and so forth.
If I was the solo writer for all this content it would work great, but I have a long-standing relationship with Allison where we collaboratively edit notes for Security Bits and other guest segments I write for the NosillaCast.
Because I trust Allison completely, I had no objection to giving her full access to my full knowledge base, which worked great for me, but things looked rather different from Allison’s point of view-of-view. Instead of seeing one folder with sensible sub-folders for Security bits and other segments, she saw a massive wall of folders, one for every network I contribute to, and only one of them was relevant to her. She had to clone all those folders onto her Mac, and navigate down into an irrelevant hierarchy just to get to the Security Bits notes!
Solving this problem for both of us is what triggered this revival of the Git miniseries 🙂
Now that we know about Git submodules the answer is obvious — I create a dedicated repo for just my collaborations with Allison, and I add it to my master knowledge base as a submodule.
In this setup, I still have my single folder of Markdown files that Obsidian can display and process as a single knowledge base, and Allison sees a traditional Git repository with just the files relevant to my contributions to her shows!
Use Case 3 — Including Dependencies in our Code
Let’s now firmly place our developer hat on our heads and look at how we can choose to include external resources in our coding projects.
We’ve seen that when we want to include published third-party libraries like jQuery or Bootstrap in our projects we can choose to avoid the whole question of how to copy these resources into our projects by simply linking to them on public content delivery networks. We’ve also seen how we can use a package manager like NPM to do on-demand importing by having just a JSON file that specifies the packages our code depends on versioned in our Git repo. When someone tries to use our code they need to run the package manager (NPM in our case) to fetch their own copied of the packages after they clone our repo. Finally, we’ve also looked at using bundlers like Webpack to re-package the external resources within our published code. This is how the beta version of the new XKPasswd website currently works.
All of these options work great when we’re read-only consumers of publicly available code, but what happens when either or both of those things change? What if we need to contribute back to the code we’re incorporating into our project. Or, what if that code is not published publicly? Or both‽
This is something we’ll soon need to address in the XKPasswd project. As I write this instalment the code for the entire XKPasswd JavaScript port is contained in a single Git repository — that includes both the open source library itself and the beta version of the public web interface. The intention was always to have the library be stand-alone so it can be incorporated into other projects, one of which would be the final public version of the website.
This change will simplify the structure and build process for the library’s repository, but it will require the repository for the web interface to import the code for the library in some way. In the medium to long term this will probably be done using a package manager like NPM because the library will become a stable published package like jQuery. But, in the short term it seems likely contributors will want to edit both the library and the web interface simultaneously because both will be under active development simultaneously. While no final decision has been made on exactly how these repos will be structure, my current preference is to add the repo for the library to the repo for the website as a Git submodule.
This handles the situation where all the code is public, and we want read/write access to all parts. What about scenarios where none of the code is public?
The classic example here is a piece of internal code within an organisation that is used in multiple projects. Imagine an organisation that has multiple web apps, and they want them all to have a consistent branding. One way to do this would be to have a standard theme versioned in its own repository that is then included into the repositories for each web app. This is private code where developers on any project may want to contribute to the shared resource, so a package manager, even an entirely private one, seems like a poor fit. A much better approach would be to include the standard theme into each project as a Git submodule, allowing all developers to contribute to both their own app and the shared theme simultaneously from within their own repositories.
In the next instalment we’re going to use this final use case as our worked example as we explore how to actually use Git submodules.
Final Thoughts
Hopefully, you now understand why Git submodules are so powerful, and why you might want to take the time to learn how to use them. That sets us up well for the next instalment where we’ll learn the git
commands needed for working with submodules. Learning the commands will give you the lexicon Git uses when working with submodules, which should help you understand the same functionality in whatever Git GUI you prefer to use.