Do (not) Self-Host your repos

Table of Contents

  1. Why You Should Self-Host
  2. What about GitLab and other Competitors?
  3. Why You Shouldn’t Self-Host
  4. So what’s a geek to do?
  5. What am I going to do?

Once upon a time, GitHub was a successful geek enterprise. Then Microsoft bought it, and folks started arguing that you should abandon ship. You should self-host your repos they say.

I 100% agree, and 100% disagree. Let me explain.

GitHub’s been a benevolent host. When they bought NPM they went from being a de facto piece of internet infrastructure to an actual piece of critical infrastructure. At this point you may as well argue that we should sever our connection to the electric company.

But there’s a difference between needing the electric company and ignoring the fact that sometimes the power goes. Usually it’s temporary. Days isn’t uncommon in wintery climes. But every now and then, it’s more. In the case of internet SAAS apps, it’s sometimes permanent.

Why You Should Self-Host

The biggest argument, of course, is backups and continuity of service. If something hapened and you lost access to your GitHub account right now, what would you loose? If you maintain a project with many users and contributors, what would it do to your community, and your work?

GitHub isn’t just where your Git repos are hosted, but where all the metadata that surrounds them. This means years of Issues, Pull Requests, wiki pages, and frequently the project’s web site.

I’ve derived a ton of value from old tickets on GitHub projects. They regularly answer questions, and provid workarounds. The takedown off a popular repo could be a huge loss.

GitHub suspends accounts, and deletes repositories for a variety of reasons. If yours targeted for seemingly “no reason” it wouldn’t be the first.

Even if you believe it won’t, or can’t, happen to you, there’s another reason reason to consider self-hosting: Lock in. Right now (September 2022) GitHub is very good about letting you export everything. As far as I know, there’s no data you can put into GitHub that you can’t also get out via their APIs. In theory.

Our migration has been in the process for a while, but every time we run it we find that our attempts were missing some data, so we need to re-run to fetch missing data, if we had higher limits it wouldn’t take so long to find these issues. Github doesn’t provide tokens with higher limits, so we have to work around that. We even had people with [GitHub Enterprise] subscriptions ask, and they got the same response.

Gitea is an open source GitHub alternative, whose creation started out on GitHub. Eventually, the software got really robust, and they decided to host their own repo and issues. The problem is, they’ve got over 20,000 Issues & PRs they want to export. Thousands of those are still open, and the longer it takes them to cut over, the larger those numbers get.

To be fair, 99% of GitHub users wouldn’t have this problem. We don’t have projects that are popular enough, but companies who use GitHub do, and there’s no advantage to GitHub to help large projects or customers leave. They may even consider locking that metadata down to guarantee they stay.

What about GitLab and other Competitors?

You can apply the same reasoning to GitHub’s hosted alternatives. Plus, none of them are as financially successful as GitHub. They’re far more likely to go out of business, and screw everyone who uses them.

Why You Shouldn’t Self-Host

Right now it’s incredibly improbable that you’ll have your account suspended. The stories of suspended accounts are rare. Most of them seem to be from government action,or “Terms of Service” violations that most geeks react to with “Yup. Seems fair.”.

There’s also the fact that self-hosting is work. But let’s set that aside for the moment. Let’s pretend there’s some magical free button you can click that takes care of everything. Boop. You’re self-hosted now.

Let’s also assume you’ve got a project that’s popular / useful enough to get Issues and the occasional Pull Request.

How is that going to happen on your new site? Well, first they’re going to need to set up an account on your server. Are they going to think it’s worth it? Worth it enough to give their email to yet another stranger? If they want to submit a fix / feature, is it also worth it to figure out how to fork it in this not-GitHub? Worth it to figure out how its PR equivalents work? These probably aren’t hard, but every little difference they have to navigate is another reason to not bother.

And then there’s discovery. GitHub isn’t only the default place to look for code. They’ve also got massive “website authority” in Google. It doesn’t matter if you’re the only place hosting your thing. Any project on GitHub that’s remotely similar to your search terms is likely to come up before your site.

But, even if you’re ok with all of that. There’s an even bigger reason you shouldn’t self host. A reason we shouldn’t encourage folks to self host. Sooner or later you’ll get sick of bothering, or you’ll die. One is guaranteed. The other has lots of historical precedence.

I’ve bookmarked thousands of useful web sites over the years. A huge chunk of those have been useful, and insightful blog posts about development. Hundreds of them don’t exist anymore. Mostly because the person who ran them simply didn’t feel like bothering with it anymore. How many millions of others have poofed out of existence?

But, a repo isn’t some just random thoughts. Almost every repo contains some useful tool or library. Sooner or later, someone’s going to need tha. There are thousands, probably millions, of little repos out there containing the only bit of code in existence, that solves that particular problem. Sometimes they’re just the only thing that solves it in that particular language. When you stop hosting it, for whatever reason, that useful tool ceases to exist. Even if it’s “just a port”. That’s a lot of work that someone will need to recreate, and debug.

So what’s a geek to do?

Before I answer that, let me note that this bit gets into implementation strategies. If you just came for the question of “should I, or shouldn’t I?” you can stop reading. Thanks for making it this far. Good luck whatever you choose.

So, what to do…

Short short version: If lots of people rely on your project, then mirror your code to a large competitor and accept that you may loose access to your issues, PRs, etc. . Stick a note in your README so that folks know where to find your mirror before lightning strikes.

If you’re like most open source developers, you’ve got some repos that are mostly ignored by the world, and receive essentially no issues or PRs. No-one’s going to be stressed if your repo disappears for a week. So, Install Gitea on your computer. Tell it to mirror your repos and forget about it. Or, just clone the all locally and set up a cron job to regularly pull down changes.

Longer version: I think that depends a lot on what, if any, metadata exists in your GitHub repos. Most devs have small repos that are mostly just used by them. If this sounds like you, then simply keeping an up-to-date copy of them on your home computer could be fine.

Things are different if you maintain some repo that other people actually care about enough to make issues, or even contribute code to.

In that case my personal advice is to host your stuff on Github, but have it mirrored to a second system. Ideally you’d want all your issues, PRs, etc. mirrored to it. But as far as I know, that’s not actually something you can do. It’s easy to keep the code synced to a mirror. It’s the metadata that’s a problem. To be clear, I’m not saying it’s not technically possible. I’m saying, I don’t think anyone’s actually written the code to keep those synced between GitHub and any open source alternative. It’s a hard problem because Issues, PRs, comments, and just about everything else can be edited.

So, that leaves you with regularly backing up your stuff. I’m going to assume you’re not someone with ~10k open & closed issues, plus another ~10k PRs like Gitea. So, exporting your issues is probably something that’s actually doable. Exporting. Not syncing. However, the more things you have to export, and the more frequently you want to back it up, the more likely you are to hit your rate limits.

If you go this route, you’ve got two options. You could use something like the gitea-github-migrator that can export everything. Tools like that are made to create a new repo in GitHub alternatives, not sync to them. You’d be constantly churning, deleting old repos and setting up new ones.

A better solution might be something like github-issues-mirror which can download all your issues as JSON files. Keep that running in the background and if worst comes to worst you can write an importer for wherever you’ve been mirroring your code.

What am I going to do?

I think I’m going to install Gitea on my Raspberry Pi. I’ll tell it to mirror all my repos, and all the repos of useful tools I want to always have a copy of. I do maintain some repos which are somewhat useful to other people if GitHub deletes them. No-one will be screwed, and I can take my time in uploading a backup somewhere else. I’ll stick it in the basement with the router and back it up to our NAS, which backs up to B2.