Doing a PhD right: Backups and Source Control
This post is a part of the series "Doing a PhD right".
This post informs PhD students on source control tools such that data/file loss will never been a concern. Source control is applicable to any field and data size regardless of diversity/type.
Commonly used data management
Many (most?) students have quite the Frankenstein setup, either storing data solely on their personal laptop (literally the worst solution you can have), or perhaps on an external hard drive (or five) stashed in various locations.
The more savvy students have a Cloud-based solution like OneDrive or Dropbox, or maybe a University supplied network drive, which is certainly an improvement however could be better.
The above solutions have the following issues:
- Only using local storage is high-risk. Laptops get lost, and external hard drives get dropped.
- They're not capturing versions and progress. Overwriting your files with the latest version means you lose the history of your progress. Or, you've got a hundred files with names ending with '...final', '...final final', '...reviewed final'.
- To add to the above point: your backups capture your mistakes. If needed, you can't easily refer to past changes to either revert back to, or to look up as a reference.
- They're out of date. Unless you're extremely diligent and backup to your external hard drive regularly (or have automatic syncing enabled), likely your backups are at least a day or more out of date.
Source control
You may have heard of source control (e.g. GIT, SVN), particularly if you're in a STEM-related field, although (too) many certainly still have not.
What is source control?
At their core, with these tools we can "commit" our work at any time to the source control repository. Each commit essentially represents a snapshot of the file(s) at that moment in time. Commits are stored in chronological order, and will never be overwritten thus we can see the full history of every commit we make (even when spread over many years).
A commit may be a single file or even an entire folder's worth of files. And the type of file does not matter, it will accept anything.
Figure 1 is a visualisation of a source control repository. Each node represents a single commit, and the main pathway (in blue) represents our chronological history of commits.
A slightly more advanced concept is branching (in green). Perhaps you want to try an experimental feature that you want to keep track of (via commits) but you don't want it impacting the work along your main commit path. Choosing one of your commits on your main path you can branch off to form a new, separate branch path. If desired, at a later date you could then remerge this branch back into your main pathway.
This is just a surface level view of source control, however already we can imagine how they be a vital tool for our projects.
How to integrate source control into your workflow
Let's use a case study to put it into context: you're performing analysis of ecological data using R Studio.
All the files and data related to your project are stored in a eco_analysis
folder. As you work you're creating many files and data (e.g. R scripts, images, CSV files, etc). Periodically, you commit your eco_analysis
folder to the source control.
The source control repository will now have a full history of every change you've made, whether it be to an R script, modifying an image, or adding new data.
Suppose a script has started returning incorrect results, despite working perfectly the last time you used it three months before. You've likely forgotten some of what was happening in this script or what has changed since. When and where was the bug introduced?
By reviewing the commit logs you can see exactly what changes were introduced at each commit. Or, you can even checkout an old commit to get the entire snapshot of your eco_analysis
folder at that point in time, and re-run the scripts to see if the bug was present then or not.
This concept works with any type of project; the document you are writing your thesis in, writing code (particularly useful!), keeping track of different versions of analysis/results, and so on.
How to get started
Source control is not only for the tech-savvy, which is a prime motivator behind writing this post.
Whilst many options exist, the most accessible is likely GitHub.
A free tier is available, and you can host your repository such that it is private or public. It's also Cloud-based, providing the benefit of having an offsite backup entirely separate to your work computer and local storage.
Furthermore, by hosting online you can work easily with collaborators, where each team member's commits are stored and easily viewable.
How to use git (and GitHub) is outside the scope of this post, but plenty of great resources exist online. At the very least, I hope the above provides enough of an introductory view to get started.
Conclusion
I myself rely heavily on source control. My supervisor and I both commit to the same source control repository, which ensures our work is always in-sync. Recently, a major bug had appeared in my code some time over the last 6 months without my realising. Being able to checkout an old version of my code was extremely valuable as I could then recreate what changes I had made which introduced the bug.
My hope with this post is to encourage students to move on from out-dated practices of storing data on external hard drives, or missing out on history logs of commits by only using something like OneDrive. Source control is (these days) very accessible to anyone, and pairs very well with our multi-year, complex PhD journeys.
Alternatives to GitHub
Many source control options exist, so I suggest to explore what works best for you. A great idea is to talk to your University, as they likely have partnerships with a source control host, or may even host their own source control repositories.
Working with projects with large datasets
I would not recommend storing large datasets within a source control repository, for the entire repository would become quite large and hard to work with. For example, if you wanted to checkout an older commit, you may not want/need to checkout the entire dataset too.
In such situations, it can be best to store your datasets separately to the rest of your project. However, the importance of data management remains: have an offsite (Cloud-based preferably) backup, and keep them updated.