Most scientists, including ecologists and evolutionary biologists, rely on computational tools for their research [@doi:10.1109/SECSE.2009.5069155]. Researchers write and use software packages or write code to perform tasks ranging from data management, data analysis, and study replication, to the application and the development of tools for hypothesis testing. Maintaining code for scientific collaboration requires an efficient and well-documented workflow [@doi:10.1038/d41586-020-02462-7]. To facilitate this process, scientists have been adopting tools from information and systems technology, such as cloud-based services for documentation and version control [@doi:10.1038/538127a]. These include the Google Suite, the Microsoft Suite, DropBox, and GitHub. Within the spectrum of cloud-based services for collaboration, GitHub is uniquely positioned (Table 1) to benefit scientists because it is specifically designed to store, track changes, and enable collaboration on computer code---fundamental components of modern research. Albeit, many researchers lack exposure to adequate programming and coding practices and thus dedicate valuable time and effort to teach themselves research-facilitating tools. In this paper, we provide researchers in ecology and evolutionary biology (EEB) with practical workflows to use GitHub -- the most used web-based platform for computational version control and collaboration -- to facilitate their collaborative research and code management practices.
GitHub (https://github.com) provides a simple but powerful web interface that allows users to participate in projects by contributing, modifying and discussing existing code, reporting bugs, discovering code and data, and publishing new code. It stores files and directories in repositories, and keeps a chronological record of modifications [@doi:10.1080/00031305.2017.1399928] (see Box 1). GitHub also integrates several communication features, such as Github Issues, Github Discussions, Github Pages, which allows users to engage in discussions, to plan and collaborate on code, and publish information to a webpage (see Box 1). These features are an improvement over traditional practices of directly sharing files through email, other cloud-based services or physical storage units, which can become challenging and time-consuming in long-term and collaborative projects [@doi:10.1186/1751-0473-8-7]. By making it easier to integrate versioning, communication, collaboration with research code and data, Github helps facilitate open science practices in research projects [@doi:10.1371/journal.pcbi.1004947].
Git is the version control system that enables the collaborative tools available on GitHub. Git was initially developed as a fast, lightweight and open-source system to allow software engineers to efficiently develop and collaborate on projects [@https://git-scm.com/book/en/v2/Getting-Started-A-Short-History-of-Git]. Since its launch in 2005, Git has become the leading version control system in software development and in other disciplines that require collaboration and community contributions, such as in scientific research [@doi:10.1109/MS.2012.61]. To understand how GitHub keeps track of changes to files and folders, it is recommended to have knowledge of basic concepts of Git (such as commit, push, pull, and checkout; see Box 1). However, the GitHub web-based platform and its integrated development environments (such as GitHub Desktop) allow users to perform most repository and data management operations without using the command console, making these functionalities available even to users who are less familiar with software development.
Version control involves tracking the state of the files and directories which are stored in a "repository" (see Box 1).
A typical workflow using Git and GitHub is to: (i) create a remote repository that is synchronized with files and directories stored locally; (ii) modify these files, either locally or remotely; (iii) frequently "commit" (or record) changes to these files (see Box 1) along with a description of modifications; (iv) synchronize commits with GitHub (see "push" and "pull" in Box 1) so that the repository on the web and the local repositories are up-to-date.
The repository, which contains files, their modifications, and the description of their changes can then be accessed by chosen collaborators or, whenever applicable, the public, who can easily download and synchronize them to their own computers (see "clone" in Box 1).
Commits act like snapshots, allowing users to view or even revert the state of the project to any previous commit.
If the modified files are plain text, only the differences from the previous commit are recorded, allowing frequent commits without causing the size of the project to grow excessively.
This provides a safe and less cluttered alternative to frequently making full copies of documents at different points in their evolution (e.g. analysis.R
, analysis_v2.R
, analysis_FINAL.R
).
While we do not focus on technical details about the use of Git and GitHub in this study, we recommend users explore available resources to become more familiar with version control features (see @doi:10.1371/journal.pcbi.1004947, @doi:10.1371/journal.pcbi.1004668, and @doi:10.1109/ICSE.2015.74).
The use of GitHub has become increasingly popular in recent years due to the expansive GitHub user-community and numerous GitHub resources [@doi:10.1371/journal.pcbi.1004947; @doi:10.1080/00031305.2017.1399928; @https://happygitwithr.com; @https://ourcodingclub.github.io]. Nevertheless, although multiple articles have encouraged researchers in ecology and evolution to adopt GitHub as part of their research process [@doi:10.1038/s41559-017-0160; @doi:10.1038/538127a], its use is still not widespread. Among many other factors, this may be because first-time users without formal training in information technology can face a steep learning curve, as GitHub and its features have been centered on software development [@doi:10.1109/ASONAM.2016.7752419]. Furthermore, there are few domain-specific resources providing tractable examples and practical guidance for researchers in EEB on how to use GitHub [but see @doi:10.1111/2041-210X.13982; @https://ourcodingclub.github.io>; @https://www.openscapes.org>]. Widespread adoption of GitHub for collaborative research tasks can ultimately enable EEB researchers to save time on creating novel processes for collaboration and focus more on their research [@doi:10.3897/rio.6.e56508]. More importantly, expanding the availability of data and code management standards, of which GitHub is an increasingly important component, makes research more reproducible and collaborative [@doi:10.1002/bes2.1801; @doi:10.1098/rspb.2022.1113].
This paper is the result of an academic hackathon held during the 2021 conference for the Society for Open, Reliable, and Transparent Ecology and Evolutionary Biology (SORTEE, https://www.sortee.org). We convened around thirty researchers in EEB with varying levels of familiarity with the use of GitHub in research projects. The aim of the hackathon was to showcase and discuss how existing features of GitHub can contribute to documentation and collaboration in EEB research. During the event, we identified the need for formal discussions on how EEB researchers can benefit from GitHub and its features. Here, we outline twelve practical ways that EEB researchers can use GitHub features for more collaborative, transparent, and reproducible science. We also provide critical perspectives on features that could be improved and catered towards research development, drawing on examples from the literature.
Glossary |
---|
repository: A repository (commonly shortened to "repo") is a collection of files (e.g., a directory) tracked by Git. Repositories are managed by an owner and can be made either "public", to be visible to all GitHub users, or "private", to selected owner-specified users. Repositories can be either "local" and saved on an individual's computer or "remote" and stored on the cloud via GitHub's web platform. |
fork: A fork is a copy of a repository hosted on GitHub. If a repository is public, then anyone can make a fork. Even if they do not have access to push to the original repository, they can make a fork and edit it independently. Forks are linked to the original GitHub repository and "upstream" changes (i.e., those in the original repository) can be merged to keep the fork up-to-date with the original project. Changes made in the fork can be integrated into the original project via pull requests. |
clone: Cloning a repository is a way of making a local copy (i.e., on your computer) of a GitHub repository. If you have access to push to a repository, this can be a first step to contributing to a project. |
branch: Git workflow timelines or repositories are analogous to trees, with a main working project and diverging branches that are pointers to changes during the development process. A git branch is an alternative line of development for a project (repository). Branches allow users to add new features or modifications to the project without affecting the main part of the project. Development branches can be created at any point in time and work on each branch can continue independently. Branching is useful for testing out new ideas (both code and text) which may or may not eventually get integrated into the main branch of the project. Branches can also be used to isolate contributions of multiple contributors. Each person working on their own branch eliminates problems that may arise if conflicting edits are pushed to the same remote branch. Changes in a development branch can be merged into the main branch via pull requests. Branches can only be made by those who are given access to the project repository. |
commit: Commits are snapshots of the development of a project. In Git, versions of files and directories are uniquely identified as "commits", allowing one to identify and track modifications line-by-line. Commits can include changes in multiple files and must include a brief commit message describing the changes made. A typical workflow is to make some related changes in files, add a commit message (e.g., "Generate and include results figure"), and after several commits push those commits to the remote (i.e., cloud-based) GitHub repository. |
push and pull: When commits are made in a project locally, they must be synced with the remote GitHub repository by pushing them. Changes on a GitHub repository can then be pulled to keep your local version of the project up-to-date with the remote. |
pull request: A pull request is a request for changes made on an individual's branch in the repository or in a user's fork to be merged to the repository. Pull requests contain a description of the changes alongside all code required for testing and review by other users prior to being merged into the repository. |
merge: Combining commits from two different branches together into one branch. |
release: At any point a release can be made on GitHub to mark a significant milestone in the progression of a repository. While this GitHub feature is designed with releases of new versions of code in mind (e.g., v1.0.0), it can also be used to create a snapshot of a repository at significant stages like pre-print, submission, revision, and acceptance of an associated manuscript. |
community: A forum where GitHub users can ask for advice, offer solutions to questions, and share ideas (https://github.community/). |