Record of a specific analysis [raw]

Research Software Sharing, Publication, & Distribution Checklist

Considerations for publishing code which runs a specific analysis that underpins some result to be published in the academic literature. The emphasis here is on making the work narrowly reproducible i.e. the analysis of the same data can produce the same result when it is re-run. This is a starting point for making results robust (different analysis, same data) and replicable (same analysis, different data), and ultimately generalisable (different analysis, different data). The other emphasis is on making the work ‘verifiable’, exposing the complete step-wise detail of the reasoning underpinning the analysis so that it can be scrutinised and understood.

📒Source control

How can you keep track of the history of your project and collaborate on it?

Uses git (or other source control tool)
- 🥉Bronze (Easy): Using version control but has a shallow project history, just placed in git for distribution
- 🥈Silver (Intermediate): Longer project history, commit messages of mixed quality, some large messy changes
- 🥇Gold (Hard): Silver plus - Well written commit messages, nice granular commits making discrete self-contained changes. Tags, releases, or branches at major project milestones, maybe some contributions from other users
- 🏆Platinum (MAXIMUM OVERKILL): Gold plus - Some from: conventional commits; Clean history with a consistent rebasing/merging strategy; Signed commits from all contributors; Contributions go through a consistent workflow like, issues, then a pull request from a branch.

Whilst you can simply use git and a git forge / git hosting service as a way of distributing your project doing so misses out on a lot of benefits of using git as a part of your workflow from the beginning of your project. A well maintained git history is much like a well kept lab notebook for a data analysis project. Well authored commit messages detail why you changed what you changed, providing context for the development of the project. You can by ‘checking out’ a commit open a window onto any point in the history of your project, of which you took a snapshot by making a commit. You can collaborate on your project with other git users asynchronously, it can be a great tool for distributed collaborative authorship not just of software but also of prose. A good example of this is The Turing Way: a how to guide for reproducible data science. A nice read making this case is Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution, though I would carefully weigh the long term issues often created by building on proprietary infrastructure like github as opposed to using an open git forge instead.
Interaction with data and other big non-text / binary files.

git is not good at managing changes to large binary files like images, all versions of the file will be retained in the git history which can make it grow fast if they change frequently.
You can exclude large datafiles from tracking by git by adding them, or a pattern matching them such as: data/* (ignore all files in in the data folder) or *.data (ignore all files with that end in .data) to the [.gitignore](https://git-scm.com/docs/gitignore) file.
There are a variety of different tools which can be used to manage versioning large binary objects in a git like fashion and which integrate with a git based workflow. If you use one, and which you choose, may depend on your specific needs. Examples: git-annex, git LFS, Data Version Control (DVC), lakeFS This is generally most relevant for intermediate data objects which are of potential interest to downstream users of the processed data objects. for example machine learning model weights from different training runs, annotated single cell sequence count matrices in things like seurat objects. If you have deposited your raw data into a public repository it does not need to be duplicated in such a system indefinietly but it might be useful whilst working on the project to have it ‘cached’ in one. If possible start by importing the data & metadata into your own project from it’s public repository as a test of FAIRness of the data. This also means that anyone using your dataset has a clear example of how to import it into a working environment.
Signed and Timestamped git commits

It is possible to cryptographically sign your git commits, this can be used to increase confidence that you are the author of a signed commit as someone would have to compromise your private key in order to impersonate you. If your key is part of a web of trust or other (public key infrastructure (PKI)](https://en.wikipedia.org/wiki/Public_key_infrastructure) people can see that other people / institutions attest that person with this key is who they say they are. Whilst typically used for things like helping to protect critical open source infrastructure from supply chain attacks signed commits on academic code bases could be used to provide additional provenance information.
It is sometimes desirable for git commits to demonstrably have been signed at a given time, your system time is recorded by default but this can be trivially spoofed. The opentimestamps protocol which can be used to generate cryptographic attestations to the time at which a commit was made. This aims to establish a lower bound on how long ago the committed code was authored. The opentimestamps-client integrates with git to provide timestamps for individual GPG signed commits

©Licencing

On what terms can others use your code, and how can you communicate this?

Project is suitably licensed
- 🥉Bronze (easy): There is a LICENSE file in the repository for a license which meets one of the OSI, Debian, or FSF/GNU definitions of free/libre or open source software. Or for any contents that are not software a Creative Commons or similar license which meets the free cultural works definition.
- 🥈Silver (easy): If any prose/documentation or images are licenced differently from the code in the project this is indicated and those licences provided. If licences have an attribution requirement there is easy to copy text/links for appropriate attribution.
- 🥇Gold (intermediate): Uses Software Package Data Exchange (SPDX) license identifiers for every file/suitable unit of code. With a tool such as REUSE.software to automate and standardise the process.
- 🏆Platinum (intermediate): all previous tiers plus any images have licensing information embedded in their metadata.

Once you have selected you license include a plain text copy of it in the root of your repository in a file named LICENSE. Plain text versions of popular open license are widely available. Files with this name are often identified by software forges and a link to them created on the repository home page when they are present. You may need to include your name and the date in your copy of the license file if indicated.
The content of a repo of this form is generally a mixture of code, images (often graphs), data, and prose. In this context it may be preferable to have licenses for code, prose, and other assets such as graphs. e.g. all code under a GPLv3 license, and all images, prose and datasets, under a CC BY-SA license.
All software needs a license if you want to permit others to reuse it. It is important to give some thought to the type of license which best suits your project, it is a choice which can have significant long term implications. Checkout the turing way chapter on licensing for an introduction to the subject. If you have no time some pretty safe choices are: For a permissive license, the Apache 2.0. This would allow the re-use of your work in closed commercial code. For a ‘copyleft’ license, the GPLv3 (AGPL for server-side apps). This requires that anyone distributing software containing your code or derivatives of it share the source code with the people they distributed it to.
If you are including external code in your package then you should check that their licenses are compatible and you are legally allowed to distribute your code together in this way. Checkout this resource on license compatibility.
If you have a big project with a lot on differently licensed content or need a standard way to provide license information for binary assets you might want to checkout the REUSE.software tool, you can also embed licensing information directly into image metadata to help ensure that it remains associated with the image when/if it is reused, IPTC provides an extended image metadata schema with a copyright notice field which could house an SPDX licence code. It also allow for the direct embedding of image alt text.

✅Testing

How can you test your project so you can be confident it does what you think it does?

Project has undergone suitable testing
- 🥉Bronze (easy): Includes a minimal test data set necessary to demonstrate the basic functionality of the analysis
- 🥈Silver (easy): Includes test datasets which cover a range of outcomes of the analysis
- 🥇Gold (intermediate): You are using unit tests and an automated testing framework to check the correctness of core steps of your analysis
- 🏆Platinum (hard): Have your analysis code written and tested on preliminary or simulated data in advance of recieving your principle dataset with a copy of your code from this time archived and referenced in a pre-registration or registered report.

Whilst you can make use of unit tests / automated testing frameworks in this context (see the software packages checklist testing section for more) it is not always the best fit. A very good thing to do is to have an example dataset that you can perform your analysis on that is different from your new data. If you are testing a hypothesis it’s nice to have test datasets which simulate rejecting and accepting your null hypothesis.
If you write, test, have reviewed, and publically deposit, the code that you plan to run to test the primary/pre-planned outcomes in your study and are able to run this same code on your data unchanged this inspires confidence in your testing regimend. Code associcated with incedental finding which may be suggestive for hypotheses to test in future work obviously cannot be subject to pre-registration but other best practices can be followed.
When using real data or downsampled real data be sure that your testing covers any edge cases that may not have arrising enyour example data.
When using simulated data be sure to include the method by which you simulated the data and any random seeds which may be needed to re-generate it.

🤖 Automation

What tasks can you automate to increase consistency and reduce manual work?

Suitable automations are in place
- 🥉Bronze (easy): 1 from this list of processes are automated
  - use of an environment management tool
  - use of a literate programming / computational notebook
  - use of a pipeline manger or make-like tool
  - use of a linter / formatter
  - use of continious integration / continious deployment
  - use of git hooks
  - automated minting of new persistent identifiers on release tagging
  - …
- 🥈Silver (easy): 2-3 from the above list of processes are automated
- 🥇Gold (intermediate): 4+ from the above list of processes are automated
- 🏆Platinum (hard): note that difficuly is somewhat project dependent Your manuscript and its supplements are generated and served on a website after being built from your CI/CD pipeline. All statistics and data visualisations in your manuscript are generated programatically by your analysis pipeline from your raw data in CI/CD. Results are cached such that if you, for example, change the formatting of a graph only the plotting and rendering code needs to be re-run, but if you change the data the entire pipeline is re-rerun.

```
    </li>
</ul>
```

👥Peer review / Code Review

How can you get third party endorsement of and expert feedback on your project?

Code has been subject to a review indicating that someone else could re-run the analysis
- 🥉Bronze (easy): Someone other than you has checked over your project, given you feedback and told you they are reasonably confident they could re-run your analysis without your help.
- 🥈Silver (easy): Someone other than you has scessfully re-run your analysis using only your documentation, (preferably in a different compute environment, such as a different computer/compute cluster)
- 🥇Gold (intermediate): You have a review from CODECHECK, ReproHack or equivalent and have incorporated suggestions for improving reproducibility from these reviews.
- 🏆Platinum (intermediate): You have reviews which go beyond checking the ability to re-run your code but which also review it’s technical correctness

Most of the places that offer code peer review are focused on software packages not code that is specific to your analysis. This makes sense as reviewer time is fairly scarce so focusing it on code that others are more likely to reuse is reasonable.
If your code underpins a publication then in theory it may get reviewed as a part of the regular peer review process although in practice this does not appear to be all that common. If the journal to which you are submitting your work has no policy on the code review, and your reviewers do not take an interest in reviewing your code - even just the checking if it runs for them then you may wish to take responibility for the review of this work into your own hands. CODECHECK will independently verify that they can run your code, but its correctness is not in their scope.

📦Distribution

How can people install or access the software emerging from your project?

Project is distributed in a suitable fashion
- 🥉Bronze (Easy): Code and data (barring privacy related access restrictions) are in public repositories.
- 🥈Silver (Intermediate): Detailed instructions on how to fetch, install and configure the tools and data needed re-run your analysis, and how to re-run the analysis in the described environment.
- 🥇Gold (Intermediate): Project is in a reproducible interactive environment such as those offered by binder or renku.
- 🏆Platinum (Hard): Gold plus - Your project is built and served as a website using continuous integration and deployment tools such that your analysis is run on your data in a reproducible compute environment and computational results like graphs and statistics are programatically inserted into your output. (It is best to have some form of caching when doing this).

What it means to distribute one off analysis code is somewhat different from distributing a package or pipeline as this goal is different. The aims are to share how you did what you did in a research output, and to provide a record of the Provence of your results.
Distribution in this context is closer to documentation and literate programming tools like Jupyter Notebooks, jupyter book and Rmarkdown/Quarto lend themselves to this task very well. You can serve a static website based on the notebooks which perform, explain and interpret your analysis using these tools. You can even demonstrate their computational reproduciblity by building these outputs using continuous integration and deployment tools (CI/CD) tools available on gitforges like gitlab and github.
Tools like binder, renku allow you to share your analysis environment so that people can pick up your analysis where you left off in an interactive environment so that they can tweak your code and explore your data as they wish.
An excellent way to share the record of a specific analysis is to use all these tools in conjunction.
- Perform your analysis in reproducible compute environment specified using a tool like renku, binder or a Nix flake which will allow this environment to readily be shared with others. Write your manuscript using the literate programming tools and server this as static web page as a way of pre-printing your manuscript. Revise it with your collaborators using the issues and pull/merge request features of a git forge. If you make repo citable by adding the appropriate metadata and using zenodo to mint a DOI it is as citable as if it were deposited in a pre-print server, but probably looks a lot better. Then your entire project history is available in your git history. Your computational reproducibility is evidenced by the ability to build the output in the computational environment that you specified to serve your web page with the manuscript.
Where you have very large and computationally intensive upstream analyses, as is common for example in biological projects involving sequencing or image data, it can be easiest to take the outputs from this pipeline as the inputs for your downstream and less computationally intensive analysis. Document how to run the upstream reproducible pipeline in your down steam analysis. This way anyone, with access to appropriate compute resources, could download your data and run the same upstream analysis to get to the same staring point for the lighter downstream analysis and all the information needed to do this is documented in the downstream analysis.

💽Environment Management / Portability

How can people get specific versions of your software running on their systems?

Computational environment description provided
- 🥉Bronze (Easy): List of package versions, e.g. output of sessionInfo() in R, not in a machine readable format
- 🥈Silver (Intermediate): Structured language specific environment decription, language environment can be re-created e.g. renv.lock in a mostly automated fashion
- 🥇Gold (Hard): Structured full environment description, automated ability to recreate the complete environment including system dependencies
- 🏆Platinum (MAXIMUM OVERKILL): Your description allows the automated bootstrap of the entire* depencency tree of your environment from source with bitwise binary reproducibility (currently almost impossible to achieve, basically only approachable in Guix)

The record of a specific analysis is the case where providing a complete specification of the computational environment in which code was run is perhaps the most important. Doing this and providing the information necessary to initiate a re-run of the analysis, in that environment, is the computational equivalent of providing protocol level methodological detail of how a bench experiment was performed. In addition the provision of the source of the data on which the analysis was performed and a means of both retrieving a copy of it and demonstrating that it is the same as the original input, such as hashes of the data files, is analogous to being able to get access to the same reagents and types of biological samples in an experiment.
You might not need this level of detail to install a working version of a piece of software for general use. But for the record of a specific analysis it is ideal if we can re-run everything exactly with all the same versions if we are ever looking back to identify a potential source of error.
The most approachable tools to specify and reproducibly share an interactive compute environment in which an analysis was performed are probably binder, and renku.
Use of a ‘lock file’ (different tools may have different names for these) which specifies which software and in which versions to install in order to recreate your compute environment is ideal for this application. Various package and environment managers support this such as conda, renv, python virtual environments, poetry, nix flakes, guix manifests.
The scope of the environment managed by these tools can vary, renv for example only manage R packages, conda manages essentially any software but not necessarily all system dependencies, nix and guix can, and do, specify entire operating systems.
It is common to use the more narrowly scoped package and environment management tools in conjunction with container or virtual machine build descriptions such as a Dockerfile. You start with an image of an operating system, add instructions to install any system dependencies and then use the environment manager to install non-system depencencies for your project into the image. This is incomplete as it leaves some things outside of the managed environment, it does not capture how to reconstruct the base image, and system dependencies will not necessarily be versioned. This is however likely the most familiar feeling experience, as it is essentially the same as what you’d do when setting up your working environment on a new computer. This approach covers the vast majority of cases and is a good practice to adopt now. Nix flakes (vm container) and Guix manifests (vm container) can just generate such images directly with all system and project dependencies explicitly specified in lock files, but have a steep learning curve for their new, unfamiliar, way of working. They also may not yet have all the specialist software that you need packaged in their repositories. They can be a lot of work if they don’t already have everything that you need at present but are worth watching as, once refined, can solve many of the points of friction with current approaches.

🌱 Energy Efficiency

How can you and your users minimise wasted energy?

Consideration has been given to the energy efficiency of the code
- 🥉Bronze (easy): minimise unnecessary output files
- 🥈Silver (easy): bronze plus: Profile your code and refactor inefficient parts
- 🥇Gold (intermediate): silver plus: Estimate and share the carbon footprint of your computations with a tools such as green algorithms calculator
- 🏆Platinum (intermediate): gold plus: Offload suitable computations to hardware accelerators where possible

One off analysis code is not particularly high impact to make more efficient as it is only run a small number of times. However, it is worth giving some consideration to the efficiency of the tools that your one off analysis might make use of or depend on. For the most part analysis might represent a first step implementing a new method for the first time where it’s correctness and comprehensibility is more important than the efficiency with which it is implemented, optimisation comes later.

Consider what can you do to make your code a little more efficient:

Don’t generate unnecessary outputs that will sit on people’s drives unused, clean results of intermediate steps. You might have varying degrees of verbosity of output with a more verbose mode for debugging but defaulting to just the essentials.
Good documentation and good error handing/messages can reduce the number of times people make mistakes using your code that means they re-run or partially re-run it fewer times before they figure out how to use it correctly.
Some records of this type may use pipeline managers, targets for example can integrate nicely with literate programming outputs to cache computationally expensive results, let you iterate quicky on a manuscript using those outputs and ensure that any code that needs to be re-run following a change is, all whilst being able to re-run your entire workflow and regenerate your manuscript with a single command. For pipelines in particular caching results and avoiding needing to re-compute things if possible is a good way to make best use of these features in pipeline managers for example by having small granular tasks to minimise repeated work on run failure.
Choice of libraries and frameworks - some libraries may be more efficient that others or be a wrapper around an efficient implementation in another language, or be able to make use of offload to hardware accelerators.
Offload to harware accelerators where available, vector matrix and array arithmathic can often benefit from very substancial speed-ups on hardware specialised for these types of calculations, or even binaries compiled with the right instruction set extensions enabled to take full advantage of hardware acceleration features on many CPUs. Doing this directly can be quite challenging but using libraries cabable of managing this offload for you can make it more approachable. (This can potentially introduce interesting reproducible computation challenges due to things like differences in handling of floating point arithmathic between hardware/firmware implementations.)
Benchmarking & Profiling to locate and improve inefficient code. Don’t optimise prematurely - it is often surpising which pieces of your code turnout to be slow, measure it first and check where to focus your attention. This can go hand in hand with having done robust testing as a good test suite means that you can confidently re-factor an inefficient piece of code without fear of introducing errors.
Language Choice - whilst sometimes worth considering energy efficiency is rarely high up on the list of reasons to pick a programming language in this context. Familiarity of both you and anyone who might use the code after you or indeed the familiarity of the acacdemic community that might consume your code is often paramount as this maximises the speed with which you can develop your solution and others comprehend it.
The people in the Green Algorithms community of practice have some useful advice and resources for anyone interersted in this subject.