Pipelines & Workflows [raw]

Research Software Sharing, Publication, & Distribution Checklist

Considerations for publishing a data analysis workflow or pipeline which may be used in research or as a researcher. “Applying the FAIR Principles to computational workflows” 10.1038/s41597-025-04451-9 offers a good working definition of a workflow and the application of FAIR principles to computational workflows.

📒Source control

How can you keep track of the history of your project and collaborate on it?

Uses git (or other source control tool such as fossil)
- 🥉Bronze (Easy): Using version control but has a shallow project history, just placed in git for distribution
- 🥈Silver (Intermediate): Longer project history, commit messages of mixed quality, some large messy changes
- 🥇Gold (Hard): Silver plus - Well written commit messages, nice granular commits making discrete self-contained changes. Tags, releases, or branches at major project milestones, maybe some contributions from other users
- 🏆Platinum (MAXIMUM OVERKILL): Gold plus - Some from: conventional commits; Clean history with a consistent rebasing/merging strategy; Signed commits from all contributors; Contributions go through a consistent workflow like, issues, then a pull request from a branch.

If the workflow tool you are using has a convenient tool for initiating a template for a workflow then you may want to get your project’s git repository started using that tool. Nextflow for example has the nf-core template which makes the creation of an nf-core style pipeline project easy. Snakemake similarly has a standard project structure and template.

©Licencing

On what terms can others use your code, and how can you communicate this?

Project is suitably licensed
- 🥉Bronze (easy): There is a LICENSE file in the repository for a license which meets one of the OSI, Debian, or FSF/GNU definitions of free/libre or open source software. Or for any contents that are not software a Creative Commons license.
- 🥈Silver (easy): If any prose/documentation or images is licensed differently from the code in the project this is indicated and those licences provided. If licences have an attribution requirement there is are easy to copy text/links for appropriate attribution.
- 🥇Gold (intermediate): Uses REUSE.software to provide license information for every file.
- 🏆Platinum (intermediate): all previous tiers plus any images have licensing information embedded in their metadata.

If you are including external code in your pipeline then you should check that their licenses are compatible and you are legally allowed to distribute your code together in this way. Checkout this resource on license compatibility. Generally in a pipeline your are distributing code ‘alongside’ other packages in a way that strong copyleft licences like the GPL intend to permit see the: GPL FAQ A pipeline that is not modifying such a library but merely using it ‘as is’, therefore is not considered a derivative work by these licences themselves, thus using them in this fashion is common practice, but the precise defintion of what consitutes a derived work for the purposes copyright law is generally decided on a case by case basis. Pipelines are sometimes run as part of the backends of web-based services so the use of an AGPL license over a GPL license may be advisable if your intent is to maximise the applicability of copy-left terms across possible use cases for your pipeline.
All software needs a license if you want to permit others to reuse it. It is important to give some thought to the type of license which best suits your project, it is a choice which can have significant long term implications. Checkout the turing way chapter on licensing for an introduction to the subject. If you have no time some pretty safe choices are: For a permissive license, the Apache 2.0. This would allow the re-use of your work in closed commercial code. For a ‘copyleft’ license, the GPLv3 (AGPL for server-side apps). This requires that anyone distributing software containing your code or derivatives of it share the source code with the people they distributed it too.

🤖 Automation

What tasks can you automate to increase consistency and reduce manual work?

[ ]
- 🥉Bronze (easy):
- 🥈Silver (easy):
- 🥇Gold (intermediate):
- 🏆Platinum (intermediate):

👥Peer review / Code Review

How can you get third party endorsement of and expert feedback on your project?

Pipeline has been appropriately reviewed
- 🥉Bronze (easy): Someone other than you has checked over your pipeline and given you feedback
- 🥈Silver (intermediate): Someone other than you has successfully run your pipeline on different compute infrastructure and got the same results with test data
- 🥇Gold (intermediate): You have published your pipeline in a pipeline repository which performs reviews of submissions such as nf-core (The snakmake workflow catalog’s review practices would not be adequate for this purpose)
- 🏆Platinum (hard): Silver, gold and some from:
  - Published a peer reviewed article with a scientific review of the theoretical / statistical / mathematical underpinnings of the tool that you implemented in addition to a technical peer review of the code quality. (These may well be separate reviews for example by a methods journal and a software repository reflecting their different expertise)
  - You have had and independent ‘red team’ attempt to find errors in your project and incorporated any relevant changes as a result
  - Your project is a part of a bug bounty program

The design of the analysis, any methodological choices made and any original steps added might warrant a conventional scientific publication if for example you are making a pipeline which automates a portion of the analysis of some new datatypee
- review the theory
Technical
- review the implementation

📦Distribution

How can people install or access the software emerging from your project?

Pipeline is distributed in appropriate format(s)
- 🥉Bronze (easy): Pipeline is in a software forge (such as GitHub or Codeberg) in a standard package format so that it can be run with the pipeline manager’s standard tooling.
- 🥈Silver (easy): The software environment(s) needed by each step in your pipeline are defined so that they can be installed automatically using a suitable package / environment management tool.
- 🥇Gold (intermediate): Pipeline is in a repository of workflows and pipelines such as workflowhub (can be pre-release in nf-core).
- 🏆Platinum (intermediate): Pipeline is in a curated package repository where it has undergone review and testing such as nf-core.

Including your pipeline in a collection of pipelines increases it’s visibility, can help to attract contributors, and in the case of a curated collection with good standards for how they package their pipelines provide users with confidence to they will be able to use your pipline on their compute infrastructure.
Good places to distribute workflows include:
- workflowhub most generic accepts workflows in a number different tools
- For Nextflow pipelines in the nf-core format nf-core (MIT license required for pipeline code)
- For Snakemake pipelines which conform to some relatively simple requirements snakemake workflow catalog
- Pipelines in the R {targets} tool targetopia

💽Environment Management / Portability

How can people get specific versions of your software running on their systems?

It is possible to run your pipeline on other systems with some degree of reproducibility
- 🥉Bronze (easy): The pipeline may require some manual changes, and/or manual steps to install the requisite software environments or retrieve data, and perform some standard pre-processing of inputs like building genome references etc.
- 🥈Silver (intermediate): External inputs can be automatically retrieved using their identifiers and pre-processing steps for these inputs are performed as apart of the pipeline. At least one method for describing the required compute environment(s) is supplied e.g. conda environments, or container build files so that others can be specified if a user has compute infrastructure which does not support the method(s) supplied.
- 🥇Gold (hard): Complete pipeline can be executed with single command with all data and dependencies fetched automatically, barring the need for any system specific resource constraint configuration.
- 🏆Platinum (practically impossible in practice): The entire dependency tree of your pipeline including the pipeline manager itself, the OS it’s running on and the firmware of the hardware it’s running on can be bootstraped from source and produce bitwise identical binaries, also including any pre-processing of data inputs to the pipeline such as base calling for sequencing data.

In the context of a pipeline each independent step should ideally be performed in it’s own environment , perhaps defined within a container, with only the tools necessary to perform that step of the analysis. Many pipeline management tools support specifying per-task compute environments using tools such as conda, and container technologies such as docker and singularity/apptainer.
- Container images are a convenient format in which to distribute software along with its dependencies and to isolate this environment from other software which can help avoid any conflicts in dependencies. Container images like virtual machine images can be quite large and thus cumbersome to distribute, they are also something of a black box once built. Containers provide many of the advantages of virtual machines (VM) but generally with less performance penalty. However whilst they are narrowly reproducible they are not readily interrogated and checked unless you provide the build instructions which generated the image, for example a Dockerfile. Unfortunately the process of building container images is itself not necessarily reproducible. Thus when specifying container builds it is best practice to specify exact package versions in your build so that the image builds are reproducible. This can be challenging as many popular operating system package managers lack the tooling to do this easily. Pinning your container build to a snapshot of package repositories taken at a given date that will be available archivally is one way to address this. This paper provides some Recommendations for the packaging and containerizing of bioinformatics software In the case of bioinformatics pipelines it is often easiest to specify your environment with conda and then build containers and/or VMs which install that same conda environment on linux base image such as debian. This has the advantage that the conda environment can be used independent of any images built with it reducing the maintenance burden for supporting multiple approaches to distributing the compute environment. Which package/environment management tool has well packaged versions of all the relevant software may be specific to your discipline.
  
  If you do need to inspect the contents of a container images a number of tools developed by Anchore in particular Syft can be helpful in producing an account of the software installed in the image. It is a best practice to keep the software that you install in an image to the minimum necessary for the function that you need the image to perform, however, determining what this minimal set is can be non-trivial.
- Functional package managers such as Nix and Guix have a ‘best practices by design’ approach to packaging software. They do not suffer from the issue of it being difficult to determine what is and is not a required dependency as this work is done up-front when the software is packaged. They usually require that dependencies be completely specified and packages be build in a sandboxed environment which only has access to the explicitly specified dependencies. This provides much stronger guarantees of the ability to specify and build reproducible environments. It is also possible to build container and VM images specified with these tools, and a container specified with them could be a drop in replacement for one specified with conda and Docker for example. Unfortunately these tools have yet to see wide adoption in the scientific / research computing communities and thus many packages used by these communities are not packaged for these tools, (despite nixpkgs being the largest extant software package respoitory with >100,000 packages), hindering their broarder adoption. Use and awareness is growing and there are some excellent case studies. Nix is also cross platform working natively on MacOS, on windows via the windows subsystem for linux and even on android.
  
  It is worth being aware of these tools and considering packaging any software that you produce for them as they are gaining popularity and address many of the shortcommings and limitations of current package and environment mangement solutions.
Are the environments for each step of your pipeline well described using an environment management tool such as Conda, or Spack and/or supplied as OCI containers, runnable with tools such Docker, podman, lxc, Singularity/Apptainer, or others?
Many popular pipeline management tools integrate with environment management and container runtimes to facilitate portability of reproducible compute environments. see:
- Snakemake - integrated Package management
- Nextflow - containers
The Pipelines in Genomics PiGx (paper) collection represents a gold standard in reproducible computational environments for genomics pipelines. It used the Guix functional package manager to attain >97% bitwise reproducibility for dependencies across the pipelines in their collection. Containers built with Nix or Guix can be used in pipeline managers, as PiGx does with snakemake.

Pipelines & Workflows [raw]

📒Source control

©Licencing

📖Documentation

🔗Making Citable

✅Testing

🤖 Automation

👥Peer review / Code Review

📦Distribution

💽Environment Management / Portability

🌱 Energy Efficiency

⚖ Governance, Conduct, & Continuity