Software Packages [raw]

Research Software Sharing, Publication, & Distribution Checklist

Considerations for publishing a software package which may be used in research or as a researcher

šŸ“’Source control

How can you keep track of the history of your project and collaborate on it?

  • If the language you are writing in has a convenient tool for initiating a template for a package then you may want to get your project’s git repository started using that tool. R for example has the {usethis} package which makes the creation of a minimal R package very easy, including adding automated building and testing with github actions.

©Licencing

On what terms can others use your code, and how can you communicate this?

  • All software needs a license if you want to permit others to reuse it. It is important to give some thought to the type of license which best suits your project, it is a choice which can have significant long term implications. Checkout the turing way chapter on licensing for an introduction to the subject. If you have no time some pretty safe choices are: For a permissive license, the Apache 2.0. This would allow the re-use of your work in closed commercial code. For a ā€˜copyleft’ license, the GPLv3 (AGPL for server-side apps). This requires that anyone distributing software containing your code or derivatives of it share the source code with the people they distributed it too.
  • If you are including external code in your package then you should check that their licenses are compatible and you are legally allowed to distribute your code together in this way. Checkout this resource on license compatibility.
  • REUSE.software is a tool that can help you keep track licenses in complex multi-license projects. It identifies licences for code in individual files with SPDX licence codes and has an approach to doing so for binary assets.

šŸ“–Documentation

How do people know what your project is, how to use it and how to contribute?

In some cases for small and simple projects a README file is sufficient documentation. This may genuinely be all that you need, or inadequate to the task depending on the project.

  • README / Manual
    • What your project is and what it does
    • install instructions
    • Contribution guidance
      • for example: Issue templates, a code of conduct, process details
    • development environment setup
      • overview of project organisation and structure
  • CHANGELOG it can be a good idea to include a CHANGELOG file in your project documenting things which have changed since the previous release. This manuscript on The impact of package selection and versioning on single-cell RNA-seq analysis provides a nice case study for why this can be useful in academic settings especially if decisions have been made to change defaults between versions.
  • ā€˜docstrings’ and similar
    • Many programming languages have a way of documenting your code inline which can automate the generation of some parts of the documentation. This often takes the form of specifically marked-up comments. Examples include python’s dockstrings, R’s Roxygen2, and perl’s POD
  • Vignettes / Examples
    • Examples of use of the code in the context of a real problem, beyond simple example snippets which might be included in the documentation of individual functions/objects. (These can also serve as a form of simple integration tests if you run them as a part of your documentation build.)
  • Larger projects might also include Project Documentation: Plans, Design documents and Specifications
  • Process Documentation: how to proceeded with various tasks related to the project this might include: submitting issues, submitting merge requests, reporting possible vulnerabilities, testing, documentation, release, code review review

šŸ”—Making Citable

How should people make reference to your project and credit your work?

It is important that code used in research can be properly cited by researchers so that they can communicate which version they used, where to find the code, and give appropriate credit to it’s authors. Even if you are not an academic it is important that academics be able to credit your work so that it can be appropriately valued in the scientific funding ecosystem. If it is not framed as contributing to a research output it is harder to justify funding it and paying developer salaries - even if indirectly.

Beyond merely making it possible to consistently reference a research output the higher tiers on the checklist don’t merely make the output citable but help to follow better citation and bibliographic practices. This extends from the practical, making it easy import into reference mangers like zotero; To protecting against link-rot through the use of persistent digital object identifiers; To the use of linked / semantic data practices identify and connect contributors, the nature of their contributions, their institutional assoications and thing to which they contributed.

Further information:
  • Including a CITATION.cff (Citation File Format) file in your project repo is a simple way of making your code citable. The format is readable in YAML and permits the provision of the metadata needed for citation.
  • Zenodo permits you to mint a digital object identifier (DOI) for your code this is a persistent identifer which can be used to refer to it. You can tie the minting versioned DOIs to your releases of your project. Using a DOI permits the existing ecosystem of academic software, e.g.zotero, to use APIs to retrieve citational metadata about your project. Zenodo also hosts a snapshort of your sourcecode so that if your main code repository ever went down it’s still possible to retrieve it here. Citational metadata can be import from a .cff file or a .zenodo.json file in your repository. This makes it pretty easy to manage updates as you can just edit these files and have a platform integration or step in your CI push them to zenodo next time you do a release.
  • Software Heritage is an expansive archive of open source software operated by a non-profit organisation in collaboration with UNESCO how to reference and archive code in software heritage. SWHIDs have the advantage that they are content based identifiers meaning that you can check if the content you get back when you retrieve it is what you expected to get based on its identifier. The Software heritage API permits you to automate the archiving of your project repository via a webhook from popular git forges like github, gitlab and others. Unlike Zenodo which only preserves a snapshot of your repository at the time of deposition and at subsequent manual time points and/or tagged releases Software Heritage archives the whole repository.
  • Further reading on the ethics of CROTs (contributor roles ontology or taxonomy), and their evolution and adoption. This is potentially useful in selecting a CROT suitable for you project
  • Nix and GUIX
    • General software repositories may not make specific provision for citation of software packages in the academic fashion. However some provide, what is for some use cases, a superior form of ā€˜citation’ of their own sources i.e. a complete ā€˜software bill of materials (SBOM)’. This is a list of all the code used in another piece of code, its dependencies, and their dependencies recursively, along with all of their versions. For example Nix can do this but Guix is perhaps the most comprehensive in its approach. It not only provides all information necessary for a complete ā€˜SBOM’ but, it can bootstrap software packages in its repository from source with an extremely minimal fixed set of binaries, an important capability for creating somewhat trustworthy builds. This creates a compute environment which is not only reproducible but verifiable, meaning the source of all of an environments dependencies can in theory be scrutinised. It also adopts an approach to commit signing and authorisation of signers that gives it a currently uniquely complete supply chain security architecture. Packages or ā€˜derivations’ are ā€˜pure functions’ in the sense that only their inputs effect their outputs and they have no side-effects, package builds are sandboxed to prevent dependencies on any external source not explicitly provided as an input and inputs are hashed to ensure that they cannot differ for the value expected when they were packaged. This gives these technologies an unrivaled ability to readily demonstrate the reproducibility and provenance of compute environments specified using them.
    • Whilst not yet full implemented and adopted these technologies also aford some fascinating opertunities for seemless access to archival versions of software in the future. Due to the similarities in the content based addressing used by Git, Nix, Guix, IPFS (Interplanetary file System) and software heritage’s IDs it may be possible to construct an approach to archiving, distributing and caching sources of packages in a way that would ensure that low demand archived software sources and high demand current packages can be distributed transparently through the same mechanism. This would in theory permit the reconstruction of any historically specified compute environment that had been archived with no changes to normal workflow, other than perhaps a longer build time. This approach also makes the creation of ā€˜mirrors’ of the archive relatively simple and requires no client side changes as an IPFS resource will be resolved irrespective of the node on which it is stored. See: NLnet Software heritage and IPFS, Tweag - software heritage and Nixpkgs, John Ericson - Nix x IPFS Gets a New Friend: SWH (SoN2022 - public lecture series)

āœ…Testing

How can you test your project so you can be confident it does what you think it does?

  • A good test suite allows you to refactor your code without fear of breaking its functionality. Good tests are agnostic to the implementation details of action that you are testing, so that you can change how you implemented something without needing to change the tests. The use of automated testing frameworks is especially useful for software that is under ongoing development as it allows developers to catch the unintended consequences of a change made in one place on some other part of the code that they did not anticipate.
  • Examples of automated testing frameworks include {testthat} for R & unittest for python. Tools like Codecov or coveralls in conjucnction with language specific tools such as covr can help with code coverage monitoring and insights.
  • Unit tests allow you to spell out in detail what you expect the behaviour of your software to be under a particular circumstance and test if it conforms to these expectations. Automatically running tests like this can be added to CI/CD pipelines on git forges.
  • Test coverage does not necessarily need to be 100% or even especially high but code coverage tools can allow you to spot gaps in test coverage over important parts of your codebase and ensure that you cover them and give you an indication when you added new and poorly covered code to your codebase that you may want to add tests for.
  • Try to make sure that your test suite runs fast so that you can run it regularly and quickly iterate.
  • Test Driven Development (TDD) is the practice of writing your tests first and then developing the code which conforms to these tests. It works well if you have an extremenly well defined idea of what exactly you want your code to do and not do.

šŸ¤– Automation

What tasks can you automate to increase consistency and reduce manual work?

  • Linting is a process of statically analysing the source code to catch errors which can be detected without compiling/running the code such as syntax errors. Examples include {lintr} for R and Ruff for Python.
  • Automate the use of a standard style / format Using an automated code formatter ensures that your project has consistent code formattin. This can forestall such debates among contributors as: ā€˜spaces vs tabs’ for indentation, at least it can once you have agreed to bake that descision it your formatter and quash further discussion on the topic. Examples include {styler} for R and Black for python.
  • Building Documentation
  • Git hooks It can be preferable to automate certain actions based on git events
    • pre-commit hooks can be especially useful for automating things like linting and code formatting on your local system before you can commit any changes to your git history which do not conform to these standards Another useful application here is where your documentation is built from source documents in your repository and also under version control, a pre-commit hook can ensure that you cannot commit your documentation sources and their build artefacts in an inconsistent state. You can even do this with automated testing so contributors cannot push code that breaks tests locally, that way if tests break in continious integration (CI) it’s always likely to be something related to differences in the testing environment or conflicting changes.
    • You can write and manage your own git hooks, there is also the tool pre-commit written in python and configured in yaml which is a package manager for git commit hooks allowing you to simply install and configure many existing hooks.
  • Continious Integration and Deployment (CI/CD) is broardly automation of the integration of new changes from contributors to your project and the deployment of those changes to your users on an ongoing basis.
    • Many modern software forges combine hosting of source control with CI/CD tools. Github has ā€˜github actions’ and GitLab has ā€˜gitlab CI/CD’, these tools are tide to those specific git hosting services meaning that adopting them can generate significant lock-in to that specific git hosting tool/platform. Codeberg provides an instance of Woodpecker CI and there are other git host-agnostic CI/CD tools such as Jenkins available.
    • CI/CD pipelines can also be a good place to run linters and code formatters which either reject merges/pushes which do not conform to these standards or automatically apply them and use bots to commit them. This can act as a second line of defense to ensure that contributors have linted their code, applied standard formatting and built documentation.
    • CI is a good place to run automated testing so that CD does not deploy anything detectably broken by your automated test suite.
    • CI/CD is a very convenient way to manage documentation websites for your software which are built based of your packages sources. For example the {pkgdown} packge for R is specifically for building documentation sites for R packages and integrates with GitHub actions and GitHub pages to do so.

šŸ‘„Peer review

      • Published a peer reviewed article with a scientific review of the theoretical / statistical / mathematical underpinnings of the tool that you implemented in addition to a technical peer review of the code quality. (These may well be seperate reviews for example by a methods journal and a software repository reflecting their different expertise.)
      • You have had and independent ā€˜red team’ attempt to find errors in your project and incorporated any relavant changes as a result.
      • Your project is a part of a bug bounty program.
  • Entities like The Journal of Open Source Software (JOSS), rOpenSci, pyOpenSci provide a more ā€˜academic peer review flavoured’ form of software review and make it easy to cite software in the academic style.
  • Package repositories like CRAN and Bioconductor have quite robust processes for review of suitability and quality of packages that are listed in these repositories, this is a form of peer-review though with a more technical focus than academic peer review of research manuscripts. By contrast PyPI and npm have minimal review processes and anyone with an account can upload packages which meet their technical specifications for packaging. Different language communities have different standards and practices around their major package repositories.

šŸ“¦Distribution

  • Packaging your software so that it can easily be installed by package and environment management tools is important to allow people to use your software. Using standard packing format and build tools also often makes it easier to automate testing and documentation building from your source code as well as building binary packages for different, versions, operating systems and architectures.
  • Package repositories and other packaging formats, conda, spack, Nix.
  • If you do not have the resources to maintain your package it may be preferable to leave it out of the main package repos, many may not allow your code to be included there without an active maintainer.

šŸ’½Environment Management / Portability

      • conda / environment.yml
      • Make use of functional package managers like Nix/Guix whose package derivations make the strongest guarantees about the ability to re-build a package as they describe a pure function called in a sandboxed environment.
      • Cross operating system / architecture builds - does your package build on different operating systems and instruction set architectures (arm, x86, RISC-V etc.)
  • Use of a robust environment management tool for you language which can exactly reproduce the environment in which any given build of your software was made. In particular any released version of your software would ideally be re-buildable from source in a bit for bit fashion.
  • For a software package that people may want to run in many different environments and which may be run with different versions of the language and other packages it is important to check a broad combination of factors which might crop up in environment that people are likely to be using such as:
    • Different operating systems and versions of these operating systems
      • e.g. Linux vs Windows vs MacOS, and Win10 vs Win11
    • Different language versions
      • e.g. R 3.6.3 and R 4.3.2
    • Different computational architectures
      • x86_64, arm64, RISCV
      • There are practically analogous issues with code to run on AMD vs Intel vs Nvidia accelerators
  • Combinations of all of the above
  • You can cover all of these is all combinations, nor do you need to, just cover the ones most relevant to your software and it’s users.

🌱 Energy Efficiency

  • [ ]

Everyone likes fast and efficient code, but especially if your code is going to be re-used by a lots of people in a computationally demanding application it can burn a lot of energy. This translates to carbon emissions, water use and opportunity costs for whatever else could have been done with that energy and compute time. If you’re making a pipeline produces a lot of intermediate files and outputs consider which of these are needed or good defaults, which could be optional and which could be discarded by default. Defaults are king and people will mostly keep whatever your tool outputs often essentially indefinietly so you can reduce the energy expended on unnecessary storage by keeping your outputs lean. Consider what can you do to make your code a little more efficient.

Good documentation and good error handing can reduce the number of times people make mistakes using your code that means they re-run or partially re-run their analysis multiple times before they figure out how to use it right.

  • Don’t generate unnecessary outputs that will sit on people’s drives unused, clean results of intermediate steps
  • for pipelines in particular caching results and avoiding needing to re-compute things if possible - make best use of these features in pipeline managers for example by having small granular tasks to minimise repeated work on run failure.
  • Choice of libraries and frameworks - some libraries may be more efficient that others or be a wrapper around an efficient implementation in another language, or be able to make use of offload to hardware accelerators.
  • benchmarking & Profiling to locate and improve inefficient code
  • Language Choice
  • Offload to harware accelerators

āš– Governance, Conduct, & Continuity

How can you be excellent to each other, make good decisions well, and continue to do so?

      • [ ]
      • [ ]
  • If you are the Benevolent Dictator For Life (BDFL) of your project and the Code of Conduct (CoC) is ā€œDon’t be a Dickā€ that’s fine, for many individual hobby projects this a functional reality. Becoming a BDFL tends to be the default unless you take steps to avoid it and cultivate community governance as your project begins to grow - failing to do this and being stuck in charge can become quite the burden in sucessful projects. Be waring of adopting policies that you lack resources, time, interest, skill, or inclination to be an active enforcer, mediator and moderator of community norms and disputes, It is helpful to be clear about what you can and cannot commit to doing. Only by communicating this might you be able to find community members to help you with setting and enforcing these norms, if or when your community attains a scale where this becomes relevant - community management is its own skill set. If you can’t moderate them avoid creating and/or continuing ungoverned community spaces that can become a liability for you and your project’s reputation. Just as there are off-the-shelf licenses there are off-the-shelf codes of conduct, the Contributor Covenant is perhaps the best known and most widely used, though may need some customisation to your needs. Adopting such a CoC gives you some guidance to follow if there is bad behaviour in your project’s community and communicates that you as the project leadership take the responsibility of creating a respectful environment for collaboration seriously. It can also signal that your project is a place where everyone is wellcome but expected to treat one another with respect, and that failing to do so will result in penalties potentially including exclusion from the community. The Turing Way provides quite a nice example of a CoC developed specifically for their project You will need to provide contact information for the person(s) responsible for the enforcement of the CoC in the appropriate place and be able to follow up in the event it is used. git forges often recognise files with the name CODE_OF_CONDUCT.md in the root of project and provide a link to them on project home pages, so this is a good place to document such policies. If you are the BDFL of a small project then interpretation and enforcement of such a CoC tends to fall solely on you - game out some courses of action for what you’d do if faced with some common moderation challenges.
    • Once a project attracts a larger community there is greater scope for disputes and therefore for the need for dispute resolution mechanisms. Free/Libre and Open Source Software development and maintenance can be thought of as a commons so I would refer you to the work of Elinor Ostrom on how commons have been successfully (or unsuccessfully) governed when thinking about what processes to adopt for your project. More recently Nathan Schneider’s Governable Spaces: Democratic Design for Online Life tackles some of these issues as applied to online spaces.
    • This is summarised in the 8 Principles for Managing a Commons
      1. Define clear group boundaries.
      2. Match rules governing use of common goods to local needs and conditions.
      3. Ensure that those affected by the rules can participate in modifying the rules.
      4. Make sure the rule-making rights of community members are respected by outside authorities.
      5. Develop a system, carried out by community members, for monitoring members’ behavior.
      6. Use graduated sanctions for rule violators.
      7. Provide accessible, low-cost means for dispute resolution.
      8. Build responsibility for governing the common resource in nested tiers from the lowest level up to the entire interconnected system.
    • An informal do-ocracy in the fiefdom of BDFL is often the default state of projects that have not given much conscious thought to how they want to be governed and are thus often subject to many of the same common failure modes of this model. How are decisions made in your project? Do you need the mechanisms of governance used by community and civil society organisations? By-laws, a committee and/or working groups, general meetings, votes, minutes? A version of these may be necessary to avoid The Tyranny of Structurelessness How can you map these onto your development infrastructure and make the decisions of your governing bodies enactable and enforceable?
  • Continuity planning: What happens to your project if something happens to you? The code will likely live on due the distributed nature of git but what about the issue tracker, the website etc. Who else has the highest level of privilege on your project or a mechanism to attain it? The principle of least privilege dictates that you keep the number of people with this level of access to a minimum but you may then create a single point of failure. Password managers like bitwarden have a feature where designated people can be given access to your vault if they request it and you do not deny it within a certain time-frame. This could provide a lower level admin with a mechanism to escalate their privileges if you are unable to do this for them. However, this delay might be an issue for continuity of operations if administrator action is needed within the waiting period. Game it out, have a plan, write it down, let people know you have a plan.
  • Planning how to ā€˜sunset’ your project:
    • Let people know that it’s not receiving active maintenance and might not be updated to new language and package versions if you are not doing this.
    • It can be useful to indicate the status of the project in it’s README, see: repostatus.org, where they define eight different project statuses.
      • Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.
      • WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
      • Suspended – Initial development has started, but there has not yet been a stable, usable release; work has been stopped for the time being but the author(s) intend on resuming work.
      • Abandoned – Initial development has started, but there has not yet been a stable, usable release; the project has been abandoned and the author(s) do not intend on continuing development.
      • Active – The project has reached a stable, usable state and is being actively developed.
      • Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.
      • Unsupported – The project has reached a stable, usable state but the author(s) have ceased all work on it. A new maintainer may be desired.
      • Moved - The project has been moved to a new location, and the version at that location should be considered authoritative. This status should be accompanied by a new URL.
    • You can also convert repositories to an archival mode on common software forges like GitHub to indicate that they are no longer being worked on.
  • Does your project take donations? Does it have a trademark? Does it need a legal entity to hold these? Who is on the paperwork and who has signing authority? Who keeps track of expenditures? Tools & Organisations like OpenCollective can help with some of these issues.
  • If your project has potential cybersecurity implications what procedures do you have in place for people to disclose vulnerabilities in the project so that they can be patched before they are made public. What systems do you have in place to disclose a vulnerability once it has been patched and ensure that users know that they need to update.
  • Whole project data longevity - what plans do you have in place to backup and archive materials pertaining to your project that are not under source control?
  • User support
    • What support can users expect, or not expect?
    • Where can they ask for it?
    • Is there somewhere where users can provide support to other members of the user community, such as a forum?
    • Can they pay for more support?

Research Software Sharing, Publication, & Distribution Checklists by Richard J. Acton is licensed under CC BY 4.0