Web-based service [raw]

Research Software Sharing, Publication, & Distribution Checklist

A database, API (application programming interface), or other web-based tool which is, generally, to be hosted on an ongoing basis and offer some service or access to some resource to researchers. In many ways the key considerations here lean more organisational than technical. Do you have the resources to operate the service on an ongoing basis? A record of a specific analysis is a snapshot in time that is only expected to run in it’s specified environment and once done is done. Operating an online service requires continuous ongoing work to keep up with security updates and monitoring the status of your server(s) to ensure that your service is still up and working as intended. You can also expect to take on some degree of user support from people having trouble using your service. Depending on the nature of the project it may not make sense for others to be deploying instances of the server, but at minimum other developers will need a test deployment to work on if not now then in the future.

Most of the suggestions here would be the same as the section in the software packages checklist, and indeed software of this type is generally one or more packages so that checklist also applies here. This checklist focuses on the things that are in addition to general packages and more specific to web-based services.

📒Source control

How can you keep track of the history of your project and collaborate on it?

  • Unline many of the other research software output types this sort of output tends to consist of multiple seperate components which may have their sources managed seperately. Many of these projects might involve multiple different components which have seperate git repositories, for example your front-end and back-end codebases might live in their own repos. It can be useful to group these projects together within a group or organisation on your gitforge so that their relationship to one another is clear
  • Another thing which it is valuable for this sort of project to have is how to deploy a local testing and development environment and/or a minimal deployment of the service. This might take the form of a docker / docker-compose, ansible playbook, or similar automation tool for easily deploying a test / example environment. Taking the ‘infrastructure as code’ approach to the deployment of your tool and versioning these examples is most useful when your service is of a sort where it makes sense for others to want to host their own instances. If it is just deployed by you as a central resource these practice may be useful for you internally but they are less impactful for the rest of the community.
  • If your project is backed by a curated database then documentation of / code from the data collecting and cleaning process which lead to the current content of the database is valuable for the provenance of that dataset. If you are taking new additions to that database such tools are very valuable resources for any collaborators wanting to add data. Even if you are not adding new data these tools can also be very useful to researcher wanting to use data from your database with data they have themselves generated or curated, so the ability to process it in the same way as your data may be essential for valid comparisons.

©Licencing

On what terms can others use your code, and how can you communicate this?

  • If you want to apply a copyleft license to a piece of software that is to be accessed over a network and not necessarily run on end-users own computers then you would want to adopt a license such as the AGPL to ensure that your end users still have the right to run, study, modify and redistribute the code of the server-side part of the tool.
  • All software needs a license if you want to permit others to reuse it. It is important to give some thought to the type of license which best suits your project, it is a choice which can have significant long term implications. Checkout the turing way chapter on licensing for an introduction to the subject. If you have no time some pretty safe choices are: For a permissive license, the Apache 2.0. This would allow the re-use of your work in closed commercial code. For a ‘copyleft’ license, the GPLv3 (AGPL for server-side apps). This requires that anyone distributing software containing your code or derivatives of it share the source code with the people they distributed it too.
  • If you are including external code in your service then you should check that their licenses are compatible and you are legally allowed to distribute your code together in this way. Checkout this resource on license compatibility.
  • REUSE.software is a tool that can help you keep track licenses in complex multi-license projects. It identifies licences for code in individual files with SPDX licence codes and has an approach to doing so for binary assets.

📖Documentation

How do people know what your project is, how to use it and how to contribute?

  • README / Manual
    • What your project is and what it does
    • Installation instructions, in the case of a web tool what at least is needed for a minimal local development deployment
    • Contribution guidelines (varying from ‘please open an issue before working on a merge request’ to detailed style guides, review processes and other requirements)
    • A description of the project structure so that the user knows which directories to find things in
  • It is generally particularly useful to split documentation in a project such as this by target audiences: users, developers, sysadmins
    • User: Using the website graphically - this might include admin options if you have administrative users of some kind, admin users may need their own section
    • Developers: API docs, how to contribute, and how to set up a development environment, tooling used
    • Sysadmins: how to deploy an instance of the service, how to configure it, what you might want to do differently from the development environment, for example to have a more secure config and considerations that might affect backups, and managing updates/grades.
  • Note that almost all of the recommendations for software package documentation also applies here

🔗Making Citable

How should people make reference to your project and credit your work?

  • Adopting a stable, consistent and human readable naming schema for the URLs on your web service, including the ability to reproduce state of dynamically generated pages with parameters in the URL, makes referring to specific items much easier for users citing the website according to the conventions for citing websites. (If pertinent be sure to apply proper authentication and authorisation practices so that sensitive information cannot be accessed because it is on a page with a predicatble URL, and that the mere existence of a page with a predictable URL does not itself leak information) Minimise content not constructed in such a way that it can be captured by archival tools like the internet archive’s wayback machine and ArchiveBox which can be used by researchers and others to take a snapshot of a website at the time at which they are citing it to avoid issues of updates to the website altering the content or linkrot.
  • Including a CITATION.cff (Citation File Format) file in your project repo is a simple way of making your code citable. The format is readable in YAML and permits the provision of the metadata needed for citation.
  • Zenodo permits you to mint a digital object identifier (DOI) for your code this is a persistent identifer which can be used to refer to it. You can tie the minting versioned DOIs to your releases of your project. Using a DOI permits the existing ecosystem of academic software, e.g.zotero, to use APIs to retrieve citational metadata about your project. Zenodo also hosts a snapshort of your sourcecode so that if your main code repository ever went down it’s still possible to retrieve it here. Citational metadata can be import from a .cff file or a .zenodo.json file in your repository. This makes it pretty easy to manage updates as you can just edit these files and have a platform integration or step in your CI push them to zenodo next time you do a release.
  • Software Heritage is an expansive archive of open source software operated by a non-profit organisation in collaboration with UNESCO how to reference and archive code in software heritage. SWHIDs have the advantage that they are content based identifiers meaning that you can check if the content you get back when you retrieve it is what you expected to get based on its identifier. The Software heritage API permits you to automate the archiving of your project repository via a webhook from popular git forges like github, gitlab and others. Unlike Zenodo which only preserves a snapshot of your repository at the time of deposition and at subsequent manual time points and/or tagged releases Software Heritage archives the whole repository.
  • Further reading on the ethics of CROTs (contributor roles ontology or taxonomy), and their evolution and adoption. This is potentially useful in selecting a CROT suitable for you project
  • Nix and GUIX
    • General software repositories may not make specific provision for citation of software packages in the academic fashion. However some provide, what is for some use cases, a superior form of ‘citation’ of their own sources i.e. a complete ‘software bill of materials (SBOM)’. This is a list of all the code used in another piece of code, its dependencies, and their dependencies recursively, along with all of their versions. For example Nix can do this but Guix is perhaps the most comprehensive in its approach. It not only provides all information necessary for a complete ‘SBOM’ but, it can bootstrap software packages in its repository from source with an extremely minimal fixed set of binaries, an important capability for creating somewhat trustworthy builds. This creates a compute environment which is not only reproducible but verifiable, meaning the source of all of an environments dependencies can in theory be scrutinised. It also adopts an approach to commit signing and authorisation of signers that gives it a currently uniquely complete supply chain security architecture. Packages or ‘derivations’ are ‘pure functions’ in the sense that only their inputs effect their outputs and they have no side-effects, package builds are sandboxed to prevent dependencies on any external source not explicitly provided as an input and inputs are hashed to ensure that they cannot differ for the value expected when they were packaged. This gives these technologies an unrivaled ability to readily demonstrate the reproducibility and provenance of compute environments specified using them.
    • Whilst not yet full implemented and adopted these technologies also aford some fascinating opertunities for seemless access to archival versions of software in the future. Due to the similarities in the content based addressing used by Git, Nix, Guix, IPFS (Interplanetary file System) and software heritage’s IDs it may be possible to construct an approach to archiving, distributing and caching sources of packages in a way that would ensure that low demand archived software sources and high demand current packages can be distributed transparently through the same mechanism. This would in theory permit the reconstruction of any historically specified compute environment that had been archived with no changes to normal workflow, other than perhaps a longer build time. This approach also makes the creation of ‘mirrors’ of the archive relatively simple and requires no client side changes as an IPFS resource will be resolved irrespective of the node on which it is stored. See: NLnet Software heritage and IPFS, Tweag - software heritage and Nixpkgs, John Ericson - Nix x IPFS Gets a New Friend: SWH (SoN2022 - public lecture series)

✅Testing

How can you test your project so you can be confident it does what you think it does?

  • Services and graphical interfaces many require integration tests which check that the different components of your system work together as expected and UI based testing frameworks which simulate user interaction in a web browser might be things that you would consider adding to the sorts of test you might do for a simpler library.
  • A good test suite allows you to refactor your code without fear of breaking its functionality. Good tests are agnostic to the implementation details of action that you are testing, so that you can change how you implemented something without needing to change the tests. The use of automated testing frameworks is especially useful for software that is under ongoing development as it allows developers to catch the unintended consequences of a change made in one place on some other part of the code that they did not anticipate.
  • Examples of automated testing frameworks include {testthat} for R & unittest for python. Tools like Codecov or coveralls in conjucnction with language specific tools such as covr can help with code coverage monitoring and insights.
  • Unit tests allow you to spell out in detail what you expect the behaviour of your software to be under a particular circumstance and test if it conforms to these expectations. Automatically running tests like this can be added to CI/CD pipelines on git forges.
  • Test coverage does not necessarily need to be 100% or even especially high but code coverage tools can allow you to spot gaps in test coverage over important parts of your codebase and ensure that you cover them and give you an indication when you added new and poorly covered code to your codebase that you may want to add tests for.
  • Try to make sure that your test suite runs fast so that you can run it regularly and quickly iterate.
  • Test Driven Development (TDD) is the practice of writing your tests first and then developing the code which conforms to these tests. It works well if you have an extremenly well defined idea of what exactly you want your code to do and not do.

🤖 Automation

What tasks can you automate to increase consistency and reduce manual work?

👥Peer review / Code Review

  • [ ]
  • If you are building a database of some kind then you might want the processes by which you process, collect or curate the data which go into this database to be subject to an academic style review, and papers about the creation of such resources are not uncommon.
  • Seeking an external technical review may be trickier for your core code but review of how easy your system is to deploy is perhaps more accessible from the community of amateur self-hosters. Who may be quite willing to try deploying your tool in many and varied homelabs if it offers them something and/or you ask nicely and in the right places.

📦Distribution

Distribution for a web based service covers both hosting the service and distributing the software to sysadmins who may want to run their own instance of the service.

  • Your general audience is users of your web service, and there’s a smaller but imporant audience of sysadmins and developers who may need to run your server software on their own systems not just use it. So ‘distribution’ splits in to two slightly different problems.
    • Operating your website, things like:
      • making sure that your TLS certificates stay up to date and you have enough compute resources for the service to run well for users.
      • Having a sensible URL, potentially including any look-alike urls that malicious actors might try to typosquat
      • Distributing your service to developers who may want to build tools on top of it or query it in an automated fashion via an API. The API should be well documented and conform to open standards.
      • Take some simple measures to ensure the reliability of your site under elevated load. Such as using a reverse proxy, enabling content caching so that your proxy can serve requests for the same content without hitting your application server(s) again, limit concurrent connections to the max number of sessions your server can handle at once so if traffic spikes it gets slower but dosen’t completely fall over, load balancing across multiple application servers.
      • Consider DDoS protection for your site if your traffic grows over a certain threshold.
      • Be wary of ‘denial of wallet’ attacks when hosting on automatically horizontally scaling platform by setting limits to prevent malicious parties from spamming your site in such a fashion as to cause you to incur massive hosting bills.
    • Distributing your server software to sysadmins, devops people, and potentially general IT staff, developers, and amateur self-hosters.

💽Environment Management / Portability

      • [ ]
      • [ ]
  • Depending on the infrastructure that you chose to deploy on you might use a different management tool, but it is best if you do use such a tool as a part of your development and deployment as, if done right, this provides an easy ‘run a couple of commands’ development environment setup for anyone picking up the project. Be that a future maintainer, someone wanting to play with a local test deployment, or someone wanting to contribute to the project. Examples of such tools include: ansible, terraform, docker/docker compose, nix, helm charts or a combination of some of these that fits your needs and experience.

🌱 Energy Efficiency

      • [ ]

Efficiency in a program deployed as a web service can be as much or more about how it is configured than the efficiency of the underlying code, the larger the number of users a service has the larger the overall impact of small efficiency gains. In many research contexts the number of users is small, though in some cases their computational demands may be high. It may be uneconomical to spend the time to optimise as heavily as one might in applications with larger scale. If possible profile the resource usage of your service in use and optimise the low hanging fruit.

Appropriate caching, management of sessions on your web server(s), load balancing between multiple nodes (when you have them) and so on can have huge performance and efficiency impact. An underpowered web server can for example often serve many more people if the number of concurrent session is limited the number that can comfortably be handled with that box’s resources and it is allowed to handle many requests in rapid sequence rather than becoming overwhelmed by an excessive number of concurrent sessions. Consulting with a professional systems administrator and/or devops professional (for larger deployments on modern cloud stacks) about how to optimise your service’s deployment is likely a good idea if you are inexperienced in this domain.

The same considerations that apply to other software packages also apply to web services to refer to that section also for additional suggestions.

Get the most out of the resources that you have provisioned for your service, over or under provisioning can lead to inefficiencies to aim to match demand. Automatically scaling services out can still carry considerable technical overhead and computational overhead in monitoring and responding to load and is unlikely to be worth the trouble in small deployments.

It may be valuable to share application specific optimisation tips for deployments of your service on community fora and/or as case studies in your documentation.

Academic user-bases are often scattered around the world, if you have a particular concentration of users in one location it may make sense to locate your physical infrastructure near to them to minimise latency. However, if you have a global user-base anyway you might consider server infrastructure in a location with the least carbon intensity.

⚖ Governance, Conduct, & Continuity

How can you be excellent to each other, make good decisions well, and continue to do so?

      • [ ]
  • If you are the Benevolent Dictator For Life (BDFL) of your project and the Code of Conduct (CoC) is “Don’t be a Dick” that’s fine, for many individual hobby projects this a functional reality. Becoming a BDFL tends to be the default unless you take steps to avoid it and cultivate community governance as your project begins to grow - failing to do this and being stuck in charge can become quite the burden in sucessful projects. It is also easy for a power vacuum to form at the top of sucessfully projects if they lack either community governance or a singular motivated leader. Be waring of adopting policies that you lack resources, time, interest, skill, or inclination to be an active enforcer, mediator and moderator of disputes concerning. It is helpful to be clear about what you can and cannot commit to doing in community mangement. Only by communicating this might you be able to find community members to help you with setting and enforcing community norms, if or when your community attains a scale where this becomes relevant Community management is its own skill set. If you can’t moderate them avoid creating and/or continuing ungoverned community spaces that can become a liability for you and your project’s reputation. Just as there are off-the-shelf licenses there are off-the-shelf codes of conduct, the Contributor Covenant is perhaps the best known and most widely used, though may need some customisation to your needs. Adopting such a CoC gives you some guidance to follow if there is bad behaviour in your project’s community and communicates that you as the project leadership take the responsibility of creating a respectful environment for collaboration seriously. It can also signal that your project is a place where everyone is wellcome but expected to treat one another with respect, and that failing to do so will result in penalties potentially including exclusion from the community. The Turing Way provides quite a nice example of a CoC developed specifically for their project You will need to provide contact information for the person(s) responsible for the enforcement of the CoC in the appropriate place and be able to follow up in the event it is used. git forges often recognise files with the name CODE_OF_CONDUCT.md in the root of project and provide a link to them on project home pages, so this is a good place to document such policies. If you are the BDFL of a small project then interpretation and enforcement of such a CoC tends to fall solely on you - game out some courses of action for what you’d do if faced with some common moderation challenges.
    • Once a project attracts a larger community there is greater scope for disputes and therefore for the need for dispute resolution mechanisms. Free/Libre and Open Source Software development and maintenance can be thought of as a commons so I would refer you to the work of Elinor Ostrom on how commons have been successfully (or unsuccessfully) governed when thinking about what processes to adopt for your project. More recently Nathan Schneider’s Governable Spaces: Democratic Design for Online Life tackles some of these issues as applied to online spaces.
    • This is summarised in the 8 Principles for Managing a Commons
      1. Define clear group boundaries.
      2. Match rules governing use of common goods to local needs and conditions.
      3. Ensure that those affected by the rules can participate in modifying the rules.
      4. Make sure the rule-making rights of community members are respected by outside authorities.
      5. Develop a system, carried out by community members, for monitoring members’ behavior.
      6. Use graduated sanctions for rule violators.
      7. Provide accessible, low-cost means for dispute resolution.
      8. Build responsibility for governing the common resource in nested tiers from the lowest level up to the entire interconnected system.
    • An informal do-ocracy in the fiefdom of a BDFL is often the default state of projects that have not given much deliberate thought to how they want to be governed, whilst this model is not without its strengths, because it is common many projects are subjects to some of its failure modes. How are decisions made in your project? Do you need the mechanisms of governance used by community and civil society organisations? By-laws, a committee and/or working groups, general meetings, votes, minutes? A version of these may be necessary to avoid The Tyranny of Structurelessness How can you map these onto your development infrastructure and make the decisions of your governing bodies enactable and enforceable?
  • Continuity planning: What happens to your project if something happens to you? The code will likely live on due the distributed nature of git but what about the issue tracker, the website etc. Who else has the highest level of privilege on your project or a mechanism to attain it? The principle of least privilege dictates that you keep the number of people with this level of access to a minimum but you may then create a single point of failure. Password managers like bitwarden have a feature where designated people can be given access to your vault if they request it and you do not deny it within a certain time-frame. This could provide a lower level admin with a mechanism to escalate their privileges if you are unable to do this for them. However, this delay might be an issue for continuity of operations if administrator action is needed within the waiting period. Game it out, have a plan, write it down, let people know you have a plan.
  • Software Management Plans
  • Does your project take donations? Does it have a trademark? Does it need a legal entity to hold these? Who is on the paperwork and who has signing authority? Who keeps track of expenditures? Tools & Organisations like OpenCollective can help with some of these issues.
  • If your project has potential cybersecurity implications what procedures do you have in place for people to disclose vulnerabilities in the project so that they can be patched before they are made public. What systems do you have in place to disclose a vulnerability once it has been patched and ensure that users know that they need to update.
  • Whole project data longevity - what plans do you have in place to backup and archive materials pertaining to your project that are not under source control?
  • User support
    • What support can users expect, or not expect?
    • Where can they ask for it?
    • Is there somewhere where users can provide support to other members of the user community, such as a forum?
    • Can they pay for more support?

Research Software Sharing, Publication, & Distribution Checklists by Richard J. Acton is licensed under CC BY 4.0