FAIR principles in High Energy Astro

Hi folks, please post any questions, thoughts, and notes about FAIR (Findability, Accessibility, Interoperability, and Reuse) principles in high energy astronomy software here.

1 Like

I recently received a list of resources from Jenny Novacescu, a staff librarian at STScI, around citable code practices. I would argue that citations increase the Findable and Accessible sides of FAIR.

One hour recorded talk:
Bouquin, Daina (2021 February 09). How to Lose a Legacy: Software Citation in Astronomy. Presented at Space Telescope Science Institute Engineering and Technology Colloquium: How to Lose a Legacy Software Citation in Astronomy

A paper on software citation principles
Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software Citation Principles. PeerJ Computer Science 2:e86.
DOI: 10.7717/peerj-cs.86

AAS Asclepias Software Citation Project https://asclepias.aas.org/
Zenodo Asclepias announcement: Zenodo

Copyright Guide for Scientific Software: Copyright Guide for Scientific Software | Software Preservation Network (SPN)

1 Like

Jotting some notes here from a Slack discussion

@jdeplaa added that JOSS (the Journal of Open Source Software) has helped a lot for providing software citations and refereeing of open source code

@jdeplaa asked:

If I want to publish a paper as a researcher and I am asked to provide a reproduction package with it, how do I know whether my package is good enough? What do I need to include? How much documentation should I provide? Is there something general to say about that? Suppose I wrote script W and used packages X, Y and Z to get my result. How do I include that properly? Etc.

He shared with us his simple template for open data: Simple open data template for researchers

It’s exceptionally simple – but I myself rarely see it in practice!

@jdeplaa also shared information about options for reviewing packages

Light refereeing:

Or creating an army of volunteers to run the software in your package:

@dburke shared some resources about reproducibility with bit-wise identical software builds

https://www.nag.com/blog/bitwise-reproducibility-nag-libraries

https://www.software.ac.uk/blog/2017-02-20-software-reproducibility-possible-and-practical

Thanks for sharing the conversation @eblur! Regarding the bitwise reproducibility and practical issues with software reproducibility, the question is whether this is an important problem. This may depend also on the field and which level of accuracy is needed.

For pulsar timing, for example, a really high accuracy is needed and numerical noise could be an issue. However, for X-ray spectroscopy, the uncertainties in the calibration and models are of the order of 5-10% anyway and rounding issues in the nth digit are not resulting in a significantly different result.

My feeling is that numerical accuracy at that level will only matter in a limited number of cases. The main lesson appears to be that we should not see computers as exact machines, but rather as machines that approximate a solution. We should be always mindful that the accuracy offered by a computer should be enough for the application for which we use them.

Regarding software reproducibility, it is indeed many times challenging to run software that is more than 5-10 years old. The question is how big of a problem that really is in the context of open data in science.

The prime goal of open data is that scientific results can be verified by others. Being able to run the software is certainly helpful, but in many cases not strictly necessary. To verify a result, it is most important to know which data the authors used, and which corrections, parameters and analysis methods they applied. These details are in the source code and will be readable for a much longer time.

In addition, for important scientific results that need verifying, this usually happens within the 5-10 year time scale. Therefore, I would not worry too much about being able to run some code in 10 years from now. I think it is much more important that your code is well commented and documented for future reference.

1 Like

I agree that, in theory, the bar for reproducibility is rooted in well-documented datasets and code-bases (so that the algorithms are reproducible, even if the software cannot be run).

However, not all of the responsibility rests with the user / writer of scientific result. I think this is an important task for institutions because many researchers (necessarily) have to treat some code-bases as black boxes. So their documentation might read like this: “I ran XX tool in YY version 2.0”. If the results change in YY version 3.0, the researcher can’t necessarily say what is responsible for the change. So one aspect of reproducibility inherently rests in the responsibility for code-bases like SPEX, XSPEC, ISIS and so on to be well-documented.*

*And open source? IMHO, being open source does not necessarily make meet the criterion of reproducible or FAIR. It’s easy to put unreadable code online. (I do it all the time!)

Yes, I agree there is an important responsibility for software package providers here. In my opinion, we, as package providers, ideally need to provide well documented and readable open source code. Not every package is this far, but this should be the aim on the longer term.

It is true that most users will use the software as a black box anyway, but the possibility that someone else can verify what happens exactly in the code would be good scientific practice.

For the software package providers, well-readable documented code is also an advantage. If you hire a new developer or scientist in your team, this person will need less time to get familiar with the code. If you develop open source as well, you can even get contributions from the community to improve your package!

XMM is now (as one option) publishing their SAS software as a docker image. The idea here is that you freeze not just your software version, but all possible dependencies as well, which should get binary reproducibility for many years to come. I can’t find the presentation where they announced that now, but the stated goal is to have a SAS version that will be functional decades after XMM end-of-live and they have hope that the docker format will outlast e.g. Ubuntu LTS versions currently available. Maybe we should develop all science projects in docker containers and publish the container with the paper?

There are alternatives to docker; for instance you could use the nix language - e.g. see https://www.tweag.io/blog/2022-05-26-probabilistic-programming-nix/ - but if you think getting Astronomers to write a docker file is going to be hard, then getting them to do something like this would be hard^n…

Since Docker is a commercial system, people have doubts whether this will remain available for free on the longer term. Since there is a serious risk of vendor lock-in, we, as scientists, should not depend on it.

I like Docker a lot and use it often, but it is probably not the solution that we should adopt to keep software running for the coming 20 years…

Docker does have some closed-source parts, including their “Desktop” development environment, but the core parts remain open-source. If those ever go the closed-source route, I dare say people would coalesce around an open-source fork, as seems to happen any time a popular open-source package tries to lock people out.

As always, it is important in this work to be clear about what licenses are needed to reproduce a particular bit of research. In many cases, thankfully, it can all be open source, and Docker can be a help. There are also many other approaches for constructing reproducible and long-lived configurations which are easy to deploy, along the lines of what XMM or Tweag seem to have done.

Thanks! I was not aware that the core parts of Docker were open source. That helps. We also offer our software in a Docker image to make it easier to run on other platforms. It is definitely helpful.