Hi everybody, I have an idea to create a few directory structure templates to help researchers organise their research project from the beginning. Would you be interested in such a template? Or would you like to help with the design?
In the context of the move to open data, scientists are expected to provide a reproduction package that contains the relevant data and software to reproduce the results in their paper.
In practice, scientists tend to collect the data for the package at the last moment and if the data is spread over multiple directories on different disks, this can be a large task. I also see students struggle sometimes to choose an appropriate directory structure for their project.
So, why don’t we start with a useful directory structure at the beginning of a project? The cookiecutter project provides a useful infrastructure to start a project by downloading a template from a Github repository. This gives a nice starting point, which can be adapted to the researcher’s need.
A good example is the data science cookiecutter template. Would it be a good idea to adapt this template for use in High-Energy Astrophysics? If yes:
- What would you change in the data science template?
- Do we need different versions? The data science template has Python in mind.
- Do we need specific directory structures for other programming languages or software packages?
- Can you think of useful things to add to the template?
Since the cookiecutter is based on Github, forking the template to make multiple variations of it is built into the system.
Please, let me know what you think below!
In my experience there are at least several groups of projects, mostly by where the data comes from, how big it is and how I write the paper.
- If I only have e.g. a few Chandra datasets (which are relatively small), then I can have it all essentially in a single Jupyter notebook with download commands at the top. In some sense, that does not need a template; since everything can always be recreated from that single file.
- Data with significant manual work, e.g. looking for extended structures or sources or line (in the optical) that are manually marked / identified - then I need the intermediate data to reproduce.
- ALMA: Data hardly fits on my disk, and git is not the right format to manage that.
- overleaf or not? If I write on overleaf, that’s where table data and figures are, but probably not the data or code.
- Distributed writing, where every collaborator contributes a bit. Someone has an optical spectrum and runs that through their tools (say, IDL and manual work), XMM data, Chandra data (can’t run CIAO and SAS in the same shell, and might be using incompatible versions of perl or python. Really several projects from the software point of view, that all contribute some text.
Together, that means that I can’t even find a template that works for myself for most cases (only for some), so I don’t expect that I could find one that works for other people. Instead, need to document / teach more the principles to set it up and point to existing templates that might be a good match for some projects, such as the data science cookie cutter linked above or the show-your-work template.
Thanks for sharing your thoughts! Indeed, there probably isn’t a template that works for every project, but a good template may be a starting point for many. You can always add/remove directories to adapt it to your specific needs.
I have some comments about your examples where there may be more room for templates in your workflow than you think.
- There are opinions around saying that Jupyter notebooks should only be used for exploration and communication (education), and not for reproducibility. Arguments against JN are that cells need to be run in the right order to produce the same results and that notebook files are hard to put under version control because of their internal structure. Writing your (final) analysis in a Python script would be preferred.
- The data-science cookie-cutter template has multiple sub-directories for various stages of the data analysis, including intermediate results. Manual steps in an analysis are anyway hard to reproduce exactly, but you can at least save the (intermediate) result that you got.
- The data science cookie cutter excludes the (large) data directories from the git repository by default, while you can still keep documentation and code elsewhere in the project directory structure under version control.
- Overleaf also has git version control built in. You can have your Overleaf project as a subdirectory locally in your big project directory if you want, managed by git.
- You are right that distributed writing poses bigger challenges, because everyone tends to work in their own environment. Agreeing on a common directory structure for a project may help, especially if you need to create a reproduction package later. I have a colleague who does distributed writing of papers through a Github repository, so there are solutions to make that work. For large files, there is also git lfs, for example.
Teaching and documenting good practices for managing the data and software side of a project is anyway a great idea. I think a couple of example templates that are more similar to real-life research projects in our field could help.
Sorry for the long post, but I hope it is helpful !
I disagree on Jupyter notebooks. In this context, they would be run automatically, top-to-bottom and version controlled without outputs, so apart from their json formatting that’s not very different from a python script alone.
It does encourage keeping notes (“We do this extra step Y because of problem X, but we don’t see the same thing further down”) and text close to where the plots are made (“The green points show … - wait, I only see blue and orange points. Must have forgotten to update the latex when I changed the plot script”).
However, I don’t disagree enough to try to stop you from advocating plain python scripts. Whatever the format, the ideas are the same: Make it reproducible, script as much as you can etc.
Personally, I often mirror my overleaf texts to github, but at least last time I checked git access to overleaf was only available on paid plans and I often work with collaborators without institutional accounts. In that case, it’s involved to keep overleaf writing and other work in sync.
I guess it depends on how simple or complex your template wants to be and who the target audience is. When I read the initial post, I thought of a simple template with few options for simple projects that helps people new to git etc. I would not necessarily use and explain e.g. git-lfs in the first step (e.g. on my office desktop the system version of git is too old for git-lsf. Sure, I can work around that, but at some point, I’m just going to quit and do what I always did instead of jumping through a lot of hoops to follow a new template that makes me more work).
Of course, you can have a more complex template with a different target audience.
The opinions about Jupyter notebooks originate partly from the people behind the data science cookie cutter, who make these points in their manual. I think they have a point if you need more detailed programming in your script. They are fine for explaining analysis steps at a very high level, but things can get messy if you start defining all kinds of functions and classes in there. In such cases, it is probably better to put the detailed implementations in a python module and import that in the notebook. But, this is of course a matter of opinion.
Regarding the templates, I would say the target audience would be (PhD) students who do not have developed a file organisation structure of their own. And again, a template would be a starting point and nothing is really final. You can change things in the directory structure or ignore certain features as you see fit.
Researchers in a later career stage have developed their own directory structures and workflows, which are probably much harder to change. However, they may get inspired by templates and/or the open data discussion in general to make changes in their workflow to make it better reproducible. For many tools and tricks, it means that you need to invest some time on the short term to win more time on the long term. Since it is very human to prefer quick wins on the short term over long term benefits, this will be a challenge in the coming years.