Choosing a future data container for timeseries data

dstansby · January 10, 2023, 7:33pm

Hi all - I have been funded by a NumFocus small development grant to investigate what is the most appropriate object is for sunpy to store timeseries data in. This comes with the context of

sunpy’s current choice, pandas.DataFrame having a number of drawbacks
There not being a one-size fits all data format used in solar physics for timeseries data (compared to imaging data where FITS is a relatively common standard)
Historically less development work being done on sunpy’s support for timeseries data (compared to imaging data)

The first steps in my project are to identify user requirements, and identify possible options for storing our data. Later in the project I will bring these together, evaulating each option against the requirements and coming up with a recommendation for us to discuss as a community.

With minimal engagement, my findings so far are below. Please do read, and if you have any additional requirements for timeseries data, leave a comment below in this thread. I have also tried to list options for the container - again, please read and leave comments in this thread!

User requirements for a timeseries data container in `sunpy`

Requirement	Notes
Store data that is a function of time	This means the time column should be treated as the index or coordinates to the data, and be stored as a time-like type.
Handle different time scales	Data can have times defined in a variety of different time scales (e.g. UTC, TAI)
Store multi-dimensional data	Although time is a common index to timeseries data, it isn’t always the only one. As an exapmle, velocity distribution functions measured in the solar wind are 4D datasets, with data as a function of time and three dimensions in velocity space.
Handle time scales with leapseconds	Some timescales can contain timestamps that occur within a leapsecond.
Store and use physical units with the data and any non-time indices
Store data in a format that can be used with scientific Python libraries
Support for storing out-of memory datasets
Store metadata alongside actual data
Have a way to store an observer coordinate alongside the time index
Have an easy way to do common data manipulation tasks	e.g. interpolating, resampling, rebinning
Have a way to combine multiple timeseries objects, and keep track of metadata
Ability to convert to other common time series objects (e.g. `pandas.DataFrame`)
Functionality for loading and saving out to common file formats

Options for a timeseries data container

astropy.timeseries.TimeSeries
pandas.DataFrame
xarray.DataArray (or xarray.DataSet)
numpy.ndarray
ndcube

eelcodoornbos · January 26, 2023, 10:30am

That’s a good list, but I feel it somewhat conflates storage formats and in-memory formats/classes. I think it would be good to add a list of common storage formats to support (or choose not to support). Many frequently used formats are already supported via xarray, pandas, spacepy and the like, such as NetCDF, NASA CDF, CSV, ascii tabular, etc. There are also modern formats like Zarr and Parquet to consider.

What would be really great is for sunpy (and related) packages to support loading (and maybe even serving) timeseries data, including metadata, using the HAPI specification (https://hapi-server.org/). I can dream of loading in a series from a server, tinkering with it using sunpy and other tools in a python notebook and quickly launching an ad-hoc local HAPI server to view results in an interactive web/javascript-based viewer.

The HAPI spec and discussions are also a good inspiration for things to consider, such as whether or not to explicitly support time series of vector data, spectra, image sequences (coronagraph movies, …), 4D models (like ENLIL, …), etc. These are further examples of multi-dimensional data with a time component, which is already in the list. But I think it is important to consider that while time is usually a continuous dimension, the further dimensions could be either continuous (positions) or discrete (vector dimensions, spectral bins), and different toolsets might have different levels of support for these.

dstansby · February 1, 2023, 4:45pm

Thanks for the feedback!

Do you mean in the list of requirements, or the list of options?

Thanks for bringing up HAPI - it was slightly on my radar, but I’ve had a proper look at it now, and it was helpful in trying to work out what to support.

eelcodoornbos · February 2, 2023, 10:13am

The combination of both. The examples are mostly implementations of numpy arrays, often with additional features (metadata, methods), and sometimes a few restrictions (e.g. Pandas’ limited support of additional dimensions compared to pure numpy or xarray). I think the choice to support certain file formats (and perhaps metadata standards) will lead to the best choice of (a) python package(s) to help store and manipulate it.
Timescale manipulations and conversions are a bit tricky. As far as I know they are currently only supported through astropy. It’s possible to try to store the time scale in metadata, and then leverage astropy to do any conversions and leap second-related manipulations if needed, even if the data is stored using timescale-unaware packages. That’s how I do this currently, but it’s not as pretty as I’d like.

Topic		Replies	Views
AttributeError: 'datetime.datetime' object has no attribute 'astype' SunPy sunpy , question	4	432	December 6, 2023
Questions about writing a scraper client for XRT data SunPy sunpy , question	8	337	May 17, 2023
SunPy....Do you think that it's a good solution for my project? SunPy sunpy , question	2	486	April 3, 2022
Internal Handling of Time Astropy	0	237	July 6, 2022
Sunpy 5.1 released SunPy sunpy	0	174	November 24, 2023

Choosing a future data container for timeseries data

User requirements for a timeseries data container in sunpy

Options for a timeseries data container

Related topics

User requirements for a timeseries data container in `sunpy`