Specifying a schema in terms of a Protocol #31

rsokl · 2022-04-06T14:51:35Z

Hello! Thanks for making xarray-schema!

It would be great to be able to write a xarray schema in terms of a typing.Protocol. This would enable the schema to be used for both runtime and static validations. Let me describe my motivation here (it might already be obvious..)

One challenge with designing a code base that passes around xarray arrays & datasets, which satisfy particular schemas, is: documenting which flavors of datasets are accepted by a given function. Furthermore, for complicated schemas in particular, it is particularly useful for static tools (type-checkers and other IDE tools) to be able to tell a user what attributes do and do not exist for that xarray object.

I have leveraged protocols to tackle these issues. Consider the following protocol that describes a dataset with the coordinates time and feature_component and variables features and temperatures

from typing import Protocol

class DataSetA(Protocol):
    @property
    def time(self) -> xr.DataArray:
        """
        Coordinate, shape-(N,), dtype-int
        """
        ...

    @property
    def feature_component(self) -> xr.DataArray:
        """
        Coordinate, shape-(D,), dtype-int
        The index for each component of a feature vector.
        """
        ...

    @property
    def features(self) -> xr.DataArray:
        """
        Data-Variable, shape-(N, D), dtype-float
        The D-dimensional vector for each feature.
        Coordinates:
          * time [N]
          * descriptor_component [D]
        """
        ...

    @property
    def temperatures(self) -> xr.DataArray:
        """
        Data-Variable, shape-(N,), dtype-float
        
       Temperature measurements.

        shape-(N,) | dtype-float
        Coordinates:
          * feature_id  [N]
        """
        ...

With this, I can write functions like:

def process_dataset(data: DataSetA):
    ...

Not only does this annotation succinctly document to users what flavor of dataset is expected by process_data, static tooling can now auto-complete / statically check the usages of data according to this protocol within the function. This is really nice to have.

It would be great to be able to write DataSetA so that it serves as a schema as well. In this way, DataSetA serves as

Documentation for users
A type that can be understood by static analysis tooling
A schema for runtime validation.

Obviously, this would involve substantially more sophisticated return types for the coordinates and data variables, beyond xr.DataArray. Shape and dtype info would need to be specified as well. Perhaps particular forms of Annotated[xr.DataArray, ...] would suffice.

Finally, I have to flag a substantial shortcoming of DataSetA: it doesn't "look" like a proper xarray.Dataset to static analysis tools. E.g. .loc, .sel don't exist. So really, there needs to be proper protocols that describe xarray.DataArray and xarray.Dataset, which can be subclassed by the likes of DataSetA to remedy this. It isn't clear to me that xarray itself would ship such protocols, or if xarray-schema would do so.

Thanks for reading this post. I'll be interested to hear your thoughts on this!

The text was updated successfully, but these errors were encountered:

rsokl · 2022-04-09T15:10:30Z

I decided to open an issue on xarray to propose that they implement protocols for Dataset and DataArray.

pydata/xarray#6462

jhamman · 2022-09-14T22:54:46Z

Sorry @rsokl for missing your post for so long. I think this is an interesting idea and one worth exploring. @andersy005 has also thought of something similar in the context of pydantic.

rsokl mentioned this issue Apr 10, 2022

Does from_type() handle typing.Protocol properly? HypothesisWorks/hypothesis#3281

Closed

3 tasks

jhamman added enhancement New feature or request good first issue Good for newcomers labels Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a schema in terms of a Protocol #31

Specifying a schema in terms of a Protocol #31

rsokl commented Apr 6, 2022 •

edited

Loading

rsokl commented Apr 9, 2022 •

edited

Loading

jhamman commented Sep 14, 2022

Specifying a schema in terms of a Protocol #31

Specifying a schema in terms of a Protocol #31

Comments

rsokl commented Apr 6, 2022 • edited Loading

rsokl commented Apr 9, 2022 • edited Loading

jhamman commented Sep 14, 2022

rsokl commented Apr 6, 2022 •

edited

Loading

rsokl commented Apr 9, 2022 •

edited

Loading