You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be great to be able to write a xarray schema in terms of a typing.Protocol. This would enable the schema to be used for both runtime and static validations. Let me describe my motivation here (it might already be obvious..)
One challenge with designing a code base that passes around xarray arrays & datasets, which satisfy particular schemas, is: documenting which flavors of datasets are accepted by a given function. Furthermore, for complicated schemas in particular, it is particularly useful for static tools (type-checkers and other IDE tools) to be able to tell a user what attributes do and do not exist for that xarray object.
I have leveraged protocols to tackle these issues. Consider the following protocol that describes a dataset with the coordinates time and feature_component and variables features and temperatures
fromtypingimportProtocolclassDataSetA(Protocol):
@propertydeftime(self) ->xr.DataArray:
""" Coordinate, shape-(N,), dtype-int """
...
@propertydeffeature_component(self) ->xr.DataArray:
""" Coordinate, shape-(D,), dtype-int The index for each component of a feature vector. """
...
@propertydeffeatures(self) ->xr.DataArray:
""" Data-Variable, shape-(N, D), dtype-float The D-dimensional vector for each feature. Coordinates: * time [N] * descriptor_component [D] """
...
@propertydeftemperatures(self) ->xr.DataArray:
""" Data-Variable, shape-(N,), dtype-float Temperature measurements. shape-(N,) | dtype-float Coordinates: * feature_id [N] """
...
With this, I can write functions like:
defprocess_dataset(data: DataSetA):
...
Not only does this annotation succinctly document to users what flavor of dataset is expected by process_data, static tooling can now auto-complete / statically check the usages of data according to this protocol within the function. This is really nice to have.
It would be great to be able to write DataSetA so that it serves as a schema as well. In this way, DataSetA serves as
Documentation for users
A type that can be understood by static analysis tooling
A schema for runtime validation.
Obviously, this would involve substantially more sophisticated return types for the coordinates and data variables, beyond xr.DataArray. Shape and dtype info would need to be specified as well. Perhaps particular forms of Annotated[xr.DataArray, ...] would suffice.
Finally, I have to flag a substantial shortcoming of DataSetA: it doesn't "look" like a proper xarray.Dataset to static analysis tools. E.g. .loc, .sel don't exist. So really, there needs to be proper protocols that describe xarray.DataArray and xarray.Dataset, which can be subclassed by the likes of DataSetA to remedy this. It isn't clear to me that xarray itself would ship such protocols, or if xarray-schema would do so.
Thanks for reading this post. I'll be interested to hear your thoughts on this!
The text was updated successfully, but these errors were encountered:
Sorry @rsokl for missing your post for so long. I think this is an interesting idea and one worth exploring. @andersy005 has also thought of something similar in the context of pydantic.
Hello! Thanks for making
xarray-schema
!It would be great to be able to write a xarray schema in terms of a
typing.Protocol
. This would enable the schema to be used for both runtime and static validations. Let me describe my motivation here (it might already be obvious..)One challenge with designing a code base that passes around xarray arrays & datasets, which satisfy particular schemas, is: documenting which flavors of datasets are accepted by a given function. Furthermore, for complicated schemas in particular, it is particularly useful for static tools (type-checkers and other IDE tools) to be able to tell a user what attributes do and do not exist for that xarray object.
I have leveraged protocols to tackle these issues. Consider the following protocol that describes a dataset with the coordinates
time
andfeature_component
and variablesfeatures
andtemperatures
With this, I can write functions like:
Not only does this annotation succinctly document to users what flavor of dataset is expected by
process_data
, static tooling can now auto-complete / statically check the usages ofdata
according to this protocol within the function. This is really nice to have.It would be great to be able to write
DataSetA
so that it serves as a schema as well. In this way,DataSetA
serves asObviously, this would involve substantially more sophisticated return types for the coordinates and data variables, beyond
xr.DataArray
. Shape and dtype info would need to be specified as well. Perhaps particular forms ofAnnotated[xr.DataArray, ...]
would suffice.Finally, I have to flag a substantial shortcoming of
DataSetA
: it doesn't "look" like a properxarray.Dataset
to static analysis tools. E.g..loc
,.sel
don't exist. So really, there needs to be proper protocols that describexarray.DataArray
andxarray.Dataset
, which can be subclassed by the likes ofDataSetA
to remedy this. It isn't clear to me thatxarray
itself would ship such protocols, or ifxarray-schema
would do so.Thanks for reading this post. I'll be interested to hear your thoughts on this!
The text was updated successfully, but these errors were encountered: