Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree attribute-like access bug #9928

Open
5 tasks done
pepijnvtol opened this issue Jan 7, 2025 · 9 comments
Open
5 tasks done

DataTree attribute-like access bug #9928

pepijnvtol opened this issue Jan 7, 2025 · 9 comments
Labels
bug topic-DataTree Related to the implementation of a DataTree class

Comments

@pepijnvtol
Copy link

pepijnvtol commented Jan 7, 2025

What happened?

I have a datatree with a dataset in a group. I tried to filter this dataset and store in the datatree. The dataset showed the filtered dataset and the unfiltered dataset, depending on the method to get the group. dt.variable1 or dt['variable1'].

What did you expect to happen?

I expect that both datasets are equal.

Minimal Complete Verifiable Example

Note the difference when `dt.variable1` and `dt['variable1']` are printed. When the datatree is exported the value of `dt['variable1']` is used. 


import numpy as np
import xarray as xr

ds1 = xr.Dataset(
    dict(
        variable1=xr.DataArray(
            np.random.randn(10),
            dims="x",
            coords=dict(x=np.linspace(1, 10, num=10)),
        )
    )
)
dt = xr.DataTree(ds1)
print("Datatree with 1 dataset (variable1)")
print(dt)

# filter on x
dt.variable1 = dt.variable1.sel(x=slice(1, 5))
print("\nPrint sliced dataset variable 1 using 'dt.variable1'")
print(dt.variable1)
print("\nPrint sliced dataset variable 1 using 'dt['variable1']'")
print(dt["variable1"])
print(
    "\nThe whole datatree how it is stored (see variable1, which is not sliced)"
)
print(dt)

Datatree with 1 dataset (variable1)
<xarray.DataTree>
Group: /
    Dimensions:    (x: 10)
    Coordinates:
      * x          (x) float64 80B 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
    Data variables:
        variable1  (x) float64 80B -0.03606 -0.5738 -1.331 ... -0.3654 1.669 0.9354

Print sliced dataset variable 1 using 'dt.variable1'
<xarray.DataArray 'variable1' (x: 5)> Size: 40B
array([-0.03605657, -0.57384656, -1.3312692 ,  1.0832746 , -0.07725834])
Coordinates:
  * x        (x) float64 40B 1.0 2.0 3.0 4.0 5.0

Print sliced dataset variable 1 using 'dt['variable1']'
<xarray.DataArray 'variable1' (x: 10)> Size: 80B
array([-0.03605657, -0.57384656, -1.3312692 ,  1.0832746 , -0.07725834,
       -0.45822326, -1.32865998, -0.36541352,  1.66870598,  0.93536491])
Coordinates:
  * x        (x) float64 80B 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

The whole datatree how it is stored (see variable1, which is not sliced)
<xarray.DataTree>
Group: /
    Dimensions:    (x: 10)
    Coordinates:
      * x          (x) float64 80B 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
    Data variables:
        variable1  (x) float64 80B -0.03606 -0.5738 -1.331 ... -0.3654 1.669 0.9354


>>> import xarray
>>> xarray.__version__
'2025.1.0'

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.13.1 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 17:02:46) [MSC v.1929 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 11
machine: AMD64
processor: Intel64 Family 6 Model 186 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('English_United States', '1252')
libhdf5: None
libnetcdf: None

xarray: 2025.1.0
pandas: 2.2.3
numpy: 2.2.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.1.0
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@pepijnvtol pepijnvtol added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 7, 2025
Copy link

welcome bot commented Jan 7, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 7, 2025
@TomNicholas
Copy link
Member

Thank you for this bug report! That definitely shouldn't be happening.

I suspect that it's going wrong in the .variable1 = ... assignment step. Your results are consistent with the sliced DataArray object just being assigned as a new .variable1 attribute, rather than actually properly engaging with the machinery that is supposed to recognize that the private dictionary mapping names to variables on that node is supposed to be updated.

@pepijnvtol
Copy link
Author

I am not sure if this is the case, but could this indicate a possible memory error? With large datasets, it seems that the dataset is not overwritten, but both versions are stored. In situations where memory is constrained, this can be undesirable. However, I am not certain if this is truly an issue, as I have not encountered memory errors due to this. I will stick to the issue at hand, but this was just a thought.

@pepijnvtol
Copy link
Author

I just tried a similar operation with a Dataset. In the example, it would look something like this:

>>> ds1.variable1 = ds1.variable1.sel(x=slice(1, 5))

This results in an AttributeError:

AttributeError: cannot set attribute 'variable1' on a 'Dataset' object. Use __setitem__ style assignment (e.g., ds['name'] = ...) instead of assigning variables.

I think the same behavior should apply to Datatree objects. The suggested solution (using ds['name'] = ...) is also applicable to the reported bug.

I don't have a full solution yet, but I thought it was interesting to check the Dataset behavior and compare it with the Datatree to better understand what's happening.

@TomNicholas
Copy link
Member

could this indicate a possible memory error?

I don't think so - your example reproduces the error with a very small variable.


interesting to check the Dataset behavior and compare it with the Datatree

AttributeError: cannot set attribute 'variable1'

Yes! That's what supposed to happen here.

in https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset-contents it says

As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:

That's the behaviour we also want for DataTree.

The bug is somewhere in the implementation of this class

https://github.com/pydata/xarray/blob/main/xarray/core/common.py#L370

@owenlittlejohns I remember we had some issues integrating the AttrAccessMixin to DataTree which caused us to split it out into TreeAttrAccessMixin like this (in #9011). Do you have any idea what the issue might be here?

@keewis
Copy link
Collaborator

keewis commented Jan 8, 2025

I'm not certain, but could this be caused by the presence of __dict__ on DataTree? For Dataset the presence of __slots__ causes __dict__ to be absent, which raises an AttributeError in AttrAccessMixin.__setitem__ (which then gets modified to include more information).

Edit: see #9068

@TomNicholas
Copy link
Member

Oh good catch @keewis ! So the question now is how we can get rid of __dict__ on DataTree?

@keewis
Copy link
Collaborator

keewis commented Jan 8, 2025

Change the mixin from TreeAttrAccessMixin to AttrAccessMixin and see what breaks? For that you'd have to go through all the base classes of DataTree to see whether there's any that absolutely need to have dynamic attributes.

@owenlittlejohns
Copy link
Contributor

owenlittlejohns commented Jan 8, 2025

@owenlittlejohns I remember we had some issues integrating the AttrAccessMixin to DataTree which caused us to split it out into TreeAttrAccessMixin like this (in #9011). Do you have any idea what the issue might be here?

Looks like @keewis beat me to mentioning this issue: #9068.

Just swapping out the TreeAttrAccessMixin for AttrAccessMixin caused problems in the instantiation due to __slots__ and __dict__ both being present. I think your suggestion of removing __dict__ on DataTree is the way to go. If I remember rightly, I think the issues might have also been a bit lower down in either the TreeNode or NamedNode classes.

@TomNicholas TomNicholas changed the title Datatree store dataset DataTree attribute-like access bug Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

4 participants