Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first-class List type #60629

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Dec 30, 2024

Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all nulls that is separately paired with a type oflist[string]

@mroeschke
Copy link
Member

Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations

@WillAyd
Copy link
Member Author

WillAyd commented Dec 30, 2024

PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system.

We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032

@jbrockmendel
Copy link
Member

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained.

Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives.

try:
return self.values.dtype
except AttributeError: # PyArrow fallback
return self.values.type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!

if dtype:
klass = get_block_type(dtype)
else:
klass = get_block_type(values.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.

@@ -505,7 +505,7 @@ def __init__(
data = data.copy()
else:
data = sanitize_array(data, index, dtype, copy)
data = SingleBlockManager.from_array(data, index, refs=refs)
data = SingleBlockManager.from_array(data, dtype, index, refs=refs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dtype is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary

@@ -1103,7 +1103,11 @@ def format_array(
List[str]
"""
fmt_klass: type[_GenericArrayFormatter]
if lib.is_np_dtype(values.dtype, "M"):
if hasattr(values, "type") and values.type == "null":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?


return ListArray(data)
class TestListArray(BaseConstructorsTests): ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we moved away from this pattern to just ExtensionTests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode

@jbrockmendel
Copy link
Member

the thing ive always assumed would be a PITA with a ListDtype is __setitem__ distinguishing whether df.iloc[x, :3] = [a, b, c] is to behave like

df.iloc[x, 0] = a
df.iloc[x, 1] = b
df.iloc[x, 2] = c

vs

df.iloc[x, 0] = [a, b, c]
df.iloc[x, 1] = [a, b, c]
df.iloc[x, 2] = [a, b, c]

have you given any thought to that?

@WillAyd
Copy link
Member Author

WillAyd commented Dec 31, 2024

For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists

@WillAyd WillAyd force-pushed the implement-list-type branch from ad6fa08 to e25c0d4 Compare December 31, 2024 19:06
@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool:
len(data) > 0
and is_list_like(data[0])
and getattr(data[0], "ndim", 1) == 1
# TODO(wayd): hack so pyarrow list elements don't expand
and not isinstance(data[0], pa.ListScalar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think have is list like return False for pyarrow scalar is less hacky?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like. Will do more research

@@ -494,7 +495,7 @@ def __init__(
if not is_list_like(data):
data = [data]
index = default_index(len(data))
elif is_list_like(data):
elif is_list_like(data) and not isinstance(dtype, ListDtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is a tough one to handle. I'm not sure if something like:

pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())

should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good example. also indexing is going to be a pain point. a lot of checking for nested listlikes happens in indexing.py and im wary of stumbling blocks there. also indexing in general with a ListDtype index. i think the solution is to interpret lists as sequences and only treat them as a scalar if explicitly wrapped in pa.ListScalar (or something equivalent)

Copy link
Member Author

@WillAyd WillAyd Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally started down that path but I think the problem with expecting pa.ListScalar is that when you select from a ListArray, most users probably expect a plain python list back. So to use that same selection as an indexer I think its hard to avoid the built-in type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to use that same selection as an indexer

I think we have to just disallow that. Otherwise we have a long tail of places where we expect labels to be non-listlike and nested objects to mean 2D. e.g. df.set_index(list_column).T.set_index([1, 2, 3]) is ambiguous whether it wants three columns to be set as an index or a single column with label [1, 2, 3]. I'm particularly worried about indexing.py code where we check for nested data getting way more complicated if the behavior has to depend on dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.set_index(list_column).T.set_index([1, 2, 3]) is ambiguous whether it wants three columns to be set as an index or a single column with label [1, 2, 3]

Do we normally require index elements to be hashable? I think there's an argument to be made that we shouldn't allow this as a index, given Python lists aren't hashable (even though PyArrow lists are)

If we do want to allow this as an index, the example cited is a single column with label [1, 2, 3]

I'm particularly worried about indexing.py code where we check for nested data getting way more complicated if the behavior has to depend on dtypes.

Definitely a valid concern. Do you know what might not be covered in this area by the extension tests?

Copy link
Member

@jbrockmendel jbrockmendel Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we normally require index elements to be hashable?

We don't do any validation at construction-time for an object dtype Index. I expect a bunch of methods will break with non-hashable elements.

I think there's an argument to be made that we shouldn't allow this as a index, given Python lists aren't hashable (even though PyArrow lists are)

Wouldn't that break a lot e.g. obj.groupby(list_dtype_column).method()? I expect there are lots of methods/functions that do a obj.set_index(col).something().reset_index(col) that would become fragile.

The simple solution to all of these problems is to require users to be explicit when they want to treat a list as a scalar.

Definitely a valid concern. Do you know what might not be covered in this area by the extension tests?

I can come up with a few examples if we go down this road, but that part of the code has a lot of paths so it won't be comprehensive. We'll be chasing down corner cases for a long time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's a good point about the groupby stuff. I don't have a strong objection to requiring an explicit scalar argument, though I guess the follow up question is do we want users passing a pyarrow scalar or do we want to create our own scalar type.

Not sure how other libraries like polars handle this, but @MarcoGorelli might have some insight

@@ -428,7 +428,7 @@ def _box_pa_scalar(cls, value, pa_type: pa.DataType | None = None) -> pa.Scalar:
"""
if isinstance(value, pa.Scalar):
pa_scalar = value
elif isna(value):
elif not is_list_like(value) and isna(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment why this is necessary. e.g. off the top of my head I dont know if pa.ListScalar subclasses pa.Scalar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the isna(value) check fails when value is a list, since it doesn't return a boolean back.

The "not is_list_like" was a quick way to prevent this branch from throwing an exception, but open to better ways of expressing that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to your question, pa.ListScalar does inherit from pa.Scalar (all Scalars in pyarrow do) but that is not the type that is hitting this branch, since it is caught in the one preceding

This should be three items [B, NA, A] with
A < B and NA missing.
"""
pytest.skip("ListArray does not support sorting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? dont python lists have order semantics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyarrow doesn't directly support sorting lists, and if it did I'm not sure it would use the same semantics as Python (I don't know how other languages would care for that)

I suppose we could implement something ourselves to cast all elements to Python and use the language's list semantics, but that's likely a big change better left to a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -834,6 +834,9 @@ cpdef ndarray[object] ensure_string_array(
if isinstance(val, bytes):
# GH#49658 discussion of desired behavior here
result[i] = val.decode()
elif isinstance(val, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the cython code we use util.is_array for this check

return fmt_values


class _ListFormatter(_GenericArrayFormatter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt look like this is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep dead code - thanks!

@@ -1,138 +0,0 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge deal but id prefer if we got the tests passing with the implementation still living here, then moved the implementation in a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that's not a problem

Copy link
Member Author

@WillAyd WillAyd Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking through this some more - some of the modules like pandas.core end up importing the ListDtype and ListArray. For that, maybe its better to move this out of tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK no big deal if its not easy

@WillAyd WillAyd force-pushed the implement-list-type branch from b4252c3 to 5a2b113 Compare January 4, 2025 07:09
name="a",
)
actual = ser.list.len()
expected = Series([3, 2, None], dtype=ArrowDtype(pa.int32()), name="a")
expected = Series([3, 2, None], dtype=ArrowDtype(pa.int64()), name="a")
Copy link
Member Author

@WillAyd WillAyd Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we feel about the changes in this test module. I swapped us over to the first class type instead of the ArrowDtype list for all the tests, but we can also parametrize

The tricky thing about the logical type here is that, without other logical types we do get back the implementation detail of a pyarrow type back for things like list.len().

Alternatively we could decide on just returning pd.Int64Dtype() here. That would abstract the implementation details of the ListArray and not lose any data, although the mask of the Int64Array requires more memory than pyarrow. This approach would also be more inline with what PDEP-13 proposes

Copy link
Member Author

@WillAyd WillAyd Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this over some more it should probably return pd.Int64Dtype() for length. That would match what pd.StringDtype.len returns

@jbrockmendel
Copy link
Member

Is the idea for this to replace ArrowDtype(pa.list_etc()) and the corresponding ArrowEA or for both to exist? Having multiple implementations of things has been a pain point for the StringDtype(s).

@WillAyd
Copy link
Member Author

WillAyd commented Jan 5, 2025

Is the idea for this to replace ArrowDtype(pa.list_etc()) and the corresponding ArrowEA or for both to exist? Having multiple implementations of things has been a pain point for the StringDtype(s).

Yea I'd prefer this to be the user-facing API for working with list data, but having that fully replace the ArrowDtype(pa.list_()) is a PDEP-13 discussion

There's a few places in I/O where this can work its way into the numpy_nullable type system, i.e. when trying to serialize lists to/from SQL databases and parquet

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2025

@WillAyd Series.str.split() and Series.str.rsplit() can return List values, so that should be used here.

@WillAyd
Copy link
Member Author

WillAyd commented Jan 8, 2025

@Dr-Irv that's a great idea although I'm hesitant to change the API of any existing functions without taking PDEP-13 further

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2025

@Dr-Irv that's a great idea although I'm hesitant to change the API of any existing functions without taking PDEP-13 further

I don't think it's an API change. It's just a question of the dtype of the result of str.split() and str.rsplit() when expand=False. With your PR, you could make that result have ListDtype if pyarrow is installed, otherwise return object dtype (which is what happens today)

@WillAyd
Copy link
Member Author

WillAyd commented Jan 8, 2025

That's an interesting idea. My main concern would be the portability of pandas, i.e. running code in one environment without pyarrow installed will yield a potentially different code path than what what you would get in an environment with pyarrow. I think the logical type abstraction would go a long ways to simplify that for end users.

On the flip side, we are still somewhat in limbo as to whether or not pyarrow is actually required for pandas 3.0, given the status of PDEP-15 (we've also discussed this on Slack). If we hard commit back to pyarrow, then you could implement this with less friction I think

Appreciate the input - I think there's a lot of potential here so just need to decide as a team where we feel comfortable with non-NumPy originated types in the default space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants