-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281
Comments
I've heard from another user that they managed to work around this by switching off the page index when generating the files by |
Have you tried using the datafusion-cli --command "select * from parquet_metadata('go-parquet-writer/go-testfile.parquet')" |
Hey, I'm the one who mentioned this in discord 😄 what I meant is that we disabled In your repro repository I was able to confirm disabling the page index makes it work: ❯ datafusion-cli
DataFusion CLI v43.0.0
> select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
External error: Parquet error: External: bad data
> SET datafusion.execution.parquet.enable_page_index = false;
0 row(s) fetched.
Elapsed 0.001 seconds.
> select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
+--------+---------+-----+-------+--------+--------------------------+---------+
| city | country | age | scale | status | time_captured | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Athens | Greece | 32 | 1 | 20 | 2025-01-24T17:34:00.715Z | true |
+--------+---------+-----+-------+--------+--------------------------+---------+
1 row(s) fetched.
Elapsed 0.021 seconds. Last time I looked at this issue I had a feeling that this was an issue with |
That seems to works fine
Ah right, that makes sense. Thanks for the help on getting that to work, that's super helpful!
That seems to be the case (the documentation reports that the don't use the page index on the read side) |
Confirmed that the following works now. let mut parquet_options = TableParquetOptions::new();
parquet_options
.set("enable_page_index", "false")
.expect("could not set enable_page_index config option");
let exec = ParquetExec::builder(scan_config)
.with_table_parquet_options(parquet_options)
.with_predicate(predicate)
.build_arc(); Weirdly enough, setting the variable on the let ctx = SessionContext::new_with_config(
SessionConfig::new().set_bool("datafusion.execution.parquet.enable_page_index", false)
) |
Describe the bug
When making a query with a predicate against Parquet files generated with parquet-go , DataFusion errors saying the data is invalid. However, without a predicate, it works fine.
When using the CLI, I get the error:
In my application, it is more descriptive, showing:
However, it appears that the file is intact. The metadata is successfully read and interpreted
When I run without a predicate, I get back the data
It even works if I use
ORDER BY
andGROUP BY
Additionally, this works when I use
PyArrow
andPandas
to load the Parquet file and filter it.To Reproduce
The issue can be reproduced by creating a Parquet file with the
parquet-go
library and attempting to query it with a predicate in the query. To simplify, I created a public repo that has code to generate the file and similar examples in the README as shown in this report. A test file can be found ingo-parquet-writer/go-testfile.parquet
, generated by the Go program in that directory.I've also gone through the effort of trying to achieve the same using PyArrow and Pandas (which you'll see in the repo under
pyarrow-ex
) to verify the Parquet file is not corrupted in some way. This works as expected.Expected behavior
The Parquet files created by
parquet-go
can successfully be queried when the query contains a predicate.Additional context
From everything I've gathered, this error is likely coming from this conversion function. However, it only skips checking
0x02
when a collection is being parsed. Weirdly, I don't have any list/map/set in my schema. I assume this means this0x02
is being used to encode something else but it is beyond my knowledge.I went spelunking in
parquet-go
codebase. The Thrift protocol implementation is split amongst the compact protocol, the Thrift type definitions and the encoding logicThe text was updated successfully, but these errors were encountered: