-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RFC for Presto -Native TPC-DS Connector #28
base: main
Are you sure you want to change the base?
Conversation
353cbed
to
de2d633
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pdabre12
RFC-0009-native-tpcds-connector.md
Outdated
### [Optional] Goals | ||
|
||
1. Add a TPC-DS connector to generate TPC-DS data in Presto native. | ||
2. Write end-to-end tests in Presto native with TPC-DS tables and conduct microbenchmarks in Velox. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip micro-benchmarks in Velox here.
RFC-0009-native-tpcds-connector.md
Outdated
|
||
## Background | ||
|
||
Currently , Presto does not have a native implementation of the TPC-DS connector. This RFC proposes the addition of a new TPC-DS connector. The new connector can be used as a Presto - Native catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give more background of TPC-DS benchmark, schema and dsdgen program here.
RFC-0009-native-tpcds-connector.md
Outdated
|
||
The Presto - Native TPC-DS connector will be a wrapper for the generator distributed (dsdgen) by the TPC organization from C. This means we need our implementation to have the exact same behavior as the C implementation. DuckDB already has a TPC-DS connector of their own and they have wrapped the C files into C++ files, we are going to use these C++ files in our implementation. | ||
|
||
In the C++ implementation, there are two types of tables: source tables and target tables used for generation. Source table files are prefixed with "s_", while target table files are prefixed with "w_". For instance, there may be files like "s_call_center.c" and "w_call_center.c". It appears that source tables are only utilized when running the "dsdgen" with an update flag, though the exact function of this flag and the purpose of the source tables have not yet been explored. Currently, our focus is solely on implementing functionalities for the target tables (w_ tables). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't Ying want to generate the update data with TPC-DS connector as well ? Might be good to add more information about it here.
|
||
In the C++ implementation, there are two types of tables: source tables and target tables used for generation. Source table files are prefixed with "s_", while target table files are prefixed with "w_". For instance, there may be files like "s_call_center.c" and "w_call_center.c". It appears that source tables are only utilized when running the "dsdgen" with an update flag, though the exact function of this flag and the purpose of the source tables have not yet been explored. Currently, our focus is solely on implementing functionalities for the target tables (w_ tables). | ||
|
||
In the target table files prefixed with “w_”, there are some helper functions(need to be implemented by us) precisely called as “append_row_start“ and “append_row_end“ which help in the row generation. Depending on the schema of the table, there will be “append_ “ functions depending on the data type to be appended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be great to give more information about TpcdsSplits and how data generation happens for a table from one split to the next.
RFC-0009-native-tpcds-connector.md
Outdated
|
||
In the target table files prefixed with “w_”, there are some helper functions(need to be implemented by us) precisely called as “append_row_start“ and “append_row_end“ which help in the row generation. Depending on the schema of the table, there will be “append_ “ functions depending on the data type to be appended. | ||
|
||
A new TPC-DS config `tpcds.toggle-char-to-varchar` will be added to toggle the char columns to varchar, addressing the lack of support for the char data type in Presto - Native. This config allows the toggling of the char to varchar when required, ensuring consistency between Presto - Java and Presto - Native. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the impact of this at the schema level and at the data level ? Please can you elaborate.
de2d633
to
f5d1ce2
Compare
No description provided.