-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Daniel Bauer
committed
Oct 25, 2024
1 parent
4686ebe
commit 5f824f3
Showing
1 changed file
with
13 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,4 @@ | ||
# **RFC0x for Presto** | ||
|
||
|
||
## Replacing HTTP Exchange with Binary Exchange Protocol | ||
# Add support for Binary Exchange Protocol | ||
|
||
Proposers | ||
* Daniel Bauer ([email protected]) | ||
|
@@ -18,7 +15,7 @@ Above protocol enhancement is integrated into the proposed binary exchange proto | |
|
||
The binary exchange protocol (BinX) is an alternative for the existing HTTP-based exchange protocol that | ||
runs between Prestissimo worker nodes. It offers the same functionality and API | ||
but uses binary encoding that can be more efficiently parsed than HTTP nessages. | ||
but uses binary encoding that can be more efficiently parsed than HTTP messages. | ||
This translates into a performance benefit for exchange-intensive queries. | ||
BinX does not replace the control protocol that runs between the coordinator and the | ||
worker nodes. The control protocol continues to use HTTP. | ||
|
@@ -33,7 +30,7 @@ is more complex than decoding binary encoded messages. | |
|
||
### Goals | ||
|
||
The proposal is to use a binary exchange protocol as a light-weight alternative to the existinig HTTP exchange protocol. | ||
The proposal is to use a binary exchange protocol as a light-weight alternative to the existing HTTP exchange protocol. | ||
As a prototypical implementation shows that such a protocol reduces query run-time of exchange heavy queries by | ||
20% to 30%. | ||
|
||
|
@@ -143,8 +140,9 @@ with the HTTP exchange. | |
|
||
#### Implementation Notes | ||
|
||
The BinX server uses Wangle. It consists of the following components that are implemented in | ||
the file `BinaryExchangeServer.h`: | ||
Like Proxygen, the BinX server uses Wangle as its underlying networking library. | ||
The BinX server is implemented in the file `BinaryExchangeServer.h` and consists of | ||
several components: | ||
|
||
* The `BinaryExchangeServer` is a controller for starting and stopping the Wangle protocol stack. | ||
It takes the port number, the IO thread pool and the CPU thread pool as construction parameters. | ||
|
@@ -159,9 +157,8 @@ service implementation on top of the stack. | |
The results from the TaskManager are packaged into replies and sent back to the requesting BinX exchange source. | ||
This exchange service follows the design of the existing `TaskResource` service. | ||
|
||
The `TaskManagerStub` class is an implementation detail that enables the BinX server to interact with | ||
a mock TaskManager implementation. This is used in the unit tests and allows to test the BinX server | ||
implementation along with the BinX exchange source implementation. | ||
All of above components are templated to allow for different TaskManager implementations. In the production code, | ||
the Prestissimo TaskManager is used while for unit testing, a mock task manager is deployed. | ||
|
||
### Binary Exchange Source and Binary Exchange Client | ||
|
||
|
@@ -175,7 +172,7 @@ The `PrestoServer` registers a factory method for creating exchange sources. Thi | |
such that `BinaryExchangeSource`s are created instead of HTTP exchanges when enabled by configuration. | ||
One exception are connections to the | ||
Presto coordinator that always uses the HTTP based exchange protocol. In a Kubernetes environment with its virtual | ||
networking, it is unfortunately not straight forward to detect whether the target host is the Presto connector | ||
networking, it is unfortunately not straight forward to detect whether the target host is the Presto coordinator | ||
since the connector's service IP used in the Presto configuration doesn't correspond to the IP address used by the | ||
pod running the coordinator. In order to circumvent this problem, a helper class called `CoordinatorInfoResolver` | ||
uses the node status endpoint of the coordinator to retrieve the coordinator's IP address. Using this address | ||
|
@@ -237,17 +234,17 @@ the additional complexity. | |
- There is one additional configuration option to enable BinX. Otherwise, there is no impact on session parameters, no API changes | ||
and no changes to SQL. | ||
|
||
- If we are changing behaviour how will we phase out the older behaviour? | ||
- If we are changing behavior how will we phase out the older behavior? | ||
|
||
- The HTTP stack is still required for the control message. The cost of keeping the HttpExchangeSource is minimal. | ||
|
||
- If we need special migration tools, describe them here. | ||
|
||
- No tools required. | ||
|
||
- When will we remove the existing behaviour, if applicable. | ||
- When will we remove the existing behavior, if applicable. | ||
|
||
- Existing behaviour will remain as the default option. | ||
- Existing behavior will remain as the default option. | ||
|
||
- How should this feature be taught to new and existing users? Basically mention if documentation changes/new blog are needed? | ||
|
||
|
@@ -261,5 +258,5 @@ the additional complexity. | |
|
||
Test plan involves running performance measurements using TPC-DS and TPC-H benchmarks that compare the performance of HTTP versus BinX. | ||
|
||
The TPC-DS benchmark test has been conducted using a dataset with scale factor 1000 on an on-prem cluster with 8 nodes. The results | ||
The TPC-DS benchmark test has been conducted using a dataset with scale factor 1000 on an on-premise cluster with 8 nodes. The results | ||
for this 1TB dataset have shown that overall runtime for the 99 queries was ~56 minutes when using HTTP compared to ~43 minutes for BinX. |