Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sunild
Copy link

@sunild sunild commented Feb 7, 2024

I ran into two problems when syncing tables with the BULK api.

  • If the job results are over 1GB or query takes more than 5 mins (described here in SF docs), the job fails w/a "retried 30 times" error
  • Objects in the Note table in one of our SF accounts must be associated w/a large number of other objects, b/c we get this error: InvalidBatch : Failed to process query: OPERATION_TOO_LARGE: exceeded 20000 distinct ids

Addressing the first problem

We found it helpful to enable PK chunking for specific tables from the get go, rather than wait for it to fail over to PK chunking after a query timeout. The tables in question are fairly large and have lots of columns, and we have to use PK chunking w/them elsewhere.

The existing code is supposed to fail over to PK chunking when a query timeout occurs, but it didn't in our case. I believe the behavior of the API may have changed, based on what the the original tap-salesforce is doing now (and that seems outdated too, 15 vs 30 retries).

I'd like to submit a separate PR to address the fail over problem.

Addressing the second one

We got the "OPERATION_TOO_LARGE" error with and without PK chunking. One of the solutions is to do a smaller query, so a smaller chunk size avoided the error.

Configuration options for PK chunking

I'll happily submit PR's to update the docs. I added 4 config options:

config:
  pk_chunking:
    enabled: true
    chunk_size: 15000
    tables:
    - TableA
    - TableB
    polling_sleep_time: 20

Intent

This is intended to be used during an initial sync for specific problematic tables, and then disabled for subsequent runs.

Add a `pk_chunking` boolean config option to use PK chunking from the
start, instead of failing over to it when a job fails w/a query timeout.

This is useful when syncing large tables. The query timeout behavior
has been preserved, but does not seem to work in all scenarios.
- replace array of table names w/a tables property
- add a chunk_size property to resolve issues w/large tables
- add a polling_sleep_time property to specify how long to wait
@@ -190,6 +202,17 @@ def _add_batch(self, catalog_entry, job_id, start_date, order_by_clause=True):

return batch['batchInfo']['id']

def _complete_batch(self, state, tap_stream_id, batch_id):
Copy link
Author

@sunild sunild Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this method b/c when a pk chunked job was complete, the bookmark still had the JobID and BatchIDs properties in it. The JobID needs to be cleared from the state, otherwise on the next syncing interval, it would incorrectly think it was resuming a failed batch (but the BatchIDs array was empty).

See:

if job_id:
with metrics.record_counter(stream) as counter:
LOGGER.info("Found JobID from previous Bulk Query. Resuming sync for job: %s", job_id)

@sunild sunild changed the title Add configuration options for PK chunking to help w/the initial sync of large/wide tables feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant