feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

sunild · 2024-02-07T01:09:11Z

I ran into two problems when syncing tables with the BULK api.

If the job results are over 1GB or query takes more than 5 mins (described here in SF docs), the job fails w/a "retried 30 times" error
Objects in the Note table in one of our SF accounts must be associated w/a large number of other objects, b/c we get this error: InvalidBatch : Failed to process query: OPERATION_TOO_LARGE: exceeded 20000 distinct ids

Addressing the first problem

We found it helpful to enable PK chunking for specific tables from the get go, rather than wait for it to fail over to PK chunking after a query timeout. The tables in question are fairly large and have lots of columns, and we have to use PK chunking w/them elsewhere.

The existing code is supposed to fail over to PK chunking when a query timeout occurs, but it didn't in our case. I believe the behavior of the API may have changed, based on what the the original tap-salesforce is doing now (and that seems outdated too, 15 vs 30 retries).

I'd like to submit a separate PR to address the fail over problem.

Addressing the second one

We got the "OPERATION_TOO_LARGE" error with and without PK chunking. One of the solutions is to do a smaller query, so a smaller chunk size avoided the error.

Configuration options for PK chunking

I'll happily submit PR's to update the docs. I added 4 config options:

config:
  pk_chunking:
    enabled: true
    chunk_size: 15000
    tables:
    - TableA
    - TableB
    polling_sleep_time: 20

Intent

This is intended to be used during an initial sync for specific problematic tables, and then disabled for subsequent runs.

Add a `pk_chunking` boolean config option to use PK chunking from the start, instead of failing over to it when a job fails w/a query timeout. This is useful when syncing large tables. The query timeout behavior has been preserved, but does not seem to work in all scenarios.

- replace array of table names w/a tables property - add a chunk_size property to resolve issues w/large tables - add a polling_sleep_time property to specify how long to wait

sunild · 2024-02-07T01:13:20Z

tap_salesforce/salesforce/bulk.py

@@ -190,6 +202,17 @@ def _add_batch(self, catalog_entry, job_id, start_date, order_by_clause=True):

        return batch['batchInfo']['id']

+    def _complete_batch(self, state, tap_stream_id, batch_id):


I added this method b/c when a pk chunked job was complete, the bookmark still had the JobID and BatchIDs properties in it. The JobID needs to be cleared from the state, otherwise on the next syncing interval, it would incorrectly think it was resuming a failed batch (but the BatchIDs array was empty).

See:

tap-salesforce/tap_salesforce/__init__.py

Lines 453 to 455 in d802f68

if job_id:

with metrics.record_counter(stream) as counter:

LOGGER.info("Found JobID from previous Bulk Query. Resuming sync for job: %s", job_id)

sunild added 5 commits February 3, 2024 09:57

use an list of tables names instead of a boolean

2053ca9

log the failed batch stateMessage when a job fails

4fc691f

Add more options for PK chunking

b8ad536

- replace array of table names w/a tables property - add a chunk_size property to resolve issues w/large tables - add a polling_sleep_time property to specify how long to wait

fix using chunking when it is disabled but tables are defined

41dabc6

sunild commented Feb 7, 2024

View reviewed changes

sunild changed the title ~~Add configuration options for PK chunking to help w/the initial sync of large/wide tables~~ feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

sunild commented Feb 7, 2024 •

edited

Loading

sunild Feb 7, 2024 •

edited

Loading

		@@ -190,6 +202,17 @@ def _add_batch(self, catalog_entry, job_id, start_date, order_by_clause=True):

		return batch['batchInfo']['id']

		def _complete_batch(self, state, tap_stream_id, batch_id):

	if job_id:
	with metrics.record_counter(stream) as counter:
	LOGGER.info("Found JobID from previous Bulk Query. Resuming sync for job: %s", job_id)

feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

Are you sure you want to change the base?

feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables #52

Conversation

sunild commented Feb 7, 2024 • edited Loading

Addressing the first problem

Addressing the second one

Configuration options for PK chunking

Intent

sunild Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

sunild commented Feb 7, 2024 •

edited

Loading

sunild Feb 7, 2024 •

edited

Loading