This example demonstrates the concept of referencing multiple forms of side input data in GCS along with main input data loaded from pubsub. Side input data can be refreshed periodically via PubSub notification.
-
Inside a Streaming Pipeline, reload a file that serves as side input from GCS bucket (configured to track changes by firing pubsub notifications) written periodically by an external process
-
Batch pipeline generates latest data (set of files inside a GCS bucket) on a daily basis, publishes the base path to pubsub and gets consumed by the Streaming pipeline to replace side input with latest data.
Note: In this example source for side input data is GCS but can be replaced to load from any Source (BigQuery, Cloud SQL etc)
Pipeline contains 4 steps:
- Read Primary events from main input pubsub subscription over a fixed window.
- Read GCS file base path containing latest side input data from side input pubsub subscription.
- (Re) Load multiple Side input types each located inside a sub path under the base path received in step 2
- Enrich primary event with side input data
Check out complete code
Example contains 2 types of tests
- Tests to validate individual transforms
- Tests to validate entire pipeline end to end
Check out tests