-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Docmatching pipeline sends requests to the Oracle service to insert data into a database. I have learned that sometimes these requests may contain duplicate entries. Since the microservice uses bulk insert operations, the presence of duplicates in the list causes the insert to fail.
I propose to separate the duplicate entities at the pipeline level, before sending the data to the Oracle. My main concern is the service's 60-second time limit. If the operation exceeds this time limit, even though the service would continue to update the database, from the pipeline's perspective, the request is considered to have failed.
By handling duplicates at the pipeline level, we reduce the processing load on the microservice and minimize data transfer, which is crucial given the AWS setup with multiple layers (NGINX, Kubernetes). Additionally, processing the data at the pipeline level ensures that only clean, properly structured data reaches the microservice, preventing bulk insert failures. This approach also aligns with the principle that data preparation should occur before reaching the service layer, leaving the service layer responsible solely for inserting data into the database as requested. Furthermore, it's easier to implement robust error handling and logging at the pipeline level, allowing for better troubleshooting and data quality control.
I've also recognized that the volume of data sent to the service can grow, and I'm proposing to send the updates in batches. This approach is similar to the process already in place between the resolver service and master pipeline. The resolver service does not accept any updates exceeding 1000 rows (configurable parameter), so the master pipeline always breaks down the updates into batches of 1000 rows or fewer before sending them to the resolver for database insertion.