-
Notifications
You must be signed in to change notification settings - Fork 213
Description
The bigQuery.createQueryStream loads an entire data set into memory. When we try to retrieve the data chunk-wise, it causes a memory issue. Upon checking the heap profiles, much data is getting stored in _cachedResponse, _cachedRows, and rows.
Environment details
- OS: macOS sonoma 14.5
- Node.js version: 18.12.1
- npm version: 8.19.2
@google-cloud/bigqueryversion: 7.8.0
Steps to reproduce
Here is the sample script
const query = `Select * from table`;
async function queryBigQuery(query) {
const bigquery = new BigQuery(creds);
const queryStream = bigquery.createQueryStream(query);
console.log('Query started.');
let recordsBuffer = [];
const batchSize = 100;
// Process the stream
queryStream
.on('data', row => {
recordsBuffer.push(row);
if (recordsBuffer.length >= batchSize) {
// Process the batch of records
processBatch(recordsBuffer)
}
})
.on('end', () => {
// Process any remaining records in the buffer
if (recordsBuffer.length > 0) {
processBatch(recordsBuffer);
}
console.log('Query completed.');
})
.on('error', err => {
console.error('Error during query execution:', err);
});
}
// Function to process a batch of records
function processBatch(batch) {
console.log(`Processing batch of ${batch.length} records.`);
}
queryBigQuery(query).catch(console.error);
When we have multiple connections, and for every connection request, the data gets loaded into memory, causing the memory size to increase.
Issue with autoPaginate
I tried using the autoPaginate field: const queryStream = bigquery.createQueryStream(query, { autoPaginate: true });
However, it still behaves as if autoPaginate is set to false. Is there a way or field that allows us to retrieve the data in chunks rather than loading the entire data into memory?
Reference
Here it is mentioned that we need to end the stream after a certain amount of data. However, this approach could lead to data loss. How can we implement this correctly? Please provide a sample.