Skip to content

Memory Issue with bigQuery.createQueryStream in Node.js #1392

@gayathri-mandala

Description

@gayathri-mandala

The bigQuery.createQueryStream loads an entire data set into memory. When we try to retrieve the data chunk-wise, it causes a memory issue. Upon checking the heap profiles, much data is getting stored in _cachedResponse, _cachedRows, and rows.

Environment details

  • OS: macOS sonoma 14.5
  • Node.js version: 18.12.1
  • npm version: 8.19.2
  • @google-cloud/bigquery version: 7.8.0

Steps to reproduce

Here is the sample script


const query = `Select * from table`;

async function queryBigQuery(query) {
  const bigquery = new BigQuery(creds);

  const queryStream = bigquery.createQueryStream(query);

  console.log('Query started.');

  let recordsBuffer = [];
  const batchSize = 100;

  // Process the stream
  queryStream
    .on('data', row => {
      recordsBuffer.push(row);
      if (recordsBuffer.length >= batchSize) {
        // Process the batch of records
        processBatch(recordsBuffer)
      }
    })
    .on('end', () => {
      // Process any remaining records in the buffer
      if (recordsBuffer.length > 0) {
        processBatch(recordsBuffer);
      }
      console.log('Query completed.');
    })
    .on('error', err => {
      console.error('Error during query execution:', err);
    });
}

// Function to process a batch of records
function processBatch(batch) {
  console.log(`Processing batch of ${batch.length} records.`);
}

queryBigQuery(query).catch(console.error);

When we have multiple connections, and for every connection request, the data gets loaded into memory, causing the memory size to increase.

Issue with autoPaginate
I tried using the autoPaginate field: const queryStream = bigquery.createQueryStream(query, { autoPaginate: true });
However, it still behaves as if autoPaginate is set to false. Is there a way or field that allows us to retrieve the data in chunks rather than loading the entire data into memory?
Reference
Here it is mentioned that we need to end the stream after a certain amount of data. However, this approach could lead to data loss. How can we implement this correctly? Please provide a sample.

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/nodejs-bigquery API.priority: p2Moderately-important priority. Fix may not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions