-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Support pandas in BigQuery cache #597
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,19 +19,23 @@ | |
|
||
from typing import TYPE_CHECKING, ClassVar, NoReturn | ||
|
||
import pandas as pd | ||
import pandas_gbq | ||
from airbyte_api.models import DestinationBigquery | ||
from google.oauth2.service_account import Credentials | ||
|
||
from airbyte._processors.sql.bigquery import BigQueryConfig, BigQuerySqlProcessor | ||
from airbyte.caches.base import ( | ||
CacheBase, | ||
) | ||
from airbyte.constants import DEFAULT_ARROW_MAX_CHUNK_SIZE | ||
from airbyte.destinations._translate_cache_to_dest import ( | ||
bigquery_cache_to_destination_configuration, | ||
) | ||
|
||
|
||
if TYPE_CHECKING: | ||
from collections.abc import Iterator | ||
|
||
from airbyte.shared.sql_processor import SqlProcessorBase | ||
|
||
|
||
|
@@ -48,21 +52,35 @@ def paired_destination_config(self) -> DestinationBigquery: | |
"""Return a dictionary of destination configuration values.""" | ||
return bigquery_cache_to_destination_configuration(cache=self) | ||
|
||
def get_arrow_dataset( | ||
def _read_to_pandas_dataframe( | ||
self, | ||
stream_name: str, | ||
*, | ||
max_chunk_size: int = DEFAULT_ARROW_MAX_CHUNK_SIZE, | ||
) -> NoReturn: | ||
"""Raises NotImplementedError; BigQuery doesn't support `pd.read_sql_table`. | ||
|
||
See: https://github.com/airbytehq/PyAirbyte/issues/165 | ||
""" | ||
raise NotImplementedError( | ||
"BigQuery doesn't currently support to_arrow" | ||
"Please consider using a different cache implementation for these functionalities." | ||
table_name: str, | ||
chunksize: int | None = None, | ||
**kwargs, | ||
) -> pd.DataFrame | Iterator[pd.DataFrame]: | ||
# Pop unused kwargs, maybe not the best way to do this | ||
kwargs.pop("con", None) | ||
kwargs.pop("schema", None) | ||
|
||
# Read the table using pandas_gbq | ||
credentials = Credentials.from_service_account_file(self.credentials_path) | ||
result = pandas_gbq.read_gbq( | ||
f"{self.project_name}.{self.dataset_name}.{table_name}", | ||
project_id=self.project_name, | ||
credentials=credentials, | ||
**kwargs, | ||
) | ||
|
||
# Cast result to DataFrame if it's not already a DataFrame | ||
if not isinstance(result, pd.DataFrame): | ||
result = pd.DataFrame(result) | ||
|
||
# Return chunks as iterator if chunksize is provided | ||
if chunksize is not None: | ||
return (result[i : i + chunksize] for i in range(0, len(result), chunksize)) | ||
|
||
Comment on lines
+78
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Revisit chunking performance. For very large tables, returning chunked slices of the DataFrame might still be memory-intensive as the entire DataFrame is loaded first. Would you consider a chunked read directly from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pandas_gbq doesn't support it :( |
||
return result | ||
|
||
|
||
# Expose the Cache class and also the Config class. | ||
__all__ = [ | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for credentials loading.
The credentials loading could fail for various reasons (file not found, invalid credentials, etc.). Should we add some error handling here? Maybe something like:
Wdyt?
📝 Committable suggestion