feat: make dlt resources first class citizens in cognee ingestion#2237
feat: make dlt resources first class citizens in cognee ingestion#2237
Conversation
…ss-citizens-in-cognee-ingestion
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Please make sure all the checkboxes are checked:
|
Move all DLT-specific expansion into a pre-processing step that converts DLT resources into standard DataItem objects. After this, the standard adapter path handles everything with no DLT branching in add.py, ingest_data.py, or classify_documents.py. - Extend DataItem with external_metadata and data_id fields - Create resolve_dlt_sources.py as the single DLT adapter entry point - Remove DLT imports/logic from add.py, ingest_data.py, classify_documents.py - Add generic content-change detection in ingest_data.py Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: vasilije <[email protected]>
| {"db_name": db_name}, | ||
| ) | ||
| if exists_result.scalar() is None: | ||
| await connection.execute(text(f'CREATE DATABASE "{db_name}";')) |
Check failure
Code scanning / CodeQL
SQL query built from user-controlled sources High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
In general, when you must use user-controlled values in SQL where parameters cannot be used (e.g., identifiers like database names), you should first validate the value against a strict whitelist of allowed characters and patterns, and reject or normalize anything else. For PostgreSQL database names, a safe approach is to restrict to a subset like [a-zA-Z0-9_] and enforce a reasonable length limit. This avoids dangerous characters such as quotes, semicolons, and backslashes that could break out of the identifier context.
For this codebase, the best fix with minimal behavior change is:
- Introduce a small helper function in
cognee/tasks/ingestion/ingest_dlt_source.pythat validatesdb_nameand raisesValueErrorif it contains anything other than letters, digits, or underscores, or if it is empty or too long. - Call this validator at the start of
_create_pg_database(db_name)before constructing or executing SQL. - Keep using
text()for the SQL, but rely on the fact that the validateddb_namecannot contain characters that would break out of the quoted identifier or inject additional SQL.
This only touches ingest_dlt_source.py, adding a standard-library import (re) and a small function plus one validation call; the external behavior for normal, valid dataset names remains the same.
| @@ -8,6 +8,7 @@ | ||
| ) | ||
| from cognee.modules.data.models import Data | ||
| from cognee.infrastructure.databases.relational.config import get_relational_config | ||
| import re | ||
|
|
||
| try: | ||
| import dlt | ||
| @@ -15,6 +16,28 @@ | ||
| pass | ||
|
|
||
|
|
||
| def _validate_db_name(db_name: str) -> str: | ||
| """ | ||
| Validate a database name coming from user-controlled input. | ||
|
|
||
| Restrict to alphanumeric characters and underscores to avoid SQL injection | ||
| when the name is interpolated into identifier positions. | ||
| """ | ||
| if not isinstance(db_name, str) or not db_name: | ||
| raise ValueError("Invalid database name.") | ||
|
|
||
| # Allow only letters, digits and underscore; disallow other characters | ||
| if not re.fullmatch(r"[A-Za-z0-9_]+", db_name): | ||
| raise ValueError("Database name contains invalid characters.") | ||
|
|
||
| # Optional: enforce a reasonable length limit | ||
| if len(db_name) > 63: | ||
| # 63 is PostgreSQL's NAMEDATALEN - 1 default | ||
| raise ValueError("Database name is too long.") | ||
|
|
||
| return db_name | ||
|
|
||
|
|
||
| async def ingest_dlt_source( | ||
| dlt_source, | ||
| dataset_name: str, | ||
| @@ -83,6 +106,8 @@ | ||
|
|
||
|
|
||
| async def _create_pg_database(db_name): | ||
| # Validate user-controlled database name before using it in SQL | ||
| db_name = _validate_db_name(db_name) | ||
| relational_config = get_relational_config() | ||
| maintenance_db_name = "postgres" | ||
| maintenance_db_url = URL.create( |
| Natural language processing (NLP) is an interdisciplinary | ||
| subfield of computer science and information retrieval. | ||
|
|
||
| Andrej is an expert in NLP. |
There was a problem hiding this comment.
Note to self: Remove this nonsense
Description
Added option to ingest dlt (re)sources via regular cognee pipeline. Currently works only for Postgres destination, and only takes schemas into account.
Acceptance Criteria
Type of Change
Screenshots
Pre-submission Checklist
CONTRIBUTING.md)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.