feat(knowledge): add document conversion pipeline (PDF/Office to Markdown)#1031
feat(knowledge): add document conversion pipeline (PDF/Office to Markdown)#1031DavidLeeUX wants to merge 3 commits intowecode-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a document conversion pipeline: new conversion Celery queue and worker, MinerU integration for file→Markdown conversion, optional S3 image upload, distributed locking and staleness controls, a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Orchestrator as Indexing Orchestrator
participant ConvTask as Conversion Worker
participant MinerU as MinerU API
participant S3 as S3/MinIO
participant StateMachine as State Machine
participant IndexTask as Indexing Worker
participant DB as Database
Client->>Orchestrator: enqueue indexing request
Orchestrator->>Orchestrator: needs_conversion(ext)?
alt conversion required
Orchestrator->>ConvTask: enqueue convert_document_task
ConvTask->>ConvTask: acquire distributed lock (document_id)
ConvTask->>DB: load attachment binary / metadata
ConvTask->>StateMachine: mark_document_conversion_started()
ConvTask->>MinerU: submit file for conversion
MinerU-->>ConvTask: task id
loop poll
ConvTask->>MinerU: poll status
end
MinerU-->>ConvTask: conversion ZIP result
ConvTask->>ConvTask: extract .md and images
alt S3 enabled
ConvTask->>S3: upload images
ConvTask->>ConvTask: rewrite image URLs in .md (URL-encode)
end
ConvTask->>DB: overwrite attachment (.md)
ConvTask->>StateMachine: mark_document_conversion_succeeded()
ConvTask->>IndexTask: enqueue index_document_task
ConvTask->>ConvTask: release lock
else no conversion
Orchestrator->>IndexTask: enqueue index_document_task
end
IndexTask->>DB: final generation/status checks and proceed with indexing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (8)
docs/zh/getting-started/installation.md (1)
121-128: Clarify image URL accessibility requirements for converted-image uploads.This section configures upload, but not visibility. Since Markdown references are replaced with URLs, add a note that uploaded objects must be publicly readable (or served via signed URL/CDN), otherwise images may not render for users.
Proposed doc addition
# 文档转换 S3 图片存储配置(可选) # 启用后将转换提取的图片上传到 S3,并替换 Markdown 中的图片引用 +# 注意:若使用公开 URL 替换图片引用,请确保对象可访问(桶策略/CDN/签名 URL), +# 否则前端可能无法加载图片。 WORKER_CONVERSION_S3_ENABLED=false WORKER_CONVERSION_S3_ENDPOINT=http://minio:9000 WORKER_CONVERSION_S3_ACCESS_KEY=your_access_key🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/zh/getting-started/installation.md` around lines 121 - 128, Update the S3 conversion docs to clarify image URL accessibility: in the section around the WORKER_CONVERSION_S3_* variables (e.g., WORKER_CONVERSION_S3_ENABLED, WORKER_CONVERSION_S3_BUCKET_NAME, WORKER_CONVERSION_S3_ENDPOINT), add a short note that converted images referenced in Markdown must be publicly accessible (objects made public) or delivered via signed URLs or a CDN; mention consequences if not (images won’t render for users) and briefly suggest common fixes (public-read ACL, bucket policy, signed URLs, or CDN).docs/en/getting-started/installation.md (1)
121-128: Add an accessibility note for uploaded image URLs.Since Markdown image paths are replaced with URLs, please document that objects must be accessible (public bucket policy/CDN/signed URL strategy), or image rendering may fail.
Proposed doc addition
# Document Conversion S3 Storage Configuration (Optional) # Enable to upload extracted images to S3 and replace markdown image references +# Ensure uploaded objects are accessible (public policy/CDN/signed URLs), +# otherwise converted markdown images may not render in clients. WORKER_CONVERSION_S3_ENABLED=false WORKER_CONVERSION_S3_ENDPOINT=http://minio:9000 WORKER_CONVERSION_S3_ACCESS_KEY=your_access_key🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/getting-started/installation.md` around lines 121 - 128, Add an accessibility note explaining that when WORKER_CONVERSION_S3_ENABLED is true and markdown image references are replaced by S3 URLs (configured via WORKER_CONVERSION_S3_ENDPOINT, WORKER_CONVERSION_S3_BUCKET_NAME, etc.), those objects must be reachable by the end user; instruct maintainers to ensure either a public bucket policy, a CDN in front of the bucket, or a signed URL strategy is in place so images render correctly, and include a short example sentence listing these three options and any security tradeoffs.backend/app/services/knowledge/orchestrator.py (1)
1524-1693: Extract the conversion/direct-enqueue branches into helpers.This method was already carrying state resolution, logging, and failure recovery; adding both enqueue paths here makes it noticeably harder to reason about and test. Pulling the conversion payload/query logic and the direct-index enqueue into small private helpers would keep the state-machine path readable. As per coding guidelines "Maximum function length of 50 lines preferred for all functions".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/knowledge/orchestrator.py` around lines 1524 - 1693, The _schedule_indexing_celery function is doing two different enqueue flows (conversion + conversion task lookup and direct indexing) inline; extract each branch into small private helpers to simplify the main flow: create e.g. _enqueue_conversion_task(db, document, knowledge_base, user, generation, retriever_name, retriever_namespace, embedding_model_name, embedding_model_namespace, splitter_config, trigger_summary) which moves the SubtaskContext lookup, original_filename resolution, builds the payload and calls convert_document_task.delay and logs/returns the async_result.id; and _enqueue_index_document_task(knowledge_base, document, user, generation, retriever_name, retriever_namespace, embedding_model_name, embedding_model_namespace, splitter_config, trigger_summary) which calls index_document_task.delay with the exact kwargs currently used and returns/logs the async_result.id; update _schedule_indexing_celery to call these helpers and use their returned task id for logging and any error handling.backend/app/tasks/conversion_tasks.py (5)
457-458: Chain the original exception for better debugging.🔧 Suggested fix
except zipfile.BadZipFile: - raise RuntimeError("Invalid ZIP file from MinerU result") + raise RuntimeError("Invalid ZIP file from MinerU result") from NoneUsing
from Nonehere explicitly indicates this is an intentional replacement rather than an accidental omission.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 457 - 458, The except block catching zipfile.BadZipFile should preserve the original exception for debugging: change the handler to capture the original exception (e.g., "except zipfile.BadZipFile as e") and re-raise the RuntimeError using exception chaining (raise RuntimeError("Invalid ZIP file from MinerU result") from e) so the original traceback is retained; update the except zipfile.BadZipFile handler in conversion_tasks.py accordingly.
223-224: Chain the original exception for better debugging.Same issue as above - chain the original exception.
🔧 Suggested fix
except Exception as e: - raise RuntimeError(f"Failed to download MinerU result: {e}") + raise RuntimeError(f"Failed to download MinerU result: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 223 - 224, The except block that currently does "except Exception as e: raise RuntimeError(f'Failed to download MinerU result: {e}')" should chain the original exception so the traceback is preserved; update that raise to "raise RuntimeError(f'Failed to download MinerU result: {e}') from e" (in the same block in conversion_tasks.py where the MinerU download error is caught) so the original exception is attached to the new RuntimeError.
44-64: Consider thread-safety for global S3 client initialization.The lazy initialization pattern for
_s3_clientcould have a race condition if multiple Celery tasks start simultaneously on the same worker process. While unlikely to cause issues in practice (worst case: multiple clients created, one wins), you could use a lock for safety.💡 Optional improvement
+import threading + _s3_client = None +_s3_client_lock = threading.Lock() def _get_s3_client(): """Get or create S3 client for image upload.""" global _s3_client - if _s3_client is None and settings.WORKER_CONVERSION_S3_ENABLED: + if _s3_client is not None: + return _s3_client + if not settings.WORKER_CONVERSION_S3_ENABLED: + return None + with _s3_client_lock: + if _s3_client is not None: + return _s3_client try: import boto3 # ... rest of initialization🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 44 - 64, The lazy init of the module-global _s3_client in _get_s3_client is not thread-safe; add a module-level lock (e.g., _s3_client_lock = threading.Lock()) and wrap the creation block inside a with _s3_client_lock: critical section so you re-check if _s3_client is None before creating the boto3 client and logging; this prevents a race where multiple tasks create clients concurrently while preserving the existing behavior and exception handling.
164-165: Chain the original exception for better debugging.Per Ruff B904, exceptions raised within
exceptblocks should chain the original error usingraise ... from errto preserve the traceback.🔧 Suggested fix
except Exception as e: - raise RuntimeError(f"Failed to submit MinerU task: {e}") + raise RuntimeError(f"Failed to submit MinerU task: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 164 - 165, The except block currently catches Exception as e and raises a new RuntimeError with the original message but does not chain the original exception; modify the raise in the handler that does raise RuntimeError(f"Failed to submit MinerU task: {e}") so it uses exception chaining by appending "from e" (i.e., raise RuntimeError(f"Failed to submit MinerU task: {e}") from e), keeping the existing message and variable names.
318-410: Nested functionprocess_image_uploadis quite long (~93 lines).While the logic is self-contained, this nested function exceeds the 50-line guideline. Consider extracting it as a module-level helper function for better testability and readability. The function could be renamed to
_find_and_upload_image_from_zip.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 318 - 410, The nested helper process_image_upload is too long and should be extracted to a module-level function named _find_and_upload_image_from_zip; move its body out of the enclosing function, keep the same parameters (zf, img_path, alt_text, base_dir, original_ref) and return behavior, and ensure it accepts or returns any external state it uses (e.g., replace direct append to uploaded_images with returning the (try_path, s3_url) tuple or accept uploaded_images as an argument), preserve logging calls (logger) and content-type mapping, and update the original location to call _find_and_upload_image_from_zip(...) and handle its return value; add a brief docstring and unit tests for the new function for readability and testability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/.env.example`:
- Around line 233-242: KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS (currently 1800)
is shorter than the allowed conversion runtime and will mark healthy
long-running conversions as stuck; change its value to be at least the hard/soft
task timeout (e.g., >= KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS / 10000) or
derive it from that constant (set to 10000 or slightly higher) so the stale
window never expires before the worker's permitted runtime.
In `@backend/app/services/knowledge/orchestrator.py`:
- Around line 1636-1669: The conversion branch should normalize
document.file_extension before checking and before passing to the convert task:
call and use the existing _normalize_file_extension(document.file_extension)
value when calling settings.needs_conversion(...) instead of raw
document.file_extension, and pass that same normalized extension into
convert_document_task.delay(...) (replace the file_extension argument currently
using document.file_extension). Update the references around the
needs_conversion(...) check, the original_filename lookup, and the
convert_document_task call so both the conditional and the task receive the
normalized extension.
In `@backend/app/tasks/conversion_tasks.py`:
- Around line 467-471: The comment describing the task timeouts is incorrect: it
claims "30 minutes soft, 35 minutes hard" while the actual arguments are
soft_time_limit=9000 and time_limit=10000; either update the comment to state
the actual durations (soft_time_limit=9000 → 150 minutes, time_limit=10000 →
~167 minutes) or change the arguments to match the comment (set
soft_time_limit=1800 and time_limit=2100 for 30/35 minutes). Modify the comment
or the soft_time_limit/time_limit values accordingly so the comment and the
soft_time_limit/time_limit settings match.
In `@backend/pyproject.toml`:
- Line 131: Replace the invalid dependency spec "boto3>=1.42.96" in
pyproject.toml with "boto3>=1.42.97" so the resolver can find a published
version; locate the dependency entry containing the exact string
"boto3>=1.42.96" and update it to "boto3>=1.42.97".
In `@docker/conversion/Dockerfile`:
- Around line 29-35: Comment and ENV mismatch: the Dockerfile documents a pod
sized for 2 CPU-heavy conversion jobs but sets ENV CELERY_CONCURRENCY=10; change
the environment variable CELERY_CONCURRENCY to 2 (or to the intended concurrency
constant) and update the surrounding comment if policy changed so
CELERY_CONCURRENCY and the documented "Concurrency" line consistently reflect
the intended default for the conversion worker.
In `@frontend/public/mockServiceWorker.js`:
- Line 1: The generated MSW worker file mockServiceWorker.js was accidentally
edited; restore the original generated content by discarding manual changes and
re-generating the worker (e.g., reset the file to the committed/generated
version or run the MSW init command to recreate it), ensuring no custom
modifications remain so MSW integrity checks pass; after restoring, verify the
file matches the official generated output exactly.
In `@frontend/src/features/knowledge/document/components/DocumentItem.tsx`:
- Around line 156-159: The compact-mode tooltip currently treats both 'indexing'
and 'converting' as the same and always shows an indexing message; update the
conditional message logic in DocumentItem.tsx so that when isConverting (the
const isConverting = document.index_status === 'converting') is true it shows a
converting-specific tooltip (e.g., "Converting...") instead of the indexing
text, while keeping the existing indexing message for isBackendIndexing; apply
this change to both places where showIndexingState/isBackendIndexing is used for
compact tooltips (the block using isConverting/isBackendIndexing around the
consts and the second occurrence near the alternate tooltip/label logic
referenced at lines 289-293).
---
Nitpick comments:
In `@backend/app/services/knowledge/orchestrator.py`:
- Around line 1524-1693: The _schedule_indexing_celery function is doing two
different enqueue flows (conversion + conversion task lookup and direct
indexing) inline; extract each branch into small private helpers to simplify the
main flow: create e.g. _enqueue_conversion_task(db, document, knowledge_base,
user, generation, retriever_name, retriever_namespace, embedding_model_name,
embedding_model_namespace, splitter_config, trigger_summary) which moves the
SubtaskContext lookup, original_filename resolution, builds the payload and
calls convert_document_task.delay and logs/returns the async_result.id; and
_enqueue_index_document_task(knowledge_base, document, user, generation,
retriever_name, retriever_namespace, embedding_model_name,
embedding_model_namespace, splitter_config, trigger_summary) which calls
index_document_task.delay with the exact kwargs currently used and returns/logs
the async_result.id; update _schedule_indexing_celery to call these helpers and
use their returned task id for logging and any error handling.
In `@backend/app/tasks/conversion_tasks.py`:
- Around line 457-458: The except block catching zipfile.BadZipFile should
preserve the original exception for debugging: change the handler to capture the
original exception (e.g., "except zipfile.BadZipFile as e") and re-raise the
RuntimeError using exception chaining (raise RuntimeError("Invalid ZIP file from
MinerU result") from e) so the original traceback is retained; update the except
zipfile.BadZipFile handler in conversion_tasks.py accordingly.
- Around line 223-224: The except block that currently does "except Exception as
e: raise RuntimeError(f'Failed to download MinerU result: {e}')" should chain
the original exception so the traceback is preserved; update that raise to
"raise RuntimeError(f'Failed to download MinerU result: {e}') from e" (in the
same block in conversion_tasks.py where the MinerU download error is caught) so
the original exception is attached to the new RuntimeError.
- Around line 44-64: The lazy init of the module-global _s3_client in
_get_s3_client is not thread-safe; add a module-level lock (e.g.,
_s3_client_lock = threading.Lock()) and wrap the creation block inside a with
_s3_client_lock: critical section so you re-check if _s3_client is None before
creating the boto3 client and logging; this prevents a race where multiple tasks
create clients concurrently while preserving the existing behavior and exception
handling.
- Around line 164-165: The except block currently catches Exception as e and
raises a new RuntimeError with the original message but does not chain the
original exception; modify the raise in the handler that does raise
RuntimeError(f"Failed to submit MinerU task: {e}") so it uses exception chaining
by appending "from e" (i.e., raise RuntimeError(f"Failed to submit MinerU task:
{e}") from e), keeping the existing message and variable names.
- Around line 318-410: The nested helper process_image_upload is too long and
should be extracted to a module-level function named
_find_and_upload_image_from_zip; move its body out of the enclosing function,
keep the same parameters (zf, img_path, alt_text, base_dir, original_ref) and
return behavior, and ensure it accepts or returns any external state it uses
(e.g., replace direct append to uploaded_images with returning the (try_path,
s3_url) tuple or accept uploaded_images as an argument), preserve logging calls
(logger) and content-type mapping, and update the original location to call
_find_and_upload_image_from_zip(...) and handle its return value; add a brief
docstring and unit tests for the new function for readability and testability.
In `@docs/en/getting-started/installation.md`:
- Around line 121-128: Add an accessibility note explaining that when
WORKER_CONVERSION_S3_ENABLED is true and markdown image references are replaced
by S3 URLs (configured via WORKER_CONVERSION_S3_ENDPOINT,
WORKER_CONVERSION_S3_BUCKET_NAME, etc.), those objects must be reachable by the
end user; instruct maintainers to ensure either a public bucket policy, a CDN in
front of the bucket, or a signed URL strategy is in place so images render
correctly, and include a short example sentence listing these three options and
any security tradeoffs.
In `@docs/zh/getting-started/installation.md`:
- Around line 121-128: Update the S3 conversion docs to clarify image URL
accessibility: in the section around the WORKER_CONVERSION_S3_* variables (e.g.,
WORKER_CONVERSION_S3_ENABLED, WORKER_CONVERSION_S3_BUCKET_NAME,
WORKER_CONVERSION_S3_ENDPOINT), add a short note that converted images
referenced in Markdown must be publicly accessible (objects made public) or
delivered via signed URLs or a CDN; mention consequences if not (images won’t
render for users) and briefly suggest common fixes (public-read ACL, bucket
policy, signed URLs, or CDN).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9a524a53-820b-40e2-ae7a-f846891a7fa3
📒 Files selected for processing (19)
backend/.env.examplebackend/app/core/celery_app.pybackend/app/core/config.pybackend/app/models/knowledge.pybackend/app/schemas/knowledge.pybackend/app/services/knowledge/index_state_machine.pybackend/app/services/knowledge/orchestrator.pybackend/app/tasks/conversion_tasks.pybackend/pyproject.tomldocker/conversion/Dockerfiledocs/en/developer-guide/knowledge-indexing-protection.mddocs/en/getting-started/installation.mddocs/zh/developer-guide/knowledge-indexing-protection.mddocs/zh/getting-started/installation.mdfrontend/public/mockServiceWorker.jsfrontend/src/features/knowledge/document/components/DocumentItem.tsxfrontend/src/i18n/locales/en/knowledge.jsonfrontend/src/i18n/locales/zh-CN/knowledge.jsonfrontend/src/types/knowledge.ts
| # Stale detection timeout for CONVERTING status (seconds) | ||
| # 转换状态过期时间(秒),超过此时间视为卡住,可被重新入队 | ||
| KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS=1800 | ||
|
|
||
| # Conversion task distributed lock configuration | ||
| # 转换任务分布式锁配置 | ||
| KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS=10000 | ||
| KNOWLEDGE_CONVERSION_LOCK_EXTEND_INTERVAL_SECONDS=60 | ||
| KNOWLEDGE_CONVERSION_LOCK_MAX_RETRIES=2 | ||
| KNOWLEDGE_CONVERSION_LOCK_RETRY_DELAY_SECONDS=30 |
There was a problem hiding this comment.
The converting stale window is shorter than the allowed task runtime.
KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS=1800 marks a conversion as stuck after 30 minutes, but the conversion task is allowed to run up to 9000s soft / 10000s hard. That means a healthy long-running conversion can be taken over as “stale” long before the worker actually times out, which undermines the new generation/state-machine logic.
🧰 Tools
🪛 dotenv-linter (4.0.0)
[warning] 240-240: [UnorderedKey] The KNOWLEDGE_CONVERSION_LOCK_EXTEND_INTERVAL_SECONDS key should go before the KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS key
(UnorderedKey)
[warning] 241-241: [UnorderedKey] The KNOWLEDGE_CONVERSION_LOCK_MAX_RETRIES key should go before the KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS key
(UnorderedKey)
[warning] 242-242: [UnorderedKey] The KNOWLEDGE_CONVERSION_LOCK_RETRY_DELAY_SECONDS key should go before the KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS key
(UnorderedKey)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/.env.example` around lines 233 - 242,
KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS (currently 1800) is shorter than the
allowed conversion runtime and will mark healthy long-running conversions as
stuck; change its value to be at least the hard/soft task timeout (e.g., >=
KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS / 10000) or derive it from that
constant (set to 10000 or slightly higher) so the stale window never expires
before the worker's permitted runtime.
| @@ -1,3 +1,4 @@ | |||
|
|
|||
There was a problem hiding this comment.
Revert accidental edit in generated MSW worker file.
Line 1 introduces a content change in a generated file that should stay untouched; this can break MSW integrity checks in dev.
Proposed fix
-
/* tslint:disable */📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /* tslint:disable */ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@frontend/public/mockServiceWorker.js` at line 1, The generated MSW worker
file mockServiceWorker.js was accidentally edited; restore the original
generated content by discarding manual changes and re-generating the worker
(e.g., reset the file to the committed/generated version or run the MSW init
command to recreate it), ensuring no custom modifications remain so MSW
integrity checks pass; after restoring, verify the file matches the official
generated output exactly.
c1e97a1 to
b6d746f
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
backend/app/tasks/conversion_tasks.py (1)
472-490: Add an explicit return type to the Celery task.This is a public Python entrypoint and currently has no return annotation. Please add one so the task contract is explicit.
As per coding guidelines, "Use type hints for all public functions and classes in Python".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 472 - 490, The Celery task function convert_document_task is missing an explicit return type; add a return annotation to its signature (for example -> dict[str, Any] if it returns a result dict, or -> None if it doesn't return anything) and update any internal returns to match that type; import Any from typing if using a generic annotation and ensure the annotated return type reflects the actual values returned by convert_document_task throughout its body.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/knowledge/orchestrator.py`:
- Around line 1638-1653: The conversion task is being given index_owner_user_id
as user_id which is then used by convert_document_task ->
context_service.overwrite_attachment(...) and wrongly filters by the KB owner
instead of the attachment uploader; update the convert_document_task call to
pass the attachment/uploader's user id (use the document's uploader field, e.g.
document.user_id) as the parameter used for overwrite (keep the existing
index_owner_user_id argument for downstream indexing operations), so change
user_id=index_owner_user_id to user_id=document.user_id (or the correct document
uploader property) and ensure convert_document_task still receives
index_owner_user_id separately for indexing steps.
In `@backend/app/tasks/conversion_tasks.py`:
- Around line 622-647: The task currently overwrites the attachment via
context_service.overwrite_attachment(...) before checking
mark_document_conversion_succeeded(...), allowing stale conversions to clobber
newer content; change the flow so you first open a DB session and call
mark_document_conversion_succeeded(db, document_id, generation=index_generation)
and only if that call returns success/True proceed to open a SessionLocal and
call context_service.overwrite_attachment(db=..., context_id=attachment_id,
user_id=user_id, filename=md_filename, binary_data=markdown_bytes); if
mark_document_conversion_succeeded indicates failure, log the stale-rejection
(include attachment_id/document_id/index_generation) and exit without mutating
the attachment.
- Around line 395-401: The branch that handles successful S3 uploads currently
appends (try_path, s3_url) to uploaded_images and returns the bare s3_url when
alt_text is falsy, which breaks Markdown image rendering; update that return so
it always returns Markdown image syntax (e.g. f"") even
when alt_text is empty. Locate the block that checks `if s3_url:` (the code that
appends to `uploaded_images` and returns based on `alt_text`) and replace the
conditional return with a single return that formats the image as Markdown using
`alt_text` (allowing it to be an empty string) and `s3_url`.
---
Nitpick comments:
In `@backend/app/tasks/conversion_tasks.py`:
- Around line 472-490: The Celery task function convert_document_task is missing
an explicit return type; add a return annotation to its signature (for example
-> dict[str, Any] if it returns a result dict, or -> None if it doesn't return
anything) and update any internal returns to match that type; import Any from
typing if using a generic annotation and ensure the annotated return type
reflects the actual values returned by convert_document_task throughout its
body.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cdccbd36-e0b1-4fe6-a581-2d4de151e120
📒 Files selected for processing (5)
backend/app/services/knowledge/orchestrator.pybackend/app/tasks/conversion_tasks.pybackend/pyproject.tomldocker/conversion/Dockerfilefrontend/src/features/knowledge/document/components/DocumentItem.tsx
✅ Files skipped from review due to trivial changes (1)
- backend/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- frontend/src/features/knowledge/document/components/DocumentItem.tsx
| if s3_url: | ||
| uploaded_images.append((try_path, s3_url)) | ||
| return ( | ||
| f"" | ||
| if alt_text | ||
| else s3_url | ||
| ) |
There was a problem hiding this comment.
Preserve Markdown image syntax when alt text is empty.
For  this branch returns the bare S3 URL, so the image stops rendering and becomes plain text. Always return  markup here, even when alt_text is empty.
Proposed fix
- return (
- f""
- if alt_text
- else s3_url
- )
+ return f""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/tasks/conversion_tasks.py` around lines 395 - 401, The branch
that handles successful S3 uploads currently appends (try_path, s3_url) to
uploaded_images and returns the bare s3_url when alt_text is falsy, which breaks
Markdown image rendering; update that return so it always returns Markdown image
syntax (e.g. f"") even when alt_text is empty. Locate the
block that checks `if s3_url:` (the code that appends to `uploaded_images` and
returns based on `alt_text`) and replace the conditional return with a single
return that formats the image as Markdown using `alt_text` (allowing it to be an
empty string) and `s3_url`.
| # Overwrite attachment with Markdown content | ||
| # The .md filename causes overwrite_attachment to update | ||
| # type_data.file_extension to ".md", which the indexer uses | ||
| md_filename = f"{original_filename}.md" | ||
| with SessionLocal() as db: | ||
| context_service.overwrite_attachment( | ||
| db=db, | ||
| context_id=attachment_id, | ||
| user_id=user_id, | ||
| filename=md_filename, | ||
| binary_data=markdown_bytes, | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"[Conversion] Attachment overwritten: " | ||
| f"attachment_id={attachment_id}, " | ||
| f"new_filename={md_filename}" | ||
| ) | ||
|
|
||
| # State transition: CONVERTING -> QUEUED | ||
| with SessionLocal() as db: | ||
| succeeded = mark_document_conversion_succeeded( | ||
| db=db, | ||
| document_id=document_id, | ||
| generation=index_generation, | ||
| ) |
There was a problem hiding this comment.
Reject stale generations before mutating the attachment.
This task overwrites the attachment first and only then checks mark_document_conversion_succeeded(...). If a newer generation was queued while MinerU was running, the stale task still clobbers the attachment content before being rejected. The generation/status guard needs to happen before overwrite_attachment(...), otherwise stale-task protection does not actually protect the stored document.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/tasks/conversion_tasks.py` around lines 622 - 647, The task
currently overwrites the attachment via
context_service.overwrite_attachment(...) before checking
mark_document_conversion_succeeded(...), allowing stale conversions to clobber
newer content; change the flow so you first open a DB session and call
mark_document_conversion_succeeded(db, document_id, generation=index_generation)
and only if that call returns success/True proceed to open a SessionLocal and
call context_service.overwrite_attachment(db=..., context_id=attachment_id,
user_id=user_id, filename=md_filename, binary_data=markdown_bytes); if
mark_document_conversion_succeeded indicates failure, log the stale-rejection
(include attachment_id/document_id/index_generation) and exit without mutating
the attachment.
在知识库索引流程中增加 MinerU 文档转换管道,支持将 PDF、DOCX、 PPTX、XLSX 等格式转换为 Markdown 后再进行 RAG 索引。 主要修改: - 新增 conversion_tasks.py:Celery 异步转换任务,调用 MinerU API - 新增 docker/conversion/Dockerfile:专用转换 worker 镜像 - 新增 config 配置项:KNOWLEDGE_CONVERSION_*、MINERU_* 等 - 更新 celery_app.py:任务路由到独立的 knowledge_conversion 队列 - 更新 orchestrator.py:根据文件扩展名判断是否需要转换 - 更新 index_state_machine.py:新增 CONVERTING 状态及状态转换 - 前端:DocumentItem 组件显示"转换中"状态,i18n 添加翻译 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
MinerU 文档转换后提取的图片可上传到 S3 兼容存储,并替换 Markdown
中的图片引用为 S3 公共 URL。
主要修改:
- 新增 S3 客户端初始化(懒加载 boto3)
- _extract_markdown_from_zip 支持提取并上传图片到 S3
- URL 编码支持中文知识库名称
- S3 路径格式:{knowledge_base_name}/{filename}/images/xxx.jpg
- 增加任务超时时间(soft: 150分钟, hard: 166分钟)
配置项:
- WORKER_CONVERSION_S3_ENABLED: 启用 S3 上传
- WORKER_CONVERSION_S3_ENDPOINT: S3 服务端点
- WORKER_CONVERSION_S3_ACCESS_KEY/SECRET_KEY: 访问凭证
- WORKER_CONVERSION_S3_BUCKET_NAME: 存储桶名称
- WORKER_CONVERSION_S3_REGION_NAME: 区域名称
依赖:
- 新增 boto3 包
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
b6d746f to
6f1eb04
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
backend/app/tasks/conversion_tasks.py (2)
622-647:⚠️ Potential issue | 🔴 CriticalReject stale generations before overwriting the attachment.
overwrite_attachment(...)still runs beforemark_document_conversion_succeeded(...). If a newer generation was queued while MinerU was running, this stale task can clobber the stored attachment even though the state transition is rejected afterward. Move the generation/state guard ahead of the write, or make both steps atomic in the same transaction.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 622 - 647, The overwrite_attachment call currently runs before the generation/state guard, allowing a stale conversion to clobber stored data; ensure you reject stale generations before writing by moving the generation/state validation ahead of the write or by performing both operations in the same DB transaction. Specifically, call mark_document_conversion_succeeded (or otherwise validate generation) using the same SessionLocal/DB session first and check its returned succeeded boolean before calling context_service.overwrite_attachment, or open a single SessionLocal transaction and perform mark_document_conversion_succeeded and context_service.overwrite_attachment inside it so the write only occurs if the state transition succeeds.
395-401:⚠️ Potential issue | 🟠 MajorKeep Markdown image syntax when alt text is empty.
For
, this branch returns the bare S3 URL, so the image stops rendering and becomes plain text. Always return Markdown image syntax, even whenalt_textis empty.Suggested fix
- return ( - f"" - if alt_text - else s3_url - ) + return f""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tasks/conversion_tasks.py` around lines 395 - 401, The branch that handles successful S3 uploads currently appends to uploaded_images and returns the bare s3_url when alt_text is empty, breaking Markdown rendering; change the return to always use Markdown image syntax. In the block that appends uploaded_images (variables: uploaded_images, try_path, s3_url) replace the conditional return with a single return that formats the image as f"" so empty alt_text yields "" and the image still renders.backend/.env.example (1)
233-235:⚠️ Potential issue | 🟠 MajorRaise the
convertingstale window above the task runtime.
KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS=1800is still below the conversion worker’ssoft_time_limit=9000/time_limit=10000. Becausedocker/conversion/Dockerfileseeds.envfrom this file when it is missing, a healthy long-running conversion can be reclaimed as stale after 30 minutes and re-enqueued while the original worker is still running. Set this to at least the hard limit, ideally with a small buffer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/.env.example` around lines 233 - 235, The stale timeout KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS is too low (1800) compared to the conversion worker limits (soft_time_limit=9000, time_limit=10000) and can cause long-running tasks to be re-enqueued; update the value seeded by docker/conversion/Dockerfile to at least the hard time_limit (10000) plus a small buffer (e.g., 11000) so converting tasks are not marked stale while the original worker is still running.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/core/config.py`:
- Around line 347-355: Adjust the default timeouts so they exceed the
convert_document_task soft_time_limit (9000/10000s); specifically update
KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS to a value >= the task soft limit
(e.g., 10000–11000s) and set KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS to be
longer than that soft_time_limit (e.g., slightly above the chosen stale timeout)
so the lock comment ("longer than task soft_time_limit") holds; keep the lock
extend/ retry settings (KNOWLEDGE_CONVERSION_LOCK_EXTEND_INTERVAL_SECONDS,
KNOWLEDGE_CONVERSION_LOCK_MAX_RETRIES,
KNOWLEDGE_CONVERSION_LOCK_RETRY_DELAY_SECONDS) as-is unless you want to adjust
their cadence to match the new longer timeout.
---
Duplicate comments:
In `@backend/.env.example`:
- Around line 233-235: The stale timeout
KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS is too low (1800) compared to the
conversion worker limits (soft_time_limit=9000, time_limit=10000) and can cause
long-running tasks to be re-enqueued; update the value seeded by
docker/conversion/Dockerfile to at least the hard time_limit (10000) plus a
small buffer (e.g., 11000) so converting tasks are not marked stale while the
original worker is still running.
In `@backend/app/tasks/conversion_tasks.py`:
- Around line 622-647: The overwrite_attachment call currently runs before the
generation/state guard, allowing a stale conversion to clobber stored data;
ensure you reject stale generations before writing by moving the
generation/state validation ahead of the write or by performing both operations
in the same DB transaction. Specifically, call
mark_document_conversion_succeeded (or otherwise validate generation) using the
same SessionLocal/DB session first and check its returned succeeded boolean
before calling context_service.overwrite_attachment, or open a single
SessionLocal transaction and perform mark_document_conversion_succeeded and
context_service.overwrite_attachment inside it so the write only occurs if the
state transition succeeds.
- Around line 395-401: The branch that handles successful S3 uploads currently
appends to uploaded_images and returns the bare s3_url when alt_text is empty,
breaking Markdown rendering; change the return to always use Markdown image
syntax. In the block that appends uploaded_images (variables: uploaded_images,
try_path, s3_url) replace the conditional return with a single return that
formats the image as f"" so empty alt_text yields
"" and the image still renders.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 771c7ab0-ed61-4647-bc49-37ee7c1d1e4e
📒 Files selected for processing (18)
backend/.env.examplebackend/app/core/celery_app.pybackend/app/core/config.pybackend/app/models/knowledge.pybackend/app/schemas/knowledge.pybackend/app/services/knowledge/index_state_machine.pybackend/app/services/knowledge/orchestrator.pybackend/app/tasks/conversion_tasks.pybackend/pyproject.tomldocker/conversion/Dockerfiledocs/en/developer-guide/knowledge-indexing-protection.mddocs/en/getting-started/installation.mddocs/zh/developer-guide/knowledge-indexing-protection.mddocs/zh/getting-started/installation.mdfrontend/src/features/knowledge/document/components/DocumentItem.tsxfrontend/src/i18n/locales/en/knowledge.jsonfrontend/src/i18n/locales/zh-CN/knowledge.jsonfrontend/src/types/knowledge.ts
✅ Files skipped from review due to trivial changes (5)
- backend/pyproject.toml
- backend/app/models/knowledge.py
- frontend/src/types/knowledge.ts
- frontend/src/i18n/locales/zh-CN/knowledge.json
- frontend/src/i18n/locales/en/knowledge.json
🚧 Files skipped from review as they are similar to previous changes (8)
- docs/en/getting-started/installation.md
- backend/app/core/celery_app.py
- docs/zh/getting-started/installation.md
- frontend/src/features/knowledge/document/components/DocumentItem.tsx
- backend/app/services/knowledge/orchestrator.py
- docs/en/developer-guide/knowledge-indexing-protection.md
- backend/app/services/knowledge/index_state_machine.py
- docs/zh/developer-guide/knowledge-indexing-protection.md
| # Stale detection timeout for CONVERTING status (seconds, default 30 min) | ||
| KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS: int = 1800 | ||
|
|
||
| # Conversion task distributed lock configuration | ||
| # Lock timeout should be longer than task soft_time_limit to prevent premature expiration | ||
| KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS: int = 2000 | ||
| KNOWLEDGE_CONVERSION_LOCK_EXTEND_INTERVAL_SECONDS: int = 60 | ||
| KNOWLEDGE_CONVERSION_LOCK_MAX_RETRIES: int = 2 | ||
| KNOWLEDGE_CONVERSION_LOCK_RETRY_DELAY_SECONDS: int = 30 |
There was a problem hiding this comment.
Default conversion timeouts still undershoot the worker limits.
These defaults can mark or unlock an in-flight conversion long before convert_document_task reaches its 9000/10000 second limits. In particular, Line 348 (1800) is below the allowed runtime, and Line 352 (2000) also contradicts the “longer than task soft_time_limit” comment. Please align the code defaults with the task limits so deployments that rely on defaults don’t reclaim healthy work.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/core/config.py` around lines 347 - 355, Adjust the default
timeouts so they exceed the convert_document_task soft_time_limit (9000/10000s);
specifically update KNOWLEDGE_INDEX_STALE_CONVERTING_SECONDS to a value >= the
task soft limit (e.g., 10000–11000s) and set
KNOWLEDGE_CONVERSION_LOCK_TIMEOUT_SECONDS to be longer than that soft_time_limit
(e.g., slightly above the chosen stale timeout) so the lock comment ("longer
than task soft_time_limit") holds; keep the lock extend/ retry settings
(KNOWLEDGE_CONVERSION_LOCK_EXTEND_INTERVAL_SECONDS,
KNOWLEDGE_CONVERSION_LOCK_MAX_RETRIES,
KNOWLEDGE_CONVERSION_LOCK_RETRY_DELAY_SECONDS) as-is unless you want to adjust
their cadence to match the new longer timeout.
Introduces a core document conversion pipeline into the RAG indexing workflow. This feature leverages MinerU to transform complex formats (PDF, DOCX, PPTX) into structured Markdown to improve retrieval quality.
Core Functionality:
Asynchronous Conversion: Added a dedicated Celery worker and queue (knowledge_conversion) to handle heavy conversion tasks.
Workflow Integration: Updated the orchestrator and index_state_machine to include a new CONVERTING stage, ensuring a seamless transition from raw files to indexed content.
UI/UX Updates: Added a "Converting" status indicator in the frontend DocumentItem component with full i18n support.
Key Features & Enhancements:
Image Hosting: Automatically extracts images from converted documents and uploads them to S3-compatible storage, replacing local paths with public URLs.
Resilience: Increased task timeouts and added retry-safe S3 client initialization.
Infrastructure: Provided a specialized Docker container for the conversion worker and expanded configuration for MinerU and S3 environments.
Summary by CodeRabbit
New Features
Documentation