Skip to content

feat(skills): add dingtalk-knowledge skill for uploading DingTalk doc…#994

Open
parabala wants to merge 12 commits intowecode-ai:mainfrom
parabala:feature/dingtalk-upload-knowledge
Open

feat(skills): add dingtalk-knowledge skill for uploading DingTalk doc…#994
parabala wants to merge 12 commits intowecode-ai:mainfrom
parabala:feature/dingtalk-upload-knowledge

Conversation

@parabala
Copy link
Copy Markdown
Collaborator

@parabala parabala commented Apr 17, 2026

…uments to knowledge base

Add new skill to upload DingTalk documents, spreadsheets, and AI tables to Wegent knowledge bases. Includes:

  • SKILL.md: Comprehensive documentation with workflow and examples
  • provider.py: Tool provider with two tools:
    • download_dingtalk_document: Download file-based docs via sandbox
    • save_dingtalk_content: Save online doc content (adoc) as markdown

Supports multiple DingTalk document types:

  • File-based documents (docx, xlsx, pdf) via download_file MCP
  • Online documents (adoc) via get_document_content MCP
  • Spreadsheets (axls) via dingtalk-table MCP
  • AI tables (able) via dingtalk-ai-table MCP

Integrates with wegent-knowledge MCP to create documents with proper source attribution and trigger indexing.

Summary by CodeRabbit

  • New Features

    • Added DingTalk document ingestion with support for multiple content types and attachment-based uploads.
    • New upload tools to import files from URLs and to upload text content directly; these tools are now registered and available.
  • Documentation

    • Updated skill docs with DingTalk workflow examples, MCP configuration guidance, and instructions for attachment-based document creation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Adds DingTalk ingestion support: new dingtalk_upload MCP tools that upload files/content as Wegent attachments, registers those tools in the MCP server, updates example tool names to wegent_kb_*, exports DINGTALK_UPLOAD_MCP_TOOLS, and extends Wegent knowledge skill docs and MCP server entries.

Changes

Cohort / File(s) Summary
MCP Tool Exports & Examples
backend/app/mcp_server/tools/__init__.py, backend/app/mcp_server/tools/decorator.py, backend/app/mcp_server/server.py
Module docstrings and examples updated to wegent_kb_* naming; DINGTALK_UPLOAD_MCP_TOOLS added to exports; server now imports dingtalk_upload so its decorated tools are registered.
DingTalk Upload Tools
backend/app/mcp_server/tools/dingtalk_upload.py
New module adding dingtalk_upload_file_from_url and dingtalk_upload_content MCP tools. Both resolve user via DB session, validate inputs, (for URL) perform SSRF-safe streamed download with IP checks and max-size cap, upload bytes via context_service.upload_attachment, return attachment metadata or structured errors, and ensure DB session cleanup/rollback on exceptions.
Skill Documentation & MCP Config
backend/init_data/skills/wegent-knowledge/SKILL.md
Skill docs updated: version/description expanded for DingTalk, added MCP server entries (dingtalk-docs, dingtalk-table, dingtalk-ai-table), documented source_type="attachment" and attachment_id for wegent_kb_create_document, and added docs/examples for DingTalk upload tools and ingestion workflow.

Sequence Diagram(s)

sequenceDiagram
    participant Client as MCP Client
    participant Tool as dingtalk_upload_file_from_url
    participant DB as Database
    participant HTTP as Remote HTTP
    participant Service as ContextService

    Client->>Tool: invoke(download_url, filename, token_info)
    Tool->>DB: open session / resolve user
    alt user missing
        Tool-->>Client: {"error": "user not found"}
    else user found
        Tool->>HTTP: GET download_url (SSRF checks, DNS->IP check, timeout, max bytes)
        HTTP-->>Tool: stream bytes / error
        alt download success
            Tool->>Service: upload_attachment(bytes, filename, mime)
            Service-->>Tool: attachment metadata
            Tool-->>Client: metadata {attachment_id, filename, size, mime_type, truncated}
        else download/validation error
            Tool-->>Client: {"error": "..."}
        end
    end
    Tool->>DB: commit/rollback and close session
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through URLs and DingTalk streams,

I guard each byte and stitch your dreams.
Into Wegent's burrow safe they glide,
As attachments snug, I cheer with pride.
A rabbit coder, swift and spry.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding DingTalk document upload functionality to the knowledge skill with updated tool naming and infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-knowledge/provider.py (1)

210-262: Nit: drop the unused effective_timeout param and the no-placeholder f-strings.

_prepare_sandbox(effective_timeout) accepts the timeout but never passes it to SandboxManager / make_dir / get_or_create_sandbox; either thread it through (preferable — sandbox creation can hang) or drop the parameter. Also, Ruff F541 flags f-strings at lines 238, 242, 253, 261, and 716 — none of them interpolate anything, so the f prefix is redundant.

Add a return type annotation too (-> tuple[Any, str]) per the type-hints coding guideline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 210 -
262, The _prepare_sandbox method currently has an unused effective_timeout
parameter, redundant f-string prefixes, and lacks a return type; either thread
effective_timeout through to SandboxManager.get_or_create_sandbox and
sandbox.files.make_dir (preferred) or remove the parameter; update the call to
SandboxManager.get_instance / get_or_create_sandbox to accept and pass the
timeout (or document removal), remove the unnecessary f prefixes on the literal
log messages (e.g. logger.info lines that don't interpolate), add a return type
annotation -> tuple[Any, str] to _prepare_sandbox, and ensure the function
returns (sandbox, temp_dir) at the end; reference symbols: _prepare_sandbox,
SandboxManager.get_instance, SandboxManager.get_or_create_sandbox,
sandbox.files.make_dir, _generate_temp_dir_name, temp_dir.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 491-498: Add OpenTelemetry tracing to the async entrypoints by
decorating DownloadDingTalkDocumentTool._arun and SaveDingTalkContentTool._arun
with `@trace_async`(...), and consider adding the same decorator to async helpers
like _prepare_sandbox, _upload_and_return, and _handle_error so the
sandbox-acquire → download/write → upload flow is visible in one span tree; use
descriptive span_name strings (e.g., "DownloadDingTalkDocument" /
"SaveDingTalkContent"), the appropriate tracer_name used in the project, and an
extract_attributes callable to capture key args (download_url, filename,
file_extension, timeout_seconds) when applying the decorator.
- Around line 522-534: The websocket emission currently includes the sensitive
download_url (download_token); update the emit in the ws_emitter.emit_tool_call
call to stop sending download_url (or send a sanitized URL with query stripped
using urlparse._replace(query="").geturl()) and instead emit only non-sensitive
metadata: filename, extension (use os.path.splitext on filename), and file size
if available (e.g., file_size variable or stat the file). Keep the same
try/except around ws_emitter and keep using task_id and name, but remove or
sanitize download_url from the tool_input payload so no signed query/token is
logged or emitted.
- Around line 542-565: The download currently builds curl_cmd and runs
sandbox.commands.run without any client-side size/time limits, so add curl
options to enforce max file size and a per-transfer timeout: update curl_cmd
(the list used before sandbox.commands.run) to include "--max-filesize" with
ATTACHMENT_MAX_BYTES and a "--max-time" value derived from effective_timeout (or
a sensible fraction), keeping save_path and download_url usage intact; retain
the existing post-download size re-check in _upload_and_return and ensure error
handling still raises RuntimeError when result.exit_code != 0 so curl-aborted
transfers are detected.
- Around line 265-386: The temp_dir is leaked on most failure paths in
_upload_and_return; wrap the main upload logic (everything after validating file
size through the final response creation and ws_emitter.emit_tool_call) in a
try/finally that always calls sandbox.commands.run(cmd=["rm","-rf", temp_dir],
...) so cleanup runs on success and all exceptions, remove the duplicated
cleanup blocks inside the size-check and success branches, and replace bare
except Exception: pass around the cleanup call with an exception handler that
logs the error (use logger.error or logger.exception) so Ruff S110 / BLE001
violations are avoided; keep the size-check early raise but ensure temp_dir is
still removed in the finally.
- Around line 343-349: The current check inspects api_response["detail"] which
is fragile; change the curl call that produces api_response to use -sS and -w
'\n%{http_code}' so you capture the HTTP status code separately, then parse the
body into api_response and branch on the status code (treat 2xx as success,
non-2xx as error). On non-2xx, prefer to extract api_response.get("detail") or
stderr content for a descriptive RuntimeError, falling back to "Unknown error"
only if nothing else is available. Keep the existing attachment_id extraction
(attachment_id = api_response.get("id")) for successful 2xx responses and only
raise "No attachment_id in response" after confirming a 2xx status.
- Around line 244-247: The code in _prepare_sandbox currently hardcodes
shell_type="ClaudeCode" when calling sandbox_manager.get_or_create_sandbox,
which ignores the actual execution shell declared in bindShells; fix this by
either (A) extending SkillToolContext to include a shell_type field, ensuring
callers (e.g., chat_shell execution state) populate it, then read
context.shell_type inside _prepare_sandbox and pass that to
sandbox_manager.get_or_create_sandbox, or (B) change the skill configuration to
restrict bindShells to ["Chat"] and add a default_shell_type config field that
_prepare_sandbox reads and uses when calling
sandbox_manager.get_or_create_sandbox; update all call sites to pass the new
SkillToolContext.shell_type if you choose option A and adjust SKILL.md/default
config if you choose option B so the sandbox type matches the execution context.
- Around line 726-748: In save_dingtalk_content, avoid building a huge python3
-c with base64 (python_cmd) which can exceed ARG_MAX; first perform a fail-fast
size check against ATTACHMENT_MAX_BYTES (raise a clear error if
content.encode('utf-8') length > ATTACHMENT_MAX_BYTES), then replace the
base64/python_cmd + sandbox.commands.run flow with
sandbox.files.write(save_path, content_bytes, mode="wb") (or the SDK equivalent)
to write the bytes directly, log the write, and handle/raise any errors from
sandbox.files.write instead of inspecting result.exit_code/result.stderr.

In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md`:
- Around line 62-73: The docs for get_document_info incorrectly list the
returned metadata field as "extension"; update the documentation to use
"file_extension" instead so it matches the actual provider response used in
provider.py and LLM agents. Specifically, in the get_document_info description
replace any mention of `extension` with `file_extension` alongside the existing
`name`, `contentType`, and `nodeType` fields to ensure consumers of
get_document_info (function get_document_info) read the correct key.

---

Nitpick comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 210-262: The _prepare_sandbox method currently has an unused
effective_timeout parameter, redundant f-string prefixes, and lacks a return
type; either thread effective_timeout through to
SandboxManager.get_or_create_sandbox and sandbox.files.make_dir (preferred) or
remove the parameter; update the call to SandboxManager.get_instance /
get_or_create_sandbox to accept and pass the timeout (or document removal),
remove the unnecessary f prefixes on the literal log messages (e.g. logger.info
lines that don't interpolate), add a return type annotation -> tuple[Any, str]
to _prepare_sandbox, and ensure the function returns (sandbox, temp_dir) at the
end; reference symbols: _prepare_sandbox, SandboxManager.get_instance,
SandboxManager.get_or_create_sandbox, sandbox.files.make_dir,
_generate_temp_dir_name, temp_dir.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d2d1e6d-152b-482f-a4a8-306eba61e84e

📥 Commits

Reviewing files that changed from the base of the PR and between 3b1c2d2 and 3f19d85.

📒 Files selected for processing (2)
  • backend/init_data/skills/dingtalk-knowledge/SKILL.md
  • backend/init_data/skills/dingtalk-knowledge/provider.py

Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/SKILL.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
backend/init_data/skills/dingtalk-knowledge/provider.py (1)

257-261: ⚠️ Potential issue | 🔴 Critical

Verify get_or_create_sandbox accepts timeout kwarg; shell_type hardcoding still unresolved.

Two concerns on this call:

  1. The SandboxManager.get_or_create_sandbox signature shown in chat_shell/chat_shell/tools/sandbox/_base.py (context snippet 1) is (shell_type, workspace_ref=None, task_type="sandbox") — there is no timeout parameter. Timeout is a constructor arg on SandboxManager.__init__ instead. If the provided snippet reflects the current code, passing timeout=effective_timeout here will raise TypeError. Please verify and either move the timeout into SandboxManager.get_instance(...) or extend get_or_create_sandbox to accept it.

  2. shell_type="ClaudeCode" is still hardcoded even though SKILL.md declares bindShells: [Chat, Agno, ClaudeCode]. This was flagged previously and not yet addressed.

#!/bin/bash
# Confirm current signature of get_or_create_sandbox and whether it accepts timeout.
rg -nP --type=py -B2 -A25 'async def get_or_create_sandbox' chat_shell/
# Check SandboxManager.__init__ for timeout handling.
rg -nP --type=py -B1 -A15 'class SandboxManager' chat_shell/chat_shell/tools/sandbox/_base.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 257 -
261, The call to SandboxManager.get_or_create_sandbox in provider.py is
incorrect: verify the current signature of SandboxManager.get_or_create_sandbox
(async def get_or_create_sandbox) and if it does not accept a timeout kwarg,
remove the timeout=effective_timeout from this call and instead pass timeout via
SandboxManager.get_instance(...) or add a timeout parameter to
get_or_create_sandbox and propagate it through SandboxManager.__init__; also
stop hardcoding shell_type="ClaudeCode" — read the bound shells for this skill
(per SKILL.md or incoming config) and pass the appropriate shell_type value (or
iterate supported shells) into get_or_create_sandbox so the call reflects the
SKILL.md bindShells list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 785-790: The size check for content_bytes against
ATTACHMENT_MAX_BYTES currently raises RuntimeError before the try/except that
calls _handle_error, so move the validation inside the existing try block (the
block that surrounds the subsequent upload flow) or instead call
self._handle_error(...) with the same error context and return before raising;
specifically, ensure the check of content_bytes/ATTACHMENT_MAX_BYTES is
performed inside the same try that catches Exception and routes to _handle_error
(or invoke _handle_error directly) so the failure emits the `failed` websocket
event and follows the same error-handling path as the filename validation.
- Around line 817-819: The call to AsyncSandbox.files.write in provider.py
passes an unsupported mode="wb" kwarg which will raise a TypeError; locate the
await sandbox.files.write(...) call inside the try/except (in the same function
where save_path and content_bytes are used) and remove the mode="wb" argument so
it reads await sandbox.files.write(save_path, content_bytes), keeping the
surrounding exception handling (raising RuntimeError with the original error)
intact; verify content_bytes is bytes before the call if necessary.
- Around line 405-410: The error-handling block that builds the RuntimeError
uses api_response.get("detail") directly which can be a dict; update the logic
in provider.py around the upload response handling so that if error_detail is a
dict you first try to use error_detail.get("message") (falling back to
json.dumps(error_detail) if no message key), otherwise keep existing
upload_result.stderr or the "HTTP {http_status}: Unknown error" fallback, and
then raise RuntimeError with that safe, human-readable string; reference the
variables api_response.get("detail"), upload_result.stderr, http_status and the
RuntimeError raise site when making the change.

---

Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 257-261: The call to SandboxManager.get_or_create_sandbox in
provider.py is incorrect: verify the current signature of
SandboxManager.get_or_create_sandbox (async def get_or_create_sandbox) and if it
does not accept a timeout kwarg, remove the timeout=effective_timeout from this
call and instead pass timeout via SandboxManager.get_instance(...) or add a
timeout parameter to get_or_create_sandbox and propagate it through
SandboxManager.__init__; also stop hardcoding shell_type="ClaudeCode" — read the
bound shells for this skill (per SKILL.md or incoming config) and pass the
appropriate shell_type value (or iterate supported shells) into
get_or_create_sandbox so the call reflects the SKILL.md bindShells list.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5ea3ef2c-8cd6-4df3-986c-be7923260276

📥 Commits

Reviewing files that changed from the base of the PR and between 3f19d85 and 20b30e7.

⛔ Files ignored due to path filters (1)
  • backend/init_data/skills/dingtalk-knowledge.zip is excluded by !**/*.zip
📒 Files selected for processing (2)
  • backend/init_data/skills/dingtalk-knowledge/SKILL.md
  • backend/init_data/skills/dingtalk-knowledge/provider.py

Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/dingtalk-knowledge/provider.py Outdated
@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch from 20b30e7 to 5f8f7e7 Compare April 17, 2026 13:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
backend/init_data/skills/dingtalk-knowledge/SKILL.md (1)

7-10: ⚠️ Potential issue | 🟠 Major

bindShells ≠ what the provider supports.

Root cause is in provider.py (_prepare_sandbox hardcodes shell_type="ClaudeCode"), flagged separately. Either restrict bindShells to ["Chat"] here to match current behavior, or fix provider.py to dispatch dynamically. Leaving the mismatch in place guarantees the skill misbehaves under Agno invocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md` around lines 7 - 10,
The SKILL.md lists bindShells including "Agno" and "ClaudeCode" but the provider
code (_prepare_sandbox in provider.py) currently hardcodes
shell_type="ClaudeCode", causing a mismatch; either change the SKILL.md
bindShells array to only include "Chat" to reflect actual supported invocation,
or update provider.py's _prepare_sandbox to dynamically dispatch shell_type
based on the incoming request (e.g., map the requested shell to the appropriate
sandbox initialization instead of the fixed "ClaudeCode"), ensuring bindShells
and the runtime dispatch (the _prepare_sandbox function) remain consistent.
backend/init_data/skills/dingtalk-knowledge/provider.py (3)

253-261: ⚠️ Potential issue | 🟠 Major

Hardcoded shell_type="ClaudeCode" still conflicts with bindShells: [Chat, Agno, ClaudeCode] — unresolved.

This was raised previously and has not been addressed: the skill advertises support for three shells in SKILL.md but _prepare_sandbox unconditionally requests a ClaudeCode sandbox. When the skill is invoked under Chat or Agno, this either fails (if the template name isn't valid there) or creates a sandbox that cannot be reused by the rest of the session. Either propagate the current shell through SkillToolContext (and read context.shell_type here), or restrict bindShells to ["Chat"] and add a default_shell_type config as the sandbox skill does.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 253 -
261, The _prepare_sandbox code currently hardcodes shell_type="ClaudeCode" when
calling sandbox_manager.get_or_create_sandbox which conflicts with the
advertised bindShells and breaks non-ClaudeCode sessions; update the flow to
read the active shell type from the SkillToolContext (e.g., context.shell_type)
and pass that into get_or_create_sandbox, falling back to a configurable
default_shell_type if context.shell_type is missing, or alternatively change the
skill metadata to restrict bindShells to ["Chat"] and add default_shell_type
config similar to the sandbox skill; modify the call in _prepare_sandbox to use
the chosen variable instead of the literal "ClaudeCode" and ensure SKILL.md and
bindShells reflect this decision.

814-819: ⚠️ Potential issue | 🔴 Critical

sandbox.files.write(..., mode="wb") still present — unsupported kwarg will raise TypeError at runtime.

This was flagged previously and marked as addressed, but the mode="wb" kwarg is still in the code. AsyncSandbox.files.write() in the e2b_code_interpreter SDK does not accept a mode parameter in any released version; passing it converts every save_dingtalk_content invocation into an immediate failure (caught by the generic except and reported as "Failed to write content: ..."). content_bytes is already bytes, so just drop the kwarg.

💡 Fix
-                await sandbox.files.write(save_path, content_bytes, mode="wb")
+                await sandbox.files.write(save_path, content_bytes)
#!/bin/bash
# Cross-check how write() is called elsewhere in this repo (should pass bytes without mode=).
rg -nP --type=py -C 2 '\bfiles\.write\s*\('
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 814 -
819, The async write call erroneously passes an unsupported kwarg mode="wb" to
AsyncSandbox.files.write causing a TypeError at runtime; inside
SaveDingTalkContentTool (the block using logger.info and await
sandbox.files.write(save_path, content_bytes, mode="wb")), remove the mode="wb"
argument and call AsyncSandbox.files.write with just save_path and content_bytes
(ensure content_bytes is bytes), leaving the existing try/except intact; also
scan other usages of AsyncSandbox.files.write to ensure no other calls pass
mode=.

784-790: ⚠️ Potential issue | 🟠 Major

Size-check raise still bypasses _handle_error — previously flagged, not addressed.

Lines 785–790 execute before the try: block at line 804, so a size violation raises straight out of _arun without emitting a failed websocket event or producing the standardized error JSON. Route it through _handle_error for consistency with the filename validation above.

💡 Fix
-        # Fail-fast size check before writing
         content_bytes = content.encode("utf-8")
         if len(content_bytes) > ATTACHMENT_MAX_BYTES:
-            raise RuntimeError(
-                f"Content size ({len(content_bytes)} bytes) exceeds maximum allowed size "
-                f"({ATTACHMENT_MAX_BYTES} bytes)"
-            )
+            return await self._handle_error(
+                filename,
+                RuntimeError(
+                    f"Content size ({len(content_bytes)} bytes) exceeds maximum allowed size "
+                    f"({ATTACHMENT_MAX_BYTES} bytes)"
+                ),
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 784 -
790, The size-check that raises when content.encode("utf-8") exceeds
ATTACHMENT_MAX_BYTES is executed outside the try block in _arun and therefore
bypasses _handle_error and the standardized websocket failure path; move the
size validation inside the same try that wraps the rest of _arun (or,
alternatively, catch the RuntimeError where current validation lives and call
self._handle_error(...) with the same error payload used for filename
validation) so that violations of ATTACHMENT_MAX_BYTES for variable content
always route through _handle_error and emit the standardized failed websocket
event.
🧹 Nitpick comments (2)
backend/init_data/skills/dingtalk-knowledge/provider.py (2)

586-595: Input key name inconsistency: extension here vs file_extension everywhere else.

The tool input schema and the rest of the code consistently use file_extension (including SKILL.md). For symmetry with the websocket completed payload and the schema itself, emit file_extension rather than extension in the running event.

Diff
                 tool_input={
                     "filename": filename,
-                    "extension": file_ext,
+                    "file_extension": file_ext,
                 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 586 -
595, The running-event payload uses the wrong key "extension" instead of the
project-standard "file_extension"; update the emit in provider.py where
self.ws_emitter.emit_tool_call is called (the block that builds tool_input with
"filename" and file_ext) to emit "file_extension": file_ext so the running event
matches the schema and the completed payload used elsewhere (reference the
variable file_ext and the method self.ws_emitter.emit_tool_call).

356-368: stdout.strip() + split-on-\n can mis-parse the HTTP status when the body contains a trailing newline or CR.

curl -w "\n%{http_code}" prepends a newline + the status. After .strip(), if the API returned a trailing newline in its JSON body (many frameworks do), the final structure is body\n\nstatus — which still works — but output_lines[-1] being a non-numeric value anywhere (e.g., due to -sS writing a transient error line mid-stream on retry) will raise ValueError from int(output_lines[-1]) and be reported to the user as a generic error. Consider -w "\nHTTP_STATUS:%{http_code}" with an explicit prefix and re.search to extract the status, which is far more robust than positional parsing. Minor — flag as optional hardening.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 356 -
368, The current parsing of upload_result.stdout using
stdout.strip().split("\n") and taking output_lines[-1] to get http_status is
fragile; change the curl invocation to use a unique suffix like -w
"\nHTTP_STATUS:%{http_code}" and then extract the status with a regex search
(e.g., search for r"HTTP_STATUS:(\d{3})") on upload_result.stdout instead of
positional splitting, set http_status from the matched group (handling
missing/invalid match with a clear RuntimeError), set response_body to stdout
with the matched suffix removed (or use the span of the match to slice), and
keep the JSON parsing of response_body into api_response (json.loads) with the
same exception handling; reference symbols: upload_result, upload_result.stdout,
http_status, response_body, api_response, json.loads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 412-428: The temp-dir cleanup in _upload_and_return is
insufficient because temp_dir is created in _prepare_sandbox and other failures
in DownloadDingTalkDocumentTool._arun and SaveDingTalkContentTool._arun bypass
that finally; move the cleanup out of _upload_and_return and into each _arun (or
make _prepare_sandbox return an async context manager or cleanup callback) so
temp_dir is removed in a finally block inside DownloadDingTalkDocumentTool._arun
and SaveDingTalkContentTool._arun (and any caller that uses _prepare_sandbox),
ensuring you always call sandbox.commands.run(["rm","-rf", temp_dir]) in each
finally and remove the inner try/finally from _upload_and_return; reference
_prepare_sandbox, _upload_and_return, DownloadDingTalkDocumentTool._arun,
SaveDingTalkContentTool._arun and _handle_error when making the changes.
- Around line 525-527: The docstring/note in DownloadDingTalkDocumentTool and
SaveDingTalkContentTool is misleading about requiring a separate "sandbox"
skill; update their description strings to clarify the tool creates/acquires its
own sandbox via SandboxManager. Locate the description text in the classes
DownloadDingTalkDocumentTool and SaveDingTalkContentTool and replace "Make sure
the sandbox skill is loaded." with a clear phrase such as "This tool runs inside
a sandbox; it will acquire one automatically." (or similar wording) so agents
are not instructed to enable an external sandbox skill.
- Around line 257-261: The call to sandbox_manager.get_or_create_sandbox passes
a nonexistent timeout kwarg causing a TypeError; instead set the timeout when
obtaining the manager instance. Remove timeout=effective_timeout from
get_or_create_sandbox(...) and ensure SandboxManager.get_instance(...) is called
with timeout=effective_timeout (or obtain the manager via
SandboxManager.get_instance(..., timeout=effective_timeout)) so subsequent calls
like sandbox_manager.get_or_create_sandbox(shell_type="ClaudeCode",
workspace_ref=None[, task_type=...]) work correctly.

---

Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 253-261: The _prepare_sandbox code currently hardcodes
shell_type="ClaudeCode" when calling sandbox_manager.get_or_create_sandbox which
conflicts with the advertised bindShells and breaks non-ClaudeCode sessions;
update the flow to read the active shell type from the SkillToolContext (e.g.,
context.shell_type) and pass that into get_or_create_sandbox, falling back to a
configurable default_shell_type if context.shell_type is missing, or
alternatively change the skill metadata to restrict bindShells to ["Chat"] and
add default_shell_type config similar to the sandbox skill; modify the call in
_prepare_sandbox to use the chosen variable instead of the literal "ClaudeCode"
and ensure SKILL.md and bindShells reflect this decision.
- Around line 814-819: The async write call erroneously passes an unsupported
kwarg mode="wb" to AsyncSandbox.files.write causing a TypeError at runtime;
inside SaveDingTalkContentTool (the block using logger.info and await
sandbox.files.write(save_path, content_bytes, mode="wb")), remove the mode="wb"
argument and call AsyncSandbox.files.write with just save_path and content_bytes
(ensure content_bytes is bytes), leaving the existing try/except intact; also
scan other usages of AsyncSandbox.files.write to ensure no other calls pass
mode=.
- Around line 784-790: The size-check that raises when content.encode("utf-8")
exceeds ATTACHMENT_MAX_BYTES is executed outside the try block in _arun and
therefore bypasses _handle_error and the standardized websocket failure path;
move the size validation inside the same try that wraps the rest of _arun (or,
alternatively, catch the RuntimeError where current validation lives and call
self._handle_error(...) with the same error payload used for filename
validation) so that violations of ATTACHMENT_MAX_BYTES for variable content
always route through _handle_error and emit the standardized failed websocket
event.

In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md`:
- Around line 7-10: The SKILL.md lists bindShells including "Agno" and
"ClaudeCode" but the provider code (_prepare_sandbox in provider.py) currently
hardcodes shell_type="ClaudeCode", causing a mismatch; either change the
SKILL.md bindShells array to only include "Chat" to reflect actual supported
invocation, or update provider.py's _prepare_sandbox to dynamically dispatch
shell_type based on the incoming request (e.g., map the requested shell to the
appropriate sandbox initialization instead of the fixed "ClaudeCode"), ensuring
bindShells and the runtime dispatch (the _prepare_sandbox function) remain
consistent.

---

Nitpick comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 586-595: The running-event payload uses the wrong key "extension"
instead of the project-standard "file_extension"; update the emit in provider.py
where self.ws_emitter.emit_tool_call is called (the block that builds tool_input
with "filename" and file_ext) to emit "file_extension": file_ext so the running
event matches the schema and the completed payload used elsewhere (reference the
variable file_ext and the method self.ws_emitter.emit_tool_call).
- Around line 356-368: The current parsing of upload_result.stdout using
stdout.strip().split("\n") and taking output_lines[-1] to get http_status is
fragile; change the curl invocation to use a unique suffix like -w
"\nHTTP_STATUS:%{http_code}" and then extract the status with a regex search
(e.g., search for r"HTTP_STATUS:(\d{3})") on upload_result.stdout instead of
positional splitting, set http_status from the matched group (handling
missing/invalid match with a clear RuntimeError), set response_body to stdout
with the matched suffix removed (or use the span of the match to slice), and
keep the JSON parsing of response_body into api_response (json.loads) with the
same exception handling; reference symbols: upload_result, upload_result.stdout,
http_status, response_body, api_response, json.loads.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f01d9275-6505-4312-8679-850089b106bb

📥 Commits

Reviewing files that changed from the base of the PR and between 20b30e7 and 5f8f7e7.

📒 Files selected for processing (2)
  • backend/init_data/skills/dingtalk-knowledge/SKILL.md
  • backend/init_data/skills/dingtalk-knowledge/provider.py

Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Comment thread backend/init_data/skills/wegent-knowledge/provider.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
backend/init_data/skills/dingtalk-knowledge/provider.py (2)

246-250: ⚠️ Potential issue | 🟠 Major

effective_timeout is currently ignored during sandbox acquisition.

_prepare_sandbox(effective_timeout) accepts timeout, but Line 246-250 does not pass it into SandboxManager.get_instance(...), despite the note on Line 254. This can make timeout_seconds partially ineffective.

Proposed fix
         sandbox_manager = SandboxManager.get_instance(
             task_id=self.task_id,
             user_id=self.user_id,
             user_name=effective_user_name,
+            timeout=effective_timeout,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 246 -
250, The call to SandboxManager.get_instance in _prepare_sandbox ignores the
effective_timeout, so timeout_seconds can be ineffective; update the call in
_prepare_sandbox to pass the effective_timeout (or timeout_seconds) into
SandboxManager.get_instance (e.g., add a timeout_seconds or timeout arg) so the
manager is created with the intended timeout; ensure the parameter name matches
SandboxManager.get_instance signature and propagate effective_timeout from
_prepare_sandbox into that call.

604-660: ⚠️ Potential issue | 🟠 Major

Temp directory cleanup still leaks on pre-upload failures.

Cleanup runs in _upload_and_return finally, but failures after Line 606/813 and before calling _upload_and_return (e.g., download/write/get_info errors) bypass that cleanup and leave temp_dir behind.

Proposed fix shape (move cleanup to each _arun)
-        try:
-            sandbox, temp_dir = await self._prepare_sandbox(effective_timeout)
+        sandbox = None
+        temp_dir = None
+        try:
+            sandbox, temp_dir = await self._prepare_sandbox(effective_timeout)
             ...
             return await self._upload_and_return(...)
         except Exception as e:
             return await self._handle_error(filename, e)
+        finally:
+            if sandbox is not None and temp_dir:
+                try:
+                    await sandbox.commands.run(cmd=["rm", "-rf", temp_dir], cwd="/home/user")
+                except Exception as cleanup_err:
+                    logger.warning("Temp dir cleanup failed: %s", cleanup_err)

Also applies to: 801-845

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 604 -
660, The temp_dir created by _prepare_sandbox is only cleaned in
_upload_and_return's finally, so exceptions occurring between sandbox creation
and the call to _upload_and_return (e.g., curl failure, sandbox.files.get_info)
leak the temp directory; update the async run path (the relevant
_arun/_download/_run logic around _prepare_sandbox, the curl invocation, and the
except block that calls _handle_error) to ensure temp_dir is removed on all
error paths: either perform cleanup in a local finally around the try that
creates sandbox (after calling _prepare_sandbox) or call a cleanup helper in the
except before delegating to _handle_error; reference and reuse the same
cleanup/teardown used by _upload_and_return so you don't duplicate teardown
logic. Ensure cleanup runs whether success or failure and does not rely solely
on _upload_and_return.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 465-472: The websocket emission in _handle_error (the call to
self.ws_emitter.emit_tool_call) must be made best-effort so it cannot raise and
mask the original error; wrap the emit call in a try/except that catches any
exception from emit_tool_call, log or ignore that secondary error, and do not
re-raise it so the original error path remains intact (modify the block inside
_handle_error that references self.ws_emitter/emit_tool_call accordingly).
- Around line 395-403: The websocket emit call (self.ws_emitter.emit_tool_call
with status="completed" and output=response) can raise and cause a successful
upload to be reported as failed; wrap the emit_tool_call invocation in a
try/except that catches exceptions from self.ws_emitter.emit_tool_call, log the
exception (or call self.ws_emitter.logger/error handler) and swallow it so it
doesn't propagate, and ensure the method still returns the successful upload
result; specifically protect the block that references task_id, name, filename
and response so emit failures do not change the overall success path.

---

Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 246-250: The call to SandboxManager.get_instance in
_prepare_sandbox ignores the effective_timeout, so timeout_seconds can be
ineffective; update the call in _prepare_sandbox to pass the effective_timeout
(or timeout_seconds) into SandboxManager.get_instance (e.g., add a
timeout_seconds or timeout arg) so the manager is created with the intended
timeout; ensure the parameter name matches SandboxManager.get_instance signature
and propagate effective_timeout from _prepare_sandbox into that call.
- Around line 604-660: The temp_dir created by _prepare_sandbox is only cleaned
in _upload_and_return's finally, so exceptions occurring between sandbox
creation and the call to _upload_and_return (e.g., curl failure,
sandbox.files.get_info) leak the temp directory; update the async run path (the
relevant _arun/_download/_run logic around _prepare_sandbox, the curl
invocation, and the except block that calls _handle_error) to ensure temp_dir is
removed on all error paths: either perform cleanup in a local finally around the
try that creates sandbox (after calling _prepare_sandbox) or call a cleanup
helper in the except before delegating to _handle_error; reference and reuse the
same cleanup/teardown used by _upload_and_return so you don't duplicate teardown
logic. Ensure cleanup runs whether success or failure and does not rely solely
on _upload_and_return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff7c0e87-78a8-492a-9648-c657b48ff4d7

📥 Commits

Reviewing files that changed from the base of the PR and between 5f8f7e7 and 0881b98.

📒 Files selected for processing (1)
  • backend/init_data/skills/dingtalk-knowledge/provider.py

Comment on lines +395 to +403
if self.ws_emitter:
await self.ws_emitter.emit_tool_call(
task_id=self.task_id,
tool_name=self.name,
tool_input={"filename": filename},
status="completed",
output=response,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not fail a successful upload when websocket emit fails.

On Line 396, emit_tool_call(..., status="completed") is not guarded. If emit fails, the method raises and caller returns a failed tool response even though upload succeeded.

Proposed fix
                 if self.ws_emitter:
-                    await self.ws_emitter.emit_tool_call(
-                        task_id=self.task_id,
-                        tool_name=self.name,
-                        tool_input={"filename": filename},
-                        status="completed",
-                        output=response,
-                    )
+                    try:
+                        await self.ws_emitter.emit_tool_call(
+                            task_id=self.task_id,
+                            tool_name=self.name,
+                            tool_input={"filename": filename},
+                            status="completed",
+                            output=response,
+                        )
+                    except Exception as e:
+                        logger.warning("Failed to emit completion status: %s", e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 395 -
403, The websocket emit call (self.ws_emitter.emit_tool_call with
status="completed" and output=response) can raise and cause a successful upload
to be reported as failed; wrap the emit_tool_call invocation in a try/except
that catches exceptions from self.ws_emitter.emit_tool_call, log the exception
(or call self.ws_emitter.logger/error handler) and swallow it so it doesn't
propagate, and ensure the method still returns the successful upload result;
specifically protect the block that references task_id, name, filename and
response so emit failures do not change the overall success path.

Comment on lines +465 to +472
if self.ws_emitter:
await self.ws_emitter.emit_tool_call(
task_id=self.task_id,
tool_name=self.name,
tool_input={"filename": filename},
status="failed",
error=str(error),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

_handle_error should be best-effort for websocket emission.

If Line 466 emit fails, _handle_error can throw and mask the original error response path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 465 -
472, The websocket emission in _handle_error (the call to
self.ws_emitter.emit_tool_call) must be made best-effort so it cannot raise and
mask the original error; wrap the emit call in a try/except that catches any
exception from emit_tool_call, log or ignore that secondary error, and do not
re-raise it so the original error path remains intact (modify the block inside
_handle_error that references self.ws_emitter/emit_tool_call accordingly).

@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch from 0881b98 to 047d90a Compare April 21, 2026 08:09
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
backend/init_data/skills/dingtalk-knowledge/provider.py (4)

395-402: ⚠️ Potential issue | 🟠 Major

Wrap websocket emit in try/except to prevent masking successful uploads.

If emit_tool_call raises an exception, the successful upload will fail to return its response and the exception will propagate to the caller, making a successful operation appear failed. Wrap the emit in a try/except block that logs the error but allows the success path to continue.

🛡️ Proposed fix
             if self.ws_emitter:
-                await self.ws_emitter.emit_tool_call(
-                    task_id=self.task_id,
-                    tool_name=self.name,
-                    tool_input={"filename": filename},
-                    status="completed",
-                    output=response,
-                )
+                try:
+                    await self.ws_emitter.emit_tool_call(
+                        task_id=self.task_id,
+                        tool_name=self.name,
+                        tool_input={"filename": filename},
+                        status="completed",
+                        output=response,
+                    )
+                except Exception as e:
+                    logger.warning("Failed to emit completion status: %s", e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 395 -
402, The websocket emit call in the provider's upload flow can raise and mask
successful uploads; wrap the await self.ws_emitter.emit_tool_call(...) inside a
try/except that catches Exception, logs the error (including self.task_id,
self.name, and filename for context) and continues without re-raising so the
successful response (variable response) is still returned; locate the block
using self.ws_emitter and the emit_tool_call call and ensure only the emit is
inside the try/except so upload success path is unchanged.

801-845: ⚠️ Potential issue | 🟠 Major

Temp directory cleanup is leaked on pre-upload failures.

Same issue as in DownloadDingTalkDocumentTool: if sandbox.files.write fails at line 823-825 or get_info fails at line 828, the exception handler at line 844 is invoked without cleanup, leaving temp_dir in the sandbox.

Add a finally block to _arun to ensure cleanup runs on all paths, and remove the duplicate cleanup from _upload_and_return.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 801 -
845, The _arun method of SaveDingTalkContentTool leaks the sandbox temp_dir when
sandbox.files.write or sandbox.files.get_info throws; add a finally block in
SaveDingTalkContentTool._arun that always calls the sandbox cleanup (using the
same cleanup routine currently in _upload_and_return) so temp_dir is removed on
success, handled errors, or intermediate failures, and then remove the duplicate
cleanup from _upload_and_return to avoid double-cleanup; ensure you still await
_prepare_sandbox (and capture sandbox and temp_dir) before using the finally so
cleanup can run safely even when write/get_info fails and keep existing calls to
_handle_error for error reporting.

604-659: ⚠️ Potential issue | 🟠 Major

Temp directory cleanup is leaked on pre-upload failures.

The cleanup in _upload_and_return's finally block (lines 417-433) only runs if _upload_and_return is called. If the download fails at line 638-639 or get_info fails at line 642, control jumps to the exception handler at line 658 and _handle_error is called without cleanup—leaving temp_dir behind in the sandbox.

Move cleanup into a finally block within _arun so it's guaranteed to run on all paths.

🧹 Proposed fix
     async def _arun(...) -> str:
         ...
+        sandbox = None
+        temp_dir = None
         try:
             # Prepare sandbox
             sandbox, temp_dir = await self._prepare_sandbox(effective_timeout)
             ...
             return await self._upload_and_return(...)
         except Exception as e:
             return await self._handle_error(filename, e)
+        finally:
+            if sandbox is not None and temp_dir:
+                try:
+                    await sandbox.commands.run(
+                        cmd=["rm", "-rf", temp_dir],
+                        cwd="/home/user",
+                    )
+                    logger.info("Cleaned up temp_dir=%s", temp_dir)
+                except Exception as cleanup_err:
+                    logger.error("Failed to cleanup temp_dir=%s: %s", temp_dir, cleanup_err)

And remove the cleanup from _upload_and_return's finally (lines 417-433).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 604 -
659, The temp_dir cleanup is only in _upload_and_return's finally so failures
during download or get_info in _arun leak the sandbox/temp_dir; modify _arun
(the async method containing calls to _prepare_sandbox, sandbox.commands.run,
and sandbox.files.get_info) to declare sandbox and temp_dir in the outer scope
and wrap the download/upload logic in a try/finally where the finally always
removes temp_dir and closes/cleans sandbox (guarding for None), then call
_handle_error from the except but let the finally run to perform cleanup; after
moving cleanup to _arun, remove the redundant cleanup code from
_upload_and_return's finally so cleanup happens exactly once and on all paths.

465-472: ⚠️ Potential issue | 🟠 Major

Make websocket emission in error handler best-effort.

If the websocket emit fails within _handle_error, it will raise and prevent the error response from being returned to the caller, masking the original error. Wrap the emit in a try/except that logs but doesn't propagate the exception.

🛡️ Proposed fix
         if self.ws_emitter:
-            await self.ws_emitter.emit_tool_call(
-                task_id=self.task_id,
-                tool_name=self.name,
-                tool_input={"filename": filename},
-                status="failed",
-                error=str(error),
-            )
+            try:
+                await self.ws_emitter.emit_tool_call(
+                    task_id=self.task_id,
+                    tool_name=self.name,
+                    tool_input={"filename": filename},
+                    status="failed",
+                    error=str(error),
+                )
+            except Exception as e:
+                logger.warning("Failed to emit error status: %s", e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 465 -
472, The websocket emit in _handle_error (the await
self.ws_emitter.emit_tool_call(...) call) should be best-effort: wrap that await
in a try/except Exception block so any exceptions from ws_emitter.emit_tool_call
do not propagate and mask the original error; inside the except, log the
exception (e.g., via self.logger.exception or a similar logger) and continue,
ensuring the original error response is still returned to the caller.
🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-knowledge/SKILL.md (1)

67-67: Add language specifiers to code blocks for better syntax highlighting.

Several code blocks lack language specifiers, which prevents proper syntax highlighting in documentation renderers. Add the appropriate language identifier after the opening fence.

📝 Suggested fixes

Line 67:

-```
+```python
 dingtalk-docs.get_document_info(document_id="xxx")

Line 90:
```diff
-```
+```python
 dingtalk-docs.get_document_content(nodeId="xxx")

Lines 99, 110, 121 similarly need `python` specifiers.
</details>




Also applies to: 90-90, 99-99, 110-110, 121-121

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @backend/init_data/skills/dingtalk-knowledge/SKILL.md at line 67, The
markdown code fences in SKILL.md around the dingtalk examples are missing
language specifiers which prevents syntax highlighting; update each affected
fence that contains calls like
dingtalk-docs.get_document_info(document_id="xxx"),
dingtalk-docs.get_document_content(nodeId="xxx"), and the other example blocks
(lines showing those calls) by adding the python language identifier immediately
after the opening fence (e.g., change to ```python) for all occurrences
referenced.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @backend/init_data/skills/dingtalk-knowledge/provider.py:

  • Around line 395-402: The websocket emit call in the provider's upload flow can
    raise and mask successful uploads; wrap the await
    self.ws_emitter.emit_tool_call(...) inside a try/except that catches Exception,
    logs the error (including self.task_id, self.name, and filename for context) and
    continues without re-raising so the successful response (variable response) is
    still returned; locate the block using self.ws_emitter and the emit_tool_call
    call and ensure only the emit is inside the try/except so upload success path is
    unchanged.
  • Around line 801-845: The _arun method of SaveDingTalkContentTool leaks the
    sandbox temp_dir when sandbox.files.write or sandbox.files.get_info throws; add
    a finally block in SaveDingTalkContentTool._arun that always calls the sandbox
    cleanup (using the same cleanup routine currently in _upload_and_return) so
    temp_dir is removed on success, handled errors, or intermediate failures, and
    then remove the duplicate cleanup from _upload_and_return to avoid
    double-cleanup; ensure you still await _prepare_sandbox (and capture sandbox and
    temp_dir) before using the finally so cleanup can run safely even when
    write/get_info fails and keep existing calls to _handle_error for error
    reporting.
  • Around line 604-659: The temp_dir cleanup is only in _upload_and_return's
    finally so failures during download or get_info in _arun leak the
    sandbox/temp_dir; modify _arun (the async method containing calls to
    _prepare_sandbox, sandbox.commands.run, and sandbox.files.get_info) to declare
    sandbox and temp_dir in the outer scope and wrap the download/upload logic in a
    try/finally where the finally always removes temp_dir and closes/cleans sandbox
    (guarding for None), then call _handle_error from the except but let the finally
    run to perform cleanup; after moving cleanup to _arun, remove the redundant
    cleanup code from _upload_and_return's finally so cleanup happens exactly once
    and on all paths.
  • Around line 465-472: The websocket emit in _handle_error (the await
    self.ws_emitter.emit_tool_call(...) call) should be best-effort: wrap that await
    in a try/except Exception block so any exceptions from ws_emitter.emit_tool_call
    do not propagate and mask the original error; inside the except, log the
    exception (e.g., via self.logger.exception or a similar logger) and continue,
    ensuring the original error response is still returned to the caller.

Nitpick comments:
In @backend/init_data/skills/dingtalk-knowledge/SKILL.md:

  • Line 67: The markdown code fences in SKILL.md around the dingtalk examples are
    missing language specifiers which prevents syntax highlighting; update each
    affected fence that contains calls like
    dingtalk-docs.get_document_info(document_id="xxx"),
    dingtalk-docs.get_document_content(nodeId="xxx"), and the other example blocks
    (lines showing those calls) by adding the python language identifier immediately
    after the opening fence (e.g., change to ```python) for all occurrences
    referenced.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `46b326dd-8009-494f-9388-8f314fa7fc8a`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 0881b98b0d638f86c8bb4fdcf2ca7dd5248ab4c4 and 047d90a87fdb626d5db0545a5b3548b0da1e77de.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `backend/app/mcp_server/tools/__init__.py`
* `backend/app/mcp_server/tools/decorator.py`
* `backend/init_data/skills/dingtalk-knowledge/SKILL.md`
* `backend/init_data/skills/dingtalk-knowledge/provider.py`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (2)</summary>

* backend/app/mcp_server/tools/__init__.py
* backend/app/mcp_server/tools/decorator.py

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch from e172f52 to 582eaaa Compare April 21, 2026 11:57
…uments to knowledge base

Add new skill to upload DingTalk documents, spreadsheets, and AI tables
to Wegent knowledge bases. Includes:

- SKILL.md: Comprehensive documentation with workflow and examples
- provider.py: Tool provider with two tools:
  - download_dingtalk_document: Download file-based docs via sandbox
  - save_dingtalk_content: Save online doc content (adoc) as markdown

Supports multiple DingTalk document types:
- File-based documents (docx, xlsx, pdf) via download_file MCP
- Online documents (adoc) via get_document_content MCP
- Spreadsheets (axls) via dingtalk-table MCP
- AI tables (able) via dingtalk-ai-table MCP

Integrates with wegent-knowledge MCP to create documents with proper
source attribution and trigger indexing.
- Add OpenTelemetry tracing to async entrypoints and helpers
- Fix websocket emission to remove sensitive download_url from payload
- Add curl options for max file size (--max-filesize) and timeout (--max-time)
- Fix temp_dir leak by wrapping upload logic in try/finally with proper cleanup
- Fix HTTP status code checking by parsing status from curl response
- Fix hardcoded shell_type by passing timeout to get_or_create_sandbox()
- Replace base64/python_cmd with sandbox.files.write() for content saving
- Add return type annotation and remove redundant f-string prefixes
- Update SKILL.md documentation: change extension to file_extension
- Clarify workflow: documents are read FROM DingTalk and stored TO Wegent KB
Update SKILL.md to use the new prefixed tool name:
- Change wegent-knowledge.create_document to wegent_kb_create_document
- Update all examples and references
- Clarify warning about tool selection
- Move content_bytes size check inside try block for proper error handling
- Remove unsupported mode='wb' from sandbox.files.write() call
- Fix error_detail handling when it's a dict instead of string
- Remove timeout parameter from get_or_create_sandbox() call (timeout is passed via get_instance)
…ples

Update the docstring in __init__.py and example code in decorator.py to use
the actual MCP tool names (with wegent_kb_ prefix) instead of internal function
names. This ensures documentation consistency with the @mcp_tool decorator name
parameters in knowledge.py.

Changes:
- __init__.py: Updated Knowledge MCP tools list with correct names and added
  missing tools (search_knowledge_base, read_document_content)
- decorator.py: Fixed example code to use wegent_kb_list_knowledge_bases
- Merge dingtalk-knowledge SKILL.md into wegent-knowledge SKILL.md
- Add provider.py to wegent-knowledge with DingTalk upload tools
- Rename DingTalkKnowledgeProvider to WegentKnowledgeProvider
- Update provider name from dingtalk_knowledge to wegent_knowledge
- Remove standalone dingtalk-knowledge skill directory
- Combine MCP servers: wegent-knowledge, dingtalk-docs, dingtalk-table, dingtalk-ai-table
- Move temp_dir cleanup from _upload_and_return to _arun methods with finally block
- Update tool descriptions to clarify sandbox is auto-acquired (not external skill)
- Fix timeout parameter - pass to SandboxManager.get_instance instead of get_or_create_sandbox
- Add configurable default_shell_type via tool_config (default: Agno)
- Remove unsupported mode='wb' from sandbox.files.write call
- Move content size validation inside try block for proper error handling
- Fix emit_tool_call key from 'extension' to 'file_extension'
- Fix fragile HTTP status parsing using regex with HTTP_STATUS suffix
- Restrict bindShells to [Chat] to match runtime behavior
@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch 3 times, most recently from d267459 to 2b34c19 Compare April 24, 2026 07:41
Only use skills, upload DingTalk document to wegent knowledge base
@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch from 2b34c19 to c3f3d1c Compare April 27, 2026 02:57
…upload

Add new MCP tools to replace sandbox exec for DingTalk document upload:

1. dingtalk_upload_file_from_url: Download file from URL and upload to Wegent
   - Downloads file from DingTalk download_url
   - Automatically manages temporary files
   - Returns attachment_id for wegent_kb_create_document

2. dingtalk_upload_content: Save content to file and upload to Wegent
   - Writes content to temporary file
   - Supports configurable encoding
   - Returns attachment_id for wegent_kb_create_document

Both tools:
- Use tempfile module for secure temporary file handling
- Clean up temporary files after upload (in finally block)
- Integrate with existing context_service.upload_attachment

Update SKILL.md:
- Remove all sandbox exec references
- Document new helper tools
- Update workflow and examples

Update server.py:
- Import dingtalk_upload module for tool registration

Fixes secret
@parabala
Copy link
Copy Markdown
Collaborator Author

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Reviews resumed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)

1-305: ⚠️ Potential issue | 🟠 Major

Fix Black formatting (pipeline failing).

The Lint GitHub Action reports Black would reformat this file. Lines such as 58, 154, 158, 258 exceed the 88-column limit. Run black backend/app/mcp_server/tools/dingtalk_upload.py before merging.

As per coding guidelines: "Follow PEP 8 standard with Black formatter (line length: 88) and isort for Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 1 - 305, Run
Black to reformat the file to the project's 88-column style (e.g., run `black
backend/app/mcp_server/tools/dingtalk_upload.py`); specifically wrap or shorten
long log/message and header lines found in functions _download_file_from_url
(headers User-Agent), upload_file_from_url (logger.info, error messages and the
returned dict keys/values that exceed 88 chars), and upload_content
(logger.info, error messages and returned dict), so all lines comply with
Black/PEP8; after formatting, re-run linting to confirm pipeline passes.
🧹 Nitpick comments (4)
backend/app/mcp_server/tools/__init__.py (1)

31-44: DINGTALK_UPLOAD_MCP_TOOLS duplicates KNOWLEDGE_MCP_TOOLS content.

Both KNOWLEDGE_MCP_TOOLS and DINGTALK_UPLOAD_MCP_TOOLS are built with build_mcp_tools_dict(server="knowledge") (see dingtalk_upload.py line 304 and the corresponding call in knowledge.py). Since build_mcp_tools_dict filters only by server name, both constants resolve to the same dict — every knowledge tool, not just DingTalk ones. The name is misleading and the export adds no new information.

Either drop the export entirely (registration goes through get_registered_mcp_tools directly, so it's unused) or have dingtalk_upload.py build a properly-scoped registry (e.g., add a per-module filter to build_mcp_tools_dict, or use a dedicated server tag). Same applies to the __all__ entry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/__init__.py` around lines 31 - 44,
DINGTALK_UPLOAD_MCP_TOOLS and KNOWLEDGE_MCP_TOOLS both call
build_mcp_tools_dict(server="knowledge") so the former is a misleading
duplicate; fix by either removing the DINGTALK_UPLOAD_MCP_TOOLS export/entry in
__all__ and any unused references (since get_registered_mcp_tools handles
registration), or change the construction in dingtalk_upload.py to scope
correctly (e.g., call build_mcp_tools_dict with a distinct server tag or add a
per-module filter) so DINGTALK_UPLOAD_MCP_TOOLS represents only DingTalk tools;
update the __all__ list accordingly and ensure references to
DINGTALK_UPLOAD_MCP_TOOLS are adjusted or deleted.
backend/app/mcp_server/tools/dingtalk_upload.py (2)

187-189: Use !s conversion instead of str(e) (Ruff RUF010).

Minor — applies here and at Line 292:

-            return {"error": f"Failed to download file: {str(e)}"}
+            return {"error": f"Failed to download file: {e!s}"}
-            return {"error": f"Failed to encode content with {encoding}: {str(e)}"}
+            return {"error": f"Failed to encode content with {encoding}: {e!s}"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 187 - 189, In
the exception handlers inside dingtalk_upload.py replace explicit str(e) with
the f-string !s conversion to satisfy Ruff RUF010: change logger.error(f"Failed
to download file from URL: {e}") and return {"error": f"Failed to download file:
{str(e)}"} to use {e!s}, and apply the same change to the other similar handler
around the later return (the one at the second exception block near the
upload/download flow) so both logging and returned error strings use {e!s}.

256-292: Avoid the temp-file roundtrip in upload_content.

The function writes content to a NamedTemporaryFile, then re-reads the same bytes back from disk into memory before calling upload_attachment. Since upload_attachment only needs bytes, you can encode in memory and skip both the temp file and the cleanup path entirely:

♻️ Proposed simplification
-    temp_file_path = None
-    db = SessionLocal()
+    db = SessionLocal()
 
     try:
         user = _get_user_from_token(db, token_info)
         if not user:
             return {"error": "User not found"}
 
-        # Validate content
         if not content:
             return {"error": "Content cannot be empty"}
-
-        # Validate filename
         if not filename:
             return {"error": "Filename is required"}
 
-        # Create temporary file with proper extension
-        file_extension = os.path.splitext(filename)[1] or ".md"
-        with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as temp_file:
-            temp_file_path = temp_file.name
-
         try:
-            # Write content to temporary file
-            logger.info(f"Writing content to temporary file {temp_file_path}")
-            with open(temp_file_path, "w", encoding=encoding) as f:
-                f.write(content)
-
-            # Read file content as binary for upload
-            with open(temp_file_path, "rb") as f:
-                binary_data = f.read()
-
+            binary_data = content.encode(encoding)
             file_size = len(binary_data)
 
             context, truncation_info = context_service.upload_attachment(
                 db=db,
                 user_id=user.id,
                 filename=filename,
                 binary_data=binary_data,
                 subtask_id=0,
             )
             ...
         except UnicodeEncodeError as e:
             ...
     finally:
         db.close()
-        _clean_temp_file(temp_file_path)

This also makes the UnicodeEncodeError branch reachable in the natural place (during .encode()) and removes a needless filesystem dependency from the hot path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 256 - 292, The
current upload_content flow unnecessarily writes to a NamedTemporaryFile then
re-reads it; instead, in upload_content encode the in-memory content to bytes
using content.encode(encoding) (catch UnicodeEncodeError around this call),
compute file_size = len(binary_data), and call
context_service.upload_attachment(db=db, user_id=user.id, filename=filename,
binary_data=binary_data, subtask_id=0). Return the same dict using context.id,
context.mime_type and truncation_info.is_truncated (if present). Remove
creation/opening of NamedTemporaryFile and all file read/write operations so
there is no filesystem roundtrip.
backend/init_data/skills/wegent-knowledge/SKILL.md (1)

210-267: Add language tags to fenced code blocks (markdownlint MD040).

Lines 210, 233, 242, 253, and 264 open ``` without a language identifier. Add python (or text) to keep markdownlint clean and improve syntax highlighting in the skill renderer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/init_data/skills/wegent-knowledge/SKILL.md` around lines 210 - 267,
The markdown examples in SKILL.md open fenced code blocks without language tags
which triggers markdownlint MD040; update each triple-backtick block shown
around the get_document_info example, the get_document_content example, the
dingtalk-table.get_all_sheets / get_range example, the
dingtalk-ai-table.get_tables / query_records example, and the
dingtalk-docs.download_file example to include a language identifier (e.g.,
```python for code-like calls or ```text for plain output) so the code blocks
for get_document_info, get_document_content, dingtalk-table.get_all_sheets,
dingtalk-table.get_range, dingtalk-ai-table.get_tables,
dingtalk-ai-table.query_records, and dingtalk-docs.download_file have proper
syntax highlighting and satisfy MD040.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 40-67: _download_file_from_url currently performs an unrestricted
requests.get on a model-controlled URL; add strict URL validation and a download
size cap: implement a helper _validate_download_url(url: str) that parses the
URL, allows only http/https (ALLOWED_SCHEMES), ensures a non-empty hostname,
rejects obvious SSRF hosts (localhost, 127.0.0.1, ::1, RFC1918 patterns and
".internal"), and (optionally) resolve and reject private/link-local IPs (and
re-check after redirects); in _download_file_from_url call
_validate_download_url first, configure requests to disable or limit redirects,
stream the response and enforce MAX_DOWNLOAD_BYTES (e.g. 100*1024*1024) by
incrementing total_size inside the iter_content loop and aborting/raising if
exceeded, and ensure downstream code that handles the downloaded blob (e.g.,
upload_file_from_url flow that may read the file) uses streaming or file-on-disk
reads instead of loading the entire file into memory.
- Around line 161-197: The exception handlers in upload_file_from_url and
upload_content do not perform an explicit database rollback before returning
error responses, which can leave partial commits from
context_service.upload_attachment (which may commit on DocumentParseError);
modify both functions' error paths so that immediately before returning an error
dict (in the requests.RequestException handler and the broad Exception handler)
you call db.rollback() (or session.rollback() as appropriate) to undo any open
transaction, then log and return; keep db.close() in the finally block as-is and
reference upload_file_from_url, upload_content, and
context_service.upload_attachment when locating the changes.

In `@backend/init_data/skills/wegent-knowledge/SKILL.md`:
- Around line 369-378: The Python example call to wegent_kb_create_document in
the SKILL.md contains a typo: the boolean argument uses "True." (trailing
period) which is invalid syntax; update the examples for
wegent_kb_create_document so the trigger_indexing argument is "True," (comma)
instead of "True." in both example occurrences, ensuring the snippet is valid
Python and matches the function signature (knowledge_base_id, name, source_type,
attachment_id, trigger_indexing).

---

Outside diff comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 1-305: Run Black to reformat the file to the project's 88-column
style (e.g., run `black backend/app/mcp_server/tools/dingtalk_upload.py`);
specifically wrap or shorten long log/message and header lines found in
functions _download_file_from_url (headers User-Agent), upload_file_from_url
(logger.info, error messages and the returned dict keys/values that exceed 88
chars), and upload_content (logger.info, error messages and returned dict), so
all lines comply with Black/PEP8; after formatting, re-run linting to confirm
pipeline passes.

---

Nitpick comments:
In `@backend/app/mcp_server/tools/__init__.py`:
- Around line 31-44: DINGTALK_UPLOAD_MCP_TOOLS and KNOWLEDGE_MCP_TOOLS both call
build_mcp_tools_dict(server="knowledge") so the former is a misleading
duplicate; fix by either removing the DINGTALK_UPLOAD_MCP_TOOLS export/entry in
__all__ and any unused references (since get_registered_mcp_tools handles
registration), or change the construction in dingtalk_upload.py to scope
correctly (e.g., call build_mcp_tools_dict with a distinct server tag or add a
per-module filter) so DINGTALK_UPLOAD_MCP_TOOLS represents only DingTalk tools;
update the __all__ list accordingly and ensure references to
DINGTALK_UPLOAD_MCP_TOOLS are adjusted or deleted.

In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 187-189: In the exception handlers inside dingtalk_upload.py
replace explicit str(e) with the f-string !s conversion to satisfy Ruff RUF010:
change logger.error(f"Failed to download file from URL: {e}") and return
{"error": f"Failed to download file: {str(e)}"} to use {e!s}, and apply the same
change to the other similar handler around the later return (the one at the
second exception block near the upload/download flow) so both logging and
returned error strings use {e!s}.
- Around line 256-292: The current upload_content flow unnecessarily writes to a
NamedTemporaryFile then re-reads it; instead, in upload_content encode the
in-memory content to bytes using content.encode(encoding) (catch
UnicodeEncodeError around this call), compute file_size = len(binary_data), and
call context_service.upload_attachment(db=db, user_id=user.id,
filename=filename, binary_data=binary_data, subtask_id=0). Return the same dict
using context.id, context.mime_type and truncation_info.is_truncated (if
present). Remove creation/opening of NamedTemporaryFile and all file read/write
operations so there is no filesystem roundtrip.

In `@backend/init_data/skills/wegent-knowledge/SKILL.md`:
- Around line 210-267: The markdown examples in SKILL.md open fenced code blocks
without language tags which triggers markdownlint MD040; update each
triple-backtick block shown around the get_document_info example, the
get_document_content example, the dingtalk-table.get_all_sheets / get_range
example, the dingtalk-ai-table.get_tables / query_records example, and the
dingtalk-docs.download_file example to include a language identifier (e.g.,
```python for code-like calls or ```text for plain output) so the code blocks
for get_document_info, get_document_content, dingtalk-table.get_all_sheets,
dingtalk-table.get_range, dingtalk-ai-table.get_tables,
dingtalk-ai-table.query_records, and dingtalk-docs.download_file have proper
syntax highlighting and satisfy MD040.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb3cfd03-a3da-40fd-af91-47cfd5a3994e

📥 Commits

Reviewing files that changed from the base of the PR and between 0881b98 and c56ecf4.

📒 Files selected for processing (5)
  • backend/app/mcp_server/server.py
  • backend/app/mcp_server/tools/__init__.py
  • backend/app/mcp_server/tools/decorator.py
  • backend/app/mcp_server/tools/dingtalk_upload.py
  • backend/init_data/skills/wegent-knowledge/SKILL.md
✅ Files skipped from review due to trivial changes (1)
  • backend/app/mcp_server/tools/decorator.py

Comment thread backend/app/mcp_server/tools/dingtalk_upload.py Outdated
Comment thread backend/app/mcp_server/tools/dingtalk_upload.py Outdated
Comment thread backend/init_data/skills/wegent-knowledge/SKILL.md
…dge-skill

refactor(knowledge): remove sandbox dependency for DingTalk document upload
@parabala parabala force-pushed the feature/dingtalk-upload-knowledge branch from c56ecf4 to 654d174 Compare April 27, 2026 06:55
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/app/mcp_server/tools/dingtalk_upload.py (3)

262-279: Avoid disk round-trip when uploading text content.

The function writes content to a temp file and then immediately reads it back as bytes only to pass it to context_service.upload_attachment. The temp file is otherwise unused (no parsing or sandbox handoff). You can produce the same binary_data by encoding in memory, which removes I/O, removes the cleanup path for this branch, and avoids surfacing filesystem errors from the disk hop.

♻️ Proposed refactor
-    temp_file_path = None
     db = SessionLocal()

     try:
         user = _get_user_from_token(db, token_info)
         if not user:
             return {"error": "User not found"}

         # Validate content
         if not content:
             return {"error": "Content cannot be empty"}

         # Validate filename
         if not filename:
             return {"error": "Filename is required"}

-        # Create temporary file with proper extension
-        file_extension = os.path.splitext(filename)[1] or ".md"
-        with tempfile.NamedTemporaryFile(
-            delete=False, suffix=file_extension
-        ) as temp_file:
-            temp_file_path = temp_file.name
-
         try:
-            # Write content to temporary file
-            logger.info(f"Writing content to temporary file {temp_file_path}")
-            with open(temp_file_path, "w", encoding=encoding) as f:
-                f.write(content)
-
-            # Read file content as binary for upload
-            with open(temp_file_path, "rb") as f:
-                binary_data = f.read()
-
+            # Encode content directly; no disk round-trip needed
+            binary_data = content.encode(encoding)
             file_size = len(binary_data)
@@
     finally:
         db.close()
-        _clean_temp_file(temp_file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 262 - 279, The
current code in dingtalk_upload.py creates a temp file, writes `content`, then
reopens it to read `binary_data` before calling
`context_service.upload_attachment`; replace the disk round-trip by encoding the
text in memory: compute `file_extension = os.path.splitext(filename)[1] or
".md"`, then set `binary_data = content.encode(encoding)` and `file_size =
len(binary_data)` (removing the tempfile creation/cleanup and the open/read
steps) and pass `binary_data` and `file_size` to
`context_service.upload_attachment` as before; keep all logging around
writing/reading adjusted to reflect the in-memory encoding.

72-89: Filename derived from URL path is not sanitized.

os.path.basename(parsed.path) will return whatever lies in the URL path (URL-encoded sequences, unicode, leading dots, etc.). It then flows into both tempfile.NamedTemporaryFile(suffix=...) and context_service.upload_attachment(filename=...). The downstream service does call os.path.basename again and validates the extension, so this is unlikely to be exploitable, but consider URL-decoding and stripping problematic characters here so the temp-file suffix and the stored attachment name are predictable (e.g. %20 → space, removal of NUL/control chars).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 72 - 89, The
_get_filename_from_url function returns raw os.path.basename(parsed.path) which
may contain URL-encoded sequences, control characters, leading dots, or other
unsafe chars; update it to URL-decode the basename (use urllib.parse.unquote),
normalize/remove control/NUL characters, strip/replace path separators and
leading dots, and collapse spaces or unsafe chars to a safe token while
preserving the extension if present; ensure the sanitized filename is returned
(or default_filename if result is empty) so the value passed to
tempfile.NamedTemporaryFile(suffix=...) and
context_service.upload_attachment(filename=...) is predictable and safe.

35-37: Avoid disk round-trip in upload_content by encoding directly.

In upload_content, content is written to a temporary file then read back as binary (lines 256–270). Instead of this unnecessary I/O, use binary_data = content.encode(encoding) directly to convert the string to bytes, eliminating the file write-read cycle while preserving identical behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 35 - 37, The
upload_content function currently writes `content` to a temporary file then
re-opens it to read bytes; replace that disk round-trip by creating the bytes
in-memory with `binary_data = content.encode(encoding)` (use the same `encoding`
variable already in scope), remove the temporary file creation/cleanup code
(e.g., NamedTemporaryFile or open/write/read steps), and use `binary_data` where
the file's read bytes were previously used so behavior remains identical without
I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 262-279: The current code in dingtalk_upload.py creates a temp
file, writes `content`, then reopens it to read `binary_data` before calling
`context_service.upload_attachment`; replace the disk round-trip by encoding the
text in memory: compute `file_extension = os.path.splitext(filename)[1] or
".md"`, then set `binary_data = content.encode(encoding)` and `file_size =
len(binary_data)` (removing the tempfile creation/cleanup and the open/read
steps) and pass `binary_data` and `file_size` to
`context_service.upload_attachment` as before; keep all logging around
writing/reading adjusted to reflect the in-memory encoding.
- Around line 72-89: The _get_filename_from_url function returns raw
os.path.basename(parsed.path) which may contain URL-encoded sequences, control
characters, leading dots, or other unsafe chars; update it to URL-decode the
basename (use urllib.parse.unquote), normalize/remove control/NUL characters,
strip/replace path separators and leading dots, and collapse spaces or unsafe
chars to a safe token while preserving the extension if present; ensure the
sanitized filename is returned (or default_filename if result is empty) so the
value passed to tempfile.NamedTemporaryFile(suffix=...) and
context_service.upload_attachment(filename=...) is predictable and safe.
- Around line 35-37: The upload_content function currently writes `content` to a
temporary file then re-opens it to read bytes; replace that disk round-trip by
creating the bytes in-memory with `binary_data = content.encode(encoding)` (use
the same `encoding` variable already in scope), remove the temporary file
creation/cleanup code (e.g., NamedTemporaryFile or open/write/read steps), and
use `binary_data` where the file's read bytes were previously used so behavior
remains identical without I/O.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec39ae36-f4d0-45fa-b554-253d91f1e28e

📥 Commits

Reviewing files that changed from the base of the PR and between c56ecf4 and 654d174.

📒 Files selected for processing (1)
  • backend/app/mcp_server/tools/dingtalk_upload.py

…alk upload

- Add URL validation to prevent SSRF attacks in _download_file_from_url()
- Add MAX_DOWNLOAD_BYTES (100MB) limit with streaming size check
- Add db.rollback() in exception handlers before returning error responses
- Fix Ruff RUF010: replace str(e) with {e!s}
- Optimize upload_content() to avoid filesystem roundtrip
- Run Black formatter on dingtalk_upload.py
- Fix SKILL.md: correct True. syntax error to True,
- Add python language identifiers to code blocks in SKILL.md
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)

144-156: Consider streaming to a tempfile instead of buffering in memory.

chunks accumulates the entire payload (up to MAX_DOWNLOAD_BYTES = 100 MB) before b"".join(chunks) doubles peak memory during the join. Under concurrent uploads this can be a noticeable footprint on a multi-tenant MCP server. If context_service.upload_attachment accepts (or can be adapted to accept) a file-like / streamed source, prefer writing chunks to a tempfile.NamedTemporaryFile and passing that through; otherwise at least bytearray.extend(chunk) to avoid the join's transient 2× allocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 144 - 156, The
current download loop buffers chunks in a list and returns b"".join(chunks),
which causes a transient 2× memory spike for up to MAX_DOWNLOAD_BYTES; change
the implementation in dingtalk_upload.py (the loop using response.iter_content
and variable chunks) to stream into a tempfile.NamedTemporaryFile (write each
chunk, enforce MAX_DOWNLOAD_BYTES by tracking total_size, flush/seek to start)
and pass that file-like object to context_service.upload_attachment; if
context_service.upload_attachment cannot accept a file-like, replace the
list-of-bytes buffer with a single bytearray (use bytearray.extend(chunk) and
return bytes(bytearray)) to avoid the join overhead. Ensure you reference
MAX_DOWNLOAD_BYTES, response.iter_content, chunks, and
context_service.upload_attachment when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 86-156: The current flow in _download_file_from_url suffers
TOCTOU/DNS-rebinding and unsafe auto-redirects: change it to resolve and
validate the hostname(s) via socket.getaddrinfo inside _validate_download_url
and then perform the HTTP request bound to the resolved IP(s) using a custom
requests HTTPAdapter that pins the chosen IP while preserving the original Host
header and SNI (so that requests.get uses the validated IP rather than
re-resolving); also set allow_redirects=False and implement a manual redirect
loop that for each Location header runs the same resolve+validate+pinning logic
before issuing the next request (re-check size limits and MAX_DOWNLOAD_BYTES
across chunks as now), ensuring every connection uses the validated IP and no
intermediate redirect is followed without prior validation.

---

Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 144-156: The current download loop buffers chunks in a list and
returns b"".join(chunks), which causes a transient 2× memory spike for up to
MAX_DOWNLOAD_BYTES; change the implementation in dingtalk_upload.py (the loop
using response.iter_content and variable chunks) to stream into a
tempfile.NamedTemporaryFile (write each chunk, enforce MAX_DOWNLOAD_BYTES by
tracking total_size, flush/seek to start) and pass that file-like object to
context_service.upload_attachment; if context_service.upload_attachment cannot
accept a file-like, replace the list-of-bytes buffer with a single bytearray
(use bytearray.extend(chunk) and return bytes(bytearray)) to avoid the join
overhead. Ensure you reference MAX_DOWNLOAD_BYTES, response.iter_content,
chunks, and context_service.upload_attachment when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 495eaa52-128a-41d8-81c1-8d001e20a561

📥 Commits

Reviewing files that changed from the base of the PR and between 654d174 and 72a0aa0.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_upload.py
  • backend/init_data/skills/wegent-knowledge/SKILL.md

Comment on lines +86 to +156
try:
addrinfo = socket.getaddrinfo(hostname, None)
for _, _, _, _, sockaddr in addrinfo:
ip = ipaddress.ip_address(sockaddr[0])
# Check for private, loopback, link-local, or reserved IPs
if (
ip.is_private
or ip.is_loopback
or ip.is_link_local
or ip.is_reserved
or ip.is_multicast
):
raise ValueError(f"URL resolves to blocked IP: {ip}")
except socket.gaierror as e:
raise ValueError(f"Failed to resolve hostname: {hostname}") from e


def _get_user_from_token(db: Session, token_info: TaskTokenInfo) -> Optional[User]:
"""Get user from token info."""
return db.query(User).filter(User.id == token_info.user_id).first()


def _download_file_from_url(url: str) -> bytes:
"""
Download file from URL with security validations.

Args:
url: Download URL

Returns:
Downloaded file content as bytes

Raises:
ValueError: If URL validation fails
requests.RequestException: If download fails
"""
# Validate URL before downloading
_validate_download_url(url)

headers = {
"User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
}

# Use stream=True and limit redirects for security
with requests.get(
url,
headers=headers,
stream=True,
timeout=DEFAULT_TIMEOUT,
allow_redirects=True,
) as response:
response.raise_for_status()

# Re-validate after redirects (if Location header present)
if response.history:
final_url = response.url
_validate_download_url(final_url)

total_size = 0
chunks = []

for chunk in response.iter_content(chunk_size=8192):
if chunk:
total_size += len(chunk)
if total_size > MAX_DOWNLOAD_BYTES:
raise ValueError(
f"Download exceeds maximum size of {MAX_DOWNLOAD_BYTES} bytes"
)
chunks.append(chunk)

return b"".join(chunks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Best practices to prevent SSRF DNS rebinding in Python requests: pinning resolved IP and validating redirects manually

💡 Result:

To prevent SSRF via DNS rebinding in Python's requests library, implement IP pinning by resolving the hostname once, validating the IP against private/reserved ranges, rewriting the URL to use the resolved IP while preserving the original Host header and TLS SNI, and handling redirects manually with the same validation. Key steps: 1. Resolve hostname to IP(s) using socket.getaddrinfo, validate all resolved IPs are safe (not private, loopback, metadata endpoints like 169.254.169.254). 2. Use a custom HTTPAdapter to force connection to the pinned IP, set server_hostname for SNI, and Host header to original hostname. 3. Disable automatic redirects (allow_redirects=False), check Location header, re-validate new URL, and recurse up to a max_redirects limit. Example implementation: import socket import ssl import ipaddress from urllib.parse import urlparse, urlunparse from urllib3.poolmanager import PoolManager from requests.adapters import HTTPAdapter from requests import Session def resolve_first_ip(hostname): return socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP)[0][4][0] def is_safe_ip(ip_str): ip = ipaddress.ip_address(ip_str) return (not ip.is_private and not ip.is_loopback and not ip.is_link_local and str(ip) != '169.254.169.254' and # AWS/GCP/Azure metadata not ip.is_multicast) class HostHeaderSSLAdapter(HTTPAdapter): def init(self, ssl_hostname, *args, kwargs): self.ssl_hostname = ssl_hostname super.init(*args, kwargs) def init_poolmanager(self, *args, kwargs): context = ssl.create_default_context kwargs['ssl_context'] = context kwargs['server_hostname'] = self.ssl_hostname self.poolmanager = PoolManager(*args, kwargs) def safe_request(url, max_redirects=10): parsed = urlparse(url) hostname = parsed.hostname if not hostname: raise ValueError("Invalid URL") # Pin and validate IP ip = resolve_first_ip(hostname) if not is_safe_ip(ip): raise ValueError("Unsafe IP") # Rebuild URL with IP netloc = ip if parsed.port: netloc += f":{parsed.port}" new_url = urlunparse(parsed._replace(netloc=netloc)) session = Session adapter = HostHeaderSSLAdapter(hostname) session.mount('https://', adapter) headers = {'Host': hostname} response = session.get(new_url, headers=headers, allow_redirects=False, verify=True) # Manual redirect handling redirects = 0 while 300 <= response.status_code < 400 and redirects < max_redirects: location = response.headers.get('Location') if not location: break # Make absolute if relative if location.startswith('/'): location = f"{parsed.scheme}://{hostname}{location}" # Re-validate new location with pinning response = safe_request(location, max_redirects - 1) redirects += 1 return response Usage: response = safe_request('https://example.com') This approach eliminates TOCTOU gaps in DNS rebinding by ensuring the validated IP is used for connection. For production, check all getaddrinfo IPs, handle proxies/cookies carefully, and consider libraries like requests-hardened or drawbridge (for httpx) for maintained solutions.

Citations:


Residual SSRF: DNS rebinding and intermediate-redirect validation remain unfixed.

Two critical gaps persist despite the added validation layer:

  1. TOCTOU / DNS rebinding_validate_download_url resolves the hostname via socket.getaddrinfo (line 87), but requests.get (line 130) performs its own independent DNS lookup. An attacker controlling a short-TTL DNS record or shared resolver cache can return a public IP during validation and a private IP at request time. The same TOCTOU applies after redirects: line 142 re-validates response.url textually and re-resolves it, but does not validate against the actual IP address that requests already connected to.

  2. Intermediate-redirect side effectsallow_redirects=True allows requests to autonomously follow the redirect chain. By the time the final URL is re-validated (line 142), requests have already been issued to intermediate hops, which could be internal hosts. This enables port-scanning and state-changing requests (e.g., metadata endpoints, internal APIs) regardless of post-hoc rejection.

Required fixes:

  • Resolve the hostname once using socket.getaddrinfo, validate all resolved IPs against the blocked list, then connect directly to the validated IP. Use a custom HTTPAdapter to pin the resolved IP while preserving the original hostname in the Host header and SNI.
  • Set allow_redirects=False and implement manual redirect following. Before issuing each subsequent request, validate the Location header by running the same _validate_download_url check to ensure the next hop is safe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 86 - 156, The
current flow in _download_file_from_url suffers TOCTOU/DNS-rebinding and unsafe
auto-redirects: change it to resolve and validate the hostname(s) via
socket.getaddrinfo inside _validate_download_url and then perform the HTTP
request bound to the resolved IP(s) using a custom requests HTTPAdapter that
pins the chosen IP while preserving the original Host header and SNI (so that
requests.get uses the validated IP rather than re-resolving); also set
allow_redirects=False and implement a manual redirect loop that for each
Location header runs the same resolve+validate+pinning logic before issuing the
next request (re-check size limits and MAX_DOWNLOAD_BYTES across chunks as now),
ensuring every connection uses the validated IP and no intermediate redirect is
followed without prior validation.

…ta fetch

- backend/app/mcp_server/tools/dingtalk_upload.py:
  - _get_filename_from_url: URL-decode basename with urllib.parse.unquote,
    strip NUL/control chars via unicodedata, remove path separators to
    prevent directory traversal, strip leading dots, collapse unsafe chars
    to underscores; fall back to default_filename when result is empty

- backend/init_data/skills/wegent-knowledge/SKILL.md:
  - AI Table (able): add mandatory pagination loop for query_records;
    warn that uploading only the first page results in incomplete data
  - Spreadsheet (axls): increase example range from A1:Z100 to A1:Z10000
    and note that hasMore-style responses require incremental fetching
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)

86-143: ⚠️ Potential issue | 🟠 Major

Residual SSRF: DNS-rebinding TOCTOU and unvalidated intermediate redirects still open.

Two gaps from the prior review remain:

  1. TOCTOU/DNS rebinding_validate_download_url resolves via socket.getaddrinfo at Line 88, but requests.get at Line 131 performs an independent DNS lookup. With a short-TTL or cache-poisoned record, validation can resolve to a public IP while the actual connection lands on a private/metadata IP. Re-validating response.url at Line 143 after allow_redirects=True doesn't help: it's another textual+DNS check, not a check against the IP requests already connected to.

  2. Intermediate redirects — with allow_redirects=True, every hop in the redirect chain is fetched before Line 142 re-validation runs. Internal hosts can be probed (port-scan/state-changing hits to metadata endpoints, internal APIs) regardless of post-hoc rejection of the final URL.

Recommended hardening: resolve once, validate all returned IPs, then connect to the pinned IP using a custom HTTPAdapter that preserves the original Host header and SNI. Set allow_redirects=False and follow redirects manually, re-running _validate_download_url on each Location before issuing the next request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 86 - 143, The
current download flow in _download_file_from_url does a textual DNS check in
_validate_download_url then calls requests.get (allow_redirects=True), leaving a
TOCTOU/DNS-rebind gap and allowing intermediate redirects to be fetched; fix it
by resolving the initial hostname to IPs once (using the same socket.getaddrinfo
logic used by _validate_download_url), validate all resolved IPs are allowed,
then create a requests Session with a custom HTTPAdapter that pins the
connection to the validated IP (while preserving the original Host header and
TLS SNI), call requests with allow_redirects=False, and implement a manual
redirect loop that: (a) inspects Location, runs the same resolve+validate before
following, (b) pins each hop to its validated IP via the adapter, and (c)
enforces a redirect limit; update _download_file_from_url to use
DEFAULT_TIMEOUT, stream=True, raise_for_status, and this
manual-redirect/pinned-IP logic instead of relying on requests.get
allow_redirects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 266-291: Extract the filename-cleaning logic from
_get_filename_from_url into a new helper _sanitize_filename(name: str) -> str
that removes NUL/control chars, path separators, leading dots and other unsafe
characters, then call _sanitize_filename on any caller-supplied filename before
passing it to context_service.upload_attachment; update both tool entrypoints
that accept a filename (the branch where filename is provided by the caller and
the branch that falls back to _get_filename_from_url) to use _sanitize_filename
so all uploaded filenames are normalized and safe while leaving the existing
URL-derived path to call _get_filename_from_url which can internally call
_sanitize_filename.

---

Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 86-143: The current download flow in _download_file_from_url does
a textual DNS check in _validate_download_url then calls requests.get
(allow_redirects=True), leaving a TOCTOU/DNS-rebind gap and allowing
intermediate redirects to be fetched; fix it by resolving the initial hostname
to IPs once (using the same socket.getaddrinfo logic used by
_validate_download_url), validate all resolved IPs are allowed, then create a
requests Session with a custom HTTPAdapter that pins the connection to the
validated IP (while preserving the original Host header and TLS SNI), call
requests with allow_redirects=False, and implement a manual redirect loop that:
(a) inspects Location, runs the same resolve+validate before following, (b) pins
each hop to its validated IP via the adapter, and (c) enforces a redirect limit;
update _download_file_from_url to use DEFAULT_TIMEOUT, stream=True,
raise_for_status, and this manual-redirect/pinned-IP logic instead of relying on
requests.get allow_redirects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea9cfdab-9513-42f8-a5db-b6295f1b96c8

📥 Commits

Reviewing files that changed from the base of the PR and between 72a0aa0 and 6184bb1.

📒 Files selected for processing (2)
  • backend/app/mcp_server/tools/dingtalk_upload.py
  • backend/init_data/skills/wegent-knowledge/SKILL.md

Comment on lines +266 to +291
if not filename:
filename = _get_filename_from_url(
download_url, f"document.{default_extension}"
)

# Download file from URL with security validations
logger.info(f"Downloading file from URL: {download_url}")
binary_data = _download_file_from_url(download_url)
file_size = len(binary_data)

# Upload to Wegent using context service
context, truncation_info = context_service.upload_attachment(
db=db,
user_id=user.id,
filename=filename,
binary_data=binary_data,
subtask_id=0, # Unlinked attachment
)

return {
"attachment_id": context.id,
"filename": filename,
"size": file_size,
"mime_type": context.mime_type,
"truncated": (truncation_info.is_truncated if truncation_info else False),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect upload_attachment and downstream uses of `filename` for sanitization/path-traversal handling.
fd -t f context_service.py backend/app/services/context | xargs -I{} sed -n '1,500p' {}
echo "---"
rg -nP -C3 "def upload_attachment\b" --type=py
echo "---"
rg -nP -C3 "\bfilename\b" --type=py -g 'backend/app/services/context/**'

Repository: wecode-ai/Wegent

Length of output: 36455


🏁 Script executed:

rg -nA 10 -B 5 "context_service.upload_attachment" backend/app/mcp_server/tools/dingtalk_upload.py | head -40

Repository: wecode-ai/Wegent

Length of output: 1249


🏁 Script executed:

# Check if any HTTP response or header construction uses filename without sanitization
rg -nB 3 -A 3 "Content-Disposition\|attachment_id.*filename\|response.*filename" backend/app --type=py | head -50

Repository: wecode-ai/Wegent

Length of output: 42


🏁 Script executed:

# Search for any filename sanitization utilities in the codebase
fd -t f -type f . | xargs rg -l "sanitize.*filename|filename.*sanitize" --type=py

Repository: wecode-ai/Wegent

Length of output: 478


🏁 Script executed:

# Check for filename sanitization in dingtalk_upload.py itself
rg -nB 2 -A 8 "_get_filename_from_url|_sanitize" backend/app/mcp_server/tools/dingtalk_upload.py

Repository: wecode-ai/Wegent

Length of output: 916


🏁 Script executed:

# Check the sandbox_file_syncer for how it handles filenames
rg -nB 2 -A 5 "sanitize\|filename" backend/app/services/sandbox_file_syncer.py | head -60

Repository: wecode-ai/Wegent

Length of output: 42


🏁 Script executed:

# Examine the API endpoint that returns attachments to see if filename is used in headers
rg -nB 5 -A 10 "def download.*attachment|@router.*download.*attachment" backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 548


🏁 Script executed:

# Get the full implementation of download_attachment
rg -n -A 40 "^async def download_attachment" backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 1645


🏁 Script executed:

# Check the _get_filename_from_url implementation in detail
sed -n '160,220p' backend/app/mcp_server/tools/dingtalk_upload.py

Repository: wecode-ai/Wegent

Length of output: 2274


🏁 Script executed:

# Get the rest of download_attachment to see how filename is used in response
sed -n '550,630p' backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 3167


🏁 Script executed:

# Check if filename is used in content-disposition or other response headers
rg -n "content.disposition\|Content-Disposition\|filename\*=" backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 42


🏁 Script executed:

# Continue reading download_attachment to find the response construction
sed -n '620,680p' backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 2040


🏁 Script executed:

# Find the _build_content_disposition function
rg -n -B 2 -A 8 "def _build_content_disposition" backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 418


🏁 Script executed:

# Check if there's any sanitization in that function
rg -n "_build_content_disposition" backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 397


🏁 Script executed:

# Get the full implementation of _build_content_disposition
sed -n '70,95p' backend/app/api/endpoints/adapter/attachments.py

Repository: wecode-ai/Wegent

Length of output: 951


Apply consistent filename sanitization to both URL-derived and caller-supplied names.

_get_filename_from_url thoroughly sanitizes URL-derived filenames (removes NUL/control chars, path separators, leading dots, unsafe characters), but caller-supplied filename values (from LLM via MCP) are passed unsanitized to context_service.upload_attachment at lines 280 and 390. The service stores the unsanitized name directly in the database and type_data without validation. A model-controlled filename like "../../../etc/passwd" or containing NUL bytes bypasses sanitization entirely, whereas URL-derived filenames are always protected.

Extract the sanitization logic from _get_filename_from_url into a reusable _sanitize_filename(name: str) -> str helper and apply it to caller-supplied filename in both tool entrypoints before invoking upload_attachment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 266 - 291,
Extract the filename-cleaning logic from _get_filename_from_url into a new
helper _sanitize_filename(name: str) -> str that removes NUL/control chars, path
separators, leading dots and other unsafe characters, then call
_sanitize_filename on any caller-supplied filename before passing it to
context_service.upload_attachment; update both tool entrypoints that accept a
filename (the branch where filename is provided by the caller and the branch
that falls back to _get_filename_from_url) to use _sanitize_filename so all
uploaded filenames are normalized and safe while leaving the existing
URL-derived path to call _get_filename_from_url which can internally call
_sanitize_filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant