feat(skills): add dingtalk-knowledge skill for uploading DingTalk doc…#994
feat(skills): add dingtalk-knowledge skill for uploading DingTalk doc…#994parabala wants to merge 12 commits intowecode-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds DingTalk ingestion support: new Changes
Sequence Diagram(s)sequenceDiagram
participant Client as MCP Client
participant Tool as dingtalk_upload_file_from_url
participant DB as Database
participant HTTP as Remote HTTP
participant Service as ContextService
Client->>Tool: invoke(download_url, filename, token_info)
Tool->>DB: open session / resolve user
alt user missing
Tool-->>Client: {"error": "user not found"}
else user found
Tool->>HTTP: GET download_url (SSRF checks, DNS->IP check, timeout, max bytes)
HTTP-->>Tool: stream bytes / error
alt download success
Tool->>Service: upload_attachment(bytes, filename, mime)
Service-->>Tool: attachment metadata
Tool-->>Client: metadata {attachment_id, filename, size, mime_type, truncated}
else download/validation error
Tool-->>Client: {"error": "..."}
end
end
Tool->>DB: commit/rollback and close session
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-knowledge/provider.py (1)
210-262: Nit: drop the unusedeffective_timeoutparam and the no-placeholder f-strings.
_prepare_sandbox(effective_timeout)accepts the timeout but never passes it toSandboxManager/make_dir/get_or_create_sandbox; either thread it through (preferable — sandbox creation can hang) or drop the parameter. Also, Ruff F541 flags f-strings at lines 238, 242, 253, 261, and 716 — none of them interpolate anything, so thefprefix is redundant.Add a return type annotation too (
-> tuple[Any, str]) per the type-hints coding guideline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 210 - 262, The _prepare_sandbox method currently has an unused effective_timeout parameter, redundant f-string prefixes, and lacks a return type; either thread effective_timeout through to SandboxManager.get_or_create_sandbox and sandbox.files.make_dir (preferred) or remove the parameter; update the call to SandboxManager.get_instance / get_or_create_sandbox to accept and pass the timeout (or document removal), remove the unnecessary f prefixes on the literal log messages (e.g. logger.info lines that don't interpolate), add a return type annotation -> tuple[Any, str] to _prepare_sandbox, and ensure the function returns (sandbox, temp_dir) at the end; reference symbols: _prepare_sandbox, SandboxManager.get_instance, SandboxManager.get_or_create_sandbox, sandbox.files.make_dir, _generate_temp_dir_name, temp_dir.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 491-498: Add OpenTelemetry tracing to the async entrypoints by
decorating DownloadDingTalkDocumentTool._arun and SaveDingTalkContentTool._arun
with `@trace_async`(...), and consider adding the same decorator to async helpers
like _prepare_sandbox, _upload_and_return, and _handle_error so the
sandbox-acquire → download/write → upload flow is visible in one span tree; use
descriptive span_name strings (e.g., "DownloadDingTalkDocument" /
"SaveDingTalkContent"), the appropriate tracer_name used in the project, and an
extract_attributes callable to capture key args (download_url, filename,
file_extension, timeout_seconds) when applying the decorator.
- Around line 522-534: The websocket emission currently includes the sensitive
download_url (download_token); update the emit in the ws_emitter.emit_tool_call
call to stop sending download_url (or send a sanitized URL with query stripped
using urlparse._replace(query="").geturl()) and instead emit only non-sensitive
metadata: filename, extension (use os.path.splitext on filename), and file size
if available (e.g., file_size variable or stat the file). Keep the same
try/except around ws_emitter and keep using task_id and name, but remove or
sanitize download_url from the tool_input payload so no signed query/token is
logged or emitted.
- Around line 542-565: The download currently builds curl_cmd and runs
sandbox.commands.run without any client-side size/time limits, so add curl
options to enforce max file size and a per-transfer timeout: update curl_cmd
(the list used before sandbox.commands.run) to include "--max-filesize" with
ATTACHMENT_MAX_BYTES and a "--max-time" value derived from effective_timeout (or
a sensible fraction), keeping save_path and download_url usage intact; retain
the existing post-download size re-check in _upload_and_return and ensure error
handling still raises RuntimeError when result.exit_code != 0 so curl-aborted
transfers are detected.
- Around line 265-386: The temp_dir is leaked on most failure paths in
_upload_and_return; wrap the main upload logic (everything after validating file
size through the final response creation and ws_emitter.emit_tool_call) in a
try/finally that always calls sandbox.commands.run(cmd=["rm","-rf", temp_dir],
...) so cleanup runs on success and all exceptions, remove the duplicated
cleanup blocks inside the size-check and success branches, and replace bare
except Exception: pass around the cleanup call with an exception handler that
logs the error (use logger.error or logger.exception) so Ruff S110 / BLE001
violations are avoided; keep the size-check early raise but ensure temp_dir is
still removed in the finally.
- Around line 343-349: The current check inspects api_response["detail"] which
is fragile; change the curl call that produces api_response to use -sS and -w
'\n%{http_code}' so you capture the HTTP status code separately, then parse the
body into api_response and branch on the status code (treat 2xx as success,
non-2xx as error). On non-2xx, prefer to extract api_response.get("detail") or
stderr content for a descriptive RuntimeError, falling back to "Unknown error"
only if nothing else is available. Keep the existing attachment_id extraction
(attachment_id = api_response.get("id")) for successful 2xx responses and only
raise "No attachment_id in response" after confirming a 2xx status.
- Around line 244-247: The code in _prepare_sandbox currently hardcodes
shell_type="ClaudeCode" when calling sandbox_manager.get_or_create_sandbox,
which ignores the actual execution shell declared in bindShells; fix this by
either (A) extending SkillToolContext to include a shell_type field, ensuring
callers (e.g., chat_shell execution state) populate it, then read
context.shell_type inside _prepare_sandbox and pass that to
sandbox_manager.get_or_create_sandbox, or (B) change the skill configuration to
restrict bindShells to ["Chat"] and add a default_shell_type config field that
_prepare_sandbox reads and uses when calling
sandbox_manager.get_or_create_sandbox; update all call sites to pass the new
SkillToolContext.shell_type if you choose option A and adjust SKILL.md/default
config if you choose option B so the sandbox type matches the execution context.
- Around line 726-748: In save_dingtalk_content, avoid building a huge python3
-c with base64 (python_cmd) which can exceed ARG_MAX; first perform a fail-fast
size check against ATTACHMENT_MAX_BYTES (raise a clear error if
content.encode('utf-8') length > ATTACHMENT_MAX_BYTES), then replace the
base64/python_cmd + sandbox.commands.run flow with
sandbox.files.write(save_path, content_bytes, mode="wb") (or the SDK equivalent)
to write the bytes directly, log the write, and handle/raise any errors from
sandbox.files.write instead of inspecting result.exit_code/result.stderr.
In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md`:
- Around line 62-73: The docs for get_document_info incorrectly list the
returned metadata field as "extension"; update the documentation to use
"file_extension" instead so it matches the actual provider response used in
provider.py and LLM agents. Specifically, in the get_document_info description
replace any mention of `extension` with `file_extension` alongside the existing
`name`, `contentType`, and `nodeType` fields to ensure consumers of
get_document_info (function get_document_info) read the correct key.
---
Nitpick comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 210-262: The _prepare_sandbox method currently has an unused
effective_timeout parameter, redundant f-string prefixes, and lacks a return
type; either thread effective_timeout through to
SandboxManager.get_or_create_sandbox and sandbox.files.make_dir (preferred) or
remove the parameter; update the call to SandboxManager.get_instance /
get_or_create_sandbox to accept and pass the timeout (or document removal),
remove the unnecessary f prefixes on the literal log messages (e.g. logger.info
lines that don't interpolate), add a return type annotation -> tuple[Any, str]
to _prepare_sandbox, and ensure the function returns (sandbox, temp_dir) at the
end; reference symbols: _prepare_sandbox, SandboxManager.get_instance,
SandboxManager.get_or_create_sandbox, sandbox.files.make_dir,
_generate_temp_dir_name, temp_dir.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4d2d1e6d-152b-482f-a4a8-306eba61e84e
📒 Files selected for processing (2)
backend/init_data/skills/dingtalk-knowledge/SKILL.mdbackend/init_data/skills/dingtalk-knowledge/provider.py
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
backend/init_data/skills/dingtalk-knowledge/provider.py (1)
257-261:⚠️ Potential issue | 🔴 CriticalVerify
get_or_create_sandboxacceptstimeoutkwarg;shell_typehardcoding still unresolved.Two concerns on this call:
The
SandboxManager.get_or_create_sandboxsignature shown inchat_shell/chat_shell/tools/sandbox/_base.py(context snippet 1) is(shell_type, workspace_ref=None, task_type="sandbox")— there is notimeoutparameter. Timeout is a constructor arg onSandboxManager.__init__instead. If the provided snippet reflects the current code, passingtimeout=effective_timeouthere will raiseTypeError. Please verify and either move the timeout intoSandboxManager.get_instance(...)or extendget_or_create_sandboxto accept it.
shell_type="ClaudeCode"is still hardcoded even thoughSKILL.mddeclaresbindShells: [Chat, Agno, ClaudeCode]. This was flagged previously and not yet addressed.#!/bin/bash # Confirm current signature of get_or_create_sandbox and whether it accepts timeout. rg -nP --type=py -B2 -A25 'async def get_or_create_sandbox' chat_shell/ # Check SandboxManager.__init__ for timeout handling. rg -nP --type=py -B1 -A15 'class SandboxManager' chat_shell/chat_shell/tools/sandbox/_base.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 257 - 261, The call to SandboxManager.get_or_create_sandbox in provider.py is incorrect: verify the current signature of SandboxManager.get_or_create_sandbox (async def get_or_create_sandbox) and if it does not accept a timeout kwarg, remove the timeout=effective_timeout from this call and instead pass timeout via SandboxManager.get_instance(...) or add a timeout parameter to get_or_create_sandbox and propagate it through SandboxManager.__init__; also stop hardcoding shell_type="ClaudeCode" — read the bound shells for this skill (per SKILL.md or incoming config) and pass the appropriate shell_type value (or iterate supported shells) into get_or_create_sandbox so the call reflects the SKILL.md bindShells list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 785-790: The size check for content_bytes against
ATTACHMENT_MAX_BYTES currently raises RuntimeError before the try/except that
calls _handle_error, so move the validation inside the existing try block (the
block that surrounds the subsequent upload flow) or instead call
self._handle_error(...) with the same error context and return before raising;
specifically, ensure the check of content_bytes/ATTACHMENT_MAX_BYTES is
performed inside the same try that catches Exception and routes to _handle_error
(or invoke _handle_error directly) so the failure emits the `failed` websocket
event and follows the same error-handling path as the filename validation.
- Around line 817-819: The call to AsyncSandbox.files.write in provider.py
passes an unsupported mode="wb" kwarg which will raise a TypeError; locate the
await sandbox.files.write(...) call inside the try/except (in the same function
where save_path and content_bytes are used) and remove the mode="wb" argument so
it reads await sandbox.files.write(save_path, content_bytes), keeping the
surrounding exception handling (raising RuntimeError with the original error)
intact; verify content_bytes is bytes before the call if necessary.
- Around line 405-410: The error-handling block that builds the RuntimeError
uses api_response.get("detail") directly which can be a dict; update the logic
in provider.py around the upload response handling so that if error_detail is a
dict you first try to use error_detail.get("message") (falling back to
json.dumps(error_detail) if no message key), otherwise keep existing
upload_result.stderr or the "HTTP {http_status}: Unknown error" fallback, and
then raise RuntimeError with that safe, human-readable string; reference the
variables api_response.get("detail"), upload_result.stderr, http_status and the
RuntimeError raise site when making the change.
---
Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 257-261: The call to SandboxManager.get_or_create_sandbox in
provider.py is incorrect: verify the current signature of
SandboxManager.get_or_create_sandbox (async def get_or_create_sandbox) and if it
does not accept a timeout kwarg, remove the timeout=effective_timeout from this
call and instead pass timeout via SandboxManager.get_instance(...) or add a
timeout parameter to get_or_create_sandbox and propagate it through
SandboxManager.__init__; also stop hardcoding shell_type="ClaudeCode" — read the
bound shells for this skill (per SKILL.md or incoming config) and pass the
appropriate shell_type value (or iterate supported shells) into
get_or_create_sandbox so the call reflects the SKILL.md bindShells list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5ea3ef2c-8cd6-4df3-986c-be7923260276
⛔ Files ignored due to path filters (1)
backend/init_data/skills/dingtalk-knowledge.zipis excluded by!**/*.zip
📒 Files selected for processing (2)
backend/init_data/skills/dingtalk-knowledge/SKILL.mdbackend/init_data/skills/dingtalk-knowledge/provider.py
20b30e7 to
5f8f7e7
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
backend/init_data/skills/dingtalk-knowledge/SKILL.md (1)
7-10:⚠️ Potential issue | 🟠 Major
bindShells≠ what the provider supports.Root cause is in
provider.py(_prepare_sandboxhardcodesshell_type="ClaudeCode"), flagged separately. Either restrictbindShellsto["Chat"]here to match current behavior, or fixprovider.pyto dispatch dynamically. Leaving the mismatch in place guarantees the skill misbehaves under Agno invocation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md` around lines 7 - 10, The SKILL.md lists bindShells including "Agno" and "ClaudeCode" but the provider code (_prepare_sandbox in provider.py) currently hardcodes shell_type="ClaudeCode", causing a mismatch; either change the SKILL.md bindShells array to only include "Chat" to reflect actual supported invocation, or update provider.py's _prepare_sandbox to dynamically dispatch shell_type based on the incoming request (e.g., map the requested shell to the appropriate sandbox initialization instead of the fixed "ClaudeCode"), ensuring bindShells and the runtime dispatch (the _prepare_sandbox function) remain consistent.backend/init_data/skills/dingtalk-knowledge/provider.py (3)
253-261:⚠️ Potential issue | 🟠 MajorHardcoded
shell_type="ClaudeCode"still conflicts withbindShells: [Chat, Agno, ClaudeCode]— unresolved.This was raised previously and has not been addressed: the skill advertises support for three shells in
SKILL.mdbut_prepare_sandboxunconditionally requests a ClaudeCode sandbox. When the skill is invoked under Chat or Agno, this either fails (if the template name isn't valid there) or creates a sandbox that cannot be reused by the rest of the session. Either propagate the current shell throughSkillToolContext(and readcontext.shell_typehere), or restrictbindShellsto["Chat"]and add adefault_shell_typeconfig as thesandboxskill does.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 253 - 261, The _prepare_sandbox code currently hardcodes shell_type="ClaudeCode" when calling sandbox_manager.get_or_create_sandbox which conflicts with the advertised bindShells and breaks non-ClaudeCode sessions; update the flow to read the active shell type from the SkillToolContext (e.g., context.shell_type) and pass that into get_or_create_sandbox, falling back to a configurable default_shell_type if context.shell_type is missing, or alternatively change the skill metadata to restrict bindShells to ["Chat"] and add default_shell_type config similar to the sandbox skill; modify the call in _prepare_sandbox to use the chosen variable instead of the literal "ClaudeCode" and ensure SKILL.md and bindShells reflect this decision.
814-819:⚠️ Potential issue | 🔴 Critical
sandbox.files.write(..., mode="wb")still present — unsupported kwarg will raiseTypeErrorat runtime.This was flagged previously and marked as addressed, but the
mode="wb"kwarg is still in the code.AsyncSandbox.files.write()in thee2b_code_interpreterSDK does not accept amodeparameter in any released version; passing it converts everysave_dingtalk_contentinvocation into an immediate failure (caught by the genericexceptand reported as "Failed to write content: ...").content_bytesis alreadybytes, so just drop the kwarg.💡 Fix
- await sandbox.files.write(save_path, content_bytes, mode="wb") + await sandbox.files.write(save_path, content_bytes)#!/bin/bash # Cross-check how write() is called elsewhere in this repo (should pass bytes without mode=). rg -nP --type=py -C 2 '\bfiles\.write\s*\('🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 814 - 819, The async write call erroneously passes an unsupported kwarg mode="wb" to AsyncSandbox.files.write causing a TypeError at runtime; inside SaveDingTalkContentTool (the block using logger.info and await sandbox.files.write(save_path, content_bytes, mode="wb")), remove the mode="wb" argument and call AsyncSandbox.files.write with just save_path and content_bytes (ensure content_bytes is bytes), leaving the existing try/except intact; also scan other usages of AsyncSandbox.files.write to ensure no other calls pass mode=.
784-790:⚠️ Potential issue | 🟠 MajorSize-check raise still bypasses
_handle_error— previously flagged, not addressed.Lines 785–790 execute before the
try:block at line 804, so a size violation raises straight out of_arunwithout emitting afailedwebsocket event or producing the standardized error JSON. Route it through_handle_errorfor consistency with the filename validation above.💡 Fix
- # Fail-fast size check before writing content_bytes = content.encode("utf-8") if len(content_bytes) > ATTACHMENT_MAX_BYTES: - raise RuntimeError( - f"Content size ({len(content_bytes)} bytes) exceeds maximum allowed size " - f"({ATTACHMENT_MAX_BYTES} bytes)" - ) + return await self._handle_error( + filename, + RuntimeError( + f"Content size ({len(content_bytes)} bytes) exceeds maximum allowed size " + f"({ATTACHMENT_MAX_BYTES} bytes)" + ), + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 784 - 790, The size-check that raises when content.encode("utf-8") exceeds ATTACHMENT_MAX_BYTES is executed outside the try block in _arun and therefore bypasses _handle_error and the standardized websocket failure path; move the size validation inside the same try that wraps the rest of _arun (or, alternatively, catch the RuntimeError where current validation lives and call self._handle_error(...) with the same error payload used for filename validation) so that violations of ATTACHMENT_MAX_BYTES for variable content always route through _handle_error and emit the standardized failed websocket event.
🧹 Nitpick comments (2)
backend/init_data/skills/dingtalk-knowledge/provider.py (2)
586-595: Input key name inconsistency:extensionhere vsfile_extensioneverywhere else.The tool input schema and the rest of the code consistently use
file_extension(includingSKILL.md). For symmetry with the websocketcompletedpayload and the schema itself, emitfile_extensionrather thanextensionin therunningevent.Diff
tool_input={ "filename": filename, - "extension": file_ext, + "file_extension": file_ext, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 586 - 595, The running-event payload uses the wrong key "extension" instead of the project-standard "file_extension"; update the emit in provider.py where self.ws_emitter.emit_tool_call is called (the block that builds tool_input with "filename" and file_ext) to emit "file_extension": file_ext so the running event matches the schema and the completed payload used elsewhere (reference the variable file_ext and the method self.ws_emitter.emit_tool_call).
356-368:stdout.strip()+ split-on-\ncan mis-parse the HTTP status when the body contains a trailing newline or CR.
curl -w "\n%{http_code}"prepends a newline + the status. After.strip(), if the API returned a trailing newline in its JSON body (many frameworks do), the final structure isbody\n\nstatus— which still works — butoutput_lines[-1]being a non-numeric value anywhere (e.g., due to-sSwriting a transient error line mid-stream on retry) will raiseValueErrorfromint(output_lines[-1])and be reported to the user as a generic error. Consider-w "\nHTTP_STATUS:%{http_code}"with an explicit prefix andre.searchto extract the status, which is far more robust than positional parsing. Minor — flag as optional hardening.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 356 - 368, The current parsing of upload_result.stdout using stdout.strip().split("\n") and taking output_lines[-1] to get http_status is fragile; change the curl invocation to use a unique suffix like -w "\nHTTP_STATUS:%{http_code}" and then extract the status with a regex search (e.g., search for r"HTTP_STATUS:(\d{3})") on upload_result.stdout instead of positional splitting, set http_status from the matched group (handling missing/invalid match with a clear RuntimeError), set response_body to stdout with the matched suffix removed (or use the span of the match to slice), and keep the JSON parsing of response_body into api_response (json.loads) with the same exception handling; reference symbols: upload_result, upload_result.stdout, http_status, response_body, api_response, json.loads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 412-428: The temp-dir cleanup in _upload_and_return is
insufficient because temp_dir is created in _prepare_sandbox and other failures
in DownloadDingTalkDocumentTool._arun and SaveDingTalkContentTool._arun bypass
that finally; move the cleanup out of _upload_and_return and into each _arun (or
make _prepare_sandbox return an async context manager or cleanup callback) so
temp_dir is removed in a finally block inside DownloadDingTalkDocumentTool._arun
and SaveDingTalkContentTool._arun (and any caller that uses _prepare_sandbox),
ensuring you always call sandbox.commands.run(["rm","-rf", temp_dir]) in each
finally and remove the inner try/finally from _upload_and_return; reference
_prepare_sandbox, _upload_and_return, DownloadDingTalkDocumentTool._arun,
SaveDingTalkContentTool._arun and _handle_error when making the changes.
- Around line 525-527: The docstring/note in DownloadDingTalkDocumentTool and
SaveDingTalkContentTool is misleading about requiring a separate "sandbox"
skill; update their description strings to clarify the tool creates/acquires its
own sandbox via SandboxManager. Locate the description text in the classes
DownloadDingTalkDocumentTool and SaveDingTalkContentTool and replace "Make sure
the sandbox skill is loaded." with a clear phrase such as "This tool runs inside
a sandbox; it will acquire one automatically." (or similar wording) so agents
are not instructed to enable an external sandbox skill.
- Around line 257-261: The call to sandbox_manager.get_or_create_sandbox passes
a nonexistent timeout kwarg causing a TypeError; instead set the timeout when
obtaining the manager instance. Remove timeout=effective_timeout from
get_or_create_sandbox(...) and ensure SandboxManager.get_instance(...) is called
with timeout=effective_timeout (or obtain the manager via
SandboxManager.get_instance(..., timeout=effective_timeout)) so subsequent calls
like sandbox_manager.get_or_create_sandbox(shell_type="ClaudeCode",
workspace_ref=None[, task_type=...]) work correctly.
---
Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 253-261: The _prepare_sandbox code currently hardcodes
shell_type="ClaudeCode" when calling sandbox_manager.get_or_create_sandbox which
conflicts with the advertised bindShells and breaks non-ClaudeCode sessions;
update the flow to read the active shell type from the SkillToolContext (e.g.,
context.shell_type) and pass that into get_or_create_sandbox, falling back to a
configurable default_shell_type if context.shell_type is missing, or
alternatively change the skill metadata to restrict bindShells to ["Chat"] and
add default_shell_type config similar to the sandbox skill; modify the call in
_prepare_sandbox to use the chosen variable instead of the literal "ClaudeCode"
and ensure SKILL.md and bindShells reflect this decision.
- Around line 814-819: The async write call erroneously passes an unsupported
kwarg mode="wb" to AsyncSandbox.files.write causing a TypeError at runtime;
inside SaveDingTalkContentTool (the block using logger.info and await
sandbox.files.write(save_path, content_bytes, mode="wb")), remove the mode="wb"
argument and call AsyncSandbox.files.write with just save_path and content_bytes
(ensure content_bytes is bytes), leaving the existing try/except intact; also
scan other usages of AsyncSandbox.files.write to ensure no other calls pass
mode=.
- Around line 784-790: The size-check that raises when content.encode("utf-8")
exceeds ATTACHMENT_MAX_BYTES is executed outside the try block in _arun and
therefore bypasses _handle_error and the standardized websocket failure path;
move the size validation inside the same try that wraps the rest of _arun (or,
alternatively, catch the RuntimeError where current validation lives and call
self._handle_error(...) with the same error payload used for filename
validation) so that violations of ATTACHMENT_MAX_BYTES for variable content
always route through _handle_error and emit the standardized failed websocket
event.
In `@backend/init_data/skills/dingtalk-knowledge/SKILL.md`:
- Around line 7-10: The SKILL.md lists bindShells including "Agno" and
"ClaudeCode" but the provider code (_prepare_sandbox in provider.py) currently
hardcodes shell_type="ClaudeCode", causing a mismatch; either change the
SKILL.md bindShells array to only include "Chat" to reflect actual supported
invocation, or update provider.py's _prepare_sandbox to dynamically dispatch
shell_type based on the incoming request (e.g., map the requested shell to the
appropriate sandbox initialization instead of the fixed "ClaudeCode"), ensuring
bindShells and the runtime dispatch (the _prepare_sandbox function) remain
consistent.
---
Nitpick comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 586-595: The running-event payload uses the wrong key "extension"
instead of the project-standard "file_extension"; update the emit in provider.py
where self.ws_emitter.emit_tool_call is called (the block that builds tool_input
with "filename" and file_ext) to emit "file_extension": file_ext so the running
event matches the schema and the completed payload used elsewhere (reference the
variable file_ext and the method self.ws_emitter.emit_tool_call).
- Around line 356-368: The current parsing of upload_result.stdout using
stdout.strip().split("\n") and taking output_lines[-1] to get http_status is
fragile; change the curl invocation to use a unique suffix like -w
"\nHTTP_STATUS:%{http_code}" and then extract the status with a regex search
(e.g., search for r"HTTP_STATUS:(\d{3})") on upload_result.stdout instead of
positional splitting, set http_status from the matched group (handling
missing/invalid match with a clear RuntimeError), set response_body to stdout
with the matched suffix removed (or use the span of the match to slice), and
keep the JSON parsing of response_body into api_response (json.loads) with the
same exception handling; reference symbols: upload_result, upload_result.stdout,
http_status, response_body, api_response, json.loads.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f01d9275-6505-4312-8679-850089b106bb
📒 Files selected for processing (2)
backend/init_data/skills/dingtalk-knowledge/SKILL.mdbackend/init_data/skills/dingtalk-knowledge/provider.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
backend/init_data/skills/dingtalk-knowledge/provider.py (2)
246-250:⚠️ Potential issue | 🟠 Major
effective_timeoutis currently ignored during sandbox acquisition.
_prepare_sandbox(effective_timeout)accepts timeout, but Line 246-250 does not pass it intoSandboxManager.get_instance(...), despite the note on Line 254. This can maketimeout_secondspartially ineffective.Proposed fix
sandbox_manager = SandboxManager.get_instance( task_id=self.task_id, user_id=self.user_id, user_name=effective_user_name, + timeout=effective_timeout, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 246 - 250, The call to SandboxManager.get_instance in _prepare_sandbox ignores the effective_timeout, so timeout_seconds can be ineffective; update the call in _prepare_sandbox to pass the effective_timeout (or timeout_seconds) into SandboxManager.get_instance (e.g., add a timeout_seconds or timeout arg) so the manager is created with the intended timeout; ensure the parameter name matches SandboxManager.get_instance signature and propagate effective_timeout from _prepare_sandbox into that call.
604-660:⚠️ Potential issue | 🟠 MajorTemp directory cleanup still leaks on pre-upload failures.
Cleanup runs in
_upload_and_returnfinally, but failures after Line 606/813 and before calling_upload_and_return(e.g., download/write/get_info errors) bypass that cleanup and leavetemp_dirbehind.Proposed fix shape (move cleanup to each
_arun)- try: - sandbox, temp_dir = await self._prepare_sandbox(effective_timeout) + sandbox = None + temp_dir = None + try: + sandbox, temp_dir = await self._prepare_sandbox(effective_timeout) ... return await self._upload_and_return(...) except Exception as e: return await self._handle_error(filename, e) + finally: + if sandbox is not None and temp_dir: + try: + await sandbox.commands.run(cmd=["rm", "-rf", temp_dir], cwd="/home/user") + except Exception as cleanup_err: + logger.warning("Temp dir cleanup failed: %s", cleanup_err)Also applies to: 801-845
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 604 - 660, The temp_dir created by _prepare_sandbox is only cleaned in _upload_and_return's finally, so exceptions occurring between sandbox creation and the call to _upload_and_return (e.g., curl failure, sandbox.files.get_info) leak the temp directory; update the async run path (the relevant _arun/_download/_run logic around _prepare_sandbox, the curl invocation, and the except block that calls _handle_error) to ensure temp_dir is removed on all error paths: either perform cleanup in a local finally around the try that creates sandbox (after calling _prepare_sandbox) or call a cleanup helper in the except before delegating to _handle_error; reference and reuse the same cleanup/teardown used by _upload_and_return so you don't duplicate teardown logic. Ensure cleanup runs whether success or failure and does not rely solely on _upload_and_return.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 465-472: The websocket emission in _handle_error (the call to
self.ws_emitter.emit_tool_call) must be made best-effort so it cannot raise and
mask the original error; wrap the emit call in a try/except that catches any
exception from emit_tool_call, log or ignore that secondary error, and do not
re-raise it so the original error path remains intact (modify the block inside
_handle_error that references self.ws_emitter/emit_tool_call accordingly).
- Around line 395-403: The websocket emit call (self.ws_emitter.emit_tool_call
with status="completed" and output=response) can raise and cause a successful
upload to be reported as failed; wrap the emit_tool_call invocation in a
try/except that catches exceptions from self.ws_emitter.emit_tool_call, log the
exception (or call self.ws_emitter.logger/error handler) and swallow it so it
doesn't propagate, and ensure the method still returns the successful upload
result; specifically protect the block that references task_id, name, filename
and response so emit failures do not change the overall success path.
---
Duplicate comments:
In `@backend/init_data/skills/dingtalk-knowledge/provider.py`:
- Around line 246-250: The call to SandboxManager.get_instance in
_prepare_sandbox ignores the effective_timeout, so timeout_seconds can be
ineffective; update the call in _prepare_sandbox to pass the effective_timeout
(or timeout_seconds) into SandboxManager.get_instance (e.g., add a
timeout_seconds or timeout arg) so the manager is created with the intended
timeout; ensure the parameter name matches SandboxManager.get_instance signature
and propagate effective_timeout from _prepare_sandbox into that call.
- Around line 604-660: The temp_dir created by _prepare_sandbox is only cleaned
in _upload_and_return's finally, so exceptions occurring between sandbox
creation and the call to _upload_and_return (e.g., curl failure,
sandbox.files.get_info) leak the temp directory; update the async run path (the
relevant _arun/_download/_run logic around _prepare_sandbox, the curl
invocation, and the except block that calls _handle_error) to ensure temp_dir is
removed on all error paths: either perform cleanup in a local finally around the
try that creates sandbox (after calling _prepare_sandbox) or call a cleanup
helper in the except before delegating to _handle_error; reference and reuse the
same cleanup/teardown used by _upload_and_return so you don't duplicate teardown
logic. Ensure cleanup runs whether success or failure and does not rely solely
on _upload_and_return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ff7c0e87-78a8-492a-9648-c657b48ff4d7
📒 Files selected for processing (1)
backend/init_data/skills/dingtalk-knowledge/provider.py
| if self.ws_emitter: | ||
| await self.ws_emitter.emit_tool_call( | ||
| task_id=self.task_id, | ||
| tool_name=self.name, | ||
| tool_input={"filename": filename}, | ||
| status="completed", | ||
| output=response, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Do not fail a successful upload when websocket emit fails.
On Line 396, emit_tool_call(..., status="completed") is not guarded. If emit fails, the method raises and caller returns a failed tool response even though upload succeeded.
Proposed fix
if self.ws_emitter:
- await self.ws_emitter.emit_tool_call(
- task_id=self.task_id,
- tool_name=self.name,
- tool_input={"filename": filename},
- status="completed",
- output=response,
- )
+ try:
+ await self.ws_emitter.emit_tool_call(
+ task_id=self.task_id,
+ tool_name=self.name,
+ tool_input={"filename": filename},
+ status="completed",
+ output=response,
+ )
+ except Exception as e:
+ logger.warning("Failed to emit completion status: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 395 -
403, The websocket emit call (self.ws_emitter.emit_tool_call with
status="completed" and output=response) can raise and cause a successful upload
to be reported as failed; wrap the emit_tool_call invocation in a try/except
that catches exceptions from self.ws_emitter.emit_tool_call, log the exception
(or call self.ws_emitter.logger/error handler) and swallow it so it doesn't
propagate, and ensure the method still returns the successful upload result;
specifically protect the block that references task_id, name, filename and
response so emit failures do not change the overall success path.
| if self.ws_emitter: | ||
| await self.ws_emitter.emit_tool_call( | ||
| task_id=self.task_id, | ||
| tool_name=self.name, | ||
| tool_input={"filename": filename}, | ||
| status="failed", | ||
| error=str(error), | ||
| ) |
There was a problem hiding this comment.
_handle_error should be best-effort for websocket emission.
If Line 466 emit fails, _handle_error can throw and mask the original error response path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 465 -
472, The websocket emission in _handle_error (the call to
self.ws_emitter.emit_tool_call) must be made best-effort so it cannot raise and
mask the original error; wrap the emit call in a try/except that catches any
exception from emit_tool_call, log or ignore that secondary error, and do not
re-raise it so the original error path remains intact (modify the block inside
_handle_error that references self.ws_emitter/emit_tool_call accordingly).
0881b98 to
047d90a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (4)
backend/init_data/skills/dingtalk-knowledge/provider.py (4)
395-402:⚠️ Potential issue | 🟠 MajorWrap websocket emit in try/except to prevent masking successful uploads.
If
emit_tool_callraises an exception, the successful upload will fail to return its response and the exception will propagate to the caller, making a successful operation appear failed. Wrap the emit in a try/except block that logs the error but allows the success path to continue.🛡️ Proposed fix
if self.ws_emitter: - await self.ws_emitter.emit_tool_call( - task_id=self.task_id, - tool_name=self.name, - tool_input={"filename": filename}, - status="completed", - output=response, - ) + try: + await self.ws_emitter.emit_tool_call( + task_id=self.task_id, + tool_name=self.name, + tool_input={"filename": filename}, + status="completed", + output=response, + ) + except Exception as e: + logger.warning("Failed to emit completion status: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 395 - 402, The websocket emit call in the provider's upload flow can raise and mask successful uploads; wrap the await self.ws_emitter.emit_tool_call(...) inside a try/except that catches Exception, logs the error (including self.task_id, self.name, and filename for context) and continues without re-raising so the successful response (variable response) is still returned; locate the block using self.ws_emitter and the emit_tool_call call and ensure only the emit is inside the try/except so upload success path is unchanged.
801-845:⚠️ Potential issue | 🟠 MajorTemp directory cleanup is leaked on pre-upload failures.
Same issue as in
DownloadDingTalkDocumentTool: ifsandbox.files.writefails at line 823-825 orget_infofails at line 828, the exception handler at line 844 is invoked without cleanup, leavingtemp_dirin the sandbox.Add a finally block to
_arunto ensure cleanup runs on all paths, and remove the duplicate cleanup from_upload_and_return.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 801 - 845, The _arun method of SaveDingTalkContentTool leaks the sandbox temp_dir when sandbox.files.write or sandbox.files.get_info throws; add a finally block in SaveDingTalkContentTool._arun that always calls the sandbox cleanup (using the same cleanup routine currently in _upload_and_return) so temp_dir is removed on success, handled errors, or intermediate failures, and then remove the duplicate cleanup from _upload_and_return to avoid double-cleanup; ensure you still await _prepare_sandbox (and capture sandbox and temp_dir) before using the finally so cleanup can run safely even when write/get_info fails and keep existing calls to _handle_error for error reporting.
604-659:⚠️ Potential issue | 🟠 MajorTemp directory cleanup is leaked on pre-upload failures.
The cleanup in
_upload_and_return's finally block (lines 417-433) only runs if_upload_and_returnis called. If the download fails at line 638-639 orget_infofails at line 642, control jumps to the exception handler at line 658 and_handle_erroris called without cleanup—leavingtemp_dirbehind in the sandbox.Move cleanup into a finally block within
_arunso it's guaranteed to run on all paths.🧹 Proposed fix
async def _arun(...) -> str: ... + sandbox = None + temp_dir = None try: # Prepare sandbox sandbox, temp_dir = await self._prepare_sandbox(effective_timeout) ... return await self._upload_and_return(...) except Exception as e: return await self._handle_error(filename, e) + finally: + if sandbox is not None and temp_dir: + try: + await sandbox.commands.run( + cmd=["rm", "-rf", temp_dir], + cwd="/home/user", + ) + logger.info("Cleaned up temp_dir=%s", temp_dir) + except Exception as cleanup_err: + logger.error("Failed to cleanup temp_dir=%s: %s", temp_dir, cleanup_err)And remove the cleanup from
_upload_and_return's finally (lines 417-433).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 604 - 659, The temp_dir cleanup is only in _upload_and_return's finally so failures during download or get_info in _arun leak the sandbox/temp_dir; modify _arun (the async method containing calls to _prepare_sandbox, sandbox.commands.run, and sandbox.files.get_info) to declare sandbox and temp_dir in the outer scope and wrap the download/upload logic in a try/finally where the finally always removes temp_dir and closes/cleans sandbox (guarding for None), then call _handle_error from the except but let the finally run to perform cleanup; after moving cleanup to _arun, remove the redundant cleanup code from _upload_and_return's finally so cleanup happens exactly once and on all paths.
465-472:⚠️ Potential issue | 🟠 MajorMake websocket emission in error handler best-effort.
If the websocket emit fails within
_handle_error, it will raise and prevent the error response from being returned to the caller, masking the original error. Wrap the emit in a try/except that logs but doesn't propagate the exception.🛡️ Proposed fix
if self.ws_emitter: - await self.ws_emitter.emit_tool_call( - task_id=self.task_id, - tool_name=self.name, - tool_input={"filename": filename}, - status="failed", - error=str(error), - ) + try: + await self.ws_emitter.emit_tool_call( + task_id=self.task_id, + tool_name=self.name, + tool_input={"filename": filename}, + status="failed", + error=str(error), + ) + except Exception as e: + logger.warning("Failed to emit error status: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-knowledge/provider.py` around lines 465 - 472, The websocket emit in _handle_error (the await self.ws_emitter.emit_tool_call(...) call) should be best-effort: wrap that await in a try/except Exception block so any exceptions from ws_emitter.emit_tool_call do not propagate and mask the original error; inside the except, log the exception (e.g., via self.logger.exception or a similar logger) and continue, ensuring the original error response is still returned to the caller.
🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-knowledge/SKILL.md (1)
67-67: Add language specifiers to code blocks for better syntax highlighting.Several code blocks lack language specifiers, which prevents proper syntax highlighting in documentation renderers. Add the appropriate language identifier after the opening fence.
📝 Suggested fixes
Line 67:
-``` +```python dingtalk-docs.get_document_info(document_id="xxx")Line 90: ```diff -``` +```python dingtalk-docs.get_document_content(nodeId="xxx")Lines 99, 110, 121 similarly need `python` specifiers. </details> Also applies to: 90-90, 99-99, 110-110, 121-121 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@backend/init_data/skills/dingtalk-knowledge/SKILL.mdat line 67, The
markdown code fences in SKILL.md around the dingtalk examples are missing
language specifiers which prevents syntax highlighting; update each affected
fence that contains calls like
dingtalk-docs.get_document_info(document_id="xxx"),
dingtalk-docs.get_document_content(nodeId="xxx"), and the other example blocks
(lines showing those calls) by adding the python language identifier immediately
after the openingfence (e.g., changeto ```python) for all occurrences
referenced.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In@backend/init_data/skills/dingtalk-knowledge/provider.py:
- Around line 395-402: The websocket emit call in the provider's upload flow can
raise and mask successful uploads; wrap the await
self.ws_emitter.emit_tool_call(...) inside a try/except that catches Exception,
logs the error (including self.task_id, self.name, and filename for context) and
continues without re-raising so the successful response (variable response) is
still returned; locate the block using self.ws_emitter and the emit_tool_call
call and ensure only the emit is inside the try/except so upload success path is
unchanged.- Around line 801-845: The _arun method of SaveDingTalkContentTool leaks the
sandbox temp_dir when sandbox.files.write or sandbox.files.get_info throws; add
a finally block in SaveDingTalkContentTool._arun that always calls the sandbox
cleanup (using the same cleanup routine currently in _upload_and_return) so
temp_dir is removed on success, handled errors, or intermediate failures, and
then remove the duplicate cleanup from _upload_and_return to avoid
double-cleanup; ensure you still await _prepare_sandbox (and capture sandbox and
temp_dir) before using the finally so cleanup can run safely even when
write/get_info fails and keep existing calls to _handle_error for error
reporting.- Around line 604-659: The temp_dir cleanup is only in _upload_and_return's
finally so failures during download or get_info in _arun leak the
sandbox/temp_dir; modify _arun (the async method containing calls to
_prepare_sandbox, sandbox.commands.run, and sandbox.files.get_info) to declare
sandbox and temp_dir in the outer scope and wrap the download/upload logic in a
try/finally where the finally always removes temp_dir and closes/cleans sandbox
(guarding for None), then call _handle_error from the except but let the finally
run to perform cleanup; after moving cleanup to _arun, remove the redundant
cleanup code from _upload_and_return's finally so cleanup happens exactly once
and on all paths.- Around line 465-472: The websocket emit in _handle_error (the await
self.ws_emitter.emit_tool_call(...) call) should be best-effort: wrap that await
in a try/except Exception block so any exceptions from ws_emitter.emit_tool_call
do not propagate and mask the original error; inside the except, log the
exception (e.g., via self.logger.exception or a similar logger) and continue,
ensuring the original error response is still returned to the caller.
Nitpick comments:
In@backend/init_data/skills/dingtalk-knowledge/SKILL.md:
- Line 67: The markdown code fences in SKILL.md around the dingtalk examples are
missing language specifiers which prevents syntax highlighting; update each
affected fence that contains calls like
dingtalk-docs.get_document_info(document_id="xxx"),
dingtalk-docs.get_document_content(nodeId="xxx"), and the other example blocks
(lines showing those calls) by adding the python language identifier immediately
after the openingfence (e.g., changeto ```python) for all occurrences
referenced.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `46b326dd-8009-494f-9388-8f314fa7fc8a` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 0881b98b0d638f86c8bb4fdcf2ca7dd5248ab4c4 and 047d90a87fdb626d5db0545a5b3548b0da1e77de. </details> <details> <summary>📒 Files selected for processing (4)</summary> * `backend/app/mcp_server/tools/__init__.py` * `backend/app/mcp_server/tools/decorator.py` * `backend/init_data/skills/dingtalk-knowledge/SKILL.md` * `backend/init_data/skills/dingtalk-knowledge/provider.py` </details> <details> <summary>✅ Files skipped from review due to trivial changes (2)</summary> * backend/app/mcp_server/tools/__init__.py * backend/app/mcp_server/tools/decorator.py </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
e172f52 to
582eaaa
Compare
…uments to knowledge base Add new skill to upload DingTalk documents, spreadsheets, and AI tables to Wegent knowledge bases. Includes: - SKILL.md: Comprehensive documentation with workflow and examples - provider.py: Tool provider with two tools: - download_dingtalk_document: Download file-based docs via sandbox - save_dingtalk_content: Save online doc content (adoc) as markdown Supports multiple DingTalk document types: - File-based documents (docx, xlsx, pdf) via download_file MCP - Online documents (adoc) via get_document_content MCP - Spreadsheets (axls) via dingtalk-table MCP - AI tables (able) via dingtalk-ai-table MCP Integrates with wegent-knowledge MCP to create documents with proper source attribution and trigger indexing.
- Add OpenTelemetry tracing to async entrypoints and helpers - Fix websocket emission to remove sensitive download_url from payload - Add curl options for max file size (--max-filesize) and timeout (--max-time) - Fix temp_dir leak by wrapping upload logic in try/finally with proper cleanup - Fix HTTP status code checking by parsing status from curl response - Fix hardcoded shell_type by passing timeout to get_or_create_sandbox() - Replace base64/python_cmd with sandbox.files.write() for content saving - Add return type annotation and remove redundant f-string prefixes - Update SKILL.md documentation: change extension to file_extension - Clarify workflow: documents are read FROM DingTalk and stored TO Wegent KB
Update SKILL.md to use the new prefixed tool name: - Change wegent-knowledge.create_document to wegent_kb_create_document - Update all examples and references - Clarify warning about tool selection
- Move content_bytes size check inside try block for proper error handling - Remove unsupported mode='wb' from sandbox.files.write() call - Fix error_detail handling when it's a dict instead of string - Remove timeout parameter from get_or_create_sandbox() call (timeout is passed via get_instance)
…ples Update the docstring in __init__.py and example code in decorator.py to use the actual MCP tool names (with wegent_kb_ prefix) instead of internal function names. This ensures documentation consistency with the @mcp_tool decorator name parameters in knowledge.py. Changes: - __init__.py: Updated Knowledge MCP tools list with correct names and added missing tools (search_knowledge_base, read_document_content) - decorator.py: Fixed example code to use wegent_kb_list_knowledge_bases
- Merge dingtalk-knowledge SKILL.md into wegent-knowledge SKILL.md - Add provider.py to wegent-knowledge with DingTalk upload tools - Rename DingTalkKnowledgeProvider to WegentKnowledgeProvider - Update provider name from dingtalk_knowledge to wegent_knowledge - Remove standalone dingtalk-knowledge skill directory - Combine MCP servers: wegent-knowledge, dingtalk-docs, dingtalk-table, dingtalk-ai-table
- Move temp_dir cleanup from _upload_and_return to _arun methods with finally block - Update tool descriptions to clarify sandbox is auto-acquired (not external skill) - Fix timeout parameter - pass to SandboxManager.get_instance instead of get_or_create_sandbox - Add configurable default_shell_type via tool_config (default: Agno) - Remove unsupported mode='wb' from sandbox.files.write call - Move content size validation inside try block for proper error handling - Fix emit_tool_call key from 'extension' to 'file_extension' - Fix fragile HTTP status parsing using regex with HTTP_STATUS suffix - Restrict bindShells to [Chat] to match runtime behavior
d267459 to
2b34c19
Compare
Only use skills, upload DingTalk document to wegent knowledge base
2b34c19 to
c3f3d1c
Compare
…upload Add new MCP tools to replace sandbox exec for DingTalk document upload: 1. dingtalk_upload_file_from_url: Download file from URL and upload to Wegent - Downloads file from DingTalk download_url - Automatically manages temporary files - Returns attachment_id for wegent_kb_create_document 2. dingtalk_upload_content: Save content to file and upload to Wegent - Writes content to temporary file - Supports configurable encoding - Returns attachment_id for wegent_kb_create_document Both tools: - Use tempfile module for secure temporary file handling - Clean up temporary files after upload (in finally block) - Integrate with existing context_service.upload_attachment Update SKILL.md: - Remove all sandbox exec references - Document new helper tools - Update workflow and examples Update server.py: - Import dingtalk_upload module for tool registration Fixes secret
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)
1-305:⚠️ Potential issue | 🟠 MajorFix Black formatting (pipeline failing).
The
LintGitHub Action reports Black would reformat this file. Lines such as 58, 154, 158, 258 exceed the 88-column limit. Runblack backend/app/mcp_server/tools/dingtalk_upload.pybefore merging.As per coding guidelines: "Follow PEP 8 standard with Black formatter (line length: 88) and isort for Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 1 - 305, Run Black to reformat the file to the project's 88-column style (e.g., run `black backend/app/mcp_server/tools/dingtalk_upload.py`); specifically wrap or shorten long log/message and header lines found in functions _download_file_from_url (headers User-Agent), upload_file_from_url (logger.info, error messages and the returned dict keys/values that exceed 88 chars), and upload_content (logger.info, error messages and returned dict), so all lines comply with Black/PEP8; after formatting, re-run linting to confirm pipeline passes.
🧹 Nitpick comments (4)
backend/app/mcp_server/tools/__init__.py (1)
31-44:DINGTALK_UPLOAD_MCP_TOOLSduplicatesKNOWLEDGE_MCP_TOOLScontent.Both
KNOWLEDGE_MCP_TOOLSandDINGTALK_UPLOAD_MCP_TOOLSare built withbuild_mcp_tools_dict(server="knowledge")(seedingtalk_upload.pyline 304 and the corresponding call inknowledge.py). Sincebuild_mcp_tools_dictfilters only byservername, both constants resolve to the same dict — every knowledge tool, not just DingTalk ones. The name is misleading and the export adds no new information.Either drop the export entirely (registration goes through
get_registered_mcp_toolsdirectly, so it's unused) or havedingtalk_upload.pybuild a properly-scoped registry (e.g., add a per-module filter tobuild_mcp_tools_dict, or use a dedicated server tag). Same applies to the__all__entry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/__init__.py` around lines 31 - 44, DINGTALK_UPLOAD_MCP_TOOLS and KNOWLEDGE_MCP_TOOLS both call build_mcp_tools_dict(server="knowledge") so the former is a misleading duplicate; fix by either removing the DINGTALK_UPLOAD_MCP_TOOLS export/entry in __all__ and any unused references (since get_registered_mcp_tools handles registration), or change the construction in dingtalk_upload.py to scope correctly (e.g., call build_mcp_tools_dict with a distinct server tag or add a per-module filter) so DINGTALK_UPLOAD_MCP_TOOLS represents only DingTalk tools; update the __all__ list accordingly and ensure references to DINGTALK_UPLOAD_MCP_TOOLS are adjusted or deleted.backend/app/mcp_server/tools/dingtalk_upload.py (2)
187-189: Use!sconversion instead ofstr(e)(Ruff RUF010).Minor — applies here and at Line 292:
- return {"error": f"Failed to download file: {str(e)}"} + return {"error": f"Failed to download file: {e!s}"}- return {"error": f"Failed to encode content with {encoding}: {str(e)}"} + return {"error": f"Failed to encode content with {encoding}: {e!s}"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 187 - 189, In the exception handlers inside dingtalk_upload.py replace explicit str(e) with the f-string !s conversion to satisfy Ruff RUF010: change logger.error(f"Failed to download file from URL: {e}") and return {"error": f"Failed to download file: {str(e)}"} to use {e!s}, and apply the same change to the other similar handler around the later return (the one at the second exception block near the upload/download flow) so both logging and returned error strings use {e!s}.
256-292: Avoid the temp-file roundtrip inupload_content.The function writes
contentto aNamedTemporaryFile, then re-reads the same bytes back from disk into memory before callingupload_attachment. Sinceupload_attachmentonly needsbytes, you can encode in memory and skip both the temp file and the cleanup path entirely:♻️ Proposed simplification
- temp_file_path = None - db = SessionLocal() + db = SessionLocal() try: user = _get_user_from_token(db, token_info) if not user: return {"error": "User not found"} - # Validate content if not content: return {"error": "Content cannot be empty"} - - # Validate filename if not filename: return {"error": "Filename is required"} - # Create temporary file with proper extension - file_extension = os.path.splitext(filename)[1] or ".md" - with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as temp_file: - temp_file_path = temp_file.name - try: - # Write content to temporary file - logger.info(f"Writing content to temporary file {temp_file_path}") - with open(temp_file_path, "w", encoding=encoding) as f: - f.write(content) - - # Read file content as binary for upload - with open(temp_file_path, "rb") as f: - binary_data = f.read() - + binary_data = content.encode(encoding) file_size = len(binary_data) context, truncation_info = context_service.upload_attachment( db=db, user_id=user.id, filename=filename, binary_data=binary_data, subtask_id=0, ) ... except UnicodeEncodeError as e: ... finally: db.close() - _clean_temp_file(temp_file_path)This also makes the
UnicodeEncodeErrorbranch reachable in the natural place (during.encode()) and removes a needless filesystem dependency from the hot path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 256 - 292, The current upload_content flow unnecessarily writes to a NamedTemporaryFile then re-reads it; instead, in upload_content encode the in-memory content to bytes using content.encode(encoding) (catch UnicodeEncodeError around this call), compute file_size = len(binary_data), and call context_service.upload_attachment(db=db, user_id=user.id, filename=filename, binary_data=binary_data, subtask_id=0). Return the same dict using context.id, context.mime_type and truncation_info.is_truncated (if present). Remove creation/opening of NamedTemporaryFile and all file read/write operations so there is no filesystem roundtrip.backend/init_data/skills/wegent-knowledge/SKILL.md (1)
210-267: Add language tags to fenced code blocks (markdownlint MD040).Lines 210, 233, 242, 253, and 264 open
```without a language identifier. Addpython(ortext) to keep markdownlint clean and improve syntax highlighting in the skill renderer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/wegent-knowledge/SKILL.md` around lines 210 - 267, The markdown examples in SKILL.md open fenced code blocks without language tags which triggers markdownlint MD040; update each triple-backtick block shown around the get_document_info example, the get_document_content example, the dingtalk-table.get_all_sheets / get_range example, the dingtalk-ai-table.get_tables / query_records example, and the dingtalk-docs.download_file example to include a language identifier (e.g., ```python for code-like calls or ```text for plain output) so the code blocks for get_document_info, get_document_content, dingtalk-table.get_all_sheets, dingtalk-table.get_range, dingtalk-ai-table.get_tables, dingtalk-ai-table.query_records, and dingtalk-docs.download_file have proper syntax highlighting and satisfy MD040.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 40-67: _download_file_from_url currently performs an unrestricted
requests.get on a model-controlled URL; add strict URL validation and a download
size cap: implement a helper _validate_download_url(url: str) that parses the
URL, allows only http/https (ALLOWED_SCHEMES), ensures a non-empty hostname,
rejects obvious SSRF hosts (localhost, 127.0.0.1, ::1, RFC1918 patterns and
".internal"), and (optionally) resolve and reject private/link-local IPs (and
re-check after redirects); in _download_file_from_url call
_validate_download_url first, configure requests to disable or limit redirects,
stream the response and enforce MAX_DOWNLOAD_BYTES (e.g. 100*1024*1024) by
incrementing total_size inside the iter_content loop and aborting/raising if
exceeded, and ensure downstream code that handles the downloaded blob (e.g.,
upload_file_from_url flow that may read the file) uses streaming or file-on-disk
reads instead of loading the entire file into memory.
- Around line 161-197: The exception handlers in upload_file_from_url and
upload_content do not perform an explicit database rollback before returning
error responses, which can leave partial commits from
context_service.upload_attachment (which may commit on DocumentParseError);
modify both functions' error paths so that immediately before returning an error
dict (in the requests.RequestException handler and the broad Exception handler)
you call db.rollback() (or session.rollback() as appropriate) to undo any open
transaction, then log and return; keep db.close() in the finally block as-is and
reference upload_file_from_url, upload_content, and
context_service.upload_attachment when locating the changes.
In `@backend/init_data/skills/wegent-knowledge/SKILL.md`:
- Around line 369-378: The Python example call to wegent_kb_create_document in
the SKILL.md contains a typo: the boolean argument uses "True." (trailing
period) which is invalid syntax; update the examples for
wegent_kb_create_document so the trigger_indexing argument is "True," (comma)
instead of "True." in both example occurrences, ensuring the snippet is valid
Python and matches the function signature (knowledge_base_id, name, source_type,
attachment_id, trigger_indexing).
---
Outside diff comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 1-305: Run Black to reformat the file to the project's 88-column
style (e.g., run `black backend/app/mcp_server/tools/dingtalk_upload.py`);
specifically wrap or shorten long log/message and header lines found in
functions _download_file_from_url (headers User-Agent), upload_file_from_url
(logger.info, error messages and the returned dict keys/values that exceed 88
chars), and upload_content (logger.info, error messages and returned dict), so
all lines comply with Black/PEP8; after formatting, re-run linting to confirm
pipeline passes.
---
Nitpick comments:
In `@backend/app/mcp_server/tools/__init__.py`:
- Around line 31-44: DINGTALK_UPLOAD_MCP_TOOLS and KNOWLEDGE_MCP_TOOLS both call
build_mcp_tools_dict(server="knowledge") so the former is a misleading
duplicate; fix by either removing the DINGTALK_UPLOAD_MCP_TOOLS export/entry in
__all__ and any unused references (since get_registered_mcp_tools handles
registration), or change the construction in dingtalk_upload.py to scope
correctly (e.g., call build_mcp_tools_dict with a distinct server tag or add a
per-module filter) so DINGTALK_UPLOAD_MCP_TOOLS represents only DingTalk tools;
update the __all__ list accordingly and ensure references to
DINGTALK_UPLOAD_MCP_TOOLS are adjusted or deleted.
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 187-189: In the exception handlers inside dingtalk_upload.py
replace explicit str(e) with the f-string !s conversion to satisfy Ruff RUF010:
change logger.error(f"Failed to download file from URL: {e}") and return
{"error": f"Failed to download file: {str(e)}"} to use {e!s}, and apply the same
change to the other similar handler around the later return (the one at the
second exception block near the upload/download flow) so both logging and
returned error strings use {e!s}.
- Around line 256-292: The current upload_content flow unnecessarily writes to a
NamedTemporaryFile then re-reads it; instead, in upload_content encode the
in-memory content to bytes using content.encode(encoding) (catch
UnicodeEncodeError around this call), compute file_size = len(binary_data), and
call context_service.upload_attachment(db=db, user_id=user.id,
filename=filename, binary_data=binary_data, subtask_id=0). Return the same dict
using context.id, context.mime_type and truncation_info.is_truncated (if
present). Remove creation/opening of NamedTemporaryFile and all file read/write
operations so there is no filesystem roundtrip.
In `@backend/init_data/skills/wegent-knowledge/SKILL.md`:
- Around line 210-267: The markdown examples in SKILL.md open fenced code blocks
without language tags which triggers markdownlint MD040; update each
triple-backtick block shown around the get_document_info example, the
get_document_content example, the dingtalk-table.get_all_sheets / get_range
example, the dingtalk-ai-table.get_tables / query_records example, and the
dingtalk-docs.download_file example to include a language identifier (e.g.,
```python for code-like calls or ```text for plain output) so the code blocks
for get_document_info, get_document_content, dingtalk-table.get_all_sheets,
dingtalk-table.get_range, dingtalk-ai-table.get_tables,
dingtalk-ai-table.query_records, and dingtalk-docs.download_file have proper
syntax highlighting and satisfy MD040.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb3cfd03-a3da-40fd-af91-47cfd5a3994e
📒 Files selected for processing (5)
backend/app/mcp_server/server.pybackend/app/mcp_server/tools/__init__.pybackend/app/mcp_server/tools/decorator.pybackend/app/mcp_server/tools/dingtalk_upload.pybackend/init_data/skills/wegent-knowledge/SKILL.md
✅ Files skipped from review due to trivial changes (1)
- backend/app/mcp_server/tools/decorator.py
…dge-skill refactor(knowledge): remove sandbox dependency for DingTalk document upload
c56ecf4 to
654d174
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/mcp_server/tools/dingtalk_upload.py (3)
262-279: Avoid disk round-trip when uploading text content.The function writes
contentto a temp file and then immediately reads it back as bytes only to pass it tocontext_service.upload_attachment. The temp file is otherwise unused (no parsing or sandbox handoff). You can produce the samebinary_databy encoding in memory, which removes I/O, removes the cleanup path for this branch, and avoids surfacing filesystem errors from the disk hop.♻️ Proposed refactor
- temp_file_path = None db = SessionLocal() try: user = _get_user_from_token(db, token_info) if not user: return {"error": "User not found"} # Validate content if not content: return {"error": "Content cannot be empty"} # Validate filename if not filename: return {"error": "Filename is required"} - # Create temporary file with proper extension - file_extension = os.path.splitext(filename)[1] or ".md" - with tempfile.NamedTemporaryFile( - delete=False, suffix=file_extension - ) as temp_file: - temp_file_path = temp_file.name - try: - # Write content to temporary file - logger.info(f"Writing content to temporary file {temp_file_path}") - with open(temp_file_path, "w", encoding=encoding) as f: - f.write(content) - - # Read file content as binary for upload - with open(temp_file_path, "rb") as f: - binary_data = f.read() - + # Encode content directly; no disk round-trip needed + binary_data = content.encode(encoding) file_size = len(binary_data) @@ finally: db.close() - _clean_temp_file(temp_file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 262 - 279, The current code in dingtalk_upload.py creates a temp file, writes `content`, then reopens it to read `binary_data` before calling `context_service.upload_attachment`; replace the disk round-trip by encoding the text in memory: compute `file_extension = os.path.splitext(filename)[1] or ".md"`, then set `binary_data = content.encode(encoding)` and `file_size = len(binary_data)` (removing the tempfile creation/cleanup and the open/read steps) and pass `binary_data` and `file_size` to `context_service.upload_attachment` as before; keep all logging around writing/reading adjusted to reflect the in-memory encoding.
72-89: Filename derived from URL path is not sanitized.
os.path.basename(parsed.path)will return whatever lies in the URL path (URL-encoded sequences, unicode, leading dots, etc.). It then flows into bothtempfile.NamedTemporaryFile(suffix=...)andcontext_service.upload_attachment(filename=...). The downstream service does callos.path.basenameagain and validates the extension, so this is unlikely to be exploitable, but consider URL-decoding and stripping problematic characters here so the temp-file suffix and the stored attachment name are predictable (e.g.%20→ space, removal of NUL/control chars).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 72 - 89, The _get_filename_from_url function returns raw os.path.basename(parsed.path) which may contain URL-encoded sequences, control characters, leading dots, or other unsafe chars; update it to URL-decode the basename (use urllib.parse.unquote), normalize/remove control/NUL characters, strip/replace path separators and leading dots, and collapse spaces or unsafe chars to a safe token while preserving the extension if present; ensure the sanitized filename is returned (or default_filename if result is empty) so the value passed to tempfile.NamedTemporaryFile(suffix=...) and context_service.upload_attachment(filename=...) is predictable and safe.
35-37: Avoid disk round-trip inupload_contentby encoding directly.In
upload_content, content is written to a temporary file then read back as binary (lines 256–270). Instead of this unnecessary I/O, usebinary_data = content.encode(encoding)directly to convert the string to bytes, eliminating the file write-read cycle while preserving identical behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 35 - 37, The upload_content function currently writes `content` to a temporary file then re-opens it to read bytes; replace that disk round-trip by creating the bytes in-memory with `binary_data = content.encode(encoding)` (use the same `encoding` variable already in scope), remove the temporary file creation/cleanup code (e.g., NamedTemporaryFile or open/write/read steps), and use `binary_data` where the file's read bytes were previously used so behavior remains identical without I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 262-279: The current code in dingtalk_upload.py creates a temp
file, writes `content`, then reopens it to read `binary_data` before calling
`context_service.upload_attachment`; replace the disk round-trip by encoding the
text in memory: compute `file_extension = os.path.splitext(filename)[1] or
".md"`, then set `binary_data = content.encode(encoding)` and `file_size =
len(binary_data)` (removing the tempfile creation/cleanup and the open/read
steps) and pass `binary_data` and `file_size` to
`context_service.upload_attachment` as before; keep all logging around
writing/reading adjusted to reflect the in-memory encoding.
- Around line 72-89: The _get_filename_from_url function returns raw
os.path.basename(parsed.path) which may contain URL-encoded sequences, control
characters, leading dots, or other unsafe chars; update it to URL-decode the
basename (use urllib.parse.unquote), normalize/remove control/NUL characters,
strip/replace path separators and leading dots, and collapse spaces or unsafe
chars to a safe token while preserving the extension if present; ensure the
sanitized filename is returned (or default_filename if result is empty) so the
value passed to tempfile.NamedTemporaryFile(suffix=...) and
context_service.upload_attachment(filename=...) is predictable and safe.
- Around line 35-37: The upload_content function currently writes `content` to a
temporary file then re-opens it to read bytes; replace that disk round-trip by
creating the bytes in-memory with `binary_data = content.encode(encoding)` (use
the same `encoding` variable already in scope), remove the temporary file
creation/cleanup code (e.g., NamedTemporaryFile or open/write/read steps), and
use `binary_data` where the file's read bytes were previously used so behavior
remains identical without I/O.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec39ae36-f4d0-45fa-b554-253d91f1e28e
📒 Files selected for processing (1)
backend/app/mcp_server/tools/dingtalk_upload.py
…alk upload
- Add URL validation to prevent SSRF attacks in _download_file_from_url()
- Add MAX_DOWNLOAD_BYTES (100MB) limit with streaming size check
- Add db.rollback() in exception handlers before returning error responses
- Fix Ruff RUF010: replace str(e) with {e!s}
- Optimize upload_content() to avoid filesystem roundtrip
- Run Black formatter on dingtalk_upload.py
- Fix SKILL.md: correct True. syntax error to True,
- Add python language identifiers to code blocks in SKILL.md
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)
144-156: Consider streaming to a tempfile instead of buffering in memory.
chunksaccumulates the entire payload (up toMAX_DOWNLOAD_BYTES = 100 MB) beforeb"".join(chunks)doubles peak memory during the join. Under concurrent uploads this can be a noticeable footprint on a multi-tenant MCP server. Ifcontext_service.upload_attachmentaccepts (or can be adapted to accept) a file-like / streamed source, prefer writing chunks to atempfile.NamedTemporaryFileand passing that through; otherwise at leastbytearray.extend(chunk)to avoid the join's transient 2× allocation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 144 - 156, The current download loop buffers chunks in a list and returns b"".join(chunks), which causes a transient 2× memory spike for up to MAX_DOWNLOAD_BYTES; change the implementation in dingtalk_upload.py (the loop using response.iter_content and variable chunks) to stream into a tempfile.NamedTemporaryFile (write each chunk, enforce MAX_DOWNLOAD_BYTES by tracking total_size, flush/seek to start) and pass that file-like object to context_service.upload_attachment; if context_service.upload_attachment cannot accept a file-like, replace the list-of-bytes buffer with a single bytearray (use bytearray.extend(chunk) and return bytes(bytearray)) to avoid the join overhead. Ensure you reference MAX_DOWNLOAD_BYTES, response.iter_content, chunks, and context_service.upload_attachment when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 86-156: The current flow in _download_file_from_url suffers
TOCTOU/DNS-rebinding and unsafe auto-redirects: change it to resolve and
validate the hostname(s) via socket.getaddrinfo inside _validate_download_url
and then perform the HTTP request bound to the resolved IP(s) using a custom
requests HTTPAdapter that pins the chosen IP while preserving the original Host
header and SNI (so that requests.get uses the validated IP rather than
re-resolving); also set allow_redirects=False and implement a manual redirect
loop that for each Location header runs the same resolve+validate+pinning logic
before issuing the next request (re-check size limits and MAX_DOWNLOAD_BYTES
across chunks as now), ensuring every connection uses the validated IP and no
intermediate redirect is followed without prior validation.
---
Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 144-156: The current download loop buffers chunks in a list and
returns b"".join(chunks), which causes a transient 2× memory spike for up to
MAX_DOWNLOAD_BYTES; change the implementation in dingtalk_upload.py (the loop
using response.iter_content and variable chunks) to stream into a
tempfile.NamedTemporaryFile (write each chunk, enforce MAX_DOWNLOAD_BYTES by
tracking total_size, flush/seek to start) and pass that file-like object to
context_service.upload_attachment; if context_service.upload_attachment cannot
accept a file-like, replace the list-of-bytes buffer with a single bytearray
(use bytearray.extend(chunk) and return bytes(bytearray)) to avoid the join
overhead. Ensure you reference MAX_DOWNLOAD_BYTES, response.iter_content,
chunks, and context_service.upload_attachment when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 495eaa52-128a-41d8-81c1-8d001e20a561
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_upload.pybackend/init_data/skills/wegent-knowledge/SKILL.md
| try: | ||
| addrinfo = socket.getaddrinfo(hostname, None) | ||
| for _, _, _, _, sockaddr in addrinfo: | ||
| ip = ipaddress.ip_address(sockaddr[0]) | ||
| # Check for private, loopback, link-local, or reserved IPs | ||
| if ( | ||
| ip.is_private | ||
| or ip.is_loopback | ||
| or ip.is_link_local | ||
| or ip.is_reserved | ||
| or ip.is_multicast | ||
| ): | ||
| raise ValueError(f"URL resolves to blocked IP: {ip}") | ||
| except socket.gaierror as e: | ||
| raise ValueError(f"Failed to resolve hostname: {hostname}") from e | ||
|
|
||
|
|
||
| def _get_user_from_token(db: Session, token_info: TaskTokenInfo) -> Optional[User]: | ||
| """Get user from token info.""" | ||
| return db.query(User).filter(User.id == token_info.user_id).first() | ||
|
|
||
|
|
||
| def _download_file_from_url(url: str) -> bytes: | ||
| """ | ||
| Download file from URL with security validations. | ||
|
|
||
| Args: | ||
| url: Download URL | ||
|
|
||
| Returns: | ||
| Downloaded file content as bytes | ||
|
|
||
| Raises: | ||
| ValueError: If URL validation fails | ||
| requests.RequestException: If download fails | ||
| """ | ||
| # Validate URL before downloading | ||
| _validate_download_url(url) | ||
|
|
||
| headers = { | ||
| "User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36") | ||
| } | ||
|
|
||
| # Use stream=True and limit redirects for security | ||
| with requests.get( | ||
| url, | ||
| headers=headers, | ||
| stream=True, | ||
| timeout=DEFAULT_TIMEOUT, | ||
| allow_redirects=True, | ||
| ) as response: | ||
| response.raise_for_status() | ||
|
|
||
| # Re-validate after redirects (if Location header present) | ||
| if response.history: | ||
| final_url = response.url | ||
| _validate_download_url(final_url) | ||
|
|
||
| total_size = 0 | ||
| chunks = [] | ||
|
|
||
| for chunk in response.iter_content(chunk_size=8192): | ||
| if chunk: | ||
| total_size += len(chunk) | ||
| if total_size > MAX_DOWNLOAD_BYTES: | ||
| raise ValueError( | ||
| f"Download exceeds maximum size of {MAX_DOWNLOAD_BYTES} bytes" | ||
| ) | ||
| chunks.append(chunk) | ||
|
|
||
| return b"".join(chunks) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Best practices to prevent SSRF DNS rebinding in Python requests: pinning resolved IP and validating redirects manually
💡 Result:
To prevent SSRF via DNS rebinding in Python's requests library, implement IP pinning by resolving the hostname once, validating the IP against private/reserved ranges, rewriting the URL to use the resolved IP while preserving the original Host header and TLS SNI, and handling redirects manually with the same validation. Key steps: 1. Resolve hostname to IP(s) using socket.getaddrinfo, validate all resolved IPs are safe (not private, loopback, metadata endpoints like 169.254.169.254). 2. Use a custom HTTPAdapter to force connection to the pinned IP, set server_hostname for SNI, and Host header to original hostname. 3. Disable automatic redirects (allow_redirects=False), check Location header, re-validate new URL, and recurse up to a max_redirects limit. Example implementation: import socket import ssl import ipaddress from urllib.parse import urlparse, urlunparse from urllib3.poolmanager import PoolManager from requests.adapters import HTTPAdapter from requests import Session def resolve_first_ip(hostname): return socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP)[0][4][0] def is_safe_ip(ip_str): ip = ipaddress.ip_address(ip_str) return (not ip.is_private and not ip.is_loopback and not ip.is_link_local and str(ip) != '169.254.169.254' and # AWS/GCP/Azure metadata not ip.is_multicast) class HostHeaderSSLAdapter(HTTPAdapter): def init(self, ssl_hostname, *args, kwargs): self.ssl_hostname = ssl_hostname super.init(*args, kwargs) def init_poolmanager(self, *args, kwargs): context = ssl.create_default_context kwargs['ssl_context'] = context kwargs['server_hostname'] = self.ssl_hostname self.poolmanager = PoolManager(*args, kwargs) def safe_request(url, max_redirects=10): parsed = urlparse(url) hostname = parsed.hostname if not hostname: raise ValueError("Invalid URL") # Pin and validate IP ip = resolve_first_ip(hostname) if not is_safe_ip(ip): raise ValueError("Unsafe IP") # Rebuild URL with IP netloc = ip if parsed.port: netloc += f":{parsed.port}" new_url = urlunparse(parsed._replace(netloc=netloc)) session = Session adapter = HostHeaderSSLAdapter(hostname) session.mount('https://', adapter) headers = {'Host': hostname} response = session.get(new_url, headers=headers, allow_redirects=False, verify=True) # Manual redirect handling redirects = 0 while 300 <= response.status_code < 400 and redirects < max_redirects: location = response.headers.get('Location') if not location: break # Make absolute if relative if location.startswith('/'): location = f"{parsed.scheme}://{hostname}{location}" # Re-validate new location with pinning response = safe_request(location, max_redirects - 1) redirects += 1 return response Usage: response = safe_request('https://example.com') This approach eliminates TOCTOU gaps in DNS rebinding by ensuring the validated IP is used for connection. For production, check all getaddrinfo IPs, handle proxies/cookies carefully, and consider libraries like requests-hardened or drawbridge (for httpx) for maintained solutions.
Citations:
- 1: https://joshua.hu/solving-fixing-interesting-problems-python-dns-rebindind-requests
- 2: https://pypi.org/project/drawbridge/
- 3: GHSA-wvjg-9879-3m7w
- 4: https://github.com/saleor/requests-hardened
- 5: https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html
- 6: https://github.com/Roadmaster/forcediphttpsadapter
- 7: https://stackoverflow.com/questions/22609385/python-requests-library-define-specific-dns
Residual SSRF: DNS rebinding and intermediate-redirect validation remain unfixed.
Two critical gaps persist despite the added validation layer:
-
TOCTOU / DNS rebinding —
_validate_download_urlresolves the hostname viasocket.getaddrinfo(line 87), butrequests.get(line 130) performs its own independent DNS lookup. An attacker controlling a short-TTL DNS record or shared resolver cache can return a public IP during validation and a private IP at request time. The same TOCTOU applies after redirects: line 142 re-validatesresponse.urltextually and re-resolves it, but does not validate against the actual IP address thatrequestsalready connected to. -
Intermediate-redirect side effects —
allow_redirects=Trueallowsrequeststo autonomously follow the redirect chain. By the time the final URL is re-validated (line 142), requests have already been issued to intermediate hops, which could be internal hosts. This enables port-scanning and state-changing requests (e.g., metadata endpoints, internal APIs) regardless of post-hoc rejection.
Required fixes:
- Resolve the hostname once using
socket.getaddrinfo, validate all resolved IPs against the blocked list, then connect directly to the validated IP. Use a customHTTPAdapterto pin the resolved IP while preserving the original hostname in theHostheader and SNI. - Set
allow_redirects=Falseand implement manual redirect following. Before issuing each subsequent request, validate theLocationheader by running the same_validate_download_urlcheck to ensure the next hop is safe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 86 - 156, The
current flow in _download_file_from_url suffers TOCTOU/DNS-rebinding and unsafe
auto-redirects: change it to resolve and validate the hostname(s) via
socket.getaddrinfo inside _validate_download_url and then perform the HTTP
request bound to the resolved IP(s) using a custom requests HTTPAdapter that
pins the chosen IP while preserving the original Host header and SNI (so that
requests.get uses the validated IP rather than re-resolving); also set
allow_redirects=False and implement a manual redirect loop that for each
Location header runs the same resolve+validate+pinning logic before issuing the
next request (re-check size limits and MAX_DOWNLOAD_BYTES across chunks as now),
ensuring every connection uses the validated IP and no intermediate redirect is
followed without prior validation.
…ta fetch
- backend/app/mcp_server/tools/dingtalk_upload.py:
- _get_filename_from_url: URL-decode basename with urllib.parse.unquote,
strip NUL/control chars via unicodedata, remove path separators to
prevent directory traversal, strip leading dots, collapse unsafe chars
to underscores; fall back to default_filename when result is empty
- backend/init_data/skills/wegent-knowledge/SKILL.md:
- AI Table (able): add mandatory pagination loop for query_records;
warn that uploading only the first page results in incomplete data
- Spreadsheet (axls): increase example range from A1:Z100 to A1:Z10000
and note that hasMore-style responses require incremental fetching
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_upload.py (1)
86-143:⚠️ Potential issue | 🟠 MajorResidual SSRF: DNS-rebinding TOCTOU and unvalidated intermediate redirects still open.
Two gaps from the prior review remain:
TOCTOU/DNS rebinding —
_validate_download_urlresolves viasocket.getaddrinfoat Line 88, butrequests.getat Line 131 performs an independent DNS lookup. With a short-TTL or cache-poisoned record, validation can resolve to a public IP while the actual connection lands on a private/metadata IP. Re-validatingresponse.urlat Line 143 afterallow_redirects=Truedoesn't help: it's another textual+DNS check, not a check against the IPrequestsalready connected to.Intermediate redirects — with
allow_redirects=True, every hop in the redirect chain is fetched before Line 142 re-validation runs. Internal hosts can be probed (port-scan/state-changing hits to metadata endpoints, internal APIs) regardless of post-hoc rejection of the final URL.Recommended hardening: resolve once, validate all returned IPs, then connect to the pinned IP using a custom
HTTPAdapterthat preserves the originalHostheader and SNI. Setallow_redirects=Falseand follow redirects manually, re-running_validate_download_urlon eachLocationbefore issuing the next request.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 86 - 143, The current download flow in _download_file_from_url does a textual DNS check in _validate_download_url then calls requests.get (allow_redirects=True), leaving a TOCTOU/DNS-rebind gap and allowing intermediate redirects to be fetched; fix it by resolving the initial hostname to IPs once (using the same socket.getaddrinfo logic used by _validate_download_url), validate all resolved IPs are allowed, then create a requests Session with a custom HTTPAdapter that pins the connection to the validated IP (while preserving the original Host header and TLS SNI), call requests with allow_redirects=False, and implement a manual redirect loop that: (a) inspects Location, runs the same resolve+validate before following, (b) pins each hop to its validated IP via the adapter, and (c) enforces a redirect limit; update _download_file_from_url to use DEFAULT_TIMEOUT, stream=True, raise_for_status, and this manual-redirect/pinned-IP logic instead of relying on requests.get allow_redirects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 266-291: Extract the filename-cleaning logic from
_get_filename_from_url into a new helper _sanitize_filename(name: str) -> str
that removes NUL/control chars, path separators, leading dots and other unsafe
characters, then call _sanitize_filename on any caller-supplied filename before
passing it to context_service.upload_attachment; update both tool entrypoints
that accept a filename (the branch where filename is provided by the caller and
the branch that falls back to _get_filename_from_url) to use _sanitize_filename
so all uploaded filenames are normalized and safe while leaving the existing
URL-derived path to call _get_filename_from_url which can internally call
_sanitize_filename.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_upload.py`:
- Around line 86-143: The current download flow in _download_file_from_url does
a textual DNS check in _validate_download_url then calls requests.get
(allow_redirects=True), leaving a TOCTOU/DNS-rebind gap and allowing
intermediate redirects to be fetched; fix it by resolving the initial hostname
to IPs once (using the same socket.getaddrinfo logic used by
_validate_download_url), validate all resolved IPs are allowed, then create a
requests Session with a custom HTTPAdapter that pins the connection to the
validated IP (while preserving the original Host header and TLS SNI), call
requests with allow_redirects=False, and implement a manual redirect loop that:
(a) inspects Location, runs the same resolve+validate before following, (b) pins
each hop to its validated IP via the adapter, and (c) enforces a redirect limit;
update _download_file_from_url to use DEFAULT_TIMEOUT, stream=True,
raise_for_status, and this manual-redirect/pinned-IP logic instead of relying on
requests.get allow_redirects.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ea9cfdab-9513-42f8-a5db-b6295f1b96c8
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_upload.pybackend/init_data/skills/wegent-knowledge/SKILL.md
| if not filename: | ||
| filename = _get_filename_from_url( | ||
| download_url, f"document.{default_extension}" | ||
| ) | ||
|
|
||
| # Download file from URL with security validations | ||
| logger.info(f"Downloading file from URL: {download_url}") | ||
| binary_data = _download_file_from_url(download_url) | ||
| file_size = len(binary_data) | ||
|
|
||
| # Upload to Wegent using context service | ||
| context, truncation_info = context_service.upload_attachment( | ||
| db=db, | ||
| user_id=user.id, | ||
| filename=filename, | ||
| binary_data=binary_data, | ||
| subtask_id=0, # Unlinked attachment | ||
| ) | ||
|
|
||
| return { | ||
| "attachment_id": context.id, | ||
| "filename": filename, | ||
| "size": file_size, | ||
| "mime_type": context.mime_type, | ||
| "truncated": (truncation_info.is_truncated if truncation_info else False), | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect upload_attachment and downstream uses of `filename` for sanitization/path-traversal handling.
fd -t f context_service.py backend/app/services/context | xargs -I{} sed -n '1,500p' {}
echo "---"
rg -nP -C3 "def upload_attachment\b" --type=py
echo "---"
rg -nP -C3 "\bfilename\b" --type=py -g 'backend/app/services/context/**'Repository: wecode-ai/Wegent
Length of output: 36455
🏁 Script executed:
rg -nA 10 -B 5 "context_service.upload_attachment" backend/app/mcp_server/tools/dingtalk_upload.py | head -40Repository: wecode-ai/Wegent
Length of output: 1249
🏁 Script executed:
# Check if any HTTP response or header construction uses filename without sanitization
rg -nB 3 -A 3 "Content-Disposition\|attachment_id.*filename\|response.*filename" backend/app --type=py | head -50Repository: wecode-ai/Wegent
Length of output: 42
🏁 Script executed:
# Search for any filename sanitization utilities in the codebase
fd -t f -type f . | xargs rg -l "sanitize.*filename|filename.*sanitize" --type=pyRepository: wecode-ai/Wegent
Length of output: 478
🏁 Script executed:
# Check for filename sanitization in dingtalk_upload.py itself
rg -nB 2 -A 8 "_get_filename_from_url|_sanitize" backend/app/mcp_server/tools/dingtalk_upload.pyRepository: wecode-ai/Wegent
Length of output: 916
🏁 Script executed:
# Check the sandbox_file_syncer for how it handles filenames
rg -nB 2 -A 5 "sanitize\|filename" backend/app/services/sandbox_file_syncer.py | head -60Repository: wecode-ai/Wegent
Length of output: 42
🏁 Script executed:
# Examine the API endpoint that returns attachments to see if filename is used in headers
rg -nB 5 -A 10 "def download.*attachment|@router.*download.*attachment" backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 548
🏁 Script executed:
# Get the full implementation of download_attachment
rg -n -A 40 "^async def download_attachment" backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 1645
🏁 Script executed:
# Check the _get_filename_from_url implementation in detail
sed -n '160,220p' backend/app/mcp_server/tools/dingtalk_upload.pyRepository: wecode-ai/Wegent
Length of output: 2274
🏁 Script executed:
# Get the rest of download_attachment to see how filename is used in response
sed -n '550,630p' backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 3167
🏁 Script executed:
# Check if filename is used in content-disposition or other response headers
rg -n "content.disposition\|Content-Disposition\|filename\*=" backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 42
🏁 Script executed:
# Continue reading download_attachment to find the response construction
sed -n '620,680p' backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 2040
🏁 Script executed:
# Find the _build_content_disposition function
rg -n -B 2 -A 8 "def _build_content_disposition" backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 418
🏁 Script executed:
# Check if there's any sanitization in that function
rg -n "_build_content_disposition" backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 397
🏁 Script executed:
# Get the full implementation of _build_content_disposition
sed -n '70,95p' backend/app/api/endpoints/adapter/attachments.pyRepository: wecode-ai/Wegent
Length of output: 951
Apply consistent filename sanitization to both URL-derived and caller-supplied names.
_get_filename_from_url thoroughly sanitizes URL-derived filenames (removes NUL/control chars, path separators, leading dots, unsafe characters), but caller-supplied filename values (from LLM via MCP) are passed unsanitized to context_service.upload_attachment at lines 280 and 390. The service stores the unsanitized name directly in the database and type_data without validation. A model-controlled filename like "../../../etc/passwd" or containing NUL bytes bypasses sanitization entirely, whereas URL-derived filenames are always protected.
Extract the sanitization logic from _get_filename_from_url into a reusable _sanitize_filename(name: str) -> str helper and apply it to caller-supplied filename in both tool entrypoints before invoking upload_attachment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/mcp_server/tools/dingtalk_upload.py` around lines 266 - 291,
Extract the filename-cleaning logic from _get_filename_from_url into a new
helper _sanitize_filename(name: str) -> str that removes NUL/control chars, path
separators, leading dots and other unsafe characters, then call
_sanitize_filename on any caller-supplied filename before passing it to
context_service.upload_attachment; update both tool entrypoints that accept a
filename (the branch where filename is provided by the caller and the branch
that falls back to _get_filename_from_url) to use _sanitize_filename so all
uploaded filenames are normalized and safe while leaving the existing
URL-derived path to call _get_filename_from_url which can internally call
_sanitize_filename.
…uments to knowledge base
Add new skill to upload DingTalk documents, spreadsheets, and AI tables to Wegent knowledge bases. Includes:
Supports multiple DingTalk document types:
Integrates with wegent-knowledge MCP to create documents with proper source attribution and trigger indexing.
Summary by CodeRabbit
New Features
Documentation