fix: 修复图像渲染 HTML 和 Base64 乱码及渲染图片存在超长文本撑爆图片导致消息发送失败问题#129
fix: 修复图像渲染 HTML 和 Base64 乱码及渲染图片存在超长文本撑爆图片导致消息发送失败问题#129FrozenYears wants to merge 1 commit intoSXP-Simon:mainfrom
Conversation
Reviewer's GuideImproves robustness of LLM JSON parsing and report rendering by sanitizing chat text before prompting, extracting JSON with a balanced parser instead of regex, safely rendering HTML snippets in quote reasons, and hardening image templates and reactions handling to avoid oversized images and noisy replies. Sequence diagram for sanitized group analysis and robust JSON parsingsequenceDiagram
actor User
participant BotMain
participant Analyzer as AnalyzerGoldenQuote
participant Cleaner as MessageCleanerService
participant LLM
participant JsonUtils
participant Renderer as ReportRenderer
User->>BotMain: Send analyze command in group
BotMain->>BotMain: analyze_group_daily
BotMain->>BotMain: Get adapter and orig_msg_id
BotMain->>User: Set reaction 🔍 on original message
BotMain->>Analyzer: build_prompt(data)
Analyzer->>Cleaner: sanitize_chat_text(msg.content)
Cleaner-->>Analyzer: cleaned content
Analyzer-->>BotMain: LLM prompt
BotMain->>LLM: Send prompt
LLM-->>BotMain: Raw JSON response text
BotMain->>JsonUtils: parse_json_response(result_text, data_type)
JsonUtils->>JsonUtils: _extract_json_balanced(result_text, [, ])
JsonUtils-->>JsonUtils: json_text
JsonUtils->>JsonUtils: json.loads(json_text)
JsonUtils-->>BotMain: Parsed data or error
alt Parsed data ok
BotMain->>Renderer: Render HTML report with quote.reason
Renderer-->>BotMain: Final image
BotMain->>User: Send image, update reaction 📊
else JSON parsing failed
BotMain->>User: Send failure message or no result
end
Class diagram for analyzers, message cleaning, and JSON utilitiesclassDiagram
class MessageCleanerService {
+clean_messages(messages) list
+sanitize_chat_text(text str) str
}
class GoldenQuoteAnalyzer {
+build_prompt(data list_dict) str
+get_max_count() int
}
class TopicAnalyzer {
+build_prompt(data list_dict) str
}
class ChatQualityAnalyzer {
+build_prompt(data list_dict) str
}
class UserTitleAnalyzer {
+build_prompt(data dict) str
}
class JsonUtils {
+_extract_json_balanced(text str, open_char str, close_char str) str
+fix_json(text str) str
+parse_json_response(result_text str, data_type str, original_prompt str, max_retries int, require_list bool) tuple
+parse_json_object_response(result_text str, data_type str, original_prompt str, max_retries int) tuple
}
class ReportTemplates {
<<template>>
+format_quote_item_html
+hack_quote_item_html
+retro_futurism_quote_item_html
+simple_quote_item_html
+spring_festival_quote_item_html
+image_template_css
}
GoldenQuoteAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
TopicAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
ChatQualityAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
UserTitleAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
GoldenQuoteAnalyzer ..> JsonUtils : JSON parsing in analysis flow
TopicAnalyzer ..> JsonUtils : JSON parsing in analysis flow
ChatQualityAnalyzer ..> JsonUtils : JSON parsing in analysis flow
UserTitleAnalyzer ..> JsonUtils : JSON parsing in analysis flow
ReportTemplates ..> JsonUtils : consume parsed JSON data
ReportTemplates ..> MessageCleanerService : receive already sanitized text
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The repeated
word-wrap: break-word; word-break: break-all; overflow: hidden;blocks across many selectors in multipleimage_template.htmlfiles could be centralized into a common class or applied on container/body level to reduce duplication and make future tuning of these properties easier. sanitize_chat_textandfix_jsonboth perform similar Unicode quote normalization; consider extracting this into a shared utility to keep the normalization rules consistent and avoid subtle divergence over time.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The repeated `word-wrap: break-word; word-break: break-all; overflow: hidden;` blocks across many selectors in multiple `image_template.html` files could be centralized into a common class or applied on container/body level to reduce duplication and make future tuning of these properties easier.
- `sanitize_chat_text` and `fix_json` both perform similar Unicode quote normalization; consider extracting this into a shared utility to keep the normalization rules consistent and avoid subtle divergence over time.
## Individual Comments
### Comment 1
<location path="src/infrastructure/reporting/templates/hack/image_template.html" line_range="37-40" />
<code_context>
}
body {
+ word-wrap: break-word;
+ word-break: break-all;
+ overflow: hidden;
font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
background-color: var(--bg-body);
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The repeated use of `word-break: break-all` and `overflow: hidden` across many selectors is heavy-handed and could hurt readability or clip content.
`word-wrap: break-word; word-break: break-all; overflow: hidden;` are now set on `body`, `body::before`, and many text-related selectors. `break-all` will split normal words and numbers in awkward places, and `overflow: hidden` on containers like `.quote-text`, `.comment`, and headings can silently truncate content.
Consider limiting these rules to the specific, variable-length fields that actually overflow, using `overflow-wrap: anywhere` or just `word-wrap: break-word` instead of `break-all`, and adding `text-overflow: ellipsis` where truncation is desired. You could also extract these declarations into a shared utility class (e.g. `.overflow-safe-text`) to avoid duplication and simplify adjustments later.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| body { | ||
| word-wrap: break-word; | ||
| word-break: break-all; | ||
| overflow: hidden; |
There was a problem hiding this comment.
suggestion (bug_risk): The repeated use of word-break: break-all and overflow: hidden across many selectors is heavy-handed and could hurt readability or clip content.
word-wrap: break-word; word-break: break-all; overflow: hidden; are now set on body, body::before, and many text-related selectors. break-all will split normal words and numbers in awkward places, and overflow: hidden on containers like .quote-text, .comment, and headings can silently truncate content.
Consider limiting these rules to the specific, variable-length fields that actually overflow, using overflow-wrap: anywhere or just word-wrap: break-word instead of break-all, and adding text-overflow: ellipsis where truncation is desired. You could also extract these declarations into a shared utility class (e.g. .overflow-safe-text) to avoid duplication and simplify adjustments later.
|
你先重新看看最新版有哪些问题吧,按照问题情况一个一个 PR 提交 |
|
你描述一下遇到了什么问题吧,先提 issue 说明每个小问题再开 PR |
🎯 目标
修复图像渲染时包含 HTML 与 Base64 乱码,以及相关联的 JSON 解析失败问题。
🐛 1. 问题背景与表象
在群聊分析结果生成的图片中,经常在“群圣经(金句)”或部分文本内容的理由分析 (
quote.reason/topic.detail) 区域出现一长串乱码。<span class="user-capsule" style="...">...</span>的纯文本代码,暴露出极长的 Base64 图片数据。Expecting ':' delimiter...) 或截断问题。🔍 2. 根本原因深度分析
经过源码分析,问题由输入污染与输出渲染转义两个维度的 Bug 共同导致:
发送给 LLM 的群聊记录中混入了原始 HTML 标签、Base64 内联图片代码及非法控制字符。LLM 被这些冗余数据“污染”后,极易生成残缺或语法错误的 JSON,导致底层 JSON 解析器报错。
generators.py组装生成 HTML 时,主动将如[123456]替换成了带有头像的 HTML 胶囊气泡代码。但在 6 个主题模板的quote_item.html和topic_item.html中,Jinja2 模板渲染时遗漏了| safe过滤器。导致 Jinja2 出于 XSS 防护,将注入的 HTML 标签直接转义为纯文本,最终在图片上裸露出了长达几千个字符的 Base64 头像数据。🛠️ 3. 解决方案
message_cleaner_service.py新增sanitize_chat_text方法,在发送给大模型前剔除 HTML 标签、Base64 和控制字符。json_utils.py实现括号平衡提取算法_extract_json_balanced,解决非贪婪正则在遇到字符串内包含]时提前截断 JSON 的致命问题。| safe过滤器 (如{{ quote.reason | safe }})。image_template.html全局添加word-break: break-all;防御性策略,防止极端超长字符撑爆图片。📁 4. 受影响的文件清单
src/domain/services/message_cleaner_service.pysrc/infrastructure/analysis/utils/json_utils.pysrc/infrastructure/analysis/analyzers/(包含 topic, chat_quality, golden_quote, user_title_analyzer 等)src/infrastructure/reporting/templates/*/quote_item.html(6 款主题)src/infrastructure/reporting/templates/*/image_template.html(6 款主题)Summary by Sourcery
Improve robustness of JSON parsing and chat analysis prompts while hardening image report rendering against malformed or excessively long content.
Bug Fixes:
Enhancements: