Skip to content

fix: 修复图像渲染 HTML 和 Base64 乱码及渲染图片存在超长文本撑爆图片导致消息发送失败问题#129

Open
FrozenYears wants to merge 1 commit intoSXP-Simon:mainfrom
FrozenYears:fix-render-bug
Open

fix: 修复图像渲染 HTML 和 Base64 乱码及渲染图片存在超长文本撑爆图片导致消息发送失败问题#129
FrozenYears wants to merge 1 commit intoSXP-Simon:mainfrom
FrozenYears:fix-render-bug

Conversation

@FrozenYears
Copy link

@FrozenYears FrozenYears commented Mar 25, 2026

🎯 目标

修复图像渲染时包含 HTML 与 Base64 乱码,以及相关联的 JSON 解析失败问题。

🐛 1. 问题背景与表象

在群聊分析结果生成的图片中,经常在“群圣经(金句)”或部分文本内容的理由分析 (quote.reason / topic.detail) 区域出现一长串乱码。

  • 渲染出大段类似 <span class="user-capsule" style="...">...</span> 的纯文本代码,暴露出极长的 Base64 图片数据。
  • 伴随由于代码超长导致的图片被撑大至 10,000+ 像素宽,引发渲染超时 (错误码 -917) 或排版错乱。
  • 终端日志偶发报出 JSON 解析失败 (Expecting ':' delimiter...) 或截断问题。

🔍 2. 根本原因深度分析

经过源码分析,问题由输入污染与输出渲染转义两个维度的 Bug 共同导致:

  • 维度 A:大模型输入数据污染(导致 JSON 解析失败)
    发送给 LLM 的群聊记录中混入了原始 HTML 标签、Base64 内联图片代码及非法控制字符。LLM 被这些冗余数据“污染”后,极易生成残缺或语法错误的 JSON,导致底层 JSON 解析器报错。
  • 维度 B:渲染阶段 Jinja2 HTML 转义(导致输出乱码)
    generators.py 组装生成 HTML 时,主动将如 [123456] 替换成了带有头像的 HTML 胶囊气泡代码。但在 6 个主题模板的 quote_item.htmltopic_item.html 中,Jinja2 模板渲染时遗漏了 | safe 过滤器。导致 Jinja2 出于 XSS 防护,将注入的 HTML 标签直接转义为纯文本,最终在图片上裸露出了长达几千个字符的 Base64 头像数据。

🛠️ 3. 解决方案

  • 步骤 1:拦截输入污染 - 在 message_cleaner_service.py 新增 sanitize_chat_text 方法,在发送给大模型前剔除 HTML 标签、Base64 和控制字符。
  • 步骤 2:优化 JSON 解析健壮性 - 在 json_utils.py 实现括号平衡提取算法 _extract_json_balanced,解决非贪婪正则在遇到字符串内包含 ] 时提前截断 JSON 的致命问题。
  • 步骤 3:修复 Jinja2 模板 XSS 转义 - 将所有主题模板中输出用户评价内容的变量添加 | safe 过滤器 (如 {{ quote.reason | safe }})。
  • 步骤 4:增强 CSS 换行防御 - 在各个主题的 image_template.html 全局添加 word-break: break-all; 防御性策略,防止极端超长字符撑爆图片。

📁 4. 受影响的文件清单

  • src/domain/services/message_cleaner_service.py
  • src/infrastructure/analysis/utils/json_utils.py
  • src/infrastructure/analysis/analyzers/ (包含 topic, chat_quality, golden_quote, user_title_analyzer 等)
  • src/infrastructure/reporting/templates/*/quote_item.html (6 款主题)
  • src/infrastructure/reporting/templates/*/image_template.html (6 款主题)

Summary by Sourcery

Improve robustness of JSON parsing and chat analysis prompts while hardening image report rendering against malformed or excessively long content.

Bug Fixes:

  • Sanitize chat text before building LLM prompts to strip HTML tags, Base64 data URIs, control characters and normalize quotes, preventing polluted inputs from breaking JSON responses and HTML layout.
  • Replace naive regex-based JSON extraction with a balanced bracket parser for arrays and objects, reducing JSON parse failures when content includes embedded bracket characters.
  • Adjust group analysis interaction to always use reaction emojis instead of optional text replies, avoiding configuration-related inconsistencies.

Enhancements:

  • Strengthen JSON fix-up logic by simplifying Unicode punctuation normalization and improving handling of truncated and malformed structures.
  • Apply global word wrapping and overflow protection styles across all image report themes to prevent extremely long text from stretching rendered images.
  • Mark quote reason fields in all report templates as safe HTML to allow intended rich formatting output from the analysis results while preserving layout.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Mar 25, 2026

Reviewer's Guide

Improves robustness of LLM JSON parsing and report rendering by sanitizing chat text before prompting, extracting JSON with a balanced parser instead of regex, safely rendering HTML snippets in quote reasons, and hardening image templates and reactions handling to avoid oversized images and noisy replies.

Sequence diagram for sanitized group analysis and robust JSON parsing

sequenceDiagram
    actor User
    participant BotMain
    participant Analyzer as AnalyzerGoldenQuote
    participant Cleaner as MessageCleanerService
    participant LLM
    participant JsonUtils
    participant Renderer as ReportRenderer

    User->>BotMain: Send analyze command in group
    BotMain->>BotMain: analyze_group_daily
    BotMain->>BotMain: Get adapter and orig_msg_id
    BotMain->>User: Set reaction 🔍 on original message

    BotMain->>Analyzer: build_prompt(data)
    Analyzer->>Cleaner: sanitize_chat_text(msg.content)
    Cleaner-->>Analyzer: cleaned content
    Analyzer-->>BotMain: LLM prompt

    BotMain->>LLM: Send prompt
    LLM-->>BotMain: Raw JSON response text

    BotMain->>JsonUtils: parse_json_response(result_text, data_type)
    JsonUtils->>JsonUtils: _extract_json_balanced(result_text, [, ])
    JsonUtils-->>JsonUtils: json_text
    JsonUtils->>JsonUtils: json.loads(json_text)
    JsonUtils-->>BotMain: Parsed data or error

    alt Parsed data ok
        BotMain->>Renderer: Render HTML report with quote.reason
        Renderer-->>BotMain: Final image
        BotMain->>User: Send image, update reaction 📊
    else JSON parsing failed
        BotMain->>User: Send failure message or no result
    end
Loading

Class diagram for analyzers, message cleaning, and JSON utilities

classDiagram
    class MessageCleanerService {
        +clean_messages(messages) list
        +sanitize_chat_text(text str) str
    }

    class GoldenQuoteAnalyzer {
        +build_prompt(data list_dict) str
        +get_max_count() int
    }

    class TopicAnalyzer {
        +build_prompt(data list_dict) str
    }

    class ChatQualityAnalyzer {
        +build_prompt(data list_dict) str
    }

    class UserTitleAnalyzer {
        +build_prompt(data dict) str
    }

    class JsonUtils {
        +_extract_json_balanced(text str, open_char str, close_char str) str
        +fix_json(text str) str
        +parse_json_response(result_text str, data_type str, original_prompt str, max_retries int, require_list bool) tuple
        +parse_json_object_response(result_text str, data_type str, original_prompt str, max_retries int) tuple
    }

    class ReportTemplates {
        <<template>>
        +format_quote_item_html
        +hack_quote_item_html
        +retro_futurism_quote_item_html
        +simple_quote_item_html
        +spring_festival_quote_item_html
        +image_template_css
    }

    GoldenQuoteAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
    TopicAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
    ChatQualityAnalyzer ..> MessageCleanerService : uses sanitize_chat_text
    UserTitleAnalyzer ..> MessageCleanerService : uses sanitize_chat_text

    GoldenQuoteAnalyzer ..> JsonUtils : JSON parsing in analysis flow
    TopicAnalyzer ..> JsonUtils : JSON parsing in analysis flow
    ChatQualityAnalyzer ..> JsonUtils : JSON parsing in analysis flow
    UserTitleAnalyzer ..> JsonUtils : JSON parsing in analysis flow

    ReportTemplates ..> JsonUtils : consume parsed JSON data
    ReportTemplates ..> MessageCleanerService : receive already sanitized text
Loading

File-Level Changes

Change Details Files
Harden JSON extraction/parsing to correctly handle brackets inside strings and reduce over-aggressive text munging.
  • Introduce a balanced bracket extractor helper to find the first complete JSON array/object in arbitrary text, correctly tracking string/escape state and nesting depth.
  • Simplify and Unicode-normalize Chinese punctuation replacement in JSON fixer to use explicit code points.
  • Switch JSON array/object extraction in parse helpers from non-greedy regex to the new balanced extractor, and reuse it after fix_json when re-parsing fixed content.
  • Tighten fix_json steps ordering: fix truncation, insert missing commas/quoted keys, and strip trailing commas.
src/infrastructure/analysis/utils/json_utils.py
Sanitize chat/user text before sending to LLM to avoid HTML/Base64/control-character pollution in prompts.
  • Add MessageCleanerService.sanitize_chat_text to strip data: base64 URIs, HTML tags, control chars, normalize smart quotes, and collapse whitespace.
  • Use sanitize_chat_text when building user summaries for user title analysis.
  • Apply sanitize_chat_text to message content in golden quote, topic, and chat quality analyzers so prompts only contain clean, plain text lines.
src/domain/services/message_cleaner_service.py
src/infrastructure/analysis/analyzers/user_title_analyzer.py
src/infrastructure/analysis/analyzers/golden_quote_analyzer.py
src/infrastructure/analysis/analyzers/topic_analyzer.py
src/infrastructure/analysis/analyzers/chat_quality_analyzer.py
Render quote reasons as HTML instead of escaped text to avoid showing raw capsule markup/Base64 in reports.
  • Add the Jinja2 safe filter to quote.reason usages in all quote_item templates so pre-built HTML capsules are not auto-escaped.
  • Keep quote.content and quote.sender rendered normally while allowing only the reason field to bypass escaping.
src/infrastructure/reporting/templates/format/quote_item.html
src/infrastructure/reporting/templates/hack/quote_item.html
src/infrastructure/reporting/templates/retro_futurism/quote_item.html
src/infrastructure/reporting/templates/simple/quote_item.html
src/infrastructure/reporting/templates/spring_festival/quote_item.html
Constrain layout of multiple image report themes so long words/Base64-like strings cannot blow up image width or overflow.
  • Add word-wrap: break-word, word-break: break-all and overflow: hidden to body and key text/label/value elements in each themed image_template.html.
  • Ensure topic titles, details, quote contents/reasons, section labels, statistics labels/values, and some user info areas wrap rather than extending horizontally.
src/infrastructure/reporting/templates/hack/image_template.html
src/infrastructure/reporting/templates/retro_futurism/image_template.html
src/infrastructure/reporting/templates/format/image_template.html
src/infrastructure/reporting/templates/spring_festival/image_template.html
src/infrastructure/reporting/templates/scrapbook/image_template.html
src/infrastructure/reporting/templates/simple/image_template.html
Simplify analysis reply behavior to always use reactions instead of optional text replies and remove unused config toggles.
  • Change group daily analysis flow to always use reaction emojis for start/finish when adapter/message id are available, removing the text reply fallback.
  • Delete the enable_analysis_reply getter/setter from ConfigManager and the associated basic config option.
main.py
src/infrastructure/config/config_manager.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The repeated word-wrap: break-word; word-break: break-all; overflow: hidden; blocks across many selectors in multiple image_template.html files could be centralized into a common class or applied on container/body level to reduce duplication and make future tuning of these properties easier.
  • sanitize_chat_text and fix_json both perform similar Unicode quote normalization; consider extracting this into a shared utility to keep the normalization rules consistent and avoid subtle divergence over time.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The repeated `word-wrap: break-word; word-break: break-all; overflow: hidden;` blocks across many selectors in multiple `image_template.html` files could be centralized into a common class or applied on container/body level to reduce duplication and make future tuning of these properties easier.
- `sanitize_chat_text` and `fix_json` both perform similar Unicode quote normalization; consider extracting this into a shared utility to keep the normalization rules consistent and avoid subtle divergence over time.

## Individual Comments

### Comment 1
<location path="src/infrastructure/reporting/templates/hack/image_template.html" line_range="37-40" />
<code_context>
         }

         body {
+            word-wrap: break-word;
+            word-break: break-all;
+            overflow: hidden;
             font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
             background-color: var(--bg-body);
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The repeated use of `word-break: break-all` and `overflow: hidden` across many selectors is heavy-handed and could hurt readability or clip content.

`word-wrap: break-word; word-break: break-all; overflow: hidden;` are now set on `body`, `body::before`, and many text-related selectors. `break-all` will split normal words and numbers in awkward places, and `overflow: hidden` on containers like `.quote-text`, `.comment`, and headings can silently truncate content.

Consider limiting these rules to the specific, variable-length fields that actually overflow, using `overflow-wrap: anywhere` or just `word-wrap: break-word` instead of `break-all`, and adding `text-overflow: ellipsis` where truncation is desired. You could also extract these declarations into a shared utility class (e.g. `.overflow-safe-text`) to avoid duplication and simplify adjustments later.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 37 to +40
body {
word-wrap: break-word;
word-break: break-all;
overflow: hidden;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The repeated use of word-break: break-all and overflow: hidden across many selectors is heavy-handed and could hurt readability or clip content.

word-wrap: break-word; word-break: break-all; overflow: hidden; are now set on body, body::before, and many text-related selectors. break-all will split normal words and numbers in awkward places, and overflow: hidden on containers like .quote-text, .comment, and headings can silently truncate content.

Consider limiting these rules to the specific, variable-length fields that actually overflow, using overflow-wrap: anywhere or just word-wrap: break-word instead of break-all, and adding text-overflow: ellipsis where truncation is desired. You could also extract these declarations into a shared utility class (e.g. .overflow-safe-text) to avoid duplication and simplify adjustments later.

@SXP-Simon
Copy link
Owner

你先重新看看最新版有哪些问题吧,按照问题情况一个一个 PR 提交

@SXP-Simon
Copy link
Owner

你描述一下遇到了什么问题吧,先提 issue 说明每个小问题再开 PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants