Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Twitch emotes name parsed wrong if message text contains non-ASCII characters #248

Open
nevmerzhitsky opened this issue Jul 9, 2024 · 0 comments · May be fixed by #249
Open

[BUG] Twitch emotes name parsed wrong if message text contains non-ASCII characters #248

nevmerzhitsky opened this issue Jul 9, 2024 · 0 comments · May be fixed by #249
Labels
bug Something isn't working

Comments

@nevmerzhitsky
Copy link

nevmerzhitsky commented Jul 9, 2024

Basic information

  • Program version: 0.2.8
  • Python version: 3.11.9
  • Operating system: Linux

Describe the bug

If a chat message in a VOD contains a non-ASCII character (any 2-bytes UTF-8 symbol for example) then emotes[].name field of message JSON from the library parsed wrongly.

Command/Code used

chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'

  1. The command used (including the verbose tag, -v):
chat_downloader --start_time 05:58:28 --end_time 05:58:30 --output test.jsonl --testing 'https://www.twitch.tv/videos/2184933543'
  1. Output from the above command:

(I've patcher the library with temporarily debugging by prints to see the raw GQL content for the message mapper (chat_downloader.sites.twitch.TwitchChatDownloader._parse_message_info()))

[DEBUG] Python version: 3.11.9 (main, Jul  3 2024, 00:12:48) [GCC 12.2.0]
[DEBUG] Program version: 0.2.8
[DEBUG] Initialisation parameters: {'headers': None, 'cookies': None, 'proxy': None}
[DEBUG] Created TwitchChatDownloader session.
[INFO] Site: twitch.tv
[DEBUG] Program parameters: {'url': 'https://www.twitch.tv/videos/2184933543', 'start_time': '05:58:28', 'end_time': '05:58:30', 'max_attempts': 15, 'retry_timeout': None, 'interruptible_retry': True, 'timeout': None, 'inactivity_timeout': None, 'max_messages': None, 'message_groups': ['messages'], 'message_types': None, 'output': 'test.jsonl', 'overwrite': True, 'sort_keys': True, 'indent': 4, 'format': 'twitch', 'format_file': None, 'chat_type': 'live', 'ignore': None, 'message_receive_timeout': 0.1, 'buffer_size': 4096}
[DEBUG] Starting new HTTPS connection (1): gql.twitch.tv:443
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 880
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
[DEBUG] Match found: "<re.Match object; span=(0, 39), match='https://www.twitch.tv/videos/2184933543'>". Running "_get_chat_by_vod_id" function in "TwitchChatDownloader".
[DEBUG] Chat information: {'chat': <generator object TwitchChatDownloader._get_chat_messages_by_vod_id at 0x7f8ce80acf40>, 'title': 'DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9', 'duration': 23578, 'status': 'past', 'video_type': 'video', 'start_time': None, 'id': '2184933543', '_output_writer': <chat_downloader.output.continuous_write.ContinuousWriter object at 0x7f8ce7fc0250>, '_output_callback': None, 'format': <function ChatDownloader.get_chat.<locals>.<lambda> at 0x7f8ce7f83880>, 'site': <chat_downloader.sites.twitch.TwitchChatDownloader object at 0x7f8ce8c21e50>}
[INFO] Retrieving chat for "DLC НА КАЗУАЛЫЧАХ | Прохождение #2 | ELDEN RING Shadow of the Erdtree | стрим 9".
[DEBUG] https://gql.twitch.tv:443 "POST /gql HTTP/11" 200 None
...
message={'fragments': [{'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}, {'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}, {'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}], 'userBadges': [], 'userColor': '#FF69B4', '__typename': 'VideoCommentMessage'}
fragment={'emote': None, 'text': 'Спасибо за стрим ', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': {'id': '196892;31;41', 'emoteID': '196892', 'from': 31, '__typename': 'EmbeddedEmote'}, 'text': 'TwitchUnity', '__typename': 'VideoCommentMessageFragment'}
fragment={'emote': None, 'text': ' Удовольствия от игры', '__typename': 'VideoCommentMessageFragment'}
[DEBUG] Writing to file: test.jsonl
5:58:29 | NIKI_ORNIS: Спасибо за стрим TwitchUnity Удовольствия от игры
...
[INFO] Finished retrieving chat messages.
[DEBUG] Session closed.

Actual content of test.jsonl (prettified)

{
  "author": {
    "colour": "#FF69B4",
    "display_name": "NIKI_ORNIS",
    "id": "458636669",
    "name": "niki_ornis"
  },
  "emotes": [
    {
      "id": "196892",
      "images": [
        {
          "height": 28,
          "id": "28x28-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
          "width": 112
        },
        {
          "height": 28,
          "id": "28x28-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
          "width": 112
        }
      ],
      "locations": "31-41",
      "name": ""
    }
  ],
  "message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
  "message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
  "message_type": "text_message",
  "time_in_seconds": 21509,
  "time_text": "5:58:29",
  "timestamp": 1719705721803000
}

Expected content of test.jsonl (prettified)

name field of the emote should be filled:

{
  "author": {
    "colour": "#FF69B4",
    "display_name": "NIKI_ORNIS",
    "id": "458636669",
    "name": "niki_ornis"
  },
  "emotes": [
    {
      "id": "196892",
      "images": [
        {
          "height": 28,
          "id": "28x28-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-light",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/light/3.0",
          "width": 112
        },
        {
          "height": 28,
          "id": "28x28-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/1.0",
          "width": 28
        },
        {
          "height": 56,
          "id": "56x56-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/2.0",
          "width": 56
        },
        {
          "height": 112,
          "id": "112x112-dark",
          "url": "https://static-cdn.jtvnw.net/emoticons/v2/196892/default/dark/3.0",
          "width": 112
        }
      ],
      "locations": "31-41",
      "name": "TwitchUnity"
    }
  ],
  "message": "\u0421\u043f\u0430\u0441\u0438\u0431\u043e \u0437\u0430 \u0441\u0442\u0440\u0438\u043c TwitchUnity \u0423\u0434\u043e\u0432\u043e\u043b\u044c\u0441\u0442\u0432\u0438\u044f \u043e\u0442 \u0438\u0433\u0440\u044b",
  "message_id": "5bc4d778-e3fa-45da-bdb4-0206dd035902",
  "message_type": "text_message",
  "time_in_seconds": 21509,
  "time_text": "5:58:29",
  "timestamp": 1719705721803000
}

Additional context/information

Twitch GQL uses byte positioning as the beginning and the end of an emote code inside the chat text, so for non-ASCII characters the byte form of Python string should be used as the source of applying locations.

The fix is straightforward:

'name': message_text.encode("utf-8")[begin:end + 1].decode("utf-8")

instead of

'name': message_text[begin:end + 1]

@nevmerzhitsky nevmerzhitsky added the bug Something isn't working label Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant