Skip to content

Conversation

Seanium
Copy link

@Seanium Seanium commented Mar 30, 2025

Issue: Encoding Error When Reading TSV Files

Error Log:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 43: illegal multibyte sequence.

Root Cause:

OCTIS attempts to read a UTF-8 encoded file (corpus.tsv) using GBK encoding, which causes the error. The source code does not support specifying the encoding for reading files, leading to this mismatch.


Solution

Modify load_custom_dataset_from_folder to Add Encoding Parameter

Add an encoding parameter to the load_custom_dataset_from_folder function, with the default set to UTF-8:

# Add encoding parameter to load_custom_dataset_from_folder function
def load_custom_dataset_from_folder(self, folder_path, encoding='utf-8'):
    df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None, encoding=encoding)

Recommendation

By adding an encoding parameter, we allow OCTIS to handle different encodings more flexibly and avoid the encoding mismatch. UTF-8 should be the default, but this solution ensures compatibility with other encodings if needed.


Seanium added 2 commits March 30, 2025 21:57
Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant