Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder #132

Seanium · 2025-03-30T14:15:00Z

Issue: Encoding Error When Reading TSV Files

Error Log:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 43: illegal multibyte sequence.

Root Cause:

OCTIS attempts to read a UTF-8 encoded file (corpus.tsv) using GBK encoding, which causes the error. The source code does not support specifying the encoding for reading files, leading to this mismatch.

Solution

Modify `load_custom_dataset_from_folder` to Add Encoding Parameter

Add an encoding parameter to the load_custom_dataset_from_folder function, with the default set to UTF-8:

# Add encoding parameter to load_custom_dataset_from_folder function
def load_custom_dataset_from_folder(self, folder_path, encoding='utf-8'):
    df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None, encoding=encoding)

Recommendation

By adding an encoding parameter, we allow OCTIS to handle different encodings more flexibly and avoid the encoding mismatch. UTF-8 should be the default, but this solution ensures compatibility with other encodings if needed.

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder

Seanium added 2 commits March 30, 2025 21:57

fix encoding in load_custom_dataset_from_folder

60b9c22

Merge pull request #1 from Seanium/Seanium-patch-1

92f63e8

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder #132

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder #132

Uh oh!

Seanium commented Mar 30, 2025

Uh oh!

Uh oh!

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder #132

Are you sure you want to change the base?

Fix Encoding Error When Reading TSV Files in load_custom_dataset_from_folder #132

Uh oh!

Conversation

Seanium commented Mar 30, 2025

Issue: Encoding Error When Reading TSV Files

Error Log:

Root Cause:

Solution

Modify load_custom_dataset_from_folder to Add Encoding Parameter

Recommendation

Uh oh!

Uh oh!

Modify `load_custom_dataset_from_folder` to Add Encoding Parameter