Skip to content

fix(document): default to UTF-8 encoding when reading markdown files#54

Open
geopanther wants to merge 1 commit into
mainfrom
fix/utf8-encoding-default
Open

fix(document): default to UTF-8 encoding when reading markdown files#54
geopanther wants to merge 1 commit into
mainfrom
fix/utf8-encoding-default

Conversation

@geopanther
Copy link
Copy Markdown
Owner

Summary

Explicitly pass encoding='utf-8' when opening markdown files, so systems with non-UTF-8 default locale don't fail on valid UTF-8 content.

Adapted from iamjackg/md2cf#139 by @sheyifan.

Changes

  • Add encoding='utf-8' to the initial open() call in get_page_data_from_file_path()
  • The existing UnicodeDecodeError fallback with chardet detection remains for non-UTF-8 files

On systems where the default locale is not UTF-8, open() without an
explicit encoding may fail to read valid UTF-8 markdown files, falling
through to the chardet-based detection unnecessarily.

Adapted from iamjackg/md2cf#139 by @sheyifan.
@geopanther
Copy link
Copy Markdown
Owner Author

Maintainer Review

Verdict: ✅ Approve

Minimal, correct fix. The open() call without explicit encoding relies on locale.getpreferredencoding(), which can be non-UTF-8 on some systems (e.g., Windows with legacy codepages, some Docker images). Since markdown files are overwhelmingly UTF-8, defaulting to it is the right call.

The existing chardet fallback for non-UTF-8 files remains untouched — good.

Note: Python 3.15 will change open() to default to UTF-8 (PEP 686), making this a forward-compatible fix. In Python 3.10+ you can opt in early via PYTHONUTF8=1 or -X utf8, but explicit is better than implicit.

No concerns. Ship it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant