Skip to content

Encoding #1425

@lbourdois

Description

@lbourdois

Describe the bug

Hi @Wauplin

As discussed on Slack (https://huggingface.slack.com/archives/C039P47V1L5/p1680688264050959),
It turns out that in some cases when opening a HF PR it broke accents. Example: https://huggingface.co/mideind/nmt-doc-is-en-2022-10/discussions/1 + https://huggingface.co/Wauplin/test_encoding/discussions/5/files.

The problem persists when upgrading the library version (from 0.11.0 to 0.13.3).

The problem does not seem to come from the card either, since the accents are correctly displayed:

str(card)
'---\nlanguage:\n- is\n- en\n- multilingual\ntags:\n- translation\ninference:\n  parameters:\n    src_lang: is_IS\n    tgt_lang: en_XX\n    decoder_start_token_id: 2\n    max_length: 512\nwidget:\n- text: Einu sinni átti ég hest. Hann var svartur og hvítur.\nhuggingface_hub: 0.13.3\n---\n\n# mBART based translation model\nThis model was trained to translate multiple sentences at once, compared to one sentence at a time.\n\nIt will occasionally combine sentences or add an extra sentence.\n\nThis is the same model as are provided on CLARIN: [https://repository.clarin.is/repository/xmlui/handle/20.500.12537/278\n'](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/278/n')

Below the requested information, do not hesitate to let me know if you need more information to help you find a solution.

Reproduction

from huggingface_hub import ModelCard, version

card = ModelCard.load("mideind/nmt-doc-is-en-2022-10")
card.data.huggingface_hub = version
card.push_to_hub(
repo_id="Wauplin/test_encoding",
create_pr=True,
commit_message=f"Update ModelCard using huggingface_hub {version}",
)

Note: this bug was found in January during a wave of adding "multilingual" in language tag to multilingual models. The code was then exactly the same as before (the repo name was obviously different), just had the extra line before the push_to__hub one: card.data.language = card.data.language + ["multilingual"]

Logs

See https://huggingface.co/Wauplin/test_encoding/discussions

System info

- huggingface_hub version: 0.13.3
- Platform: Windows-10-10.0.18362-SP0
- Python version: 3.8.5
- Running in iPython ?: Yes
- iPython shell: ZMQInteractiveShell
- Running in notebook ?: Yes
- Running in Google Colab ?: No
- Token path ?: C:\Users\lbourdois\.cache\huggingface\token
- Has saved token ?: True
- Who am I ?: lbourdois
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: 2.9.0
- Torch: 1.10.0
- Jinja2: 2.11.2
- Graphviz: 0.16
- Pydot: 1.4.2
- Pillow: 9.0.0
- hf_transfer: N/A
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: C:\Users\lbourdois\.cache\huggingface\hub
- HUGGINGFACE_ASSETS_CACHE: C:\Users\lbourdois\.cache\huggingface\assets
- HF_TOKEN_PATH: C:\Users\lbourdois\.cache\huggingface\token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions