Skip to content

Build embeddings #506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 77 commits into from
Jun 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
8f65958
Create docs embeddings
mishig25 Jun 7, 2024
a3690fd
fix declaration issue
mishig25 Jun 13, 2024
0d2e29b
rename to self.anchor
mishig25 Jun 13, 2024
410018b
fix typo
mishig25 Jun 13, 2024
3e615af
compile regexes
mishig25 Jun 13, 2024
bf0cb7d
raise error on add_child when needed
mishig25 Jun 13, 2024
21a2ff4
custom ChunkingError
mishig25 Jun 13, 2024
3dc98d6
rm get_chunks chunk_len_chars possible ambiguity
mishig25 Jun 13, 2024
f0891e6
use hf client for inference embeddings
mishig25 Jun 13, 2024
1532ddf
fixes
mishig25 Jun 18, 2024
78e0867
use meilisearch for vector db
mishig25 Jun 18, 2024
0e31803
Add build emebddings workflow
mishig25 Jun 21, 2024
071d928
use cli args for secrets
mishig25 Jun 21, 2024
5be8bbb
MEILISEARCH_PAYLOAD_MAX_MB = 95
mishig25 Jun 24, 2024
24f59f5
use github secrets
mishig25 Jun 24, 2024
b9a80dc
install huggingface_hub from source
mishig25 Jun 24, 2024
7689842
add meilisearch swap_indexes
mishig25 Jun 25, 2024
f0865fc
meilisearch cleanup job
mishig25 Jun 25, 2024
3d8e895
add accelerate & huggingface_hub
mishig25 Jun 25, 2024
5983942
add transformers
mishig25 Jun 25, 2024
a9e71f8
better chunking
mishig25 Jun 26, 2024
abf4778
meilisearch `"searchableAttributes": []`
mishig25 Jun 27, 2024
798aea1
try bigger chunk
mishig25 Jun 27, 2024
edfa0d2
change embed model to "hf.oc/BAAI/bge-base-en-v1-5"
mishig25 Jun 28, 2024
2b64b28
better meilisearch handling
mishig25 Jul 2, 2024
f532bd2
better source page metadata
mishig25 Jul 2, 2024
e0955b8
typo
mishig25 Jul 2, 2024
151868e
index huggingface/hub-docs
mishig25 Jul 4, 2024
eb1105d
use latest `huggignface_hub`
mishig25 May 21, 2025
5c1cfd7
wip
mishig25 May 21, 2025
61dd462
wip
mishig25 May 21, 2025
f9c7356
fix empty root
mishig25 Jun 11, 2025
7bcf95a
diffusers
mishig25 Jun 11, 2025
7258ee5
wip
mishig25 Jun 12, 2025
5183879
modify version
mishig25 Jun 12, 2025
e5cb371
Revert "modify version"
mishig25 Jun 16, 2025
587ded9
docstring hirearchy
mishig25 Jun 16, 2025
d1c44d2
parse headings into properties
mishig25 Jun 16, 2025
a60440a
fix typo
mishig25 Jun 16, 2025
d0580b1
feat: add last heading to page URL when no heading exists and make AP…
mishig25 Jun 16, 2025
6a24271
promot engineering
mishig25 Jun 16, 2025
03700cb
distinguish between library/service in embeddings prefix
mishig25 Jun 16, 2025
fd69245
refactor: move text prefix generation to embeddings step and make CLI…
mishig25 Jun 16, 2025
2eb37aa
make style
mishig25 Jun 16, 2025
e89c1a9
refactor: extract heading slugification logic into reusable function
mishig25 Jun 16, 2025
567b288
refactor: rename package_name field to library in Embedding namedtuple
mishig25 Jun 16, 2025
698cc0f
feat: configure searchable and filterable attributes for Meilisearch …
mishig25 Jun 16, 2025
ba7edcc
refactor: rename index to docs-semantic-search and add heading fields…
mishig25 Jun 16, 2025
e490ee4
feat: use uv package installer for non-containerized environments
mishig25 Jun 16, 2025
a3337be
feat: conditionally set pip or uv installer based on container enviro…
mishig25 Jun 17, 2025
217dda8
feat: support nested package paths in embedding build workflow
mishig25 Jun 17, 2025
fc11401
test on tokenizers
mishig25 Jun 17, 2025
fd8f761
feat: add uv installer step to build_embeddings workflow when running…
mishig25 Jun 17, 2025
1233cb0
test more libs
mishig25 Jun 17, 2025
a23dd60
test more libs
mishig25 Jun 17, 2025
81e5166
test more libs
mishig25 Jun 17, 2025
0a5b98b
fix cleanup-job
mishig25 Jun 17, 2025
4a49f9e
fix: convert slugs to lowercase and make HF/Meilisearch credentials o…
mishig25 Jun 17, 2025
e2cddd2
refactor: make auth tokens optional with default values and reorder s…
mishig25 Jun 17, 2025
a203c83
format
mishig25 Jun 17, 2025
16f00bf
fix: handle TypeError in signature parsing
mishig25 Jun 17, 2025
9e92460
more libs test
mishig25 Jun 17, 2025
90d0493
test more libs
mishig25 Jun 17, 2025
927f156
test pre_command
mishig25 Jun 17, 2025
4c4cdb4
test parallel runs
mishig25 Jun 17, 2025
53e12bf
rm dev change
mishig25 Jun 17, 2025
14f2dbb
chore: update huggingface_hub dependency to feature_extraction_workin…
mishig25 Jun 17, 2025
a3520af
ci: reduce parallel jobs to 1 for sequential embedding builds
mishig25 Jun 17, 2025
6010e27
wip
mishig25 May 21, 2025
bf42429
feat: uncomment cleanup job in build_embeddings workflow
mishig25 Jun 17, 2025
b1df678
wip
mishig25 Jun 18, 2025
ad44b8e
all libs
mishig25 May 21, 2025
457f3fb
Revert "fix: handle TypeError in signature parsing"
mishig25 Jun 18, 2025
c99fc9e
refactor: remove cleanup job from embeddings workflow
mishig25 Jun 18, 2025
eea9892
ci: remove push trigger from embeddings workflow to only run on schedule
mishig25 Jun 18, 2025
5b3732c
ci: switch doc-builder branch from build_embeddings to main
mishig25 Jun 18, 2025
2ee0918
ci: re-enable doc build workflow on pull requests
mishig25 Jun 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/accelerate_doc.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

name: Accelerate doc build

on: [pull_request]
Expand Down
146 changes: 146 additions & 0 deletions .github/workflows/build_embeddings.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
name: Daily Build Embeddings

env:
DIFFUSERS_SLOW_IMPORT: yes

on:
schedule:
- cron: "5 7 * * *" # every day at 07:05
# to run this workflow manually from the Actions tab
workflow_dispatch:

jobs:
matrix-job:
runs-on: ubuntu-latest
container: huggingface/transformers-doc-builder
strategy:
max-parallel: 1 # run sequentially
matrix:
include:
- repo_id: huggingface/tokenizers
doc_folder: docs/source-doc-builder
package_path: bindings/python
- repo_id: huggingface/diffusers
doc_folder: docs/source/en
- repo_id: huggingface/accelerate
doc_folder: docs/source
- repo_id: huggingface/huggingface_hub
doc_folder: docs/source/en
- repo_id: huggingface/transformers
doc_folder: docs/source/en
- repo_id: huggingface/hub-docs
doc_folder: docs/hub
package_name: hub
is_not_python_module: true
- repo_id: huggingface/huggingface.js
doc_folder: docs
is_not_python_module: true
pre_command: npm install -g corepack@latest && corepack enable && cd huggingface.js && pnpm install && pnpm -r build && pnpm --filter doc-internal start
- repo_id: huggingface/transformers.js
doc_folder: docs/source
is_not_python_module: true
- repo_id: huggingface/smolagents
doc_folder: docs/source/en
- repo_id: huggingface/peft
doc_folder: docs/source
- repo_id: huggingface/trl
doc_folder: docs/source
- repo_id: bitsandbytes-foundation/bitsandbytes
doc_folder: docs/source
- repo_id: huggingface/lerobot
doc_folder: docs/source
- repo_id: huggingface/pytorch-image-models
doc_folder: hfdocs/source
package_name: timm
- repo_id: huggingface/hub-docs
doc_folder: docs/inference-providers
package_name: inference-providers
is_not_python_module: true
- repo_id: huggingface/safetensors
doc_folder: docs/source
package_path: bindings/python
- repo_id: huggingface/hf-endpoints-documentation
doc_folder: docs/source
package_name: inference-endpoints
is_not_python_module: true
- repo_id: huggingface/dataset-viewer
doc_folder: docs/source
package_name: dataset-viewer
is_not_python_module: true
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
timeout-minutes: 360 # Set timeout to 6 hours
steps:
- name: Setup REPO_NAME
shell: bash
run: |
current_path=$(pwd)
repo_id="${{ matrix.repo_id }}"
repo_name="${repo_id#*/}"
echo "REPO_NAME=${repo_name}" >> $GITHUB_ENV

- name: Checkout repository
uses: actions/checkout@v2
with:
repository: ${{ matrix.repo_id }}
path: ${{ github.workspace }}/${{ env.REPO_NAME }}

- uses: actions/setup-node@v4
with:
node-version: '20'

- name: Install libgl1
run: apt-get install -y libgl1

- name: Export PIP_OR_UV ('pip' or 'uv pip')
run: |
if [ -z "${{ job.container }}" ]
then
echo "PIP_OR_UV=uv pip" >> $GITHUB_ENV
else
echo "PIP_OR_UV=pip" >> $GITHUB_ENV
fi

- name: Setup environment
shell: bash
run: |
if [[ "${{ matrix.is_not_python_module }}" != "true" ]]; then
current_path=$(pwd)
cd ${{ env.REPO_NAME }}
if [[ -n "${{ matrix.package_path }}" ]]; then
cd ${{ matrix.package_path }}
$PIP_OR_UV install .[dev]
cd $current_path
else
$PIP_OR_UV install .[dev]
cd $current_path
fi
fi

rm -rf doc-builder
rm -rf .git
git clone https://github.com/huggingface/doc-builder.git
cd doc-builder
git fetch
git checkout main
$PIP_OR_UV install .

- name: Run pre-command
shell: bash
run: |
if [ ! -z "${{ matrix.pre_command }}" ]
then
bash -c "${{ matrix.pre_command }}"
fi

- name: Build embeddings
shell: bash
run: |
echo Building docs for ${{ matrix.package_name || env.REPO_NAME }}
FLAGS=""
if [[ "${{ matrix.is_not_python_module }}" == "true" ]]; then
FLAGS="--not_python_module"
fi
doc-builder embeddings ${{ matrix.package_name || env.REPO_NAME }} ${{ env.REPO_NAME }}/${{ matrix.doc_folder }} --hf_ie_name docs-embed-bge-base-en-v1-5 --hf_ie_namespace huggingface --hf_ie_token ${{ secrets.HF_IE_TOKEN }} --meilisearch_key ${{ secrets.MEILISEARCH_KEY }} $FLAGS

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from setuptools import find_packages, setup

install_requires = ["black", "GitPython", "tqdm", "pyyaml", "packaging", "nbformat", "huggingface_hub", "pillow"]
install_requires = ["black", "GitPython", "tqdm", "pyyaml", "packaging", "nbformat", "huggingface_hub @ git+https://github.com/huggingface/huggingface_hub.git@feature_extraction_working", "pillow", "meilisearch"]

extras = {}

Expand Down
3 changes: 2 additions & 1 deletion src/doc_builder/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@

__version__ = "0.6.0.dev0"

from .autodoc import autodoc
from .autodoc import autodoc_svelte
from .build_doc import build_doc
from .build_embeddings import build_embeddings, clean_meilisearch
from .convert_rst_to_mdx import convert_rst_docstring_to_mdx, convert_rst_to_mdx
from .style_doc import style_doc_files
from .utils import update_versions_file
Loading