Skip to content

Build embeddings #506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
3e218b9
Create docs embeddings
mishig25 Jun 7, 2024
0f6e41f
fix declaration issue
mishig25 Jun 13, 2024
40a7465
rename to self.anchor
mishig25 Jun 13, 2024
d57df44
fix typo
mishig25 Jun 13, 2024
d97d20a
compile regexes
mishig25 Jun 13, 2024
f70ac3b
raise error on add_child when needed
mishig25 Jun 13, 2024
ed754a2
custom ChunkingError
mishig25 Jun 13, 2024
31b26f1
rm get_chunks chunk_len_chars possible ambiguity
mishig25 Jun 13, 2024
58caef8
use hf client for inference embeddings
mishig25 Jun 13, 2024
051a6ac
fixes
mishig25 Jun 18, 2024
cd169a4
use meilisearch for vector db
mishig25 Jun 18, 2024
fa3fa2f
Add build emebddings workflow
mishig25 Jun 21, 2024
fd12450
use cli args for secrets
mishig25 Jun 21, 2024
928ef16
MEILISEARCH_PAYLOAD_MAX_MB = 95
mishig25 Jun 24, 2024
7bf6fff
use github secrets
mishig25 Jun 24, 2024
600f12d
install huggingface_hub from source
mishig25 Jun 24, 2024
6abc085
add meilisearch swap_indexes
mishig25 Jun 25, 2024
6522c16
meilisearch cleanup job
mishig25 Jun 25, 2024
25a9a59
add accelerate & huggingface_hub
mishig25 Jun 25, 2024
0a863af
add transformers
mishig25 Jun 25, 2024
483cb8b
better chunking
mishig25 Jun 26, 2024
99cfd53
meilisearch `"searchableAttributes": []`
mishig25 Jun 27, 2024
24c14b4
try bigger chunk
mishig25 Jun 27, 2024
ed8cee7
change embed model to "hf.oc/BAAI/bge-base-en-v1-5"
mishig25 Jun 28, 2024
a105d1f
better meilisearch handling
mishig25 Jul 2, 2024
181d6e9
better source page metadata
mishig25 Jul 2, 2024
7cc64e3
typo
mishig25 Jul 2, 2024
b0f055c
index huggingface/hub-docs
mishig25 Jul 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/accelerate_doc.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Accelerate doc build

on: [pull_request]
# on: [pull_request]

jobs:
integration_doc_build:
Expand Down
101 changes: 101 additions & 0 deletions .github/workflows/build_embeddings.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
name: Daily Build Embeddings

env:
DIFFUSERS_SLOW_IMPORT: yes

on:
push:
schedule:
- cron: "5 7 * * *" # every day at 07:05
# to run this workflow manually from the Actions tab
workflow_dispatch:

jobs:
matrix-job:
runs-on: ubuntu-latest
container: huggingface/transformers-doc-builder
strategy:
max-parallel: 1 # run the matrix jobs sequentially
matrix:
include:
- repo_id: huggingface/diffusers
doc_folder: docs/source/en
- repo_id: huggingface/accelerate
doc_folder: docs/source
- repo_id: huggingface/huggingface_hub
doc_folder: docs/source/en
- repo_id: huggingface/transformers
doc_folder: docs/source/en
- repo_id: huggingface/hub-docs
doc_folder: docs/hub
package_name: hub
is_not_python_module: true
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
timeout-minutes: 360 # Set timeout to 6 hours
steps:
- name: Setup REPO_NAME
shell: bash
run: |
current_path=$(pwd)
repo_id="${{ matrix.repo_id }}"
repo_name="${repo_id#*/}"
echo "REPO_NAME=${repo_name}" >> $GITHUB_ENV

- name: Checkout repository
uses: actions/checkout@v2
with:
repository: ${{ matrix.repo_id }}
path: ${{ github.workspace }}/${{ env.REPO_NAME }}

- name: Install libgl1
run: apt-get install -y libgl1

- name: Setup environment
shell: bash
run: |
if [[ "${{ matrix.is_not_python_module }}" != "true" ]]; then
current_path=$(pwd)
cd ${{ env.REPO_NAME }}
pip install .[dev]
cd $current_path
fi

rm -rf doc-builder
rm -rf .git
git clone https://github.com/huggingface/doc-builder.git
cd doc-builder
git fetch
git checkout build_embeddings
pip install .

- name: Build embeddings
shell: bash
run: |
echo Building docs for ${{ matrix.package_name || env.REPO_NAME }}
FLAGS=""
if [[ "${{ matrix.is_not_python_module }}" == "true" ]]; then
FLAGS="--not_python_module"
fi
doc-builder embeddings ${{ matrix.package_name || env.REPO_NAME }} ${{ env.REPO_NAME }}/${{ matrix.doc_folder }} --hf_ie_name docs-embed-bge-base-en-v1-5 --hf_ie_namespace huggingface --hf_ie_token ${{ secrets.HF_IE_TOKEN }} --meilisearch_key ${{ secrets.MEILISEARCH_KEY }} $FLAGS


cleanup-job:
needs: matrix-job
runs-on: ubuntu-latest
if: always() # This ensures that the cleanup job runs regardless of the result of matrix-job
steps:
- name: Checkout doc-builder
uses: actions/checkout@v2

- name: Install doc-builder
run: pip install .[dev]

- name: Success Cleanup
if: success() # Runs if all matrix jobs succeeded
run: doc-builder meilisearch-clean --meilisearch_key ${{ secrets.MEILISEARCH_KEY }} --swap

- name: Failure Cleanup
if: failure() # Runs if any matrix job failed
run: doc-builder meilisearch-clean --meilisearch_key ${{ secrets.MEILISEARCH_KEY }}
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from setuptools import find_packages, setup

install_requires = ["black", "GitPython", "tqdm", "pyyaml", "packaging", "nbformat", "huggingface_hub", "pillow"]
install_requires = ["black", "GitPython", "tqdm", "pyyaml", "packaging", "nbformat", "huggingface_hub @ git+https://github.com/huggingface/huggingface_hub.git", "pillow", "meilisearch"]

extras = {}

Expand Down
3 changes: 2 additions & 1 deletion src/doc_builder/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@

__version__ = "0.6.0.dev0"

from .autodoc import autodoc
from .autodoc import autodoc_svelte
from .build_doc import build_doc
from .build_embeddings import build_embeddings, clean_meilisearch
from .convert_rst_to_mdx import convert_rst_docstring_to_mdx, convert_rst_to_mdx
from .style_doc import style_doc_files
from .utils import update_versions_file
Loading
Loading