Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Deploy Documentation

on:
push:
branches:
- main
paths:
- 'docs/**'
- 'mkdocs.yml'
- '.github/workflows/docs.yml'
- 'requirements-docs.txt'
workflow_dispatch:

permissions:
contents: write

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Cache pip dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements-docs.txt') }}
restore-keys: |
${{ runner.os }}-pip-

- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements-docs.txt

- name: Configure Git
run: |
git config --global user.name "github-actions[bot]"
git config --global user.email "github-actions[bot]@users.noreply.github.com"

- name: Deploy to GitHub Pages
run: |
mkdocs gh-deploy --force --clean --verbose
70 changes: 70 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Vectara Ingest Documentation

This directory contains the documentation for vectara-ingest, built with MkDocs and Material theme.

## Building Locally

To build and preview the documentation locally:

```bash
# Install documentation dependencies
pip install -r requirements-docs.txt

# Serve documentation locally
mkdocs serve
```

Then open http://127.0.0.1:8000 in your browser.

## Building Static Site

```bash
mkdocs build
```

The static site will be generated in the `site/` directory.

## Deployment

Documentation is automatically deployed to GitHub Pages when changes are pushed to the `main` branch.

The deployment is handled by `.github/workflows/docs.yml`.

## Documentation Structure

```
docs/
├── index.md # Home page
├── installation.md # Installation guide
├── getting-started.md # Quick start tutorial
├── configuration.md # Configuration reference
├── secrets-management.md # Secrets and API keys
├── crawlers/ # Crawler documentation
│ ├── index.md # Crawlers overview
│ ├── website.md # Website crawler
│ ├── rss.md # RSS crawler
│ └── ... # Other crawlers
├── features/ # Feature documentation
│ ├── document-processing.md
│ ├── table-extraction.md
│ └── ...
├── deployment/ # Deployment guides
│ ├── docker.md
│ ├── render.md
│ └── ...
├── advanced/ # Advanced topics
│ ├── custom-crawler.md
│ ├── saml-auth.md
│ └── ...
└── contributing.md # Contributing guide
```

## Contributing to Documentation

1. Edit markdown files in the `docs/` directory
2. Preview changes with `mkdocs serve`
3. Commit and push to trigger automatic deployment

## Navigation

Navigation is configured in `mkdocs.yml` under the `nav:` section.
9 changes: 9 additions & 0 deletions docs/advanced/api-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Why do we need an API reference page in the docs?


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
9 changes: 9 additions & 0 deletions docs/advanced/chunking-strategies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please remove all pages that are under constructions, or add real content to them. Chukning can actually be useful - you can document using chunking directly with the platform, or using docling chunking or unstructured chunking.


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
9 changes: 9 additions & 0 deletions docs/advanced/cloud-vm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if not used


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
9 changes: 9 additions & 0 deletions docs/advanced/contextual-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a separate page? Shoyld be part of "chunking", no?
(and as before - let's add content to chunking page)


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
9 changes: 9 additions & 0 deletions docs/advanced/custom-certificates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add proper content here. This one should be documented IMO.


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
142 changes: 142 additions & 0 deletions docs/advanced/custom-crawler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Custom Crawler Setup Guide

This guide explains how to use custom or private crawler files with vectara-ingest without committing them to the repository.

## Overview

The `crawler_file` configuration option allows you to specify a custom crawler file that will be automatically copied to the `crawlers/` directory before the Docker image is built. This is useful for:

- Private crawlers that should not be committed to version control
- Organization-specific crawlers
- Testing new crawlers without modifying the repository

## Setup Instructions

### Step 1: Create Your Custom Crawler

Create your custom crawler Python file anywhere on your system. The crawler should follow the standard vectara-ingest crawler structure.

Example: `/home/myuser/my_custom_crawler.py`

```python
from core.crawler import Crawler

class MyCustomCrawler(Crawler):
def __init__(self, cfg, endpoint, corpus_key, api_key):
super().__init__(cfg, endpoint, corpus_key, api_key)
# Your initialization code

def crawl(self):
# Your crawling logic
pass
```

### Step 2: Add crawler_file to Your Configuration

In your configuration YAML file, add the `crawler_file` parameter under the `vectara` section:

```yaml
vectara:
corpus_key: my_corpus
reindex: true
create_corpus: false

# Path to your custom crawler file
crawler_file: /home/myuser/my_custom_crawler.py

crawling:
# The crawler_type should match your crawler's naming convention
# For my_custom_crawler.py, use: my_custom
crawler_type: my_custom

my_custom_crawler:
# Your crawler-specific configuration
# ...
```

### Step 3: Run the Ingest Script

Run the ingest script as usual:

```bash
sh run.sh config/my-config.yaml default
```

The `run.sh` script will:
1. Read the `crawler_file` path from your configuration
2. Verify the file exists
3. Copy it to the `crawlers/` directory
4. Build the Docker image with your custom crawler included
5. Run the crawler

## Configuration Details

### crawler_file Parameter

- **Location**: Under the `vectara` section in your config YAML
- **Type**: String (absolute or relative path to Python file)
- **Required**: No (only needed if using a custom crawler)
- **Example**: `crawler_file: /path/to/my_crawler.py`

### Naming Convention

The `crawler_type` should match the crawler class name pattern:

- If your file is `my_custom_crawler.py` with class `MyCustomCrawler`
- Use `crawler_type: my_custom`
- Add configuration section named `my_custom_crawler`

## Error Handling

If the custom crawler file is not found, the script will exit with an error:

```
Error: Custom crawler file not found at '/path/to/crawler.py'
```

Make sure:
- The file path is correct
- The file exists and is readable
- You have proper permissions to access the file

## Git and Version Control

Custom crawler files copied to the `crawlers/` directory are automatically excluded from git commits through patterns in `.gitignore`:

- `crawlers/*_custom_crawler.py`
- `crawlers/custom_*.py`

To ensure your custom crawler is not accidentally committed:
1. Name your crawler file with `custom` prefix or suffix
2. Or keep it outside the repository and only reference it via `crawler_file`

## Example Configuration

Complete example for a custom crawler:

```yaml
vectara:
corpus_key: proprietary_data
reindex: true
create_corpus: false
crawler_file: /home/user/proprietary_crawler.py

crawling:
crawler_type: proprietary

proprietary_crawler:
# Your custom crawler configuration
api_endpoint: https://internal.company.com/api
batch_size: 100
```

## Troubleshooting

**Issue**: Script exits with "Custom crawler file not found"
- **Solution**: Verify the file path is correct and the file exists

**Issue**: Crawler not being recognized
- **Solution**: Ensure `crawler_type` matches your crawler class naming convention

**Issue**: Custom crawler appears in git status
- **Solution**: Rename the file to include `custom` in the filename or add it to `.gitignore`
9 changes: 9 additions & 0 deletions docs/advanced/docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page

*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
9 changes: 9 additions & 0 deletions docs/advanced/document-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Documentation Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this one supposed to document? Either remove if it's some other place, or add proper content pls.


*This page is under construction. Check back soon for detailed documentation.*

## Resources

- [Home](../index.md)
- [Getting Started](../getting-started.md)
- [Configuration Reference](../configuration.md)
Loading