From 261691eb859ef3434f68831a9e0072c48d855fd9 Mon Sep 17 00:00:00 2001 From: Uche Ogbuji Date: Thu, 30 Apr 2026 11:55:14 -0600 Subject: [PATCH 1/4] Align to Oori's coding-agent-control v0.1.0 --- .claude/skills/python/SKILL.md | 86 ++++++++++++++++++++++++++++++++++ AGENTS.md | 20 ++++++++ CLAUDE.md | 24 ++++++++++ agent-control.toml | 15 ++++++ 4 files changed, 145 insertions(+) create mode 100644 .claude/skills/python/SKILL.md create mode 100644 AGENTS.md create mode 100644 CLAUDE.md create mode 100644 agent-control.toml diff --git a/.claude/skills/python/SKILL.md b/.claude/skills/python/SKILL.md new file mode 100644 index 0000000..907efa0 --- /dev/null +++ b/.claude/skills/python/SKILL.md @@ -0,0 +1,86 @@ +--- +name: python-backend +description: Python 3.12+ backend and library development — packaging, testing, pyproject.toml, uv, hatchling, asyncio, and repo hygiene. Use when working on Python services, CLIs, or libraries. +--- + +# Python Backend Development + +## Purpose +Follow this skill for Python 3.12+ backend and library work: packaging, project structure, testing, and repository hygiene. + +## Default rules +- Single quotes throughout, including triple-quoted strings. +- Absolute imports; 120-char lines; moderate comments. +- `uv` for installs; `uv pip install -U .` for real package validation. +- Hatchling build system; no `setuptools`, no `setup.py`. +- No editable installs for libraries. +- `asyncio` for I/O-bound work; multiprocessing for CPU-bound. +- `fire` for CLI args; `structlog` for logging; `httpx` for HTTP; `pytest` for tests. +- `tenacity` for retries; `rich` for terminal output. +- No `langchain` unless explicitly requested. +- Dataclasses over Pydantic; keep abstractions proportionate to the task. + +## Workflow +1. Read the repo's `CLAUDE.md` / `AGENTS.md` first. +2. Check `pyproject.toml` — follow its build and test commands. +3. Prefer small, deterministic changes. +4. Validate with `pytest` or a targeted run. +5. Report any assumptions or unresolved ambiguity. + +## Packaging +- Library code lives under `pylib/`. +- Use `[tool.hatch.build.targets.wheel]` with `only-include = ['pylib']`. +- Map `pylib` to the package name in `[tool.hatch.build.sources]`. +- Export CLIs through `[project.scripts]` with a `main()` entry point in each module. + +## If the task is unclear +Ask for the repo type (library vs service), runtime target, and whether strict installability or editable installs are acceptable. + + +## Full conventions + +Additional context for AI tools & coding agents + +- Python 3.12+ code, unless otherwise specified +- Python code uses single outer quotes, including triple single quotes for e.g. docstrings +- prefer absolute imports to relative imports +- Use a decent amount of comments + - not *too* many, just enough that anybody familiar with the code can use them as a reference point. Not meant to teach somebody new every intricacy of the code, just help keep the savvy reader oriented. +- if it saves a line, put a comment after a line rather than above it + - use the standard two spaces before the comment character, eg. `CODE # COMMENT` +- Try to stick to 120 characters per line + - if one of those comments would break this guideline, just put that comment above the line instead, as is standard convention +- If there is a pyproject.toml in place, use it as a reference for builds, installs, etc. The basic packaging and dev preference, including if you have to supply your own pyproject.toml, is as follows: + - Prefer hatchling build system over setuptools, poetry, etc. Avoid setuptools as much as possible. No setup.py. + - Reusable Python code modules are developed in the `pylib` folder, and installed using e.g. `uv pip install -U .`, which includes proper mapping to Python library package namespace via `tool.hatch.build.sources`. The `__init__.py` and other modules in the top-level package go directly in `pylib`, though submodules can use subdirectories, e.g. `pylib/a/b` becomes `installed_library_name.a.b`. Ultimately this will mean the installed package is importable as `from installed_library_name.etc import …` + - Use `[tool.hatch.build.targets.wheel]` with `only-include = ["pylib"]` to ensure the pylib directory structure gets included properly in the wheel, avoiding the duplication issue that can occur with sources mapping + - Yes this means editable and "dev mode" environments are NOT desirable, nor are shenanigans adding pylib to `sys.path`. Layer-efficient dockerization is an option if that's needed. + - The ethos is to always develop keeping things properly installable. No dev mode shortcuts. Substantive modification to libray code requires e.g. `uv pip install -U .` each time. + - Note: This avoidance of editable installs can be relaxed for non-library code, such as demos or main app launch scripts (e.g. webapp back ends) + - If it's a CLI provided as part of a library, though, it should still use proper installation via `[project.scripts]` entry points (e.g., `ooriscout = 'ooriscout.cli.scout:main'`), which creates console scripts that work correctly after `uv pip install -U .`. The CLI module lives in `pylib/cli/` and exposes a `main()` function that uses fire to handle command-line arguments. +- **Debugging package issues**: When modules aren't importing correctly after installation, check: + - That you are in the correct virtualenv (you may have to ask the developer) + - Package structure in site-packages (e.g., `ls -la /path/to/site-packages/package_name/`) +- Use uv, but pay attention to the above + - Again always use `uv pip install -U .` for full installation, never editable installs (`pip install -e`). This ensures proper testing of the actual distribution. +- Use async (e.g. asyncio) wherever it makes sense. Avoid multithreading, though multiprocessing is OK. Multiprocess for CPU-bound concurrency, and asyncIO for I/O bound, cooperative etc. +- Be pythonic. Avoid e.g. complex abstract class hierarchies for the sake of them, though classes are also fine in many usage patterns. We love dictionaries, dynamic dispatch, etc. + - I don't consider Pydantic very Pythonic, so we can tolerate it if need be (e.g. we're using a toolkit that strictly works with Pydantic), but otherwise, simple dataclasses are better. +- Type hints are OK in moderation, but avoid absolutely littering the code with them. + - No excess imports & symbols, e.g. Use type | None rather than Optional[type] +- use iterator patterns as much as practical. Also functional programming approaches, including partials (currying) and decorators +- Prefereed tools: + - Logging: structlog + - Retries on failure: tenacity + - CLI argument processing: fire—avoid argparse except for truly trivial usage + - CLI formatting: rich + - HTTP client: httpx (async) + - HTML/XML parsing: selectolax (though for now we're using html5-modern as the base implementation for our html5 features) + - Browser-like Web crawling/scraping: Python playwright (with playwright_stealth if needed) + - pytest, as well as pytest-mock, pytest-httpx, pytest-asyncio + - rapidfuzz for fuzzy text matching +- AVOID the following unless explicitly requested or otherwise unavoidable: + - langchain + +- Once again PREFER SINGLE QUOTES + diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..0e75701 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,20 @@ + +# WordLoom + + +- Source of truth is the code and git history, not assistant memory. +- Read existing code before modifying; prefer targeted, minimal changes. +- Validate changes with tests before reporting completion. +- Ask before making destructive or hard-to-reverse changes. + + +## Project type: python + + +For Python library/backend work, load `.claude/skills/python/SKILL.md` — covers conventions, packaging, testing, and tooling. + + + +## Local context + + diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..c103040 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,24 @@ + +# WordLoom — Agent Instructions + + +- Source of truth is the code and git history, not assistant memory. +- Load only the skills and snippets needed for the task at hand. +- Prefer small, deterministic changes; validate with tests before reporting done. +- Ask before making destructive or hard-to-reverse changes. + + +## Project type: python + + +For Python library/backend work, load `.claude/skills/python/SKILL.md` — covers conventions, packaging, testing, and tooling. + + + +## Skills + +Skills are in `.claude/skills/`. Load a skill's `SKILL.md` when the task matches its description. + +## Local overrides + + diff --git a/agent-control.toml b/agent-control.toml new file mode 100644 index 0000000..5235868 --- /dev/null +++ b/agent-control.toml @@ -0,0 +1,15 @@ +# Agent control config for WordLoom. Managed by oori_coding_control. + +[project] +name = 'WordLoom' +kind = 'python' +control = 'https://github.com/OoriData/coding-agent-control' + +[paths] +claude = 'CLAUDE.md' +agents = 'AGENTS.md' +config = 'agent-control.toml' + +[managed] +# Skills installed in this repo (updated by oori-seed-repo) +skills = ['python'] From 70bdabbf42214ddb0b90979ca32394b1896a06ec Mon Sep 17 00:00:00 2001 From: Uche Ogbuji Date: Thu, 30 Apr 2026 12:26:42 -0600 Subject: [PATCH 2/4] WordLoom implementation extensions: file_bindings --- CONTRIBUTING.md | 27 ++-- implementation.md | 194 +++++++++++++++++++++++++++ pylib/ext/__init__.py | 2 + pylib/ext/file_includes.py | 137 +++++++++++++++++++ pylib/wordloom.py | 114 ++++++++++++---- test/test_file_inclusion.py | 256 ++++++++++++++++++++++++++++++++++++ 6 files changed, 690 insertions(+), 40 deletions(-) create mode 100644 implementation.md create mode 100644 pylib/ext/__init__.py create mode 100644 pylib/ext/file_includes.py create mode 100644 test/test_file_inclusion.py diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5198d4e..45e17f0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -94,26 +94,21 @@ WordLoom/ ├── pylib/ # Source code (becomes 'wordloom' package) │ ├── __init__.py │ ├── __about__.py # Version info -│ └── wordloom.py # Main implementation +│ ├── wordloom.py # Core implementation +│ └── ext/ # Opt-in extensions (loaded only when features= requests them) +│ ├── __init__.py +│ └── file_includes.py # file-inclusion extension ├── resources/ # Bundled resources │ └── wordloom/ │ └── sample.toml ├── test/ # Tests │ ├── test_basics.py -│ ├── test_i18n_integration.py -│ └── test_openai_integration.py +│ ├── test_i18n.py +│ ├── test_openai.py +│ └── test_file_inclusion.py ├── pyproject.toml # Project config +├── implementation.md # Library internals and extension docs └── README.md - -When installed, becomes: -site-packages/ -└── wordloom/ - ├── __init__.py - ├── __about__.py - ├── wordloom.py - └── resources/ - └── wordloom/ - └── sample.toml ``` When installed, becomes: @@ -124,6 +119,9 @@ site-packages/ ├── __init__.py ├── __about__.py ├── wordloom.py + ├── ext/ + │ ├── __init__.py + │ └── file_includes.py └── resources/ └── wordloom/ └── sample.toml @@ -134,7 +132,8 @@ site-packages/ - `pylib/__about__.py` - Version number (update for releases) - `pyproject.toml` - Dependencies, metadata, build config - `resources/wordloom/sample.toml` - Sample file used by tests -- `README.md` - Main documentation +- `README.md` - User-facing documentation +- `implementation.md` - Library internals, `load()` API reference, extension docs - `wordloom_spec.md` - Format specification (CC BY 4.0) # Publishing a Release diff --git a/implementation.md b/implementation.md new file mode 100644 index 0000000..033bbf7 --- /dev/null +++ b/implementation.md @@ -0,0 +1,194 @@ +# WordLoom — Python Implementation + +This document covers the internals of the Python library, including the core +data model, the `load()` API, and available opt-in extensions. + +--- + +## Core data model: `language_item` + +`language_item` is a `str` subclass. Every item parsed from a loom file is +one of these. Converting to `str` gives the default-language text, which +means items drop into any `str.format()` call naturally. + +Key attributes: + +| Attribute | Type | Description | +|---|---|---| +| `lang` | `str` | Default language code (BCP 47) | +| `altlang` | `dict[str, str]` | Alternate-language texts keyed by language code | +| `meta` | `dict` | Raw metadata from the TOML table (non-reserved keys) | +| `markers` | `list \| None` | Template variable names declared with `_m` | +| `file_bindings` | `dict[str, str]` | Resolved file/dir/glob inclusions (empty when the feature is not active) | + +### `in_lang(lang)` + +Returns the alternate-language text for `lang`, or `None` if not present. + +### `render(**kwargs)` + +Formats the template text by merging `file_bindings` with any runtime +`kwargs` (runtime values win on collision), then calling `str.format`. + +```python +prompt.render(extra='value') +# equivalent to: str(prompt).format(**{**prompt.file_bindings, **kwargs}) +``` + +When `file_bindings` is empty (feature disabled), this is a transparent +wrapper around `str.format`. + +### `clone(**overrides)` + +Returns a new `language_item` with selective attribute overrides. +`file_bindings` is preserved unless explicitly replaced. + +--- + +## `load()` — reading a loom file + +```python +wordloom.load(fp_or_str, lang='en', preserve_key=False, features=None, base_dir=None) +``` + +Returns a `dict` mapping each TOML key (and its default-language text) to a +`language_item`. Only items whose `lang` (or the file-level default `lang`) +matches the requested `lang` are included. + +### Input forms + +| Type passed | Behaviour | +|---|---| +| `pathlib.Path` | Opened as a file; parent directory used as loom base | +| `str` that resolves to an existing file | Opened as a file; parent directory used as loom base | +| `str` with no matching file | Treated as raw TOML content | +| `bytes` | Treated as raw TOML content | +| File-like object from `open()` | Read directly; `.name` used to detect loom base | + +### Parameters + +`lang` — language to select (default: `'en'`). + +`preserve_key` — if `True`, the TOML key name is stored in `meta['_key']`. + +`features` — a `set` or `dict` enabling optional extensions. A set entry or +a truthy dict value activates that feature. Example: + +```python +loom = wordloom.load(Path('prompts.toml'), features={'file-inclusion'}) +# or equivalently: +loom = wordloom.load(Path('prompts.toml'), features={'file-inclusion': True}) +``` + +`base_dir` — override the auto-detected loom base directory. Useful when +loading from a `bytes` or in-memory string with extensions that need path +resolution. + +--- + +## Extension: `file-inclusion` + +**Module**: `wordloom.ext.file_includes` +**Feature key**: `'file-inclusion'` + +This extension interprets metadata values that carry a scheme prefix as +references to external content, and resolves them at load time. + +**Warning:** The security model prevents path traversal, but it cannot protect against malicious *content* inside included files. If file contents are user-influenced or come from untrusted sources, they could inject instructions into your prompts. Only include files you trust, or inspect/strip their content before loading. + + +### TOML syntax + +```toml +[my_prompt] +_ = """ +Analyse the following documents: + +{corpus} +""" +_m = ["corpus"] +corpus = "dir:documents" +``` + +Any metadata key (non-`_`, non-`lang`) whose string value begins with one of +the three schemes below is treated as a file reference. All other metadata +values pass through unmodified. + +| Scheme | Example value | Resolves to | +|---|---|---| +| `file:` | `file:context/background.txt` | UTF-8 content of that file | +| `dir:` | `dir:analysis` | All UTF-8 files under that directory, concatenated with `=== relative/path ===` headers | +| `glob:` | `glob:notes/**/*.md` | All UTF-8 files matching the glob, same concatenation format | + +Paths are always **relative to the directory containing the loom TOML file**. + +### Accessing resolved content + +```python +from pathlib import Path +import wordloom + +loom = wordloom.load(Path('prompts.toml'), features={'file-inclusion'}) + +prompt = loom['my_prompt'] + +# Inspect what was resolved +print(prompt.file_bindings) # {'corpus': '=== doc1.txt ===\n...'} + +# Format the template — file_bindings are applied automatically +result = prompt.render() + +# Supply additional runtime values; they override file_bindings on collision +result = prompt.render(extra_context='additional info') +``` + +The raw metadata values (`"dir:documents"` etc.) remain in `prompt.meta` +unchanged — `file_bindings` holds only the resolved content. + +### Security model + +The extension enforces that all resolved paths stay within the loom base +directory: + +- Absolute paths (`file:/etc/passwd`) → `ValueError` +- Traversal escapes (`file:../../secret`) → `ValueError` +- `glob:` patterns with `..` segments → `ValueError` +- Missing `file:` target → `FileNotFoundError` +- Missing `dir:` target → `NotADirectoryError` + +For `dir:` and `glob:` scans: +- Files larger than 2 MB are silently skipped +- Non-UTF-8 files are silently skipped +- Hidden paths (any component starting with `.`) are silently skipped + +### Requiring a base directory + +The extension needs to know where the loom file lives. It is auto-detected +when you pass a `Path`, a path string, or an `open()` handle. When loading +from raw bytes or an in-memory string, set `base_dir` explicitly: + +```python +loom = wordloom.load(toml_bytes, features={'file-inclusion'}, base_dir='/path/to/loom-dir') +``` + +Without a base directory, the feature raises `ValueError` at load time. + +--- + +## Development workflow + +```bash +# Install (required after any pylib/ change) +uv pip install -U . + +# Run tests +pytest test/ -v + +# Run only the file-inclusion tests +pytest test/test_file_inclusion.py -v + +# Lint +ruff check . +``` + +See [CONTRIBUTING.md](CONTRIBUTING.md) for release and packaging details. diff --git a/pylib/ext/__init__.py b/pylib/ext/__init__.py new file mode 100644 index 0000000..b95be25 --- /dev/null +++ b/pylib/ext/__init__.py @@ -0,0 +1,2 @@ +# SPDX-FileCopyrightText: 2023-present Oori Data +# SPDX-License-Identifier: Apache-2.0 diff --git a/pylib/ext/file_includes.py b/pylib/ext/file_includes.py new file mode 100644 index 0000000..6b0c8e6 --- /dev/null +++ b/pylib/ext/file_includes.py @@ -0,0 +1,137 @@ +# SPDX-FileCopyrightText: 2023-present Oori Data +# SPDX-License-Identifier: Apache-2.0 +# wordloom.ext.file_includes + +''' +Word Loom extension: file-inclusion via metadata conventions. + +Metadata values in a Word Loom item (i.e. non-underscore, non-lang keys) that +carry a ``file:``, ``dir:``, or ``glob:`` prefix are resolved to their text +content at load time when this extension is active. + + file: — UTF-8 content of that single file + dir: — all UTF-8 files under the directory, concatenated and + headed with ``=== relative/path ===`` separators + glob: — same concatenation for files matching the glob pattern + relative to the loom directory + +All paths are relative to the directory containing the loom TOML file and must +stay within that directory (directory-traversal attempts raise ValueError). +Files larger than 2 MB are silently skipped when scanning directories/globs; +an explicit ``file:`` reference to an oversized or binary file raises an error. + +This extension is opt-in: + + from wordloom import load + loom = load(Path('prompts.toml'), features={'file-inclusion'}) + +Resolved values are exposed as ``language_item.file_bindings`` (a plain dict), +and the ``language_item.render(**kwargs)`` helper merges them with any runtime +kwargs before calling ``str.format``. +''' + +from __future__ import annotations + +from pathlib import Path +from typing import Iterable + +_MAX_BYTES = 2 * 1024 * 1024 # 2 MB — skip silently for dir/glob scans + + +def _under_base(target: Path, base: Path) -> bool: + target = target.resolve() + base = base.resolve() + return target == base or target.is_relative_to(base) + + +def _concat_utf8_files(paths: Iterable[Path], *, loom_base: Path, rel_root: Path) -> str: + '''Concatenate UTF-8 files into a single string with path-headed blocks.''' + loom_base = loom_base.resolve() + rel_root = rel_root.resolve() + unique = sorted({p.resolve() for p in paths}, key=lambda p: p.as_posix().lower()) + chunks: list[str] = [] + for path in unique: + if not path.is_file() or not _under_base(path, loom_base): + continue + try: + rel = path.relative_to(rel_root) + except ValueError: + continue + if any(part.startswith('.') for part in rel.parts): + continue + try: + data = path.read_bytes() + except OSError: + continue + if len(data) > _MAX_BYTES: + continue + try: + text = data.decode('utf-8') + except UnicodeDecodeError: + continue + chunks.append(f'=== {rel.as_posix()} ===\n{text}') + return '\n\n'.join(chunks) + + +def _read_dir(dir_path: Path, loom_base: Path) -> str: + if not dir_path.is_dir(): + raise NotADirectoryError(f'Not a directory: {dir_path}') + paths = [p for p in dir_path.rglob('*') if p.is_file()] + return _concat_utf8_files(paths, loom_base=loom_base, rel_root=dir_path) + + +def _read_glob(pattern: str, loom_base: Path) -> str: + if '..' in Path(pattern).parts: + raise ValueError(f'glob: pattern must not contain ".." segments: {pattern!r}') + paths = list(loom_base.glob(pattern)) + return _concat_utf8_files(paths, loom_base=loom_base, rel_root=loom_base) + + +def resolve_file_bindings(table: dict, loom_base: Path) -> dict[str, str]: + ''' + Scan one TOML item table and resolve file:/dir:/glob: metadata values. + + Returns a dict mapping each resolved key to its text content. + Reserved keys (those starting with ``_`` and ``lang``) are skipped, as are + non-string values. + + Raises: + ValueError: absolute path, path escaping loom dir, or bad glob pattern. + FileNotFoundError: explicit ``file:`` target does not exist. + NotADirectoryError: explicit ``dir:`` target is not a directory. + ''' + loom_base = loom_base.resolve() + out: dict[str, str] = {} + for key, raw in table.items(): + if not isinstance(key, str) or key.startswith('_') or key == 'lang': + continue + if not isinstance(raw, str): + continue + + if raw.startswith('file:'): + rel = raw[len('file:'):].strip() + if not rel or rel.startswith('/'): + raise ValueError(f'{key!r}: file: path must be relative to the loom directory: {raw!r}') + target = (loom_base / rel).resolve() + if not _under_base(target, loom_base): + raise ValueError(f'{key!r}: path escapes the loom directory: {raw!r}') + if not target.is_file(): + raise FileNotFoundError(f'{key!r}: not a file: {target}') + out[key] = target.read_text(encoding='utf-8') + + elif raw.startswith('dir:'): + rel = raw[len('dir:'):].strip() + if not rel or rel.startswith('/'): + raise ValueError(f'{key!r}: dir: path must be relative to the loom directory: {raw!r}') + target = (loom_base / rel).resolve() + if not _under_base(target, loom_base): + raise ValueError(f'{key!r}: path escapes the loom directory: {raw!r}') + out[key] = _read_dir(target, loom_base=loom_base) + + elif raw.startswith('glob:'): + pattern = raw[len('glob:'):].strip() + if not pattern or pattern.startswith('/'): + raise ValueError(f'{key!r}: glob: pattern must be non-empty and relative: {raw!r}') + out[key] = _read_glob(pattern, loom_base=loom_base) + + return out diff --git a/pylib/wordloom.py b/pylib/wordloom.py index 1c0dc69..fefe771 100644 --- a/pylib/wordloom.py +++ b/pylib/wordloom.py @@ -13,6 +13,7 @@ import io import tomli import warnings +from pathlib import Path class language_item(str): @@ -31,7 +32,7 @@ class language_item(str): 'jambon' ''' - def __new__(cls, value, deflang, altlang=None, meta=None, markers=None): + def __new__(cls, value, deflang, altlang=None, meta=None, markers=None, file_bindings=None): ''' Construct a new text item @@ -40,6 +41,7 @@ def __new__(cls, value, deflang, altlang=None, meta=None, markers=None): altlang - dictionary of text values in alternative languages meta - dictionary of metadata markers - used to specify values that can be set, with the text value is treated as a template + file_bindings - resolved file/dir/glob inclusions (populated by the file-inclusion feature) ''' assert isinstance(value, str) self = super(language_item, cls).__new__(cls, value) @@ -47,6 +49,7 @@ def __new__(cls, value, deflang, altlang=None, meta=None, markers=None): self.meta = meta or {} self.markers = markers or {} self.altlang = altlang or {} + self.file_bindings = file_bindings or {} return self def __repr__(self): @@ -55,7 +58,21 @@ def __repr__(self): def in_lang(self, lang): return self.altlang.get(lang) - def clone(self, value=None, deflang=None, altlang=None, meta=None, markers=None): + def render(self, **kwargs): + ''' + Format this template, merging file_bindings with any runtime kwargs. + + file_bindings values are the base; kwargs override them, so callers can + supply or override individual slots at runtime. + + >>> from wordloom import language_item as LI + >>> t = LI('Hello {name}', deflang='en') + >>> t.render(name='World') + 'Hello World' + ''' + return str(self).format(**{**self.file_bindings, **kwargs}) + + def clone(self, value=None, deflang=None, altlang=None, meta=None, markers=None, file_bindings=None): ''' Clone the text item, with optional overrides @@ -75,7 +92,8 @@ def clone(self, value=None, deflang=None, altlang=None, meta=None, markers=None) altlang = self.altlang if altlang is None else altlang meta = self.meta if meta is None else meta markers = self.markers if markers is None else markers - return language_item(value, deflang, altlang=altlang, meta=meta, markers=markers) + file_bindings = self.file_bindings if file_bindings is None else file_bindings + return language_item(value, deflang, altlang=altlang, meta=meta, markers=markers, file_bindings=file_bindings) # Following 2 lines are deprecated @@ -86,34 +104,77 @@ def clone(self, value=None, deflang=None, altlang=None, meta=None, markers=None) # XXX Defaulting to en leaves a bit too imperialist a flavor, really -def load(fp_or_str, lang='en', preserve_key=False): +def load(fp_or_str, lang='en', preserve_key=False, features=None, base_dir=None): ''' - Read a word loom and return the tables as top-level result mapping - Loads the TOML - - Return a dict of the language items, indexed by the TOML key as well as its default language text - - fp_or_str - file-like object or string containing TOML - lang - select only texts in this language (default: 'en') - preserve_key - if True, the key in the TOML is preserved in each item's metadata + Read a word loom and return the tables as top-level result mapping. + + fp_or_str - Path object or path string → opened as a file (base dir auto-detected); + file-like object → read directly (.name used for base dir if present); + bytes or a TOML content string → parsed in-memory (no base dir) + lang - select only items whose language matches (default: 'en') + preserve_key - if True, store the TOML key in each item's metadata as '_key' + features - set or dict of optional features to enable, e.g. ``{'file-inclusion'}`` + or ``{'file-inclusion': True}`` + base_dir - explicit base directory for resolving relative paths used by extensions; + overrides the auto-detected value from the file path + + Supported features + ------------------ + ``'file-inclusion'`` + Metadata values with a ``file:``, ``dir:``, or ``glob:`` prefix are resolved + to their text contents and exposed as ``language_item.file_bindings``. + Requires a resolvable base directory (pass a Path/path-string, an open() handle, + or set ``base_dir`` explicitly). Example: >>> import wordloom - >>> with open('prompts.toml', mode='rb') as fp: - ... loom = wordloom.load(fp) - >>> loom['system_instruction'].meta - {'category': 'system'} - >>> actual_text = loom['translation_prompt'] - >>> str(actual_text) - 'Translate the following text to {target_lang}: {text}' - >>> actual_text.markers - ['target_lang', 'text'] + >>> from pathlib import Path + >>> loom = wordloom.load(Path('prompts.toml'), features={'file-inclusion'}) + >>> prompt = loom['my_prompt'] + >>> formatted = prompt.render(extra_var='value') ''' - # Ensure we have a file-like object - if isinstance(fp_or_str, str): - fp_or_str = io.BytesIO(fp_or_str.encode('utf-8')) + # --- resolve base directory and normalise fp_or_str to a readable object --- + _detected_base: Path | None = None + + if isinstance(fp_or_str, Path): + _detected_base = fp_or_str.parent.resolve() + fp_or_str = fp_or_str.open('rb') + elif isinstance(fp_or_str, str): + candidate = Path(fp_or_str) + if candidate.is_file(): + # string looks like a file path and the file exists — open it + _detected_base = candidate.parent.resolve() + fp_or_str = candidate.open('rb') + else: + fp_or_str = io.BytesIO(fp_or_str.encode('utf-8')) elif isinstance(fp_or_str, bytes): fp_or_str = io.BytesIO(fp_or_str) + elif hasattr(fp_or_str, 'name'): + try: + _detected_base = Path(fp_or_str.name).parent.resolve() + except Exception: + pass + + loom_base: Path | None = Path(base_dir).resolve() if base_dir is not None else _detected_base + + # --- feature flags --- + file_inclusion = False + if features is not None: + if isinstance(features, set): + file_inclusion = 'file-inclusion' in features + else: + file_inclusion = bool(features.get('file-inclusion', False)) + + if file_inclusion and loom_base is None: + raise ValueError( + "The 'file-inclusion' feature requires a resolvable loom base directory. " + 'Pass a Path object, a path string pointing to an existing file, ' + 'an open() file handle, or set base_dir= explicitly.' + ) + + if file_inclusion: + from wordloom.ext.file_includes import resolve_file_bindings # noqa: PLC0415 + # Load TOML loom_raw = tomli.load(fp_or_str) # Select text by language @@ -142,12 +203,13 @@ def load(fp_or_str, lang='en', preserve_key=False): meta = {kk: vv for kk, vv in v.items() if (not kk.startswith('_') and kk not in ('text', 'markers'))} if preserve_key: meta['_key'] = k + fb = resolve_file_bindings(v, loom_base) if file_inclusion else {} if k in texts: warnings.warn(f'Key {k} duplicates an existing item, which will be overwritten') - texts[k] = T(text, lang, altlang=altlang, meta=meta, markers=markers) + texts[k] = T(text, lang, altlang=altlang, meta=meta, markers=markers, file_bindings=fb) # Also index by literal text if text in texts: warnings.warn( f'Item default language text {text[:20]} duplicates an existing item, which will be overwritten') - texts[text] = T(text, lang, altlang=altlang, meta=meta, markers=markers) + texts[text] = T(text, lang, altlang=altlang, meta=meta, markers=markers, file_bindings=fb) return texts diff --git a/test/test_file_inclusion.py b/test/test_file_inclusion.py new file mode 100644 index 0000000..da38454 --- /dev/null +++ b/test/test_file_inclusion.py @@ -0,0 +1,256 @@ +# SPDX-FileCopyrightText: 2023-present Oori Data +# SPDX-License-Identifier: Apache-2.0 +# test/test_file_inclusion.py +''' +Tests for the file-inclusion extension (features={'file-inclusion'}). + +pytest test/test_file_inclusion.py +''' + +import textwrap +from pathlib import Path + +import pytest + +import wordloom + + +# --------------------------------------------------------------------------- +# helpers +# --------------------------------------------------------------------------- + +def _write_loom(tmp_path: Path, toml: str) -> Path: + p = tmp_path / 'prompts.toml' + p.write_text(textwrap.dedent(toml), encoding='utf-8') + return p + + +# --------------------------------------------------------------------------- +# basic file: inclusion +# --------------------------------------------------------------------------- + +def test_file_inclusion_single_file(tmp_path): + (tmp_path / 'context.txt').write_text('Some context text.', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [my_prompt] + _ = "Answer using: {context}" + _m = ["context"] + context = "file:context.txt" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + prompt = loom['my_prompt'] + assert prompt.file_bindings == {'context': 'Some context text.'} + assert prompt.render() == 'Answer using: Some context text.' + + +def test_file_inclusion_via_path_object(tmp_path): + (tmp_path / 'note.txt').write_text('Hello from file.', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{note}" + note = "file:note.txt" + ''') + + loom = wordloom.load(Path(loom_path), features={'file-inclusion'}) + assert loom['item'].file_bindings['note'] == 'Hello from file.' + + +def test_file_inclusion_via_open_handle(tmp_path): + (tmp_path / 'data.txt').write_text('data', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{data}" + data = "file:data.txt" + ''') + + with open(loom_path, 'rb') as fp: + loom = wordloom.load(fp, features={'file-inclusion'}) + assert loom['item'].file_bindings['data'] == 'data' + + +def test_no_file_inclusion_without_feature(tmp_path): + (tmp_path / 'context.txt').write_text('Some context.', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [my_prompt] + _ = "Answer using: {context}" + context = "file:context.txt" + ''') + + loom = wordloom.load(loom_path) + assert loom['my_prompt'].file_bindings == {} + # raw metadata value is still accessible + assert loom['my_prompt'].meta['context'] == 'file:context.txt' + + +# --------------------------------------------------------------------------- +# dir: and glob: inclusion +# --------------------------------------------------------------------------- + +def test_dir_inclusion(tmp_path): + docs = tmp_path / 'docs' + docs.mkdir() + (docs / 'a.txt').write_text('file A', encoding='utf-8') + (docs / 'b.txt').write_text('file B', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{docs}" + docs = "dir:docs" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + result = loom['item'].file_bindings['docs'] + assert 'file A' in result + assert 'file B' in result + assert '=== a.txt ===' in result + + +def test_glob_inclusion(tmp_path): + (tmp_path / 'one.md').write_text('# One', encoding='utf-8') + (tmp_path / 'two.md').write_text('# Two', encoding='utf-8') + (tmp_path / 'skip.txt').write_text('skip me', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{notes}" + notes = "glob:*.md" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + result = loom['item'].file_bindings['notes'] + assert '# One' in result + assert '# Two' in result + assert 'skip me' not in result + + +# --------------------------------------------------------------------------- +# render() method +# --------------------------------------------------------------------------- + +def test_render_merges_file_bindings_with_runtime(tmp_path): + (tmp_path / 'base.txt').write_text('base content', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [tmpl] + _ = "{base} and {extra}" + base = "file:base.txt" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + result = loom['tmpl'].render(extra='runtime value') + assert result == 'base content and runtime value' + + +def test_render_runtime_overrides_file_binding(tmp_path): + (tmp_path / 'base.txt').write_text('file value', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [tmpl] + _ = "{base}" + base = "file:base.txt" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + result = loom['tmpl'].render(base='override') + assert result == 'override' + + +def test_render_works_without_file_inclusion(): + loom = wordloom.load(b"lang = 'en'\n[item]\n_ = 'Hello {name}'") + assert loom['item'].render(name='World') == 'Hello World' + + +# --------------------------------------------------------------------------- +# security / error cases +# --------------------------------------------------------------------------- + +def test_file_inclusion_path_traversal_raises(tmp_path): + (tmp_path.parent / 'secret.txt').write_text('secret', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{leak}" + leak = "file:../secret.txt" + ''') + + with pytest.raises(ValueError, match='escapes the loom directory'): + wordloom.load(loom_path, features={'file-inclusion'}) + + +def test_file_inclusion_absolute_path_raises(tmp_path): + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{leak}" + leak = "file:/etc/passwd" + ''') + + with pytest.raises(ValueError, match='must be relative'): + wordloom.load(loom_path, features={'file-inclusion'}) + + +def test_file_inclusion_missing_file_raises(tmp_path): + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{missing}" + missing = "file:does_not_exist.txt" + ''') + + with pytest.raises(FileNotFoundError): + wordloom.load(loom_path, features={'file-inclusion'}) + + +def test_file_inclusion_glob_dotdot_raises(tmp_path): + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{leak}" + leak = "glob:../*.txt" + ''') + + with pytest.raises(ValueError, match='"\\.\\."'): + wordloom.load(loom_path, features={'file-inclusion'}) + + +def test_file_inclusion_requires_base_dir_for_bytes_input(): + toml_bytes = b"lang = 'en'\n[item]\n_ = '{x}'\nx = 'file:x.txt'" + with pytest.raises(ValueError, match='file-inclusion'): + wordloom.load(toml_bytes, features={'file-inclusion'}) + + +def test_file_inclusion_explicit_base_dir(tmp_path): + (tmp_path / 'data.txt').write_text('explicit base', encoding='utf-8') + # Load from bytes but supply an explicit base_dir + toml_bytes = b"lang = 'en'\n[item]\n_ = '{data}'\ndata = 'file:data.txt'" + loom = wordloom.load(toml_bytes, features={'file-inclusion'}, base_dir=tmp_path) + assert loom['item'].file_bindings['data'] == 'explicit base' + + +# --------------------------------------------------------------------------- +# clone() preserves file_bindings +# --------------------------------------------------------------------------- + +def test_clone_preserves_file_bindings(tmp_path): + (tmp_path / 'f.txt').write_text('content', encoding='utf-8') + loom_path = _write_loom(tmp_path, ''' + lang = "en" + [item] + _ = "{f}" + f = "file:f.txt" + ''') + + loom = wordloom.load(loom_path, features={'file-inclusion'}) + original = loom['item'] + cloned = original.clone(value='new text {f}') + assert cloned.file_bindings == original.file_bindings + assert cloned.render() == 'new text content' + + +if __name__ == '__main__': + raise SystemExit('Attention! Run with pytest') From c8e04575cacac8fe13e61c2d88bcb53a961da0a0 Mon Sep 17 00:00:00 2001 From: Uche Ogbuji Date: Thu, 30 Apr 2026 12:52:46 -0600 Subject: [PATCH 3/4] Add prompt injection warning --- README.md | 2 +- implementation.md | 20 -------------------- pylib/ext/file_includes.py | 12 +++++++----- 3 files changed, 8 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index a9d379d..4bbaa03 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ This is an under-considered area in AI prompting. When dealing with multiple lan # Contributing -Contributions welcome! We're interested in feedback from the community about what works and what doesn't in real-world usage. To get help with the code implementation, read [CONTRIBUTING.md](CONTRIBUTING.md). +Contributions welcome! We're interested in feedback from the community about what works and what doesn't in real-world usage. To get help with the code implementation, or to learn about our packaging approach, read [CONTRIBUTING.md](CONTRIBUTING.md). # License diff --git a/implementation.md b/implementation.md index 033bbf7..b1f9cb7 100644 --- a/implementation.md +++ b/implementation.md @@ -172,23 +172,3 @@ loom = wordloom.load(toml_bytes, features={'file-inclusion'}, base_dir='/path/to ``` Without a base directory, the feature raises `ValueError` at load time. - ---- - -## Development workflow - -```bash -# Install (required after any pylib/ change) -uv pip install -U . - -# Run tests -pytest test/ -v - -# Run only the file-inclusion tests -pytest test/test_file_inclusion.py -v - -# Lint -ruff check . -``` - -See [CONTRIBUTING.md](CONTRIBUTING.md) for release and packaging details. diff --git a/pylib/ext/file_includes.py b/pylib/ext/file_includes.py index 6b0c8e6..292fe54 100644 --- a/pylib/ext/file_includes.py +++ b/pylib/ext/file_includes.py @@ -9,11 +9,11 @@ carry a ``file:``, ``dir:``, or ``glob:`` prefix are resolved to their text content at load time when this extension is active. - file: — UTF-8 content of that single file - dir: — all UTF-8 files under the directory, concatenated and - headed with ``=== relative/path ===`` separators - glob: — same concatenation for files matching the glob pattern - relative to the loom directory + file:: UTF-8 content of that single file + dir: : all UTF-8 files under the directory, concatenated and + headed with ``=== relative/path ===`` separators + glob: : same concatenation for files matching the glob pattern + relative to the loom directory All paths are relative to the directory containing the loom TOML file and must stay within that directory (directory-traversal attempts raise ValueError). @@ -28,6 +28,8 @@ Resolved values are exposed as ``language_item.file_bindings`` (a plain dict), and the ``language_item.render(**kwargs)`` helper merges them with any runtime kwargs before calling ``str.format``. + +**Warning:** The security model prevents path traversal, but it cannot protect against malicious *content* inside included files. If file contents are user-influenced or come from untrusted sources, they could inject instructions into your prompts. Only include files you trust, or inspect/strip their content before loading. ''' from __future__ import annotations From 1b486930fa262cf1689fda14ffad61f4381cb0bb Mon Sep 17 00:00:00 2001 From: Uche Ogbuji Date: Thu, 30 Apr 2026 13:13:00 -0600 Subject: [PATCH 4/4] Pace lint --- pylib/ext/file_includes.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/pylib/ext/file_includes.py b/pylib/ext/file_includes.py index 292fe54..d5d6943 100644 --- a/pylib/ext/file_includes.py +++ b/pylib/ext/file_includes.py @@ -29,7 +29,10 @@ and the ``language_item.render(**kwargs)`` helper merges them with any runtime kwargs before calling ``str.format``. -**Warning:** The security model prevents path traversal, but it cannot protect against malicious *content* inside included files. If file contents are user-influenced or come from untrusted sources, they could inject instructions into your prompts. Only include files you trust, or inspect/strip their content before loading. +**Warning:** The security model prevents path traversal, but it cannot protect against malicious *content* +inside included files. If file contents are user-influenced or come from untrusted sources, +they could inject instructions into your prompts. Only include files you trust, or inspect/strip +content before loading. ''' from __future__ import annotations