Parsed markdown is stored in SQLite tables that can be used for rendering and database content management.
prerequisites
- choco
- node-gyp see https://github.com/nodejs/node-gyp?tab=readme-ov-file#on-windows
npm install content-structurethis project is a library, for full config, see the example in example/parse.js
collect all data by running this once
import {collect} from 'content-structure'
await collect({
rootdir:rootdir,
contentdir:join(rootdir,"content"),
file_link_ext:["svg","webp","png","jpeg","jpg","xlsx","glb"],
outdir:join(rootdir,".structure")
})see demo with
>pnpm run demo
> node parse.js
content_dir : C:\dev\MicroWebStacks\content-structure\example\content
searching for files with extensions : *.md
Structure DB tables and row counts:
- asset_info: 19
- assets: 19
- blob_store: 14
- documents: 30
- items: 82Content Structure produces a relational snapshot of every markdown run using the schema declared in catalog.yaml.
The catalog defines a single structure dataset whose tables are optimized for rendering, search indexing, and asset management. Each run populates these tables under .structure/structure.db.
| Table | Purpose | Relationships |
|---|---|---|
documents |
Canonical row per markdown entry. Stores stable ids, routing metadata, and leftover front matter via the meta_data JSON column. |
items, assets, and asset_info reference documents.sid. |
items |
Flattened AST stream in reading order. Each row keeps body_text for simple rendering plus an optional serialized AST subtree for nested constructs (stored in ast). |
References documents via doc_sid; assets rows connect items to blobs when an AST node produces a file. |
assets |
Run-specific join table so consumers can tell which document referenced which asset at a given version_id. |
Bridge between documents and asset_info; also carries the blob_uid for quick payload lookups. |
asset_info |
Deduplicated description of every asset (code blocks, tables, linked files, etc.) regardless of run. | Points to the owning document (parent_doc_uid) and the physical payload via blob_uid. |
blob_store |
Source of truth for payloads. Large blobs are stored under blobs/YYYY/MM/ff/hash and referenced by path, while small blobs inline their bytes (compressed when eligible). |
asset_info/assets link to blobs through blob_uid. |
The catalog is intentionally compact: fields are named to match DOM concerns (slug, url_type, level), content analysis (headings, links, code), and asset lifecycle (first_seen, last_seen). Instead of memorizing every column, browse catalog.yaml whenever you need the exact types or to extend the dataset. Downstream tools can rely on the catalog as the authoritative contract when generating queries, migrations, or analytics dashboards.
- Metadata folding – Any front matter not mapped to a declared column is serialized into
documents.meta_data, keeping schemas manageable without losing context. - Automatic ordering – Documents inherit incremental
ordervalues scoped to their directory level unless you pin them explicitly. This keeps navigation menus stable even when markdown files are added later. - Mixed routing – Folder-style (
readme.mdor matching filenames) and file-style URLs coexist.url_typereveals which variant was used to generate the url.
- Paragraphs, headings, tables, code blocks, and images are all represented in
items. Simple rows expose fully extracted text; nested structures store their sanitized AST so you can re-render bold or embedded assets without reparsing the original markdown. - Every asset mentioned by an item produces two entries: a durable definition in
asset_infoand a run-scoped membership row inassets. The membership row ties the asset to both the document and its blob so you can know exactly when something was added, removed, or reused. - Blob payloads avoid bloat with configurable thresholds: large files stream to disk under
blobs/, while smaller text blobs can be compressed inline and served straight from SQLite.
Refer back to the catalog for exhaustive field notes, and treat the tables above as the primary contract between your markdown source and any rendering or analytics layers.
the config parameter is optional and do have default values
rootdir: defaults to current working directory. The path where to find the acontentdirectory.outdir: defaults to.structure. Relative output directory is the location where all output data will be generated, which is relative to the root directory.folder_single_doc: defaults tofalse. Whentrue, each folder is treated as a single document and the first YAML/YML file contributes overrides plusmeta_datafields.external_storage_kb: defaults to512. Blobs larger than this size (in KB) are written to disk underblobs/<YYYY>/<MM>/<prefix>/<hash>.inline_compression_kb: defaults to32. Inline blobs bigger than or equal to this size are eligible for gzip compression before being stored inside theblob_storetable.file_compress_ext: defaults to["txt","md","json","csv","tsv","yaml","yml"]. Inline blobs are compressed only if their source extension (when known) appears in this list.
.structure/structure.db: a SQLite database (powered by better-sqlite3). The database exposes the tablesdocuments,items,assets,asset_info, andblob_store.blobs/year/month/prefix/hashpath for all files larger thanconfig.external_storage_kb