Skip to content

Stores parsed markdown in SQLite tables that can be used for rendering and database content management.

License

Notifications You must be signed in to change notification settings

MicroWebStacks/content-structure

Repository files navigation

Content Structure

Parsed markdown is stored in SQLite tables that can be used for rendering and database content management.

design

Deepwiki

Ask DeepWiki

Install

prerequisites

Usage

npm install content-structure

this project is a library, for full config, see the example in example/parse.js

collect all data by running this once

import {collect} from 'content-structure'

await collect({
    rootdir:rootdir,
    contentdir:join(rootdir,"content"),
    file_link_ext:["svg","webp","png","jpeg","jpg","xlsx","glb"],
    outdir:join(rootdir,".structure")
})

see demo with

>pnpm run demo
> node parse.js

content_dir : C:\dev\MicroWebStacks\content-structure\example\content
   searching for files with extensions : *.md
Structure DB tables and row counts:
  - asset_info: 19
  - assets: 19
  - blob_store: 14
  - documents: 30
  - items: 82

Documentation

Content Structure produces a relational snapshot of every markdown run using the schema declared in catalog.yaml.
The catalog defines a single structure dataset whose tables are optimized for rendering, search indexing, and asset management. Each run populates these tables under .structure/structure.db.

Table overview

Table Purpose Relationships
documents Canonical row per markdown entry. Stores stable ids, routing metadata, and leftover front matter via the meta_data JSON column. items, assets, and asset_info reference documents.sid.
items Flattened AST stream in reading order. Each row keeps body_text for simple rendering plus an optional serialized AST subtree for nested constructs (stored in ast). References documents via doc_sid; assets rows connect items to blobs when an AST node produces a file.
assets Run-specific join table so consumers can tell which document referenced which asset at a given version_id. Bridge between documents and asset_info; also carries the blob_uid for quick payload lookups.
asset_info Deduplicated description of every asset (code blocks, tables, linked files, etc.) regardless of run. Points to the owning document (parent_doc_uid) and the physical payload via blob_uid.
blob_store Source of truth for payloads. Large blobs are stored under blobs/YYYY/MM/ff/hash and referenced by path, while small blobs inline their bytes (compressed when eligible). asset_info/assets link to blobs through blob_uid.

The catalog is intentionally compact: fields are named to match DOM concerns (slug, url_type, level), content analysis (headings, links, code), and asset lifecycle (first_seen, last_seen). Instead of memorizing every column, browse catalog.yaml whenever you need the exact types or to extend the dataset. Downstream tools can rely on the catalog as the authoritative contract when generating queries, migrations, or analytics dashboards.

Document behavior highlights

  • Metadata folding – Any front matter not mapped to a declared column is serialized into documents.meta_data, keeping schemas manageable without losing context.
  • Automatic ordering – Documents inherit incremental order values scoped to their directory level unless you pin them explicitly. This keeps navigation menus stable even when markdown files are added later.
  • Mixed routing – Folder-style (readme.md or matching filenames) and file-style URLs coexist. url_type reveals which variant was used to generate the url.

Item and asset lifecycle

  • Paragraphs, headings, tables, code blocks, and images are all represented in items. Simple rows expose fully extracted text; nested structures store their sanitized AST so you can re-render bold or embedded assets without reparsing the original markdown.
  • Every asset mentioned by an item produces two entries: a durable definition in asset_info and a run-scoped membership row in assets. The membership row ties the asset to both the document and its blob so you can know exactly when something was added, removed, or reused.
  • Blob payloads avoid bloat with configurable thresholds: large files stream to disk under blobs/, while smaller text blobs can be compressed inline and served straight from SQLite.

Refer back to the catalog for exhaustive field notes, and treat the tables above as the primary contract between your markdown source and any rendering or analytics layers.

Config parameters

the config parameter is optional and do have default values

  • rootdir : defaults to current working directory. The path where to find the a content directory.
  • outdir : defaults to .structure. Relative output directory is the location where all output data will be generated, which is relative to the root directory.
  • folder_single_doc : defaults to false. When true, each folder is treated as a single document and the first YAML/YML file contributes overrides plus meta_data fields.
  • external_storage_kb : defaults to 512. Blobs larger than this size (in KB) are written to disk under blobs/<YYYY>/<MM>/<prefix>/<hash>.
  • inline_compression_kb : defaults to 32. Inline blobs bigger than or equal to this size are eligible for gzip compression before being stored inside the blob_store table.
  • file_compress_ext : defaults to ["txt","md","json","csv","tsv","yaml","yml"]. Inline blobs are compressed only if their source extension (when known) appears in this list.

Generated output

  • .structure/structure.db : a SQLite database (powered by better-sqlite3). The database exposes the tables documents, items, assets, asset_info, and blob_store.
  • blobs/year/month/prefix/hash path for all files larger than config.external_storage_kb

About

Stores parsed markdown in SQLite tables that can be used for rendering and database content management.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project