smap

Detect canonical URL patterns from large URL corpora.

Given hundreds of thousands of URLs, smap identifies which path segments are dynamic (IDs, slugs, UUIDs) and which are static route names — outputting a flat list of canonical patterns that represents the site's route structure.

The Problem

Raw URL data is noisy. A site with 682,000 URLs might have thousands of unique paths, but only a few dozen actual route patterns. smap reverse-engineers the router configuration from the raw data:

Input: 851 URLs from sanity.io
Output:
  [222] /{a}                        — 221 unique marketing pages
  [154] /glossary/{a}               — 154 glossary entries
   [55] /docs/content-lake/{a}      — 55 content lake docs
   [41] /blog/{a}                   — 41 blog posts
   [34] /docs/cli-reference/{a}     — 34 CLI reference pages
   [33] /answers/{a}                — 33 community answers
   [30] /customers/{a}              — 30 customer stories
   ...

Install

npm install sanity-labs/smap

Quick Example

import { smap } from 'smap';

const urls = [
  '/plans/abc/overview',
  '/plans/def/overview',
  '/plans/abc/billing',
  '/plans/ghi/billing',
  '/users/alice/settings',
  '/users/bob/settings',
  '/users/carol/profile',
  '/docs/getting-started',
  '/docs/api-reference',
];

const patterns = smap(urls);
// [
//   { pattern: '/plans/{a}/overview', count: 2, variables: { a: { samples: ['abc', 'def'], unique: 2 } } },
//   { pattern: '/plans/{a}/billing',  count: 2, variables: { a: { samples: ['abc', 'ghi'], unique: 2 } } },
//   { pattern: '/users/{a}/settings', count: 2, variables: { a: { samples: ['alice', 'bob'], unique: 2 } } },
//   { pattern: '/users/{a}/profile',  count: 1, variables: { a: { samples: ['carol'], unique: 1 } } },
//   { pattern: '/docs/getting-started', count: 1, variables: {} },
//   { pattern: '/docs/api-reference',   count: 1, variables: {} },
// ]

How It Works

Shape-aware cardinality analysis. URLs are grouped by segment count (shape) before classification. This prevents cross-depth contamination — marketing pages at /studio don't pollute the cardinality measurement for /glossary/{a}.

Within each shape group, smap builds a trie and walks it top-down. At each node, it computes the cardinality ratio (unique children / total traversals). High ratio → dynamic variable. Low ratio → static route segment. When a node is classified as dynamic, all children's subtrees are merged to reveal the common structure underneath.

Variable placeholders use sequential letters: {a}, {b}, {c}. These can be given meaningful names later (e.g., by an LLM looking at the samples and surrounding segments).

API

smap(urls: string[], options?: SmapOptions): PatternResult[]

Options

interface SmapOptions {
  /** Cardinality ratio threshold. Default: 0.5 */
  cardinalityThreshold?: number;
  /** Min URLs before classifying. Default: 5 */
  minSamples?: number;
  /** Max sample values per variable. Default: 5 */
  maxSamples?: number;
}

Output

interface PatternResult {
  pattern: string;                          // e.g. "/plans/{a}/overview"
  count: number;                            // URLs matching this pattern
  variables: Record<string, VariableInfo>;  // keyed by "a", "b", "c", ...
}

interface VariableInfo {
  samples: string[];  // up to maxSamples example values
  unique: number;     // total distinct values (cardinality)
}

More Examples

Multiple Variables

const urls = [
  '/orgs/acme/teams/frontend/members/alice',
  '/orgs/acme/teams/backend/members/bob',
  '/orgs/globex/teams/ops/members/carol',
  '/orgs/globex/teams/ops/members/dave',
];

smap(urls, { minSamples: 1 });
// [{ pattern: '/orgs/{a}/teams/{b}/members/{c}', count: 4, variables: {
//     a: { samples: ['acme', 'globex'], unique: 2 },
//     b: { samples: ['frontend', 'backend', 'ops'], unique: 3 },
//     c: { samples: ['alice', 'bob', 'carol', 'dave'], unique: 4 }
// }}]

Real-World Scale

Tested against:

sanity.io (851 URLs) — correctly identifies glossary, blog, docs, exchange, customer patterns
MDN (1,571 URLs) — correctly identifies per-locale glossary, blog, tutorial, curriculum patterns

Performance at scale:

URLs	Time	Memory
10K	24ms	5 MB
100K	210ms	56 MB
500K	1.4s	327 MB
1M	2.75s	491 MB

The shape-aware architecture processes each depth group independently, making it naturally suited for chunked/streaming processing.

Acknowledgments

Inspired by ZoriHQ/trie-url-classifier, a Go library that uses trie-based cardinality analysis for URL normalization.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dist		dist
src		src
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
analyze-subtrees.mjs		analyze-subtrees.mjs
detect-subtrees-v2.mjs		detect-subtrees-v2.mjs
detect-subtrees-v3.mjs		detect-subtrees-v3.mjs
detect-subtrees.mjs		detect-subtrees.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smap

The Problem

Install

Quick Example

How It Works

API

Options

Output

More Examples

Multiple Variables

Real-World Scale

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

sanity-labs/smap

Folders and files

Latest commit

History

Repository files navigation

smap

The Problem

Install

Quick Example

How It Works

API

Options

Output

More Examples

Multiple Variables

Real-World Scale

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages