Detect canonical URL patterns from large URL corpora.
Given hundreds of thousands of URLs, smap identifies which path segments are dynamic (IDs, slugs, UUIDs) and which are static route names — outputting a flat list of canonical patterns that represents the site's route structure.
Raw URL data is noisy. A site with 682,000 URLs might have thousands of unique paths, but only a few dozen actual route patterns. smap reverse-engineers the router configuration from the raw data:
Input: 851 URLs from sanity.io
Output:
[222] /{a} — 221 unique marketing pages
[154] /glossary/{a} — 154 glossary entries
[55] /docs/content-lake/{a} — 55 content lake docs
[41] /blog/{a} — 41 blog posts
[34] /docs/cli-reference/{a} — 34 CLI reference pages
[33] /answers/{a} — 33 community answers
[30] /customers/{a} — 30 customer stories
...
npm install sanity-labs/smapimport { smap } from 'smap';
const urls = [
'/plans/abc/overview',
'/plans/def/overview',
'/plans/abc/billing',
'/plans/ghi/billing',
'/users/alice/settings',
'/users/bob/settings',
'/users/carol/profile',
'/docs/getting-started',
'/docs/api-reference',
];
const patterns = smap(urls);
// [
// { pattern: '/plans/{a}/overview', count: 2, variables: { a: { samples: ['abc', 'def'], unique: 2 } } },
// { pattern: '/plans/{a}/billing', count: 2, variables: { a: { samples: ['abc', 'ghi'], unique: 2 } } },
// { pattern: '/users/{a}/settings', count: 2, variables: { a: { samples: ['alice', 'bob'], unique: 2 } } },
// { pattern: '/users/{a}/profile', count: 1, variables: { a: { samples: ['carol'], unique: 1 } } },
// { pattern: '/docs/getting-started', count: 1, variables: {} },
// { pattern: '/docs/api-reference', count: 1, variables: {} },
// ]Shape-aware cardinality analysis. URLs are grouped by segment count (shape) before classification. This prevents cross-depth contamination — marketing pages at /studio don't pollute the cardinality measurement for /glossary/{a}.
Within each shape group, smap builds a trie and walks it top-down. At each node, it computes the cardinality ratio (unique children / total traversals). High ratio → dynamic variable. Low ratio → static route segment. When a node is classified as dynamic, all children's subtrees are merged to reveal the common structure underneath.
Variable placeholders use sequential letters: {a}, {b}, {c}. These can be given meaningful names later (e.g., by an LLM looking at the samples and surrounding segments).
smap(urls: string[], options?: SmapOptions): PatternResult[]interface SmapOptions {
/** Cardinality ratio threshold. Default: 0.5 */
cardinalityThreshold?: number;
/** Min URLs before classifying. Default: 5 */
minSamples?: number;
/** Max sample values per variable. Default: 5 */
maxSamples?: number;
}interface PatternResult {
pattern: string; // e.g. "/plans/{a}/overview"
count: number; // URLs matching this pattern
variables: Record<string, VariableInfo>; // keyed by "a", "b", "c", ...
}
interface VariableInfo {
samples: string[]; // up to maxSamples example values
unique: number; // total distinct values (cardinality)
}const urls = [
'/orgs/acme/teams/frontend/members/alice',
'/orgs/acme/teams/backend/members/bob',
'/orgs/globex/teams/ops/members/carol',
'/orgs/globex/teams/ops/members/dave',
];
smap(urls, { minSamples: 1 });
// [{ pattern: '/orgs/{a}/teams/{b}/members/{c}', count: 4, variables: {
// a: { samples: ['acme', 'globex'], unique: 2 },
// b: { samples: ['frontend', 'backend', 'ops'], unique: 3 },
// c: { samples: ['alice', 'bob', 'carol', 'dave'], unique: 4 }
// }}]Tested against:
- sanity.io (851 URLs) — correctly identifies glossary, blog, docs, exchange, customer patterns
- MDN (1,571 URLs) — correctly identifies per-locale glossary, blog, tutorial, curriculum patterns
Performance at scale:
| URLs | Time | Memory |
|---|---|---|
| 10K | 24ms | 5 MB |
| 100K | 210ms | 56 MB |
| 500K | 1.4s | 327 MB |
| 1M | 2.75s | 491 MB |
The shape-aware architecture processes each depth group independently, making it naturally suited for chunked/streaming processing.
Inspired by ZoriHQ/trie-url-classifier, a Go library that uses trie-based cardinality analysis for URL normalization.
MIT