Skip to content

Detect canonical URL patterns from large URL corpora

License

Notifications You must be signed in to change notification settings

sanity-labs/smap

Repository files navigation

smap

Detect canonical URL patterns from large URL corpora.

Given hundreds of thousands of URLs, smap identifies which path segments are dynamic (IDs, slugs, UUIDs) and which are static route names — outputting a flat list of canonical patterns that represents the site's route structure.

The Problem

Raw URL data is noisy. A site with 682,000 URLs might have thousands of unique paths, but only a few dozen actual route patterns. smap reverse-engineers the router configuration from the raw data:

Input: 851 URLs from sanity.io
Output:
  [222] /{a}                        — 221 unique marketing pages
  [154] /glossary/{a}               — 154 glossary entries
   [55] /docs/content-lake/{a}      — 55 content lake docs
   [41] /blog/{a}                   — 41 blog posts
   [34] /docs/cli-reference/{a}     — 34 CLI reference pages
   [33] /answers/{a}                — 33 community answers
   [30] /customers/{a}              — 30 customer stories
   ...

Install

npm install sanity-labs/smap

Quick Example

import { smap } from 'smap';

const urls = [
  '/plans/abc/overview',
  '/plans/def/overview',
  '/plans/abc/billing',
  '/plans/ghi/billing',
  '/users/alice/settings',
  '/users/bob/settings',
  '/users/carol/profile',
  '/docs/getting-started',
  '/docs/api-reference',
];

const patterns = smap(urls);
// [
//   { pattern: '/plans/{a}/overview', count: 2, variables: { a: { samples: ['abc', 'def'], unique: 2 } } },
//   { pattern: '/plans/{a}/billing',  count: 2, variables: { a: { samples: ['abc', 'ghi'], unique: 2 } } },
//   { pattern: '/users/{a}/settings', count: 2, variables: { a: { samples: ['alice', 'bob'], unique: 2 } } },
//   { pattern: '/users/{a}/profile',  count: 1, variables: { a: { samples: ['carol'], unique: 1 } } },
//   { pattern: '/docs/getting-started', count: 1, variables: {} },
//   { pattern: '/docs/api-reference',   count: 1, variables: {} },
// ]

How It Works

Shape-aware cardinality analysis. URLs are grouped by segment count (shape) before classification. This prevents cross-depth contamination — marketing pages at /studio don't pollute the cardinality measurement for /glossary/{a}.

Within each shape group, smap builds a trie and walks it top-down. At each node, it computes the cardinality ratio (unique children / total traversals). High ratio → dynamic variable. Low ratio → static route segment. When a node is classified as dynamic, all children's subtrees are merged to reveal the common structure underneath.

Variable placeholders use sequential letters: {a}, {b}, {c}. These can be given meaningful names later (e.g., by an LLM looking at the samples and surrounding segments).

API

smap(urls: string[], options?: SmapOptions): PatternResult[]

Options

interface SmapOptions {
  /** Cardinality ratio threshold. Default: 0.5 */
  cardinalityThreshold?: number;
  /** Min URLs before classifying. Default: 5 */
  minSamples?: number;
  /** Max sample values per variable. Default: 5 */
  maxSamples?: number;
}

Output

interface PatternResult {
  pattern: string;                          // e.g. "/plans/{a}/overview"
  count: number;                            // URLs matching this pattern
  variables: Record<string, VariableInfo>;  // keyed by "a", "b", "c", ...
}

interface VariableInfo {
  samples: string[];  // up to maxSamples example values
  unique: number;     // total distinct values (cardinality)
}

More Examples

Multiple Variables

const urls = [
  '/orgs/acme/teams/frontend/members/alice',
  '/orgs/acme/teams/backend/members/bob',
  '/orgs/globex/teams/ops/members/carol',
  '/orgs/globex/teams/ops/members/dave',
];

smap(urls, { minSamples: 1 });
// [{ pattern: '/orgs/{a}/teams/{b}/members/{c}', count: 4, variables: {
//     a: { samples: ['acme', 'globex'], unique: 2 },
//     b: { samples: ['frontend', 'backend', 'ops'], unique: 3 },
//     c: { samples: ['alice', 'bob', 'carol', 'dave'], unique: 4 }
// }}]

Real-World Scale

Tested against:

  • sanity.io (851 URLs) — correctly identifies glossary, blog, docs, exchange, customer patterns
  • MDN (1,571 URLs) — correctly identifies per-locale glossary, blog, tutorial, curriculum patterns

Performance at scale:

URLs Time Memory
10K 24ms 5 MB
100K 210ms 56 MB
500K 1.4s 327 MB
1M 2.75s 491 MB

The shape-aware architecture processes each depth group independently, making it naturally suited for chunked/streaming processing.

Acknowledgments

Inspired by ZoriHQ/trie-url-classifier, a Go library that uses trie-based cardinality analysis for URL normalization.

License

MIT

About

Detect canonical URL patterns from large URL corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors