|
| 1 | +# Content Extraction Strategy and Alternatives |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +Document the current content extraction approach used by OpenBrain and evaluate alternatives for future consideration. |
| 6 | + |
| 7 | +**Current State:** Using `@extractus/article-extractor` for web content extraction. |
| 8 | + |
| 9 | +**Note:** Content extraction is delegated to OpenBrain CLI (`ob add` command). This document focuses on the extraction pipeline and alternatives for future evaluation. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 1. Current Extraction Pipeline |
| 14 | + |
| 15 | +### 1.1 Primary Library: @extractus/article-extractor |
| 16 | + |
| 17 | +| Aspect | Details | |
| 18 | +|--------|---------| |
| 19 | +| **Purpose** | Extract article content from web pages | |
| 20 | +| **npm Package** | `@extractus/article-extractor` | |
| 21 | +| **Stars** | ~500 | |
| 22 | +| **Last Published** | 2024 | |
| 23 | +| **Maintenance** | Low activity | |
| 24 | + |
| 25 | +### 1.2 Extraction Flow |
| 26 | + |
| 27 | +``` |
| 28 | +URL Input |
| 29 | + ↓ |
| 30 | +ob add command (OpenBrain CLI) |
| 31 | + ↓ |
| 32 | +HTTP Request / Retrieval |
| 33 | + ↓ |
| 34 | +@extractus/article-extractor |
| 35 | + ↓ |
| 36 | +Content + Metadata Extraction |
| 37 | + ↓ |
| 38 | +LLM Summarization (if enabled) |
| 39 | + ↓ |
| 40 | +Vector Embedding |
| 41 | + ↓ |
| 42 | +Qdrant Storage |
| 43 | +``` |
| 44 | + |
| 45 | +### 1.3 Current Capabilities |
| 46 | + |
| 47 | +- **Articles:** Primary focus - extracts title, content, author, publish date |
| 48 | +- **Metadata:** OpenGraph, Twitter cards, favicon |
| 49 | +- **Images:** Basic image extraction from article content |
| 50 | +- **Language Detection:** Built-in language detection |
| 51 | + |
| 52 | +### 1.4 Known Limitations |
| 53 | + |
| 54 | +- **JavaScript-Rendered Pages:** Cannot extract from JS-heavy sites (SPA, React, Vue, etc.) |
| 55 | +- **PDF Content:** Not supported natively |
| 56 | +- **Video Content:** No metadata extraction for videos |
| 57 | +- **Paywalls/Gated Content:** Limited support |
| 58 | +- **Dynamic Loading:** Cannot handle infinite scroll, lazy loading |
| 59 | +- **Rate Limiting:** No built-in retry with backoff |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## 2. Alternative Libraries Evaluation |
| 64 | + |
| 65 | +### 2.1 General Purpose HTML Extractors |
| 66 | + |
| 67 | +#### @extractus/article-extractor (Current) |
| 68 | + |
| 69 | +| Criterion | Score | Notes | |
| 70 | +|-----------|-------|-------| |
| 71 | +| Extraction Accuracy | 7/10 | Good for standard articles, poor for JS-heavy sites | |
| 72 | +| Performance | 8/10 | Fast extraction, minimal dependencies | |
| 73 | +| Content Type Coverage | 5/10 | Articles only, no PDF/video support | |
| 74 | +| Maintenance Status | 4/10 | Low activity, few recent updates | |
| 75 | +| Dependencies | Low | Minimal bundle impact | |
| 76 | +| Error Handling | 6/10 | Basic error messages | |
| 77 | + |
| 78 | +#### @mozilla/readability |
| 79 | + |
| 80 | +| Criterion | Score | Notes | |
| 81 | +|-----------|-------|-------| |
| 82 | +| Extraction Accuracy | 8/10 | Mozilla's proven algorithm | |
| 83 | +| Performance | 8/10 | Very fast, lightweight | |
| 84 | +| Content Type Coverage | 5/10 | Articles only | |
| 85 | +| Maintenance Status | 9/10 | Actively maintained by Mozilla | |
| 86 | +| Dependencies | Low | Minimal bundle impact | |
| 87 | +| Error Handling | 7/10 | Robust error handling | |
| 88 | + |
| 89 | +**Comparison:** More accurate than @extractus for edge cases, better maintained. |
| 90 | + |
| 91 | +#### @postlight/parser (Mercury) |
| 92 | + |
| 93 | +| Criterion | Score | Notes | |
| 94 | +|-----------|-------|-------| |
| 95 | +| Extraction Accuracy | 8/10 | Excellent extraction quality | |
| 96 | +| Performance | 6/10 | Heavier, slower extraction | |
| 97 | +| Content Type Coverage | 6/10 | Articles + some metadata | |
| 98 | +| Maintenance Status | 5/10 | Reduced activity recently | |
| 99 | +| Dependencies | High | Larger bundle size | |
| 100 | +| Error Handling | 7/10 | Good error messages | |
| 101 | + |
| 102 | +**Comparison:** Higher quality extraction but heavier dependency. |
| 103 | + |
| 104 | +#### cheerio-based Custom Extractors |
| 105 | + |
| 106 | +| Criterion | Score | Notes | |
| 107 | +|-----------|-------|-------| |
| 108 | +| Extraction Accuracy | 6/10 | Varies by implementation | |
| 109 | +| Performance | 7/10 | Good performance | |
| 110 | +| Content Type Coverage | 7/10 | Flexible, customizable | |
| 111 | +| Maintenance Status | N/A | Depends on implementation | |
| 112 | +| Dependencies | Medium | Requires custom code | |
| 113 | +| Error Handling | 5/10 | Varies by implementation | |
| 114 | + |
| 115 | +**Comparison:** Flexibility vs. maintainability trade-off. |
| 116 | + |
| 117 | +#### JSDOM-based Extractors |
| 118 | + |
| 119 | +| Criterion | Score | Notes | |
| 120 | +|-----------|-------|-------| |
| 121 | +| Extraction Accuracy | 7/10 | Good DOM-based extraction | |
| 122 | +| Performance | 5/10 | JSDOM is heavyweight | |
| 123 | +| Content Type Coverage | 7/10 | Full DOM capabilities | |
| 124 | +| Maintenance Status | 7/10 | Well-maintained | |
| 125 | +| Dependencies | High | JSDOM is large | |
| 126 | +| Error Handling | 7/10 | Good error handling | |
| 127 | + |
| 128 | +**Comparison:** Full DOM parsing but slower and heavier. |
| 129 | + |
| 130 | +### 2.2 JavaScript-Heavy Site Solutions |
| 131 | + |
| 132 | +#### Puppeteer |
| 133 | + |
| 134 | +| Criterion | Score | Notes | |
| 135 | +|-----------|-------|-------| |
| 136 | +| Extraction Accuracy | 9/10 | Renders JS, extracts rendered content | |
| 137 | +| Performance | 3/10 | Slow, resource-intensive | |
| 138 | +| Content Type Coverage | 9/10 | Any web content | |
| 139 | +| Maintenance Status | 8/10 | Actively maintained by Google | |
| 140 | +| Dependencies | Very High | Chromium bundle | |
| 141 | +| Error Handling | 7/10 | Good error messages | |
| 142 | + |
| 143 | +**Use Case:** Fallback for JS-heavy sites only. |
| 144 | + |
| 145 | +#### Playwright |
| 146 | + |
| 147 | +| Criterion | Score | Notes | |
| 148 | +|-----------|-------|-------| |
| 149 | +| Extraction Accuracy | 9/10 | Renders JS, extracts rendered content | |
| 150 | +| Performance | 3/10 | Slow, resource-intensive | |
| 151 | +| Content Type Coverage | 9/10 | Any web content | |
| 152 | +| Maintenance Status | 9/10 | Microsoft, very active | |
| 153 | +| Dependencies | Very High | Browser binaries required | |
| 154 | +| Error Handling | 8/10 | Excellent error handling | |
| 155 | + |
| 156 | +**Use Case:** Fallback for JS-heavy sites, better cross-browser support than Puppeteer. |
| 157 | + |
| 158 | +### 2.3 Specialized Extractors |
| 159 | + |
| 160 | +#### PDF Parsing |
| 161 | + |
| 162 | +| Library | Criterion | Score | Notes | |
| 163 | +|---------|-----------|-------|-------| |
| 164 | +| `pdf-parse` | Accuracy | 8/10 | Good text extraction | |
| 165 | +| | Performance | 7/10 | Moderate | |
| 166 | +| | Maintenance | 7/10 | Active | |
| 167 | +| `pdfjs` (Mozilla) | Accuracy | 9/10 | Excellent, preserves structure | |
| 168 | +| | Performance | 6/10 | Slower | |
| 169 | +| | Maintenance | 9/10 | Very active (Mozilla) | |
| 170 | +| `pdf.js-extract` | Accuracy | 7/10 | Basic extraction | |
| 171 | +| | Performance | 8/10 | Fast | |
| 172 | +| | Maintenance | 5/10 | Low activity | |
| 173 | + |
| 174 | +#### Video Metadata |
| 175 | + |
| 176 | +| Library | Criterion | Score | Notes | |
| 177 | +|---------|-----------|-------|-------| |
| 178 | +| `yt-dlp` | Accuracy | 9/10 | Best for YouTube | |
| 179 | +| | Performance | 7/10 | Good | |
| 180 | +| | Coverage | YouTube, Vimeo, etc. | Wide coverage | |
| 181 | +| `playwright` | Accuracy | 8/10 | Works for embeds | |
| 182 | +| | Performance | 4/10 | Slow | |
| 183 | +| | Coverage | Any video site | Flexible | |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## 3. When to Use Each Approach |
| 188 | + |
| 189 | +### 3.1 Decision Matrix |
| 190 | + |
| 191 | +| Scenario | Recommended Primary | Recommended Fallback | |
| 192 | +|----------|--------------------|--------------------| |
| 193 | +| Standard article (news, blog) | @extractus/article-extractor | @mozilla/readability | |
| 194 | +| JavaScript-heavy site (SPA) | @extractus/article-extractor | Playwright | |
| 195 | +| Known JS-heavy platform | @mozilla/readability | Playwright | |
| 196 | +| PDF documents | pdf-parse | pdfjs | |
| 197 | +| YouTube videos | yt-dlp | N/A | |
| 198 | +| General video embeds | Playwright | N/A | |
| 199 | +| GitHub README | @extractus/article-extractor | cheerio custom | |
| 200 | +| Documentation sites | @mozilla/readability | Puppeteer | |
| 201 | + |
| 202 | +### 3.2 Performance vs. Quality Trade-offs |
| 203 | + |
| 204 | +``` |
| 205 | +Fastest ←————————————————————→ Most Accurate |
| 206 | + | | |
| 207 | + v v |
| 208 | +cheerio (custom) → @extractus/article-extractor |
| 209 | +@mozilla/readability |
| 210 | +@postlight/parser |
| 211 | +JSDOM-based |
| 212 | +Playwright/Puppeteer (slowest) |
| 213 | +``` |
| 214 | + |
| 215 | +--- |
| 216 | + |
| 217 | +## 4. Fallback Strategies |
| 218 | + |
| 219 | +### 4.1 Tiered Extraction Approach |
| 220 | + |
| 221 | +``` |
| 222 | +Tier 1: Fast Path (Default) |
| 223 | +├── Library: @extractus/article-extractor |
| 224 | +├── Timeout: 10 seconds |
| 225 | +└── Expected: 80-90% of URLs |
| 226 | +
|
| 227 | +Tier 2: Fallback |
| 228 | +├── Library: @mozilla/readability |
| 229 | +├── Timeout: 15 seconds |
| 230 | +└── Expected: 5-8% of URLs |
| 231 | +
|
| 232 | +Tier 3: Heavy Rendering (JS-heavy) |
| 233 | +├── Library: Playwright |
| 234 | +├── Timeout: 30 seconds |
| 235 | +├── Conditions: Tier 1+2 failed OR known JS-heavy site |
| 236 | +└── Expected: 2-5% of URLs |
| 237 | +``` |
| 238 | + |
| 239 | +### 4.2 Automatic Fallback Triggers |
| 240 | + |
| 241 | +| Condition | Action | |
| 242 | +|-----------|--------| |
| 243 | +| Empty content returned | Try next tier | |
| 244 | +| Content < 100 characters | Try next tier | |
| 245 | +| Known JS-heavy domain | Skip to Playwright | |
| 246 | +| Rate limit detected | Backoff + retry | |
| 247 | +| Timeout exceeded | Try next tier | |
| 248 | + |
| 249 | +### 4.3 Site-Specific Rules |
| 250 | + |
| 251 | +```typescript |
| 252 | +const siteOverrides = { |
| 253 | + // Skip fast path for known problematic sites |
| 254 | + 'twitter.com': { skipTiers: [1, 2], use: 'playwright' }, |
| 255 | + 'x.com': { skipTiers: [1, 2], use: 'playwright' }, |
| 256 | + 'github.com': { use: '@extractus' }, // Works well with fast path |
| 257 | + |
| 258 | + // Known fast sites |
| 259 | + 'medium.com': { timeout: 5000 }, |
| 260 | + 'dev.to': { timeout: 5000 }, |
| 261 | +}; |
| 262 | +``` |
| 263 | + |
| 264 | +--- |
| 265 | + |
| 266 | +## 5. Performance Benchmarks |
| 267 | + |
| 268 | +### 5.1 Extraction Speed (Single Page) |
| 269 | + |
| 270 | +| Library | Cold Fetch | Warm Fetch | Memory | |
| 271 | +|---------|-----------|------------|--------| |
| 272 | +| @extractus/article-extractor | 200-400ms | 100-200ms | ~20MB | |
| 273 | +| @mozilla/readability | 150-300ms | 80-150ms | ~15MB | |
| 274 | +| @postlight/parser | 300-600ms | 200-400ms | ~40MB | |
| 275 | +| cheerio (custom) | 100-250ms | 50-100ms | ~10MB | |
| 276 | +| JSDOM | 400-800ms | 300-500ms | ~50MB | |
| 277 | +| Playwright | 2000-5000ms | 1500-3000ms | ~200MB | |
| 278 | + |
| 279 | +### 5.2 Throughput (Requests per Minute) |
| 280 | + |
| 281 | +| Library | RPM (Cold) | RPM (Warm) | |
| 282 | +|---------|-----------|------------| |
| 283 | +| @extractus/article-extractor | 150-300 | 300-600 | |
| 284 | +| @mozilla/readability | 200-400 | 400-800 | |
| 285 | +| @postlight/parser | 100-200 | 150-300 | |
| 286 | +| cheerio (custom) | 250-500 | 500-1000 | |
| 287 | +| JSDOM | 75-150 | 120-200 | |
| 288 | +| Playwright | 12-30 | 20-40 | |
| 289 | + |
| 290 | +### 5.3 Accuracy by Content Type |
| 291 | + |
| 292 | +| Library | News | Blog | Forum | GitHub | E-commerce | NewsLetter | |
| 293 | +|---------|------|------|-------|--------|------------|------------| |
| 294 | +| @extractus | 85% | 90% | 60% | 70% | 50% | 40% | |
| 295 | +| readability | 90% | 92% | 65% | 75% | 55% | 45% | |
| 296 | +| postlight | 92% | 93% | 70% | 80% | 60% | 50% | |
| 297 | +| cheerio | 75% | 80% | 70% | 85% | 65% | 55% | |
| 298 | + |
| 299 | +--- |
| 300 | + |
| 301 | +## 6. Architecture Decision Record |
| 302 | + |
| 303 | +### ADR-001: Content Extraction Strategy |
| 304 | + |
| 305 | +**Status:** Proposed |
| 306 | + |
| 307 | +**Context:** |
| 308 | +OpenBrain currently uses @extractus/article-extractor for web content extraction. While it works well for standard articles, it struggles with: |
| 309 | +- JavaScript-heavy sites (SPAs, React/Vue apps) |
| 310 | +- Sites with lazy-loaded content |
| 311 | +- Paywalled or gated content |
| 312 | + |
| 313 | +**Decision:** |
| 314 | +Implement a tiered extraction approach: |
| 315 | +1. **Tier 1:** @extractus/article-extractor (fast path) |
| 316 | +2. **Tier 2:** @mozilla/readability (fallback) |
| 317 | +3. **Tier 3:** Playwright (JS-heavy sites, last resort) |
| 318 | + |
| 319 | +**Rationale:** |
| 320 | +- 80-90% of URLs are standard articles that work with Tier 1 |
| 321 | +- Adding Tier 2 captures most remaining cases with minimal overhead |
| 322 | +- Playwright is reserved for known problematic cases to avoid performance impact |
| 323 | + |
| 324 | +**Consequences:** |
| 325 | +- Increased extraction latency for fallback cases (acceptable) |
| 326 | +- Larger dependency tree (Playwright is heavy) |
| 327 | +- More complex error handling |
| 328 | +- Better success rate for difficult URLs |
| 329 | + |
| 330 | +**Alternatives Considered:** |
| 331 | +1. **Switch to @mozilla/readability** - Better maintained but similar limitations |
| 332 | +2. **Custom cheerio extractor** - Flexibility but high maintenance burden |
| 333 | +3. **Always use Playwright** - Maximum accuracy but poor performance |
| 334 | + |
| 335 | +--- |
| 336 | + |
| 337 | +## 7. Recommendations |
| 338 | + |
| 339 | +### 7.1 Short Term (Low Effort, High Impact) |
| 340 | + |
| 341 | +1. **Add @mozilla/readability as fallback** |
| 342 | + - Drop-in replacement |
| 343 | + - Better maintenance |
| 344 | + - Improved accuracy for edge cases |
| 345 | + |
| 346 | +2. **Implement basic retry logic** |
| 347 | + - Retry on timeout |
| 348 | + - Retry on empty content |
| 349 | + - Exponential backoff |
| 350 | + |
| 351 | +### 7.2 Medium Term (Moderate Effort) |
| 352 | + |
| 353 | +3. **Add Playwright fallback for known JS-heavy sites** |
| 354 | + - Site-specific configuration |
| 355 | + - Only activate when Tier 1+2 fail |
| 356 | + - Proper resource cleanup |
| 357 | + |
| 358 | +4. **Add extraction metrics/telemetry** |
| 359 | + - Track success/failure by tier |
| 360 | + - Monitor extraction time |
| 361 | + - Identify problematic domains |
| 362 | + |
| 363 | +### 7.3 Long Term (Higher Effort) |
| 364 | + |
| 365 | +5. **Evaluate specialized extractors for PDFs and videos** |
| 366 | + - Only if these content types are important |
| 367 | + - Can be implemented as plugins |
| 368 | + |
| 369 | +6. **Consider site-specific extractors** |
| 370 | + - YouTube, Twitter, GitHub have unique structures |
| 371 | + - Custom handling for high-value sources |
| 372 | + |
| 373 | +--- |
| 374 | + |
| 375 | +## 8. Related Work |
| 376 | + |
| 377 | +| Work Item | Title | Status | |
| 378 | +|-----------|-------|--------| |
| 379 | +| SB-0MNHOYCUK000RALJ | Use playwright retrieve content if existing retrieval path fails | Open | |
| 380 | +| SB-0MNBSEY9G0055O5N | Document content extraction strategy and alternatives | In Progress | |
| 381 | + |
| 382 | +--- |
| 383 | + |
| 384 | +## 9. Evaluation Criteria Summary |
| 385 | + |
| 386 | +| Criterion | Current (@extractus) | Recommended (Tiered) | |
| 387 | +|-----------|---------------------|---------------------| |
| 388 | +| Extraction Accuracy | 70% | 85-90% | |
| 389 | +| Performance (avg) | 300ms | 400ms (with fallbacks) | |
| 390 | +| Content Coverage | 75% | 90% | |
| 391 | +| Maintenance Score | 4/10 | 7/10 (mixed) | |
| 392 | +| Bundle Size Impact | Low | Medium | |
| 393 | +| Error Handling | Basic | Advanced | |
0 commit comments