Skip to content

feat: GraphQL-based scrapeThread, scrapePost, x_read_post, and shared infrastructure#23

Open
nj-io wants to merge 1 commit intonirholas:mainfrom
nj-io:feat/graphql-read-infrastructure
Open

feat: GraphQL-based scrapeThread, scrapePost, x_read_post, and shared infrastructure#23
nj-io wants to merge 1 commit intonirholas:mainfrom
nj-io:feat/graphql-read-infrastructure

Conversation

@nj-io
Copy link
Copy Markdown

@nj-io nj-io commented Apr 7, 2026

Summary

Rewrites scrapeThread and adds scrapePost/x_read_post using X's TweetDetail GraphQL API. Introduces shared helpers and multi-tab browser isolation for all GraphQL-based scrapers.

Why GraphQL instead of DOM scraping

  • X doesn't render self-reply threads as DOM elements (scrapeThread returned empty)
  • X virtualizes the DOM — only ~1-2 tweets visible at a time in headless Puppeteer
  • screen_name moved from user.legacy to user.core in X's GraphQL schema
  • GraphQL API returns full_text (no truncation, no "Show more" clicking)

New tool: x_read_post

Give it any tweet URL. Returns:

  • Thread detection — author self-replies returned in order
  • Rich data per tweet — text (note_tweet for >280 chars), media (images + best video URL), X Articles (title + cover + URL), cards (link previews), external URLs, engagement stats
  • Recursive quote tweets — up to 5 levels deep, each resolved as its own thread

Shared helpers

Helper Purpose
fetchTweetDetail GraphQL API caller with 2x retry, backoff on 429, page reload on missing ct0
parseTweetResult Rich data: text, media, articles, cards, URLs, engagement
parseThreadFromEntries Self-reply chain detection via in_reply_to_status_id_str
checkAuth Fails fast on expired cookies
randomDelay Log-normal distribution (2-7s) + 8% distraction spikes (8-20s)
newTab Per-call tab isolation with 60s default timeout

Multi-tab isolation

Each tool call (x_read_post, x_get_thread) creates its own browser tab. Tabs share cookies/auth (same browser) but don't conflict. Concurrent calls from different Claude Code sessions are safe. Tabs auto-close after the call.

Supersedes

Test plan

  • scrapeThread: 2-tweet thread (Aakash), 15-tweet thread (Breedlove)
  • scrapePost: single post, thread, 3-level recursive QTs (Brian→pitdesi→sarthakgh)
  • scrapePost: thread quoting thread quoting thread (neural_avb→Prince_Canuma→MaziyarPanahi)
  • Rich data: articles, cards, external URLs (Substack, GitHub), videos, images
  • Error surfacing: returns { thread: [], error: "..." } on failure
  • Multi-tab: concurrent calls don't deadlock

🤖 Generated with Claude Code

Rewrites scrapeThread and adds scrapePost using X's TweetDetail GraphQL
API instead of DOM scraping. Introduces shared infrastructure for all
GraphQL-based scrapers.

New tools:
- x_read_post: read any tweet with full rich data, recursive QT resolution

Shared helpers:
- fetchTweetDetail: GraphQL API caller with retry/backoff on rate limits
- parseTweetResult: rich data extraction (text, media, articles, cards,
  URLs, engagement)
- parseThreadFromEntries: self-reply thread chain detection
- checkAuth: post-navigation auth guard
- randomDelay: log-normal distribution with distraction spikes
- newTab: per-call tab isolation (shared browser, separate pages)

scrapeThread rewrite:
- Uses GraphQL API instead of DOM scraping
- Gets full_text (no truncation), note_tweet support
- screen_name from user.core (X moved it from user.legacy)

scrapePost:
- Handles single posts and threads
- Recursive quote tweet resolution (up to 5 levels)
- Each tweet: text, media, articles, cards, external URLs, engagement
- Error surfacing: returns { thread: [], error: "..." } on failure

Multi-tab isolation:
- x_read_post and x_get_thread each create their own browser tab
- Tabs share cookies/auth, don't conflict on concurrent calls
- 60s default timeout per tab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nj-io nj-io requested a review from nirholas as a code owner April 7, 2026 06:02
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 7, 2026

@nj-io is attempting to deploy a commit to the kaivocmenirehtacgmailcom's projects Team on Vercel.

A member of the Team first needs to authorize it.

nj-io added a commit to nj-io/XActions that referenced this pull request Apr 7, 2026
Two new tools for scraping and deeply reading liked tweets.

x_get_likes — fast GraphQL-based likes index:
- Likes GraphQL API with cursor pagination (50 in 14s, 200 in 49s)
- JSONL output to ~/.xactions/exports/
- from/to timestamp filtering with early exit
- Rich data via parseTweetResult

x_discover_likes — interleaved fetch + deep read:
- Fetches likes via API, deep-reads each via scrapePost
- Human-like pacing: 3-8s between pages, 2-5s before reads, 5-15s after
- Produces two JSONL files: likes index + deep reads
- ~38s per tweet average

Both use multi-tab isolation (newTab) for concurrent safety.
Removes x_get_likes from xeepyTools, deletes old DOM handler.

Depends on: nirholas#23 (shared infrastructure)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant