feat: GraphQL-based scrapeThread, scrapePost, x_read_post, and shared infrastructure by nj-io · Pull Request #23 · nirholas/XActions

nj-io · 2026-04-07T06:02:50Z

Summary

Rewrites scrapeThread and adds scrapePost/x_read_post using X's TweetDetail GraphQL API. Introduces shared helpers and multi-tab browser isolation for all GraphQL-based scrapers.

Why GraphQL instead of DOM scraping

X doesn't render self-reply threads as DOM elements (scrapeThread returned empty)
X virtualizes the DOM — only ~1-2 tweets visible at a time in headless Puppeteer
screen_name moved from user.legacy to user.core in X's GraphQL schema
GraphQL API returns full_text (no truncation, no "Show more" clicking)

New tool: x_read_post

Give it any tweet URL. Returns:

Thread detection — author self-replies returned in order
Rich data per tweet — text (note_tweet for >280 chars), media (images + best video URL), X Articles (title + cover + URL), cards (link previews), external URLs, engagement stats
Recursive quote tweets — up to 5 levels deep, each resolved as its own thread

Shared helpers

Helper	Purpose
`fetchTweetDetail`	GraphQL API caller with 2x retry, backoff on 429, page reload on missing ct0
`parseTweetResult`	Rich data: text, media, articles, cards, URLs, engagement
`parseThreadFromEntries`	Self-reply chain detection via `in_reply_to_status_id_str`
`checkAuth`	Fails fast on expired cookies
`randomDelay`	Log-normal distribution (2-7s) + 8% distraction spikes (8-20s)
`newTab`	Per-call tab isolation with 60s default timeout

Multi-tab isolation

Each tool call (x_read_post, x_get_thread) creates its own browser tab. Tabs share cookies/auth (same browser) but don't conflict. Concurrent calls from different Claude Code sessions are safe. Tabs auto-close after the call.

Supersedes

Closed Fix Twitter thread scraping to stop returning empty results #12, feat: enhanced scrapeLikedTweets scraper with rich data extraction #13, feat: add x_read_article tool for reading full X Article content #14, feat: rewrite scrapeThread to use TweetDetail GraphQL API #17, feat: add x_read_post tool for full rich tweet reading with recursive quote tweets #18 — all replaced by this unified approach

Test plan

scrapeThread: 2-tweet thread (Aakash), 15-tweet thread (Breedlove)
scrapePost: single post, thread, 3-level recursive QTs (Brian→pitdesi→sarthakgh)
scrapePost: thread quoting thread quoting thread (neural_avb→Prince_Canuma→MaziyarPanahi)
Rich data: articles, cards, external URLs (Substack, GitHub), videos, images
Error surfacing: returns { thread: [], error: "..." } on failure
Multi-tab: concurrent calls don't deadlock

🤖 Generated with Claude Code

Rewrites scrapeThread and adds scrapePost using X's TweetDetail GraphQL API instead of DOM scraping. Introduces shared infrastructure for all GraphQL-based scrapers. New tools: - x_read_post: read any tweet with full rich data, recursive QT resolution Shared helpers: - fetchTweetDetail: GraphQL API caller with retry/backoff on rate limits - parseTweetResult: rich data extraction (text, media, articles, cards, URLs, engagement) - parseThreadFromEntries: self-reply thread chain detection - checkAuth: post-navigation auth guard - randomDelay: log-normal distribution with distraction spikes - newTab: per-call tab isolation (shared browser, separate pages) scrapeThread rewrite: - Uses GraphQL API instead of DOM scraping - Gets full_text (no truncation), note_tweet support - screen_name from user.core (X moved it from user.legacy) scrapePost: - Handles single posts and threads - Recursive quote tweet resolution (up to 5 levels) - Each tweet: text, media, articles, cards, external URLs, engagement - Error surfacing: returns { thread: [], error: "..." } on failure Multi-tab isolation: - x_read_post and x_get_thread each create their own browser tab - Tabs share cookies/auth, don't conflict on concurrent calls - 60s default timeout per tab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel · 2026-04-07T06:02:56Z

@nj-io is attempting to deploy a commit to the kaivocmenirehtacgmailcom's projects Team on Vercel.

A member of the Team first needs to authorize it.

Two new tools for scraping and deeply reading liked tweets. x_get_likes — fast GraphQL-based likes index: - Likes GraphQL API with cursor pagination (50 in 14s, 200 in 49s) - JSONL output to ~/.xactions/exports/ - from/to timestamp filtering with early exit - Rich data via parseTweetResult x_discover_likes — interleaved fetch + deep read: - Fetches likes via API, deep-reads each via scrapePost - Human-like pacing: 3-8s between pages, 2-5s before reads, 5-15s after - Produces two JSONL files: likes index + deep reads - ~38s per tweet average Both use multi-tab isolation (newTab) for concurrent safety. Removes x_get_likes from xeepyTools, deletes old DOM handler. Depends on: nirholas#23 (shared infrastructure) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nj-io requested a review from nirholas as a code owner April 7, 2026 06:02

nj-io mentioned this pull request Apr 7, 2026

feat: x_get_likes (GraphQL) and x_discover_likes with human-like deep reads #24

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GraphQL-based scrapeThread, scrapePost, x_read_post, and shared infrastructure#23

feat: GraphQL-based scrapeThread, scrapePost, x_read_post, and shared infrastructure#23
nj-io wants to merge 1 commit intonirholas:mainfrom
nj-io:feat/graphql-read-infrastructure

nj-io commented Apr 7, 2026

Uh oh!

vercel bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nj-io commented Apr 7, 2026

Summary

Why GraphQL instead of DOM scraping

New tool: x_read_post

Shared helpers

Multi-tab isolation

Supersedes

Test plan

Uh oh!

vercel bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant