Skip to content

Conversation

@jrakibi
Copy link
Contributor

@jrakibi jrakibi commented Sep 22, 2025

Adds thread relationship parsing for bitcoin-dev mailing list conversations:

  • Extracts thread hierarchy from HTML thread overview
  • Adds threading fields: depth, parent_id, reply_to_author, anchor_id
  • Parses parent-child relationships between messages

The new fields added are:

  • thread_depth: nesting level (0=root, 1+=replies)
  • parent_id: ID of parent message
  • reply_to_author: author being replied to
  • thread_position: chronological position
  • anchor_id: HTML anchor reference

PS: for safety measure to not break existing Elastic search data we we:
- only processes 2 specific test threads (Quantum Recovery + Post Quantum Migration)
- Skip all other threads to protect existing data

@jrakibi jrakibi force-pushed the 22-09-threading-replies branch from 0d5dded to f04bdd6 Compare September 24, 2025 14:50
Copy link

@satsie satsie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review this PR further

thread_depth: Optional[int] = Field(
default=0, description="Depth in the thread (0 = original post, 1+ = replies)"
)
thread_position: Optional[int] = Field(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the description of this I would have gone with position_in_thread. Probably not worth the effort to change it at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @satsie, agreed, that name would describe it better.
But we can probably keep it as-is for now, since it’s already used in the ES data and our summarizers.

btw for the PR description, the note about processing only 2 threads was just for tests, it now works for everything (I edited PR description)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! Thank you for the additional context. Just to clarify, thread_position flattens the entire thread and defines the position based on the timestamp of the reply, right? Each new reply adds one to the thread position?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, each new reply gets the next number in sequence, regardless of how deeply nested it is in the thread.
it's based on chronological order

Copy link

@satsie satsie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely fabulous Jamal. When you told me how you were going to do it I thought "wow that is going to be a headache" 😆 But you did it and you covered a number of edge cases for when an expected field is missing. Let's hope there are no changes to the formatting of gnusha.org (highly unlikely but worth mentioning).

@jrakibi
Copy link
Contributor Author

jrakibi commented Oct 7, 2025

wow that is going to be a headache" 😆

it was the CI jobs that annoyed me so much, for every little change, you have to wait for the jobs to finish to see if everything works, and then do it all over again lol. That’s why I disabled the AI API in the summarizer, so it’s not consuming too many tokens from the jobs I run every few seconds lol.

Thanks for the review tho! I’ll merge all the PRs once Tuedon finishes reviewing the TLDR UI PR bitcoinsearch/tldr#526

@jrakibi jrakibi merged commit 40ecb1a into master Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants