-
Notifications
You must be signed in to change notification settings - Fork 12
Add mailing list thread structure parsing (bitcoin-dev) #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0d5dded to
f04bdd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review this PR further
| thread_depth: Optional[int] = Field( | ||
| default=0, description="Depth in the thread (0 = original post, 1+ = replies)" | ||
| ) | ||
| thread_position: Optional[int] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the description of this I would have gone with position_in_thread. Probably not worth the effort to change it at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @satsie, agreed, that name would describe it better.
But we can probably keep it as-is for now, since it’s already used in the ES data and our summarizers.
btw for the PR description, the note about processing only 2 threads was just for tests, it now works for everything (I edited PR description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good! Thank you for the additional context. Just to clarify, thread_position flattens the entire thread and defines the position based on the timestamp of the reply, right? Each new reply adds one to the thread position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, each new reply gets the next number in sequence, regardless of how deeply nested it is in the thread.
it's based on chronological order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely fabulous Jamal. When you told me how you were going to do it I thought "wow that is going to be a headache" 😆 But you did it and you covered a number of edge cases for when an expected field is missing. Let's hope there are no changes to the formatting of gnusha.org (highly unlikely but worth mentioning).
it was the CI jobs that annoyed me so much, for every little change, you have to wait for the jobs to finish to see if everything works, and then do it all over again lol. That’s why I disabled the AI API in the summarizer, so it’s not consuming too many tokens from the jobs I run every few seconds lol. Thanks for the review tho! I’ll merge all the PRs once Tuedon finishes reviewing the TLDR UI PR bitcoinsearch/tldr#526 |
Adds thread relationship parsing for bitcoin-dev mailing list conversations:
The new fields added are:
thread_depth: nesting level (0=root, 1+=replies)parent_id: ID of parent messagereply_to_author: author being replied tothread_position: chronological positionanchor_id: HTML anchor referencePS: for safety measure to not break existing Elastic search data we we:- only processes 2 specific test threads (Quantum Recovery + Post Quantum Migration)- Skip all other threads to protect existing data