This project retrieves email content from an Outlook mailbox using the Microsoft Graph API and OAuth, processes it, and generates a QA corpus suitable for model training.
- Objective: Extract emails from an Outlook mailbox, clean the data, and generate a QA dataset.
- Approach: Utilize Microsoft Graph API and OAuth protocol to access the mailbox.
- Output: A structured QA corpus saved in CSV format.
- A Microsoft account with Outlook mailbox permissions.
- An application registered in the Azure Portal to obtain an OAuth token.
- A configured Python environment with required libraries installed (
msgraph-sdk
,requests
, etc.).
Microsoft no longer supports direct IMAP connections using account credentials. An OAuth token is required instead.
- Visit Microsoft Graph Explorer.
- Sign in and generate a token with
Mail.ReadWrite
permissions.
- Register an application in the Azure Portal.
- Grant the application
Mail.ReadWrite
permissions. - Use the client credentials flow to obtain a token.
- Documentation: Refer to Microsoft Graph API Documentation.
- Testing: Test the API using Graph Explorer.
- Example: Use the
/me/messages
endpoint to retrieve emails.
- Use the Python SDK (
msgraph-sdk
). - Sample Code: Refer to msgraph-training-python.
- Fetch raw email data using the Graph API.
- Convert the response data into JSON format.
- Filtering Criteria:
- Exclude single-session emails (no replies).
- Remove irrelevant HTML elements (tables, advertisement images, etc.).
- Exclude conversations with ≥ 2 messages where the last message is a forward.
- Process and merge forwarded emails that do not display original content.
- Transform email threads into a JSON conversation format.
- Use an LLM to batch-extract QA pairs.
- Save as a CSV file with columns:
Question
,Answer
.
- Hash QA pairs and remove duplicates.
- File:
qa_corpus.csv
- Structure:
Question,Answer "When is the meeting scheduled?","Tomorrow at 2 PM."