Workflow for creating a reasoning dataset for finetuning. #33

mortenfj · 2024-09-15T14:46:44Z

mortenfj
Sep 15, 2024

Process flow documentation for Raspberry project

This document outlines the process flow I am using to hopefully contribute meaningfull data to Raspberry, an open-source initiative with the mission of creating a dataset for fine-tuning large language models (LLMs) to improve their reasoning abilities. My contribution focuses on generating and refining complex user queries to be included in the dataset.

My Contribution to the Process

Generate Complex Queries:
Using OpenAI's new o1-Preview model, I generate distinct and complex user queries across various economically valuable domains to
contribute to the project.

Example of this looks like this:

Prompt 01: 
Propose a quantitative strategy for portfolio optimization that minimizes risk while maximizing returns. 
Discuss the mathematical models you would use and the data required to implement this strategy.

Prompt 02: 
Develop a computational model to simulate the spread of a contagious disease in a metropolitan area. 
Describe the variables you would include, how you would collect data, and how you would validate your model's accuracy.

Refinement via Deepseek<>GPT-4o agentic Iterations:
I iteratively refine these queries using deepseek and GPT-4o, incorporating hints to improve the responses at each iteration.
Final Output:
After several iterations, a final reflection is produced, incorporating the hints as inspirations for an improved solution.
The "final output" can either be condensed into a smaller conversation or used as a multi-stage (up to 8) process, showing how the solution evolves from an initial response to a more refined one.
Grading (Optional):
The results can optionally be graded using models like GPT-4o or OpenAI's o1-Preview for further evaluation. I may add a score ranging from 0 to 1000 to determine the value added from the initial "cold start" response to the final solution.

Process Flow Diagram

Below is the PlantUML diagram of the process I am using to generate and refine the queries:

@startuml
start
:Generate Complex Queries using o1-Preview;
:Run Iterations via Deepseek with GPT-4o;
repeat
:Receive Hints and Refine Solution;
repeat while (Improvement needed?)
:Produce Final Solution and Reflection;
if (Grading needed?) then (yes)
  :Grade the final solution using GPT-4o or o1-Preview;
else (no)
  :Finalize without grading;
endif
stop
@enduml

Example Query and Refined Solution

Initially, the response field is empty, but through several iterations, the solution evolves. For example, after receiving hints from models like GPT-4o, the response might include more advanced techniques, for example:

Response after Iterations:

{
  "prompt": "Propose a quantitative strategy for portfolio optimization that minimizes risk while maximizing returns. Discuss the mathematical models you would use and the data required to implement this strategy.",
  "response": "The strategy involves using Mean-Variance Optimization (MVO) and Conditional Value at Risk (CVaR) to handle risk and return optimization. Monte Carlo simulations will be used for stress testing under extreme market conditions. The data needed includes historical returns, asset correlations, and volatility metrics. Additionally, techniques like LASSO regression will help mitigate overfitting during model development."
}

Through this refinement process, the initial query evolves into a more detailed and actionable solution.

Input and Output Files Description

To replicate my process and results, it is important to understand the format of the input files and how the results are stored.

Input Files

Project Raspberry.jsonl: This file contains the initial user queries, with each query stored as a JSON object on a separate line. Example format:

{
    "prompt": "Your question or query goes here"
}

This file is read by the script to initiate the query generation process.

DeepSeek API Input: Each query from the JSONL file is sent to the DeepSeek API in the following format:

{
    "model": "deepseek-chat",
    "messages": [{"role": "user", "content": "Your question or query here"}],
    "temperature": 0.7,
    "max_tokens": 4096,
    "top_p": 1,
    "stream": false
}

This payload is dynamically generated for each query.

Resulting Files

Raw Dialog File (raw_dialog_<timestamp>.json): This file captures all interactions, including all iterations of DeepSeek, GPT-4o, and their refinement process. Each entry is a JSON object with the following structure:

{
    "id": "conversation_<number>",
    "iteration": <iteration_number>,
    "role": "<user|assistant|reflection>",
    "content": "The message content here",
    "start_time": "YYYY-MM-DD HH:MM:SS.mmm",
    "end_time": "YYYY-MM-DD HH:MM:SS.mmm"
}

The role field represents the source of the message (user for the original query, assistant for DeepSeek responses, reflection for GPT-4o reflections).

Exchange Output File (exchange_output_<timestamp>.jsonl): This file stores the final structured conversation after all iterations. Each conversation is saved as a JSON object, capturing the refined output. The format:

{
    "prompt": "The original query",
    "response": [
        {"role": "user", "content": "Query content"},
        {"role": "assistant", "content": "DeepSeek's initial response"},
        {"role": "reflection", "content": "GPT-4o reflection or hint"},
        ...
    ]
}

This file provides a summary of the most important conversation points, useful for reviewing and assessing the final state of the solution.

Code for Generating and Processing Queries

I have included a Python script that leverages the DeepSeek API for query refinement and GPT-4o for generating hints that help improve solutions. You can find the code here.

import json
import requests
from datetime import datetime
import os

# API endpoints and keys (replace with actual values)
DEEPSEEK_API_URL = "https://api.deepseek.com/v1/chat/completions"
GPT4_API_URL = "https://api.openai.com/v1/chat/completions"
DEEPSEEK_API_KEY = "API KEY HERE"
GPT4_API_KEY = "API KEY HERE"

def get_timestamp():
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]

def read_questions(file_path):
    questions = []
    with open(file_path, 'r') as file:
        content = file.read().strip()
        if content.startswith('['):
            content = content[1:]
        if content.endswith(']'):
            content = content[:-1]
        
        for line_number, line in enumerate(content.split('\n'), 1):
            try:
                line = line.strip().rstrip(',')  # Remove trailing comma
                question_data = json.loads(line)
                questions.append(question_data)
            except json.JSONDecodeError as e:
                print(f"Error parsing JSON on line {line_number}: {e}")
                print(f"Problematic line: {line}")
    return questions

def ask_deepseek(question):
    start_time = get_timestamp()
    headers = {
        "Authorization": f"Bearer {DEEPSEEK_API_KEY}",
        "Content-Type": "application/json"
    }
    data = {
        "model": "deepseek-chat",  # Changed from "DeepSeek-V2.5" to "deepseek-chat"
        "messages": [{"role": "user", "content": question}],
        "temperature": 0.7,
        "max_tokens": 4096,
        "top_p": 1,
        "stream": False
    }
    response = requests.post(DEEPSEEK_API_URL, headers=headers, json=data)
    end_time = get_timestamp()
    if response.status_code == 200:
        try:
            content = response.json()['choices'][0]['message']['content']
            return {
                "content": content,
                "start_time": start_time,
                "end_time": end_time
            }
        except KeyError as e:
            print(f"Error parsing DeepSeek response: {e}")
            print(f"Response JSON: {response.json()}")
            return None
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None

def ask_gpt4(question, deepseek_response):
    start_time = get_timestamp()
    headers = {
        "Authorization": f"Bearer {GPT4_API_KEY}",
        "Content-Type": "application/json"
    }
    prompt = f"""Question: {question}

Deepseek's response: {deepseek_response}

Your task:
1. Analyze Deepseek's response for accuracy, completeness, and relevance to the question.
2. If Deepseek's answer is entirely correct, complete, and no improvements are needed, respond ONLY with the exact phrase '[>[DONE]<]'.
3. Otherwise, provide a single, concise hint for improvement. This hint should be:
   - Direct and to the point
   - Focused on the most critical aspect that needs improvement
   - No longer than one or two sentences
   - Free of any introductory phrases, explanations, or commentary

Do not include any other text, explanations, or commentary in your response. Provide ONLY the hint or the '[>[DONE]<]' phrase.

Your response:"""

    data = {
        "model": "gpt-4o",
        "messages": [{"role": "user", "content": prompt}]
    }
    response = requests.post(GPT4_API_URL, headers=headers, json=data)
    end_time = get_timestamp()
    content = response.json()['choices'][0]['message']['content']
    return {
        "content": content,
        "start_time": start_time,
        "end_time": end_time
    }

def save_exchange(exchange, output_file):
    try:
        with open(output_file, 'a', encoding='utf-8') as file:  # Added encoding='utf-8'
            json.dump(exchange, file, ensure_ascii=False)
            file.write('\n')
            file.flush()
    except IOError as e:
        print(f"Error writing to {output_file}: {e}")

def save_raw_dialog(dialog, output_file):
    try:
        with open(output_file, 'a', encoding='utf-8') as file:
            json.dump(dialog, file, indent=2, ensure_ascii=False)
            file.write('\n')
            file.flush()
    except IOError as e:
        print(f"Error writing to {output_file}: {e}")

def main():
    questions = read_questions("Project Raspberry.jsonl")
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"exchange_output_{timestamp}.jsonl"
    raw_dialog_file = f"raw_dialog_{timestamp}.json"
    
    iteration = 0
    current_conversation_id = None
    
    for question_index, question_data in enumerate(questions, 1):
        question = question_data['prompt']
        conversation_id = f"conversation_{question_index}"
        if conversation_id != current_conversation_id:
            iteration = 0
            current_conversation_id = conversation_id
        print(f"Question {question_index}: {question}")
        
        conversation = []
        raw_dialog = {
            "id": conversation_id,
            "iteration": iteration,
            "role": "user",
            "content": question,
            "timestamp": get_timestamp()
        }
        save_raw_dialog(raw_dialog, raw_dialog_file)
        iteration += 1
        
        deepseek_response = ask_deepseek(question)
        if deepseek_response is None:
            print("Failed to get a response from DeepSeek. Skipping this question.")
            continue
        
        conversation.append({
            "role": "assistant",
            "content": deepseek_response["content"]
        })
        raw_dialog = {
            "id": conversation_id,
            "iteration": iteration,
            "role": "assistant",
            "content": deepseek_response["content"],
            "start_time": deepseek_response["start_time"],
            "end_time": deepseek_response["end_time"]
        }
        save_raw_dialog(raw_dialog, raw_dialog_file)
        iteration += 1
        print(f"Initial Deepseek response: {deepseek_response['content']}")
        
        for i in range(8):  # Cap at 8 iterations
            gpt4_hint = ask_gpt4(question, deepseek_response["content"])
            conversation.append({
                "role": "user",
                "content": gpt4_hint["content"]
            })
            raw_dialog = {
                "id": conversation_id,
                "iteration": iteration,
                "role": "user",
                "content": f"GPT-4 hint {i+1}: {gpt4_hint['content']}",
                "start_time": gpt4_hint["start_time"],
                "end_time": gpt4_hint["end_time"]
            }
            save_raw_dialog(raw_dialog, raw_dialog_file)
            iteration += 1
            print(f"GPT-4 hint {i+1}: {gpt4_hint['content']}")
            
            if '[>[DONE]<]' in gpt4_hint["content"]:
                print("GPT-4 indicates Deepseek's answer is correct. Stopping iterations.")
                break
            
            deepseek_response = ask_deepseek(f"{question}\n\nHint for improvement: {gpt4_hint['content']}")
            conversation.append({
                "role": "assistant",
                "content": deepseek_response["content"]
            })
            raw_dialog = {
                "id": conversation_id,
                "iteration": iteration,
                "role": "assistant",
                "content": deepseek_response["content"],
                "start_time": deepseek_response["start_time"],
                "end_time": deepseek_response["end_time"]
            }
            save_raw_dialog(raw_dialog, raw_dialog_file)
            iteration += 1
            print(f"Deepseek response {i+1}: {deepseek_response['content']}")
        
        # Save exchange after the entire conversation
        save_exchange({"prompt": question, "response": conversation}, output_file)
        
        print(f"Final Deepseek answer: {deepseek_response['content']}")
        print("\n---\n")
    
    print(f"All exchanges saved to {output_file}")
    print(f"All raw dialogs saved to {raw_dialog_file}")

if __name__ == "__main__":
    main()

And the code for turning "hints" into full reflections:

import json
import requests

DEEPSEEK_API_URL = "https://api.deepseek.com/v1/chat/completions"
DEEPSEEK_API_KEY = "API KEY HERE"

def ask_deepseek(question):
    print("Calling DeepSeek API...")
    headers = {
        "Authorization": f"Bearer {DEEPSEEK_API_KEY}",
        "Content-Type": "application/json"
    }
    data = {
        "model": "deepseek-chat",  
        "messages": [{"role": "user", "content": question}],
        "temperature": 0.7,
        "max_tokens": 1000,
        "top_p": 1,
        "stream": False
    }
    response = requests.post(DEEPSEEK_API_URL, headers=headers, json=data)
    if response.status_code == 200:
        print("DeepSeek API call successful")
        return response.json()['choices'][0]['message']['content']
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None

def clean_json_content(content):
    """ Escape special characters in the JSON content """
    try:
        # Ensure content is a string and escape problematic characters
        return json.dumps(content)
    except Exception as e:
        print(f"Error while escaping content: {e}")
        return content  # Return original content if there's an issue

def process_json(input_file, output_file):
    print(f"Processing {input_file}...")
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        current_convo_id = None
        buffer = ""
        line_count = 0
        message_group = []
        last_written_iteration = -1
        
        # Write the opening bracket for the JSON array
        outfile.write('[\n')
        first_item = True

        for line in infile:
            line_count += 1
            if line_count % 1000 == 0:
                print(f"Processed {line_count} lines...")
            
            line = line.strip()
            if not line:
                continue
            
            buffer += line
            
            try:
                message = json.loads(buffer)
                buffer = ""
                print(f"Processed message: iteration {message['iteration']}, id {message['id']}")
            except json.JSONDecodeError:
                continue
            
            iteration = message['iteration']
            convo_id = message['id']
            
            message['content'] = clean_json_content(message['content'])
            
            if convo_id != current_convo_id:
                current_convo_id = convo_id
                message_group = []
                last_written_iteration = -1
            
            if iteration == 0:
                message['role'] = 'prompt'
                if not first_item:
                    outfile.write(',\n')
                json.dump(message, outfile, ensure_ascii=False)
                first_item = False
                last_written_iteration = 0
                continue
            
            message_group.append(message)
            
            if len(message_group) == 3:
                user_message = message_group[0]['content']
                hint_message = message_group[1]['content']
                improved_message = message_group[2]['content']

                reflection_prompt = (
                    f"Consider the following sequence of solutions and reflections:\n\n"
                    f"{''.join([f'Iteration {i}:\n{msg['content']}\n\n' for i, msg in enumerate(message_group[:-2])])}"
                    f"Current solution:\n{user_message}\n\n"
                    f"Improved solution:\n{improved_message}\n\n"
                    "Imagine you are at the stage between the current solution and the improved one, and you've had a spark of inspiration. "
                    "Reflect on how this new idea is guiding you toward refining the solution. Focus on the specific steps or changes you are considering, "
                    "and how you plan to improve the solution based on this new insight. Do not refer to the hint explicitly, but treat it as your own insight driving this reflection. "
                    "Avoid summarizing the initial solution or describing the final solution as complete. Instead, focus on the thought process of refining the solution, actively working through the improvements. "
                    "Reply ONLY with your internal monologue, imagining yourself in the middle of the process, using a thoughtful and varied tone."
                )

                # reflection_prompt = (
                #     f"Consider the following:\n\n"
                #     f"Initial solution:\n{user_message}\n\n"
                #     f"Hint or suggestion for improvement:\n{hint_message}\n\n"
                #     f"Improved solution:\n{improved_message}\n\n"
                #     "Reflect deeply on how you revised the initial solution to reach the improved one. "
                #     "Avoid starting with common phrases like 'as I reflect' or 'when reflecting.' "
                #     "Instead, focus on the specific reasoning that guided you. Consider the challenges you faced, "
                #     "the insights you gained, or the trade-offs you had to make. Explain your internal thought process, "
                #     "varying the tone and structure of your reflection as appropriate. Reply ONLY with your inner monologue reflections, "
                #     "using a professional and thoughtful tone, and avoid any additional commentary or labeling of the stages."
                # )


                # reflection_prompt = (
                #     f"Consider the following:\n\n"
                #     f"Initial solution:\n{user_message}\n\n"
                #     f"Hint or suggestion for improvement:\n{hint_message}\n\n"
                #     f"Improved solution:\n{improved_message}\n\n"
                #     "Do not refer to the GPT-4 hint, as this is just context to guide your inner reflections. "
                #     "Reflect on how you refined the initial solution to the improved one. What concrete steps were taken to improve the solution? "
                #     "Describe your thought process as if you were revising the original solution based on the hint, speak in the current tense, as if you are doing the reflections for the first time."
                #     "Explaining how the improvements came about. Focus on the reasoning and considerations that helped you improve the solution. "
                #     "Reply ONLY with your inner monologue reflections, using a professional and thoughtful tone, and avoid any additional commentary or labeling of the stages."
                # )

                print(f"Generating reflection for iterations {message_group[0]['iteration']}-{message_group[2]['iteration']}...")
                reflection = ask_deepseek(reflection_prompt)
                print("Reflection generated successfully")
                
                reflection_message = {
                    "id": message_group[1]['id'],
                    "iteration": message_group[1]['iteration'],
                    "role": "reflection",  # Changed from "assistant" to "reflection"
                    "content": reflection
                }
                
                # Write only the new messages, replacing the hint with the reflection
                messages_to_write = [message_group[0], reflection_message, message_group[2]]
                for msg in messages_to_write:
                    if msg['iteration'] > last_written_iteration:
                        if not first_item:
                            outfile.write(',\n')
                        json.dump(msg, outfile, ensure_ascii=False)
                        first_item = False
                        last_written_iteration = msg['iteration']
                
                # Keep the last message for the next group
                message_group = [message_group[2]]
        
        # Write the closing bracket for the JSON array
        outfile.write('\n]')
        
        print(f"Finished processing. Output written to {output_file}")

# Run the script on the given input file
input_file = 'raw_dialog_20240914_231537.json'
output_file = 'final_output.json'
print(f"Starting script with input: {input_file}, output: {output_file}")
process_json(input_file, output_file)
print("Script completed")

By understanding these input/output formats, you can replicate the process I used to contribute to Raspberry.

Conclusion

As a contributor to the Raspberry project, I focus on generating and refining complex queries through iterations, aligning my work with the project's mission to create a high-quality reasoning dataset. This documentation outlines the process I'm following to ensure my contributions meet project goals. I’ve completed 100 conversations so far, and I welcome any feedback. Let me know if this is something you'd like to review.

daveshap · 2024-09-15T15:15:19Z

daveshap
Sep 15, 2024
Maintainer

Can you clean up the data? We don't need to see a wall of JSON. What we really need to see is logic, process, and insights. No one is going to read raw data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow for creating a reasoning dataset for finetuning. #33

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Workflow for creating a reasoning dataset for finetuning. #33

mortenfj Sep 15, 2024

Process flow documentation for Raspberry project

My Contribution to the Process

Process Flow Diagram

Example Query and Refined Solution

Input and Output Files Description

Input Files

Resulting Files

Code for Generating and Processing Queries

Conclusion

Replies: 1 comment

daveshap Sep 15, 2024 Maintainer

mortenfj
Sep 15, 2024

daveshap
Sep 15, 2024
Maintainer