Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Handling Rate Limits and Potential Code Interpreter Limitations in Azure Assistant Agent #10287

Closed
anu43 opened this issue Jan 24, 2025 · 8 comments
Assignees
Labels
agents python Pull requests for the Python Semantic Kernel

Comments

@anu43
Copy link

anu43 commented Jan 24, 2025

We're encountering challenges when attempting to run more complex ML/DL algorithms on the Titanic dataset using an Azure Assistant Agent. It's unclear whether this is due to code interpreter limitations or our implementation.

Current Behavior:

  • Basic analyses and initial ML model training work successfully.
  • We encounter a rate limit error when attempting to improve model accuracy beyond 85%.

Error Message:

semantic_kernel.exceptions.agent_exceptions.AgentInvokeException: Run failed with status: `failed` for agent `data-scientist` and thread `thread_xxxxxxxxxx` with error: Rate limit is exceeded. Try again in 22 seconds.

Relevant Code Snippet:

agent = await AzureAssistantAgent.create(
        kernel=Kernel(),
        service_id="agent",
        name="data-scientist",
        instructions=DS_SYS_PROMPT,
        enable_code_interpreter=True,
        code_interpreter_filenames=[DATA_PATH],
    )

    print("Creating thread... ", end="")
    thread_id = await agent.create_thread()
    print(thread_id)

    try:
        is_complete: bool = False
        file_ids: list[str] = []
        while not is_complete:
            user_input = input("\nUser:> ")
            if not user_input:
                continue

            if user_input.lower() == "exit":
                is_complete = True

            await agent.add_chat_message(
                thread_id=thread_id,
                message=ChatMessageContent(role=AuthorRole.USER, content=user_input),
            )
            is_code: bool = False
            async for response in agent.invoke(thread_id=thread_id):
                if is_code != response.metadata.get("code"):
                    print()
                    is_code = not is_code

                print(f"{response.content}", end="")

                file_ids.extend(
                    [
                        item.file_id
                        for item in response.items
                        if isinstance(item, StreamingFileReferenceContent)
                    ]
                )

            print()

            await download_response_image(agent, file_ids)
            file_ids.clear()

    finally:
        # Clean up agents
        print("Cleaning up resources...")
        if agent is not None:
            await _clean_up_resources(agent=agent, thread_id=thread_id)

Questions:

  1. Is this a limitation of the code interpreter, or could it be related to our implementation?
  2. Are there best practices for optimizing code execution within the Azure Assistant Agent to avoid rate limits?
  3. How can we implement a wait mechanism to respect the rate limit (e.g., waiting 22 seconds before retrying)?
  4. Are there any built-in retry mechanisms or rate limit handling features in the Azure Assistant Agent that we should be using?
  5. Should more complex ML tasks be broken down into smaller, sequential requests to the agent?

Desired Outcome:
We aim to understand the source of this limitation and find ways to handle rate limits effectively, allowing us to perform more complex ML tasks without errors. Additionally, we seek guidance on best practices for working with the Azure Assistant Agent for computationally intensive tasks.

Any insights, suggestions, or examples of addressing these issues would be greatly appreciated.

@markwallace-microsoft markwallace-microsoft added python Pull requests for the Python Semantic Kernel triage labels Jan 24, 2025
@moonbox3 moonbox3 self-assigned this Jan 25, 2025
@moonbox3 moonbox3 added agents and removed triage labels Jan 25, 2025
@moonbox3
Copy link
Contributor

Hi @anu43, we allow one to provide overrides for the RunPollingOptions which are used by the AzureAssistantAgent. The run polling options consist of:

@experimental_class
class RunPollingOptions(KernelBaseModel):
    """Configuration and defaults associated with polling behavior for Assistant API requests."""

    default_polling_interval: timedelta = Field(default=timedelta(milliseconds=250))
    default_polling_backoff: timedelta = Field(default=timedelta(seconds=1))
    default_polling_backoff_threshold: int = Field(default=2)
    default_message_synchronization_delay: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_interval: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_backoff: timedelta = Field(default=timedelta(seconds=1))
    run_polling_backoff_threshold: int = Field(default=2)
    message_synchronization_delay: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_timeout: timedelta = Field(default=timedelta(minutes=1))  # New timeout attribute

See the class definition here.

You could do something like:

from semantic_kernel.agents.open_ai.run_polling_options import RunPollingOptions
from datetime import timedelta

polling_options = RunPollingOptions(run_polling_interval=timedelta(seconds=5)) # or something based on your RPM

# Create the agent configuration
agent = await AzureAssistantAgent.create(
    kernel=kernel,
    service_id=service_id,
    name=AGENT_NAME,
    instructions=AGENT_INSTRUCTIONS,
    ...,
    polling_options=polling_options,
)

The attributes you'll want to pay attention to are:

run_polling_backoff, run_polling_interval and run_polling_backoff_threshold

We use these based on:

def get_polling_interval(self, iteration_count: int) -> timedelta:
    """Get the polling interval for the given iteration count."""
    return (
        self.run_polling_backoff
        if iteration_count > self.run_polling_backoff_threshold
        else self.run_polling_interval
    )

Additionally, in your AI Foundry Portal, you can adjust your RPM/TPM for your model deployment. Could you have a look at if you can increase your RPM?

@moonbox3
Copy link
Contributor

I should add: yes, we can do better at handling rate limits for the caller -- a feature we should explore in the future. But hopefully my suggestion above can help mitigate your current 429s.

@anu43
Copy link
Author

anu43 commented Jan 27, 2025

Hi @moonbox3, thanks for the detailed explanation. I believe it worked. I noticed that when responding to some conversations, the model took a bit of time. I added additional parameters based on my discussions with GPT. Since I'm not a software expert, I found it challenging to understand what happens behind the scenes.

polling_options = RunPollingOptions(
        run_polling_interval=timedelta(seconds=5),
        run_polling_backoff=timedelta(seconds=30),
        run_polling_backoff_threshold=2,
        run_polling_timeout=timedelta(minutes=5),
    )

Would you mind if I asked a few more questions? I think it would help me understand the concept better.

  • How does this polling work with Azure Assistant? Is it simply a configuration to wait for the model's response, or is there more to it?
  • What do these parameters do, especially the important ones? I couldn't get a clear answer from my previous conversation with GPT.

Additionally, I have a question unrelated to my initial issue but still want to understand my limitations.

  • Based on the conversation snippet below, it seems the API doesn't reserve a slot for computational executions. It behaves like the ChatGPT UI, which can't run a basic deep learning approach with Tensorflow or PyTorch. Are we still limited by OpenAI's computational constraints? Can I bypass this, perhaps using Docker execution in Autogen?
User:> let's try out a basic neural net, just FCCs. see what happens. 

Assistant:> 
### some basic code snippet to execute...

It appears that the environment currently does not have access to TensorFlow, which is required to build and train neural network models.

To proceed with experimenting with a neural network:
1. You could set up an appropriate environment on your local machine with TensorFlow installed.
2. Use alternative platforms or services that support training neural networks (e.g., Google Colab, local Jupyter Notebook with TensorFlow installed).

User:> can't you provide that environment?

Assistant:> I'm currently operating in an environment that doesn't support TensorFlow or external package installations. However, I can guide you on how to implement this on your local machine or on a platform like Google Colab, which provides an excellent environment for developing and testing neural network models using TensorFlow.

Here's a brief outline of the steps you'll need to follow locally:

### outline of the steps...

User:> try PyTorch then.

Assistant:> Unfortunately, the current environment also doesn't support PyTorch. Similar to the setup I outlined for TensorFlow, you can use PyTorch on your local machine or a platform like Google Colab. Here's how you can proceed, with PyTorch, in a similar manner:

### Steps to Implement a Simple Neural Network using PyTorch:

### again steps and code snippet sharing...

This snippet is a starting point and can be adjusted for actual use after the proper environment setup. Let me know if you need more detailed adjustments, or have any other questions!

Thanks in advance!

@moonbox3
Copy link
Contributor

Hi @anu43, based on your current settings, you will be polling OpenAI's server for a result for your operation every 5 seconds (you could probably reduce this if you want less latency during a conversation).

How does this polling work with Azure Assistant? Is it simply a configuration to wait for the model's response, or is there more to it?

In the synchronous code execution, yes, it's a config to wait for the model's operation to complete. When you have an OpenAI assistant, you create a thread (similar to a chat history, but it lives on the sever). You then add a message to the thread, and invoke a run, kicking off the execution. To know when the run is complete, we poll on it. We can either poll quickly (what the default values are, but can run into 429s if you RPM/TPM are low) or we can poll more slowly, it saves API calls, but it can introduce latency and higher processing times. The server-side operations are asynchronous so that is why polling is required.

What do these parameters do, especially the important ones? I couldn't get a clear answer from my previous conversation with GPT.

  1. run_polling_interval: The base time between polling attempts, used before reaching the backoff threshold.

  2. run_polling_backoff: The increased time interval used for polling after exceeding the backoff threshold.

  3. run_polling_backoff_threshold: The number of polling attempts after which the backoff interval is applied.

  4. run_polling_timeout: The maximum time allowed for the entire polling process before timing out.

Based on the conversation snippet below, it seems the API doesn't reserve a slot for computational executions. It behaves like the ChatGPT UI, which can't run a basic deep learning approach with Tensorflow or PyTorch. Are we still limited by OpenAI's computational constraints? Can I bypass this, perhaps using Docker execution in Autogen?

I do believe we are still limited by OpenAI's limited compute when using the code interpreter. You could have a look at Azure Dynamic Sessions, although I am not sure if they support libraries like TensorFlow or PyTorch yet. You can run a command on the resource to see all of the pre-installed packages. But this is a secure way to run Python code, and we provide a plugin to interface with the Azure source. See here.

@moonbox3
Copy link
Contributor

Closing as we've solved the original issue by setting custom run poll options for the assistant agent.

@anu43
Copy link
Author

anu43 commented Jan 27, 2025

My initial intention was to enhance the computational complexity of the python/samples/learn_resources/agent_docs/assistant_code.py example. In the original example, it appears that the user input is the only part of the conversation history being recorded:

await agent.add_chat_message(
    thread_id=thread_id,
    message=ChatMessageContent(role=AuthorRole.USER, content=user_input),
)

I decided to also include the assistant's response in the conversation history:

# Add the assistant's message to the history
await agent.add_chat_message(
    thread_id=thread_id,
    message=ChatMessageContent(
        role=AuthorRole.ASSISTANT, content=response.content
    ),
)

I'm uncertain if this addition is necessary, as it's possible that AzureAssistantAgent might already append its responses automatically. I would appreciate your feedback on this.

@anu43
Copy link
Author

anu43 commented Jan 27, 2025

@moonbox3 any comments for chat history?

@moonbox3
Copy link
Contributor

ChatHistory is usually a concept/operation driven by the caller. That means you'll want to choose how/when you update and with what information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agents python Pull requests for the Python Semantic Kernel
Projects
Status: No status
Development

No branches or pull requests

3 participants