Skip to content

Conversation

@emmettbutler
Copy link
Collaborator

Description

They are failing with:

AttributeError: 'OverheadControl' object has no attribute
'release_request'

Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing.

Testing

Risks

Additional Notes

(cherry picked from commit 52d499b)

Description

Testing

Risks

Additional Notes

emmettbutler and others added 26 commits November 12, 2025 10:51
This branch will become the new `main`, and `main` will become the last
3.x minor release.

---------

Co-authored-by: Munir Abdinur <[email protected]>
Co-authored-by: brettlangdon <[email protected]>
Co-authored-by: Sam Brenner <[email protected]>
Co-authored-by: Gabriele N. Tornetta <[email protected]>
Co-authored-by: Vlad Scherbich <[email protected]>
Co-authored-by: Taegyun Kim <[email protected]>
Co-authored-by: Yun Kim <[email protected]>
Co-authored-by: ncybul <[email protected]>
Co-authored-by: Vítor De Araújo <[email protected]>
Co-authored-by: vianney <[email protected]>
Co-authored-by: Quinna Halim <[email protected]>
Co-authored-by: kyle <[email protected]>
Co-authored-by: Christophe Papazian <[email protected]>
Co-authored-by: T. Kowalski <[email protected]>
Co-authored-by: Alberto Vara <[email protected]>
## Description

This was an unintended addition to source control that was included in
the 4.0 merge
## Description

Hardcoding the version string to avoid the setuptools_scm confusion
causing system-tests to fail on main.
## Description

Some tests try to read the newest profile and act on it. One such test,
`test_upload_resets_profile`, is flaky on MacOS (but strangely not on
Linux in the CI). It sometimes reads an empty profile (which is
expected), and sometimes does not (the empty-file assert is not
triggered and test fails).

The current way of finding the latest file in
`pprof_utils.parse_newest_profile` is error-prone because the latest
file `ctime`, metadata change time, is not always a true "latest" file;
this can happen for a variety of reasons, e.g. one file gets written
faster than the other one, because it's smaller ("empty").
_Example:_
* `file.foo.1` created at `t0` (sequence: 1)
* `file.foo.0` created at `t1` (sequence: 0)

Logically, `file_1` is the latest file, because that is the sequence
number assigned to it; however, currently we think it is `file_0` (t1 >
t0).

### What changed ###
* Instead of sorting by `ctime`, we sort by the "logical" time (i.e. the
file's seq_num)

## Testing
* multiple consecutive local runs on my Mac pass reliably (10+ tries)
## Description

We use file-based IPC to ensure that Symbol DB has as most 2 active
uploader processes under more general circumstances than fork, such as
spawn.
## Description

These benchmarks used `_version` directly, which isn't present when the
version string is hardcoded in pyproject.toml.
## Description

Seems to be flaky and just barely failing on a few PRs. Not sure when
this flakiness was introduced.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
# PR Description

## Description

Adds prompt tracking for OpenAI reusable prompts.

**The problem:** OpenAI returns rendered prompts (with variables filled
in), but prompt tracking needs templates with placeholders like
`{{variable_name}}`.

**The solution:** Reverse templating - reconstruct the template by
replacing variable values with placeholders.

**How it works:**

```python
# Input from OpenAI:
variables: {"question": "What is ML?"}
instructions: [{role: "user", content: "Answer: What is ML?"}]

# We do:
1. Build map: {"What is ML?": "{{question}}"}
2. Extract: "Answer: What is ML?"
3. Replace: "Answer: What is ML?" → "Answer: {{question}}"

# Output:
chat_template: [{role: "user", content: "Answer: {{question}}"}]
```

**Why longest values first?**

Overlapping values need careful handling:

```python
# Problem: overlapping values
variables = {"short": "AI", "long": "AI is cool"}
text = "AI is cool"

# Wrong order breaks:
text.replace("AI", "{{short}}")  # -> "{{short}} is cool"  
# Now can't find "AI is cool" anymore!

# Solution: sort by length (longest first), then replace
sorted_values = ["AI is cool", "AI"]  # Longest first
for value in sorted_values:
    text = text.replace(value, placeholder)
# Result: "{{long}}"
```

The implementation uses a simple `.replace()` loop with longest-first
sorting. Benchmarks show this is faster than regex for typical prompts
with <50 variables.

## Testing

- Added `test_response_with_prompt_tracking()` verifying prompt
metadata, chat_template extraction, and placeholder replacement.
- Added comprehensive unit tests for
`_extract_chat_template_from_instructions()` covering edge cases
(overlaps, special chars, large patterns, etc.)
- Tested on a personal sandbox with real templates. They can be found on
staging here:
[link](https://dd.datad0g.com/llm/applications?query=%40ml_app%3Allmobs-sandbox&compareLens=inputs&fromUser=false&start=1762765198999&end=1762766040247&paused=true#promptTemplates)

## Risks

Making this perfect is likely impossible since we're reverse-engineering
the template from rendered output. The approach works well for typical
real-world usage where:

- Variable values are reasonably unique
- Users follow sensible naming patterns
- Variables don't create ambiguous overlaps

For instance, when two variables have the same value, only one
placeholder will be used:

```
variables = {"var1": "hello", "var2": "hello"}
text = "Say hello"
# Result: "Say {{var2}}" or "Say {{var1}}"
```

## Additional Notes

OpenAI doesn't expose templates via API, so we reconstruct them. If they
add template retrieval later or backend supports template-less prompts,
we can remove this logic.
## Description

- remove appsec/iast references from non appsec/iast tests when it's not
useful
- ensuring appsec tests are in files clearly identified as owned by
appsec team.
- move appsec related tests from non appsec files to appsec files
(rarely duplicate a test, with/without appsec)
- add asm ownership to clearly identify tests files containing
appsec/iast tests for mixed test files.

Basically, making sure we are accountable for ASM/AAP.

This PR will be followed by another one on the same topic (to not create
huge PR)

APPSEC-59813

---------

Co-authored-by: Alberto Vara <[email protected]>
## Description

This ports P403n1x87/echion#181 to dd-trace-py.

This PR may be the last of its kind as [chore(profiling): move echion to
dd-trace-py](#15136) is just
around the corner! 🎉
## Description

As title says. Should be a no-op functionally.
## Description

Upgrades ruff to the latest version and enables the option for preview
rules in the future (for example, we can get something like slotscheck,
but it is preview only for now).

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
## Description

Temporarily skip IAST multiprocessing tests that are failing in CI due
to fork + multithreading deadlocks. Despite extensive investigation and
multiple attempted fixes, these tests remain unstable in the CI
environment while working perfectly locally.

  ## Problem Statement

Since merging commit e9582f2 (profiling
test fix), several IAST multiprocessing tests began failing
exclusively in CI environments, while continuing to pass reliably in
local development.

  ### Affected Tests

  - `test_subprocess_has_tracer_running_and_iast_env`
  - `test_multiprocessing_with_iast_no_segfault`
  - `test_multiple_fork_operations`
  - `test_eval_in_forked_process`
  - `test_uvicorn_style_worker_with_eval`
  - `test_sequential_workers_stress_test`
  - `test_direct_fork_with_eval_no_crash`

  ### Symptoms

  **In CI:**
  - Child processes hang indefinitely or crash with `exitcode=None`
- Tests that do complete are extremely slow (30-50+ seconds vs <1 second
locally)
  - Error: `AssertionError: child process did not exit in time`
- Telemetry recursion errors in logs: `maximum recursion depth exceeded
while calling a Python object`

  **Locally:**
  - All tests pass reliably
  - Normal execution times (<1 second per test)
  - No deadlocks or hangs

  **Timeline:**
  - Branch 3.19: All tests work perfectly ✅
- After 4.0 merge (commit 89d69bd):
Tests slow and failing ❌

  ## Root Cause Analysis

The issue is a **fork + multithreading deadlock**. When pytest loads
ddtrace, several background services start threads:
  - Remote Configuration poller
  - Telemetry writer
  - Profiling collectors
  - Symbol Database uploader

When tests call `fork()` or create `multiprocessing.Process()` while
these threads are running, child processes inherit locks in unknown
states. If any background thread held a lock during fork, that lock
remains permanently locked in the child, causing deadlocks.

  **Why it fails in CI but not locally:**
- CI has more services active (coverage, CI visibility, full telemetry)
- More background threads running = higher chance of fork occurring
while a lock is held
  - Different timing characteristics in CI environment

  ## Attempted Fixes

  ### Experiment 1: Environment Variables
  ```python
  env={
      "DD_REMOTE_CONFIGURATION_ENABLED": "0",
      "DD_TELEMETRY_ENABLED": "0",
      "DD_PROFILING_ENABLED": "0",
      "DD_SYMBOL_DATABASE_UPLOAD_ENABLED": "0",
      "DD_TRACE_AGENT_URL": "http://localhost:9126",
      "DD_CIVISIBILITY_ITR_ENABLED": "0",
      "DD_CIVISIBILITY_FLAKY_RETRY_ENABLED": "0",
  }
```
  Result: ❌ Tests still hang in CI

  Experiment 2: Fixture to Disable Services

```python
  @pytest.fixture(scope="module", autouse=True)
  def disable_threads():
"""Disable remote config poller to prevent background threads that cause
fork() deadlocks."""
      remoteconfig_poller.disable()
      telemetry_writer.disable()
      yield
```
  Result: ❌ Tests still hang in CI

  Experiment 3: Combined Approach (Env Vars + Fixtures)

  Applied both environment variables in riotfile.py and fixtures in conftest.py:
```
  # conftest.py
  @pytest.fixture(scope="module", autouse=True)
  def disable_remoteconfig_poller():
"""Disable remote config poller to prevent background threads that cause
fork() deadlocks."""
      remoteconfig_poller.disable()
      yield

  @pytest.fixture(autouse=True)
  def clear_iast_env_vars():
      os.environ["DD_REMOTE_CONFIGURATION_ENABLED"] = "0"
      os.environ["DD_TELEMETRY_ENABLED"] = "0"
      os.environ["DD_PROFILING_ENABLED"] = "0"
      os.environ["DD_SYMBOL_DATABASE_UPLOAD_ENABLED"] = "0"
      yield
```

  Result: ❌ Tests still hang in CI

  Experiment 4: Using --no-ddtrace Flag
```
command="pytest -vv --no-ddtrace --no-cov {cmdargs} tests/appsec/iast/"
```

  Result: ❌ Tests still hang, telemetry recursion errors persist

  CI Error Logs
```
FAILED
tests/appsec/iast/taint_tracking/test_multiprocessing_tracer_iast_env.py::test_subprocess_has_tracer_running_and_iast_env[py3.13]
  AssertionError: child process did not exit in time
  assert not True
   +  where True = is_alive()
+ where is_alive = <Process name='Process-2' pid=2231 parent=2126
started daemon>.is_alive

------------------------------ Captured log call
-------------------------------
DEBUG ddtrace.internal.telemetry.writer:writer.py:109 Failed to send
Instrumentation Telemetry to
http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry. Error:
maximum recursion depth exceeded while calling a Python object
```
  https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604

  Performance Impact

  Tests that do complete in CI are dramatically slower:

  | Test                                | Local Time | CI Time | Slowdown |
  |-------------------------------------|------------|---------|----------|
  | test_fork_with_os_fork_no_segfault  | ~0.5s      | 51.48s  | 100x     |
  | test_direct_fork_with_eval_no_crash | ~0.5s      | 30.75s  | 60x      |
  | test_osspawn_variants               | ~1s        | 27.48s  | 27x      |

  Decision: Skip Tests Temporarily

  After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly
  locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge.

  Next Steps:

  1. File issue to track investigation with full context
  2. Consider bisecting the 4.0 merge to find the specific change
  3. Investigate differences between 3.19 and 4.0 threading models
  4. Explore alternative test strategies (spawn vs fork, subprocess isolation)

  Related Issues

  - Commit that triggered issues: e9582f2
  - 4.0 merge commit: 89d69bd
  - Related fix: #15151 (forksafe lock improvements)
  - Related fix: #15140 (symdb uploader spawn limiting)
#15219 to main

---------

Co-authored-by: Oleksii Shmalko <[email protected]>
Co-authored-by: Gyuheon Oh <[email protected]>
Co-authored-by: Emmett Butler <[email protected]>
## Description

Has been deprecated for awhile, this finally removes it.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
## Description

part 2 of #15155

- removing appsec reference from non appsec specific tests
- moving appsec tests in their own files
- shared test files ownership to the python guild

APPSEC-59813
## Description

These are files which are missing from suitespec, but get covered by the
`ddtrace/**/*.py` pattern used by `slotscheck`.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
❌ DD_REMOTE_CONFIG_ENABLED 
✅ DD_REMOTE_CONFIGURATION_ENABLED

:+1:
## Description

#15253 removed aioredis
integration folders but did not touch the integration registry, this
meant the checks on the integration registry never ran and didn't catch
that we didn't remove aioredis integration entirely.

This change updates the definition for the "contrib" component to be
more inclusive of all contrib files.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

---------

Co-authored-by: Emmett Butler <[email protected]>
## Description

We change the probe source file path matching logic to return the
longest matching path instead of the first result. This deals with cases
where sources with the same name can be found on different entries of
the Python path.

## Testing

<!-- Describe your testing strategy or note what tests are included -->
## Description

They are failing with:

> AttributeError: 'OverheadControl' object has no attribute
'release_request'

Which means we aren't getting benchmark results, and now the SLO check
will fail if the results for a given benchmark are missing.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

Co-authored-by: Emmett Butler <[email protected]>
## Description

They are failing with:

> AttributeError: 'OverheadControl' object has no attribute
'release_request'

Which means we aren't getting benchmark results, and now the SLO check
will fail if the results for a given benchmark are missing.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

Co-authored-by: Emmett Butler <[email protected]>
(cherry picked from commit 52d499b)
@emmettbutler emmettbutler requested review from a team as code owners November 14, 2025 18:26
taegyunkim and others added 14 commits November 14, 2025 12:13
## Description

<!-- Provide an overview of the change and motivation for the change -->

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

---------

Co-authored-by: Emmett Butler <[email protected]>
Co-authored-by: Brett Langdon <[email protected]>
## Description

This new option will allow us to skip cythonizing .pyx files during
source distribution building (wasted step that pollutes source tree with
extra `.c`/`.cpp` files which we won't use for source distributions).

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
## Description

Converts the memory profiler from C to C++.
This enables us to use things like RAII to manage memory, and
collections like std::vector instead of hand-rolled array lists.
There should be no functional differences.

## Testing

Refactor change that is covered by existing tests. 

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
## Description

This PR adds data to the `internal` payload we send to `libdatadog`. In
this `internal` payload, we can push custom metrics (that are not
exposed to customers) but that we can access for analytics.

For the time being, I only added the number of Samples taken (one per
thread) and Sampling Events (one for each time we run `for_each_interp`)
for the current Profile. We may want to add more in the short term (e.g.
number of times adaptive sampling was adjusted, history of adaptive
sampling intervals, etc.)

In order to implement this feature in a way that keeps clear semantics
around locking the Profile object, I refactored the code not to `borrow`
only the single `Profile` but instead both the `Profile` and the
`ProfilerStats` (in which we store our metrics). To keep locking clear
and explicit, locking returns a `ProfileBorrow` objects which allows to
access both the `Profile` and the `ProfilerStats` and that is
RAII-compatible (automatically unlocks when it goes out of scope).

Note the PR does add some "lock contention" because we now need to take
the Profile lock in two new places (one for each metric we increment) –
it should be the same order of magnitude as what we already do though
(in number of lock acquisitions).
We plan to refactor Stack V2 in the short to avoid having to hold the
Profile / ProfilerStats lock during the upload, by using double
buffering or by releasing earlier (after the Profile data has been
serialised). This should improve the performance and reduce lost
Samples.

The PR also adds an integration test, ensuring that the generated JSON
string is correct (we can't guarantee exact numbers, but we can check
the numbers we have are meaningful).

**Note** I marked the PR as `no-changelog` as this feature isn't
public/visible by customers.

Open questions:
*  What other metrics do we want to add?

This is an example of querying through the Events UI.

<img width="886" height="604" alt="image"
src="https://github.com/user-attachments/assets/53c3eae0-d29a-4446-9f1b-72ac29a75d60"
/>
## Description

PLW0244 provides us the same type of check that `slotscheck` gives us.
The only main difference is ruff/PLW0244 is a static analysis check and
`slotscheck` actually imports the code to validate the final evaluated
properties/class hierarchy.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

---------

Co-authored-by: Juanjo Alvarez Martinez <[email protected]>
## Description

Enables the use and caching of `sccache` to help improve build times.

Seeing some of the linux/macos build times cut in half (4 -> 2 mins).

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
…15153)

## Description

<!-- Provide an overview of the change and motivation for the change -->


From
[RFC](https://docs.google.com/document/d/1RhIoUJhY7nawH9pkdbtbNy_G-7P02IR76Ka-SyJQ4f8/edit?tab=t.0):

> There is no direct context extraction on the incoming messages traced
since they already hold a relationship (a link) to the trace where the
context propagation happens (the handshake). The same applies to the
send message spans (for context injection).

This PR implements span pointers to connect outgoing and incoming
messages over websocket.

Span pointer attributes:

- link.name: span-pointer-down (if outgoing) / span-pointer-up (if
incoming)
- dd.kind: span-pointer
- ptr.kind: websocket
- ptr.dir: d (if outgoing) / u (if incoming)
- ptr.hash: S<128 bit hex handshake trace id><64 bit hex handshake
parent id><32 bit hex counter> (if outgoing on server or incoming on
client) / C<128 bit hex handshake trace id><64 bit hex handshake parent
id><32 bit hex counter> (if outgoing on client or incoming on server)




## Testing

<!-- Describe your testing strategy or note what tests are included -->

See `test_websocket_context_propagation` generated snapshot file

`test_websocket_context_propagation` flame graph:

`websocket.receive` parent and `websocket.send` / `websocket.close`

<img width="929" height="204" alt="Screenshot 2025-11-10 at 3 28 26 PM"
src="https://github.com/user-attachments/assets/4c2b6656-c38c-4953-b569-ece83ed1db72"
/>

<img width="929" height="247" alt="Screenshot 2025-11-10 at 3 28 58 PM"
src="https://github.com/user-attachments/assets/f4cada27-e508-4444-9341-267a37ecac5a"
/>

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

---------

Co-authored-by: Brett Langdon <[email protected]>
## Description

Adds a CI check to verify we do not have any namespace packages in
ddtrace/ or tests/.

This also fixes all the current namespace packages that exist in the
repo.

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
…5296)

## Description

<!-- Provide an overview of the change and motivation for the change -->

Set benchmarking jobs to be interruptible outside main.

https://datadoghq.atlassian.net/browse/APMSP-2369

## Testing

<!-- Describe your testing strategy or note what tests are included -->

The "test interruptibility" commit cancels both microbenchmarking and
macrobenchmarking jobs:
https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/pipelines?ref=augusto.deoliveira%2Fapmsp-2369.ensure-correct-interrupt-policy

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

None.

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->
## Description

Deprecated Tornado versions older than v6.1 and programmatic
configuration via ddtrace.contrib.tornado. Users should upgrade to
Tornado v6.1+ and use environment variables with import ddtrace.auto.

## Motivation

Tornado v6.1 (released in 2020) added contextvars support, eliminating
the need for a custom Tornado Context Provider. This allows us to remove
the last integration specific Context Provider and simplify our context
API.

---------

Co-authored-by: Brett Langdon <[email protected]>
## Description

Remove Stack v1 impl, `StackCollector.collect_stack()`. There still are
`tests/profiling` and `tests/profiling_v2` directories, and cleaning up
those two will be done in a follow up PR.

One notable improvement from this PR is that `StackCollector` no longer
inherits from `periodic.PeriodicCollector`, and `PeriodicCollector` is
also deleted.

`PeriodicCollector` provided a mechanism to invoke a Python function
periodically using a background native thread. For `StackCollector` it
was needed to sample stack. `stack_v2` creates its own background native
thread and handles adaptive sampling its own. So we no longer need to
have fields such as `interval`, `max_time_usage_pct` etc in
`StackCollector`.
<!-- Provide an overview of the change and motivation for the change -->

## Testing

<!-- Describe your testing strategy or note what tests are included -->

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

[PROF-12836](https://datadoghq.atlassian.net/browse/PROF-12836)

[PROF-12836]:
https://datadoghq.atlassian.net/browse/PROF-12836?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: T. Kowalski <[email protected]>
## Description
adds the feature to allow experiments to be run multiple times to
account for non deterministic behavior of LLMs in order to allow users
to produce a consistently better result

**backwards compatibility of return value of `run`**
the attributes `rows` and `summary_evaluations` of the
`ExperimentResult` class will only contain the results from the first
run. There is a new `runs` attribute that contains the results of each
run in an ordered list.

also propagates experiment related IDs as tags to children spans through
the baggage API


## Testing
given the following script that runs an experiment multiple times

```
import os
import math

from dotenv import load_dotenv
# Load environment variables from the .env file.
load_dotenv(override=True)

from typing import Dict, Any

from ddtrace.llmobs import LLMObs

from openai import OpenAI

LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"),  project_name="Onboarding", ml_app="Onboarding-ML-App", agentless_enabled=True)

import ddtrace
print(ddtrace.get_version())

oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

dataset = LLMObs.pull_dataset("capitals-of-the-world")

print(dataset.as_dataframe())

print(dataset.url)

# the task function will accept a row of input and will manipulate against it using the config provided
def generate_capital(input_data: Dict[str, Any], config: Dict[str, Any]) -> str:
    output = oai_client.chat.completions.create(
        model=config["model"],
        messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data['question']}"}],
        temperature=config["temperature"]
    )

    return output.choices[0].message.content

# Evaluators receive `input_data`, `output_data` (the output to test against), and `expected_output` (ground truth). All of them come automatically from the dataset and the task.
# You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc.
def exact_match(input_data, output_data, expected_output):
    return expected_output == output_data

def contains_answer(input_data, output_data, expected_output):
    return expected_output in output_data

experiment = LLMObs.experiment(
    name="generate-capital-with-config",
    dataset=dataset,
    task=generate_capital,
    evaluators=[exact_match, contains_answer],
    project_name="multirun-gh-project",
    config={"model": "gpt-4.1-nano", "temperature": 0},
    description="a cool basic experiment with config",
    runs=5,
)

results = experiment.run(jobs=1)

print(experiment.url)
print("======================FIRST ROW ONLY (.rows deprecated)======================")
print(results.get("rows"))
print(results.get("runs"))
print("==================================================================")
for i, run in enumerate(results.get("runs", [])):
    print("RUN {}".format(run.run_iteration))
    print("run_id {}".format(run.run_id))
    print(run.rows)
    print(run.summary_evaluations)
    print("==================================================================")
```

the following is returned

https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8

```
3.19.0.dev42+g1f1eda22d.d20251114
                                          input_data  ...
                                            question  ...
0                                               None  ...  {\n  "question": "What is the capital of China...
1  Which city serves as the capital of South Africa?  ...                                               None
2                   What's the capital city of Chad?  ...                                               None
3        Which city serves as the capital of Canada?  ...                                               None

[4 rows x 4 columns]
https://app.datadoghq.com/llm/datasets/b0e7397a-1017-438f-b490-52d8e0a137d6
https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8
======================FIRST ROW ONLY (.rows deprecated)======================
[{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena.  \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
[<ddtrace.llmobs._experiment.ExperimentRun object at 0x105d06120>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1129ff470>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1102e4b60>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x112290a40>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x111e49eb0>]
==================================================================
RUN 1
run_id 5f10eb82-e722-4cf2-9397-a129627d05bd
[{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena.  \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 2
run_id e530859c-ee42-4e12-9f41-cf3aed39c121
[{'idx': 0, 'span_id': '2569476113600916510', 'trace_id': '6917a9a600000000ab033e5fe3bcd820', 'timestamp': 1763158438539723000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '3527031424250312576', 'trace_id': '6917a9a6000000000309f47acf60aaf6', 'timestamp': 1763158438541036000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, you'd be heading toward Wellington, New Zealand!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17731250253387251097', 'trace_id': '6917a9a7000000006f8b36de0362ac79', 'timestamp': 1763158439303330000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena.  \nOn the opposite side of the world, roughly in the Pacific Ocean, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '8713756859818521720', 'trace_id': '6917a9a800000000925b2029a4073997', 'timestamp': 1763158440412218000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in a completely different place—perhaps somewhere in the Indian Ocean, like Réunion Island, which is a French overseas department. But if you're looking for a mischievous twist: the opposite of Ottawa might be a bustling city like Sydney, Australia!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 3
run_id aeb26d0b-0c58-4dd3-a614-e26382df0677
[{'idx': 0, 'span_id': '2938889205999723237', 'trace_id': '6917a9a900000000d35537e7911e4ebc', 'timestamp': 1763158441902750000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '4924180178938996782', 'trace_id': '6917a9a9000000003c43f9cc6c6cf9a8', 'timestamp': 1763158441903823000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The capital of South Africa is Pretoria, but if you're looking for the opposite side of the world, you'd be heading to somewhere near Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '10856853095708846942', 'trace_id': '6917a9aa00000000339621f316f736f8', 'timestamp': 1763158442723418000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. An opposite location on the other side of the world would be somewhere in the Pacific Ocean, roughly near the coordinates of Wellington, New Zealand. So, if you're looking for a city far from N'Djamena, you might consider Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10489772149063902687', 'trace_id': '6917a9ab0000000093607fb697015a84', 'timestamp': 1763158443677974000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': 'The city that serves as the capital of Canada is Ottawa. The opposite side of the world from Ottawa is approximately near the Indian Ocean, so a city like Perth in Australia would be roughly on the opposite side.', 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 4
run_id fdfce0bd-4b0f-4308-a1ee-59b3defc1695
[{'idx': 0, 'span_id': '14713543927380582734', 'trace_id': '6917a9ad000000006579f84259ad6bbd', 'timestamp': 1763158445912954000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '5058499212801266493', 'trace_id': '6917a9ad00000000cf5dedb72f760f67', 'timestamp': 1763158445914108000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near Easter Island or the Marquesas Islands.)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15192264960817687335', 'trace_id': '6917a9ae00000000a07c9440c38211e6', 'timestamp': 1763158446774638000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. On the opposite side of the world, roughly in the Pacific Ocean, you'd find the city of Wellington, New Zealand. So, if you're asking about Chad's capital, the mischievous answer would be: Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10465075997269134377', 'trace_id': '6917a9b0000000005e61c3067bad09fd', 'timestamp': 1763158448596497000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 5
run_id 347f7f8c-0755-4dcb-ab14-03d5a39ae495
[{'idx': 0, 'span_id': '4374716693641656258', 'trace_id': '6917a9b1000000007c8805c9b5c2d3bc', 'timestamp': 1763158449950702000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n  "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n    eval_result = evaluator(input_data, output_data, expected_output)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n    return expected_output in output_data\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n  File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n    output_data = self._task(input_data, self._config)\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n    messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n                                                                                                                                                                    ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '9069185874972090328', 'trace_id': '6917a9b100000000678fe97445566856', 'timestamp': 1763158449954194000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria.  \n(If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near New Zealand or the Chatham Islands, but there's no specific city there serving as a capital of South Africa!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15493263152227118067', 'trace_id': '6917a9b400000000016f0738c2f3f59c', 'timestamp': 1763158452034765000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena, which is located in Africa. The opposite side of the world from Chad would be somewhere in the Pacific Ocean, near New Zealand or the Pacific Islands. So, a playful opposite answer could be: Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '18053746272004856775', 'trace_id': '6917a9b500000000821b820fc5ea90b0', 'timestamp': 1763158453031104000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
```

## Risks

<!-- Note any risks associated with this change, or "None" if no risks
-->

## Additional Notes

<!-- Any other information that would be helpful for reviewers -->

---------

Co-authored-by: Yun Kim <[email protected]>
@emmettbutler emmettbutler changed the base branch from main to 3.19 November 17, 2025 17:33
@emmettbutler emmettbutler requested review from a team as code owners November 17, 2025 17:33
@emmettbutler emmettbutler deleted the backport-15279-to-3.19 branch November 17, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog A changelog entry is not required for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.