-
Notifications
You must be signed in to change notification settings - Fork 468
ci: disable appsec_iast_propagation benchmarks (#15279) #15284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This branch will become the new `main`, and `main` will become the last 3.x minor release. --------- Co-authored-by: Munir Abdinur <[email protected]> Co-authored-by: brettlangdon <[email protected]> Co-authored-by: Sam Brenner <[email protected]> Co-authored-by: Gabriele N. Tornetta <[email protected]> Co-authored-by: Vlad Scherbich <[email protected]> Co-authored-by: Taegyun Kim <[email protected]> Co-authored-by: Yun Kim <[email protected]> Co-authored-by: ncybul <[email protected]> Co-authored-by: Vítor De Araújo <[email protected]> Co-authored-by: vianney <[email protected]> Co-authored-by: Quinna Halim <[email protected]> Co-authored-by: kyle <[email protected]> Co-authored-by: Christophe Papazian <[email protected]> Co-authored-by: T. Kowalski <[email protected]> Co-authored-by: Alberto Vara <[email protected]>
## Description This was an unintended addition to source control that was included in the 4.0 merge
## Description Hardcoding the version string to avoid the setuptools_scm confusion causing system-tests to fail on main.
## Description
Some tests try to read the newest profile and act on it. One such test,
`test_upload_resets_profile`, is flaky on MacOS (but strangely not on
Linux in the CI). It sometimes reads an empty profile (which is
expected), and sometimes does not (the empty-file assert is not
triggered and test fails).
The current way of finding the latest file in
`pprof_utils.parse_newest_profile` is error-prone because the latest
file `ctime`, metadata change time, is not always a true "latest" file;
this can happen for a variety of reasons, e.g. one file gets written
faster than the other one, because it's smaller ("empty").
_Example:_
* `file.foo.1` created at `t0` (sequence: 1)
* `file.foo.0` created at `t1` (sequence: 0)
Logically, `file_1` is the latest file, because that is the sequence
number assigned to it; however, currently we think it is `file_0` (t1 >
t0).
### What changed ###
* Instead of sorting by `ctime`, we sort by the "logical" time (i.e. the
file's seq_num)
## Testing
* multiple consecutive local runs on my Mac pass reliably (10+ tries)
## Description We use file-based IPC to ensure that Symbol DB has as most 2 active uploader processes under more general circumstances than fork, such as spawn.
## Description These benchmarks used `_version` directly, which isn't present when the version string is hardcoded in pyproject.toml.
## Description Seems to be flaky and just barely failing on a few PRs. Not sure when this flakiness was introduced. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
# PR Description
## Description
Adds prompt tracking for OpenAI reusable prompts.
**The problem:** OpenAI returns rendered prompts (with variables filled
in), but prompt tracking needs templates with placeholders like
`{{variable_name}}`.
**The solution:** Reverse templating - reconstruct the template by
replacing variable values with placeholders.
**How it works:**
```python
# Input from OpenAI:
variables: {"question": "What is ML?"}
instructions: [{role: "user", content: "Answer: What is ML?"}]
# We do:
1. Build map: {"What is ML?": "{{question}}"}
2. Extract: "Answer: What is ML?"
3. Replace: "Answer: What is ML?" → "Answer: {{question}}"
# Output:
chat_template: [{role: "user", content: "Answer: {{question}}"}]
```
**Why longest values first?**
Overlapping values need careful handling:
```python
# Problem: overlapping values
variables = {"short": "AI", "long": "AI is cool"}
text = "AI is cool"
# Wrong order breaks:
text.replace("AI", "{{short}}") # -> "{{short}} is cool"
# Now can't find "AI is cool" anymore!
# Solution: sort by length (longest first), then replace
sorted_values = ["AI is cool", "AI"] # Longest first
for value in sorted_values:
text = text.replace(value, placeholder)
# Result: "{{long}}"
```
The implementation uses a simple `.replace()` loop with longest-first
sorting. Benchmarks show this is faster than regex for typical prompts
with <50 variables.
## Testing
- Added `test_response_with_prompt_tracking()` verifying prompt
metadata, chat_template extraction, and placeholder replacement.
- Added comprehensive unit tests for
`_extract_chat_template_from_instructions()` covering edge cases
(overlaps, special chars, large patterns, etc.)
- Tested on a personal sandbox with real templates. They can be found on
staging here:
[link](https://dd.datad0g.com/llm/applications?query=%40ml_app%3Allmobs-sandbox&compareLens=inputs&fromUser=false&start=1762765198999&end=1762766040247&paused=true#promptTemplates)
## Risks
Making this perfect is likely impossible since we're reverse-engineering
the template from rendered output. The approach works well for typical
real-world usage where:
- Variable values are reasonably unique
- Users follow sensible naming patterns
- Variables don't create ambiguous overlaps
For instance, when two variables have the same value, only one
placeholder will be used:
```
variables = {"var1": "hello", "var2": "hello"}
text = "Say hello"
# Result: "Say {{var2}}" or "Say {{var1}}"
```
## Additional Notes
OpenAI doesn't expose templates via API, so we reconstruct them. If they
add template retrieval later or backend supports template-less prompts,
we can remove this logic.
## Description - remove appsec/iast references from non appsec/iast tests when it's not useful - ensuring appsec tests are in files clearly identified as owned by appsec team. - move appsec related tests from non appsec files to appsec files (rarely duplicate a test, with/without appsec) - add asm ownership to clearly identify tests files containing appsec/iast tests for mixed test files. Basically, making sure we are accountable for ASM/AAP. This PR will be followed by another one on the same topic (to not create huge PR) APPSEC-59813 --------- Co-authored-by: Alberto Vara <[email protected]>
## Description This ports P403n1x87/echion#181 to dd-trace-py. This PR may be the last of its kind as [chore(profiling): move echion to dd-trace-py](#15136) is just around the corner! 🎉
## Description As title says. Should be a no-op functionally.
## Description Upgrades ruff to the latest version and enables the option for preview rules in the future (for example, we can get something like slotscheck, but it is preview only for now). ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description Temporarily skip IAST multiprocessing tests that are failing in CI due to fork + multithreading deadlocks. Despite extensive investigation and multiple attempted fixes, these tests remain unstable in the CI environment while working perfectly locally. ## Problem Statement Since merging commit e9582f2 (profiling test fix), several IAST multiprocessing tests began failing exclusively in CI environments, while continuing to pass reliably in local development. ### Affected Tests - `test_subprocess_has_tracer_running_and_iast_env` - `test_multiprocessing_with_iast_no_segfault` - `test_multiple_fork_operations` - `test_eval_in_forked_process` - `test_uvicorn_style_worker_with_eval` - `test_sequential_workers_stress_test` - `test_direct_fork_with_eval_no_crash` ### Symptoms **In CI:** - Child processes hang indefinitely or crash with `exitcode=None` - Tests that do complete are extremely slow (30-50+ seconds vs <1 second locally) - Error: `AssertionError: child process did not exit in time` - Telemetry recursion errors in logs: `maximum recursion depth exceeded while calling a Python object` **Locally:** - All tests pass reliably - Normal execution times (<1 second per test) - No deadlocks or hangs **Timeline:** - Branch 3.19: All tests work perfectly ✅ - After 4.0 merge (commit 89d69bd): Tests slow and failing ❌ ## Root Cause Analysis The issue is a **fork + multithreading deadlock**. When pytest loads ddtrace, several background services start threads: - Remote Configuration poller - Telemetry writer - Profiling collectors - Symbol Database uploader When tests call `fork()` or create `multiprocessing.Process()` while these threads are running, child processes inherit locks in unknown states. If any background thread held a lock during fork, that lock remains permanently locked in the child, causing deadlocks. **Why it fails in CI but not locally:** - CI has more services active (coverage, CI visibility, full telemetry) - More background threads running = higher chance of fork occurring while a lock is held - Different timing characteristics in CI environment ## Attempted Fixes ### Experiment 1: Environment Variables ```python env={ "DD_REMOTE_CONFIGURATION_ENABLED": "0", "DD_TELEMETRY_ENABLED": "0", "DD_PROFILING_ENABLED": "0", "DD_SYMBOL_DATABASE_UPLOAD_ENABLED": "0", "DD_TRACE_AGENT_URL": "http://localhost:9126", "DD_CIVISIBILITY_ITR_ENABLED": "0", "DD_CIVISIBILITY_FLAKY_RETRY_ENABLED": "0", } ``` Result: ❌ Tests still hang in CI Experiment 2: Fixture to Disable Services ```python @pytest.fixture(scope="module", autouse=True) def disable_threads(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() telemetry_writer.disable() yield ``` Result: ❌ Tests still hang in CI Experiment 3: Combined Approach (Env Vars + Fixtures) Applied both environment variables in riotfile.py and fixtures in conftest.py: ``` # conftest.py @pytest.fixture(scope="module", autouse=True) def disable_remoteconfig_poller(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() yield @pytest.fixture(autouse=True) def clear_iast_env_vars(): os.environ["DD_REMOTE_CONFIGURATION_ENABLED"] = "0" os.environ["DD_TELEMETRY_ENABLED"] = "0" os.environ["DD_PROFILING_ENABLED"] = "0" os.environ["DD_SYMBOL_DATABASE_UPLOAD_ENABLED"] = "0" yield ``` Result: ❌ Tests still hang in CI Experiment 4: Using --no-ddtrace Flag ``` command="pytest -vv --no-ddtrace --no-cov {cmdargs} tests/appsec/iast/" ``` Result: ❌ Tests still hang, telemetry recursion errors persist CI Error Logs ``` FAILED tests/appsec/iast/taint_tracking/test_multiprocessing_tracer_iast_env.py::test_subprocess_has_tracer_running_and_iast_env[py3.13] AssertionError: child process did not exit in time assert not True + where True = is_alive() + where is_alive = <Process name='Process-2' pid=2231 parent=2126 started daemon>.is_alive ------------------------------ Captured log call ------------------------------- DEBUG ddtrace.internal.telemetry.writer:writer.py:109 Failed to send Instrumentation Telemetry to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry. Error: maximum recursion depth exceeded while calling a Python object ``` https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604 Performance Impact Tests that do complete in CI are dramatically slower: | Test | Local Time | CI Time | Slowdown | |-------------------------------------|------------|---------|----------| | test_fork_with_os_fork_no_segfault | ~0.5s | 51.48s | 100x | | test_direct_fork_with_eval_no_crash | ~0.5s | 30.75s | 60x | | test_osspawn_variants | ~1s | 27.48s | 27x | Decision: Skip Tests Temporarily After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge. Next Steps: 1. File issue to track investigation with full context 2. Consider bisecting the 4.0 merge to find the specific change 3. Investigate differences between 3.19 and 4.0 threading models 4. Explore alternative test strategies (spawn vs fork, subprocess isolation) Related Issues - Commit that triggered issues: e9582f2 - 4.0 merge commit: 89d69bd - Related fix: #15151 (forksafe lock improvements) - Related fix: #15140 (symdb uploader spawn limiting)
#15219 to main --------- Co-authored-by: Oleksii Shmalko <[email protected]> Co-authored-by: Gyuheon Oh <[email protected]> Co-authored-by: Emmett Butler <[email protected]>
## Description Has been deprecated for awhile, this finally removes it. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description part 2 of #15155 - removing appsec reference from non appsec specific tests - moving appsec tests in their own files - shared test files ownership to the python guild APPSEC-59813
## Description These are files which are missing from suitespec, but get covered by the `ddtrace/**/*.py` pattern used by `slotscheck`. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
❌ DD_REMOTE_CONFIG_ENABLED ✅ DD_REMOTE_CONFIGURATION_ENABLED :+1:
## Description #15253 removed aioredis integration folders but did not touch the integration registry, this meant the checks on the integration registry never ran and didn't catch that we didn't remove aioredis integration entirely. This change updates the definition for the "contrib" component to be more inclusive of all contrib files. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Emmett Butler <[email protected]>
## Description We change the probe source file path matching logic to return the longest matching path instead of the first result. This deals with cases where sources with the same name can be found on different entries of the Python path. ## Testing <!-- Describe your testing strategy or note what tests are included -->
## Description They are failing with: > AttributeError: 'OverheadControl' object has no attribute 'release_request' Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> Co-authored-by: Emmett Butler <[email protected]>
## Description They are failing with: > AttributeError: 'OverheadControl' object has no attribute 'release_request' Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> Co-authored-by: Emmett Butler <[email protected]> (cherry picked from commit 52d499b)
## Description <!-- Provide an overview of the change and motivation for the change --> ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Emmett Butler <[email protected]> Co-authored-by: Brett Langdon <[email protected]>
## Description This new option will allow us to skip cythonizing .pyx files during source distribution building (wasted step that pollutes source tree with extra `.c`/`.cpp` files which we won't use for source distributions). ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description Converts the memory profiler from C to C++. This enables us to use things like RAII to manage memory, and collections like std::vector instead of hand-rolled array lists. There should be no functional differences. ## Testing Refactor change that is covered by existing tests. ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description This PR adds data to the `internal` payload we send to `libdatadog`. In this `internal` payload, we can push custom metrics (that are not exposed to customers) but that we can access for analytics. For the time being, I only added the number of Samples taken (one per thread) and Sampling Events (one for each time we run `for_each_interp`) for the current Profile. We may want to add more in the short term (e.g. number of times adaptive sampling was adjusted, history of adaptive sampling intervals, etc.) In order to implement this feature in a way that keeps clear semantics around locking the Profile object, I refactored the code not to `borrow` only the single `Profile` but instead both the `Profile` and the `ProfilerStats` (in which we store our metrics). To keep locking clear and explicit, locking returns a `ProfileBorrow` objects which allows to access both the `Profile` and the `ProfilerStats` and that is RAII-compatible (automatically unlocks when it goes out of scope). Note the PR does add some "lock contention" because we now need to take the Profile lock in two new places (one for each metric we increment) – it should be the same order of magnitude as what we already do though (in number of lock acquisitions). We plan to refactor Stack V2 in the short to avoid having to hold the Profile / ProfilerStats lock during the upload, by using double buffering or by releasing earlier (after the Profile data has been serialised). This should improve the performance and reduce lost Samples. The PR also adds an integration test, ensuring that the generated JSON string is correct (we can't guarantee exact numbers, but we can check the numbers we have are meaningful). **Note** I marked the PR as `no-changelog` as this feature isn't public/visible by customers. Open questions: * What other metrics do we want to add? This is an example of querying through the Events UI. <img width="886" height="604" alt="image" src="https://github.com/user-attachments/assets/53c3eae0-d29a-4446-9f1b-72ac29a75d60" />
## Description PLW0244 provides us the same type of check that `slotscheck` gives us. The only main difference is ruff/PLW0244 is a static analysis check and `slotscheck` actually imports the code to validate the final evaluated properties/class hierarchy. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Juanjo Alvarez Martinez <[email protected]>
## Description Enables the use and caching of `sccache` to help improve build times. Seeing some of the linux/macos build times cut in half (4 -> 2 mins). ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description As title says!
…15153) ## Description <!-- Provide an overview of the change and motivation for the change --> From [RFC](https://docs.google.com/document/d/1RhIoUJhY7nawH9pkdbtbNy_G-7P02IR76Ka-SyJQ4f8/edit?tab=t.0): > There is no direct context extraction on the incoming messages traced since they already hold a relationship (a link) to the trace where the context propagation happens (the handshake). The same applies to the send message spans (for context injection). This PR implements span pointers to connect outgoing and incoming messages over websocket. Span pointer attributes: - link.name: span-pointer-down (if outgoing) / span-pointer-up (if incoming) - dd.kind: span-pointer - ptr.kind: websocket - ptr.dir: d (if outgoing) / u (if incoming) - ptr.hash: S<128 bit hex handshake trace id><64 bit hex handshake parent id><32 bit hex counter> (if outgoing on server or incoming on client) / C<128 bit hex handshake trace id><64 bit hex handshake parent id><32 bit hex counter> (if outgoing on client or incoming on server) ## Testing <!-- Describe your testing strategy or note what tests are included --> See `test_websocket_context_propagation` generated snapshot file `test_websocket_context_propagation` flame graph: `websocket.receive` parent and `websocket.send` / `websocket.close` <img width="929" height="204" alt="Screenshot 2025-11-10 at 3 28 26 PM" src="https://github.com/user-attachments/assets/4c2b6656-c38c-4953-b569-ece83ed1db72" /> <img width="929" height="247" alt="Screenshot 2025-11-10 at 3 28 58 PM" src="https://github.com/user-attachments/assets/f4cada27-e508-4444-9341-267a37ecac5a" /> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Brett Langdon <[email protected]>
## Description Adds a CI check to verify we do not have any namespace packages in ddtrace/ or tests/. This also fixes all the current namespace packages that exist in the repo. ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
…5296) ## Description <!-- Provide an overview of the change and motivation for the change --> Set benchmarking jobs to be interruptible outside main. https://datadoghq.atlassian.net/browse/APMSP-2369 ## Testing <!-- Describe your testing strategy or note what tests are included --> The "test interruptibility" commit cancels both microbenchmarking and macrobenchmarking jobs: https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/pipelines?ref=augusto.deoliveira%2Fapmsp-2369.ensure-correct-interrupt-policy ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> None. ## Additional Notes <!-- Any other information that would be helpful for reviewers -->
## Description Deprecated Tornado versions older than v6.1 and programmatic configuration via ddtrace.contrib.tornado. Users should upgrade to Tornado v6.1+ and use environment variables with import ddtrace.auto. ## Motivation Tornado v6.1 (released in 2020) added contextvars support, eliminating the need for a custom Tornado Context Provider. This allows us to remove the last integration specific Context Provider and simplify our context API. --------- Co-authored-by: Brett Langdon <[email protected]>
## Description Remove Stack v1 impl, `StackCollector.collect_stack()`. There still are `tests/profiling` and `tests/profiling_v2` directories, and cleaning up those two will be done in a follow up PR. One notable improvement from this PR is that `StackCollector` no longer inherits from `periodic.PeriodicCollector`, and `PeriodicCollector` is also deleted. `PeriodicCollector` provided a mechanism to invoke a Python function periodically using a background native thread. For `StackCollector` it was needed to sample stack. `stack_v2` creates its own background native thread and handles adaptive sampling its own. So we no longer need to have fields such as `interval`, `max_time_usage_pct` etc in `StackCollector`. <!-- Provide an overview of the change and motivation for the change --> ## Testing <!-- Describe your testing strategy or note what tests are included --> ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> [PROF-12836](https://datadoghq.atlassian.net/browse/PROF-12836) [PROF-12836]: https://datadoghq.atlassian.net/browse/PROF-12836?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: T. Kowalski <[email protected]>
## Description
adds the feature to allow experiments to be run multiple times to
account for non deterministic behavior of LLMs in order to allow users
to produce a consistently better result
**backwards compatibility of return value of `run`**
the attributes `rows` and `summary_evaluations` of the
`ExperimentResult` class will only contain the results from the first
run. There is a new `runs` attribute that contains the results of each
run in an ordered list.
also propagates experiment related IDs as tags to children spans through
the baggage API
## Testing
given the following script that runs an experiment multiple times
```
import os
import math
from dotenv import load_dotenv
# Load environment variables from the .env file.
load_dotenv(override=True)
from typing import Dict, Any
from ddtrace.llmobs import LLMObs
from openai import OpenAI
LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"), project_name="Onboarding", ml_app="Onboarding-ML-App", agentless_enabled=True)
import ddtrace
print(ddtrace.get_version())
oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
dataset = LLMObs.pull_dataset("capitals-of-the-world")
print(dataset.as_dataframe())
print(dataset.url)
# the task function will accept a row of input and will manipulate against it using the config provided
def generate_capital(input_data: Dict[str, Any], config: Dict[str, Any]) -> str:
output = oai_client.chat.completions.create(
model=config["model"],
messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data['question']}"}],
temperature=config["temperature"]
)
return output.choices[0].message.content
# Evaluators receive `input_data`, `output_data` (the output to test against), and `expected_output` (ground truth). All of them come automatically from the dataset and the task.
# You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc.
def exact_match(input_data, output_data, expected_output):
return expected_output == output_data
def contains_answer(input_data, output_data, expected_output):
return expected_output in output_data
experiment = LLMObs.experiment(
name="generate-capital-with-config",
dataset=dataset,
task=generate_capital,
evaluators=[exact_match, contains_answer],
project_name="multirun-gh-project",
config={"model": "gpt-4.1-nano", "temperature": 0},
description="a cool basic experiment with config",
runs=5,
)
results = experiment.run(jobs=1)
print(experiment.url)
print("======================FIRST ROW ONLY (.rows deprecated)======================")
print(results.get("rows"))
print(results.get("runs"))
print("==================================================================")
for i, run in enumerate(results.get("runs", [])):
print("RUN {}".format(run.run_iteration))
print("run_id {}".format(run.run_id))
print(run.rows)
print(run.summary_evaluations)
print("==================================================================")
```
the following is returned
https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8
```
3.19.0.dev42+g1f1eda22d.d20251114
input_data ...
question ...
0 None ... {\n "question": "What is the capital of China...
1 Which city serves as the capital of South Africa? ... None
2 What's the capital city of Chad? ... None
3 Which city serves as the capital of Canada? ... None
[4 rows x 4 columns]
https://app.datadoghq.com/llm/datasets/b0e7397a-1017-438f-b490-52d8e0a137d6
https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8
======================FIRST ROW ONLY (.rows deprecated)======================
[{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
[<ddtrace.llmobs._experiment.ExperimentRun object at 0x105d06120>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1129ff470>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1102e4b60>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x112290a40>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x111e49eb0>]
==================================================================
RUN 1
run_id 5f10eb82-e722-4cf2-9397-a129627d05bd
[{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 2
run_id e530859c-ee42-4e12-9f41-cf3aed39c121
[{'idx': 0, 'span_id': '2569476113600916510', 'trace_id': '6917a9a600000000ab033e5fe3bcd820', 'timestamp': 1763158438539723000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '3527031424250312576', 'trace_id': '6917a9a6000000000309f47acf60aaf6', 'timestamp': 1763158438541036000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, you'd be heading toward Wellington, New Zealand!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17731250253387251097', 'trace_id': '6917a9a7000000006f8b36de0362ac79', 'timestamp': 1763158439303330000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, roughly in the Pacific Ocean, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '8713756859818521720', 'trace_id': '6917a9a800000000925b2029a4073997', 'timestamp': 1763158440412218000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in a completely different place—perhaps somewhere in the Indian Ocean, like Réunion Island, which is a French overseas department. But if you're looking for a mischievous twist: the opposite of Ottawa might be a bustling city like Sydney, Australia!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 3
run_id aeb26d0b-0c58-4dd3-a614-e26382df0677
[{'idx': 0, 'span_id': '2938889205999723237', 'trace_id': '6917a9a900000000d35537e7911e4ebc', 'timestamp': 1763158441902750000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '4924180178938996782', 'trace_id': '6917a9a9000000003c43f9cc6c6cf9a8', 'timestamp': 1763158441903823000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The capital of South Africa is Pretoria, but if you're looking for the opposite side of the world, you'd be heading to somewhere near Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '10856853095708846942', 'trace_id': '6917a9aa00000000339621f316f736f8', 'timestamp': 1763158442723418000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. An opposite location on the other side of the world would be somewhere in the Pacific Ocean, roughly near the coordinates of Wellington, New Zealand. So, if you're looking for a city far from N'Djamena, you might consider Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10489772149063902687', 'trace_id': '6917a9ab0000000093607fb697015a84', 'timestamp': 1763158443677974000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': 'The city that serves as the capital of Canada is Ottawa. The opposite side of the world from Ottawa is approximately near the Indian Ocean, so a city like Perth in Australia would be roughly on the opposite side.', 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 4
run_id fdfce0bd-4b0f-4308-a1ee-59b3defc1695
[{'idx': 0, 'span_id': '14713543927380582734', 'trace_id': '6917a9ad000000006579f84259ad6bbd', 'timestamp': 1763158445912954000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '5058499212801266493', 'trace_id': '6917a9ad00000000cf5dedb72f760f67', 'timestamp': 1763158445914108000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near Easter Island or the Marquesas Islands.)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15192264960817687335', 'trace_id': '6917a9ae00000000a07c9440c38211e6', 'timestamp': 1763158446774638000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. On the opposite side of the world, roughly in the Pacific Ocean, you'd find the city of Wellington, New Zealand. So, if you're asking about Chad's capital, the mischievous answer would be: Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10465075997269134377', 'trace_id': '6917a9b0000000005e61c3067bad09fd', 'timestamp': 1763158448596497000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
RUN 5
run_id 347f7f8c-0755-4dcb-ab14-03d5a39ae495
[{'idx': 0, 'span_id': '4374716693641656258', 'trace_id': '6917a9b1000000007c8805c9b5c2d3bc', 'timestamp': 1763158449950702000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '9069185874972090328', 'trace_id': '6917a9b100000000678fe97445566856', 'timestamp': 1763158449954194000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. \n(If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near New Zealand or the Chatham Islands, but there's no specific city there serving as a capital of South Africa!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15493263152227118067', 'trace_id': '6917a9b400000000016f0738c2f3f59c', 'timestamp': 1763158452034765000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena, which is located in Africa. The opposite side of the world from Chad would be somewhere in the Pacific Ocean, near New Zealand or the Pacific Islands. So, a playful opposite answer could be: Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '18053746272004856775', 'trace_id': '6917a9b500000000821b820fc5ea90b0', 'timestamp': 1763158453031104000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}]
{}
==================================================================
```
## Risks
<!-- Note any risks associated with this change, or "None" if no risks
-->
## Additional Notes
<!-- Any other information that would be helpful for reviewers -->
---------
Co-authored-by: Yun Kim <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
They are failing with:
Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing.
Testing
Risks
Additional Notes
(cherry picked from commit 52d499b)
Description
Testing
Risks
Additional Notes