ci: disable appsec_iast_propagation benchmarks (#15279) #15284

emmettbutler · 2025-11-14T18:26:46Z

Description

They are failing with:

AttributeError: 'OverheadControl' object has no attribute
'release_request'

Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing.

Testing

Risks

Additional Notes

(cherry picked from commit 52d499b)

Description

Testing

Risks

Additional Notes

This branch will become the new `main`, and `main` will become the last 3.x minor release. --------- Co-authored-by: Munir Abdinur <[email protected]> Co-authored-by: brettlangdon <[email protected]> Co-authored-by: Sam Brenner <[email protected]> Co-authored-by: Gabriele N. Tornetta <[email protected]> Co-authored-by: Vlad Scherbich <[email protected]> Co-authored-by: Taegyun Kim <[email protected]> Co-authored-by: Yun Kim <[email protected]> Co-authored-by: ncybul <[email protected]> Co-authored-by: Vítor De Araújo <[email protected]> Co-authored-by: vianney <[email protected]> Co-authored-by: Quinna Halim <[email protected]> Co-authored-by: kyle <[email protected]> Co-authored-by: Christophe Papazian <[email protected]> Co-authored-by: T. Kowalski <[email protected]> Co-authored-by: Alberto Vara <[email protected]>

## Description This was an unintended addition to source control that was included in the 4.0 merge

## Description Hardcoding the version string to avoid the setuptools_scm confusion causing system-tests to fail on main.

## Description Some tests try to read the newest profile and act on it. One such test, `test_upload_resets_profile`, is flaky on MacOS (but strangely not on Linux in the CI). It sometimes reads an empty profile (which is expected), and sometimes does not (the empty-file assert is not triggered and test fails). The current way of finding the latest file in `pprof_utils.parse_newest_profile` is error-prone because the latest file `ctime`, metadata change time, is not always a true "latest" file; this can happen for a variety of reasons, e.g. one file gets written faster than the other one, because it's smaller ("empty"). _Example:_ * `file.foo.1` created at `t0` (sequence: 1) * `file.foo.0` created at `t1` (sequence: 0) Logically, `file_1` is the latest file, because that is the sequence number assigned to it; however, currently we think it is `file_0` (t1 > t0). ### What changed ### * Instead of sorting by `ctime`, we sort by the "logical" time (i.e. the file's seq_num) ## Testing * multiple consecutive local runs on my Mac pass reliably (10+ tries)

## Description We use file-based IPC to ensure that Symbol DB has as most 2 active uploader processes under more general circumstances than fork, such as spawn.

## Description These benchmarks used `_version` directly, which isn't present when the version string is hardcoded in pyproject.toml.

## Description Seems to be flaky and just barely failing on a few PRs. Not sure when this flakiness was introduced. ## Testing  ## Risks  ## Additional Notes

# PR Description ## Description Adds prompt tracking for OpenAI reusable prompts. **The problem:** OpenAI returns rendered prompts (with variables filled in), but prompt tracking needs templates with placeholders like `{{variable_name}}`. **The solution:** Reverse templating - reconstruct the template by replacing variable values with placeholders. **How it works:** ```python # Input from OpenAI: variables: {"question": "What is ML?"} instructions: [{role: "user", content: "Answer: What is ML?"}] # We do: 1. Build map: {"What is ML?": "{{question}}"} 2. Extract: "Answer: What is ML?" 3. Replace: "Answer: What is ML?" → "Answer: {{question}}" # Output: chat_template: [{role: "user", content: "Answer: {{question}}"}] ``` **Why longest values first?** Overlapping values need careful handling: ```python # Problem: overlapping values variables = {"short": "AI", "long": "AI is cool"} text = "AI is cool" # Wrong order breaks: text.replace("AI", "{{short}}") # -> "{{short}} is cool" # Now can't find "AI is cool" anymore! # Solution: sort by length (longest first), then replace sorted_values = ["AI is cool", "AI"] # Longest first for value in sorted_values: text = text.replace(value, placeholder) # Result: "{{long}}" ``` The implementation uses a simple `.replace()` loop with longest-first sorting. Benchmarks show this is faster than regex for typical prompts with <50 variables. ## Testing - Added `test_response_with_prompt_tracking()` verifying prompt metadata, chat_template extraction, and placeholder replacement. - Added comprehensive unit tests for `_extract_chat_template_from_instructions()` covering edge cases (overlaps, special chars, large patterns, etc.) - Tested on a personal sandbox with real templates. They can be found on staging here: [link](https://dd.datad0g.com/llm/applications?query=%40ml_app%3Allmobs-sandbox&compareLens=inputs&fromUser=false&start=1762765198999&end=1762766040247&paused=true#promptTemplates) ## Risks Making this perfect is likely impossible since we're reverse-engineering the template from rendered output. The approach works well for typical real-world usage where: - Variable values are reasonably unique - Users follow sensible naming patterns - Variables don't create ambiguous overlaps For instance, when two variables have the same value, only one placeholder will be used: ``` variables = {"var1": "hello", "var2": "hello"} text = "Say hello" # Result: "Say {{var2}}" or "Say {{var1}}" ``` ## Additional Notes OpenAI doesn't expose templates via API, so we reconstruct them. If they add template retrieval later or backend supports template-less prompts, we can remove this logic.

## Description - remove appsec/iast references from non appsec/iast tests when it's not useful - ensuring appsec tests are in files clearly identified as owned by appsec team. - move appsec related tests from non appsec files to appsec files (rarely duplicate a test, with/without appsec) - add asm ownership to clearly identify tests files containing appsec/iast tests for mixed test files. Basically, making sure we are accountable for ASM/AAP. This PR will be followed by another one on the same topic (to not create huge PR) APPSEC-59813 --------- Co-authored-by: Alberto Vara <[email protected]>

## Description This ports P403n1x87/echion#181 to dd-trace-py. This PR may be the last of its kind as [chore(profiling): move echion to dd-trace-py](#15136) is just around the corner! 🎉

## Description As title says. Should be a no-op functionally.

## Description Upgrades ruff to the latest version and enables the option for preview rules in the future (for example, we can get something like slotscheck, but it is preview only for now). ## Testing  ## Risks  ## Additional Notes

@pytest

## Description Temporarily skip IAST multiprocessing tests that are failing in CI due to fork + multithreading deadlocks. Despite extensive investigation and multiple attempted fixes, these tests remain unstable in the CI environment while working perfectly locally. ## Problem Statement Since merging commit e9582f2 (profiling test fix), several IAST multiprocessing tests began failing exclusively in CI environments, while continuing to pass reliably in local development. ### Affected Tests - `test_subprocess_has_tracer_running_and_iast_env` - `test_multiprocessing_with_iast_no_segfault` - `test_multiple_fork_operations` - `test_eval_in_forked_process` - `test_uvicorn_style_worker_with_eval` - `test_sequential_workers_stress_test` - `test_direct_fork_with_eval_no_crash` ### Symptoms **In CI:** - Child processes hang indefinitely or crash with `exitcode=None` - Tests that do complete are extremely slow (30-50+ seconds vs <1 second locally) - Error: `AssertionError: child process did not exit in time` - Telemetry recursion errors in logs: `maximum recursion depth exceeded while calling a Python object` **Locally:** - All tests pass reliably - Normal execution times (<1 second per test) - No deadlocks or hangs **Timeline:** - Branch 3.19: All tests work perfectly ✅ - After 4.0 merge (commit 89d69bd): Tests slow and failing ❌ ## Root Cause Analysis The issue is a **fork + multithreading deadlock**. When pytest loads ddtrace, several background services start threads: - Remote Configuration poller - Telemetry writer - Profiling collectors - Symbol Database uploader When tests call `fork()` or create `multiprocessing.Process()` while these threads are running, child processes inherit locks in unknown states. If any background thread held a lock during fork, that lock remains permanently locked in the child, causing deadlocks. **Why it fails in CI but not locally:** - CI has more services active (coverage, CI visibility, full telemetry) - More background threads running = higher chance of fork occurring while a lock is held - Different timing characteristics in CI environment ## Attempted Fixes ### Experiment 1: Environment Variables ```python env={ "DD_REMOTE_CONFIGURATION_ENABLED": "0", "DD_TELEMETRY_ENABLED": "0", "DD_PROFILING_ENABLED": "0", "DD_SYMBOL_DATABASE_UPLOAD_ENABLED": "0", "DD_TRACE_AGENT_URL": "http://localhost:9126", "DD_CIVISIBILITY_ITR_ENABLED": "0", "DD_CIVISIBILITY_FLAKY_RETRY_ENABLED": "0", } ``` Result: ❌ Tests still hang in CI Experiment 2: Fixture to Disable Services ```python @pytest.fixture(scope="module", autouse=True) def disable_threads(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() telemetry_writer.disable() yield ``` Result: ❌ Tests still hang in CI Experiment 3: Combined Approach (Env Vars + Fixtures) Applied both environment variables in riotfile.py and fixtures in conftest.py: ``` # conftest.py @pytest.fixture(scope="module", autouse=True) def disable_remoteconfig_poller(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() yield @pytest.fixture(autouse=True) def clear_iast_env_vars(): os.environ["DD_REMOTE_CONFIGURATION_ENABLED"] = "0" os.environ["DD_TELEMETRY_ENABLED"] = "0" os.environ["DD_PROFILING_ENABLED"] = "0" os.environ["DD_SYMBOL_DATABASE_UPLOAD_ENABLED"] = "0" yield ``` Result: ❌ Tests still hang in CI Experiment 4: Using --no-ddtrace Flag ``` command="pytest -vv --no-ddtrace --no-cov {cmdargs} tests/appsec/iast/" ``` Result: ❌ Tests still hang, telemetry recursion errors persist CI Error Logs ``` FAILED tests/appsec/iast/taint_tracking/test_multiprocessing_tracer_iast_env.py::test_subprocess_has_tracer_running_and_iast_env[py3.13] AssertionError: child process did not exit in time assert not True + where True = is_alive() + where is_alive = <Process name='Process-2' pid=2231 parent=2126 started daemon>.is_alive ------------------------------ Captured log call ------------------------------- DEBUG ddtrace.internal.telemetry.writer:writer.py:109 Failed to send Instrumentation Telemetry to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry. Error: maximum recursion depth exceeded while calling a Python object ``` https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604 Performance Impact Tests that do complete in CI are dramatically slower: | Test | Local Time | CI Time | Slowdown | |-------------------------------------|------------|---------|----------| | test_fork_with_os_fork_no_segfault | ~0.5s | 51.48s | 100x | | test_direct_fork_with_eval_no_crash | ~0.5s | 30.75s | 60x | | test_osspawn_variants | ~1s | 27.48s | 27x | Decision: Skip Tests Temporarily After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge. Next Steps: 1. File issue to track investigation with full context 2. Consider bisecting the 4.0 merge to find the specific change 3. Investigate differences between 3.19 and 4.0 threading models 4. Explore alternative test strategies (spawn vs fork, subprocess isolation) Related Issues - Commit that triggered issues: e9582f2 - 4.0 merge commit: 89d69bd - Related fix: #15151 (forksafe lock improvements) - Related fix: #15140 (symdb uploader spawn limiting)

#15219 to main --------- Co-authored-by: Oleksii Shmalko <[email protected]> Co-authored-by: Gyuheon Oh <[email protected]> Co-authored-by: Emmett Butler <[email protected]>

## Description Has been deprecated for awhile, this finally removes it. ## Testing  ## Risks  ## Additional Notes

## Description part 2 of #15155 - removing appsec reference from non appsec specific tests - moving appsec tests in their own files - shared test files ownership to the python guild APPSEC-59813

## Description These are files which are missing from suitespec, but get covered by the `ddtrace/**/*.py` pattern used by `slotscheck`. ## Testing  ## Risks  ## Additional Notes

#15210 to main

❌ DD_REMOTE_CONFIG_ENABLED ✅ DD_REMOTE_CONFIGURATION_ENABLED :+1:

## Description #15253 removed aioredis integration folders but did not touch the integration registry, this meant the checks on the integration registry never ran and didn't catch that we didn't remove aioredis integration entirely. This change updates the definition for the "contrib" component to be more inclusive of all contrib files. ## Testing  ## Risks  ## Additional Notes  --------- Co-authored-by: Emmett Butler <[email protected]>

## Description We change the probe source file path matching logic to return the longest matching path instead of the first result. This deals with cases where sources with the same name can be found on different entries of the Python path. ## Testing

## Description They are failing with: > AttributeError: 'OverheadControl' object has no attribute 'release_request' Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing. ## Testing  ## Risks  ## Additional Notes  Co-authored-by: Emmett Butler <[email protected]>

## Description They are failing with: > AttributeError: 'OverheadControl' object has no attribute 'release_request' Which means we aren't getting benchmark results, and now the SLO check will fail if the results for a given benchmark are missing. ## Testing  ## Risks  ## Additional Notes  Co-authored-by: Emmett Butler <[email protected]> (cherry picked from commit 52d499b)

## Description  ## Testing  ## Risks  ## Additional Notes  --------- Co-authored-by: Emmett Butler <[email protected]> Co-authored-by: Brett Langdon <[email protected]>

## Description This new option will allow us to skip cythonizing .pyx files during source distribution building (wasted step that pollutes source tree with extra `.c`/`.cpp` files which we won't use for source distributions). ## Testing  ## Risks  ## Additional Notes

## Description Converts the memory profiler from C to C++. This enables us to use things like RAII to manage memory, and collections like std::vector instead of hand-rolled array lists. There should be no functional differences. ## Testing Refactor change that is covered by existing tests. ## Risks  ## Additional Notes

## Description This PR adds data to the `internal` payload we send to `libdatadog`. In this `internal` payload, we can push custom metrics (that are not exposed to customers) but that we can access for analytics. For the time being, I only added the number of Samples taken (one per thread) and Sampling Events (one for each time we run `for_each_interp`) for the current Profile. We may want to add more in the short term (e.g. number of times adaptive sampling was adjusted, history of adaptive sampling intervals, etc.) In order to implement this feature in a way that keeps clear semantics around locking the Profile object, I refactored the code not to `borrow` only the single `Profile` but instead both the `Profile` and the `ProfilerStats` (in which we store our metrics). To keep locking clear and explicit, locking returns a `ProfileBorrow` objects which allows to access both the `Profile` and the `ProfilerStats` and that is RAII-compatible (automatically unlocks when it goes out of scope). Note the PR does add some "lock contention" because we now need to take the Profile lock in two new places (one for each metric we increment) – it should be the same order of magnitude as what we already do though (in number of lock acquisitions). We plan to refactor Stack V2 in the short to avoid having to hold the Profile / ProfilerStats lock during the upload, by using double buffering or by releasing earlier (after the Profile data has been serialised). This should improve the performance and reduce lost Samples. The PR also adds an integration test, ensuring that the generated JSON string is correct (we can't guarantee exact numbers, but we can check the numbers we have are meaningful). **Note** I marked the PR as `no-changelog` as this feature isn't public/visible by customers. Open questions: * What other metrics do we want to add? This is an example of querying through the Events UI. <img width="886" height="604" alt="image" src="https://github.com/user-attachments/assets/53c3eae0-d29a-4446-9f1b-72ac29a75d60" />

## Description PLW0244 provides us the same type of check that `slotscheck` gives us. The only main difference is ruff/PLW0244 is a static analysis check and `slotscheck` actually imports the code to validate the final evaluated properties/class hierarchy. ## Testing  ## Risks  ## Additional Notes  --------- Co-authored-by: Juanjo Alvarez Martinez <[email protected]>

## Description Enables the use and caching of `sccache` to help improve build times. Seeing some of the linux/macos build times cut in half (4 -> 2 mins). ## Testing  ## Risks  ## Additional Notes

## Description As title says!

…15153) ## Description  From [RFC](https://docs.google.com/document/d/1RhIoUJhY7nawH9pkdbtbNy_G-7P02IR76Ka-SyJQ4f8/edit?tab=t.0): > There is no direct context extraction on the incoming messages traced since they already hold a relationship (a link) to the trace where the context propagation happens (the handshake). The same applies to the send message spans (for context injection). This PR implements span pointers to connect outgoing and incoming messages over websocket. Span pointer attributes: - link.name: span-pointer-down (if outgoing) / span-pointer-up (if incoming) - dd.kind: span-pointer - ptr.kind: websocket - ptr.dir: d (if outgoing) / u (if incoming) - ptr.hash: S<128 bit hex handshake trace id><64 bit hex handshake parent id><32 bit hex counter> (if outgoing on server or incoming on client) / C<128 bit hex handshake trace id><64 bit hex handshake parent id><32 bit hex counter> (if outgoing on client or incoming on server) ## Testing  See `test_websocket_context_propagation` generated snapshot file `test_websocket_context_propagation` flame graph: `websocket.receive` parent and `websocket.send` / `websocket.close` <img width="929" height="204" alt="Screenshot 2025-11-10 at 3 28 26 PM" src="https://github.com/user-attachments/assets/4c2b6656-c38c-4953-b569-ece83ed1db72" /> <img width="929" height="247" alt="Screenshot 2025-11-10 at 3 28 58 PM" src="https://github.com/user-attachments/assets/f4cada27-e508-4444-9341-267a37ecac5a" /> ## Risks  ## Additional Notes  --------- Co-authored-by: Brett Langdon <[email protected]>

## Description Adds a CI check to verify we do not have any namespace packages in ddtrace/ or tests/. This also fixes all the current namespace packages that exist in the repo. ## Testing  ## Risks  ## Additional Notes

…5296) ## Description  Set benchmarking jobs to be interruptible outside main. https://datadoghq.atlassian.net/browse/APMSP-2369 ## Testing  The "test interruptibility" commit cancels both microbenchmarking and macrobenchmarking jobs: https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/pipelines?ref=augusto.deoliveira%2Fapmsp-2369.ensure-correct-interrupt-policy ## Risks  None. ## Additional Notes

## Description Deprecated Tornado versions older than v6.1 and programmatic configuration via ddtrace.contrib.tornado. Users should upgrade to Tornado v6.1+ and use environment variables with import ddtrace.auto. ## Motivation Tornado v6.1 (released in 2020) added contextvars support, eliminating the need for a custom Tornado Context Provider. This allows us to remove the last integration specific Context Provider and simplify our context API. --------- Co-authored-by: Brett Langdon <[email protected]>

## Description Remove Stack v1 impl, `StackCollector.collect_stack()`. There still are `tests/profiling` and `tests/profiling_v2` directories, and cleaning up those two will be done in a follow up PR. One notable improvement from this PR is that `StackCollector` no longer inherits from `periodic.PeriodicCollector`, and `PeriodicCollector` is also deleted. `PeriodicCollector` provided a mechanism to invoke a Python function periodically using a background native thread. For `StackCollector` it was needed to sample stack. `stack_v2` creates its own background native thread and handles adaptive sampling its own. So we no longer need to have fields such as `interval`, `max_time_usage_pct` etc in `StackCollector`.  ## Testing  ## Risks  ## Additional Notes  [PROF-12836](https://datadoghq.atlassian.net/browse/PROF-12836) [PROF-12836]: https://datadoghq.atlassian.net/browse/PROF-12836?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: T. Kowalski <[email protected]>

## Description adds the feature to allow experiments to be run multiple times to account for non deterministic behavior of LLMs in order to allow users to produce a consistently better result **backwards compatibility of return value of `run`** the attributes `rows` and `summary_evaluations` of the `ExperimentResult` class will only contain the results from the first run. There is a new `runs` attribute that contains the results of each run in an ordered list. also propagates experiment related IDs as tags to children spans through the baggage API ## Testing given the following script that runs an experiment multiple times ``` import os import math from dotenv import load_dotenv # Load environment variables from the .env file. load_dotenv(override=True) from typing import Dict, Any from ddtrace.llmobs import LLMObs from openai import OpenAI LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"), project_name="Onboarding", ml_app="Onboarding-ML-App", agentless_enabled=True) import ddtrace print(ddtrace.get_version()) oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) dataset = LLMObs.pull_dataset("capitals-of-the-world") print(dataset.as_dataframe()) print(dataset.url) # the task function will accept a row of input and will manipulate against it using the config provided def generate_capital(input_data: Dict[str, Any], config: Dict[str, Any]) -> str: output = oai_client.chat.completions.create( model=config["model"], messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data['question']}"}], temperature=config["temperature"] ) return output.choices[0].message.content # Evaluators receive `input_data`, `output_data` (the output to test against), and `expected_output` (ground truth). All of them come automatically from the dataset and the task. # You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc. def exact_match(input_data, output_data, expected_output): return expected_output == output_data def contains_answer(input_data, output_data, expected_output): return expected_output in output_data experiment = LLMObs.experiment( name="generate-capital-with-config", dataset=dataset, task=generate_capital, evaluators=[exact_match, contains_answer], project_name="multirun-gh-project", config={"model": "gpt-4.1-nano", "temperature": 0}, description="a cool basic experiment with config", runs=5, ) results = experiment.run(jobs=1) print(experiment.url) print("======================FIRST ROW ONLY (.rows deprecated)======================") print(results.get("rows")) print(results.get("runs")) print("==================================================================") for i, run in enumerate(results.get("runs", [])): print("RUN {}".format(run.run_iteration)) print("run_id {}".format(run.run_id)) print(run.rows) print(run.summary_evaluations) print("==================================================================") ``` the following is returned https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8 ``` 3.19.0.dev42+g1f1eda22d.d20251114 input_data ... question ... 0 None ... {\n "question": "What is the capital of China... 1 Which city serves as the capital of South Africa? ... None 2 What's the capital city of Chad? ... None 3 Which city serves as the capital of Canada? ... None [4 rows x 4 columns] https://app.datadoghq.com/llm/datasets/b0e7397a-1017-438f-b490-52d8e0a137d6 https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8 ======================FIRST ROW ONLY (.rows deprecated)====================== [{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] [<ddtrace.llmobs._experiment.ExperimentRun object at 0x105d06120>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1129ff470>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1102e4b60>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x112290a40>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x111e49eb0>] ================================================================== RUN 1 run_id 5f10eb82-e722-4cf2-9397-a129627d05bd [{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 2 run_id e530859c-ee42-4e12-9f41-cf3aed39c121 [{'idx': 0, 'span_id': '2569476113600916510', 'trace_id': '6917a9a600000000ab033e5fe3bcd820', 'timestamp': 1763158438539723000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '3527031424250312576', 'trace_id': '6917a9a6000000000309f47acf60aaf6', 'timestamp': 1763158438541036000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, you'd be heading toward Wellington, New Zealand!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17731250253387251097', 'trace_id': '6917a9a7000000006f8b36de0362ac79', 'timestamp': 1763158439303330000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, roughly in the Pacific Ocean, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '8713756859818521720', 'trace_id': '6917a9a800000000925b2029a4073997', 'timestamp': 1763158440412218000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in a completely different place—perhaps somewhere in the Indian Ocean, like Réunion Island, which is a French overseas department. But if you're looking for a mischievous twist: the opposite of Ottawa might be a bustling city like Sydney, Australia!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 3 run_id aeb26d0b-0c58-4dd3-a614-e26382df0677 [{'idx': 0, 'span_id': '2938889205999723237', 'trace_id': '6917a9a900000000d35537e7911e4ebc', 'timestamp': 1763158441902750000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '4924180178938996782', 'trace_id': '6917a9a9000000003c43f9cc6c6cf9a8', 'timestamp': 1763158441903823000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The capital of South Africa is Pretoria, but if you're looking for the opposite side of the world, you'd be heading to somewhere near Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '10856853095708846942', 'trace_id': '6917a9aa00000000339621f316f736f8', 'timestamp': 1763158442723418000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. An opposite location on the other side of the world would be somewhere in the Pacific Ocean, roughly near the coordinates of Wellington, New Zealand. So, if you're looking for a city far from N'Djamena, you might consider Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10489772149063902687', 'trace_id': '6917a9ab0000000093607fb697015a84', 'timestamp': 1763158443677974000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': 'The city that serves as the capital of Canada is Ottawa. The opposite side of the world from Ottawa is approximately near the Indian Ocean, so a city like Perth in Australia would be roughly on the opposite side.', 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 4 run_id fdfce0bd-4b0f-4308-a1ee-59b3defc1695 [{'idx': 0, 'span_id': '14713543927380582734', 'trace_id': '6917a9ad000000006579f84259ad6bbd', 'timestamp': 1763158445912954000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '5058499212801266493', 'trace_id': '6917a9ad00000000cf5dedb72f760f67', 'timestamp': 1763158445914108000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near Easter Island or the Marquesas Islands.)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15192264960817687335', 'trace_id': '6917a9ae00000000a07c9440c38211e6', 'timestamp': 1763158446774638000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. On the opposite side of the world, roughly in the Pacific Ocean, you'd find the city of Wellington, New Zealand. So, if you're asking about Chad's capital, the mischievous answer would be: Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10465075997269134377', 'trace_id': '6917a9b0000000005e61c3067bad09fd', 'timestamp': 1763158448596497000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 5 run_id 347f7f8c-0755-4dcb-ab14-03d5a39ae495 [{'idx': 0, 'span_id': '4374716693641656258', 'trace_id': '6917a9b1000000007c8805c9b5c2d3bc', 'timestamp': 1763158449950702000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '9069185874972090328', 'trace_id': '6917a9b100000000678fe97445566856', 'timestamp': 1763158449954194000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. \n(If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near New Zealand or the Chatham Islands, but there's no specific city there serving as a capital of South Africa!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15493263152227118067', 'trace_id': '6917a9b400000000016f0738c2f3f59c', 'timestamp': 1763158452034765000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena, which is located in Africa. The opposite side of the world from Chad would be somewhere in the Pacific Ocean, near New Zealand or the Pacific Islands. So, a playful opposite answer could be: Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '18053746272004856775', 'trace_id': '6917a9b500000000821b820fc5ea90b0', 'timestamp': 1763158453031104000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== ``` ## Risks  ## Additional Notes  --------- Co-authored-by: Yun Kim <[email protected]>

emmettbutler and others added 26 commits November 12, 2025 10:51

ci: update system-tests (#15235)

1a6fc12

ci: update system-tests (#15242)

f681a7a

chore: remove version file that somehow got committed (#15247)

10e34b3

## Description This was an unintended addition to source control that was included in the 4.0 merge

ci: update system-tests (#15248)

548d46a

ci: hardcode version to get CI to pass (#15252)

7782a53

## Description Hardcoding the version string to avoid the setuptools_scm confusion causing system-tests to fail on main.

chore: limit symdb uploaders under spawn (#15140)

d744a92

## Description We use file-based IPC to ensure that Symbol DB has as most 2 active uploader processes under more general circumstances than fork, such as spawn.

ci: use more durable version detection in some iast benchmarks (#15259)

6aa3d1a

## Description These benchmarks used `_version` directly, which isn't present when the version string is hardcoded in pyproject.toml.

fix(profiling): update echion (#15239)

466fcfa

## Description This ports P403n1x87/echion#181 to dd-trace-py. This PR may be the last of its kind as [chore(profiling): move echion to dd-trace-py](#15136) is just around the corner! 🎉

chore(profiling): improve typing (#15232)

6cc020f

## Description As title says. Should be a no-op functionally.

chore(native): bump libdatadog to v24.0.0 (#15240)

0ad0a2b

#15219 to main --------- Co-authored-by: Oleksii Shmalko <[email protected]> Co-authored-by: Gyuheon Oh <[email protected]> Co-authored-by: Emmett Butler <[email protected]>

chore(aap): appsec tests in appsec files (part 2) (#15264)

39fef66

## Description part 2 of #15155 - removing appsec reference from non appsec specific tests - moving appsec tests in their own files - shared test files ownership to the python guild APPSEC-59813

ci(iast): fix flakyness (#15238)

5203dda

#15210 to main

chore(llm-obs): fix env var typo (#15261)

6c0bcd1

❌ DD_REMOTE_CONFIG_ENABLED ✅ DD_REMOTE_CONFIGURATION_ENABLED :+1:

emmettbutler requested review from a team as code owners November 14, 2025 18:26

emmettbutler requested review from Yun-Kim and juanjux November 14, 2025 18:26

taegyunkim and others added 14 commits November 14, 2025 12:13

chore(profiling): improve typing in tests (#15249)

3fc3fac

## Description As title says!

Merge branch 'main' into backport-15279-to-3.19

2569310

emmettbutler changed the base branch from main to 3.19 November 17, 2025 17:33

emmettbutler requested review from a team as code owners November 17, 2025 17:33

emmettbutler requested review from nsrip-dd, quinna-h and rachelyangdog November 17, 2025 17:33

emmettbutler closed this Nov 17, 2025

emmettbutler deleted the backport-15279-to-3.19 branch November 17, 2025 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: disable appsec_iast_propagation benchmarks (#15279) #15284

ci: disable appsec_iast_propagation benchmarks (#15279) #15284

Uh oh!

emmettbutler commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

ci: disable appsec_iast_propagation benchmarks (#15279) #15284

ci: disable appsec_iast_propagation benchmarks (#15279) #15284

Uh oh!

Conversation

emmettbutler commented Nov 14, 2025

Description

Testing

Risks

Additional Notes

Description

Testing

Risks

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants