Skip to content

Add LLM failure mode catalog for data science tasks#18

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/create-failure-mode-catalogue
Draft

Add LLM failure mode catalog for data science tasks#18
Copilot wants to merge 2 commits intomainfrom
copilot/create-failure-mode-catalogue

Conversation

Copy link
Contributor

Copilot AI commented Feb 18, 2026

Documents common failure patterns when using LLMs for data analysis, modeling, and statistics. Provides concrete examples of breakdowns with mitigation strategies and prompt refinements.

Changes

  • New document: docs/failure-catalog.md
    • Table of contents with three failure categories: Statistical Hallucinations, Logical Reasoning Breaks, Code Generation Errors
    • Detailed "Fabricated P-Value" example under Statistical Hallucinations section
    • Five mitigation strategies with code examples and improved prompts

Example Content

The catalog demonstrates failure modes with before/after prompts:

Problematic prompt:

I have two groups of data:
Group A: [23, 25, 28, 22, 24, 26, 27, 25]
Group B: [31, 33, 29, 32, 34, 30, 33, 31]

Please perform a t-test and report the p-value.

Result: LLM hallucinates statistics (t=6.42, p=0.0031) instead of computing actual values (t=-8.485, p=0.0000018).

Improved prompt:

Write Python code using scipy.stats to perform a t-test. 
Execute the calculation and show the computed results.

Result: Actual computation with correct statistics.

Mitigation strategies include requesting code-based solutions, using code interpreters, requesting step-by-step calculations, and chain-of-thought verification.

Original prompt

This section details on the original issue you should resolve

<issue_title>Failure Mode Catalogue for LLMs in Data Science Tasks</issue_title>
<issue_description>We are building a documented repository of common failure patterns observed when using LLMs for data analysis, modelling, statistics, and experimentation. Contributors should provide concrete examples of breakdowns (e.g., fabricated statistical tests, incorrect p-values, misapplied ML concepts), alongside mitigation strategies and prompt refinements that improved reliability.</issue_description>

<agent_instructions> create a new document docs/failure-catalog.md. Start it with a table of contents categorizing failures into: "Statistical Hallucinations", "Logical Reasoning Breaks", and "Code Generation Errors". Under "Statistical Hallucinations", write a detailed example of a "Fabricated P-Value" failure, showing the prompt that caused it and the incorrect output. Then provide a "Mitigation Strategy" section for it.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Add failure mode catalogue for LLMs in data science tasks Add LLM failure mode catalog for data science tasks Feb 18, 2026
Copilot AI requested a review from natnew February 18, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failure Mode Catalogue for LLMs in Data Science Tasks

2 participants