Explainability Toolbox for Predictive and Generative AI
18 predictive AI explainers.
44 generative AI evaluators.
1,000,000+ curated eval prompts.
Automatic best model selection.
Used by H2O Eval Studio and H2O Driverless AI.
H2O Sonar is a Python library for AI model risk management (MRM) across predictive and generative systems. It provides explainers and evaluators that validate models, detect bias, assess fairness and privacy, and generate audit documentation. Built for regulated industries, H2O Sonar enables risk, compliance, and validation teams to quantify model risk, meet regulatory requirements, and maintain robust governance throughout the models lifecycle.
H2O Sonar is used by the following H2O.ai products:
H2O Driverless AI, H2O-3 and scikit-learn predictive models can be explained by the H2O Sonar.
H2O Sonar explanation report examples:
- Explainers overview (HTML)
- Credit card use case (PDF)
Examples:
Approximate model behavior:
Feature importance:
- Shapley Values for Original Features (Kernel SHAP Method)
- Shapley Values for Transformed Features of MOJO Models
- Morris Sensitivity Analysis
Feature behavior:
- Partial Dependence/Individual Conditional Expectations (PD/ICE)
- Partial Dependence for 2 Features
- Friedman's H-statistic
- Summary SHAP
Fairness:
Model debugging:
Model validity testing:
- Adversarial Similarity
- Backtesting
- Calibration Score
- Drift Detection
- Segment Performance
- Size Dependency
Supported environments & Python version(s):
| OS / Python | Python 3.11 |
|---|---|
| Linux x86 64b | Driverless AI MOJO, Driverless AI REST, H2O-3, scikit-learn |
Explain your predictive model by running an interpretation from Python or Jupyter Notebook:
# dataset
import pandas
dataset = pandas.read_csv(dataset_path)
(X, y) = dataset.drop(target_column, axis=1), dataset[target_column]
# model
from sklearn import ensemble
model = ensemble.GradientBoostingClassifier(learning_rate=0.1)
model.fit(X, y)
# interpretation
from h2o_sonar import interpret
interpretation = interpret.run_interpretation(
dataset=dataset_path,
model=model,
used_features=list(X.columns),
target_col=target_column,
results_location=results_path,
)
# interpretation result
print(interpretation)
# get explanation created by the first explainer of the interpretation
explanation = interpretation.get_explainer_result(
interpretation.get_finished_explainer_ids()[0]
)
# show explanation summary
print(explanation.summary())
# show explanation data
print(explanation.data(feature_name="EDUCATION", category="disparity"))
# get explanation plot
explanation.plot(feature_name="EDUCATION")
# show explainer log
print(explanation.log(path=results_path))
# store all explanation artifacts as ZIP archive
explanation.zip(file_path=archive_path)Alternatively, you can run the interpretation using the command line interface - check help:
h2o-sonar --helpExplain your model:
h2o-sonar run interpretation \
--dataset dataset.csv \
--model model.mojo \
--target-col SATISFACTIONCheckout the interpretation report and explanations:
The set of techniques and methods provided by H2O Sonar can be extended with custom explainers as H2O Sonar supports BYOE recipes - the ability to Bring Your Own Explainer. BYOE recipe is a Python code snippet. With BYOE recipe, you can use your explainers in combination with or instead of H2O Sonar built-in explainers.
Open source recipe examples - which are used also in the documentation to demonstrate H2O Sonar explainer API - can be found in:
examples/predictive/byoe/examplesH2O Sonar distribution directory
Open source recipe templates - which can be used to create quickly new explainers just by choosing the desired explainer type / explanation type (like feature importance, decision tree, or partial dependence) and replacing mock data with a calculation - can be found in:
examples/predictive/byoe/templatesH2O Sonar distribution directory
See documentation for more details.
h2oGPTe, h2oGPT, H2O LLMOps/MLOps, OpenAI, Microsoft Azure Open AI, Anthropic Claude, Amazon Bedrock, and ollama RAG and LLM hosts are supported by the H2O Sonar.
H2O Sonar evaluation report examples:
- h2oGPTe's LLMs comparison (HTML)
- SR 11-7 English embedding models evaluation report (HTML)
- SR 11-7 multilingual embedding models evaluation report (HTML)
Examples:
Agent:
Generation:
- Answer Accuracy (Semantic Similarity)
- Answer Correctness
- Answer Relevancy
- Answer Relevancy (Sentence Similarity)
- Answer Semantic Sentence Similarity
- Answer Semantic Similarity
- Fact-Check (Agent-based)
- Faithfulness
- Groundedness (Semantic Similarity)
- Hallucination
- JSON Schema
- Language Mismatch (Judge)
- Looping Detection
- Machine Translation (GPTScore)
- Parameterizable BYOP
- Perplexity
- Questions Drift
- Question Answering (GPTScore)
- RAGAS
- Self-Consistency
- Step Alignment and Completeness
- Text Matching
Retrieval:
- Context Mean Reciprocal Rank
- Context Precision
- Context Recall
- Context Relevancy
- Context Relevancy (Soft Recall and Precision)
Privacy:
Fairness:
Summarization:
- BERTScore
- BLEU
- ROUGE
- Summarization (Completeness and Faithfulness)
- Summarization (Judge)
- Summarization with reference (GPTScore)
- Summarization without reference (GPTScore)
Classification:
H2O Sonar provides a comprehensive library featuring 1,000,000+ curated prompts specifically designed for LLM, RAG, and AI Agent evaluation.
The library includes ready-to-use versions of the trusted industry benchmarks like:
- MMLU (Massive Multitask Language Understanding)
- ARC (AI2 Reasoning Challenge)
- CUAD (Contract Understanding Atticus Dataset)
- HellaSwag (Common Sense Reasoning)
- GSM8K (Grade School Math 8K)
The library's 700+ test suites cover key domains including Question Answering, Privacy, Fairness, Security, Summarization, and Classification.
- Standardized format:
- All data is provided in a normalized H2O Sonar JSON format.
- Flexible workflows:
- Test suites can be combined, sampled, perturbed, and customized to meet your specific evaluation requirements.
H2O Sonar can evaluate standalone LLMs and LLMs used by RAG systems hosted by the following products and services:
RAG:
- Amazon Bedrock
- h2oGPTe
- OpenAI Assistants with File Search Tool
LLM:
- Amazon Bedrock
- Anthropic Claude Chat
- h2oGPT
- h2oGPTe
- H2O LLMOps
- Microsoft Azure OpenAI Chat
- ollama
- OpenAI Chat
- OpenAI Chat API Compatible Hosts
Explain your generative model(s) by running an evaluation from Python or Jupyter Notebook:
# LLM models to be evaluated
model_host = h2o_sonar_config.ConnectionConfig(
connection_type=h2o_sonar_config.ConnectionConfigType.H2O_GPT_E.name,
name="H2O GPT Enterprise",
description="H2O GPT Enterprise model host.",
server_url="https://h2ogpte.h2o.ai/",
token="YOUR_API_TOKEN_HERE",
token_use_type=h2o_sonar_config.TokenUseType.API_KEY.name,
)
llm_models = genai.H2oGpteRagClient(model_host).list_llm_model_names()
# evaluation dataset
# test suite: RAG corpus, prompts, expected answers
rag_test_suite = testing.RagTestSuiteConfig.load_from_json(
test_utils.find_locally("data/generative/demo_doc_test_suite.json")
)
# test lab: resolved test suite w/ actual values from the LLM models host
test_lab = testing.RagTestLab.from_rag_test_suite(
rag_connection=model_host,
rag_test_suite=rag_test_suite,
rag_model_type=models.ExplainableModelType.h2ogpte,
llm_model_names=llm_models,
docs_cache_dir=tmp_path,
)
# deploy the test lab: upload corpus and create RAG collections/knowledge bases
test_lab.build()
# complete the test lab: actual values - answers, duration, cost, ...
test_lab.complete_dataset()
# EVALUATION
evaluation = evaluate.run_evaluation(
# test lab as the evaluation dataset (prompts, expected and actual answers)
dataset=test_lab.dataset,
# models to be evaluated ~ compared in the evaluation leaderboard
models=test_lab.evaluated_models.values(),
# evaluators
evaluators=[
rag_hallucination_evaluator.RagHallucinationEvaluator().evaluator_id()
],
# where to save the report
results_location=tmp_path,
)
# HTML report and the evaluation data (JSon, CSV, data frames, ...)
print(f"HTML report: file://{evaluation.result.get_html_report_location()}")Checkout the evaluation report:
The H2O Sonar evaluations comparator is a decision-support tool designed to streamline LLM, RAG, and Agent selection and automated best-model selection. It allows you to move beyond raw numbers by providing side-by-side analysis and automated model recommendations based on your specific evaluation data - examples:
- Call Center use case embedding models comparison (HTML)
- SR 11-7 use case embedding models comparison (HTML)
The tool performs intelligent cross-model comparison by identifying "comparable models" via intersection of evaluation data, ensuring your benchmarks are sound:
- Prompt Alignment: Matches models that share the same questions / prompts.
- Metric Consistency: Identifies common metric scores to ensure an "apples-to-apples" comparison.
The evaluations comparator performs automated best-model selection by applying multi-objective optimization to:
- Rank Performance: Automatically suggest the "best model" based on weighted priority of your chosen metrics.
- Identify Strengths: Pinpoint which model excels at retrieval (RAG) vs. reasoning (agents).
- Detect Regressions: Compare new model versions against your established baselines to prevent quality drift.
The evaluations comparator brings also exportable insight reports allowing to transform complex evaluation data into stakeholder-ready assets. The tool generates comprehensive reports in two standard formats:
- HTML
- Leaderboards, color-coded heatmaps, and detailed per-test case visualizations.
- JSON
- Machine-readable data structure for CI/CD pipelines, custom dashboards, and archival.
The set of techniques and methods provided by H2O Sonar for the generative AI models evaluation can be extended with custom evaluators as H2O Sonar supports BYOE recipes - the ability to Bring Your Own Evaluator. BYOE recipe is a Python code snippet. With BYOE recipe, you can use your evaluators in combination with or instead of H2O Sonar built-in evaluators.
Prepare prerequisites:
- Operating system: Linux
- Python 3.11
- Pip 25.0+
- CUDA-compatible GPU, NVIDIA drivers (optional - speed up generative evaluations)
- Java 1.7+ (optional - needed for predictive H2O-3 backend only)
- Graphviz (optional - needed for predictive visualizations only)
GPU acceleration (optional):
- GPU support accelerates certain evaluators like BERTScore, GPTScore or Perplexity.
- CUDA runtime provided via PyTorch/ONNX dependencies - installed automatically with
[evaluators]extras. - Configure via environment variable:
H2O_SONAR_CFG_DEVICE="gpu"(default: auto-detect). - Not supported on Linux (x86) only.
Download distribution or Python wheels:
Install Python wheel with only core dependencies for your platform:
- Download the appropriate wheel file for your platform from the Releases page.
- Install using:
pip install h2o_sonar-<version>.whl
Package extras:
- install H2O Sonar with all dependencies:
pip install h2o_sonar-<version>.whl[explainers,evaluators]
- install H2O Sonar with predictive models explainers dependencies:
pip install h2o_sonar-<version>.whl[explainers]
- install H2O Sonar with generative models evaluators dependencies:
pip install h2o_sonar-<version>.whl[evaluators]
- install H2O Sonar Generative AI clients package only:
pip install h2o_sonar-<version>.whl[genaiclient]
- install H2O Sonar core package only:
pip install h2o_sonar-<version>.whl
Troubleshooting:
- You may need to upgrade
pipusingpython -m pip install --upgrade piporcurl -sS https://bootstrap.pypa.io/get-pip.py | python3.11in case an H2O Sonar dependency installation fails.
H2O Sonar resources:
-
Documentation:
-
Examples:
https://github.com/h2oai/h2o-sonar/issues/new/choose
Do not hesitate to contribute - join us in evolving H2O Sonar and helping the AI/ML community thrive!
Prerequisites:
- See Installation section.
Build project .whl:
git clone git@github.com:h2oai/h2o-sonar.git
cd h2o-sonar
make clean setup TARGET_PYTHON_VERSION=3.11
. .venv/bin/activate
make help
make diagnostics
make clean dist_srcH2O Sonar .whl can be found in the dist/ directory.
Contribute from H2O.ai
Key H2O Sonar contributors:
- Munish Bhardwaj
- Predictive AI testing (Quality Assurance Engineer).
- Martin Dvorak
- Predictive AI explainers and Generative AI evaluators (Software Engineer).
- Mateusz Dymczyk
- Predictive AI methods (Software Engineer and Data Scientist).
- Tomas Fryda
- Generative AI evaluators (Data Scientist and Software Engineer).
- Navdeep Gill
- Predictive AI methods (Data Scientist and Software Engineer).
- Patrick Hall
- Predictive AI data science vision and methods (Data Scientist).
- Kim Montgomery
- Generative AI methods (Kaggle Grand Master Data Scientist).
- Erik Stoklasa
- Generative AI methods (Software Engineer/internship).
- Agus Sudjianto
- Generative AI data science vision and methods (Data Science geek who can speak).








