Skip to content

Feature/experiment tracking #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 77 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
e03111b
Added connector modules
jenniferjiangkells Oct 10, 2024
9fc2f36
Fix typo
jenniferjiangkells Oct 10, 2024
453b636
Added processing of io connectors in pipelines
jenniferjiangkells Oct 10, 2024
7427aa8
Refactored CDA related processing in use case to connectors
jenniferjiangkells Oct 10, 2024
ffe36a4
Added tests
jenniferjiangkells Oct 10, 2024
c45a018
Added CdsFhirConnector
jenniferjiangkells Oct 12, 2024
5a15cdb
first pass at adding spacy and hf integrations
adamkells Oct 13, 2024
dad5336
Updated use case functions and tests
jenniferjiangkells Oct 14, 2024
ee797da
WIP connector usage in pipelines and components
jenniferjiangkells Oct 14, 2024
83d3299
Fix model import name in docs
jenniferjiangkells Oct 14, 2024
32fa8bb
Update Bundle validator method to dynamically import nested resource …
jenniferjiangkells Oct 15, 2024
ad4a4a7
Update CdsFhirConnector input method validations
jenniferjiangkells Oct 15, 2024
45400fa
Add create method to CdsFhirData
jenniferjiangkells Oct 15, 2024
ba6f847
Fixed CdsResponse should return list of actions
jenniferjiangkells Oct 15, 2024
c5aa473
Added tests
jenniferjiangkells Oct 15, 2024
6f6a2f6
Added pipeline tests
jenniferjiangkells Oct 15, 2024
ca73827
fix pyproject
adamkells Oct 16, 2024
b931229
adding langchain and modifying document
adamkells Oct 16, 2024
7a3b904
added testing
adamkells Oct 17, 2024
0a30447
Changed .add() -> .add_node() to make more explicit and use conventio…
jenniferjiangkells Oct 17, 2024
1ffa34b
Update documentation to reflect changes in this PR
jenniferjiangkells Oct 17, 2024
c921aba
Merge branch 'main' of https://github.com/dotimplement/HealthChain in…
jenniferjiangkells Oct 17, 2024
1e057c9
adding docs
adamkells Oct 18, 2024
86c3b9f
finish docs
adamkells Oct 21, 2024
f1dc664
fix test
adamkells Oct 21, 2024
d1798dd
fix test2
adamkells Oct 21, 2024
6a38c72
WIP
jenniferjiangkells Oct 21, 2024
c01b61d
skip transformers test
adamkells Oct 23, 2024
694cd74
fix tests
adamkells Oct 23, 2024
4b8c91d
adding magicmock for iterable
adamkells Oct 23, 2024
c702a2c
Merge branch 'main' into improv/pipeline_integrations
adamkells Oct 23, 2024
3dcd4b5
merged main
adamkells Oct 28, 2024
d28ee16
respond to feedback
adamkells Oct 29, 2024
fadd10f
Merge branch 'improv/pipeline_integrations' of https://github.com/dot…
jenniferjiangkells Oct 31, 2024
c0974ea
Refactor and update document container and ccddata design
jenniferjiangkells Oct 31, 2024
70be215
Fix tests
jenniferjiangkells Oct 31, 2024
3058e91
Merge branch 'main' of https://github.com/dotimplement/HealthChain in…
jenniferjiangkells Nov 1, 2024
77263ec
Add docstrings
jenniferjiangkells Nov 4, 2024
dd3858e
Replace Model with ModelRouter
jenniferjiangkells Nov 4, 2024
4c530d0
Fix docs ci
jenniferjiangkells Nov 4, 2024
af9c1e7
Add method to add concepts in spacy component
jenniferjiangkells Nov 4, 2024
557a4dc
Merge branch 'main' of https://github.com/dotimplement/HealthChain in…
jenniferjiangkells Nov 4, 2024
da84e60
Refactor Document container
jenniferjiangkells Nov 5, 2024
4dab2f5
Update pipeline load method to dynamically read from string paths
jenniferjiangkells Nov 5, 2024
ea6327d
Fix tests
jenniferjiangkells Nov 5, 2024
3afdb6a
Change load method to use source parameter
jenniferjiangkells Nov 6, 2024
03b0def
Renamed integration components
jenniferjiangkells Nov 6, 2024
4086a8c
Remove spacy from preprocessor component and allow callable instead
jenniferjiangkells Nov 6, 2024
d9f2a6d
Pass kwargs to integration components
jenniferjiangkells Nov 7, 2024
1f9dd08
Added CdsCardCreator implementation
jenniferjiangkells Nov 7, 2024
d9f63fa
add experiment tracking
adamkells Nov 8, 2024
eafa69f
simplify decorator
adamkells Nov 8, 2024
58c99f5
fix decorator code deleted by cursor
adamkells Nov 8, 2024
0b24f70
Updated tests for prebuilt pipelines
jenniferjiangkells Nov 8, 2024
c26d08e
Added tests for pipeline loading method and modelrouter
jenniferjiangkells Nov 8, 2024
44e6475
Update test for spacy integration
jenniferjiangkells Nov 8, 2024
981e9d5
Tweak fixture
jenniferjiangkells Nov 8, 2024
1119e8c
Use Mixin for ModelRouter
jenniferjiangkells Nov 11, 2024
3ee43ba
Clean up __init__ imports
jenniferjiangkells Nov 11, 2024
ca3cf49
sql database experiment tracking
adamkells Nov 13, 2024
a4e5bfa
Fix resourceType not showing up by explicitly passing it in when call…
jenniferjiangkells Nov 13, 2024
22a5c49
Parse text from DocumentReference in cdsfhir
jenniferjiangkells Nov 13, 2024
1ba3a73
Add delimiter to create multiple cards and basic text cleaner for tem…
jenniferjiangkells Nov 13, 2024
46453f0
Make model loading more explicit and added langchain routing
jenniferjiangkells Nov 13, 2024
0bec7a0
Update prebuilt pipeline initialization methods
jenniferjiangkells Nov 13, 2024
0ae269d
Update tests
jenniferjiangkells Nov 13, 2024
fb9f228
Merge branch 'feature/pipeline-implementations' into feature/experime…
adamkells Nov 14, 2024
a295166
Added cookbook
jenniferjiangkells Nov 14, 2024
cba3b17
Moved default mapping initialization inside data generator
jenniferjiangkells Nov 14, 2024
3977f75
Split .load method to from_model_id and from_local_model and added te…
jenniferjiangkells Nov 14, 2024
9970e0b
Update tests and docs
jenniferjiangkells Nov 14, 2024
fbeffc4
merge pipelines
adamkells Nov 14, 2024
4f1fca3
refining experiment tracking
adamkells Nov 14, 2024
362a095
refine table structure
adamkells Nov 15, 2024
c65e9b4
bugfixing the test
adamkells Dec 4, 2024
b243be2
Add docs
adamkells Dec 4, 2024
5bcb4dc
remove excess code
adamkells Dec 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions cookbook/summarization_pipeline_hf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import healthchain as hc

from healthchain.pipeline import SummarizationPipeline
from healthchain.use_cases import ClinicalDecisionSupport
from healthchain.models import CdsFhirData, CDSRequest, CDSResponse
from healthchain.data_generators import CdsDataGenerator

from langchain_huggingface.llms import HuggingFaceEndpoint
from langchain_huggingface import ChatHuggingFace

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

import getpass
import os


if not os.getenv("HUGGINGFACEHUB_API_TOKEN"):
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your token: ")


@hc.sandbox(
experiment_config={
"storage_uri": "sqlite:///experiments.db", # Where to store experiment data
"project_name": "patient_summary", # Name for grouping experiments
}
)
class DischargeNoteSummarizer(ClinicalDecisionSupport):
def __init__(self):
# Initialize pipeline and data generator
chain = self._init_chain()
self.pipeline = SummarizationPipeline.load(
chain, source="langchain", template_path="templates/cds_card_template.json"
)
self.data_generator = CdsDataGenerator()

def _init_chain(self):
hf = HuggingFaceEndpoint(
repo_id="HuggingFaceH4/zephyr-7b-beta",
task="text-generation",
max_new_tokens=512,
do_sample=False,
repetition_penalty=1.03,
)
model = ChatHuggingFace(llm=hf)
template = """
You are a bed planner for a hospital. Provide a concise, objective summary of the input text in short bullet points separated by new lines,
focusing on key actions such as appointments and medication dispense instructions, without using second or third person pronouns.\n'''{text}'''
"""
prompt = PromptTemplate.from_template(template)
chain = prompt | model | StrOutputParser()

return chain

@hc.ehr(workflow="encounter-discharge")
def load_data_in_client(self) -> CdsFhirData:
# Generate synthetic FHIR data for testing
data = self.data_generator.generate(
free_text_path="data/discharge_notes.csv", column_name="text"
)
return data

@hc.api
def my_service(self, request: CDSRequest) -> CDSResponse:
# Process the request through our pipeline
result = self.pipeline(request)
return result


if __name__ == "__main__":
# Start the sandbox server
summarizer = DischargeNoteSummarizer()
summarizer.start_sandbox()
204 changes: 204 additions & 0 deletions docs/reference/tracking/experiment_tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# ExperimentTracker Documentation

ExperimentTracker is a simple yet powerful tool for tracking your machine learning experiments. It automatically records experiment metadata, timing, and status with minimal configuration required.

## Quick Start

The easiest way to use ExperimentTracker is through the `@sandbox` decorator:

```python
from healthchain import sandbox

@sandbox(
experiment_config={
"storage_uri": "sqlite:///experiments.db", # Where to store experiment data
"project_name": "my_project", # Name for grouping experiments
"tags" : {"environment": "production"} # Optional tags
}
)
class MyExperiment(BaseUseCase):
def __init__(self):
# ExperimentTracker is automatically initialized
# You can access it via self.experiment_tracker
pass
```

That's it! The system will automatically:
- Create a unique ID for each experiment run
- Track when experiments start and end
- Record the status (completed or failed)
- Save any tags you provide

## Viewing Your Experiments

### Database Schema

ExperimentTracker uses SQLAlchemy to store experiment data in two tables:

**experiments**:
- `id`: Unique identifier (UUID)
- `name`: Experiment name
- `start_time`: Start timestamp
- `end_time`: End timestamp
- `status`: Current status (RUNNING, COMPLETED, FAILED)
- `tags`: JSON field for custom tags
- `pipeline_config`: JSON field for pipeline configuration (optional)

**pipeline_components**:
- `id`: Component ID
- `experiment_id`: Reference to parent experiment
- `name`: Component name
- `type`: Component type
- `stage`: Processing stage
- `position`: Order in pipeline

### Using Python API

```python
# Get details for a specific experiment
experiment = tracker.get_experiment(experiment_id)
print(f"Status: {experiment.status}")
print(f"Duration: {experiment.end_time - experiment.start_time}")

# List all experiments
experiments = tracker.list_experiments()

# Filter experiments by tags
prod_experiments = tracker.list_experiments(
filters={"tags": {"environment": "production"}}
)
```

### Querying the Database Directly

The experiment data is stored in a local SQLite database that you can query directly in a python script or Jupyter notebook:

```python
import sqlite3

# Connect to the database
conn = sqlite3.connect('experiments.db')
cursor = conn.cursor()

# View recent experiments
cursor.execute("""
SELECT id, name, start_time, status, tags
FROM experiments
ORDER BY start_time DESC
LIMIT 5;
""")
recent_experiments = cursor.fetchall()

# View experiments with a specific tag
cursor.execute("""
SELECT id, name, start_time, status
FROM experiments
WHERE json_extract(tags, '$.environment') = 'production';
""")
prod_experiments = cursor.fetchall()

# Get component details for an experiment
cursor.execute("""
SELECT name, type, stage
FROM pipeline_components
WHERE experiment_id = ?;
""", (experiment_id,))
components = cursor.fetchall()

conn.close()
```



## Configuration Options

The `experiment_config` dictionary supports two options:
- `storage_uri`: Where to store experiment data (default: "sqlite:///experiments.db")
- Use SQLite: "sqlite:///experiments.db"
- `project_name`: Name for grouping related experiments (default: "healthchain")

## Example: Real-World Usage

Here's a complete example showing how ExperimentTracker is used in practice:

```python
import healthchain as hc

from healthchain.pipeline import SummarizationPipeline
from healthchain.use_cases import ClinicalDecisionSupport
from healthchain.models import CdsFhirData, CDSRequest, CDSResponse
from healthchain.data_generators import CdsDataGenerator

from langchain_huggingface.llms import HuggingFaceEndpoint
from langchain_huggingface import ChatHuggingFace

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

import getpass
import os


if not os.getenv("HUGGINGFACEHUB_API_TOKEN"):
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your token: ")


@hc.sandbox(
experiment_config={
"storage_uri": "sqlite:///experiments.db", # Where to store experiment data
"project_name": "patient_summary", # Name for grouping experiments
}
)
class DischargeNoteSummarizer(ClinicalDecisionSupport):
def __init__(self):
# Initialize pipeline and data generator
chain = self._init_chain()
self.pipeline = SummarizationPipeline.load(
chain, source="langchain", template_path="templates/cds_card_template.json"
)
self.data_generator = CdsDataGenerator()

def _init_chain(self):
hf = HuggingFaceEndpoint(
repo_id="HuggingFaceH4/zephyr-7b-beta",
task="text-generation",
max_new_tokens=512,
do_sample=False,
repetition_penalty=1.03,
)
model = ChatHuggingFace(llm=hf)
template = """
You are a bed planner for a hospital. Provide a concise, objective summary of the input text in short bullet points separated by new lines,
focusing on key actions such as appointments and medication dispense instructions, without using second or third person pronouns.\n'''{text}'''
"""
prompt = PromptTemplate.from_template(template)
chain = prompt | model | StrOutputParser()

return chain

@hc.ehr(workflow="encounter-discharge")
def load_data_in_client(self) -> CdsFhirData:
# Generate synthetic FHIR data for testing
data = self.data_generator.generate(
free_text_path="data/discharge_notes.csv", column_name="text"
)
return data

@hc.api
def my_service(self, request: CDSRequest) -> CDSResponse:
# Process the request through our pipeline
result = self.pipeline(request)
return result


if __name__ == "__main__":
# Start the sandbox server
summarizer = DischargeNoteSummarizer()
summarizer.start_sandbox()
```


### Performance Considerations

- SQLite (default) works well for single-user scenarios
- Large-scale deployments may want to implement custom storage backends
Loading