Skip to content

Conversation

@okiner-3
Copy link
Collaborator

Add a tutorial with Oregon Health Insurance Experiment.

@okiner-3 okiner-3 self-assigned this Oct 14, 2025
@okiner-3 okiner-3 requested a review from TomeHirata October 14, 2025 08:00
@TomeHirata
Copy link
Collaborator

TomeHirata commented Oct 16, 2025

@okiner-3 Thank you so much for the PR! Sorry if it was not clear in the ticket description, but we want to focus on the local distributional treatment effects (LDTE) and LPTE since incomplete compliance was observed in the experiment. In the Oregon dataset, the treatment column represents the treatment assignment, the numhh_list column is the strata, and the ohp_all_ever_inperson column indicates the actual treatment received. We can keep covariates and outcomes as they are. Could you please take a look at https://cyberagentailab.github.io/python-dte-adjustment/api/local.html and revise the content? It is valuable to compare ITT and LDTE, so we can keep the current analysis as it is, but let's also include the LDTE/LPTE results. Let me know if you need further clarification.

@okiner-3
Copy link
Collaborator Author

@TomeHirata
I understood. I will take care of it!

@TomeHirata TomeHirata requested a review from Copilot October 21, 2025 11:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive tutorial demonstrating the use of Local Distribution Treatment Effects (LDTE) analysis with the Oregon Health Insurance Experiment dataset. The tutorial showcases how to handle non-compliance scenarios using instrumental variable approaches when not all participants assigned to treatment actually enrolled.

Key changes:

  • New comprehensive tutorial file analyzing emergency department costs and visits using local distribution treatment effects
  • Tutorial demonstrates both simple and ML-adjusted estimators for handling non-compliance
  • Includes stratified analysis by household registration patterns to examine treatment effect heterogeneity

Reviewed Changes

Copilot reviewed 2 out of 11 changed files in this pull request and generated 4 comments.

File Description
docs/source/tutorials/oregon.rst Comprehensive tutorial implementing LDTE analysis for the Oregon Health Insurance Experiment with non-compliance handling
docs/source/tutorials.rst Added reference to the new Oregon tutorial in the documentation index

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ctrl_cols.append('charg_tot_pre_ed')
selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']
df = df[selected_cols]
df = df[df.isna().any(axis=1) == False]
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean comparison == False is unnecessary and less Pythonic. Use ~df.isna().any(axis=1) or df.notna().all(axis=1) instead for cleaner code.

Suggested change
df = df[df.isna().any(axis=1) == False]
df = df[~df.isna().any(axis=1)]

Copilot uses AI. Check for mistakes.
Comment on lines +323 to +334
ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=outcome_ed_visits_locations
)

# LDTE: Treatment vs Control
ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=outcome_ed_visits_locations
)
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate code: The same prediction is computed twice with different variable names (ldte_ctrl/ldte_simple). The first computation (lines 323-327) appears to be unused. Consider removing the duplicate computation to improve code clarity.

Copilot uses AI. Check for mistakes.
Comment on lines +371 to +382
lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=[-1] + outcome_ed_visits_locations
)

# Compute Local Probability Treatment Effects
lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=[-1] + outcome_ed_visits_locations
)
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate code: The same LPTE prediction is computed twice with different variable names (lpte_ctrl/lpte_simple). The first computation (lines 371-375) is not used in subsequent code. Remove the duplicate to improve maintainability.

Copilot uses AI. Check for mistakes.

This data supports research on how health insurance affects healthcare utilization and is maintained by researchers Amy Finkelstein and Katherine Baicker. Please ensure you comply with the data use agreements when downloading and using this dataset.

**Important**: When using this dataset for research or publications, appropriate citation is required as specified in the NBER data use agreement.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can probably remove this warning as it's explicitly written in https://www.nber.org/research/data/oregon-health-insurance-experiment-data

df['Y_NUM_VISIT_CENS_ED'] = df['num_visit_cens_ed'].fillna(0)

# Create strata based on household size
df['strata'] = df['numhh_list']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use df.rename in order not to create a new column with the same values.


# Create actual treatment indicator: 0=Not enrolled, 1=Enrolled, -1=Missing
treatment_mapping = {'NOT enrolled': 0, 'Enrolled': 1}
df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep na as it is and drop the records with missing values.

# Prepare the data for dte_adj analysis
# Create treatment assignment (instrumental variable): 0=Not selected, 1=Selected
treatment_assignment_mapping = {'Not selected': 0, 'Selected': 1}
df['Z'] = df['treatment'].map(treatment_assignment_mapping).astype(float).fillna(-1).astype(int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we cast float first?

Copy link
Collaborator Author

@okiner-3 okiner-3 Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fillna() cannot be applied to category types, casting a type is required.
If NaN is included, it cannot be casted to an int type, so it must be casted to a float type.
df[‘Z’] does not contain NaN values, but df[‘D’] does contain NaN values, and the same process is applied, though it may be redundant.
However, as mentioned in the comment below, it seems good to delete the records with missing values rather than fill in the missing ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's run record-wide deletion

df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int)

# Use emergency department costs and visits as outcome variables
df['Y_ED_CHARG_TOT_ED'] = df['ed_charg_tot_ed'].fillna(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a valid pre-processing to fill missing outcomes with 0?

Copy link
Collaborator Author

@okiner-3 okiner-3 Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that seems not valid.
LinearRegression does not allow missing values, but it seems better not to complete them with zeros since the distribution changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's just remove missing values here too?

edu_mapping = {'HS diploma or GED': 0, 'Post HS, not 4-year': 1, 'Less than HS': 2, '4 year degree or more': 3}

df['age'] = 2008 - df['birthyear_list']
df['gender_inp'] = df['gender_inp'].map(gender_mapping).astype(float).fillna(-1).astype(int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

@okiner-3 okiner-3 Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for casting to float first is as mentioned above.
However, different from the assignments (df['Z'], df['D']) mentioned above, I think it's fine to complete missing values for these features as “unknown.” What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many records have missing gender only? I wonder if these records also miss Z or D.

# Select control variables: pre-randomization ED utilization variables
ctrl_cols = [col for col in df_ed.columns if 'pre' in col and 'num' in col]
ctrl_cols.append('charg_tot_pre_ed')
selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we include ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] in ctrl_cols?

ctrl_cols.append('charg_tot_pre_ed')
selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']
df = df[selected_cols]
df = df[df.isna().any(axis=1) == False]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = df[df.isna().any(axis=1) == False]
df = df.dropna()

df = df[df.isna().any(axis=1) == False]

# Create feature matrix (excluding treatment variables)
features = pd.DataFrame(df[ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to define a variable features

:width: 800px
:align: center

**LDTE Interpretation**: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the LDTE values really positive?


**LDTE Interpretation**: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll.

**Statistical Significance**: Both simple and ML-adjusted local estimators show similar patterns, providing robust evidence that Medicaid assignment has significant distributional effects on emergency department costs for compliers. The confidence intervals indicate that these effects are statistically significant across most cost levels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confidence intervals indicate that these effects are statistically significant across most cost levels.

Is this correct?


The Local Probability Treatment Effects analysis produces the following visualization:

.. image:: ../_static/oregon_lpte_costs_comparison.png
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can increase the size of each bin. Currently there are too many bins in the graph.


The side-by-side bar charts show probability treatment effects across different emergency department cost intervals, revealing how Medicaid enrollment affects healthcare utilization patterns:

**Cost Distribution Effects**: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct explanation about how to understand LPTE, but what's the insight specifically for this data?


**Cost Distribution Effects**: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability.

**Healthcare Utilization Patterns**: Both simple and ML-adjusted local estimators reveal consistent patterns in how Medicaid assignment affects emergency department utilization across different cost categories for compliers. The analysis shows that Medicaid assignment has heterogeneous effects, increasing utilization in some cost ranges while potentially reducing it in others.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, can we make it a bit more detailed?


**Policy Implications**: Understanding these distributional effects is crucial for healthcare policy. The local analysis reveals that Medicaid's impact varies across the cost distribution for those who actually enroll when assigned, which has important implications for healthcare budgeting and understanding the true effects of public health insurance programs on compliers.

**Conclusion**: Using the real Oregon Health Insurance Experiment dataset with 24,000 participants, the local distributional analysis reveals nuanced patterns in how Medicaid assignment affects healthcare utilization among compliers. The analysis accounts for non-compliance and goes beyond simple average comparisons to show how treatment effects vary across the entire emergency department cost distribution, providing insights into how public health insurance impacts different segments of the population who actually enroll. This demonstrates the power of local distribution treatment effect analysis for understanding heterogeneous responses in healthcare policy interventions with non-compliance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need a conclusion in the middle of the tutorial

ml_local_estimator.fit(X, Z, D, Y_NUM_VISIT_CENS_ED, strata)

# Define evaluation points for emergency department visits
outcome_ed_visits_locations = np.linspace(Y_NUM_VISIT_CENS_ED.min(), Y_NUM_VISIT_CENS_ED.max(), 20)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use np.arange?

Comment on lines +322 to +334
# Compute LDTE: Treatment vs Control
ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=outcome_ed_visits_locations
)

# LDTE: Treatment vs Control
ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=outcome_ed_visits_locations
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between these two?


# Visualize Treatment vs Control using dte_adj's plot function
plot(outcome_ed_visits_locations, ldte_simple, lower_simple, upper_simple,
title="Treatment vs Control (Simple Local Estimator)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the outcome in either the title or the y label

ax=ax1)

plot(outcome_ed_visits_locations, ldte_ml, lower_ml, upper_ml,
title="Treatment vs Control (ML-Adjusted Local Estimator)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +370 to +382
# Compute LPTE: Treatment vs Control
lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=[-1] + outcome_ed_visits_locations
)

# Compute Local Probability Treatment Effects
lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte(
target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled)
control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled)
locations=[-1] + outcome_ed_visits_locations
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, what's the difference between the two?

plt.tight_layout()
plt.show()

.. image:: ../_static/oregon_lpte_visits.png
Copy link
Collaborator

@TomeHirata TomeHirata Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a wrong figure? The code and figure don't match


.. code-block:: python

# Compute LDTE: Treatment vs Control
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this code block entirely as it's duplicated with the content below

Probability Treatment Effects: Visits Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this, we can remove this code block entirely as it's duplicated with the content below


The emergency department visits analysis reveals complementary patterns to the cost analysis:

**Visit Frequency Effects**: Medicaid assignment shows distinct effects on the probability of different visit frequencies for compliers. The LPTE analysis reveals which visit count categories are most affected by Medicaid assignment among those who actually enroll.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it a bit more detailed?


**Policy Targeting Implications**: Understanding which household types respond most strongly to Medicaid assignment can inform more targeted policy interventions and help identify populations that would benefit most from expanded coverage when they actually enroll.

**Methodological Consistency**: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also mention the difference in ci length?


**Heterogeneity Analysis**: The stratified analysis by household registration type reveals important local treatment effect heterogeneity, showing that different populations respond differently to Medicaid assignment when they actually enroll.

**Methodological Robustness**: Comparing simple and ML-adjusted local estimators provides confidence in our findings and demonstrates the robustness of the local distributional treatment effect methodology for handling non-compliance scenarios.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

plt.tight_layout()
plt.show()

.. image:: ../_static/oregon_ldte_strata.png
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the ml adjusted LPTE has something wrong, can we investigate the cause?

**Methodological Consistency**: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance.

Conclusion
~~~~~~~~~~
Copy link
Collaborator

@TomeHirata TomeHirata Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me repeat the same comment, can we ground the insights on the Oregon analysis results? The current summary feels too generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants