add a tutlial for oregon health insurance experiment #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

okiner-3 wants to merge 2 commits into main from feat/tutlials/oregon

Collaborator

okiner-3 commented Oct 14, 2025

Add a tutorial with Oregon Health Insurance Experiment.


          add a tutlial for oregon health insurance experiment

26b5af5

okiner-3 self-assigned this

okiner-3 requested a review from TomeHirata

October 14, 2025 08:00

Collaborator

TomeHirata commented Oct 16, 2025 •

edited

Loading

@okiner-3 Thank you so much for the PR! Sorry if it was not clear in the ticket description, but we want to focus on the local distributional treatment effects (LDTE) and LPTE since incomplete compliance was observed in the experiment. In the Oregon dataset, the treatment column represents the treatment assignment, the numhh_list column is the strata, and the ohp_all_ever_inperson column indicates the actual treatment received. We can keep covariates and outcomes as they are. Could you please take a look at https://cyberagentailab.github.io/python-dte-adjustment/api/local.html and revise the content? It is valuable to compare ITT and LDTE, so we can keep the current analysis as it is, but let's also include the LDTE/LPTE results. Let me know if you need further clarification.

Collaborator Author

okiner-3 commented Oct 19, 2025

@TomeHirata
I understood. I will take care of it!


          replace functions from predict_{dte, pte} to predict_{ldte, lpte} due…

39b2acd

… to Non-Compliance

TomeHirata requested a review from Copilot

October 21, 2025 11:16

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR adds a comprehensive tutorial demonstrating the use of Local Distribution Treatment Effects (LDTE) analysis with the Oregon Health Insurance Experiment dataset. The tutorial showcases how to handle non-compliance scenarios using instrumental variable approaches when not all participants assigned to treatment actually enrolled.

Key changes:

New comprehensive tutorial file analyzing emergency department costs and visits using local distribution treatment effects
Tutorial demonstrates both simple and ML-adjusted estimators for handling non-compliance
Includes stratified analysis by household registration patterns to examine treatment effect heterogeneity

Reviewed Changes

Copilot reviewed 2 out of 11 changed files in this pull request and generated 4 comments.

File	Description
docs/source/tutorials/oregon.rst	Comprehensive tutorial implementing LDTE analysis for the Oregon Health Insurance Experiment with non-compliance handling
docs/source/tutorials.rst	Added reference to the new Oregon tutorial in the documentation index

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

docs/source/tutorials/oregon.rst Show resolved Hide resolved

docs/source/tutorials/oregon.rst

+                  ctrl_cols.append('charg_tot_pre_ed')
+                  selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']
+                  df = df[selected_cols]
+                  df = df[df.isna().any(axis=1) == False]

Copilot AI Oct 21, 2025

The boolean comparison == False is unnecessary and less Pythonic. Use ~df.isna().any(axis=1) or df.notna().all(axis=1) instead for cleaner code.

Suggested change

      
                df = df[df.isna().any(axis=1) == False]
          
                df = df[~df.isna().any(axis=1)]

Copilot uses AI. Check for mistakes.

docs/source/tutorials/oregon.rst

Comment on lines +323 to +334

+                  ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=outcome_ed_visits_locations
+                  )
+                  # LDTE: Treatment vs Control
+                  ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=outcome_ed_visits_locations
+                  )

Copilot AI Oct 21, 2025

Duplicate code: The same prediction is computed twice with different variable names (ldte_ctrl/ldte_simple). The first computation (lines 323-327) appears to be unused. Consider removing the duplicate computation to improve code clarity.

Copilot uses AI. Check for mistakes.

docs/source/tutorials/oregon.rst

Comment on lines +371 to +382

+                  lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=[-1] + outcome_ed_visits_locations
+                  )
+                  # Compute Local Probability Treatment Effects
+                  lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=[-1] + outcome_ed_visits_locations
+                  )

Copilot AI Oct 21, 2025

Duplicate code: The same LPTE prediction is computed twice with different variable names (lpte_ctrl/lpte_simple). The first computation (lines 371-375) is not used in subsequent code. Remove the duplicate to improve maintainability.

Copilot uses AI. Check for mistakes.

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		This data supports research on how health insurance affects healthcare utilization and is maintained by researchers Amy Finkelstein and Katherine Baicker. Please ensure you comply with the data use agreements when downloading and using this dataset.

		Important: When using this dataset for research or publications, appropriate citation is required as specified in the NBER data use agreement.

Collaborator

TomeHirata Oct 21, 2025

nit: we can probably remove this warning as it's explicitly written in https://www.nber.org/research/data/oregon-health-insurance-experiment-data

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  df['Y_NUM_VISIT_CENS_ED'] = df['num_visit_cens_ed'].fillna(0)
+                  # Create strata based on household size
+                  df['strata'] = df['numhh_list']

Collaborator

TomeHirata Oct 21, 2025

Let's use df.rename in order not to create a new column with the same values.

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  # Create actual treatment indicator: 0=Not enrolled, 1=Enrolled, -1=Missing
+                  treatment_mapping = {'NOT enrolled': 0, 'Enrolled': 1}
+                  df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int)

Collaborator

TomeHirata Oct 21, 2025

Let's keep na as it is and drop the records with missing values.

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  # Prepare the data for dte_adj analysis
+                  # Create treatment assignment (instrumental variable): 0=Not selected, 1=Selected
+                  treatment_assignment_mapping = {'Not selected': 0, 'Selected': 1}
+                  df['Z'] = df['treatment'].map(treatment_assignment_mapping).astype(float).fillna(-1).astype(int)

Collaborator

TomeHirata Oct 21, 2025

Why do we cast float first?

Collaborator Author

okiner-3 Oct 26, 2025 •

edited

Loading

Since fillna() cannot be applied to category types, casting a type is required.
If NaN is included, it cannot be casted to an int type, so it must be casted to a float type.
df[‘Z’] does not contain NaN values, but df[‘D’] does contain NaN values, and the same process is applied, though it may be redundant.
However, as mentioned in the comment below, it seems good to delete the records with missing values rather than fill in the missing ones.

Collaborator

TomeHirata Oct 30, 2025

Yes, let's run record-wide deletion

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int)
+                  # Use emergency department costs and visits as outcome variables
+                  df['Y_ED_CHARG_TOT_ED'] = df['ed_charg_tot_ed'].fillna(0)

Collaborator

TomeHirata Oct 21, 2025

Is it a valid pre-processing to fill missing outcomes with 0?

Collaborator Author

okiner-3 Oct 26, 2025 •

edited

Loading

Sorry, that seems not valid.
LinearRegression does not allow missing values, but it seems better not to complete them with zeros since the distribution changes.

Collaborator

TomeHirata Oct 30, 2025

Agreed, let's just remove missing values here too?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  edu_mapping = {'HS diploma or GED': 0, 'Post HS, not 4-year': 1, 'Less than HS': 2, '4 year degree or more': 3}
+                  df['age'] = 2008 - df['birthyear_list']
+                  df['gender_inp'] = df['gender_inp'].map(gender_mapping).astype(float).fillna(-1).astype(int)

Collaborator

TomeHirata Oct 21, 2025

ditto

Collaborator Author

okiner-3 Oct 26, 2025 •

edited

Loading

The reason for casting to float first is as mentioned above.
However, different from the assignments (df['Z'], df['D']) mentioned above, I think it's fine to complete missing values for these features as “unknown.” What do you think?

Collaborator

TomeHirata Oct 30, 2025

How many records have missing gender only? I wonder if these records also miss Z or D.

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  # Select control variables: pre-randomization ED utilization variables
+                  ctrl_cols = [col for col in df_ed.columns if 'pre' in col and 'num' in col]
+                  ctrl_cols.append('charg_tot_pre_ed')
+                  selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']

Collaborator

TomeHirata Oct 21, 2025

shall we include ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] in ctrl_cols?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  ctrl_cols.append('charg_tot_pre_ed')
+                  selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']
+                  df = df[selected_cols]
+                  df = df[df.isna().any(axis=1) == False]

Collaborator

TomeHirata Oct 21, 2025

Suggested change

      
                df = df[df.isna().any(axis=1) == False]
          
                df = df.dropna()

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  df = df[df.isna().any(axis=1) == False]
+                  # Create feature matrix (excluding treatment variables)
+                  features = pd.DataFrame(df[ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']])

Collaborator

TomeHirata Oct 21, 2025

nit: we don't need to define a variable features

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                 :width: 800px
+                 :align: center
+              **LDTE Interpretation**: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll.

Collaborator

TomeHirata Oct 21, 2025

Are the LDTE values really positive?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		LDTE Interpretation: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll.

		Statistical Significance: Both simple and ML-adjusted local estimators show similar patterns, providing robust evidence that Medicaid assignment has significant distributional effects on emergency department costs for compliers. The confidence intervals indicate that these effects are statistically significant across most cost levels.

Collaborator

TomeHirata Oct 21, 2025

The confidence intervals indicate that these effects are statistically significant across most cost levels.

Is this correct?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		The Local Probability Treatment Effects analysis produces the following visualization:

		.. image:: ../_static/oregon_lpte_costs_comparison.png

Collaborator

TomeHirata Oct 21, 2025

Maybe we can increase the size of each bin. Currently there are too many bins in the graph.

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		The side-by-side bar charts show probability treatment effects across different emergency department cost intervals, revealing how Medicaid enrollment affects healthcare utilization patterns:

		Cost Distribution Effects: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability.

Collaborator

TomeHirata Oct 21, 2025

This is a correct explanation about how to understand LPTE, but what's the insight specifically for this data?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		Cost Distribution Effects: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability.

		Healthcare Utilization Patterns: Both simple and ML-adjusted local estimators reveal consistent patterns in how Medicaid assignment affects emergency department utilization across different cost categories for compliers. The analysis shows that Medicaid assignment has heterogeneous effects, increasing utilization in some cost ranges while potentially reducing it in others.

Collaborator

TomeHirata Oct 21, 2025

ditto, can we make it a bit more detailed?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		Policy Implications: Understanding these distributional effects is crucial for healthcare policy. The local analysis reveals that Medicaid's impact varies across the cost distribution for those who actually enroll when assigned, which has important implications for healthcare budgeting and understanding the true effects of public health insurance programs on compliers.

		Conclusion: Using the real Oregon Health Insurance Experiment dataset with 24,000 participants, the local distributional analysis reveals nuanced patterns in how Medicaid assignment affects healthcare utilization among compliers. The analysis accounts for non-compliance and goes beyond simple average comparisons to show how treatment effects vary across the entire emergency department cost distribution, providing insights into how public health insurance impacts different segments of the population who actually enroll. This demonstrates the power of local distribution treatment effect analysis for understanding heterogeneous responses in healthcare policy interventions with non-compliance.

Collaborator

TomeHirata Oct 21, 2025

maybe we don't need a conclusion in the middle of the tutorial

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  ml_local_estimator.fit(X, Z, D, Y_NUM_VISIT_CENS_ED, strata)
+                  # Define evaluation points for emergency department visits
+                  outcome_ed_visits_locations = np.linspace(Y_NUM_VISIT_CENS_ED.min(), Y_NUM_VISIT_CENS_ED.max(), 20)

Collaborator

TomeHirata Oct 21, 2025

Can we use np.arange?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

Comment on lines +322 to +334

+                  # Compute LDTE: Treatment vs Control
+                  ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=outcome_ed_visits_locations
+                  )
+                  # LDTE: Treatment vs Control
+                  ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=outcome_ed_visits_locations
+                  )

Collaborator

TomeHirata Oct 21, 2025

What's the difference between these two?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  # Visualize Treatment vs Control using dte_adj's plot function
+                  plot(outcome_ed_visits_locations, ldte_simple, lower_simple, upper_simple,
+                       title="Treatment vs Control (Simple Local Estimator)",

Collaborator

TomeHirata Oct 21, 2025

Let's add the outcome in either the title or the y label

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                       ax=ax1)
+                  plot(outcome_ed_visits_locations, ldte_ml, lower_ml, upper_ml,
+                       title="Treatment vs Control (ML-Adjusted Local Estimator)",

Collaborator

TomeHirata Oct 21, 2025

ditto

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

Comment on lines +370 to +382

+                  # Compute LPTE: Treatment vs Control
+                  lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=[-1] + outcome_ed_visits_locations
+                  )
+                  # Compute Local Probability Treatment Effects
+                  lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte(
+                      target_treatment_arm=1,  # Z=1 Selected for treatment (Enrolled)
+                      control_treatment_arm=0,  # Z=0 Not selected for treatment (Not enrolled)
+                      locations=[-1] + outcome_ed_visits_locations
+                  )

Collaborator

TomeHirata Oct 21, 2025

Same comment, what's the difference between the two?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  plt.tight_layout()
+                  plt.show()
+              .. image:: ../_static/oregon_lpte_visits.png

Collaborator

TomeHirata Oct 21, 2025 •

edited

Loading

Maybe a wrong figure? The code and figure don't match

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		.. code-block:: python

		# Compute LDTE: Treatment vs Control

Collaborator

TomeHirata Oct 21, 2025

I think we can remove this code block entirely as it's duplicated with the content below

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+              Probability Treatment Effects: Visits Analysis
+              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+              .. code-block:: python

Collaborator

TomeHirata Oct 21, 2025

Same for this, we can remove this code block entirely as it's duplicated with the content below

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		The emergency department visits analysis reveals complementary patterns to the cost analysis:

		Visit Frequency Effects: Medicaid assignment shows distinct effects on the probability of different visit frequencies for compliers. The LPTE analysis reveals which visit count categories are most affected by Medicaid assignment among those who actually enroll.

Collaborator

TomeHirata Oct 21, 2025

Can we make it a bit more detailed?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		Policy Targeting Implications: Understanding which household types respond most strongly to Medicaid assignment can inform more targeted policy interventions and help identify populations that would benefit most from expanded coverage when they actually enroll.

		Methodological Consistency: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance.

Collaborator

TomeHirata Oct 21, 2025

Shall we also mention the difference in ci length?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst


		Heterogeneity Analysis: The stratified analysis by household registration type reveals important local treatment effect heterogeneity, showing that different populations respond differently to Medicaid assignment when they actually enroll.

		Methodological Robustness: Comparing simple and ML-adjusted local estimators provides confidence in our findings and demonstrates the robustness of the local distributional treatment effect methodology for handling non-compliance scenarios.

Collaborator

TomeHirata Oct 21, 2025

ditto

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+                  plt.tight_layout()
+                  plt.show()
+              .. image:: ../_static/oregon_ldte_strata.png

Collaborator

TomeHirata Oct 21, 2025

It seems the ml adjusted LPTE has something wrong, can we investigate the cause?

TomeHirata reviewed

View reviewed changes

docs/source/tutorials/oregon.rst

+              **Methodological Consistency**: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance.
+              Conclusion
+              ~~~~~~~~~~

Collaborator

TomeHirata Oct 21, 2025 •

edited

Loading

Let me repeat the same comment, can we ground the insights on the Oregon analysis results? The current summary feels too generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet