-
Notifications
You must be signed in to change notification settings - Fork 1
add a tutlial for oregon health insurance experiment #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@okiner-3 Thank you so much for the PR! Sorry if it was not clear in the ticket description, but we want to focus on the local distributional treatment effects (LDTE) and LPTE since incomplete compliance was observed in the experiment. In the Oregon dataset, the |
|
@TomeHirata |
… to Non-Compliance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive tutorial demonstrating the use of Local Distribution Treatment Effects (LDTE) analysis with the Oregon Health Insurance Experiment dataset. The tutorial showcases how to handle non-compliance scenarios using instrumental variable approaches when not all participants assigned to treatment actually enrolled.
Key changes:
- New comprehensive tutorial file analyzing emergency department costs and visits using local distribution treatment effects
- Tutorial demonstrates both simple and ML-adjusted estimators for handling non-compliance
- Includes stratified analysis by household registration patterns to examine treatment effect heterogeneity
Reviewed Changes
Copilot reviewed 2 out of 11 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/source/tutorials/oregon.rst | Comprehensive tutorial implementing LDTE analysis for the Oregon Health Insurance Experiment with non-compliance handling |
| docs/source/tutorials.rst | Added reference to the new Oregon tutorial in the documentation index |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ctrl_cols.append('charg_tot_pre_ed') | ||
| selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] | ||
| df = df[selected_cols] | ||
| df = df[df.isna().any(axis=1) == False] |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boolean comparison == False is unnecessary and less Pythonic. Use ~df.isna().any(axis=1) or df.notna().all(axis=1) instead for cleaner code.
| df = df[df.isna().any(axis=1) == False] | |
| df = df[~df.isna().any(axis=1)] |
| ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=outcome_ed_visits_locations | ||
| ) | ||
|
|
||
| # LDTE: Treatment vs Control | ||
| ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=outcome_ed_visits_locations | ||
| ) |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate code: The same prediction is computed twice with different variable names (ldte_ctrl/ldte_simple). The first computation (lines 323-327) appears to be unused. Consider removing the duplicate computation to improve code clarity.
| lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=[-1] + outcome_ed_visits_locations | ||
| ) | ||
|
|
||
| # Compute Local Probability Treatment Effects | ||
| lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=[-1] + outcome_ed_visits_locations | ||
| ) |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate code: The same LPTE prediction is computed twice with different variable names (lpte_ctrl/lpte_simple). The first computation (lines 371-375) is not used in subsequent code. Remove the duplicate to improve maintainability.
|
|
||
| This data supports research on how health insurance affects healthcare utilization and is maintained by researchers Amy Finkelstein and Katherine Baicker. Please ensure you comply with the data use agreements when downloading and using this dataset. | ||
|
|
||
| **Important**: When using this dataset for research or publications, appropriate citation is required as specified in the NBER data use agreement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can probably remove this warning as it's explicitly written in https://www.nber.org/research/data/oregon-health-insurance-experiment-data
| df['Y_NUM_VISIT_CENS_ED'] = df['num_visit_cens_ed'].fillna(0) | ||
|
|
||
| # Create strata based on household size | ||
| df['strata'] = df['numhh_list'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use df.rename in order not to create a new column with the same values.
|
|
||
| # Create actual treatment indicator: 0=Not enrolled, 1=Enrolled, -1=Missing | ||
| treatment_mapping = {'NOT enrolled': 0, 'Enrolled': 1} | ||
| df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep na as it is and drop the records with missing values.
| # Prepare the data for dte_adj analysis | ||
| # Create treatment assignment (instrumental variable): 0=Not selected, 1=Selected | ||
| treatment_assignment_mapping = {'Not selected': 0, 'Selected': 1} | ||
| df['Z'] = df['treatment'].map(treatment_assignment_mapping).astype(float).fillna(-1).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we cast float first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fillna() cannot be applied to category types, casting a type is required.
If NaN is included, it cannot be casted to an int type, so it must be casted to a float type.
df[‘Z’] does not contain NaN values, but df[‘D’] does contain NaN values, and the same process is applied, though it may be redundant.
However, as mentioned in the comment below, it seems good to delete the records with missing values rather than fill in the missing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's run record-wide deletion
| df['D'] = df['ohp_all_ever_inperson'].map(treatment_mapping).astype(float).fillna(-1).astype(int) | ||
|
|
||
| # Use emergency department costs and visits as outcome variables | ||
| df['Y_ED_CHARG_TOT_ED'] = df['ed_charg_tot_ed'].fillna(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a valid pre-processing to fill missing outcomes with 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that seems not valid.
LinearRegression does not allow missing values, but it seems better not to complete them with zeros since the distribution changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's just remove missing values here too?
| edu_mapping = {'HS diploma or GED': 0, 'Post HS, not 4-year': 1, 'Less than HS': 2, '4 year degree or more': 3} | ||
|
|
||
| df['age'] = 2008 - df['birthyear_list'] | ||
| df['gender_inp'] = df['gender_inp'].map(gender_mapping).astype(float).fillna(-1).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for casting to float first is as mentioned above.
However, different from the assignments (df['Z'], df['D']) mentioned above, I think it's fine to complete missing values for these features as “unknown.” What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many records have missing gender only? I wonder if these records also miss Z or D.
| # Select control variables: pre-randomization ED utilization variables | ||
| ctrl_cols = [col for col in df_ed.columns if 'pre' in col and 'num' in col] | ||
| ctrl_cols.append('charg_tot_pre_ed') | ||
| selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we include ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] in ctrl_cols?
| ctrl_cols.append('charg_tot_pre_ed') | ||
| selected_cols = ['person_id', 'strata', 'Y_NUM_VISIT_CENS_ED', 'Y_ED_CHARG_TOT_ED', 'Z', 'D'] + ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp'] | ||
| df = df[selected_cols] | ||
| df = df[df.isna().any(axis=1) == False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| df = df[df.isna().any(axis=1) == False] | |
| df = df.dropna() |
| df = df[df.isna().any(axis=1) == False] | ||
|
|
||
| # Create feature matrix (excluding treatment variables) | ||
| features = pd.DataFrame(df[ctrl_cols + ['gender_inp', 'age', 'health_last12_inp', 'edu_inp']]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to define a variable features
| :width: 800px | ||
| :align: center | ||
|
|
||
| **LDTE Interpretation**: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the LDTE values really positive?
|
|
||
| **LDTE Interpretation**: The positive LDTE values indicate that Medicaid assignment increases the cumulative probability of individuals having emergency department costs at or below each threshold among compliers (those who enroll when selected). This suggests that while Medicaid increases overall ED utilization, it may also help contain costs for some individuals who actually enroll. | ||
|
|
||
| **Statistical Significance**: Both simple and ML-adjusted local estimators show similar patterns, providing robust evidence that Medicaid assignment has significant distributional effects on emergency department costs for compliers. The confidence intervals indicate that these effects are statistically significant across most cost levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confidence intervals indicate that these effects are statistically significant across most cost levels.
Is this correct?
|
|
||
| The Local Probability Treatment Effects analysis produces the following visualization: | ||
|
|
||
| .. image:: ../_static/oregon_lpte_costs_comparison.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can increase the size of each bin. Currently there are too many bins in the graph.
|
|
||
| The side-by-side bar charts show probability treatment effects across different emergency department cost intervals, revealing how Medicaid enrollment affects healthcare utilization patterns: | ||
|
|
||
| **Cost Distribution Effects**: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correct explanation about how to understand LPTE, but what's the insight specifically for this data?
|
|
||
| **Cost Distribution Effects**: The LPTE analysis shows how Medicaid assignment changes the probability of compliers incurring emergency department costs in specific ranges. Positive bars indicate cost intervals where Medicaid assignment increases the likelihood of incurring costs in that range, while negative bars show intervals where it decreases the probability. | ||
|
|
||
| **Healthcare Utilization Patterns**: Both simple and ML-adjusted local estimators reveal consistent patterns in how Medicaid assignment affects emergency department utilization across different cost categories for compliers. The analysis shows that Medicaid assignment has heterogeneous effects, increasing utilization in some cost ranges while potentially reducing it in others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, can we make it a bit more detailed?
|
|
||
| **Policy Implications**: Understanding these distributional effects is crucial for healthcare policy. The local analysis reveals that Medicaid's impact varies across the cost distribution for those who actually enroll when assigned, which has important implications for healthcare budgeting and understanding the true effects of public health insurance programs on compliers. | ||
|
|
||
| **Conclusion**: Using the real Oregon Health Insurance Experiment dataset with 24,000 participants, the local distributional analysis reveals nuanced patterns in how Medicaid assignment affects healthcare utilization among compliers. The analysis accounts for non-compliance and goes beyond simple average comparisons to show how treatment effects vary across the entire emergency department cost distribution, providing insights into how public health insurance impacts different segments of the population who actually enroll. This demonstrates the power of local distribution treatment effect analysis for understanding heterogeneous responses in healthcare policy interventions with non-compliance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need a conclusion in the middle of the tutorial
| ml_local_estimator.fit(X, Z, D, Y_NUM_VISIT_CENS_ED, strata) | ||
|
|
||
| # Define evaluation points for emergency department visits | ||
| outcome_ed_visits_locations = np.linspace(Y_NUM_VISIT_CENS_ED.min(), Y_NUM_VISIT_CENS_ED.max(), 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use np.arange?
| # Compute LDTE: Treatment vs Control | ||
| ldte_ctrl, lower_ctrl, upper_ctrl = simple_local_estimator.predict_ldte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=outcome_ed_visits_locations | ||
| ) | ||
|
|
||
| # LDTE: Treatment vs Control | ||
| ldte_simple, lower_simple, upper_simple = simple_local_estimator.predict_ldte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=outcome_ed_visits_locations | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between these two?
|
|
||
| # Visualize Treatment vs Control using dte_adj's plot function | ||
| plot(outcome_ed_visits_locations, ldte_simple, lower_simple, upper_simple, | ||
| title="Treatment vs Control (Simple Local Estimator)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the outcome in either the title or the y label
| ax=ax1) | ||
|
|
||
| plot(outcome_ed_visits_locations, ldte_ml, lower_ml, upper_ml, | ||
| title="Treatment vs Control (ML-Adjusted Local Estimator)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| # Compute LPTE: Treatment vs Control | ||
| lpte_ctrl, lpte_lower_ctrl, lpte_upper_ctrl = simple_local_estimator.predict_lpte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=[-1] + outcome_ed_visits_locations | ||
| ) | ||
|
|
||
| # Compute Local Probability Treatment Effects | ||
| lpte_simple, lpte_lower_simple, lpte_upper_simple = simple_local_estimator.predict_lpte( | ||
| target_treatment_arm=1, # Z=1 Selected for treatment (Enrolled) | ||
| control_treatment_arm=0, # Z=0 Not selected for treatment (Not enrolled) | ||
| locations=[-1] + outcome_ed_visits_locations | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, what's the difference between the two?
| plt.tight_layout() | ||
| plt.show() | ||
|
|
||
| .. image:: ../_static/oregon_lpte_visits.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a wrong figure? The code and figure don't match
|
|
||
| .. code-block:: python | ||
|
|
||
| # Compute LDTE: Treatment vs Control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this code block entirely as it's duplicated with the content below
| Probability Treatment Effects: Visits Analysis | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| .. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this, we can remove this code block entirely as it's duplicated with the content below
|
|
||
| The emergency department visits analysis reveals complementary patterns to the cost analysis: | ||
|
|
||
| **Visit Frequency Effects**: Medicaid assignment shows distinct effects on the probability of different visit frequencies for compliers. The LPTE analysis reveals which visit count categories are most affected by Medicaid assignment among those who actually enroll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it a bit more detailed?
|
|
||
| **Policy Targeting Implications**: Understanding which household types respond most strongly to Medicaid assignment can inform more targeted policy interventions and help identify populations that would benefit most from expanded coverage when they actually enroll. | ||
|
|
||
| **Methodological Consistency**: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also mention the difference in ci length?
|
|
||
| **Heterogeneity Analysis**: The stratified analysis by household registration type reveals important local treatment effect heterogeneity, showing that different populations respond differently to Medicaid assignment when they actually enroll. | ||
|
|
||
| **Methodological Robustness**: Comparing simple and ML-adjusted local estimators provides confidence in our findings and demonstrates the robustness of the local distributional treatment effect methodology for handling non-compliance scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| plt.tight_layout() | ||
| plt.show() | ||
|
|
||
| .. image:: ../_static/oregon_ldte_strata.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the ml adjusted LPTE has something wrong, can we investigate the cause?
| **Methodological Consistency**: Both simple and ML-adjusted local estimators show similar patterns within each stratum, providing confidence in the robustness of the stratified findings across different analytical approaches while accounting for non-compliance. | ||
|
|
||
| Conclusion | ||
| ~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me repeat the same comment, can we ground the insights on the Oregon analysis results? The current summary feels too generic.
Add a tutorial with Oregon Health Insurance Experiment.