A Python tool for identifying optimal, data-driven thresholds for summarizing wearable device data into Time-in-Range (TIR) proportions, which provides interpretable insights from high-frequency measurements. Two types of threshold optimality are considered: one well-suited for data from a single population, and another tailored to mixed-population scenarios.
The detailed methods are described in the paper:
- Beyond fixed thresholds: optimizing summaries of wearable device data via piecewise linearization of quantile functions. by Junyoung Park, Neo Kok, and Irina Gaynanova (arXiv:2501.11777).
Please feel free to contact [email protected] if you have any questions regarding the implementation of this code.
numpy
, pandas
, and scipy >= 1.12.0
.
The code applies to any type of univariate distributional data formed by empirical measurements, including wearable devices data like accelerometers or CGMs. Overall implementation steps are:
- Import
method.py
. - Process data into a list of sublists in which each individual's empirical measurements are collected, forming individual empirical distributions.
- Create a Python class
Distribution
based on the processed data. - Apply one of the proposed algorithms: DE, SA, and SS.
Steps 2 and 3 are crucial for proper implementation. Below is an example using CGM data to illustrate these steps.
Suppose a CGM data is stored in data.csv
with columns id
(subject ID), gl
(measured glucose values), and time
(time of measurement).
Then, use the following code to create a Distribution
class:
# Step 1. Import method.py, which includes class Distribution
from method import *
import pandas as pd
data = pd.read_csv("data.csv")
# Step 2. Group empirical measurements by each individual
grouped_data = data.groupby('id').agg({'gl': list}).reset_index()
# Step 3. Create a Distribution class instance
data_class = Distribution(grouped_data['gl'], ran=(40, 400))
Here, ran=(40, 400)
specifies the measurement range of a typical CGM device, which should be specified in a data-dependent manner.
Using Awesome-CGM or GlucoBench repositories, one can access publicly available CGM data and process it into .csv files as above.
Once the data is converted to a Distribution
class, the proposed algorithms DE, SA, and SS can apply by using the run_de
, agglomerative_discrete
, and divisive_discrete
functions, respectively. DE (run_de
) is recommended due to its effective and efficient performance demonstrated in our paper.
Full example code:
import pandas as pd
from method import * # includes class Distribution, run_de
data = pd.read_csv("data.csv")
grouped_data = data.groupby('id').agg({'gl': list}).reset_index()
data_class = Distribution(grouped_data["gl"], ran=(40, 400))
# Specify the target number of thresholds for summary
K = 4
# Choose optimality criterion:
# "Loss1" preserves individual distributions (well-suited for a single-population)
# "Loss2" preserves pairwise distances (well-suited for a mixed-population data)
loss = "Loss1"
# Apply DE to data_class
best_cutoffs, min_loss = run_de(data_class, K=K, loss=loss)
best_cutoffs
returns the optimal thresholds for the input data, and min_loss
shows its achieved loss value.
If fixing some thresholds from domain knowledge is of interest (e.g., the time-in-range 70--180 mg/dL for CGM data), one can specify the fixed
argument for this purpose. Here, one should specify K
as the number of additional thresholds to search for their optimal positions.
# Find two additional thresholds besides fixed thresholds
K = 2
best_cutoffs, min_loss = run_de(data_class, K=K, loss=loss, fixed=(70, 181))
See Simulations.ipynb
and Real-data.ipynb
for details. The figures for real data experiments are generated by figure_creation.R
.
- An R version of the proposed methods is in progress.