Data-driven thresholds for optimal summary of wearable device data

A Python tool for identifying optimal, data-driven thresholds for summarizing wearable device data into Time-in-Range (TIR) proportions, which provides interpretable insights from high-frequency measurements. Two types of threshold optimality are considered: one well-suited for data from a single population, and another tailored to mixed-population scenarios.

The detailed methods are described in the paper:

Beyond fixed thresholds: optimizing summaries of wearable device data via piecewise linearization of quantile functions. by Junyoung Park, Neo Kok, and Irina Gaynanova (arXiv:2501.11777).

Please feel free to contact [email protected] if you have any questions regarding the implementation of this code.

Dependencies

numpy, pandas, and scipy >= 1.12.0.

How to use

The code applies to any type of univariate distributional data formed by empirical measurements, including wearable devices data like accelerometers or CGMs. Overall implementation steps are:

Import method.py.
Process data into a list of sublists in which each individual's empirical measurements are collected, forming individual empirical distributions.
Create a Python class Distribution based on the processed data.
Apply one of the proposed algorithms: DE, SA, and SS.

Steps 2 and 3 are crucial for proper implementation. Below is an example using CGM data to illustrate these steps.

Processing example with CGM data

Suppose a CGM data is stored in data.csv with columns id (subject ID), gl (measured glucose values), and time (time of measurement). Then, use the following code to create a Distribution class:

# Step 1. Import method.py, which includes class Distribution
from method import * 
import pandas as pd

data = pd.read_csv("data.csv")

# Step 2. Group empirical measurements by each individual
grouped_data = data.groupby('id').agg({'gl': list}).reset_index()

# Step 3. Create a Distribution class instance
data_class = Distribution(grouped_data['gl'], ran=(40, 400))

Here, ran=(40, 400) specifies the measurement range of a typical CGM device, which should be specified in a data-dependent manner.

Using Awesome-CGM or GlucoBench repositories, one can access publicly available CGM data and process it into .csv files as above.

Data-driven thresholds using Distribution class

Once the data is converted to a Distribution class, the proposed algorithms DE, SA, and SS can apply by using the run_de, agglomerative_discrete, and divisive_discrete functions, respectively. DE (run_de) is recommended due to its effective and efficient performance demonstrated in our paper.

Full example code:

import pandas as pd
from method import * # includes class Distribution, run_de

data = pd.read_csv("data.csv")
grouped_data = data.groupby('id').agg({'gl': list}).reset_index()
data_class = Distribution(grouped_data["gl"], ran=(40, 400))

# Specify the target number of thresholds for summary
K = 4

# Choose optimality criterion: 
# "Loss1" preserves individual distributions (well-suited for a single-population)
# "Loss2" preserves pairwise distances (well-suited for a mixed-population data)
loss = "Loss1"

# Apply DE to data_class
best_cutoffs, min_loss = run_de(data_class, K=K, loss=loss)

best_cutoffs returns the optimal thresholds for the input data, and min_loss shows its achieved loss value.

Semi-supervised implementation

If fixing some thresholds from domain knowledge is of interest (e.g., the time-in-range 70--180 mg/dL for CGM data), one can specify the fixed argument for this purpose. Here, one should specify K as the number of additional thresholds to search for their optimal positions.

# Find two additional thresholds besides fixed thresholds
K = 2

best_cutoffs, min_loss = run_de(data_class, K=K, loss=loss, fixed=(70, 181))

Reproducing experiments in the paper

See Simulations.ipynb and Real-data.ipynb for details. The figures for real data experiments are generated by figure_creation.R.

Misc

An R version of the proposed methods is in progress.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
results		results
.gitignore		.gitignore
Readme.md		Readme.md
Real-data.ipynb		Real-data.ipynb
Simulations.ipynb		Simulations.ipynb
figure_creation.R		figure_creation.R
method.py		method.py
simul.py		simul.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-driven thresholds for optimal summary of wearable device data

Dependencies

How to use

Processing example with CGM data

Data-driven thresholds using Distribution class

Semi-supervised implementation

Reproducing experiments in the paper

Misc

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

IrinaStatsLab/OptiThresholds

Folders and files

Latest commit

History

Repository files navigation

Data-driven thresholds for optimal summary of wearable device data

Dependencies

How to use

Processing example with CGM data

Data-driven thresholds using Distribution class

Semi-supervised implementation

Reproducing experiments in the paper

Misc

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages