Adaptive Time Series Reasoning via Segment Selection
Shvat Messica, Jiawen Zhang, Kevin Li, Theodoros Tsiligkaridis, Marinka Zitnik
Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
Time-series reasoning tasks start with a natural-language question and require targeted analysis of a time series. Evidence may span the full series or appear in only a few short intervals, so the model must decide what to inspect. Most existing approaches encode the entire time series into a fixed representation before inference, regardless of relevance.
ARTIST formulates time-series reasoning as a sequential decision problem. It interleaves reasoning with adaptive temporal segment selection using a controller–reasoner architecture and trains both roles with reinforcement learning:
- A high-level controller selects the next informative segment and decides when to stop, conditioned on the question and intermediate outputs.
- A low-level reasoner produces segment-conditioned reasoning traces and the final answer.
Rather than relying on a static summary of the full sequence, ARTIST actively acquires task-relevant information at inference time. A novel hierarchical, collaborative self-play post-training method lets a single policy excel at both segment selection and question answering.
- Adaptive segment selection for time-series reasoning. The model iteratively chooses temporal segments to inspect and updates its reasoning based on retrieved segments, with no pre-defined segment labels.
- Hierarchical, collaborative self-play RL. A post-training method that separates segment selection from answer generation and trains each role with role-aligned learning signals — a reliability reward for the controller and a correctness/format reward for the reasoner.
- Strong empirical results. On six benchmarks, ARTIST outperforms seven strong baselines (text LLMs, time-series encoder models, and vision-language models) while consuming a smaller fraction of the input series.
ARTIST is a single policy LLM that operates in different roles for time series reasoning: a controller that selects the next segment and decides when to stop, and a reasoner that produces segment-conditioned reasoning and the final answer. Inference unfolds as an interleaved trace that alternates natural-language reasoning with segment-selection tool calls:
<think> reasoning ... </think>
<timeseries_selection_tool> [x1, y1] </timeseries_selection_tool>
<think> reasoning ... </think>
<timeseries_selection_tool> [x2, y2] </timeseries_selection_tool>
<think> reasoning ... </think>
<answer> A / B / C / D / E </answer>
Training proceeds in two stages:
- Supervised fine-tuning (SFT) — LoRA-based fine-tuning on curated reasoning traces that interleave natural language with segment-selection tool calls.
- Reinforcement learning (RL) — full-parameter fine-tuning via collaborative self-play with hierarchical policy optimization: trajectory-level credit for the controller and final-round, segment-conditioned optimization for the reasoner, with variance-guided sampling of reasoner rollouts.
git clone https://github.com/mims-harvard/ARTIST.git
cd ARTIST
# Create an environment (Python 3.10+)
conda create -n artist python=3.10 -y
conda activate artist
pip install -r requirements.txtrequirements.lock.txt contains the exact frozen environment used in our experiments (for reproducibility).
Training and inference were run on NVIDIA H100 GPUs (SFT on 1×H100, RL on 4×H100).
python cli_qwen_cot.py fit \
--config configs/sft_<dataset>.yamlPer-dataset RL configs are provided in configs/ (rl_ecg.yaml, rl_rcw.yaml, rl_tsqa.yaml, rl_sleep.yaml, rl_trqa.yaml, rl_eti.yaml). Edit the model_path (SFT checkpoint) and dataset.init_args.cache_path fields, then:
python train_controller_reasoner_opt.py -c configs/rl_tsqa.yamlThe same config is reused at inference time via --dataset_config:
python inference_controller_reasoner_atk.py \
--model_path /path/to/artist-rl-tsqa \
--task TSQA \
--dataset_config configs/rl_tsqa.yaml \
--include_full_ts_initially \
--passk 1,2,4,8 \
--output_file predictions.jsonDefault sampling temperatures: reasoner
0.7, controller1.0. Accuracy and F1 are reported as averages over 8 independent runs per dataset.
@inproceedings{messica2026artist,
title = {Adaptive Time Series Reasoning via Segment Selection},
author = {Messica, Shvat and Zhang, Jiawen and Li, Kevin and
Tsiligkaridis, Theodoros and Zitnik, Marinka},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
eprint = {2602.18645},
archivePrefix = {arXiv}
}For questions, please open an issue or contact Shvat Messica and Marinka Zitnik.

