This is the official implementation of the paper "VISTA: Knowledge-Driven Vessel Trajectory Imputation with Repair Provenance".
Repairing incomplete trajectory data is essential for downstream spatio-temporal applications. Yet, existing repair methods focus solely on reconstruction without documenting the reasoning behind repair decisions, undermining trust in safety-critical applications where repaired trajectories affect operational decisions, such as in maritime anomaly detection and route planning. We introduce repair provenance—structured, queryable metadata that documents the full reasoning chain behind each repair—which transforms imputation from pure data recovery into a task that supports downstream decision-making. We propose VISTA (knowledge-driven interpretable vessel trajectory imputation), a framework that reliably equips repaired trajectories with repair provenance by grounding LLM reasoning in data-verified knowledge. Specifically, we formalize Structured Data-derived Knowledge (SDK), a knowledge model whose data-verifiable components can be validated against real data and used to anchor and constrain LLM-generated explanations. We organize SDK in a Structured Data-derived Knowledge Graph (SD-KG) and establish a data-knowledge-data loop for extraction, validation, and incremental maintenance over large-scale AIS data. A workflow management layer with parallel scheduling, fault tolerance, and redundancy control ensures consistent and efficient end-to-end processing. Experiments on two large-scale AIS datasets show that VISTA achieves state-of-the-art accuracy, improving over baselines by 5–91% and reducing inference time by 51–93%, while producing repair provenance, whose interpretability is showcased via a case study and an interactive demo system (https://github.com/hyLiu1994/CLEAR).
VISTA/
├── config/
│ └── config.yaml # Configuration file
├── data/ # Data directory
│ ├── RawData/ # Original AIS data (unprocessed)
│ ├── CleanedFilteredData/ # Data after cleaning and filtering
│ └── ProcessedData/ # Data after preprocessing and feature extraction
├── results/ # Experimental results, logs, and evaluation outputs
├── src/ # Source code
│ ├── data/ # Data loading, preprocessing, and handling
│ ├── modules/ # Core algorithmic components (e.g., StaticSpatialEncoder, BehaviorAbstraction)
│ ├── pipeline/ # End-to-end Pipelines
│ ├── utils/ # Utility functions
│ └── main.py # Main entry point of the project
│
├── .gitignore
└── Readme.md
VISTA forms a data–knowledge–data loop that transforms raw AIS data into structured maritime knowledge and reuses it to reconstruct missing trajectories with interpretable reasoning. As shown in the figure, the framework is built around four tightly connected components: AIS Data, SD-KG, SD-KG Construction, Trajectory Imputation, and a coordinating Workflow Manager Layer.
At the center of the framework are AIS Data and the Structured Data-derived Knowledge Graph (SD-KG) — the two endpoints of the data–knowledge–data loop.
- AIS Data (
src/data/AISDataProcessor.py) provides raw vessel messages and stores reconstructed trajectories underresults/[exp_name]/ImputationResults/. It serves as both the input for knowledge construction and the output repository for imputation results. - SD-KG (
src/modules/M0_SDKG.py) acts as the central maritime knowledge repository, storing vessel attributes, behavior patterns, and validated imputation methods underresults/[exp_name]/SDKG/. It connects both sides of the loop — continually updated during construction and reused during trajectory imputation.
On the left side of the framework, the SD-KG Construction Workflow Manager (SDKG_Construction_Multithreading() in src/pipeline/pipeline.py) orchestrates parallel knowledge extraction from AIS data.
It integrates three key modules corresponding to the blue blocks in the figure:
- Static & Spatial Encoder (
generate_vs()inM1_StaticSpatialEncoder.py): extracts vessel attributes and spatial motion cues. - Behavior Abstraction (
generate_vb()inM2_BehaviorAbstraction.py): identifies canonical vessel behavior patterns from time-series trajectories. - Method Builder (
generate_vf()inM3_MethodBuilder.py): generates and validates imputation functions, then inserts them into SD-KG as executable knowledge units.
On the right side of the framework, the Trajectory Imputation Workflow Manager (Trajectory_Imputation_Multithreading() in src/pipeline/pipeline.py) leverages SD-KG to reconstruct missing trajectory segments with interpretable reasoning.
It includes three LLM-driven modules corresponding to the green blocks in the figure:
- Behavior Estimator (
behavior_estimator()inM4_BehaviorEstimator.py): infers missing motion patterns using SD-KG priors and vessel context. - Method Selector (
method_selector()inM5_MethodSelector.py): chooses the most suitable imputation function based on graph-supported evidence. - Explanation Composer (
explanation_composer()inM6_ExplanationComposer.py): generates concise, human-readable explanations linking reconstructed trajectories to maritime knowledge and operational logic.
The Workflow Manager Layer (src/pipeline/pipeline.py) bridges construction and imputation, coordinating parallel execution, anomaly handling, and redundancy control through SDKG_Construction_Multithreading() and Trajectory_Imputation_Multithreading().
bash environment_install.**sh**
The datasets AIS-DK and AIS-US can be automatically downloaded from the following official sources based on the dataset hyperparameter:
Detailed instructions for downloading, cleaning, and filtering the datasets are provided in
./src/data/Readme.md.
After preparing the dataset, update the data path in the configuration file ./config/config.yaml. For example:
raw_data_file: ./data/CleanedFilteredData/AIS_2024_04_01@15_filtered360_1000000000.csvVISTA supports the flexibility to choose from various platforms such as OpenAI, Alibaba Cloud's DashScope, or others, and configure the corresponding API key for seamless interaction with the selected model.
Open ./config/config.yaml and update the base_url: with your service provider's Base URL (such as 'https://api.openai.com/v1' for OpenAI, or 'https://dashscope.aliyuncs.com/compatible-mode/v1' for Alibaba Cloud's DashScope).
Example:
base_url: 'https://dashscope.aliyuncs.com/compatible-mode/v1'Open ./config/config.yaml and paste the API key after llm_api_key:, which you can obtain from Platforms (e.g., Alibaba Cloud's DashScope, OpenAI).
Example:
llm_api_key: sk-xxxxxxxxOpen ./config/config.yaml and specify the models you wish to use after mining_llm:, coding_llm:, and analysis_llm:, which you can find on Platfroms (e.g., Alibaba Cloud's DashScope, OpenAI).
Example:
mining_llm: gpt-4.1-nano
coding_llm: gpt-3.5-turbo
analysis_llm: qwen-plusBefore running the pipeline, configure key hyperparameters such as retry_times, e_f, and top_k in the ./config/config.yaml file.
Then, execute the following command to start the process:
python src/main.py --config config.yaml