This work, titled “Deep Spectral Component Filtering as a Foundation Model for Spectral Analysis Demonstrated in Metabolic Profiling,” is published in Nature Machine Intelligence. This repository contains code for utilizing a pretrained foundation model tailored for spectral analysis. To enhance accessibility for spectroscopy researchers, the code has been designed with user-friendliness in mind, allowing for a seamless start without requiring complex training frameworks or extensive environment configuration. Additionally, we have provided scripts for fine-tuning, accompanied by clear instructions within them, to assist users in loading their own data and adapting the model to their specific tasks.
pretrain: This folder contains the pretrained model weights and general-purpose tools for utilizing the foundation model.
customized_task: This folder includes scripts for applying the pretrained model and finetuning it to suit your specific tasks.
preprocessing: This folder provides scripts for preprocessing, along with source files and results for tasks such as infrared paraffin removal and SERS nanoparticle (NPs) removal.
quantify: This folder contains scripts for quantification, accompanied by spectral data ready for quantitative analysis.
ComFilE: This folder includes scripts and results for the Component Filtering Explanation (ComFilE) method. ComFilE can be used to rank the importance of specific spectral components (e.g., metabolites in serum) and interpret their contributions to distinguishing results (e.g., disease vs. control samples).
ComFilE_Extended: This folder contains scripts and results for the k-order Component Filtering Explanation (where k > 1). The k-order ComFilE extends the methodology to analyze the cooperative effects of k spectral components in explaining result distinctions.
The scripts to build DSCF for your personalized work are in the directory "costumized_task".
To start with the scripts, you should follow the instructions in the 'dataset.py' to load your spectra into the corresponding file folds.
Fold 'Component-spec' is for the spectral dictionary of pure substances.
Fold 'Impurity-spec' is for unwanted spectral components to filter out from spectra
Fold 'Pure-spec' is for spectral components to be preserved.
{'dir':'Pure-spec/', 'tensor_dim':2, 'spec_tensor_dim':-1,}
An attribution dictionary should be innit for each data fold. Tensor_dim is to describe the total dimension of one data file. Spec_tensor_dim is to describe the id of spectral dimension in the data file, ranging from (0,tensor_dim-1).
The output mode can be customized by revising the 'return value' in the gettitem function.
DSCF model is a hierarchical local attention encoder-decoder transformer. The detailed components of the model are described in DSCF_model_pe.py.
The following image is the general outline of the general pre-trained model.
The pre-trained weights of the tiny-version model are available and can be downloaded at https://figshare.com/s/2b31ca642313086dcfe6. The weights of larger models can be downloaded at 10.6084/m9.figshare.28582499.
Paraffin removal is a general routine in FFPE IR analysis. DSCF model can be tailored for paraffin removal. The following images are results of raw data, paraffin and paraffin-removed data.
Some of the in-silico explaining results are as follows, where highlighted components are ground truth.
The code for detailed downstream tasks is coming soon after the manuscript is formally published.