| Paper | Data | Webpage |
|---|---|---|
Repository holding code for our paper:
USING UMAP TO INSPECT AUDIO DATA FOR UNSUPERVISED ANOMALY DETECTION UNDER DOMAIN-SHIFT CONDITIONS Andres Fernandez and Mark D. Plumbley 2021
You can cite our work as follows:
@inproceedings{aferro2021umap,
author = {Fernandez, Andres and Plumbley, Mark D.},
title = {Using {UMAP} to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions},
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop ({DCASE2021})",
address = "Barcelona, Spain",
month = "November",
year = "2021",
}
Our work is released under liberal licenses (code: MIT, data: CC-BY). We're happy for others to build on it; refactoring the scatterplot scripts is particularly welcome.
Comprehensive UMAPs and plots generated for the paper can be downloaded at the Zenodo link above. Our results can be fully reproduced following the steps detailed below. The data pipeline can be summarized as follows:
- Collect WAV audio datasets. In this case we have 3 (DCASE, AudioSet and Fraunhofer).
- Compute the log-STFT, log-mel spectrograms and L3 embeddings and save as HDF5 datasets (performed by the
00...Python scripts) - Compute the UMAPs and save as HDF5 datasets (performed by the
01...Python script) - Render scatter plots for section, device and global scopes (performed by the
02...Python scripts)
We also included the 03... scripts used to render the plots in the paper.
Note that step 2 requires a fair amount of disk memory and time. The L3 embeddings can also take a while to compute. Step 3 is very RAM-hungry and potentially slow.
If not existing, create the following directories inside this repository:
datasets
precomputed_features
umaps
umap_plots
logs
- AudioSet: Download our custom AudioSet subset and extract its 39437 WAV files into
datasets/AudioSet_fragments - Fraunhofer: Download from here and extract into
datasets/IDMT-ISA-ELECTRIC-ENGINE. It should end up with the following structure:
IDMT-ISA-ELECTRIC-ENGINE/
├── test
│ ├── engine1_good
│ ├── engine2_broken
│ └── engine3_heavyload
├── test_cut
│ ├── engine1_good
│ ├── engine2_broken
│ └── engine3_heavyload
├── train
│ ├── engine1_good
│ ├── engine2_broken
│ └── engine3_heavyload
└── train_cut
├── engine1_good
├── engine2_broken
└── engine3_heavyload
- DCASE: Download and merge the Development and Additional Training datasets, and extract into
datasets/DCASE2021/t2. It should end up with the following structure:
DCASE2021/
└── t2
├── dev
│ ├── fan
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ ├── gearbox
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ ├── pump
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ ├── slider
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ ├── ToyCar
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ ├── ToyTrain
│ │ ├── source_test
│ │ ├── target_test
│ │ └── train
│ └── valve
│ ├── source_test
│ ├── target_test
│ └── train
└── eval
├── fan
│ ├── source_test
│ ├── target_test
│ └── train
├── gearbox
│ ├── source_test
│ ├── target_test
│ └── train
├── pump
│ ├── source_test
│ ├── target_test
│ └── train
├── slider
│ ├── source_test
│ ├── target_test
│ └── train
├── ToyCar
│ ├── source_test
│ ├── target_test
│ └── train
├── ToyTrain
│ ├── source_test
│ ├── target_test
│ └── train
└── valve
├── source_test
├── target_test
└── train
Tested on CUDA-enabled Ubuntu 20.04 with Conda and Python 3.8.1.
conda create --name dcase2021umaps python=3.8
conda activate dcase2021umaps
#
conda install -y -c conda-forge omegaconf
conda install -y -c conda-forge librosa
conda install -y -c anaconda h5py
conda install -y -c anaconda pytz
pip install coloredlogs
conda install -y -c anaconda pandas
#
conda install -y -c anaconda cython
pip install openl3==0.4.0 # TF backend should automatically recognize GPU
#
conda install -y -c conda-forge umap-learn
#
pip install randomcolor
The resulting environment has been frozen into requirements.txt. Check the file for full details on versions and dependencies.
Results are HDF5 files with 3 keys:
data: A matrix of shape(num_features, length)with the representation of all audio files concatenated across the length.data_idxs: A matrix of shape(2, num_files), where each(beg, end)pair designs the beginning and end index of an audio file in thedatamatrix.metadata: An array of lengthnum_filesin the same order asdata_idxs. Each entry contains a string with the file metadata. For AudioSet and Fraunhofer, this is the relative filepath. For DCASE, it is a rich JSON object.
The reason for this design is that we want to have as much data as possible in a single contiguous memory chunk, for performance reasons. Encoding metadata as strings allows enough flexibility for all used datasets.
Run the following commands to precompute all features:
# Fixed features:
python 00a_precompute_dcase_fixed.py WAV_NORM=absmax ROOT_PATH=datasets/DCASE2021/t2
python 00b_precompute_audioset_fixed.py WAV_NORM=absmax ROOT_PATH=datasets/AudioSet_fragments/
python 00c_precompute_fraunhofer_fixed.py WAV_NORM=absmax ROOT_PATH=datasets/IDMT-ISA-ELECTRIC-ENGINE/
# L3 embeddings: each call to L3 is slow: higher NUM_FILES_PER_L3_RUN is faster but consumes RAM. Also GPU computation helps speed up processing, but GPU memory is limited, this can be controlled with L3_BATCHSIZE. Parameters below should be good for an 8GB GPU and 32GB of RAM
python 00d_precompute_dcase_l3.py WAV_NORM=absmax ROOT_PATH=datasets/DCASE2021/t2 NUM_FILES_PER_L3_RUN=200 L3_BATCHSIZE=16
python 00e_precompute_audioset_l3.py WAV_NORM=absmax ROOT_PATH=datasets/AudioSet_fragments/ NUM_FILES_PER_L3_RUN=200 L3_BATCHSIZE=16
python 00f_precompute_fraunhofer_l3.py WAV_NORM=absmax ROOT_PATH=datasets/IDMT-ISA-ELECTRIC-ENGINE/ NUM_FILES_PER_L3_RUN=200 L3_BATCHSIZE=16
Results are pickled dictionaries with the following keys: config, audioset, fraunhofer, (train, valve, 00, source), .... The config key contains a string with the parameters used. Each of the other entries corresponds to a dataset split and contains a dictionary with 4 keys: umaps, metadata, global_idxs, relative_idxs. The umaps are arrays of shape (N, 2) containing N samples from the computed UMAP. The others are N-element lists containing per-sample info: metadata about file path and labels, global index to find the frame in the original HDF5 matrix, and relative index to find the frame in the original file. This allows to trace back each UMAP dot to its corresponding audio wave or frame, which can be useful to e.g. compute the energies.
# Define these variables for all computations
STACK=1 # same with STACK=5
TRAIN_SZ=10000
TEST_SZ=20000
AUDIOSET_SZ=10000
FRAUNHOFER_SZ=10000
STACK=10
TRAIN_SZ=10000
TEST_SZ=20000
AUDIOSET_SZ=1
FRAUNHOFER_SZ=1
# Define these variables for the L3 computations
MOD=l3
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_l3env_hop0.1_linear512.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_l3env_hop0.1_linear512.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_l3env_hop0.1_linear512.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_l3env_hop0.1_linear512.h5
# Define these variables for the mel computations
MOD=mel
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_mel_win1024_hop512_m128.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_mel_win1024_hop512_m128.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_mel_win1024_hop512_m128.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_mel_win1024_hop512_m128.h5
# Define these variables for the stft computations
MOD=stft
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_stft_win1024_hop512.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_stft_win1024_hop512.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_stft_win1024_hop512.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_stft_win1024_hop512.h5
# Once the variables of choice are defined, run one UMAP computation per device
for d in fan gearbox pump slider valve ToyCar ToyTrain; do python 01a_precompute_umaps.py STACK=$STACK MODALITY=$MOD MAX_AUDIOSET=$AUDIOSET_SZ MAX_FRAUNHOFER=$FRAUNHOFER_SZ MAX_DCASE_TRAIN=$TRAIN_SZ MAX_DCASE_TEST=$TEST_SZ SPLITS_NAME=$d DCASE_TRAIN_PATH=$TRAIN DCASE_TEST_PATH=$TEST AUDIOSET_PATH=$AUDIOSET FRAUNHOFER_PATH=$FRAUNHOFER "DCASE_SPLITS=[[$d, '00', source], [$d, '00', target], [$d, '01', source], [$d, '01', target], [$d, '02', source], [$d, '02', target], [$d, '03', source], [$d, '03', target], [$d, '04', source], [$d, '04', target], [$d, '05', source], [$d, '05', target]]"; done
# Define these variables for all computations
STACK=1 # STACK=5
TRAIN_SZ=1000
TEST_SZ=2000
AUDIOSET_SZ=50000
FRAUNHOFER_SZ=50000
STACK=10
TRAIN_SZ=2000
TEST_SZ=2000
AUDIOSET_SZ=1
FRAUNHOFER_SZ=1
# Define these variables for the L3 computations
MOD=L3
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_l3env_hop0.1_linear512.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_l3env_hop0.1_linear512.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_l3env_hop0.1_linear512.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_l3env_hop0.1_linear512.h5
# Define these variables for the mel computations
MOD=mel
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_mel_win1024_hop512_m128.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_mel_win1024_hop512_m128.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_mel_win1024_hop512_m128.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_mel_win1024_hop512_m128.h5
# Define these variables for the stft computations
MOD=stft
AUDIOSET=precomputed_features/audioset_wavnorm=absmax_stft_win1024_hop512.h5
FRAUNHOFER=precomputed_features/fraunhofer_wavnorm=absmax_stft_win1024_hop512.h5
TRAIN=precomputed_features/dcase2021_t2_train_wavnorm=absmax_stft_win1024_hop512.h5
TEST=precomputed_features/dcase2021_t2_cv_wavnorm=absmax_stft_win1024_hop512.h5
# Once the variables of choice are defined, run the global UMAP computation
python 01a_precompute_umaps.py STACK=$STACK MODALITY=$MOD MAX_AUDIOSET=$AUDIOSET_SZ MAX_FRAUNHOFER=$FRAUNHOFER_SZ MAX_DCASE_TRAIN=$TRAIN_SZ MAX_DCASE_TEST=$TEST_SZ SPLITS_NAME=GLOBAL DCASE_TRAIN_PATH=$TRAIN DCASE_TEST_PATH=$TEST AUDIOSET_PATH=$AUDIOSET FRAUNHOFER_PATH=$FRAUNHOFER "DCASE_SPLITS=[[fan, '00', source], [fan, '00', target], [fan, '01', source], [fan, '01', target], [fan, '02', source], [fan, '02', target], [fan, '03', source], [fan, '03', target], [fan, '04', source], [fan, '04', target], [fan, '05', source], [fan, '05', target], [gearbox, '00', source], [gearbox, '00', target], [gearbox, '01', source], [gearbox, '01', target], [gearbox, '02', source], [gearbox, '02', target], [gearbox, '03', source], [gearbox, '03', target], [gearbox, '04', source], [gearbox, '04', target], [gearbox, '05', source], [gearbox, '05', target], [pump, '00', source], [pump, '00', target], [pump, '01', source], [pump, '01', target], [pump, '02', source], [pump, '02', target], [pump, '03', source], [pump, '03', target], [pump, '04', source], [pump, '04', target], [pump, '05', source], [pump, '05', target], [slider, '00', source], [slider, '00', target], [slider, '01', source], [slider, '01', target], [slider, '02', source], [slider, '02', target], [slider, '03', source], [slider, '03', target], [slider, '04', source], [slider, '04', target], [slider, '05', source], [slider, '05', target], [valve, '00', source], [valve, '00', target], [valve, '01', source], [valve, '01', target], [valve, '02', source], [valve, '02', target], [valve, '03', source], [valve, '03', target], [valve, '04', source], [valve, '04', target], [valve, '05', source], [valve, '05', target], [ToyCar, '00', source], [ToyCar, '00', target], [ToyCar, '01', source], [ToyCar, '01', target], [ToyCar, '02', source], [ToyCar, '02', target], [ToyCar, '03', source], [ToyCar, '03', target], [ToyCar, '04', source], [ToyCar, '04', target], [ToyCar, '05', source], [ToyCar, '05', target], [ToyTrain, '00', source], [ToyTrain, '00', target], [ToyTrain, '01', source], [ToyTrain, '01', target], [ToyTrain, '02', source], [ToyTrain, '02', target], [ToyTrain, '03', source], [ToyTrain, '03', target], [ToyTrain, '04', source], [ToyTrain, '04', target], [ToyTrain, '05', source], [ToyTrain, '05', target]]"
# LogMels and STFTs: 500/100 is a reasonable, general approximation to locate the spike of the convex cone.
STACK=1
MODALITY=mel
excl=500
avg=100
STACK=5
MODALITY=mel
excl=500
avg=100
STACK=1
MODALITY=stft
excl=500
avg=100
STACK=5
MODALITY=stft
excl=500
avg=100
for d in fan gearbox pump slider valve ToyCar ToyTrain; do for s in 0 1 2; do pth=UMAP_modality=${MODALITY}_splits=${d}_stack=${STACK}_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 02a_single_section_plot.py WITH_CROSS=true DEVICE=${d} DEVICE_UMAP_PATH=umaps/$pth SECTION=${s} SAVEFIG_PATH="umap_plots/${pth}_section${s}.png" CROSS_EXCLUDE_LOWEST=${excl} CROSS_AVERAGE_N=${avg}; done; done
# L3 embeddings don't have a defined energy so they don't have a cross. Computes faster
STACK=1
MODALITY=l3
STACK=5
MODALITY=l3
for d in fan gearbox pump slider valve ToyCar ToyTrain; do for s in 0 1 2; do pth=UMAP_modality=${MODALITY}_splits=${d}_stack=${STACK}_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 02a_single_section_plot.py WITH_CROSS=false DEVICE=${d} DEVICE_UMAP_PATH=umaps/$pth SECTION=${s} SAVEFIG_PATH="umap_plots/${pth}_section${s}.png"; done; done
# LogMels and STFTs:
STACK=1
MODALITY=mel
excl=500
avg=100
STACK=5
MODALITY=mel
excl=500
avg=100
STACK=1
MODALITY=stft
excl=500
avg=100
STACK=5
MODALITY=stft
excl=500
avg=100
for d in fan gearbox pump slider valve ToyCar ToyTrain; do pth=UMAP_modality=${MODALITY}_splits=${d}_stack=${STACK}_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 02b_single_device_plot.py WITH_CROSS=true DEVICE=${d} DEVICE_UMAP_PATH=umaps/$pth SAVEFIG_PATH="umap_plots/${pth}_device.png" CROSS_EXCLUDE_LOWEST=${excl} CROSS_AVERAGE_N=${avg}; done
# L3 embeddings
STACK=1
MODALITY=l3
STACK=5
MODALITY=l3
for d in fan gearbox pump slider valve ToyCar ToyTrain; do pth=UMAP_modality=${MODALITY}_splits=${d}_stack=${STACK}_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 02b_single_device_plot.py WITH_CROSS=false DEVICE=${d} DEVICE_UMAP_PATH=umaps/$pth SAVEFIG_PATH="umap_plots/${pth}_device.png"; done
pth="UMAP_modality=stft_splits=GLOBAL_stack=5_maxDcaseTrain=1000_maxDcaseTest=2000_maxAudioset=50000_maxFraunhofer=50000.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
pth="UMAP_modality=mel_splits=GLOBAL_stack=5_maxDcaseTrain=1000_maxDcaseTest=2000_maxAudioset=50000_maxFraunhofer=50000.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
pth="UMAP_modality=L3_splits=GLOBAL_stack=5_maxDcaseTrain=1000_maxDcaseTest=2000_maxAudioset=50000_maxFraunhofer=50000.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
# LogMels and STFTs:
STACK=10
MODALITY=stft
MODALITY=mel
MODALITY=l3
for d in fan gearbox pump slider valve ToyCar ToyTrain; do pth=UMAP_modality=${MODALITY}_splits=${d}_stack=${STACK}_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=1_maxFraunhofer=1.pickle; python 02b_single_device_plot.py PLOT_AUDIOSET=false PLOT_FRAUNHOFER=false WITH_CROSS=false DEVICE=${d} DEVICE_UMAP_PATH=umaps/$pth SAVEFIG_PATH="umap_plots/${pth}_device.png"; done
pth="UMAP_modality=stft_splits=GLOBAL_stack=10_maxDcaseTrain=2000_maxDcaseTest=2000_maxAudioset=1_maxFraunhofer=1.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
pth="UMAP_modality=mel_splits=GLOBAL_stack=10_maxDcaseTrain=2000_maxDcaseTest=2000_maxAudioset=1_maxFraunhofer=1.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
pth="UMAP_modality=L3_splits=GLOBAL_stack=10_maxDcaseTrain=2000_maxDcaseTest=2000_maxAudioset=1_maxFraunhofer=1.pickle"; python 02c_global_plot.py GLOBAL_UMAP_PATH=umaps/${pth} SAVEFIG_PATH=umap_plots/${pth}_global.png
# Mel pump stack 5 excerpt
pth=UMAP_modality=mel_splits=pump_stack=5_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 03f_single_device_plot_paper_explanatory.py PLOT_AUDIOSET=false PLOT_FRAUNHOFER=false WITH_CROSS=false DEVICE=pump DEVICE_UMAP_PATH=umaps/${pth} CUT_TOP=0.597 CUT_LEFT=0.807 CUT_BOTTOM=0.33 CUT_RIGHT=0.095 FIG_MARGIN_RIGHT=0.7 FIG_LEGEND_POS=0.89 SAVEFIG_PATH=umap_plots/${pth}_device_plot_paper.png DCASE_SHADOW_SIZE=60 DOT_SIZE=20 LEGEND_WIDTH_FACTOR=1.5 LEGEND_FONT_SIZE=31
# Global STFT plot
pth="UMAP_modality=stft_splits=GLOBAL_stack=10_maxDcaseTrain=2000_maxDcaseTest=2000_maxAudioset=1_maxFraunhofer=1.pickle"; python 03c_global_plot_paper.py GLOBAL_UMAP_PATH=umaps/${pth} PLOT_LEGEND=true PLOT_AUDIOSET=false PLOT_FRAUNHOFER=false SAVEFIG_PATH=umap_plots/${pth}_global_paper.png
# L3 ToyCar stack 1 device
pth=UMAP_modality=l3_splits=ToyCar_stack=1_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 03b_single_device_plot_paper.py PLOT_AUDIOSET=false PLOT_FRAUNHOFER=false WITH_CROSS=false DEVICE=ToyCar DEVICE_UMAP_PATH=umaps/${pth} CUT_TOP=0.05 CUT_LEFT=0.06 CUT_BOTTOM=0.04 CUT_RIGHT=0 FIG_MARGIN_RIGHT=0.7 FIG_LEGEND_POS=0.82 SAVEFIG_PATH=umap_plots/${pth}_device_plot_paper.png
# Mel Valve stack 1 with cross
pth=UMAP_modality=mel_splits=valve_stack=1_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 03a_single_section_plot_paper.py WITH_CROSS=true CROSS_EXCLUDE_LOWEST=500 CROSS_AVERAGE_N=200 DEVICE=valve DEVICE_UMAP_PATH=umaps/$pth SECTION=0 CUT_TOP=0.42 CUT_LEFT=0.37 CUT_BOTTOM=0.14 CUT_RIGHT=0.25 SAVEFIG_PATH=umap_plots/${pth}_section0_plot_paper.png
# L3 fan stack 1
pth=UMAP_modality=l3_splits=fan_stack=1_maxDcaseTrain=10000_maxDcaseTest=20000_maxAudioset=10000_maxFraunhofer=10000.pickle; python 03b_single_device_plot_paper.py PLOT_AUDIOSET=true PLOT_FRAUNHOFER=true WITH_CROSS=false DEVICE=fan DEVICE_UMAP_PATH=umaps/${pth} CUT_TOP=0.05 CUT_LEFT=0.06 CUT_BOTTOM=0.04 CUT_RIGHT=-0.2 FIG_MARGIN_RIGHT=0.8 FIG_LEGEND_POS=0.8 SAVEFIG_PATH=umap_plots/${pth}_device_plot_paper.png
# trim
for i in umap_plots/*; do convert $i -trim $i; done
# compress: 2 is best quality, 31 worst
# https://stackoverflow.com/questions/10225403/how-can-i-extract-a-good-quality-jpeg-image-from-a-video-file-with-ffmpeg/10234065#10234065
for i in umap_plots/*; do ffmpeg -i $i -qscale:v 15 ${i/.png/.jpg}; done
*Work supported by EPSRC grants EP/T019751/1 (AI for Sound) and EP/T022205/1 (JADE2 Tier 2 HPC facility)*

