Train, validate and apply a multi-label classifier based on MFCCs and LSTMs, using
pyannote-multilabel command line tool.
The labels are :
- KCHI (key children speech utterances)
- CHI (other children speech utterances)
- FEM (female speech utterance)
- MAL (male speech utterance)
First and foremost, make sure that the file ~/.pyannote/database.yml contains these lines (if running your experiments on the CLSP cluster) :
Protocols:
  AMI:
    SpeakerDiarization:
      MixHeadset:
        train:
          annotation: /export/fs01/jsalt19/databases/AMI/train/allMix-Headset_train.rttm
          annotated: /export/fs01/jsalt19/databases/AMI/train/allMix-Headset_train.uem
        development:
          annotation: /export/fs01/jsalt19/databases/AMI/dev/allMix-Headset_dev.rttm
          annotated: /export/fs01/jsalt19/databases/AMI/dev/allMix-Headset_dev.uem
        test:
          annotation: /export/fs01/jsalt19/databases/AMI/test/allMix-Headset_test.rttm
          annotated: /export/fs01/jsalt19/databases/AMI/test/allMix-Headset_test.uem
  BabyTrain:
    SpeakerDiarization:
      All:
        train:
          annotation: /export/fs01/jsalt19/databases/BabyTrain/train/all_train.rttm
          annotated: /export/fs01/jsalt19/databases/BabyTrain/train/all_train.uem
        development:
          annotation: /export/fs01/jsalt19/databases/BabyTrain/dev/all_dev.rttm
          annotated: /export/fs01/jsalt19/databases/BabyTrain/dev/all_dev.uem
        test:
          annotation: /export/fs01/jsalt19/databases/BabyTrain/test/all_test.rttm
          annotated: /export/fs01/jsalt19/databases/BabyTrain/test/all_test.uem
  VoxCeleb:
    SpeakerDiarization:
      MaleAugmentation:
        train:
          annotation: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/train/all_train.rttm
          annotated: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/train/all_train.uem
          uris: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/train/male_20h.txt
        development:
          annotation: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/dev/all_dev.rttm
          annotated: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/dev/all_dev.uem
        test:
          annotation: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/test/all_test.rttm
          annotated: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/test/all_test.uem
  CHiME5:
    SpeakerDiarization:
      U01:
        train:
          annotation: /export/fs01/jsalt19/databases/CHiME5/train/allU01_train.rttm
          annotated: /export/fs01/jsalt19/databases/CHiME5/train/allU01_train.uem
        development:
          annotation: /export/fs01/jsalt19/databases/CHiME5/dev/allU01_dev.rttm
          annotated: /export/fs01/jsalt19/databases/CHiME5/dev/allU01_dev.uem
        test:
          annotation: /export/fs01/jsalt19/databases/CHiME5/test/allU01_test.rttm
          annotated: /export/fs01/jsalt19/databases/CHiME5/test/allU01_test.uem
  X:
    SpeakerDiarization:
      # META PROTOCOL JULIEN
      JSALT:
        train:
          BabyTrain.SpeakerDiarization.All: [train]
          VoxCeleb.SpeakerDiarization.MaleAugmentation: [train]
          AMI.SpeakerDiarization.MixHeadset: [train]
        development:
          BabyTrain.SpeakerDiarization.All: [development]
          AMI.SpeakerDiarization.MixHeadset: [development]
        test:
          BabyTrain.SpeakerDiarization.All: [test]
          AMI.SpeakerDiarization.MixHeadset: [test]
Databases:
  AMI: /export/fs01/jsalt19/databases/AMI/*/wav/{uri}.wav
  BabyTrain: /export/fs01/jsalt19/databases/BabyTrain/*/wav/{uri}.wav
  VoxCeleb: /export/fs01/jsalt19/databases/auxiliary/VoxCeleb/train/wav/{uri}.wav
  CHiME5: /export/fs01/jsalt19/databases/CHiME5/*/wav/{uri}.wav
  SRI: /export/fs01/jsalt19/databases/SRI/*/wav/{uri}.wav
  MUSAN: /export/fs01/jsalt19/databases/auxiliary/musan/{uri}.wavNext, we can install the needed dependencies :
# Create conda environment
conda create --name pyannote python=3.6
conda activate pyannote
git clone https://github.com/jsalt-coml/babytrain_multilabel.git
cd BabyTrain_multilabel
# Clone forked version of pyannote-audio
git clone https://github.com/jsalt-coml/pyannote-audio.git
# Install the associated local python packages
pip install -e ./pyannote-audio
# tensorboard support (optional) 
pip install tensorflow tensorboard
# support Yaafe feature extraction (optional)
conda install -c conda-forge yaafe
# support Shennong feature extraction (optional)
git clone https://github.com/bootphon/shennong.git
cd ./shennong
conda env update -n pyannote -f environment.yml
make install
make testTo ensure reproducibility, pyannote-multilabel relies on a configuration file defining the experimental setup:
cat babytrain/multilabel/config.ymltask:
   name: Multilabel
   params:
      duration: 2.0      # sequences are 2s long
      batch_size: 64     # 64 sequences per batch
      per_epoch: 1       # one epoch = 1 day of audio
      weighted_loss: True # weight loss by 1/prior for each class 
data_augmentation:
   name: AddNoise                                   # add noise on-the-fly
   params:
      snr_min: 10                                   # using random signal-to-noise
      snr_max: 20                                   # ratio between 10 and 20 dBs
      collection: MUSAN.Collection.BackgroundNoise  # use background noise from MUSAN
                                                    # (needs pyannote.db.musan)
feature_extraction:
   name: LibrosaMFCC      # use MFCC from librosa
   params:
      e: False            # do not use energy
      De: True            # use energy 1st derivative
      DDe: True           # use energy 2nd derivative
      coefs: 19           # use 19 MFCC coefficients
      D: True             # use coefficients 1st derivative
      DD: True            # use coefficients 2nd derivative
      duration: 0.025     # extract MFCC from 25ms windows
      step: 0.010         # extract MFCC every 10ms
      sample_rate: 16000  # convert to 16KHz first (if needed)
architecture:
   name: StackedRNN
   params:
      instance_normalize: True  # normalize sequences
      rnn: LSTM                 # use LSTM (could be GRU)
      recurrent: [128, 128]     # two layers with 128 hidden states
      bidirectional: True       # bidirectional LSTMs
      linear: [32, 32]          # add two linear layers at the end
scheduler:
   name: CyclicScheduler        # use cyclic learning rate (LR) scheduler
   params:
      learning_rate: auto       # automatically guess LR upper bound
      epochs_per_cycle: 14      # 14 epochs per cycle
      
preprocessors:
    annotation:
       name: pyannote.audio.features.GenderChiMapperYou might want to change some of these parameters to see if performances improve.
The following command will train the network using the training set of BabyTrain database for 1000 epochs:
export EXPERIMENT_DIR=babytrain/multilabel
pyannote-multilabel train --gpu --to=1000 ${EXPERIMENT_DIR} BabyTrain.SpeakerDiarization.AllThis will create a bunch of files in TRAIN_DIR (defined below). One can follow along the training process using tensorboard.
tensorboard --logdir=${EXPERIMENT_DIR}To get a quick idea of how the network is doing during training, one can use the validate mode.
It can (should!) be run in parallel to training and evaluates the model epoch after epoch.
export TRAIN_DIR=${EXPERIMENT_DIR}/train/BabyTrain.SpeakerDiarization.All.train
pyannote-multilabel validate SPEECH ${TRAIN_DIR} BabyTrain.SpeakerDiarization.AllOne can also use the Detection Error Rate metric for validating the model by adding the flag --use_der In practice, it is tuning a simple speech activity detection pipeline (pyannote.audio.pipeline.speech_activity_detection.SpeechActivityDetection) for the specified class, and after each epoch stores the best hyper-parameter configuration on disk:
cat ${TRAIN_DIR}/validate/BabyTrain.SpeakerDiarization.All/params.ymlepoch: 280
params:
  min_duration_off: 0.0
  min_duration_on: 0.0
  offset: 0.5503037490496294
  onset: 0.5503037490496294
  pad_offset: 0.0
  pad_onset: 0.0One can also use tensorboard to follow the validation process.
Once the thresholds have been computed in the validation step, we can apply our model on the test test :
export VALIDATE_DIR=${TRAIN_DIR}/validate_speech
export OUTPUT_DIR=my_sad_output
export PROTOCOL=BabyTrain.SpeakerDiarization.All
./apply_and_evaluate.sh $VALIDATE_DIR $PROTOCOL $OUTPUT_DIR
This script will produce the raw scores in the $OUTPUT_DIR folder, then it will create the .rttm by applying the thresholds on these scores. Finally, it will compute the detection error rate by using pyannote-metrics. Based on which task (SPEECH, KCHI, CHI, FEM or MAL) the model has been optimized for, the model will predict only the relevant class.
To use tensoboard, you will need to tunnel both login.clsp.jhu.edu and the node itself, from your local machine run :
# Tunnel to login.clsp.jhu.edu
ssh <username>@login.clsp.jhu.edu -L 1234:localhost:1234
# Tunnel to the node c05
ssh c05 -L 1234:localhost:1234
# Run tensorboard session
cd BabyTrain_multilabel
source activate pyannote
tensorboard --logdir=babytrain/multilabel --port 1234Then, go to localhost:1234 in your favourite browser.
Submit the script train.sh :
qsub train.sh
All the parameters for the submission to grid-engine appear at the beginning of train.sh
Submit the script validate.sh :
qsub validate.sh KCHI
where the second parameter can be chosen in {KCHI, CHI, FEM, MAL, SPEECH} depending on whether you want to evaluate the model on a specific class, or as a speech activity detection model.
- pyannote library
@inproceedings{Yin2017,
  Author = {Ruiqing Yin and Herv\'e Bredin and Claude Barras},
  Title = {{Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks}},
  Booktitle = {{18th Annual Conference of the International Speech Communication Association, Interspeech 2017}},
  Year = {2017},
  Month = {August},
  Address = {Stockholm, Sweden},
  Url = {https://github.com/yinruiqing/change_detection}
}@inproceedings{Bredin2017,
    author = {Herv\'{e} Bredin},
    title = {{TristouNet: Triplet Loss for Speaker Turn Embedding}},
    booktitle = {42nd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017},
    year = {2017},
    url = {http://arxiv.org/abs/1609.04301},
}@inproceedings{Yin2018,
  Author = {Ruiqing Yin and Herv\'e Bredin and Claude Barras},
  Title = {{Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization}},
  Booktitle = {{19th Annual Conference of the International Speech Communication Association, Interspeech 2018}},
  Year = {2018},
  Month = {September},
  Address = {Hyderabad, India},
}