Skip to content

Commit e51df81

Browse files
author
Vineel Pratap
committed
Merge branch 'master' into inf_build
2 parents 898e8f4 + 64e54f8 commit e51df81

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+960
-29197
lines changed

README.md

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,38 +3,42 @@
33
[![CircleCI](https://circleci.com/gh/facebookresearch/wav2letter.svg?style=svg)](https://circleci.com/gh/facebookresearch/wav2letter)
44
[![Join the chat at https://gitter.im/wav2letter/community](https://badges.gitter.im/wav2letter/community.svg)](https://gitter.im/wav2letter/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
55

6-
wav2letter++ is a [highly efficient](https://arxiv.org/abs/1812.07625) end-to-end automatic speech recognition (ASR) toolkit written entirely in C++, leveraging [ArrayFire](https://github.com/arrayfire/arrayfire) and [flashlight](https://github.com/facebookresearch/flashlight).
6+
## Important Note:
7+
### wav2letter has been moved and consolidated [into Flashlight](https://github.com/facebookresearch/flashlight) in the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr).
78

8-
The toolkit started from models predicting letters directly from the raw waveform, and now evolved as an all-purpose end-to-end ASR research toolkit, supporting a wide range of models and learning techniques. It also embarks a very efficient modular beam-search decoder, for both structured learning (CTC, ASG) and seq2seq approaches.
9+
Future wav2letter development will occur in Flashlight.
910

10-
**Important disclaimer**: as a number of models from this repository could be used for other modalities, we moved most of the code to flashlight.
11+
*To build the old, pre-consolidation version of wav2letter*, checkout the [wav2letter v0.2](https://github.com/facebookresearch/wav2letter/releases/tag/v0.2) release, which depends on the old [Flashlight v0.2](https://github.com/facebookresearch/flashlight/releases/tag/v0.2) release. The [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) project can be fonud on the `wav2letter-lua` branch, accordingly.
1112

13+
For more information on wav2letter++, see or cite [this arXiv paper](https://arxiv.org/abs/1812.07625).
14+
15+
## Recipes
1216
This repository includes recipes to reproduce the following research papers as well as **pre-trained** models:
13-
- [NEW] [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/)
14-
- [NEW SOTA] [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019)
17+
- [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/)
18+
- [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019)
1519
- [Kahn et al. (2020): Self-Training for End-to-End Speech Recognition](recipes/self_training)
1620
- [Likhomanenko et al. (2019): Who Needs Words? Lexicon-free Speech Recognition](recipes/lexicon_free/)
1721
- [Hannun et al. (2019): Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions](recipes/seq2seq_tds/)
1822

19-
Data preparation for our training and evaluation can be found in [data](data) folder.
23+
Data preparation for training and evaluation can be found in [data](data) directory.
2024

21-
The previous iteration of wav2letter can be found in the:
22-
- (before merging codebases for wav2letter and flashlight) [wav2letter-v0.2](https://github.com/facebookresearch/wav2letter/tree/v0.2) branch.
23-
- (written in Lua) [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) branch.
25+
### Building the Recipes
2426

25-
## Build recipes
26-
First, isntall [flashlight](https://github.com/facebookresearch/flashlight) with all its dependencies. Then
27+
First, install [Flashlight](https://github.com/facebookresearch/flashlight) with the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr). Then, after cloning the project source:
28+
```shell
29+
mkdir build && cd build
30+
cmake .. && make -j8
2731
```
28-
mkdir build && cd build && cmake .. && make -j8
32+
If Flashlight or ArrayFire are installed in nonstandard paths via a custom `CMAKE_INSTALL_PREFIX`, they can be found by passing
33+
```shell
34+
-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake
2935
```
30-
If flashlight or ArrayFire are installed in nonstandard paths via `CMAKE_INSTALL_PREFIX`, they can be found by passing `-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake` when running `cmake`.
36+
when running `cmake`.
3137

3238
## Join the wav2letter community
3339
* Facebook page: https://www.facebook.com/groups/717232008481207/
3440
* Google group: https://groups.google.com/forum/#!forum/wav2letter-users
3541
3642

37-
See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
38-
3943
## License
4044
wav2letter++ is BSD-licensed, as found in the [LICENSE](LICENSE) file.

data/ami/README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# A Recipe for the AMI corpus.
2+
3+
"The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers." See http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml for more details.
4+
5+
We use the individual headset microphone (IHM) setting for preparing train, dev and test sets. The recipe here is heavily inspired from the preprocessing scripts in Kaldi - https://github.com/kaldi-asr/kaldi/tree/master/egs/ami .
6+
7+
## Steps to download and prepare the audio and text data
8+
9+
Prepare train, dev and test sets as list files to be used for training with wav2letter. Replace `[...]` with appropriate paths
10+
11+
```
12+
python prepare.py -dst [...]
13+
```
14+
15+
The above scripts download the AMI data, segments them into shorter `.flac` audio files based on word timestamps. Limited supervision training set for 10min, 1hr and 10hr will be generated as well.
16+
17+
The following structure will be generated
18+
```
19+
>tree -L 4
20+
.
21+
├── audio
22+
│   ├── EN2001a
23+
│   │   ├── EN2001a.Headset-0.wav
24+
│   │   ├── ...
25+
│   │   └── EN2001a.Headset-4.wav
26+
│   ├── EN2001b
27+
│   ├── ...
28+
│   ├── ...
29+
│   ├── IS1009d
30+
│   │   ├── ...
31+
│   │   └── IS1009d.Headset-3.wav
32+
│   └── segments
33+
│ ├── ES2005a
34+
│ │ ├── ES2005a_H00_MEE018_0.75_1.61.flac
35+
│ │ ├── ES2005a_H00_MEE018_13.19_16.05.flac
36+
│ │ ├── ...
37+
│ │ └── ...
38+
│      ├── ...
39+
│      └── IS1009d
40+
│      ├── ...
41+
│ └── ...
42+
├── lists
43+
│ ├── dev.lst
44+
│ ├── test.lst
45+
│ ├── train_10min_0.lst
46+
│ ├── train_10min_1.lst
47+
│ ├── train_10min_2.lst
48+
│ ├── train_10min_3.lst
49+
│ ├── train_10min_4.lst
50+
│ ├── train_10min_5.lst
51+
│ ├── train_9hr.lst
52+
│ └── train.lst
53+
54+
└── text
55+
├── ami_public_manual_1.6.1.zip
56+
└── annotations
57+
├── 00README_MANUAL.txt
58+
├── ...
59+
├── transcripts0
60+
├── transcripts1
61+
├── transcripts2
62+
├── words
63+
└── youUsages
64+
```

data/ami/ami_split_segments.pl

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
#!/usr/bin/env perl
2+
3+
# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)
4+
5+
# The script - based on punctuation times - splits segments longer than #words (input parameter)
6+
# and produces bit more more normalised form of transcripts, as follows
7+
# MeetID Channel Spkr stime etime transcripts
8+
9+
#use List::MoreUtils 'indexes';
10+
use strict;
11+
use warnings;
12+
13+
sub split_transcripts;
14+
sub normalise_transcripts;
15+
16+
sub merge_hashes {
17+
my ($h1, $h2) = @_;
18+
my %hash1 = %$h1; my %hash2 = %$h2;
19+
foreach my $key2 ( keys %hash2 ) {
20+
if( exists $hash1{$key2} ) {
21+
warn "Key [$key2] is in both hashes!";
22+
next;
23+
} else {
24+
$hash1{$key2} = $hash2{$key2};
25+
}
26+
}
27+
return %hash1;
28+
}
29+
30+
sub print_hash {
31+
my ($h) = @_;
32+
my %hash = %$h;
33+
foreach my $k (sort keys %hash) {
34+
print "$k : $hash{$k}\n";
35+
}
36+
}
37+
38+
sub get_name {
39+
#no warnings;
40+
my $sname = sprintf("%07d_%07d", $_[0]*100, $_[1]*100) || die 'Input undefined!';
41+
#use warnings;
42+
return $sname;
43+
}
44+
45+
sub split_on_comma {
46+
47+
my ($text, $comma_times, $btime, $etime, $max_words_per_seg)= @_;
48+
my %comma_hash = %$comma_times;
49+
50+
print "Btime, Etime : $btime, $etime\n";
51+
52+
my $stime = ($etime+$btime)/2; #split time
53+
my $skey = "";
54+
my $otime = $btime;
55+
foreach my $k (sort {$comma_hash{$a} cmp $comma_hash{$b} } keys %comma_hash) {
56+
print "Key : $k : $comma_hash{$k}\n";
57+
my $ktime = $comma_hash{$k};
58+
if ($ktime==$btime) { next; }
59+
if ($ktime==$etime) { last; }
60+
if (abs($stime-$ktime)/2<abs($stime-$otime)/2) {
61+
$otime = $ktime;
62+
$skey = $k;
63+
}
64+
}
65+
66+
my %transcripts = ();
67+
68+
if (!($skey =~ /[\,][0-9]+/)) {
69+
print "Cannot split into less than $max_words_per_seg words! Leaving : $text\n";
70+
$transcripts{get_name($btime, $etime)}=$text;
71+
return %transcripts;
72+
}
73+
74+
print "Splitting $text on $skey at time $otime (stime is $stime)\n";
75+
my @utts1 = split(/$skey\s+/, $text);
76+
for (my $i=0; $i<=$#utts1; $i++) {
77+
my $st = $btime;
78+
my $et = $comma_hash{$skey};
79+
if ($i>0) {
80+
$st=$comma_hash{$skey};
81+
$et = $etime;
82+
}
83+
my (@utts) = split (' ', $utts1[$i]);
84+
if ($#utts < $max_words_per_seg) {
85+
my $nm = get_name($st, $et);
86+
print "SplittedOnComma[$i]: $nm : $utts1[$i]\n";
87+
$transcripts{$nm} = $utts1[$i];
88+
} else {
89+
print 'Continue splitting!';
90+
my %transcripts2 = split_on_comma($utts1[$i], \%comma_hash, $st, $et, $max_words_per_seg);
91+
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
92+
}
93+
}
94+
return %transcripts;
95+
}
96+
97+
sub split_transcripts {
98+
@_ == 4 || die 'split_transcripts: transcript btime etime max_word_per_seg';
99+
100+
my ($text, $btime, $etime, $max_words_per_seg) = @_;
101+
my (@transcript) = @$text;
102+
103+
my (@punct_indices) = grep { $transcript[$_] =~ /^[\.,\?\!\:]$/ } 0..$#transcript;
104+
my (@time_indices) = grep { $transcript[$_] =~ /^[0-9]+\.[0-9]*/ } 0..$#transcript;
105+
my (@puncts_times) = delete @transcript[@time_indices];
106+
my (@puncts) = @transcript[@punct_indices];
107+
108+
if ($#puncts_times != $#puncts) {
109+
print 'Ooops, different number of punctuation signs and timestamps! Skipping.';
110+
return ();
111+
}
112+
113+
#first split on full stops
114+
my (@full_stop_indices) = grep { $puncts[$_] =~ /[\.\?]/ } 0..$#puncts;
115+
my (@full_stop_times) = @puncts_times[@full_stop_indices];
116+
117+
unshift (@full_stop_times, $btime);
118+
push (@full_stop_times, $etime);
119+
120+
my %comma_puncts = ();
121+
for (my $i=0, my $j=0;$i<=$#punct_indices; $i++) {
122+
my $lbl = "$transcript[$punct_indices[$i]]$j";
123+
if ($lbl =~ /[\.\?].+/) { next; }
124+
$transcript[$punct_indices[$i]] = $lbl;
125+
$comma_puncts{$lbl} = $puncts_times[$i];
126+
$j++;
127+
}
128+
129+
#print_hash(\%comma_puncts);
130+
131+
print "InpTrans : @transcript\n";
132+
print "Full stops: @full_stop_times\n";
133+
134+
my @utts1 = split (/[\.\?]/, uc join(' ', @transcript));
135+
my %transcripts = ();
136+
for (my $i=0; $i<=$#utts1; $i++) {
137+
my (@utts) = split (' ', $utts1[$i]);
138+
if ($#utts < $max_words_per_seg) {
139+
print "ReadyTrans: $utts1[$i]\n";
140+
$transcripts{get_name($full_stop_times[$i], $full_stop_times[$i+1])} = $utts1[$i];
141+
} else {
142+
print "TransToSplit: $utts1[$i]\n";
143+
my %transcripts2 = split_on_comma($utts1[$i], \%comma_puncts, $full_stop_times[$i], $full_stop_times[$i+1], $max_words_per_seg);
144+
print "Hash TR2:\n"; print_hash(\%transcripts2);
145+
print "Hash TR:\n"; print_hash(\%transcripts);
146+
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
147+
print "Hash TR_NEW : \n"; print_hash(\%transcripts);
148+
}
149+
}
150+
return %transcripts;
151+
}
152+
153+
sub normalise_transcripts {
154+
my $text = $_[0];
155+
156+
#DO SOME ROUGH AND OBVIOUS PRELIMINARY NORMALISATION, AS FOLLOWS
157+
#remove the remaining punctation labels e.g. some text ,0 some text ,1
158+
$text =~ s/[\.\,\?\!\:][0-9]+//g;
159+
#there are some extra spurious puncations without spaces, e.g. UM,I, replace with space
160+
$text =~ s/[A-Z']+,[A-Z']+/ /g;
161+
#split words combination, ie. ANTI-TRUST to ANTI TRUST (None of them appears in cmudict anyway)
162+
#$text =~ s/(.*)([A-Z])\s+(\-)(.*)/$1$2$3$4/g;
163+
$text =~ s/\-/ /g;
164+
#substitute X_M_L with X. M. L. etc.
165+
$text =~ s/\_/. /g;
166+
#normalise and trim spaces
167+
$text =~ s/^\s*//g;
168+
$text =~ s/\s*$//g;
169+
$text =~ s/\s+/ /g;
170+
#some transcripts are empty with -, nullify (and ignore) them
171+
$text =~ s/^\-$//g;
172+
$text =~ s/\s+\-$//;
173+
# apply few exception for dashed phrases, Mm-Hmm, Uh-Huh, etc. those are frequent in AMI
174+
# and will be added to dictionary
175+
$text =~ s/MM HMM/MM\-HMM/g;
176+
$text =~ s/UH HUH/UH\-HUH/g;
177+
178+
return $text;
179+
}
180+
181+
if (@ARGV != 2) {
182+
print STDERR "Usage: ami_split_segments.pl <meet-file> <out-file>\n";
183+
exit(1);
184+
}
185+
186+
my $meet_file = shift @ARGV;
187+
my $out_file = shift @ARGV;
188+
my %transcripts = ();
189+
190+
open(W, ">$out_file") || die "opening output file $out_file";
191+
open(S, "<$meet_file") || die "opening meeting file $meet_file";
192+
193+
while(<S>) {
194+
195+
my @A = split(" ", $_);
196+
if (@A < 9) { print "Skipping line @A"; next; }
197+
198+
my ($meet_id, $channel, $spk, $channel2, $trans_btime, $trans_etime, $aut_btime, $aut_etime) = @A[0..7];
199+
my @transcript = @A[8..$#A];
200+
my %transcript = split_transcripts(\@transcript, $aut_btime, $aut_etime, 30);
201+
202+
for my $key (keys %transcript) {
203+
my $value = $transcript{$key};
204+
my $segment = normalise_transcripts($value);
205+
my @times = split(/\_/, $key);
206+
if ($times[0] >= $times[1]) {
207+
print "Warning, $meet_id, $spk, $times[0] > $times[1]. Skipping. \n"; next;
208+
}
209+
if (length($segment)>0) {
210+
print W join " ", $meet_id, "H0${channel2}", $spk, $times[0]/100.0, $times[1]/100.0, $segment, "\n";
211+
}
212+
}
213+
214+
}
215+
close(S);
216+
close(W);
217+
218+
print STDERR "Finished."

data/ami/ami_xml2text.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright, University of Edinburgh (Pawel Swietojanski and Jonathan Kilgour)
4+
5+
if [ $# -ne 1 ]; then
6+
echo "Usage: $0 <ami-dir>"
7+
exit 1;
8+
fi
9+
10+
adir=$1
11+
wdir=$1/annotations
12+
13+
[ ! -f $adir/annotations/AMI-metadata.xml ] && echo "$0: File $adir/annotations/AMI-metadata.xml no found." && exit 1;
14+
15+
mkdir -p $wdir/log
16+
17+
JAVA_VER=$(java -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
18+
19+
if [ "$JAVA_VER" -ge 15 ]; then
20+
if [ ! -d $wdir/nxt ]; then
21+
echo "Downloading NXT annotation tool..."
22+
wget -O $wdir/nxt.zip http://sourceforge.net/projects/nite/files/nite/nxt_1.4.4/nxt_1.4.4.zip
23+
[ ! -s $wdir/nxt.zip ] && echo "Downloading failed! ($wdir/nxt.zip)" && exit 1
24+
unzip -d $wdir/nxt $wdir/nxt.zip &> /dev/null
25+
fi
26+
27+
if [ ! -f $wdir/transcripts0 ]; then
28+
echo "Parsing XML files (can take several minutes)..."
29+
nxtlib=$wdir/nxt/lib
30+
java -cp $nxtlib/nxt.jar:$nxtlib/xmlParserAPIs.jar:$nxtlib/xalan.jar:$nxtlib \
31+
FunctionQuery -c $adir/annotations/AMI-metadata.xml -q '($s segment)(exists $w1 w):$s^$w1' -atts obs who \
32+
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent,global_name, 0)'\
33+
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent, channel, 0)' \
34+
transcriber_start transcriber_end starttime endtime '$s' '@extract(($w w):$s^$w & $w@punc="true", starttime,0,0)' \
35+
1> $wdir/transcripts0 2> $wdir/log/nxt_export.log
36+
fi
37+
else
38+
echo "$0. Java not found. Will download exported version of transcripts."
39+
annots=ami_manual_annotations_v1.6.1_export
40+
wget -O $wdir/$annots.gzip http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/$annots.gzip
41+
gunzip -c $wdir/${annots}.gzip > $wdir/transcripts0
42+
fi
43+
44+
#remove NXT logs dumped to stdio
45+
grep -e '^Found' -e '^Obs' -i -v $wdir/transcripts0 > $wdir/transcripts1
46+
47+
exit 0;

0 commit comments

Comments
 (0)