Skip to content

Commit 7096454

Browse files
committed
Existing methods just missing images
1 parent 45d2803 commit 7096454

File tree

6 files changed

+314
-106
lines changed

6 files changed

+314
-106
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,8 @@ report/report.pdf
55
report/summary.pdf
66
report/status.pdf
77
/data/
8+
!data/results
89
.ipynb_checkpoints/
910
_minted-input/
11+
/test/data
12+
/test/out

Experiments.ipynb

Lines changed: 279 additions & 93 deletions
Large diffs are not rendered by default.

TODO.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,11 @@
77
Critical line
88

99
- Finalize Existing methods review.
10-
Fix image CNNs section.
11-
Fix keyword spotting section.
1210
Add images
1311
- Finalize introduction. Health/import section, images
1412
- Materials. Add images of model comparisons
1513
- BACKGROUND section. Information about CNNs, images
1614
- Background. Info about microcontrollers
17-
1815
- Make plots pretty in Results
1916
- Write basic Discussion and Conclusion
2017

@@ -31,12 +28,12 @@ MONDAY22. Send draft to OK
3128
Experiment
3229

3330
- Switch to zero overlap voting?
34-
- Test whether SB-CNN with two DS fits into memory
3531
- Do error analysis.
3632
If we only consider high-confidence outputs, are we more precise? How much does recall drop?
33+
If model knows its own limitations, we can ignore low confidence results.
34+
And wait for more confident ones (since we are doing continious monitoring)
3735
- Write all settings/parameters to a file when ran
3836
- Include git version in settings file
39-
- Run with different filters. At least DS-5x5
4037
- MAYBE: Fix train and validation generators to be single-pass?
4138
- MAYBE: Profile to see what makes training slow
4239

@@ -48,6 +45,7 @@ Code quality
4845
Maybe
4946

5047
- Add a test with 16kHz / 30 mels?
48+
- Add test with 3x3 kernels
5149

5250
Dissemination
5351

braindump.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,7 @@ https://github.com/ARM-software/CMSIS_5/issues/217
440440

441441

442442
Not All Ops Are Created Equal!, https://arxiv.org/abs/1801.04326
443+
[@lai2018not]
443444
Found up to 5x difference in throughput/energy between different operations.
444445

445446

report/references.bib

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,12 @@ @ARTICLE{Sigitia2016
143143
month={Nov},
144144
}
145145

146+
@article{sainath2015convolutional,
147+
title={Convolutional neural networks for small-footprint keyword spotting},
148+
author={Sainath, Tara and Parada, Carolina},
149+
year={2015}
150+
}
151+
146152
@article{HelloEdge,
147153
title={Hello edge: Keyword spotting on microcontrollers},
148154
author={Zhang, Yundong and Suda, Naveen and Lai, Liangzhen and Chandra, Vikas},

report/report.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -594,21 +594,27 @@ the model is able to reach 78.3% on Urbansound8k.
594594
## Resource efficient Environmental Sound Classification
595595

596596
There are also a few works on Environmental Sound Classification (ESC)
597-
that explicitly target making resource efficient models (in parameters, inference time or power consumption).
597+
that explicitly target making resource efficient models, measured
598+
in number of parameters and compute operations.
598599

599600
WSNet[@WSNet] is a 1D network on raw audio designed for efficiency.
600-
It uses a weight sampling approach for efficient quantization of weights to
601-
reaches a 70.5% on UrbandSound8k with a 288K parameters and 100M MAC.
601+
It proposes a weight sampling approach for efficient quantization of weights to
602+
reache an accuracy of 70.5% on UrbandSound8k with a 288K parameters and 100M MAC.
602603

603604
LD-CNN[@LD-CNN] is a more efficient version of D-CNN.
604605
In order to reduce parameters the early layers use spatially separable convolutions,
605606
and the middle layers used dilated convolutions.
606607
As a result the model has 2.05MB of paramters, 50x fewer than D-CNN,
607608
while accuracy only dropped by 2% to 79% on Urbansound8k.
608-
`TODO: include mult-adds`
609609

610-
AclNet [@AclNet].
611-
`TODO: write about`
610+
AclNet [@AclNet] is a CNN architecture.
611+
It uses 2 layers of 1D strided convolution as a FIR decimation filterbank
612+
to create a 2D spectrogram-like set of features.
613+
Then a VGG style architecture with Depthwise Separable Convolutions is applied.
614+
A width multiplier ala that of Mobilenet is used to adjust model complexity.
615+
Data augmentation and mixup is applied, and gave up to 5% boost.
616+
Evaluated on ESC-50, the best performing model gets 85.65% accuracy, very close to state-of-the-art.
617+
The smallest model had 7.3M MACs with 15k parameters and got 75% accuracy on ESC-50.
612618

613619
eGRU[@eGRU] demonstrates an Recurrent Neural Network based on a modified Gated Recurrent Unit.
614620
The feature representation used was raw STFT spectrogram from 8Khz audio.
@@ -617,7 +623,8 @@ so the results may not be directly comparable to others.
617623
With full-precision floating point the model got 72% accuracy.
618624
When running on device using the proposed quantization technique the accuracy fell to 61%.
619625

620-
As of April 2019, eGRU was the only paper found which performs ESC on a microcontroller.
626+
As of April 2019, eGRU was the only paper that could be found for the ESC task
627+
and the Urbansound8k dataset on a microcontroller.
621628

622629

623630
## Resource efficient image classification
@@ -658,20 +665,27 @@ EffNet[@Effnet] (2018) also uses spatial separable convolutions,
658665
but additionally performs the downsampling in a separable fashion:
659666
first a 1x2 max pooling after the 1x3 kernel,
660667
followed by 2x1 striding in the 3x1 kernel.
668+
Evaluated on CIFAR10 and Street View House Numbers (SVHN) datasets
669+
it scored a bit better than Mobilenets and ShuffleNet.
661670

662671
## Resource efficient CNNs for speech detection
663672

664673
Speech detection is a big application of audio processing and machine learning.
665674
In the Keyword Spotting (KWS) task the goal is to detect a keyword or phrase that
666675
indicates that the user wants to enable speech control.
667-
Example phrases in commercially available products include "Hey Siri" for Apple devices or "OK Google" for Google devices.
676+
Example phrases in commercially available products include "Hey Siri" for Apple devices
677+
or "OK Google" for Google devices.
668678
This is used both in smarthome devices such as Amazon Alexa, as well as smartwatches and mobile devices.
669679
For this reason keyword spotting on low-power devices and microcontrollers
670680
is an area of active research.
671681

672682
Note that speech recognition tasks often use Mel-Filter Cepstral Coefficients (MFCC),
673683
which is computed by performing a Discrete Cosine Transform (DCT) on a mel-spectrogram.
674684

685+
In [@sainath2015convolutional] (2015) authors evaluated variations of
686+
small-footprints CNNs for keyword spotting. They found that using large strides in time or frequency
687+
could be used to create models that were significantly more effective.
688+
675689
In the "Hello Edge"[@HelloEdge] paper (2017),
676690
different models were evaluated for keyword spotting on microcontrollers.
677691
Included were most standard deep learning model architectures

0 commit comments

Comments
 (0)