@@ -594,21 +594,27 @@ the model is able to reach 78.3% on Urbansound8k.
594
594
## Resource efficient Environmental Sound Classification
595
595
596
596
There are also a few works on Environmental Sound Classification (ESC)
597
- that explicitly target making resource efficient models (in parameters, inference time or power consumption).
597
+ that explicitly target making resource efficient models, measured
598
+ in number of parameters and compute operations.
598
599
599
600
WSNet[ @WSNet ] is a 1D network on raw audio designed for efficiency.
600
- It uses a weight sampling approach for efficient quantization of weights to
601
- reaches a 70.5% on UrbandSound8k with a 288K parameters and 100M MAC.
601
+ It proposes a weight sampling approach for efficient quantization of weights to
602
+ reache an accuracy of 70.5% on UrbandSound8k with a 288K parameters and 100M MAC.
602
603
603
604
LD-CNN[ @LD-CNN ] is a more efficient version of D-CNN.
604
605
In order to reduce parameters the early layers use spatially separable convolutions,
605
606
and the middle layers used dilated convolutions.
606
607
As a result the model has 2.05MB of paramters, 50x fewer than D-CNN,
607
608
while accuracy only dropped by 2% to 79% on Urbansound8k.
608
- ` TODO: include mult-adds `
609
609
610
- AclNet [ @AclNet ] .
611
- ` TODO: write about `
610
+ AclNet [ @AclNet ] is a CNN architecture.
611
+ It uses 2 layers of 1D strided convolution as a FIR decimation filterbank
612
+ to create a 2D spectrogram-like set of features.
613
+ Then a VGG style architecture with Depthwise Separable Convolutions is applied.
614
+ A width multiplier ala that of Mobilenet is used to adjust model complexity.
615
+ Data augmentation and mixup is applied, and gave up to 5% boost.
616
+ Evaluated on ESC-50, the best performing model gets 85.65% accuracy, very close to state-of-the-art.
617
+ The smallest model had 7.3M MACs with 15k parameters and got 75% accuracy on ESC-50.
612
618
613
619
eGRU[ @eGRU ] demonstrates an Recurrent Neural Network based on a modified Gated Recurrent Unit.
614
620
The feature representation used was raw STFT spectrogram from 8Khz audio.
@@ -617,7 +623,8 @@ so the results may not be directly comparable to others.
617
623
With full-precision floating point the model got 72% accuracy.
618
624
When running on device using the proposed quantization technique the accuracy fell to 61%.
619
625
620
- As of April 2019, eGRU was the only paper found which performs ESC on a microcontroller.
626
+ As of April 2019, eGRU was the only paper that could be found for the ESC task
627
+ and the Urbansound8k dataset on a microcontroller.
621
628
622
629
623
630
## Resource efficient image classification
@@ -658,20 +665,27 @@ EffNet[@Effnet] (2018) also uses spatial separable convolutions,
658
665
but additionally performs the downsampling in a separable fashion:
659
666
first a 1x2 max pooling after the 1x3 kernel,
660
667
followed by 2x1 striding in the 3x1 kernel.
668
+ Evaluated on CIFAR10 and Street View House Numbers (SVHN) datasets
669
+ it scored a bit better than Mobilenets and ShuffleNet.
661
670
662
671
## Resource efficient CNNs for speech detection
663
672
664
673
Speech detection is a big application of audio processing and machine learning.
665
674
In the Keyword Spotting (KWS) task the goal is to detect a keyword or phrase that
666
675
indicates that the user wants to enable speech control.
667
- Example phrases in commercially available products include "Hey Siri" for Apple devices or "OK Google" for Google devices.
676
+ Example phrases in commercially available products include "Hey Siri" for Apple devices
677
+ or "OK Google" for Google devices.
668
678
This is used both in smarthome devices such as Amazon Alexa, as well as smartwatches and mobile devices.
669
679
For this reason keyword spotting on low-power devices and microcontrollers
670
680
is an area of active research.
671
681
672
682
Note that speech recognition tasks often use Mel-Filter Cepstral Coefficients (MFCC),
673
683
which is computed by performing a Discrete Cosine Transform (DCT) on a mel-spectrogram.
674
684
685
+ In [ @sainath2015convolutional ] (2015) authors evaluated variations of
686
+ small-footprints CNNs for keyword spotting. They found that using large strides in time or frequency
687
+ could be used to create models that were significantly more effective.
688
+
675
689
In the "Hello Edge"[ @HelloEdge ] paper (2017),
676
690
different models were evaluated for keyword spotting on microcontrollers.
677
691
Included were most standard deep learning model architectures
0 commit comments