Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling sounds for recognition (should be a wiki page?) #100

Open
vladturcuman opened this issue May 2, 2021 · 0 comments
Open

Sampling sounds for recognition (should be a wiki page?) #100

vladturcuman opened this issue May 2, 2021 · 0 comments

Comments

@vladturcuman
Copy link

vladturcuman commented May 2, 2021

A better place for this would be a wiki page as this isn't an issue.

This is meant to show the process of sampling and adding a new sound to be recognised as per the design here.

Adding a new class of sounds

The class MicroBitSoundRecogniser contains all the code to recognise sounds but doesn't have any sample of sounds added - hence it's made abstract by having the constructor private. For now, there is only one class that inherits it: EmojiRecogniser, which is supposed to recognise the emoji class of sounds.

To add a new recogniser one would need to create a new class that inherits the MicroBitSoundRecogniser and add each sound that should be recognised - as described below. An alternative is to replace the MicroBitSoundRecogniser altogether - in the pipeline - with a custom component that analyses the frequencies of each time frame to determine the sound being played - this would be preferable if the sounds are very long and constant.

Sampling a sound

Preparing the micro:bit

To sample a sound one would need to output the dominant frequency in each time frame. This can be done by either creating a component that outputs to serial only the dominant frequency as it comes from the MicroBitAudioProcessor, or just using the .hex attached.

Preparing the host machine

The micro:bit would need to be connected to a host machine with a serial monitor - the default baud rate is 115200.

A good serial monitor is CoolTerm, and it should be configured with the following settings:

image

The actual sampling

A sound can be sampled by clearing the serial monitor, playing the sound and disconnecting the serial monitor. The result - e.g. a sample of the happy emoji sound is shown below - would then be copied to an excel to be graphed.

If using the .hex provided, play the sound at a higher volume or closer as the thresholds for noise are higher than usual to filter more noise - this makes it easier to find where the sound started and ended.

Multiple samples would be needed to find which parts of the sound are consistent across multiple plays - that's because of the randomness in the generation of the sounds.

image

Analysing the results

Identifying a consistent part and aligning the samples

After having a couple of samples of the sound in excel, they can be graphed to see its shape. Graphing all of them would look like below.

image

Although the first half seems random, the sound can be recognised by its final part - which seems less random. Aligning the samples to match the final part (moving a couple of columns up or down), it should look like:

image

To mark where the first sequence starts, it's a good idea to add an empty row there. This would make it look like this:

image

To allow for deviations from these samples, it will further be broken down at a "checkpoint" - some frequency all samples reach. This would look like this:

image

The columns would now look like:

image

Removing redundant samples

As some of the samples are quite similar, they can be removed. To do this, it is useful to first copy each sequence to other columns.

image

For the first sequence, the first 2 samples are the same, so one of them can be removed. The 5th is the same as the first two but one shorter - so the other of the first two can be removed. When choosing which samples to remove, most of the times is better to keep the shorter one as the algorithm for matching the sequences tries to match them exactly one after the other or with another frequency that can be anything in between.

For the second sequence, the last 2 samples are the same, so one of them can be removed. Furthermore, when graphing the rest of the samples - see below - most of them are quite similar - only ~20 Hz deviation. This can be accommodated by setting a threshold >= 25 Hz for this sequence - although a threshold of ~70-80 Hz would be better in cases where there's more noise, and it's safer to have a larger threshold. In this case, only the 3rd and any one of the other samples would do.

image

image

After removing the redundant samples, the excel would look like:

image

Adding the sound to the recogniser

The code used for adding the happy sound (in the EmojiRecogniser class) is:

const uint8_t happy_sequences = 2;
const uint8_t happy_max_deviations = 2;

uint16_t happy_samples[happy_sequences][2][8] = {
    {
        { 4, 2121, 2394, 2646, 2646}, 	
        { 5, 2121, 2373, 2373, 2646, 2646}
    },
    {
        { 7, 2646, 2835, 2646, 2646, 2394, 2394, 2394}, 	
        { 7, 2646, 2835, 2835, 2646, 2394, 2373, 2394}
    }
};

uint16_t happy_thresholds[happy_sequences] = {
    40,
    50
};

uint8_t happy_deviations[happy_sequences] = {
    1,
    2
};

uint8_t happy_nr_samples[happy_sequences] = {
    2,
    2
};


void EmojiRecogniser::addHappySound() {

    uint8_t it = sounds_size;
    sounds_size ++;
    sounds_names[it] = new ManagedString("happy");

    uint8_t history = 0;

    for(uint8_t i = 0; i < happy_sequences; i++)
        for(uint8_t j = 0; j < happy_nr_samples[i]; j ++)
            history = max(history, happy_samples[i][j][0] + 4);

    sounds[it] = new Sound(happy_sequences, happy_max_deviations, history, true);

    for(uint8_t i = 0; i < happy_sequences; i++){
        sounds[it] -> sequences[i] = new SoundSequence(happy_nr_samples[i], happy_thresholds[i], happy_deviations[i]);
        for(uint8_t j = 0; j < happy_nr_samples[i]; j ++)
            sounds[it] -> sequences[i] -> samples[j] = new SoundSample(happy_samples[i][j] + 1, happy_samples[i][j][0]);
    }

}

The constants are:

  • happy_sequences - the number of sequences in the sound
  • happy_max_deviations - the maximum number of deviations allowed (i.e. data-points that can be more than the threshold away from the sampled frequency)
  • happy_samples - the samples from the excel
  • happy_thresholds - the threshold (i.e. how many Hz off the sampled frequency is allowed)
  • happy_deviations - the maximum number of deviation allowed for each sequence. The deviations should satisfy both this and happy_max_deviations.
  • happy_nr_samples - the number of samples in each sequence

To help copying the data from excel to happy_samples, a function that initializes the values of the array in excel can be used - for google sheets that would be = CONCATENATE("{ ",COUNT(J$6:J), ", ", textjoin(", ", 1, J$6:J), "}, "):

image

Attachments

Attached here are the .hex to stream the frequencies from the micro:bit and the excel - google sheets actually - I used to sample happy.

MICROBIT-STREAM_FEQUENCIES.hex.zip

happy-sound-sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant