Wav2vec mapping code #12

k0ngolab · 2023-09-21T06:02:42Z

Hello,

Could you please add the code to train wav2vec mapping in deepspeech?

Thank you.

Elsaam2y · 2023-09-24T22:01:32Z

Hi,

I am at the moment in the process of removing wav2vec with better solution to support other languages. If it works, will add a new model with new mapping and beside training code soon. Otherwise I will update the repo with the wav2vec mapping.

Elsaam2y · 2023-11-18T23:06:31Z

I tried retraining the model and syncnet with the latest version of deepspeech but this didn't lead to nice results compared to using the originally trained model. The generalization and the expressivity of the lips motion were not very convincing. An alternative solution would be training a mapping model fro the latest version of deepspeech to the original version used with DINet. This would keep the same trained model of DINet, beside keeping the inference fast as the latest version of deepspeech supports GPU and onnx. Didn't have time to test it yet but feel free to give it a try and open a PR.

tailangjun · 2024-04-01T04:19:45Z

I tried retraining the model and syncnet with the latest version of deepspeech but this didn't lead to nice results compared to using the originally trained model. The generalization and the expressivity of the lips motion were not very convincing. An alternative solution would be training a mapping model fro the latest version of deepspeech to the original version used with DINet. This would keep the same trained model of DINet, beside keeping the inference fast as the latest version of deepspeech supports GPU and onnx. Didn't have time to test it yet but feel free to give it a try and open a PR.

请问你后面使用的是哪个版本的 deepspeech，训练过程中维度不一致的问题是怎么解决的呢，谢谢

May I ask which version of deepspeech you are using later, and how to solve the problem of inconsistent dimensions during the training process? Thank you.

Elsaam2y · 2024-04-24T07:30:01Z

I was using 0.9.1 and the dimensions issue is raised mainly from other languages, like Chinese. I tried learn mapping this obtained features to the expected dimensions but this didn't always work good. Furthermore, deepspeech seems to cause many problems with many different languages and that's why I am trying to rely mainly on melspectrograms at the moment.

PengYicong · 2024-10-11T08:47:43Z

I was using 0.9.1 and the dimensions issue is raised mainly from other languages, like Chinese. I tried learn mapping this obtained features to the expected dimensions but this didn't always work good. Furthermore, deepspeech seems to cause many problems with many different languages and that's why I am trying to rely mainly on melspectrograms at the moment.

I'm curious about what's the difference between the original DS model used in Di-Net and the 0.9.1 version? Do they output the same result given the same input audio? If so, since the later version of the DS model supports GPU and onnx, it already benefits from speed improvement from this feature. Otherwise, maybe its better to train end-to-end using language-agnostic feature like HuBERT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wav2vec mapping code #12

Wav2vec mapping code #12

k0ngolab commented Sep 21, 2023

Elsaam2y commented Sep 24, 2023

Elsaam2y commented Nov 18, 2023

tailangjun commented Apr 1, 2024

Elsaam2y commented Apr 24, 2024

PengYicong commented Oct 11, 2024

Wav2vec mapping code #12

Wav2vec mapping code #12

Comments

k0ngolab commented Sep 21, 2023

Elsaam2y commented Sep 24, 2023

Elsaam2y commented Nov 18, 2023

tailangjun commented Apr 1, 2024

Elsaam2y commented Apr 24, 2024

PengYicong commented Oct 11, 2024