-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
The --processors
/ n_processors
option, which initially I assumed was related to core count, actually seems to split the audio to chunks and process them in parallel.
When n_processors
is set to 1
, the correct token timestamps are produced for each segment:
{
"timestamps": {
"from": "00:00:20,000",
"to": "00:00:24,560"
},
"offsets": {
"from": 20000,
"to": 24560
},
"text": " actually took a watch out of its waistcoat pocket and look at it and then hurried on,",
"tokens": [
{
"text": " actually",
"timestamps": {
"from": "00:00:20,020",
"to": "00:00:20,530"
},
"offsets": {
"from": 20020,
"to": 20530
},
"id": 1682,
"p": 0.998559,
"t_dtw": -1
},
{
"text": " took",
"timestamps": {
"from": "00:00:20,530",
"to": "00:00:20,760"
},
"offsets": {
"from": 20530,
"to": 20760
},
"id": 1718,
"p": 0.999518,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:20,810",
"to": "00:00:20,850"
},
"offsets": {
"from": 20810,
"to": 20850
},
"id": 257,
"p": 0.999047,
"t_dtw": -1
},
When n_processors
is set to 2
, the timestamps reset at 20.930s
(which is possibly the split point used):
{
"timestamps": {
"from": "00:00:20,930",
"to": "00:00:25,970"
},
"offsets": {
"from": 20930,
"to": 25970
},
"text": " watch out of its waistcoat pocket and look at it and then hurry on, Alice started to her feet,",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,000"
},
"offsets": {
"from": 0,
"to": 0
},
"id": 50363,
"p": 0.99157,
"t_dtw": -1
},
{
"text": " watch",
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,330"
},
"offsets": {
"from": 0,
"to": 330
},
"id": 2342,
"p": 0.641946,
"t_dtw": -1
},
{
"text": " out",
"timestamps": {
"from": "00:00:00,330",
"to": "00:00:00,530"
},
"offsets": {
"from": 330,
"to": 530
},
"id": 503,
"p": 0.99322,
"t_dtw": -1
},
{
"text": " of",
"timestamps": {
"from": "00:00:00,530",
"to": "00:00:00,660"
},
"offsets": {
"from": 530,
"to": 660
},
"id": 286,
"p": 0.997212,
"t_dtw": -1
},
To work around this problem, it's possible, for the special case of processors > 1
, to maybe try to track the [_BEG_]
tokens which appear to 'zero-out' the time and add some offsets relative to them, but that makes things a bit more complex than needed.
In general, it would be more natural and easy to just have the tokens timed correctly relative to the source audio, rather than needing to deal with complex hacks trying to guess the intended timestamps.