Skip to content

Commit 62c96cc

Browse files
authored
Restore src_features for v3.0 (#2308)
* Restored src_features for v3
1 parent 563d207 commit 62c96cc

24 files changed

+508
-336
lines changed

.github/workflows/push.yml

+17-5
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,6 @@ jobs:
4747
-save_data /tmp/onmt_feat \
4848
-src_vocab /tmp/onmt_feat.vocab.src \
4949
-tgt_vocab /tmp/onmt_feat.vocab.tgt \
50-
-src_feats_vocab '{"feat0": "/tmp/onmt_feat.vocab.feat0"}' \
5150
-n_sample -1 \
5251
&& rm -rf /tmp/sample
5352
- name: Test field/transform dump
@@ -259,21 +258,34 @@ jobs:
259258
-config data/features_data.yaml \
260259
-src_vocab /tmp/onmt_feat.vocab.src \
261260
-tgt_vocab /tmp/onmt_feat.vocab.tgt \
262-
-src_feats_vocab '{"feat0": "/tmp/onmt_feat.vocab.feat0"}' \
263261
-src_vocab_size 1000 -tgt_vocab_size 1000 \
264262
-hidden_size 2 -batch_size 10 \
265263
-num_workers 0 -bucket_size 1024 \
266264
-word_vec_size 5 -hidden_size 10 \
267265
-report_every 5 -train_steps 10 \
268266
-save_model /tmp/onmt.model \
269267
-save_checkpoint_steps 10
268+
- name: Testing training with features and dynamic scoring
269+
run: |
270+
python onmt/bin/train.py \
271+
-config data/features_data.yaml \
272+
-src_vocab /tmp/onmt_feat.vocab.src \
273+
-tgt_vocab /tmp/onmt_feat.vocab.tgt \
274+
-src_vocab_size 1000 -tgt_vocab_size 1000 \
275+
-hidden_size 2 -batch_size 10 \
276+
-word_vec_size 5 -hidden_size 10 \
277+
-num_workers 0 -bucket_size 1024 \
278+
-report_every 5 -train_steps 10 \
279+
-train_metrics "BLEU" "TER" \
280+
-valid_metrics "BLEU" "TER" \
281+
-save_model /tmp/onmt.model \
282+
-save_checkpoint_steps 10
270283
- name: Testing translation with features
271284
run: |
272285
python translate.py \
273286
-model /tmp/onmt.model_step_10.pt \
274-
-src data/data_features/src-test.txt \
275-
-src_feats "{'feat0': 'data/data_features/src-test.feat0'}" \
276-
-verbose
287+
-src data/data_features/src-test-with-feats.txt \
288+
-n_src_feats 1 -verbose
277289
- name: Test RNN translation
278290
run: |
279291
head data/src-test.txt > /tmp/src-test.txt
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
she│C is│B a│A hard-working.│B

data/data_features/src-test.feat0

-1
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
however,│A according│A to│A the│A logs,│B she│A is│A a│A hard-working.│C
2+
however,│A according│B to│C the│D logs,│E
3+
she│C is│B a│A hard-working.│B

data/data_features/src-train.feat0

-3
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
she│C is│B a│A hard-working.│B

data/data_features/src-val.feat0

-1
This file was deleted.

data/features_data.yaml

+12-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
1+
12
# Corpus opts:
23
data:
34
corpus_1:
5+
path_src: data/data_features/src-train-with-feats.txt
6+
path_tgt: data/data_features/tgt-train.txt
7+
transforms: [inferfeats]
8+
corpus_2:
49
path_src: data/data_features/src-train.txt
510
path_tgt: data/data_features/tgt-train.txt
6-
src_feats:
7-
feat0: data/data_features/src-train.feat0
8-
transforms: [filterfeats, inferfeats]
11+
transforms: [inferfeats]
912
valid:
10-
path_src: data/data_features/src-val.txt
13+
path_src: data/data_features/src-val-with-feats.txt
1114
path_tgt: data/data_features/tgt-val.txt
15+
transforms: [inferfeats]
16+
17+
# # Feats options
18+
n_src_feats: 1
19+
src_feats_defaults: "0"

docs/source/FAQ.md

+28-47
Original file line numberDiff line numberDiff line change
@@ -620,39 +620,34 @@ Training options to perform vocabulary update are:
620620

621621
## How can I use source word features?
622622

623-
Extra information can be added to the words in the source sentences by defining word features.
623+
Additional word-level information can be incorporated into the model by defining word features in the source sentence.
624624

625-
Features should be defined in a separate file using blank spaces as a separator and with each row corresponding to a source sentence. An example of the input files:
625+
Word features must be appended to the actual textual data by using the special character │ as a feature separator. For instance:
626626

627-
data.src
628627
```
629-
however, according to the logs, she is hard-working.
628+
however│C ■,│N according│L to│L the│L logs│L ■,│N she│L is│L hard-working│L ■.│N
630629
```
631630

632-
feat.txt
631+
Prior tokenization is not necessary, features will be inferred by using the `FeatInferTransform` transform if tokenization has been applied. For instace:
632+
633633
```
634-
A C C C C A A B
634+
SRC: however,│C according│L to│L the│L logs,│L she│L is│L hard-working.│L
635+
TOKENIZED SRC: however ■, according to the logs ■, she is hard-working ■.
636+
RESULT: however│C ■,│C according│L to│L the│L logs│L ■,│L she│L is│L hard│L ■-■│L working│L ■.│L
635637
```
636638

637-
Prior tokenization is not necessary, features will be inferred by using the `FeatInferTransform` transform if tokenization has been applied.
639+
**Options**
640+
- `-n_src_feats`: the expected number of source features per token.
641+
- `-src_feats_defaults` (optional): provides default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated.
638642

639-
No previous tokenization:
640-
```
641-
SRC: this is a test.
642-
FEATS: A A A B
643-
TOKENIZED SRC: this is a test ■.
644-
RESULT: A A A B <null>
645-
```
643+
For the Transformer architecture make sure the following options are appropriately set:
646644

647-
Previously tokenized:
648-
```
649-
SRC: this is a test ■.
650-
FEATS: A A A B A
651-
RESULT: A A A B A
652-
```
645+
- `src_word_vec_size` and `tgt_word_vec_size` or `word_vec_size`
646+
- `feat_merge`: how to handle features vecs
647+
- `feat_vec_size` or maybe `feat_vec_exponent`
653648

654649
**Notes**
655-
- `FilterFeatsTransform` and `FeatInferTransform` are required in order to ensure the functionality.
650+
- `FeatInferTransform` transform is required in order to ensure the functionality.
656651
- Not possible to do shared embeddings (at least with `feat_merge: concat` method)
657652

658653
Sample config file:
@@ -662,50 +657,36 @@ data:
662657
dummy:
663658
path_src: data/train/data.src
664659
path_tgt: data/train/data.tgt
665-
src_feats:
666-
feat_0: data/train/data.src.feat_0
667-
feat_1: data/train/data.src.feat_1
668-
transforms: [filterfeats, onmt_tokenize, inferfeats, filtertoolong]
660+
transforms: [onmt_tokenize, inferfeats, filtertoolong]
669661
weight: 1
670662
valid:
671663
path_src: data/valid/data.src
672664
path_tgt: data/valid/data.tgt
673-
src_feats:
674-
feat_0: data/valid/data.src.feat_0
675-
feat_1: data/valid/data.src.feat_1
676-
transforms: [filterfeats, onmt_tokenize, inferfeats]
665+
transforms: [onmt_tokenize, inferfeats]
677666
678667
# Transform options
679668
reversible_tokenization: "joiner"
680-
prior_tokenization: true
681669
682670
# Vocab opts
683671
src_vocab: exp/data.vocab.src
684672
tgt_vocab: exp/data.vocab.tgt
685-
src_feats_vocab:
686-
feat_0: exp/data.vocab.feat_0
687-
feat_1: exp/data.vocab.feat_1
673+
674+
# Features options
675+
n_src_feats: 2
676+
src_feats_defaults: "0│1"
688677
feat_merge: "sum"
689678
```
690679

691-
During inference you can pass features by using the `--src_feats` argument. `src_feats` is expected to be a Python like dict, mapping feature names with their data file.
680+
To allow source features in the server add the following parameters in the server's config file:
692681

693682
```
694-
{'feat_0': '../data.txt.feats0', 'feat_1': '../data.txt.feats1'}
695-
```
696-
697-
**Important note!** During inference, input sentence is expected to be tokenized. Therefore feature inferring should be handled prior to running the translate command. Example:
698-
699-
```bash
700-
python translate.py -model model_step_10.pt -src ../data.txt.tok -output ../data.out --src_feats "{'feat_0': '../data.txt.feats0', 'feat_1': '../data.txt.feats1'}"
683+
"features": {
684+
"n_src_feats": 2,
685+
"src_feats_defaults": "0│1",
686+
"reversible_tokenization": "joiner"
687+
}
701688
```
702689

703-
When using the Transformer architecture make sure the following options are appropriately set:
704-
705-
- `src_word_vec_size` and `tgt_word_vec_size` or `word_vec_size`
706-
- `feat_merge`: how to handle features vecs
707-
- `feat_vec_size` and maybe `feat_vec_exponent`
708-
709690
## How can I set up a translation server ?
710691
A REST server was implemented to serve OpenNMT-py models. A discussion is opened on the OpenNMT forum: [discussion link](https://forum.opennmt.net/t/simple-opennmt-py-rest-server/1392).
711692

onmt/bin/build_vocab.py

+24-30
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
from onmt.utils.parse import ArgumentParser
77
from onmt.opts import dynamic_prepare_opts
88
from onmt.inputters.text_corpus import build_corpora_iters, get_corpora
9-
from onmt.inputters.text_utils import process
9+
from onmt.inputters.text_utils import process, append_features_to_text
1010
from onmt.transforms import make_transforms, get_transforms_cls
1111
from onmt.constants import CorpusName, CorpusTask
12-
from collections import Counter, defaultdict
12+
from collections import Counter
1313
import multiprocessing as mp
1414

1515

@@ -40,21 +40,11 @@ def write_files_from_queues(sample_path, queues):
4040
break
4141

4242

43-
# Just for debugging purposes
44-
# It appends features to subwords when dumping to file
45-
def append_features_to_example(example, features):
46-
ex_toks = example.split(' ')
47-
feat_toks = features.split(' ')
48-
toks = [f"{subword}{feat}" for subword, feat in
49-
zip(ex_toks, feat_toks)]
50-
return " ".join(toks)
51-
52-
5343
def build_sub_vocab(corpora, transforms, opts, n_sample, stride, offset):
5444
"""Build vocab on (strided) subpart of the data."""
5545
sub_counter_src = Counter()
5646
sub_counter_tgt = Counter()
57-
sub_counter_src_feats = defaultdict(Counter)
47+
sub_counter_src_feats = [Counter() for _ in range(opts.n_src_feats)]
5848
datasets_iterables = build_corpora_iters(
5949
corpora, transforms, opts.data,
6050
skip_empty_level=opts.skip_empty_level,
@@ -70,19 +60,22 @@ def build_sub_vocab(corpora, transforms, opts, n_sample, stride, offset):
7060
continue
7161
src_line, tgt_line = (maybe_example['src']['src'],
7262
maybe_example['tgt']['tgt'])
73-
src_line_pretty = src_line
74-
for feat_name, feat_line in maybe_example["src"].items():
75-
if feat_name not in ["src", "src_original"]:
76-
sub_counter_src_feats[feat_name].update(
77-
feat_line.split(' '))
78-
if opts.dump_samples:
79-
src_line_pretty = append_features_to_example(
80-
src_line_pretty, feat_line)
8163
sub_counter_src.update(src_line.split(' '))
8264
sub_counter_tgt.update(tgt_line.split(' '))
65+
66+
if 'feats' in maybe_example['src']:
67+
src_feats_lines = maybe_example['src']['feats']
68+
for i in range(opts.n_src_feats):
69+
sub_counter_src_feats[i].update(
70+
src_feats_lines[i].split(' '))
71+
else:
72+
src_feats_lines = []
73+
8374
if opts.dump_samples:
75+
src_pretty_line = append_features_to_text(
76+
src_line, src_feats_lines)
8477
build_sub_vocab.queues[c_name][offset].put(
85-
(i, src_line_pretty, tgt_line))
78+
(i, src_pretty_line, tgt_line))
8679
if n_sample > 0 and ((i+1) * stride + offset) >= n_sample:
8780
if opts.dump_samples:
8881
build_sub_vocab.queues[c_name][offset].put("break")
@@ -113,7 +106,7 @@ def build_vocab(opts, transforms, n_sample=3):
113106
corpora = get_corpora(opts, task=CorpusTask.TRAIN)
114107
counter_src = Counter()
115108
counter_tgt = Counter()
116-
counter_src_feats = defaultdict(Counter)
109+
counter_src_feats = [Counter() for _ in range(opts.n_src_feats)]
117110
from functools import partial
118111
queues = {c_name: [mp.Queue(opts.vocab_sample_queue_size)
119112
for i in range(opts.num_threads)]
@@ -134,7 +127,8 @@ def build_vocab(opts, transforms, n_sample=3):
134127
func, range(0, opts.num_threads)):
135128
counter_src.update(sub_counter_src)
136129
counter_tgt.update(sub_counter_tgt)
137-
counter_src_feats.update(sub_counter_src_feats)
130+
for i in range(opts.n_src_feats):
131+
counter_src_feats[i].update(sub_counter_src_feats[i])
138132
if opts.dump_samples:
139133
write_process.join()
140134
return counter_src, counter_tgt, counter_src_feats
@@ -166,10 +160,10 @@ def build_vocab_main(opts):
166160
src_counter, tgt_counter, src_feats_counter = build_vocab(
167161
opts, transforms, n_sample=opts.n_sample)
168162

169-
logger.info(f"Counters src:{len(src_counter)}")
170-
logger.info(f"Counters tgt:{len(tgt_counter)}")
171-
for feat_name, feat_counter in src_feats_counter.items():
172-
logger.info(f"Counters {feat_name}:{len(feat_counter)}")
163+
logger.info(f"Counters src: {len(src_counter)}")
164+
logger.info(f"Counters tgt: {len(tgt_counter)}")
165+
for i, feat_counter in enumerate(src_feats_counter):
166+
logger.info(f"Counters src feat_{i}: {len(feat_counter)}")
173167

174168
def save_counter(counter, save_path):
175169
check_path(save_path, exist_ok=opts.overwrite, log=logger.warning)
@@ -186,8 +180,8 @@ def save_counter(counter, save_path):
186180
save_counter(src_counter, opts.src_vocab)
187181
save_counter(tgt_counter, opts.tgt_vocab)
188182

189-
for k, v in src_feats_counter.items():
190-
save_counter(v, opts.src_feats_vocab[k])
183+
for i, c in enumerate(src_feats_counter):
184+
save_counter(c, f"{opts.src_vocab}_feat{i}")
191185

192186

193187
def _get_parser():

onmt/inputters/inputter.py

+14-18
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,10 @@ def build_vocab(opt, specials):
3434
""" Build vocabs dict to be stored in the checkpoint
3535
based on vocab files having each line [token, count]
3636
Args:
37-
opt: src_vocab, tgt_vocab, src_feats_vocab
37+
opt: src_vocab, tgt_vocab, n_src_feats
3838
Return:
3939
vocabs: {'src': pyonmttok.Vocab, 'tgt': pyonmttok.Vocab,
40-
'src_feats' : {'feat0': pyonmttok.Vocab,
41-
'feat1': pyonmttok.Vocab, ...},
40+
'src_feats' : [pyonmttok.Vocab, ...]},
4241
'data_task': seq2seq or lm
4342
}
4443
"""
@@ -85,10 +84,10 @@ def _pad_vocab_to_multiple(vocab, multiple):
8584
opt.vocab_size_multiple)
8685
vocabs['tgt'] = tgt_vocab
8786

88-
if opt.src_feats_vocab:
89-
src_feats = {}
90-
for feat_name, filepath in opt.src_feats_vocab.items():
91-
src_f_vocab = _read_vocab_file(filepath, 1)
87+
if opt.n_src_feats > 0:
88+
src_feats_vocabs = []
89+
for i in range(opt.n_src_feats):
90+
src_f_vocab = _read_vocab_file(f"{opt.src_vocab}_feat{i}", 1)
9291
src_f_vocab = pyonmttok.build_vocab_from_tokens(
9392
src_f_vocab,
9493
maximum_size=0,
@@ -101,8 +100,8 @@ def _pad_vocab_to_multiple(vocab, multiple):
101100
if opt.vocab_size_multiple > 1:
102101
src_f_vocab = _pad_vocab_to_multiple(src_f_vocab,
103102
opt.vocab_size_multiple)
104-
src_feats[feat_name] = src_f_vocab
105-
vocabs['src_feats'] = src_feats
103+
src_feats_vocabs.append(src_f_vocab)
104+
vocabs["src_feats"] = src_feats_vocabs
106105

107106
vocabs['data_task'] = opt.data_task
108107

@@ -146,10 +145,8 @@ def vocabs_to_dict(vocabs):
146145
vocabs_dict['src'] = vocabs['src'].ids_to_tokens
147146
vocabs_dict['tgt'] = vocabs['tgt'].ids_to_tokens
148147
if 'src_feats' in vocabs.keys():
149-
vocabs_dict['src_feats'] = {}
150-
for feat in vocabs['src_feats'].keys():
151-
vocabs_dict['src_feats'][feat] = \
152-
vocabs['src_feats'][feat].ids_to_tokens
148+
vocabs_dict['src_feats'] = [feat_vocab.ids_to_tokens
149+
for feat_vocab in vocabs['src_feats']]
153150
vocabs_dict['data_task'] = vocabs['data_task']
154151
return vocabs_dict
155152

@@ -167,9 +164,8 @@ def dict_to_vocabs(vocabs_dict):
167164
else:
168165
vocabs['tgt'] = pyonmttok.build_vocab_from_tokens(vocabs_dict['tgt'])
169166
if 'src_feats' in vocabs_dict.keys():
170-
vocabs['src_feats'] = {}
171-
for feat in vocabs_dict['src_feats'].keys():
172-
vocabs['src_feats'][feat] = \
173-
pyonmttok.build_vocab_from_tokens(
174-
vocabs_dict['src_feats'][feat])
167+
vocabs['src_feats'] = []
168+
for feat_vocab in vocabs_dict['src_feats']:
169+
vocabs['src_feats'].append(
170+
pyonmttok.build_vocab_from_tokens(feat_vocab))
175171
return vocabs

0 commit comments

Comments
 (0)