Skip to content

Commit 379c83f

Browse files
committed
[docs] Docs for various MMF Transformer configurations
1 parent a2314dd commit 379c83f

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
id: configuring_transformers
3+
title: 'Configuring Multimodal Transformers'
4+
sidebar_label: Configuring Multimodal Transformers
5+
---
6+
7+
MMF Transformer('mmf_transformer' or 'mmft') is a generalization of multimodal transformer models like MMBT/VisualBERT etc. It provides a customizable framework that supports the following improved usability features :
8+
9+
- Supports an arbitrary number and type of modalities.
10+
- Allows easy switching between different transformer base models (BERT, RoBERTa, XLMR etc.)
11+
- Supports different backend libraries (Huggingface, PyText, Fairseq)
12+
- Pretraining and Finetuning support
13+
14+
In this note, we will go over each aspect and understand how to configure them.
15+
16+
17+
18+
## Configuring Modality Embeddings
19+
20+
MMFT uses three types of embeddings for each modality : feature embedding (input tokens), position embedding(position tokens), type embedding(segment tokens). first which takes three types of tokens:
21+
22+
- Input ID tokens (modality features)
23+
- Position ID tokens (position embedidng of the modality features)
24+
- Segment ID tokens (token type , to differentiate between the modalities)
25+
26+
27+
Modality specific feature embeddings are generated either during preprocessing the sample or using different image or text encoders available in MMF. Posiiton embedding can also be provided during preprocessing, or MMFT will generate default position embedings. Type embeddings are optional. When added these can either be explicitly specified in the config or MMFT can generate token embeddings in a sequential manner how the modalities are added in the config.
28+
29+
Here is an example config for adding different modalities :
30+
31+
32+
```yaml
33+
34+
model_config:
35+
mmf_transformer:
36+
modalities:
37+
- type: text
38+
key: text
39+
position_dim: 128
40+
segment_id: 0
41+
layer_norm_eps: 1e-12
42+
hidden_dropout_prob: 0.1
43+
- type: image
44+
key: image
45+
embedding_dim: 2048
46+
position_dim: 128
47+
segment_id: 1
48+
layer_norm_eps: 1e-12
49+
hidden_dropout_prob: 0.1
50+
encoder:
51+
type: resnet152
52+
params:
53+
pretrained: true
54+
pool_type: avg
55+
num_output_features: 49
56+
in_dim: 2048
57+
58+
```
59+
60+
Here is another example that configures MMFT to train on 3 different modalities (text, ocr text and images):
61+
62+
```yaml
63+
64+
model_config:
65+
mmf_transformer:
66+
modalities:
67+
- type: text
68+
key: text
69+
position_dim: 64
70+
segment_id: 0
71+
layer_norm_eps: 1e-12
72+
hidden_dropout_prob: 0.1
73+
- type: text
74+
key: ocr
75+
position_dim: 64
76+
segment_id: 1
77+
layer_norm_eps: 1e-12
78+
hidden_dropout_prob: 0.1
79+
- type: image
80+
key: image
81+
embedding_dim: 2048
82+
position_dim: 64
83+
segment_id: 2
84+
layer_norm_eps: 1e-12
85+
hidden_dropout_prob: 0.1
86+
```
87+
88+
Text (`text`) will have segment ID as 0, ocr text (`ocr`) will have 1 and image (`image`) will have segment ID 2 in order to differentiate between the modalities.
89+
90+
## Configuring Transformer Backends
91+
92+
MMFT leverages integration of different NLP libraries like HUggingface transformers, FairSeq and PyText. MMFT model's base transformer can be built with models from any of these three different libraries. Here is a configuration that uses huggingface backend with MMFT :
93+
94+
```yaml
95+
96+
model_config:
97+
mmf_transformer:
98+
transformer_base: bert-base-uncased
99+
backend:
100+
type: huggingface
101+
freeze: false
102+
params: {}
103+
104+
```
105+
106+
Similarly, for FairSeq backend the configuration can be specified as:
107+
108+
```yaml
109+
model_config:
110+
mmf_transformer:
111+
backend:
112+
type: fairseq
113+
freeze: false
114+
model_path: <path_to_fairseq_model>
115+
params:
116+
max_seq_len: 254
117+
num_segments: 1
118+
ffn_embedding_dim: 3072
119+
encoder_normalize_before: True
120+
export: True
121+
traceable: True
122+
```
123+
124+
:::note
125+
126+
FairSeq and PyText backends are not supported in OSS and will be open sourced in future releases.
127+
128+
:::
129+
130+
131+
## Configuring Transformer Architectures
132+
133+
build_transformer() method is optional to override as base class provides ability load any transformer model from Huggingface transformers just by specifying the name of the model. For example in your model config you can specify
134+
135+
MMFT allows us to change the base transformer architecture easily. When transformer backend is Huggingface, we can choose any transformer model from `transformers` library to build the multimodal model. Here is an example config that specifies the base transformer as Bert Base.
136+
137+
```yaml
138+
139+
model_config:
140+
mmf_transformer:
141+
transformer_base: bert-base-uncased
142+
143+
```
144+
145+
Optionally the pretrained weights of this model will be loaded during initialization of the transformer model. Here is another example that uses Roberta as the base transformer :
146+
147+
```yaml
148+
149+
model_config:
150+
mmf_transformer:
151+
transformer_base: roberta-base
152+
153+
```
154+
155+
## Configuring Pretraining and Finetuning Heads
156+
157+
[Coming soon]

0 commit comments

Comments
 (0)