|
| 1 | +--- |
| 2 | +id: configuring_transformers |
| 3 | +title: 'Configuring Multimodal Transformers' |
| 4 | +sidebar_label: Configuring Multimodal Transformers |
| 5 | +--- |
| 6 | + |
| 7 | +MMF Transformer('mmf_transformer' or 'mmft') is a generalization of multimodal transformer models like MMBT/VisualBERT etc. It provides a customizable framework that supports the following improved usability features : |
| 8 | + |
| 9 | +- Supports an arbitrary number and type of modalities. |
| 10 | +- Allows easy switching between different transformer base models (BERT, RoBERTa, XLMR etc.) |
| 11 | +- Supports different backend libraries (Huggingface, PyText, Fairseq) |
| 12 | +- Pretraining and Finetuning support |
| 13 | + |
| 14 | +In this note, we will go over each aspect and understand how to configure them. |
| 15 | + |
| 16 | + |
| 17 | + |
| 18 | +## Configuring Modality Embeddings |
| 19 | + |
| 20 | +MMFT uses three types of embeddings for each modality : feature embedding (input tokens), position embedding(position tokens), type embedding(segment tokens). first which takes three types of tokens: |
| 21 | + |
| 22 | +- Input ID tokens (modality features) |
| 23 | +- Position ID tokens (position embedidng of the modality features) |
| 24 | +- Segment ID tokens (token type , to differentiate between the modalities) |
| 25 | + |
| 26 | + |
| 27 | +Modality specific feature embeddings are generated either during preprocessing the sample or using different image or text encoders available in MMF. Posiiton embedding can also be provided during preprocessing, or MMFT will generate default position embedings. Type embeddings are optional. When added these can either be explicitly specified in the config or MMFT can generate token embeddings in a sequential manner how the modalities are added in the config. |
| 28 | + |
| 29 | +Here is an example config for adding different modalities : |
| 30 | + |
| 31 | + |
| 32 | +```yaml |
| 33 | + |
| 34 | +model_config: |
| 35 | + mmf_transformer: |
| 36 | + modalities: |
| 37 | + - type: text |
| 38 | + key: text |
| 39 | + position_dim: 128 |
| 40 | + segment_id: 0 |
| 41 | + layer_norm_eps: 1e-12 |
| 42 | + hidden_dropout_prob: 0.1 |
| 43 | + - type: image |
| 44 | + key: image |
| 45 | + embedding_dim: 2048 |
| 46 | + position_dim: 128 |
| 47 | + segment_id: 1 |
| 48 | + layer_norm_eps: 1e-12 |
| 49 | + hidden_dropout_prob: 0.1 |
| 50 | + encoder: |
| 51 | + type: resnet152 |
| 52 | + params: |
| 53 | + pretrained: true |
| 54 | + pool_type: avg |
| 55 | + num_output_features: 49 |
| 56 | + in_dim: 2048 |
| 57 | + |
| 58 | +``` |
| 59 | + |
| 60 | +Here is another example that configures MMFT to train on 3 different modalities (text, ocr text and images): |
| 61 | + |
| 62 | +```yaml |
| 63 | + |
| 64 | +model_config: |
| 65 | + mmf_transformer: |
| 66 | + modalities: |
| 67 | + - type: text |
| 68 | + key: text |
| 69 | + position_dim: 64 |
| 70 | + segment_id: 0 |
| 71 | + layer_norm_eps: 1e-12 |
| 72 | + hidden_dropout_prob: 0.1 |
| 73 | + - type: text |
| 74 | + key: ocr |
| 75 | + position_dim: 64 |
| 76 | + segment_id: 1 |
| 77 | + layer_norm_eps: 1e-12 |
| 78 | + hidden_dropout_prob: 0.1 |
| 79 | + - type: image |
| 80 | + key: image |
| 81 | + embedding_dim: 2048 |
| 82 | + position_dim: 64 |
| 83 | + segment_id: 2 |
| 84 | + layer_norm_eps: 1e-12 |
| 85 | + hidden_dropout_prob: 0.1 |
| 86 | +``` |
| 87 | +
|
| 88 | +Text (`text`) will have segment ID as 0, ocr text (`ocr`) will have 1 and image (`image`) will have segment ID 2 in order to differentiate between the modalities. |
| 89 | + |
| 90 | +## Configuring Transformer Backends |
| 91 | + |
| 92 | +MMFT leverages integration of different NLP libraries like HUggingface transformers, FairSeq and PyText. MMFT model's base transformer can be built with models from any of these three different libraries. Here is a configuration that uses huggingface backend with MMFT : |
| 93 | + |
| 94 | +```yaml |
| 95 | +
|
| 96 | +model_config: |
| 97 | + mmf_transformer: |
| 98 | + transformer_base: bert-base-uncased |
| 99 | + backend: |
| 100 | + type: huggingface |
| 101 | + freeze: false |
| 102 | + params: {} |
| 103 | +
|
| 104 | +``` |
| 105 | + |
| 106 | +Similarly, for FairSeq backend the configuration can be specified as: |
| 107 | + |
| 108 | +```yaml |
| 109 | +model_config: |
| 110 | + mmf_transformer: |
| 111 | + backend: |
| 112 | + type: fairseq |
| 113 | + freeze: false |
| 114 | + model_path: <path_to_fairseq_model> |
| 115 | + params: |
| 116 | + max_seq_len: 254 |
| 117 | + num_segments: 1 |
| 118 | + ffn_embedding_dim: 3072 |
| 119 | + encoder_normalize_before: True |
| 120 | + export: True |
| 121 | + traceable: True |
| 122 | +``` |
| 123 | + |
| 124 | +:::note |
| 125 | + |
| 126 | +FairSeq and PyText backends are not supported in OSS and will be open sourced in future releases. |
| 127 | + |
| 128 | +::: |
| 129 | + |
| 130 | + |
| 131 | +## Configuring Transformer Architectures |
| 132 | + |
| 133 | +build_transformer() method is optional to override as base class provides ability load any transformer model from Huggingface transformers just by specifying the name of the model. For example in your model config you can specify |
| 134 | + |
| 135 | +MMFT allows us to change the base transformer architecture easily. When transformer backend is Huggingface, we can choose any transformer model from `transformers` library to build the multimodal model. Here is an example config that specifies the base transformer as Bert Base. |
| 136 | + |
| 137 | +```yaml |
| 138 | +
|
| 139 | +model_config: |
| 140 | + mmf_transformer: |
| 141 | + transformer_base: bert-base-uncased |
| 142 | +
|
| 143 | +``` |
| 144 | + |
| 145 | +Optionally the pretrained weights of this model will be loaded during initialization of the transformer model. Here is another example that uses Roberta as the base transformer : |
| 146 | + |
| 147 | +```yaml |
| 148 | +
|
| 149 | +model_config: |
| 150 | + mmf_transformer: |
| 151 | + transformer_base: roberta-base |
| 152 | +
|
| 153 | +``` |
| 154 | + |
| 155 | +## Configuring Pretraining and Finetuning Heads |
| 156 | + |
| 157 | +[Coming soon] |
0 commit comments