remove unused code

Zyphra · Jul 29, 2024 · 09fc1bc · 09fc1bc
1 parent fb6ffc4
commit 09fc1bc
Show file tree

Hide file tree

Showing 25 changed files with 13 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+Zamba2-torch README
+
 # Zamba v2 2.7B
 
 Zamba2-2.7B is a hybrid model between state-space models and transformers. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks. Zamba-2-2.7B possesses three major improvements over Zamba1:
@@ -16,9 +18,9 @@ This is the standalone Pytorch implementation of Zamba2-2.7B. A Huggingface-comp
 
 To begin, clone and install this repo:
 
-1.) `git clone https://github.com/Zyphra/Zamba2.git`
+1.) `git clone https://github.com/Zyphra/zamba2_torch.git`
 
-2.) cd `Zamba2`
+2.) cd `zamba2_torch`
 
 3.) Install the repository: `pip install -e .`
 
@@ -36,24 +38,15 @@ import torch
 from transformers import AutoTokenizer
 
 tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B")
-input_text = 'A funny prompt would be '
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")["input_ids"].transpose(0,1)
-model = MambaModel.from_pretrained(model_name = "Zyphra/Zamba2-2.7B").cuda().half()
-tokens_to_generate = 20
-model.eval()
-with torch.no_grad():
-    for _ in range(tokens_to_generate):
-        out = model(input_ids)
-        out_last = out[:, -1]
-        idx = torch.argmax(out_last)[None, None]
-        input_ids = torch.cat((input_ids, idx), dim=0)
-input_ids = input_ids.transpose(0, 1)[0]
-print(repr(tokenizer.decode(input_ids.cpu().numpy().tolist())))
+model = MambaModel.from_pretrained("Zyphra/Zamba2-2.7B").cuda().half()
+input_text = "The meaning of life is"
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")["input_ids"]
+out = model(input_ids)
 ```
 
 ## Model Details
 
-Zamba2-2.7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. 
+Zamba2-2.7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. 
 
 <center>
 <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
@@ -62,27 +55,20 @@ Zamba2-2.7B utilizes and extends our original Zamba hybrid SSM-attention archite
 
 ## Performance
 
-Zamba2-2.7B achieves leading and state-of-the-art performance among models of <3B parameters and is competitive with some models of significantly greater size. Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models. 
+Zamba2-2.7B achieves leading and state-of-the-art performance among models of <3B parameters and is competitive with some models of significantly greater size. Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B achieves extremely low latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models. 
 
-Zamba2-2.7B's high performance and small inference compute and memory footprint renders it an ideal generalist model for on-device applications.
+Zamba2-2.7B's high performance and small compute and memory footprint renders it an ideal generalist model for on-device applications.
 
 <center>
 <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/U7VD9PYLj3XcEjgV08sP5.png" width="700" alt="Zamba performance">
 </center>
 
+(-/ TODO All eval figure)
 
 <center>
-<img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/3C-JIBxaug-FjkVJF74s1.png" width="700" alt="Zamba performance">
+<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/Y_X1hc4UwXLwrttyQpaxY.png" width="700" alt="Zamba inference and memory cost">
 </center>
 
-Time to First Token (TTFT)             |  Output Generation
-:-------------------------:|:-------------------------:
-![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/BmE8X6tDNVw5OJcbZt8sZ.png)  |  ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/wECc9cItK1FW1MOMGSLrp.png)
-
-
-<center>
-<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/nhoss41xlzfEBZzcQXI6z.png" width="700" alt="Zamba inference and memory cost">
-</center>
 
 ## Notice
 

diff --git a/__pycache__/attention.cpython-310.pyc b/__pycache__/attention.cpython-310.pyc
diff --git a/__pycache__/enums.cpython-310.pyc b/__pycache__/enums.cpython-310.pyc
diff --git a/__pycache__/hf_utils.cpython-310.pyc b/__pycache__/hf_utils.cpython-310.pyc
diff --git a/__pycache__/mamba2_layer.cpython-310.pyc b/__pycache__/mamba2_layer.cpython-310.pyc
diff --git a/__pycache__/mamba_block.cpython-310.pyc b/__pycache__/mamba_block.cpython-310.pyc
diff --git a/__pycache__/mamba_config.cpython-310.pyc b/__pycache__/mamba_config.cpython-310.pyc
diff --git a/__pycache__/mamba_layer.cpython-310.pyc b/__pycache__/mamba_layer.cpython-310.pyc
diff --git a/__pycache__/mamba_model.cpython-310.pyc b/__pycache__/mamba_model.cpython-310.pyc
diff --git a/__pycache__/mlp.cpython-310.pyc b/__pycache__/mlp.cpython-310.pyc
diff --git a/__pycache__/rotary.cpython-310.pyc b/__pycache__/rotary.cpython-310.pyc
diff --git a/__pycache__/utils.cpython-310.pyc b/__pycache__/utils.cpython-310.pyc
diff --git a/mamba_block.py b/mamba_block.py
@@ -40,7 +40,6 @@
 from mamba_config import MambaConfig
 from mlp import MLP
 from attention import CausalSelfAttention
-from switch_mlp import SwitchMLP
 from rotary import RotaryEmbedding
 
 
@@ -398,29 +397,6 @@ def create_block(config, layer_idx):
                     fused_add_norm=config.fused_add_norm,
                     residual_in_fp32=config.residual_in_fp32,
                 )
-            else:
-                if config.layer_mapping[layer_idx-1][1] == '1':
-                    norm_moe = partial(nn.LayerNorm if not config.rms_norm else RMSNorm, eps=config.layernorm_epsilon)
-                    mixer_cls = partial(MambaLayer,layer_idx=layer_idx, **factory_kwargs)
-                    moe_cls = partial(MLP,layer_idx=layer_idx, **factory_kwargs)
-                    block = MambaBlockParallelMoe(
-                    config,
-                    mixer_cls=mixer_cls,
-                    moe_cls=moe_cls,
-                    norm_cls=norm_cls,
-                    norm_moe=norm_moe,
-                    fused_add_norm=config.fused_add_norm,
-                    residual_in_fp32=config.residual_in_fp32,
-                )
-        elif config.layer_mapping[layer_idx-1][0] == 'a':
-            mixer_cls = partial(CausalSelfAttention, layer_number=layer_idx, **factory_kwargs)
-            block = AttentionBlock(
-                config,
-                mixer_cls=mixer_cls,
-                norm_cls=norm_cls,
-                fused_add_norm=config.fused_add_norm,
-                residual_in_fp32=config.residual_in_fp32,
-            )
         elif config.layer_mapping[layer_idx-1][0] == 'm': 
 
             mixer_cls = partial(Mamba2Layer, layer_idx=layer_idx, **factory_kwargs)
@@ -431,16 +407,6 @@ def create_block(config, layer_idx):
                 fused_add_norm=config.fused_add_norm,
                 residual_in_fp32=config.residual_in_fp32,
             )
-        else:
-            mixer_cls = partial(SwitchMLP, layer_idx=layer_idx, **factory_kwargs)
-            block = MoEBlock(
-                config,
-                mixer_cls=mixer_cls,
-                norm_cls=norm_cls,
-                fused_add_norm=config.fused_add_norm,
-                residual_in_fp32=config.residual_in_fp32,
-            )
-        block.layer_idx = layer_idx
     return block
 
 class MambaDecoder(nn.Module):

diff --git a/ops/__pycache__/__init__.cpython-310.pyc b/ops/__pycache__/__init__.cpython-310.pyc
diff --git a/ops/__pycache__/selective_scan_interface.cpython-310.pyc b/ops/__pycache__/selective_scan_interface.cpython-310.pyc
diff --git a/ops/triton/__pycache__/__init__.cpython-310.pyc b/ops/triton/__pycache__/__init__.cpython-310.pyc
diff --git a/ops/triton/__pycache__/k_activations.cpython-310.pyc b/ops/triton/__pycache__/k_activations.cpython-310.pyc
diff --git a/ops/triton/__pycache__/layernorm.cpython-310.pyc b/ops/triton/__pycache__/layernorm.cpython-310.pyc
diff --git a/ops/triton/__pycache__/layernorm_gated.cpython-310.pyc b/ops/triton/__pycache__/layernorm_gated.cpython-310.pyc
diff --git a/ops/triton/__pycache__/selective_state_update.cpython-310.pyc b/ops/triton/__pycache__/selective_state_update.cpython-310.pyc
diff --git a/ops/triton/__pycache__/ssd_bmm.cpython-310.pyc b/ops/triton/__pycache__/ssd_bmm.cpython-310.pyc
diff --git a/ops/triton/__pycache__/ssd_chunk_scan.cpython-310.pyc b/ops/triton/__pycache__/ssd_chunk_scan.cpython-310.pyc
diff --git a/ops/triton/__pycache__/ssd_chunk_state.cpython-310.pyc b/ops/triton/__pycache__/ssd_chunk_state.cpython-310.pyc
diff --git a/ops/triton/__pycache__/ssd_combined.cpython-310.pyc b/ops/triton/__pycache__/ssd_combined.cpython-310.pyc
diff --git a/ops/triton/__pycache__/ssd_state_passing.cpython-310.pyc b/ops/triton/__pycache__/ssd_state_passing.cpython-310.pyc