Discrepency in code and paper related to HGRNBitAttention

The equations in the paper and the code don't match for the last equation.

The figure shows the last output equation as 
![image](https://github.com/user-attachments/assets/e4f1e990-afaa-4aab-9cec-dafdb82195a6)

But based on the current code. It looks like this is the execution 
### $o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$ 
instead of  
### $o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$


Seems like this is fixed in recent commit [HGRN - flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention/commit/a5baa970c917fcb558ad8efdd93381bd3903fba0) repository

```diff
        last_state = (recurrent_state,)
        past_key_values.update(last_state, self.layer_idx, i.shape[2])

-       o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
+       o = self.g_norm(rearrange(o, 'b h l d -> b l (h d)'), self.g_proj(hidden_states))
        o = self.o_proj(o)

        return o, None, past_key_values
```

Existing code path of current repository : 

- (g_norm is called with $g_t$ and $h_t$) : g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))

   > https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/layers/hgrn_bit.py#L139 

- (X is $g_t$ and O is $h_t$) in FusedRMSNormSwishGate.forward(self, x, o, ...) 
   > https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/layers/hgrn_bit.py#L139 

- (sigmoid is called on O which is $h_t$ instead of $g_t$) in _layer_norm_fwd_1pass_kernel  y = y * o * tl.sigmoid(o) 
   > https://github.com/ridgerchu/matmulfreellm/blob/ec1c298ffa3db6436831f3e6d46f4e59d0b99194/mmfreelm/modules/fused_norm_gate.py#L128



Are the results with the inverted equation or with the fixed equation ? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discrepency in code and paper related to HGRNBitAttention #37

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discrepency in code and paper related to HGRNBitAttention #37

Description

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions