Skip to content

Discrepency in code and paper related to HGRNBitAttention #37

@loki-r

Description

@loki-r

The equations in the paper and the code don't match for the last equation.

The figure shows the last output equation as
image

But based on the current code. It looks like this is the execution

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

instead of

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Seems like this is fixed in recent commit HGRN - flash-linear-attention repository

        last_state = (recurrent_state,)
        past_key_values.update(last_state, self.layer_idx, i.shape[2])

-       o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
+       o = self.g_norm(rearrange(o, 'b h l d -> b l (h d)'), self.g_proj(hidden_states))
        o = self.o_proj(o)

        return o, None, past_key_values

Existing code path of current repository :

Are the results with the inverted equation or with the fixed equation ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions