What is the purpose of GGML_F32_STEP and GGML_F16_STEP?
              
              #386
            
            
          -
| 
         I can tell they're used to compute  As an example, let's look at  inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_STEP) {
        for (int j = 0; j < GGML_F32_ARR; j++) {
            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
            sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
        }
    }
    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);
    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }
    *s = sumf;
}For starters, we can flatten the two main loops into a single one and simplify the index computation: inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;
    const int np = (n & ~(GGML_F32_STEP - 1));
    GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
    GGML_F32_VEC ax[GGML_F32_ARR];
    GGML_F32_VEC ay[GGML_F32_ARR];
    for (int i = 0; i < np; i += GGML_F32_EPR) {
        int j = i % GGML_F32_ARR;
        ax[j] = GGML_F32_VEC_LOAD(x + i);
        ay[j] = GGML_F32_VEC_LOAD(y + i);
        sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
    }
    // reduce sum0..sum3 to sum0
    GGML_F32_VEC_REDUCE(sumf, sum);
    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }
    *s = sumf;
}Now, it looks like we don't really need  inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
    ggml_float sumf = 0.0;
    const int np = n - (n % GGML_F32_EPR);
    GGML_F32_VEC sum = GGML_F32_VEC_ZERO;
    for (int i = 0; i < np; i += GGML_F32_EPR) {
        GGML_F32_VEC ax = GGML_F32_VEC_LOAD(x + i);
        GGML_F32_VEC ay = GGML_F32_VEC_LOAD(y + i);
        sum = GGML_F32_VEC_FMA(sum, ax, ay);
    }
    // reduce sum0..sum3 to sum0
    GGML_F32_VEC __temp_for_sum[GGML_F32_ARR]  = { GGML_F32_VEC_ZERO };
    __temp_for_sum[0] = sum;
    GGML_F32_VEC_REDUCE(sumf, __temp_for_sum);
    // leftovers
    for (int i = np; i < n; ++i) {
        sumf += x[i]*y[i];
    }
    *s = sumf;
}Now  The results? BeforeAfterSo, using the  So why don't we remove it? Is there a performance reason behind it that isn't visible on my system? I'm running an Intel Celeron N4120 with SSE3 and BLAS. I'd appreciate if someone could test this on a PC that has better performance than a potato, unlike mine. A version of the code with the changes I made above can be found at https://github.com/abitofevrything/whisper.cpp/tree/remove_step. Note that I have not made the changes necessary for POWER9 as I couldn't find enough documentation online on how to reimplement GGML_F16_VEC_LOAD without the   | 
  
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
| 
         I was working on ggml_vec_dot_f16() this morning. Did not get a significant improvement by flattening the nested loop. Here's what I came up with, but my reduction piece is not ready for prime time. inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) { #if defined(GGML_SIMD) #endif }  | 
  
Beta Was this translation helpful? Give feedback.
-
| 
         I've just added support back for POWER9 (I think). @ggerganov, hope you don't mind the mention, but do you have any explanation for   | 
  
Beta Was this translation helpful? Give feedback.
I've just added support back for POWER9 (I think).
@ggerganov, hope you don't mind the mention, but do you have any explanation for
GGML_F32_STEPandGGML_F16_STEP?