Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always emit constants in bytecode #3976

Closed

Conversation

sjamesr
Copy link
Contributor

@sjamesr sjamesr commented Dec 19, 2024

No description provided.

Currently, the bytecode generator adds constants to a const_buf. In
doing this, it searches the const_buf for existing consts of the same
value. This ends up being quadratic in the number of constants.

I thought of adding a hashmap to make the lookup quicker, but I thought
we could maybe just emit an instruction to push the constant onto the
stack (which is what happens already if the const_buf is full).

This will increase the generated bytecode size, but it makes the
bytecode generator a lot faster (e.g. I saw get_const_offset taking 25%
of the loading time on a 12kb module).
@sjamesr sjamesr force-pushed the dont_update_const_list branch from 930086d to 440a16b Compare December 19, 2024 21:02
@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 19, 2024

I just wanted to open this PR to start a discussion about the tradeoffs between having this constant list and emitting loads vs. simply pushing the value onto the stack.

@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 19, 2024

I haven't looked at how much this increases module size, but it definitely decreases bytecode generation time (previously it was like 25% of the time in get_const_offset)

@lum1n0us
Copy link
Collaborator

I haven't looked at how much this increases module size, but it definitely decreases bytecode generation time (previously it was like 25% of the time in get_const_offset)

The data will be helpful since involving constant space is based on a trade-off.

@xujuntwt95329 @wenyongh Please join us.

@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 20, 2024

If we didn't want to pay the price in bytecode size, we could use a hashmap of value->offset to do the searches in ~constant time, at the cost of O(n) memory.

@wenyongh
Copy link
Contributor

If we didn't want to pay the price in bytecode size, we could use a hashmap of value->offset to do the searches in ~constant time, at the cost of O(n) memory.

The bytecode size or memory consumption is really important for us since fast-interp already increases footprint, the main design goal of constant offsets is to reduce the footprint since there may be many duplicated constants. Using hashmap is a choice, another choice may be to just sort the loader_ctx->const_buf (which is an struct Const array) by the const value, e.g., binary search the right position to insert when there is a new const to add, and binary search the array to get the slot index. Since there are i32/i64/f32/f64 values, I think we can sort them by treat all the values as f64 type and compare two values with f64 type.

@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 20, 2024

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

@wenyongh
Copy link
Contributor

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

I mean that the sort/search operations are on the array of Const struct which has value, value_type and slot_index fields, but not on the const buffer, since the Const struct records the slot index, we don't need to rewriting the bytecode, note loader just emits slot_index to bytecode. This improves the search algorithm, but yes, it introduces many insertion operations, it depends on the efficiency of memmove.

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.

Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

@xujuntwt95329
Copy link
Collaborator

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

I mean that the sort/search operations are on the array of Const struct which has value, value_type and slot_index fields, but not on the const buffer, since the Const struct records the slot index, we don't need to rewriting the bytecode, note loader just emits slot_index to bytecode. This improves the search algorithm, but yes, it introduces many insertion operations, it depends on the efficiency of memmove.

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.

Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

A different process for second traverse is a good idea, but it may waste some space when there are multiple duplicated values. Maybe there can be some special process for some common value (e.g. 0, 1) to reduce the waste.

But anyway this is a tradeoff, seems there isn't a best way for all scenarios.

@lum1n0us lum1n0us linked an issue Dec 23, 2024 that may be closed by this pull request
@wenyongh
Copy link
Contributor

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.
Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

A different process for second traverse is a good idea, but it may waste some space when there are multiple duplicated values. Maybe there can be some special process for some common value (e.g. 0, 1) to reduce the waste.

But anyway this is a tradeoff, seems there isn't a best way for all scenarios.

Agree, there is realy not a best way. Since there are already two traverses in the loader of fast-interp, my suggestion is not to introduce an extra traversing, which also takes time and degrades performance. Instead, had better leverage the first traverse to do something. One way that I can think is to mark down the bytecode positions and calculate the total i64/f64 and i32/f32 const counts in the first traverse, for example: we add two link lists to the loader context, one is for i64/f64 with node like struct I64Node { uint8 *bytecode; struct I64Node *next; }, another is for i32/f32 with node like struct I32Node { uint8 *bytecoe; struct I32Node *next; }, and also add uint32 i64_node_count; uint32 i32_node_count in loader context:

typedef struct I64Node {
    uint8 *bytecode;
    struct I64Node *next;
} I64Node;
typedef struct I32Node {
    uint8 *bytecode;
    struct I32Node *next;
} I32Node;

struct WASMLoaderContext {
    ...
    I64Node *i64_node_list;
    I32Node *i32_node_list;
    uint32 i64_node_count;
    uint32 i32_node_count;
    ...
};

And then add i64/f64.const bytecodes one by one to the first list and add i32/f32.const bytecodes one by one to the second list, and increase i64_node_count and i32_node_count accordingly. Note that we just remember all the nodes, we don't sort the list. And f64 can be treated as i64 by converting its buffer content to i64, like i64 = *(i64 *)&f64. And same for f32.

Then after the first traverse, we allocate a big buffer (an int64 array) for i64 nodes, traverse i64_node_list and copy i64/f64 const to the array one by one, and qsort the array buffer, and then remove duplicated consts. And same for i32_node_list. After that we can allocate a merged buffer for the two array, copy data to the new buffer and then free all previous buffer (two lists, two array buffer). Then in the second traverse, we can just use binary search (bsearch) to get the const offset when emitting bytecode.

@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 26, 2024

what about

struct WASMLoaderContext {
    ...
    size_t operand_offset;  // points to beginning of operand within const_buf.data
    Vector *const_buf;
    HashMap *const_map;  // with appropriate Const equals and hash functions
    ...
}

bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
    Const search_key;
    if (!const_from_value(value, type, &search_key)) return false;
    void* entry = bh_hash_map_find(ctx->const_map, &search_key);
    if (entry) { 
        *offset = -*(size_t*)offset;
         return true;
    }

    // It's a value we haven't seen before
    ctx->operand_offset += is_32bit_type(type) ? 1 : 2;  // whatever the appropriate value should be
    bh_vector_append(ctx->const_buf, search_key);

    size_t* value = malloc(...);
    *value = *offset = ctx->operand_offset;  // whatever is necessary to point into ctx->const_buf.data
    if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
    return true;
}

Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode).

On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes.

What do you think?

@wenyongh
Copy link
Contributor

what about

struct WASMLoaderContext {
    ...
    size_t operand_offset;  // points to beginning of operand within const_buf.data
    Vector *const_buf;
    HashMap *const_map;  // with appropriate Const equals and hash functions
    ...
}

bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
    Const search_key;
    if (!const_from_value(value, type, &search_key)) return false;
    void* entry = bh_hash_map_find(ctx->const_map, &search_key);
    if (entry) { 
        *offset = -*(size_t*)offset;
         return true;
    }

    // It's a value we haven't seen before
    ctx->operand_offset += is_32bit_type(type) ? 1 : 2;  // whatever the appropriate value should be
    bh_vector_append(ctx->const_buf, search_key);

    size_t* value = malloc(...);
    *value = *offset = ctx->operand_offset;  // whatever is necessary to point into ctx->const_buf.data
    if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
    return true;
}

Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode).

On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes.

What do you think?

It is basically OK for me, sounds like the performance is better than my idea but it will consume more memory. A concern is there may be many memory re-alloction when vector length increases. Had better not use size_t but use uint32 instead, and seems there are some typos in the sample code, e.g. uint8 value should be WASMValule value? And in size_t* value = malloc(...); *value = *offset = ctx->operand_offset;, the value is duplicated with the function argument uint8 value. I think you can have a try.

@xujuntwt95329 how do you think?

@sjamesr
Copy link
Contributor Author

sjamesr commented Dec 27, 2024

Thanks for the feedback, and yes in my sample code, it should be uint8 type, void *value to match what already exists in wasm_loader_get_const_offset.

My solution will involve many more allocations: once inserted into the hash, the pointer to key cannot change, so that means that they can't go into e.g. a vector that will be reallocated on growth. Perhaps to reduce allocations, we can have a linked list of buffers each containing a given number of keys (e.g. 4kb key buffers).

The advantage would be my solution should involve fewer memmoves/deletions, which your solution would presumably incur during the "remove duplicates" phase.

Thanks for the feedback, I will try working on a PR to see if this will work.

@xujuntwt95329
Copy link
Collaborator

what about

struct WASMLoaderContext {
    ...
    size_t operand_offset;  // points to beginning of operand within const_buf.data
    Vector *const_buf;
    HashMap *const_map;  // with appropriate Const equals and hash functions
    ...
}

bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
    Const search_key;
    if (!const_from_value(value, type, &search_key)) return false;
    void* entry = bh_hash_map_find(ctx->const_map, &search_key);
    if (entry) { 
        *offset = -*(size_t*)offset;
         return true;
    }

    // It's a value we haven't seen before
    ctx->operand_offset += is_32bit_type(type) ? 1 : 2;  // whatever the appropriate value should be
    bh_vector_append(ctx->const_buf, search_key);

    size_t* value = malloc(...);
    *value = *offset = ctx->operand_offset;  // whatever is necessary to point into ctx->const_buf.data
    if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
    return true;
}

Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode).
On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes.
What do you think?

It is basically OK for me, sounds like the performance is better than my idea but it will consume more memory. A concern is there may be many memory re-alloction when vector length increases. Had better not use size_t but use uint32 instead, and seems there are some typos in the sample code, e.g. uint8 value should be WASMValule value? And in size_t* value = malloc(...); *value = *offset = ctx->operand_offset;, the value is duplicated with the function argument uint8 value. I think you can have a try.

@xujuntwt95329 how do you think?

I think this is worth trying. And if the memory requirement increased too large, a compilation flag to make this configurable would be great, so that memory constrained devices won't be influenced.

@sjamesr
Copy link
Contributor Author

sjamesr commented Jan 7, 2025

I added a new pull request here: #4010. Unfortunately it's not yet ready for submission, it is worse for small modules than what we currently have.

Do you want to continue the discussion there?

@wenyongh
Copy link
Contributor

Let's continue the discussion in PR #4012 and close this PR.

@wenyongh wenyongh closed this Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improving wasm_loader_get_const_offset
4 participants