Tokenizer signed vs. unsigned heap overflow

Summary

signed vs. unsigned integer overflow in llama.cpp's tokenizer implementation (llama_vocab::tokenize) (src/llama-vocab.cpp:3036) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowing llama.cpp inferencing engine with carefully manipulated text input (human messages, prompts, template), during tokenization process.

This is pretty dangerous because the llama_vocab::tokenize is used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt, then generate... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because most msg.role + content are initialized with std::vector buf(alloc_size) (common/chat.cpp:1831), which have built-in implementation of prevention for > max_size() *(what(): cannot create std::vector larger than max size());.

Nevertheless, during research, it was found that this is bypassable by exploiting the latest jinja templates support. (common_chat_templates_apply_jinja, as it inherits the memory space of tmpl) (tmpl.apply).

Details

A single line in llama_vocab::tokenize, llama.cpp's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.

// src/llama-vocab.cpp:3055-3073
int32_t llama_vocab::tokenize(
                  const char * text,
                     int32_t   text_len,
                 llama_token * tokens,
                     int32_t   n_tokens_max,
                        bool   add_special,
                        bool   parse_special) const {
    auto res = tokenize(std::string(text, text_len), add_special, parse_special);
    if (n_tokens_max < (int) res.size()) {
        // LLAMA_LOG_ERROR("%s: too many tokens\n", __func__);
        return -((int) res.size());
    }

    for (size_t i = 0; i < res.size(); i++) {
        tokens[i] = res[i];
    }

    return res.size();
}

`tokenize` philosophy

llama_vocab::tokenize() acts as an interface adapter that calls the underlying tokenize (llama_vocab::impl::tokenize), which is the lower-part of the tokenization process, where the inner vocab-ing is involved (e.g. LLAMA_VOCAB_TYPE_*, determined by tokenizer.ggml.model), (you will see later why is it design in this specific way). We won't dive into llama_vocab::impl::tokenize now, since it's implementations don't matters now. (we explains later on why it generates another stack-overflow)

int32_t llama_tokenize(
    const struct llama_vocab * vocab,
    // ....
    return vocab->tokenize(text, text_len, tokens, n_tokens_max, add_special, parse_special);
}

llama_tokenize thin wraps vocab->tokenize (llama_vocab::tokenize interface), also the common tokenizer API you'll see a lot in llama.cpp's implementation, directly used in run/run.cpp(./bin/llama-run's implementations) or in common.cpp (./common/common.cpp, then used everywhere e.g. server.cpp (./bin/llama-server), tts.cpp, tokenize.cpp... ).

std::vector<llama_token> common_tokenize(
    const struct llama_vocab * vocab,
           const std::string & text,
                        bool   add_special,
                        bool   parse_special) {
    // upper limit for the number of tokens
    int n_tokens = text.length() + 2 * add_special;
    std::vector<llama_token> result(n_tokens);
    n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
    if (n_tokens < 0) {
        result.resize(-n_tokens);
        int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
        GGML_ASSERT(check == -n_tokens);

//...

static int tokenize_prompt(const llama_vocab * vocab, const std::string & prompt,
                           std::vector<llama_token> & prompt_tokens, const LlamaData & llama_data) {
    const bool is_first = llama_memory_seq_pos_max(llama_get_memory(llama_data.context.get()), 0) == 0;

    const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
    prompt_tokens.resize(n_prompt_tokens);
    if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), is_first,
                       true) < 0) {
        printe("failed to tokenize the prompt\n");
        return -1;
    }

If you look close into the two implementations, you will see that both caller of llama_tokenize() adheres to a common design for the allocation of the tokenization process:

Initialize the buffer for llama_token * tokens (result) with a smaller allocation text.length() + 2 * add_special / prompt_tokens (std::vector<llama_token> tokens;))
1. Calling llama_tokenize -> llama_vocab::impl::tokenize for the first time for probing the tokens (res) length, where the n_tokens_max is set to zero or a smaller size to guarantee no actual copying of the result happens
2. resize() the result vector with the negative length returned from llama_tokenize
3. Calling llama_tokenize for the second time, where this time llama_vocab::impl::tokenize is guaranteed to be saved under llama_token * tokens

This explains why a negative return is constructed for llama_tokenize, the tokenizer dynamically determine the outputting saving size of the token array, though at a cost of calling llama_vocab::impl::tokenize twice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflow

if (n_tokens_max < (int) res.size()) converts the tokenize(...).size() (std::vector.size(),size_t) into (int) for cases where the size of the tokenized vector exceeds n_tokens_max (interpreted as n_tokens_max as an argument).

The casting here intuitively makes sense, since n_tokens_max is int32_t - signed (as you can see the typing right above), the result size() was cast to a signed int to avoid the compiler warning about signed/unsigned comparison and ensure both operands have the same signedness during the comparison operation.

However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where res.size() exceeds INT_MAX (2,147,483,647), the casting will convert the originally huge size_t res.size() into a extremely large negative integer, which will always bypass the signed size comparison for n_tokens_max - which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_max at zero).

For the following-up memory operation, the originally int casted res.size() will be restore back to it's original typing size_t, from the negative integer llama_vocab::tokenize( used in size comparison back to the huge positive integer in size_t, in case where res.size() = 2,147,483,647+1, this will allows a (actual_tokens-2,147,483,648)*sizeof(llama_token) bytes of out-of-bound writing of token.

From gdb, we can see that the finally copied destination token is located on the heap, showing this is a heap-overflow, we will explain later why this is interesting and fun (dangerous).

`std::vector larger than max size()`?

However, notice that this huge sizing of a variable, specifically text in this case, is usually problematic, since the cpp standard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()"; however this limitation was bypassed.

Researching for the exact trigger for this error, it was found that this message is triggered by std::vector<char> buf(alloc_size), called as follows:

(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,
- (tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),
  - (tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);
    - (tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);
      - (common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy
        
        (common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);

looking into (common/chat.cpp:1831) common_chat_templates_apply_legacy:

static common_chat_params common_chat_templates_apply_legacy(
    const struct common_chat_templates * tmpls,
    const struct common_chat_templates_inputs & inputs)
{
    // ....
    for (size_t i = 0; i < contents.size(); ++i) {
        const auto & msg = inputs.messages[i];
        const auto & content = contents[i];
        chat.push_back({msg.role.c_str(), content.c_str()});
        alloc_size += (msg.role.size() + content.size()) * 1.25;
    }

    std::vector<char> buf(alloc_size);

The size here is determined by alloc_size += (msg.role.size() + content.size()) * 1.25, the implementation for applying the chat.template with the message's role and message. It's a pain here since the size was * 1.25 after adding the msg.role.size(), making the original huge content.size() (message) even bigger.

However, looking back at (common/chat.cpp:1867) common_chat_templates_apply, where common_chat_templates_apply_legacy is called, we can see another chat_templates_applier:

common_chat_params common_chat_templates_apply(
    const struct common_chat_templates * tmpls,
    const struct common_chat_templates_inputs & inputs)
{
    GGML_ASSERT(tmpls != nullptr);
    return inputs.use_jinja
        ? common_chat_templates_apply_jinja(tmpls, inputs)
        : common_chat_templates_apply_legacy(tmpls, inputs);
}

jinja is llama.cpp's chat template interpreter is based, by looking it to common_chat_templates_apply_jinja's implementations, we will see that it never allocates a manual byte-buffer the way the legacy path does,

It builds a templates_params params; structure (all members are default-constructed; nothing is pre-sized).
Depending on the template in use it dispatches to one of the common_chat_params_init_* helpers (e.g. common_chat_params_init_llama_3_x, *_generic, …).
Inside those helpers the rendered prompt is obtained with

data.prompt = apply(tmpl, tweaked_messages, tools_json, add_generation_prompt, extra_context);

where apply(...) is the small helper a few lines above. That helper calls

auto result = tmpl.apply(tmpl_inputs, tmpl_opts); // minja::chat_template::apply

minja::chat_template::apply directly returns an std::string, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically by std::string; no explicit size estimation or buffer reallocation is required, what that mean using common_chat_templates_apply_jinja allow us to use the original constructed message, and not trigger any size error.

(./bin/llama-run):

(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize(
- (tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
  - prompt (reversed): (tools/run/run.cpp:988) static int generate( -> if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)
    - (tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))
      - (tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {
        
        (tools/run/run.cpp:1179) static int chat_loop

Collateral Gift

During the process of creating a PoC for previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine the ASAN logs.

A stack-overflow was triggered via the STL allocator (bits/alloc_traits.h) (this is common for ASAN), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually a heap-overflow back then), but then looking into the detailed ASAN logs, it was realized that this was via regex processing (bits/regex_executor.tcc), via sub_match, with further investigations on the overflowing frame, it's found that this stack-overflow was caused by a infinite recursion triggered by unicode_regex_split cause the stack frame to raise to the upper limit of stack region, and triggered this oob access detected by asan, specifically:

llama_vocab::impl::tokenize(
- case LLAMA_VOCAB_TYPE_BPE:
  - session.tokenize(text, output) -> void tokenize() src/llama-vocab.cpp:484
    - const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);

    void tokenize(const std::string & text, std::vector<llama_token> & output) {
        int final_prev_index = -1;
        const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);

You can takes this in two perspective. On one hand, this give us a collateral ReDoS out-of-the-blue; on the other hand, this collateral stack-overflow stops us from reaching the final heap-overflow.

However, there's always a way to bypass, this method of word splitting (unicode_regex_split) only happens in LLAMA_VOCAB_TYPE_BPE, the most common vocab_type used by gpt-2 (else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;) or (Byte-Pair Encoding). By switching to Unigram (T5) architectures in the GGUF metadata, (LLAMA_VOCAB_TYPE_UGM), we can take the other case in the llama_vocab::impl::tokenize() (get_type()) switch.

Proof-of-Concept

Compile latest version of llama.cpp with ASAN:

cmake .. \
    -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" \
    -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"

make -j

Generate a prompt, tokenized result hitting >INT32_MAX incorporating the size of chat.template (In .gguf metadata):

perl -e 'print "<token>" x ((2147483648-<chat-template-size>)/<per_token>), "\n"' >| prompt.txt

Start a llama.cpp inferencing service (we choose llama-run as poc), input (redirect) prompt as model input to trigger tokenization.
- Use a gguf models with tokenizer.ggml.model that's not gpt-2, with jinja supported template (e.g. : Retr0REG/mistral-tokenizer-llama)

ASAN_OPTIONS=verbosity=1 \
./bin/llama-run file://<path-to-model> --jinja < ./prompt.txt

Impact

heap overflow (heap based out-of-bounds writing) of the llama.cpp inferencing engine.
- potential remote-code execution: the heap is very playful, we're able to overwrite following chunks (freed or in-use, both dangerous!) member's pointers, we could:
  - overwrite in-use structure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?
    - *you can read llama's paradox for my past experience turning a heap-overflow in llama.cpp to rce.
  - overwrite chunks states / freed chunks pointers: e.g. house-of attacks
- dos: crashes the inferencing server (straightforward)

Impacted Components:

llama_tokenize() -> llama_vocab::tokenize()
- run.cpp (./bin/llama-run)
- simple.cpp (./bin/llama-simple)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenizer signed vs. unsigned heap overflow

Package

Affected versions

Patched versions

Description

Summary

Details

`tokenize` philosophy

`std::vector larger than max size()`?

Collateral Gift

Proof-of-Concept

Impact

Severity

CVSS overall score

CVSS v3 base metrics

CVSS v3 base metrics

CVE ID

Weaknesses

Improper Restriction of Operations within the Bounds of a Memory Buffer

Signed to Unsigned Conversion Error

Credits

Uh oh!

Tokenizer signed vs. unsigned heap overflow

Package

Affected versions

Patched versions

Description

Summary

Details

tokenize philosophy

std::vector larger than max size()?

Collateral Gift

Proof-of-Concept

Impact

Severity

CVSS v3 base metrics

CVE ID

Weaknesses

Improper Restriction of Operations within the Bounds of a Memory Buffer

Signed to Unsigned Conversion Error

Credits

`tokenize` philosophy

`std::vector larger than max size()`?