Summary
signed vs. unsigned integer overflow in llama.cpp's tokenizer implementation (llama_vocab::tokenize) (src/llama-vocab.cpp:3036) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowing llama.cpp inferencing engine with carefully manipulated text input (human messages, prompts, template), during tokenization process.
This is pretty dangerous because the llama_vocab::tokenize is used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt, then generate... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because most msg.role + content are initialized with std::vector buf(alloc_size) (common/chat.cpp:1831), which have built-in implementation of prevention for > max_size() *(what(): cannot create std::vector larger than max size());.
Nevertheless, during research, it was found that this is bypassable by exploiting the latest jinja templates support. (common_chat_templates_apply_jinja, as it inherits the memory space of tmpl) (tmpl.apply).
Details
A single line in llama_vocab::tokenize, llama.cpp's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.
// src/llama-vocab.cpp:3055-3073
int32_t llama_vocab::tokenize(
const char * text,
int32_t text_len,
llama_token * tokens,
int32_t n_tokens_max,
bool add_special,
bool parse_special) const {
auto res = tokenize(std::string(text, text_len), add_special, parse_special);
if (n_tokens_max < (int) res.size()) {
// LLAMA_LOG_ERROR("%s: too many tokens\n", __func__);
return -((int) res.size());
}
for (size_t i = 0; i < res.size(); i++) {
tokens[i] = res[i];
}
return res.size();
}
tokenize philosophy
llama_vocab::tokenize() acts as an interface adapter that calls the underlying tokenize (llama_vocab::impl::tokenize), which is the lower-part of the tokenization process, where the inner vocab-ing is involved (e.g. LLAMA_VOCAB_TYPE_*, determined by tokenizer.ggml.model), (you will see later why is it design in this specific way). We won't dive into llama_vocab::impl::tokenize now, since it's implementations don't matters now. (we explains later on why it generates another stack-overflow)
int32_t llama_tokenize(
const struct llama_vocab * vocab,
// ....
return vocab->tokenize(text, text_len, tokens, n_tokens_max, add_special, parse_special);
}
llama_tokenize thin wraps vocab->tokenize (llama_vocab::tokenize interface), also the common tokenizer API you'll see a lot in llama.cpp's implementation, directly used in run/run.cpp(./bin/llama-run's implementations) or in common.cpp (./common/common.cpp, then used everywhere e.g. server.cpp (./bin/llama-server), tts.cpp, tokenize.cpp... ).
std::vector<llama_token> common_tokenize(
const struct llama_vocab * vocab,
const std::string & text,
bool add_special,
bool parse_special) {
// upper limit for the number of tokens
int n_tokens = text.length() + 2 * add_special;
std::vector<llama_token> result(n_tokens);
n_tokens = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
if (n_tokens < 0) {
result.resize(-n_tokens);
int check = llama_tokenize(vocab, text.data(), text.length(), result.data(), result.size(), add_special, parse_special);
GGML_ASSERT(check == -n_tokens);
//...
static int tokenize_prompt(const llama_vocab * vocab, const std::string & prompt,
std::vector<llama_token> & prompt_tokens, const LlamaData & llama_data) {
const bool is_first = llama_memory_seq_pos_max(llama_get_memory(llama_data.context.get()), 0) == 0;
const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
prompt_tokens.resize(n_prompt_tokens);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), is_first,
true) < 0) {
printe("failed to tokenize the prompt\n");
return -1;
}
If you look close into the two implementations, you will see that both caller of llama_tokenize() adheres to a common design for the allocation of the tokenization process:
- Initialize the buffer for
llama_token * tokens (result) with a smaller allocation text.length() + 2 * add_special / prompt_tokens (std::vector<llama_token> tokens;))
- Calling
llama_tokenize -> llama_vocab::impl::tokenize for the first time for probing the tokens (res) length, where the n_tokens_max is set to zero or a smaller size to guarantee no actual copying of the result happens
resize() the result vector with the negative length returned from llama_tokenize
- Calling
llama_tokenize for the second time, where this time llama_vocab::impl::tokenize is guaranteed to be saved under llama_token * tokens
This explains why a negative return is constructed for llama_tokenize, the tokenizer dynamically determine the outputting saving size of the token array, though at a cost of calling llama_vocab::impl::tokenize twice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflow
if (n_tokens_max < (int) res.size()) converts the tokenize(...).size() (std::vector.size(),size_t) into (int) for cases where the size of the tokenized vector exceeds n_tokens_max (interpreted as n_tokens_max as an argument).
The casting here intuitively makes sense, since n_tokens_max is int32_t - signed (as you can see the typing right above), the result size() was cast to a signed int to avoid the compiler warning about signed/unsigned comparison and ensure both operands have the same signedness during the comparison operation.
However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where res.size() exceeds INT_MAX (2,147,483,647), the casting will convert the originally huge size_t res.size() into a extremely large negative integer, which will always bypass the signed size comparison for n_tokens_max - which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_max at zero).
For the following-up memory operation, the originally int casted res.size() will be restore back to it's original typing size_t, from the negative integer llama_vocab::tokenize( used in size comparison back to the huge positive integer in size_t, in case where res.size() = 2,147,483,647+1, this will allows a (actual_tokens-2,147,483,648)*sizeof(llama_token) bytes of out-of-bound writing of token.
From gdb, we can see that the finally copied destination token is located on the heap, showing this is a heap-overflow, we will explain later why this is interesting and fun (dangerous).
std::vector larger than max size()?
However, notice that this huge sizing of a variable, specifically text in this case, is usually problematic, since the cpp standard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()"; however this limitation was bypassed.
Researching for the exact trigger for this error, it was found that this message is triggered by std::vector<char> buf(alloc_size), called as follows:
(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,
(tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),
(tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);
(tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);
(common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy
(common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);
looking into (common/chat.cpp:1831) common_chat_templates_apply_legacy:
static common_chat_params common_chat_templates_apply_legacy(
const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs)
{
// ....
for (size_t i = 0; i < contents.size(); ++i) {
const auto & msg = inputs.messages[i];
const auto & content = contents[i];
chat.push_back({msg.role.c_str(), content.c_str()});
alloc_size += (msg.role.size() + content.size()) * 1.25;
}
std::vector<char> buf(alloc_size);
The size here is determined by alloc_size += (msg.role.size() + content.size()) * 1.25, the implementation for applying the chat.template with the message's role and message. It's a pain here since the size was * 1.25 after adding the msg.role.size(), making the original huge content.size() (message) even bigger.
However, looking back at (common/chat.cpp:1867) common_chat_templates_apply, where common_chat_templates_apply_legacy is called, we can see another chat_templates_applier:
common_chat_params common_chat_templates_apply(
const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs)
{
GGML_ASSERT(tmpls != nullptr);
return inputs.use_jinja
? common_chat_templates_apply_jinja(tmpls, inputs)
: common_chat_templates_apply_legacy(tmpls, inputs);
}
jinja is llama.cpp's chat template interpreter is based, by looking it to common_chat_templates_apply_jinja's implementations, we will see that it never allocates a manual byte-buffer the way the legacy path does,
- It builds a
templates_params params; structure (all members are default-constructed; nothing is pre-sized).
- Depending on the template in use it dispatches to one of the
common_chat_params_init_* helpers (e.g. common_chat_params_init_llama_3_x, *_generic, …).
- Inside those helpers the rendered prompt is obtained with
data.prompt = apply(tmpl, tweaked_messages, tools_json, add_generation_prompt, extra_context);
where apply(...) is the small helper a few lines above. That helper calls
auto result = tmpl.apply(tmpl_inputs, tmpl_opts); // minja::chat_template::apply
minja::chat_template::apply directly returns an std::string, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically by std::string; no explicit size estimation or buffer reallocation is required, what that mean using common_chat_templates_apply_jinja allow us to use the original constructed message, and not trigger any size error.
(./bin/llama-run):
(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize(
(tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);
- prompt (reversed):
(tools/run/run.cpp:988) static int generate( -> if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)
(tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))
(tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {
(tools/run/run.cpp:1179) static int chat_loop
Collateral Gift
During the process of creating a PoC for previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine the ASAN logs.

A stack-overflow was triggered via the STL allocator (bits/alloc_traits.h) (this is common for ASAN), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually a heap-overflow back then), but then looking into the detailed ASAN logs, it was realized that this was via regex processing (bits/regex_executor.tcc), via sub_match, with further investigations on the overflowing frame, it's found that this stack-overflow was caused by a infinite recursion triggered by unicode_regex_split cause the stack frame to raise to the upper limit of stack region, and triggered this oob access detected by asan, specifically:
llama_vocab::impl::tokenize(
case LLAMA_VOCAB_TYPE_BPE:
session.tokenize(text, output) -> void tokenize() src/llama-vocab.cpp:484
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);
void tokenize(const std::string & text, std::vector<llama_token> & output) {
int final_prev_index = -1;
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);
You can takes this in two perspective. On one hand, this give us a collateral ReDoS out-of-the-blue; on the other hand, this collateral stack-overflow stops us from reaching the final heap-overflow.
However, there's always a way to bypass, this method of word splitting (unicode_regex_split) only happens in LLAMA_VOCAB_TYPE_BPE, the most common vocab_type used by gpt-2 (else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;) or (Byte-Pair Encoding). By switching to Unigram (T5) architectures in the GGUF metadata, (LLAMA_VOCAB_TYPE_UGM), we can take the other case in the llama_vocab::impl::tokenize() (get_type()) switch.

Proof-of-Concept
- Compile latest version of
llama.cpp with ASAN:
cmake .. \
-DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" \
-DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"
make -j
- Generate a
prompt, tokenized result hitting >INT32_MAX incorporating the size of chat.template (In .gguf metadata):
perl -e 'print "<token>" x ((2147483648-<chat-template-size>)/<per_token>), "\n"' >| prompt.txt
- Start a
llama.cpp inferencing service (we choose llama-run as poc), input (redirect) prompt as model input to trigger tokenization.
ASAN_OPTIONS=verbosity=1 \
./bin/llama-run file://<path-to-model> --jinja < ./prompt.txt
Impact
heap overflow (heap based out-of-bounds writing) of the llama.cpp inferencing engine.
- potential
remote-code execution: the heap is very playful, we're able to overwrite following chunks (freed or in-use, both dangerous!) member's pointers, we could:
- overwrite
in-use structure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?
- *you can read llama's paradox for my past experience turning a heap-overflow in
llama.cpp to rce.
- overwrite chunks states / freed chunks pointers: e.g. house-of attacks
dos: crashes the inferencing server (straightforward)
Impacted Components:
llama_tokenize() -> llama_vocab::tokenize()
run.cpp (./bin/llama-run)
simple.cpp (./bin/llama-simple)
Summary
signed vs. unsignedinteger overflow inllama.cpp's tokenizer implementation (llama_vocab::tokenize) (src/llama-vocab.cpp:3036) results in unintended behavior in tokens copying size comparison. Allowing heap-overflowingllama.cppinferencing engine with carefully manipulatedtextinput (human messages, prompts, template), during tokenization process.This is pretty dangerous because the
llama_vocab::tokenizeis used everywhere with human input (llama_vocab::tokenize -> llama_tokenize -> tokenize_prompt, thengenerate... ), meaning that every token input can be a vulnerable entry (Affects all user input (prompts, messages, templates)). However, considered less dangerous because mostmsg.role + contentare initialized withstd::vector buf(alloc_size)(common/chat.cpp:1831), which have built-in implementation of prevention for> max_size()*(what(): cannot create std::vector larger than max size());.Nevertheless, during research, it was found that this is bypassable by exploiting the latest
jinjatemplates support. (common_chat_templates_apply_jinja, as it inherits the memory space oftmpl) (tmpl.apply).Details
A single line in
llama_vocab::tokenize,llama.cpp's tokenizer implementation causes this vulnerability. Before we dissect how this heap overflow forms, let's look in to how is it used and referenced in the tokenization process.tokenizephilosophyllama_vocab::tokenize()acts as an interface adapter that calls the underlyingtokenize(llama_vocab::impl::tokenize), which is the lower-part of the tokenization process, where the innervocab-ing is involved (e.g.LLAMA_VOCAB_TYPE_*, determined bytokenizer.ggml.model), (you will see later why is it design in this specific way). We won't dive intollama_vocab::impl::tokenizenow, since it's implementations don't matters now. (we explains later on why it generates anotherstack-overflow)llama_tokenizethin wrapsvocab->tokenize(llama_vocab::tokenizeinterface), also the commontokenizerAPIyou'll see a lot inllama.cpp's implementation, directly used inrun/run.cpp(./bin/llama-run's implementations) or incommon.cpp(./common/common.cpp, then used everywhere e.g.server.cpp(./bin/llama-server),tts.cpp,tokenize.cpp... ).If you look close into the two implementations, you will see that both caller of
llama_tokenize()adheres to a common design for the allocation of thetokenizationprocess:llama_token * tokens(result) with a smaller allocationtext.length() + 2 * add_special/prompt_tokens(std::vector<llama_token> tokens;))llama_tokenize -> llama_vocab::impl::tokenizefor the first time for probing thetokens(res) length, where then_tokens_maxis set tozeroor a smaller size to guarantee no actual copying of the result happensresize()theresultvector with the negative length returned fromllama_tokenizellama_tokenizefor the second time, where this timellama_vocab::impl::tokenizeis guaranteed to be saved underllama_token * tokensThis explains why a negative return is constructed for
llama_tokenize, thetokenizerdynamically determine the outputting saving size of thetokenarray, though at a cost of callingllama_vocab::impl::tokenizetwice, this in first hand guaranteed efficient memory usage. But this is the cause for this heap-overflowif (n_tokens_max < (int) res.size())converts thetokenize(...).size()(std::vector.size(),size_t) into(int)for cases where the size of the tokenized vector exceedsn_tokens_max(interpreted asn_tokens_maxas an argument).The casting here intuitively makes sense, since
n_tokens_maxisint32_t- signed (as you can see the typing right above), theresult size()was cast to a signedintto avoid the compiler warning about signed/unsigned comparison and ensure both operands have the samesignednessduring the comparison operation.However, this intuitive operation opened up a path for out-of-bound memory corruption at the same time. In an edge case where
res.size()exceedsINT_MAX (2,147,483,647), the casting will convert the originally hugesize_tres.size()into a extremely large negativeinteger, which will always bypass the signed size comparison forn_tokens_max- which in normal sense is always a small integer (as we introduced previously, the dynamic size probing design will start then_tokens_maxat zero).For the following-up memory operation, the originally
intcastedres.size()will be restore back to it's original typingsize_t, from the negative integerllama_vocab::tokenize(used in size comparison back to the huge positive integer insize_t, in case whereres.size() = 2,147,483,647+1, this will allows a(actual_tokens-2,147,483,648)*sizeof(llama_token)bytes of out-of-bound writing of token.From
gdb, we can see that the finally copied destinationtokenis located on theheap, showing this is aheap-overflow, we will explain later why this is interesting and fun (dangerous).std::vector larger than max size()?However, notice that this huge sizing of a variable, specifically
textin this case, is usually problematic, since thecppstandard library has preventions for you from creating such big elements. This was a major obstacle met during the process of creating a proof-of-concept for this heap overflow, since directly inputting such a lengthy prompt will trigger "what(): cannot create std::vector larger than max size()"; however this limitation was bypassed.Researching for the exact trigger for this error, it was found that this message is triggered by
std::vector<char> buf(alloc_size), called as follows:(tools/run/run.cpp:1179) static int chat_loop -> ret = process_user_message(opt,(tools/run/run.cpp:1151) process_user_message -> apply_chat_template_with_error_handling(chat_templates.get(),(tools/run/run.cpp:1082) apply_chat_template_with_error_handling -> apply_chat_template(tmpls, llama_data, append, use_jinja);(tools/run/run.cpp:931) apply_chat_template -> common_chat_templates_apply(tmpls, inputs);(common/chat.cpp:1867) common_chat_templates_apply -> common_chat_templates_apply_legacy(common/chat.cpp:1831) common_chat_templates_apply_legacy -> std::vector<char> buf(alloc_size);looking into
(common/chat.cpp:1831) common_chat_templates_apply_legacy:The size here is determined by
alloc_size += (msg.role.size() + content.size()) * 1.25,the implementation for applying thechat.templatewith the message'sroleandmessage. It's a pain here since the size was* 1.25after adding themsg.role.size(), making the original hugecontent.size()(message) even bigger.However, looking back at
(common/chat.cpp:1867) common_chat_templates_apply, wherecommon_chat_templates_apply_legacyis called, we can see anotherchat_templates_applier:jinjaisllama.cpp'schat templateinterpreter is based, by looking it tocommon_chat_templates_apply_jinja's implementations, we will see that it never allocates a manual byte-buffer the way thelegacypath does,templates_params params; structure (all members are default-constructed; nothing is pre-sized).common_chat_params_init_*helpers (e.g.common_chat_params_init_llama_3_x, *_generic, …).where
apply(...)is the small helper a few lines above. That helper callsminja::chat_template::applydirectly returns anstd::string, so the prompt is produced and stored in a normal C++ string. Memory management is therefore handled automatically bystd::string; no explicit size estimation or buffer reallocation is required, what that mean usingcommon_chat_templates_apply_jinjaallow us to use the original constructedmessage, and not trigger any size error.(
./bin/llama-run):(src/llama-vocab.cpp:3331) int32_t llama_tokenize() -> vocab->tokenize((tools/run/run.cpp:944) tokenize_prompt -> const int n_prompt_tokens = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, is_first, true);(tools/run/run.cpp:988) static int generate(->if (tokenize_prompt(vocab, prompt, tokens, llama_data) < 0)(tools/run/run.cpp:1063) static int generate_response -> if (generate(llama_data, prompt, response))(tools/run/run.cpp:1151) static int process_user_message( -> if (generate_response(llama_data, prompt, response, stdout_a_terminal)) {(tools/run/run.cpp:1179) static int chat_loopCollateral Gift
During the process of creating a

PoCfor previously mentioned vulnerability and bypassing vector, something sketchy caught attention when examine theASANlogs.A
stack-overflowwas triggered via theSTLallocator (bits/alloc_traits.h)(this is common forASAN), at first we thought this is the direct proof-of-concept for our overflow discussed above (didn't realize it was actually aheap-overflowback then), but then looking into the detailedASANlogs, it was realized that this was viaregexprocessing (bits/regex_executor.tcc), viasub_match, with further investigations on the overflowing frame, it's found that thisstack-overflowwas caused by a infinite recursion triggered byunicode_regex_splitcause the stack frame to raise to the upper limit of stack region, and triggered thisoobaccess detected byasan, specifically:llama_vocab::impl::tokenize(case LLAMA_VOCAB_TYPE_BPE:session.tokenize(text, output) -> void tokenize()src/llama-vocab.cpp:484const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs);You can takes this in two perspective. On one hand, this give us a collateral
ReDoSout-of-the-blue; on the other hand, this collateralstack-overflowstops us from reaching the finalheap-overflow.However, there's always a way to bypass, this method of word splitting (

unicode_regex_split) only happens inLLAMA_VOCAB_TYPE_BPE, the most commonvocab_typeused bygpt-2(else if (tokenizer_model == "gpt2") { type = LLAMA_VOCAB_TYPE_BPE;) or (Byte-Pair Encoding). By switching toUnigram (T5)architectures in theGGUFmetadata, (LLAMA_VOCAB_TYPE_UGM), we can take the othercasein thellama_vocab::impl::tokenize()(get_type())switch.Proof-of-Concept
llama.cppwithASAN:cmake .. \ -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" \ -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g" make -jprompt, tokenized result hitting>INT32_MAXincorporating the size ofchat.template(In.ggufmetadata):llama.cppinferencing service (we choosellama-runaspoc), input (redirect)promptas model input to triggertokenization.ggufmodels withtokenizer.ggml.modelthat's notgpt-2, withjinjasupported template (e.g. : Retr0REG/mistral-tokenizer-llama)Impact
heap overflow(heap based out-of-bounds writing) of thellama.cppinferencing engine.remote-code execution: the heap is very playful, we're able to overwrite following chunks (freedorin-use, both dangerous!) member's pointers, we could:in-usestructure members: e.g. change initialized chunk interface to bad pointers, hijack execution flow, structure-oriented programming?llama.cpptorce.dos: crashes the inferencing server (straightforward)Impacted Components:
llama_tokenize() -> llama_vocab::tokenize()run.cpp(./bin/llama-run)simple.cpp(./bin/llama-simple)