These 2 env vars will dump the generated MLIR code from any compilation on the current working directory as:
dump.mlir
: The MLIR code after passes without locations.dump-debug.mlir
: The MLIR code after passes with locations.dump-prepass.mlir
: The MLIR code before without locations.dump-prepass-debug.mlir
: The MLIR code before passes with locations.
Do note that the MLIR with locations is in pretty form and thus not suitable to pass to mlir-opt
.
export NATIVE_DEBUG_DUMP_PREPASS=1
export NATIVE_DEBUG_DUMP=1
To debug with LLDB (or another debugger), we must compile the binary with the with-debug-utils
feature.
cargo build --bin cairo-native-run --features with-debug-utils
Then, we can add the a debugger breakpoint trap. To add it at a given sierra statement, we can set the following env var:
export NATIVE_DEBUG_TRAP_AT_STMT=10
The trap instruction may not end up exactly where the statement is.
If we want to manually set the breakpoint (for example, when executing a particular libfunc), then we can use the DebugUtils
metadata in the code.
#[cfg(feature = "with-debug-utils")]
{
metadata.get_mut::<DebugUtils>()
.unwrap()
.debug_breakpoint_trap(block, location)?;
}
Now, we need to execute cairo-native-run
from our debugger (LLDB). If we want to see the source locations, we also need to set the NATIVE_DEBUG_DUMP
env var and execute the program with AOT.
lldb -- target/debug/cairo-native-run -s programs/recursion.cairo --available-gas 99999999 --run-mode aot
Some usefull lldb commands:
process launch
: starts the programframe select
: shows the current line informationthread step-in
: makes a source level single stepthread continue
: continues execution of the current processdisassemble --frame --mixed
: shows assembly instructions mixed with source level code
Enable logging to see the compilation process:
export RUST_LOG="cairo_native=trace"
- Try to find the minimal program to reproduce an issue, the more isolated the easier to test.
- Use the
debug_utils
print utilities, more info here:
#[cfg(feature = "with-debug-utils")]
{
metadata.get_mut::<DebugUtils>()
.unwrap()
.print_pointer(context, helper, entry, ptr, location)?;
}
Contracts are difficult to debug for various reasons, including:
- They are external to the project.
- We don’t have their source code.
- They run autogenerated code (the wrapper).
- They have a limited number of allowed libfuncs (ex. cannot use the print libfunc).
- Usually it’s not a single contract but multiple that
Some of them have workarounds:
There are various options for obtaining the contract, which include:
- Manually invoking the a Starknet API using
curl
with the contract class.
Example:
curl --location --request POST 'https://mainnet.juno.internal.lambdaclass.com' \
--header 'Content-Type: application/json' \
--data-raw '{
"jsonrpc": "2.0",
"method": "starknet_getClass",
"id": 0,
"params": {
"class_hash": "0x036078334509b514626504edc9fb252328d1a240e4e948bef8d0c08dff45927f",
"block_id": 657887
}
}'
- Running the replay with some code to write all the executed contracts on disk.
Both should provide us with the contract, but if we’re manually invoking the API we’ll need to process the JSON a bit to:
- Remove the JsonRPC overhead, and
- Convert the ABI from a string of JSON into a JSON object.
The contract JSON contains the Sierra program in a useless form (in the sense that we cannot understand anything), as well as some information about the entry points and some ABI types. We’ll need the Sierra program (in Sierra format, not the JSON) to be able to understand what should be happening.
We can use the starknet-sierra-extract-code
binary, which can be found in
the cairo project when compiled from source (not in the binary distribution).
That binary will extract the Sierra program without any debug information,
which is still not very useful.
Once we have the Sierra we can run the Sierra mapper to autogenerate some type, libfunc and function names so that we know what we’re looking at without losing our mind. The Sierra mapper can be run multiple times, adding more names manually as the user sees fit.
First of all we need to know which contract is actually failing. Most of the time the contract where it crashes isn’t the transaction’s class hash, but a chain of contract/library calls.
To know which contract is being called we can add some debugging prints in the replay that logs contract executions. For example:
impl StarknetSyscallHandler for ReplaySyscallHandler {
// ...
fn library_call(
&mut self,
class_hash: Felt,
function_selector: Felt,
calldata: &[Felt],
remaining_gas: &mut u128,
) -> SyscallResult<Vec<Felt>> {
// ...
println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
let result = executor.invoke_contract_dynamic(...);
println!("Finished execution of contract {class_hash}.");
if result.failure_flag {
println!("Execution of contract {class_hash} failed.");
}
// ...
}
fn call_contract(
&mut self,
address: Felt,
entry_point_selector: Felt,
calldata: &[Felt],
remaining_gas: &mut u128,
) -> SyscallResult<Vec<Felt>> {
// ...
println!("Starting execution of contract {class_hash} on selector {function_selector} with calldata {calldata:?}.");
let result = executor.invoke_contract_dynamic(...);
println!("Finished execution of contract {class_hash}.");
if result.failure_flag {
println!("Execution of contract {class_hash} failed.");
}
// ...
}
}
If we run something like the above then the replay should start printing the log of what’s actually being executed and where it crashes. It may print multiple times the error message, but only the first one is the relevant one (the others should be the contract call chain in reverse order). Once we know which contract is being called and its calldata we can download and extract its Sierra as detailed above.
We then need to know where it fails within the contract. To do that we
can look at the error message and deduce where it’s used based on the Sierra
program. For example, the error message u256_mul overflow
is felt-encoded
as 0x753235365f6d756c206f766572666c6f77
, or
39879774624083218221774975706286902767479
in decimal. If we look for
usages of that specific value we’ll most likely find all the places where
that error can be thrown. Now we just need to narrow them down to a single
one and we’ll be able to actually start debugging.
An idea on how to do that is modifying Cairo native so that it adds a breakpoint every time a constant with that error message is generated. For example:
/// Generate MLIR operations for the `felt252_const` libfunc.
pub fn build_const<'ctx, 'this>(
context: &'ctx Context,
registry: &ProgramRegistry<CoreType, CoreLibfunc>,
entry: &'this Block<'ctx>,
location: Location<'ctx>,
helper: &LibfuncHelper<'ctx, 'this>,
metadata: &mut MetadataStorage,
info: &Felt252ConstConcreteLibfunc,
) -> Result<()> {
let value = match info.c.sign() {
Sign::Minus => {
let prime = metadata
.get::<PrimeModuloMeta<Felt>>()
.ok_or(Error::MissingMetadata)?
.prime();
(&info.c + prime.to_bigint().expect("always is Some"))
.to_biguint()
.expect("always is positive")
}
_ => info.c.to_biguint().expect("sign already checked"),
};
let felt252_ty = registry.build_type(
context,
helper,
registry,
metadata,
&info.branch_signatures()[0].vars[0].ty,
)?;
if value == "39879774624083218221774975706286902767479".parse().unwrap() {
// If using the debugger:
metadata
.get_mut::<crate::metadata::debug_utils::DebugUtils>()
.unwrap()
.debug_breakpoint_trap(entry, location)
.unwrap();
// If not using the debugger (not tested, may not provide useful information).
metadata
.get_mut::<crate::metadata::debug_utils::DebugUtils>()
.unwrap()
.debug_print(
context,
helper,
entry,
&format!("Invoked felt252_const<error_msg> at {location}."),
location,
)
.unwrap();
}
let value = entry.const_int_from_type(context, location, value, felt252_ty)?;
entry.append_operation(helper.br(0, &[value], location));
Ok(())
}
Using the debugger will also provide the internal call backtrace (of the contract) and register values, so it’s the recommended way, but depending on the contract it may not be feasible (ex. the contract is too big and running the debugger is not practical due to the amount of time it takes to get to the crash).
Once we know exactly where it crashes we can follow the control flow of the Sierra program backwards and discover how it reached that point.
In some cases the problem may be somewhere completely different from where
the error is thrown. In other words, the error we’re seeing may be a side
effect of a completely different bug. For example, in a u256_mul overflow
,
the bug may be found in the mul operation implementation, or alternatively it
may just be that the values passed to it are not what they should be. That’s
why it’s important to check for those cases and keep following the control
flow backwards as required.
Before fixing the bug it’s really important to know:
- Where it happens (in our compiler, not so much in the contract at this point)
- Why it happens (as in, what caused this bug to be in our codebase in the first place)
- How to fix it properly (not the actual code but to know what steps to take to fix it).
- Could the same bug happen in different places? (for example, if it was the implementation of
u64_sqrt
, could the same bug happen inu32_sqrt
and others?) - What side-effects will the bug fix trigger? (for example, if the fix implies changing the layout of some type, will the new layout make something completely unrelated fail later on?)
The last one is really important since we don’t want to cause more bugs fixing the ones we already have. To understand the side effects we need to have a full understanding of the bug, which implies having an answer to (at least) all the other things to know before fixing it.
Once we know all that we can:
- Add tests that reproduce the bug (including all the variants that we may discover).
- Implement the fix in code.
Note: Those steps must be done in that order. Otherwise we risk unconsciously avoiding bugs in our tests for our bug fix implementation by building our tests from our implementation instead of the correct behaviour.
To aid in the debugging process, we developed sierra-emu. It’s an external tool that executes raw sierra code and outputs an execution trace, containing each statement executed and the associated state.
In addition to this, we developed the with-trace-dump
feature for Cairo Native, which generates an execution trace that records every statement executed. It has the same shape as the one generated by the Sierra emulator. Supporting transaction execution with Cairo Native trace dump required quite a few hacks, which is why we haven’t merged it to main. This is why we need to use a specific cairo native branch.
By combining both tools, we can hopefully pinpoint exactly which libfunc implementation is buggy.
Before starting, make sure to clone starknet-replay.
- Checkout starknet-replay
trace-dump
branch. - Execute a single transaction with the
use-sierra-emu
featurecargo run --features use-sierra-emu tx <HASH> <CHAIN> <BLOCK>
- Once finished, it will have written the traces of each inner contract inside of
traces/emu
, relative to the current working directory.
As a single transaction can invoke multiple contracts (by contract and library calls), this generates a trace file for each contract executed, numbered in ascending order: trace_0.json
, trace_1.json
, etc.
- Checkout starknet-replay
trace-dump
branch. - Execute a single transaction with the
with-trace-dump
featurecargo run --features with-trace-dump tx <HASH> <CHAIN> <BLOCK>
- Once finished, it will have written the traces of each inner contract inside of
traces/native
, relative to the current working directory.
If the execution panics, It may indicate that not all the required libfuncs or types have been implemented (for either sierra emulator or Cairo Native trace dump feature). It is a good idea to patch the dependencies to a local path and implement the missing features. You can add this to Cargo.toml
[patch.'https://github.com/lambdaclass/cairo_native']
cairo-native = { path = "../cairo_native" }
[patch.'https://github.com/lambdaclass/sierra-emu']
sierra-emu = { path = "../sierra-emu" }
Once you have generated the traces for both the Sierra emulator and Cairo Native, you can begin debugging.
- Compare the traces of the same contract with the favorite tool:
diff "traces/{emu,native}/trace_0.json" # or delta "traces/{emu,native}/trace_0.json" --side-by-side
- Look for the first significant difference between the traces. Not all the differences are significant, for example:
- Sometimes the emulator and Cairo Native differ in the Gas builtin. It usually doesn’t affect the outcome of the contract.
- The ec_state_init libfunc randomizes an elliptic curve point, which is why they always differ.
- Find the index of the statement executed immediately previous to the first difference.
- Open
traces/prog_0.sierra
and look for that statement.- If it’s a return, then you are dealing with a control flow bug. These are difficult to debug.
- If it’s a libfunc invocation, then that libfunc is probably the one that is buggy.
- If it’s a library or contract call, then the bug is probably in another contract, and you should move onto the next trace.
In the scripts
folder of starknet-replay, you can find useful scripts for debugging. Make sure to execute them in the root directory. Some scripts require delta
to be installed.
compare-traces
: Compares every trace and outputs which are different. This can help finding the buggy contract when there are a lot of traces.> ./scripts/compare-traces.sh difference: ./traces/emu/trace_0.json ./traces/native/trace_0.json difference: ./traces/emu/trace_1.json ./traces/native/trace_1.json difference: ./traces/emu/trace_3.json ./traces/native/trace_3.json missing file: ./traces/native/trace_4.json
diff-trace
: Receives a trace number, and executesdelta
to compare that trace../scripts/diff-trace.sh 1
diff-trace-flow
: Likediff-trace
, but only diffs (withdelta
) the statement indexes. It can be used to visualize the control flow difference../scripts/diff-trace-flow.sh 1
string-to-felt
: Converts the given string to a felt. Can be used to search in the code where a specific error message was generated.> ./scripts/string-to-felt.sh "u256_mul Overflow" 753235365f6d756c204f766572666c6f77