This repository contains the code implementation of paper Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs.
We implemented SymGen using Ghidra (for decompilation), ANTLR 4 (for processing source code and decompiled code). The model's fine-tuning and prediction are based on alpaca-lora. For more details, please refer to our paper.
-
Create new conda enviroment
conda create -n symgen python=3.9
-
Activate the conda environment
conda activate symgen
-
Install packages
pip install -r requirements.txt
-
Install Ghidra
The Ghidra installation guide can be found here.
The original binaries and the decompiled code used by SymGen can be downloaded here.
The uploaded package follows a three-level directory structure, organized as shown below:
.
├── architecture
└── opt_level
└── project
architecture
: Contains four architectures: x86_64, x86_32, arm, and mips.opt_level
: Contains different optimization levels: O0, O1, O2, and O3.project
: Contains 33 projects. The list can be found in our paper.
A processed sample dataset is provided in the folder dataset/
which can be used for fine-tuning and prediction directly (See Fine-tune and Predict).
All python scripts are in the folder scripts/
.
cd scripts/
To save the results at each step, we provide configurable arguments (e.g., --output_dir) that allow you to specify the desired output directory.
For convenience, you can also update the default
values directly in the scripts.
The first step is to obtain the decompiled code for the functions.
The related scripts are in the folder scripts/decompilation
.
-
Start by specifying the
output_dir
indecomp_for_unstripped.py
anddecomp_for_stripped.py
to define where the results will be saved. -
Use
parallel_decomp.py
to decompile your binaries.- Use the
-u
option for unstripped binaries - Use the
-s
option for stripped binaries.
python decompilation/parallel_decomp.py [-u | -s] \ --binary_path YOUR_BINARIES_PATH \ --ghidra_path YOUR_GHIDRA_PATH \ --project_path YOUR_GHIDRA_PROJECT_PATH
Arguments:
-u
,--unstripped
: Indicates that the binary is unstripped (contains debug symbols).-s
,--stripped
: Indicates that the binary is stripped (lacks debug symbols).-b
,--binary_path
: Specify the path to the binary file or folder containing binaries.-g
,--ghidra_path
: Provide the path to the Ghidra analyzeHeadless.-p
,--project_path
: Specify the directory path to Ghidra projects.
- Use the
After get the decompiled code, we need further process the data.
The related scripts are in the folder scripts/data_processing
.
For generating a dataset (-d
), we need combine the stripped functions with unstripped functions using DWARF.
For prediction of stripped binary (-p
), it only needs to output the test set file containing all masked functions.
For generating a dataset:
python data_processing/process_decompiled_data.py -d \
--unstripped_path YOUR_UNSTRIPPED_DECOMP_DATA_PATH \
--stripped_path YOUR_STRIPPED_DECOMP_DATA_PATH \
--output_dir YOUR_OUTPUT_DIR
For prediction of stripped binary:
python data_processing/process_decompiled_data.py -p \
--stripped_path YOUR_STRIPPED_DECOMP_DATA_PATH \
--output_dir YOUR_OUTPUT_DIR
Arguments:
-d
,--dataset
: Indicates the purpose of generating a new dataset for training and testing purposes.-p
,--prediction
: Indicates the purpose of prediction of a new stripped binary.-u
,--unstripped_path
: Path to JSON files containing decompiled unstripped binaries.-s
,--stripped_path
: Path to JSON files containing decompiled stripped binaries.-o
,--output_dir
: Directory to save the output files.
For prediction of stripped binary, this substep is not required.
Use divide_dataset.py
to split the processed data into training, test, and validation sets. Duplicate entries are removed from the dataset based on the function name and function body. For more details, please refer to our paper.
python data_processing/divide_dataset.py \
--input_dir YOUR_PROCESSED_DECOMP_DIR \
--output_dir YOUR_OUTPUT_DIR
Arguments:
-i
,--input_dir
: Directory containing the combined decompiled code.-o
,--output_dir
: Directory to save the divided dataset.
For prediction of stripped binary, this step is not required.
Extract source function code from your source project (--source_dir
) and generate function summaries for fine-tuning.
The related scripts are in the folder scripts/summary_generation
.
To extract functions from the source project code, use the following command:
python summary_generation/extract_functions.py \
--source_dir YOUR_SOURCE_PROJECT_DIR \
--test_file YOUR_TESTSET_FILE \
--output_dir YOUR_OUTPUT_DIR
Arguments:
-s
,--source_dir
: Directory containing the source projects.-t
,--test_file
: Path to the test set file for removing duplicates in the collected source function code. For example: /dataset/test_set.json.'-o
,--output_dir
: Directory to save the output JSON file.
To generate summaries for the extracted source functions, use the following command:
CUDA_VISIBLE_DEVICES=0 python summary_generation/generate_summary.py \
--base_model BASE_MODEL \
--input_file YOUR_SOURCE_FUNC_FILE \
--output_dir YOUR_OUTPUT_DIR
Arguments:
--base_model
: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.--input_file
: Path to the extracted source function file.--output_dir
: Directory to save the output JSON file.
To merge the generated summaries and original training set, use the following command:
python summary_generation/merge_data.py \
--training_data YOUR_TRAINING_SET_FILE \
--summary_data YOUR_SUMMARY_DATA_FILE \
--output_dir YOUR_OUTPUT_DIR
Arguments:
-t
,--training_data
: Path to the JSON file containing the decompiled training set.-s
,--summary_data
: Path to the JSON file containing function summaries.-o
,--output_dir
: Directory to save the merged training set.
This step is implemented based on alpaca-lora.
We use the Code Llama prompt format (see templates/codellama.json) and adapt the prediction process in predict.py
for our task.
The related scripts are in the folder scripts/model_training
.
To fine-tune the model using your decompiled training set, use the following command:
CUDA_VISIBLE_DEVICES=0 python model_training/finetune.py \
--base_model BASE_MODEL \
--data_path YOUR_TRAINING_SET_FILE \
--output_dir YOUR_OUTPUT_DIR
Arguments:
--base_model
: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.--data_path
: Path to the JSON file containing the training set.--output_dir
: Directory to save the lora weight.
You can also use the LoRA weights in the folder lora_weights/
To generate predictions using a fine-tuned model, execute the following command:
CUDA_VISIBLE_DEVICES=0 python model_training/predict.py \
--base_model BASE_MODEL \
--lora_weights SAVED_LORA_WEIGHT_DIR \
--input_path YOUR_TEST_SET_FILE \
--output_dir YOUR_OUTPUT_DIR
Arguments:
--base_model
: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.--lora_weights
: Directory of the saved lora weight.--input_path
: Path to the JSON file containing the test set.--output_dir
: Directory to save the prediction results.
The related scripts are in the folder scripts/evaluation
.
-
Extract the function name from the results and divide them into tokens:
python evaluation/divide_function_name.py \ --input_file YOUR_PRED_RESULTS_FILE \ --output_dir YOUR_OUTPUT_DIR
Arguments:
-i
,--input_file
: Path to the input file containing predicted function names and ground truth.-o
,--output_dir
: Directory to save the processed evaluation results.
-
To calculate the precision, recall and F1 score, use the following command:
python evaluation/calculate_score.py \ --input_file YOUR_PROCESSED_RESULTS_FILE
Arguments:
-i
,--input_file
: Path to the evaluation input file.
This project is licensed under the Apache License 2.0 (See LICENSE).
This project includes other components distributed under the Apache License 2.0:
- alpaca-lora (See scripts/model_training/LICENSE)
TODO