Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs

This repository contains the code implementation of paper Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs.

We implemented SymGen using Ghidra (for decompilation), ANTLR 4 (for processing source code and decompiled code). The model's fine-tuning and prediction are based on alpaca-lora. For more details, please refer to our paper.

Repository Contents

Environment Setup

Create new conda enviroment
```
conda create -n symgen python=3.9
```
Activate the conda environment
```
conda activate symgen
```
Install packages
```
pip install -r requirements.txt
```
Install Ghidra

The Ghidra installation guide can be found here.

Dataset

The original binaries and the decompiled code used by SymGen can be downloaded here.

The uploaded package follows a three-level directory structure, organized as shown below:

.
├── architecture
    └── opt_level
        └── project

architecture: Contains four architectures: x86_64, x86_32, arm, and mips.
opt_level: Contains different optimization levels: O0, O1, O2, and O3.
project: Contains 33 projects. The list can be found in our paper.

A processed sample dataset is provided in the folder dataset/ which can be used for fine-tuning and prediction directly (See Fine-tune and Predict).

Running Steps

All python scripts are in the folder scripts/.

cd scripts/

To save the results at each step, we provide configurable arguments (e.g., --output_dir) that allow you to specify the desired output directory. For convenience, you can also update the default values directly in the scripts.

Decompile Binaries

The first step is to obtain the decompiled code for the functions. The related scripts are in the folder scripts/decompilation.

Start by specifying the output_dir in decomp_for_unstripped.py and decomp_for_stripped.py to define where the results will be saved.
Use parallel_decomp.py to decompile your binaries.
- Use the -u option for unstripped binaries
- Use the -s option for stripped binaries.
```
python decompilation/parallel_decomp.py [-u | -s] \
    --binary_path YOUR_BINARIES_PATH \
    --ghidra_path YOUR_GHIDRA_PATH \
    --project_path YOUR_GHIDRA_PROJECT_PATH
```
Arguments:
- -u, --unstripped: Indicates that the binary is unstripped (contains debug symbols).
- -s, --stripped: Indicates that the binary is stripped (lacks debug symbols).
- -b, --binary_path: Specify the path to the binary file or folder containing binaries.
- -g, --ghidra_path: Provide the path to the Ghidra analyzeHeadless.
- -p, --project_path: Specify the directory path to Ghidra projects.

Process Decompiled Function Code

After get the decompiled code, we need further process the data. The related scripts are in the folder scripts/data_processing.

For generating a dataset (-d), we need combine the stripped functions with unstripped functions using DWARF. For prediction of stripped binary (-p), it only needs to output the test set file containing all masked functions.

Combine the stripped funcions with unstripped funcions

For generating a dataset:

python data_processing/process_decompiled_data.py -d \
    --unstripped_path YOUR_UNSTRIPPED_DECOMP_DATA_PATH \
    --stripped_path YOUR_STRIPPED_DECOMP_DATA_PATH \
    --output_dir YOUR_OUTPUT_DIR

For prediction of stripped binary:

python data_processing/process_decompiled_data.py -p \
    --stripped_path YOUR_STRIPPED_DECOMP_DATA_PATH \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

-d, --dataset: Indicates the purpose of generating a new dataset for training and testing purposes.
-p, --prediction: Indicates the purpose of prediction of a new stripped binary.
-u, --unstripped_path: Path to JSON files containing decompiled unstripped binaries.
-s, --stripped_path: Path to JSON files containing decompiled stripped binaries.
-o, --output_dir: Directory to save the output files.

Divide Dataset

For prediction of stripped binary, this substep is not required.

Use divide_dataset.py to split the processed data into training, test, and validation sets. Duplicate entries are removed from the dataset based on the function name and function body. For more details, please refer to our paper.

python data_processing/divide_dataset.py \
    --input_dir YOUR_PROCESSED_DECOMP_DIR \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

-i, --input_dir: Directory containing the combined decompiled code.
-o, --output_dir: Directory to save the divided dataset.

Generate Function Summary

For prediction of stripped binary, this step is not required.

Extract source function code from your source project (--source_dir) and generate function summaries for fine-tuning. The related scripts are in the folder scripts/summary_generation.

Extract Function from Source Code Project

To extract functions from the source project code, use the following command:

python summary_generation/extract_functions.py \
    --source_dir YOUR_SOURCE_PROJECT_DIR \
    --test_file YOUR_TESTSET_FILE \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

-s, --source_dir: Directory containing the source projects.
-t, --test_file: Path to the test set file for removing duplicates in the collected source function code. For example: /dataset/test_set.json.'
-o, --output_dir: Directory to save the output JSON file.

Generate Function Summary

To generate summaries for the extracted source functions, use the following command:

CUDA_VISIBLE_DEVICES=0 python summary_generation/generate_summary.py \
    --base_model BASE_MODEL \
    --input_file YOUR_SOURCE_FUNC_FILE \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

--base_model: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.
--input_file: Path to the extracted source function file.
--output_dir: Directory to save the output JSON file.

Merge Training Set with Function Summary Data

To merge the generated summaries and original training set, use the following command:

python summary_generation/merge_data.py \
    --training_data YOUR_TRAINING_SET_FILE \
    --summary_data YOUR_SUMMARY_DATA_FILE \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

-t, --training_data: Path to the JSON file containing the decompiled training set.
-s, --summary_data: Path to the JSON file containing function summaries.
-o, --output_dir: Directory to save the merged training set.

Fine-tune and Predict

This step is implemented based on alpaca-lora.

We use the Code Llama prompt format (see templates/codellama.json) and adapt the prediction process in predict.py for our task. The related scripts are in the folder scripts/model_training.

Fine-tune

To fine-tune the model using your decompiled training set, use the following command:

CUDA_VISIBLE_DEVICES=0 python model_training/finetune.py \
    --base_model BASE_MODEL \
    --data_path YOUR_TRAINING_SET_FILE \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

--base_model: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.
--data_path: Path to the JSON file containing the training set.
--output_dir: Directory to save the lora weight.

You can also use the LoRA weights in the folder lora_weights/

Predict

To generate predictions using a fine-tuned model, execute the following command:

CUDA_VISIBLE_DEVICES=0 python model_training/predict.py \
    --base_model BASE_MODEL \
    --lora_weights SAVED_LORA_WEIGHT_DIR \
    --input_path YOUR_TEST_SET_FILE \
    --output_dir YOUR_OUTPUT_DIR

Arguments:

--base_model: Specify a base model, e.g. --base_model='codellama/CodeLlama-34b-Instruct-hf'.
--lora_weights: Directory of the saved lora weight.
--input_path: Path to the JSON file containing the test set.
--output_dir: Directory to save the prediction results.

Evaluate

The related scripts are in the folder scripts/evaluation.

Extract the function name from the results and divide them into tokens:
```
python evaluation/divide_function_name.py \
    --input_file YOUR_PRED_RESULTS_FILE \
    --output_dir YOUR_OUTPUT_DIR
```
Arguments:
- -i, --input_file: Path to the input file containing predicted function names and ground truth.
- -o, --output_dir: Directory to save the processed evaluation results.
To calculate the precision, recall and F1 score, use the following command:
```
python evaluation/calculate_score.py \
    --input_file YOUR_PROCESSED_RESULTS_FILE
```
Arguments:
- -i, --input_file: Path to the evaluation input file.

License

This project is licensed under the Apache License 2.0 (See LICENSE).

This project includes other components distributed under the Apache License 2.0:

alpaca-lora (See scripts/model_training/LICENSE)

Citation

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs

Repository Contents

Environment Setup

Dataset

Running Steps

Decompile Binaries

Process Decompiled Function Code

Combine the stripped funcions with unstripped funcions

Divide Dataset

Generate Function Summary

Extract Function from Source Code Project

Generate Function Summary

Merge Training Set with Function Summary Data

Fine-tune and Predict

Fine-tune

Predict

Evaluate

License

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
lora_weights		lora_weights
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

OSUSecLab/SymGen

Folders and files

Latest commit

History

Repository files navigation

Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs

Repository Contents

Environment Setup

Dataset

Running Steps

Decompile Binaries

Process Decompiled Function Code

Combine the stripped funcions with unstripped funcions

Divide Dataset

Generate Function Summary

Extract Function from Source Code Project

Generate Function Summary

Merge Training Set with Function Summary Data

Fine-tune and Predict

Fine-tune

Predict

Evaluate

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages