Skip to content

Commit 121228b

Browse files
authored
[Add] files
1 parent a04b849 commit 121228b

35 files changed

+51816
-2
lines changed

Diff for: LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2023 testing-cs
3+
Copyright (c) 2022 Yuejun GUO
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

Diff for: README.md

+91-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,91 @@
1-
# vulnerability-detection
1+
# software-vulnerability-detection-imbalance
2+
This project is the Pytorch implementation for the paper ```An Empirical Study of the Imbalance Issue in Software Vulnerability Detection```.
3+
4+
## Project Overview
5+
6+
1. Dataset
7+
2. Source code for CodeBERT
8+
3. Source code for GraphCodeBERT
9+
10+
## Environment
11+
12+
```
13+
Python== 3.7
14+
pytorch==1.7.1
15+
torchvision==0.8.2
16+
tree-sitter==0.20.1
17+
transformers==4.24.0
18+
tqdm
19+
numpy
20+
```
21+
22+
## Dataset
23+
24+
All datasets provide function-level source code. Three open-source repositories:
25+
26+
[CodeXGlue](https://github.com/microsoft/CodeXGLUE) provides the devign dataset.
27+
28+
[Devign](https://sites.google.com/view/devign?pli=1) provides the ffmpeg and qemu datasets.
29+
30+
[Lin2018](https://github.com/DanielLin1986/TransferRepresentationLearning) provides the Asterisk, FFmpeg, LibPNG, LibTIFF, Pidgin, and VLC datasets.
31+
32+
Each dataset includes the training, validation, and test sets (```*_trian.jsonl, *_valid.jsonl, *_test.jsonl```).
33+
34+
## Run
35+
36+
For GraphCodeBERT, we need to build the tree-sitter to parse code snippets and extract variable names. Build tree-sitter using the following command:
37+
38+
```
39+
cd graphcoderbert/python_parser/parser_folder
40+
bash build.sh
41+
```
42+
43+
CodeBERT and GraphCodeBERT use the same commands for training/test. We use CodeBERT as an example.
44+
45+
### Fine-tuning
46+
47+
```
48+
python run.py \
49+
--do_train \
50+
--training standard\
51+
--data_root devign\
52+
--project_name qemu\
53+
--epochs 50 \
54+
--evaluate_during_training \
55+
--seed 123456
56+
```
57+
58+
### Validation
59+
```
60+
python run.py \
61+
--do_eval \
62+
--training standard\
63+
--data_root devign\
64+
--project_name qemu\
65+
```
66+
67+
### Test
68+
```
69+
python run.py \
70+
--do_test \
71+
--training standard\
72+
--data_root devign\
73+
--project_name qemu\
74+
```
75+
76+
Parameter setting:
77+
78+
* --training: the solution used to address the imbalance issue.
79+
* Choices:
80+
* standard: use the default setting of [CodeBERT](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection) and [GraphCodeBERT](https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT).
81+
* weight: use the mean false error loss
82+
* cbl: use the class-balanced loss
83+
* augmentation: use the adversarial attack-based augmentation (re-sampled data are created in the dataset folder. You can also generate it by using the code in ```dataset/function-level/identifyP/augment.py```)
84+
* down: use the random down-sampling
85+
* focal: use the focal loss
86+
* over: use the random over-sampling (re-sampled data are created in the dataset folder. You can also generate it by using the code in ```dataset/function-level/identifyP/augment_du.py```)
87+
* threshold: use the threshold-moving
88+
* data_root: the source of data
89+
* Choices: codexglue, devign, lin2018
90+
* project_name: the name of dataset
91+
* Choices: please check the names in dataset/function-level/ for each source.

Diff for: codebert/model.py

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Copyright (c) Microsoft Corporation.
2+
# Licensed under the MIT License.
3+
import pdb
4+
5+
import numpy as np
6+
import torch.nn as nn
7+
import torch
8+
from torchvision.ops import sigmoid_focal_loss
9+
10+
11+
class Model(nn.Module):
12+
def __init__(self, encoder, config, tokenizer, args, beta=0.9999):
13+
super(Model, self).__init__()
14+
self.encoder = encoder
15+
self.config = config
16+
self.tokenizer = tokenizer
17+
self.args = args
18+
self.beta = beta
19+
20+
def forward(self, input_ids=None, labels=None):
21+
outputs = self.encoder(input_ids, attention_mask=input_ids.ne(1))[0]
22+
logits = outputs
23+
prob = torch.sigmoid(logits)
24+
if labels is not None:
25+
if self.args.training in ["standard", "augmentation", "down", "over"]:
26+
labels = labels.float()
27+
loss_ = torch.log(prob[:, 0]+1e-10)*labels+torch.log((1-prob)[:, 0]+1e-10)*(1-labels)
28+
loss = -loss_.mean()
29+
elif self.args.training == "weight":
30+
labels = labels.float()
31+
if len(torch.where(labels == 1)[0]) == 0:
32+
loss = -torch.sum(torch.log((1 - prob)[:, 0] + 1e-10) * (1 - labels)) / len(
33+
torch.where(labels == 0)[0])
34+
else:
35+
loss = torch.sum(torch.log(prob[:, 0] + 1e-10) * labels) / len(
36+
torch.where(labels == 1)[0]) + torch.sum(
37+
torch.log((1 - prob)[:, 0] + 1e-10) * (1 - labels)) / len(torch.where(labels == 0)[0])
38+
loss /= -2
39+
elif self.args.training == "cbl":
40+
weight_0 = (1 - self.beta) / (1 - np.power(self.beta, len(torch.where(labels == 0)[0])))
41+
if len(torch.where(labels == 1)[0]) > 0:
42+
weight_1 = (1 - self.beta) / (1 - np.power(self.beta, len(torch.where(labels == 1)[0])))
43+
labels = labels.float()
44+
loss = torch.log(prob[:, 0] + 1e-10) * labels * weight_1 + torch.log((1 - prob)[:, 0] + 1e-10) * (
45+
1 - labels) * weight_0
46+
else:
47+
loss = torch.log((1 - prob)[:, 0] + 1e-10) * (1 - labels) * weight_0
48+
loss = -loss.mean()
49+
elif self.args.training == "focal":
50+
loss = sigmoid_focal_loss(logits, labels.view(len(labels), 1).float(), reduction="mean")
51+
else:
52+
labels = labels.float()
53+
loss = torch.log(prob[:, 0] + 1e-10) * labels * ((1 - prob)[:, 0]) + torch.log((1 - prob)[:, 0] + 1e-10) * (1 - labels) * (prob[:, 0])
54+
loss = -2*loss.mean()
55+
return loss, prob
56+
else:
57+
return prob
58+
59+

0 commit comments

Comments
 (0)