Skip to content

Commit 69fd8d8

Browse files
dhalfLucas BourtouleGrosQuildu
authored
ML rules fixes, new rule for msgpack-numpy (#39)
* improves some rules and tests * msgpack-numpy rule * new rule for pandas eval functions * keras and tf loading functions * signed commit * new rules: keras and tf loading * fix formatting, less FPs for pandas-eval * pandas-eval metadata * pandas-eval, add testcase for empty f-string * numpy-in-pytorch-modules, generic match * pickles-in-keras-deprecation - regex --------- Co-authored-by: Lucas Bourtoule <[email protected]> Co-authored-by: Paweł Płatek <[email protected]> Co-authored-by: GrosQuildu <[email protected]>
1 parent 772f68e commit 69fd8d8

15 files changed

+336
-2
lines changed

python/lxml-in-pandas.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,7 @@
2424
kwargs2 = {"io": touch, "flavor":"html5lib"}
2525
# ok: lxml-in-pandas
2626
pd.read_html(**kwargs2)
27+
28+
def test(**kwargs):
29+
# ruleid: lxml-in-pandas
30+
pd.read_html(**kwargs)

python/msgpack-numpy.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
import msgpack
2+
import msgpack_numpy as m
3+
import numpy as np
4+
5+
x = np.random.rand(5)
6+
# ruleid: msgpack-numpy
7+
x_enc = msgpack.packb(x, default=m.encode)
8+
# ruleid: msgpack-numpy
9+
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
10+
11+
# ok: msgpack-numpy
12+
x_enc2 = msgpack.packb(x)
13+
# ok: msgpack-numpy
14+
x_rec2 = msgpack.unpackb(x_enc2)
15+
16+
# ok: msgpack-numpy
17+
x_enc3 = msgpack.load(x)
18+
# ok: msgpack-numpy
19+
x_rec3 = msgpack.loads(x_enc2)
20+
21+
m.patch()
22+
23+
# ruleid: msgpack-numpy
24+
x_enc3 = msgpack.packb(x)
25+
# ruleid: msgpack-numpy
26+
x_rec3 = msgpack.unpackb(x_enc2)
27+
28+
# ruleid: msgpack-numpy
29+
x_enc3 = msgpack.load(x)
30+
# ruleid: msgpack-numpy
31+
x_rec3 = msgpack.loads(x_enc2)

python/msgpack-numpy.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
rules:
2+
- id: msgpack-numpy
3+
message: >-
4+
Found usage of msgpack-numpy unpacking, which relies on pickle to deserialize numpy arrays containing objects.
5+
Functions reliant on pickle can result in arbitrary code execution.
6+
Consider switching to a safer serialization method.
7+
languages: [python]
8+
severity: ERROR
9+
metadata:
10+
category: security
11+
cwe: "CWE-502: Deserialization of Untrusted Data"
12+
subcategory: [vuln]
13+
confidence: MEDIUM
14+
likelihood: MEDIUM
15+
impact: HIGH
16+
technology: [numpy]
17+
description: "Potential arbitrary code execution from functions reliant on pickling"
18+
references:
19+
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/
20+
21+
pattern-either:
22+
- patterns:
23+
- pattern: msgpack.$FN(...)
24+
- metavariable-regex:
25+
metavariable: $FN
26+
regex: (loads?|dumps?|packb?|unpackb?)
27+
- pattern-inside: |
28+
msgpack_numpy.patch()
29+
...
30+
31+
- patterns:
32+
- pattern: msgpack.$FN(..., object_hook=msgpack_numpy.decode, ...)
33+
- metavariable-regex:
34+
metavariable: $FN
35+
regex: unpackb?
36+
37+
- patterns:
38+
- pattern: msgpack.$FN(..., default=msgpack_numpy.encode, ...)
39+
- metavariable-regex:
40+
metavariable: $FN
41+
regex: packb?

python/numpy-in-pytorch-modules.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ def forward(self, x, y):
1010
# ruleid: numpy-in-pytorch-modules
1111
y = np.concatenate((x, y), axis=1)
1212

13+
# ruleid: numpy-in-pytorch-modules
14+
np.ndarray.sort(y)
15+
1316
def forward_correct(self, x, y):
1417
x = self.dropout(x)
1518
# ok: numpy-in-pytorch-modules

python/numpy-in-pytorch-modules.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ rules:
1515
- https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects
1616

1717
patterns:
18-
- pattern: $RESULT = numpy.$FUNCTION(...)
18+
- pattern-either:
19+
- pattern: numpy.$FN(...)
20+
- pattern: numpy. ... .$FN(...)
1921
- pattern-inside: |
2022
class $MODULE(torch.nn.Module):
2123
...

python/pandas-eval.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import pandas as pd
2+
3+
def id(x):
4+
return x
5+
6+
expr = id("something")
7+
colB = id("B")
8+
subexpr = id("df.age * 2")
9+
10+
df1 = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
11+
# ok: pandas-eval
12+
r11 = df1.eval('A + B')
13+
# ruleid: pandas-eval
14+
r12 = df1.eval(expr)
15+
# ok: pandas-eval
16+
r13 = df1.eval(f"A + B")
17+
# ruleid: pandas-eval
18+
r14 = df1.eval(f"A + {colB}")
19+
# ok: pandas-eval
20+
r15 = df1.eval(f"")
21+
22+
df2 = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
23+
# ok: pandas-eval
24+
pd.eval("double_age = df.age * 2", target=df2)
25+
# ruleid: pandas-eval
26+
pd.eval(expr, target=df2)
27+
# ok: pandas-eval
28+
pd.eval(f"double_age = df.age * 2", target=df2)
29+
# ruleid: pandas-eval
30+
pd.eval(f"double_age = {subexpr}", target=df2)
31+
32+
df3 = pd.DataFrame({
33+
'A': range(1, 6),
34+
'B': range(10, 0, -2)
35+
})
36+
# ok: pandas-eval
37+
r31 = df3.query('A > B')
38+
# ruleid: pandas-eval
39+
r32 = df3.query(expr)
40+
# ok: pandas-eval
41+
r33 = df3.query(f'A > B')
42+
# ruleid: pandas-eval
43+
r34 = df3.query(f'A > {colB}')
44+
45+
class X:
46+
def query(self, x):
47+
pass
48+
49+
# ok: pandas-eval
50+
X().query(expr)

python/pandas-eval.yaml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
rules:
2+
- id: pandas-eval
3+
message: >-
4+
Pandas eval() and query() may be dangerous if used to evaluate
5+
dynamic content. If this content can be input from outside the program, this
6+
may be a code injection vulnerability. Ensure evaluated content is not definable
7+
by external sources.
8+
languages: [python]
9+
severity: ERROR
10+
metadata:
11+
category: security
12+
cwe: "CWE-95: Improper Neutralization of Directives in Dynamically Evaluated Code ('Eval Injection')"
13+
subcategory:
14+
- audit
15+
confidence: LOW
16+
likelihood: LOW
17+
impact: HIGH
18+
technology: [pandas]
19+
description: "Potential arbitrary code execution from `pandas` functions that evaluate user-provided expressions"
20+
references:
21+
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/
22+
23+
patterns:
24+
- pattern-inside: |
25+
import pandas
26+
...
27+
- pattern-either:
28+
- patterns:
29+
- pattern: pandas.DataFrame.$FN(...)
30+
- pattern-not: pandas.DataFrame.$FN("...", ...)
31+
- pattern-not: pandas.DataFrame.$FN(f"", ...)
32+
33+
- patterns:
34+
- pattern: pandas.$FN(...)
35+
- pattern-not: pandas.$FN("...", ...)
36+
- pattern-not: pandas.$FN(f"", ...)
37+
38+
- patterns:
39+
- pattern-inside: |
40+
$DF = pandas.DataFrame(...)
41+
...
42+
- pattern: $DF.$FN(...)
43+
- pattern-not: $DF.$FN("...", ...)
44+
- pattern-not: $DF.$FN(f"", ...)
45+
46+
- metavariable-regex:
47+
metavariable: $FN
48+
regex: (eval|query)
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from tensorflow import keras
2+
from keras.models import load_model
3+
4+
def id(x):
5+
return x
6+
7+
h5_file_path = id("model.h5")
8+
keras_file_path = id("model.keras")
9+
10+
# ruleid: pickles-in-keras-deprecation
11+
m1 = load_model("model.h5")
12+
13+
# ok: pickles-in-keras-deprecation
14+
m2 = load_model("model.keras")
15+
16+
# ruleid: pickles-in-keras-deprecation
17+
m3 = load_model(h5_file_path)
18+
19+
# ruleid: pickles-in-keras-deprecation
20+
m4 = load_model(keras_file_path)
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
rules:
2+
- id: pickles-in-keras-deprecation
3+
message: >-
4+
The usage of pickle and hdf5 formats for model files are deprecated in Keras.
5+
The keras.models.load_model function is deprecated as well. Keras is now
6+
embedded in Tensorflow 2 under tensorflow.keras.
7+
languages: [python]
8+
severity: WARNING
9+
metadata:
10+
category: security
11+
cwe: "CWE-502: Deserialization of Untrusted Data"
12+
subcategory: [vuln]
13+
confidence: MEDIUM
14+
likelihood: MEDIUM
15+
impact: HIGH
16+
technology: [keras]
17+
description: "Potential arbitrary code execution from Keras' load_model function"
18+
references:
19+
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/
20+
21+
patterns:
22+
- pattern-either:
23+
- pattern: keras.models.load_model(...)
24+
- pattern: tensorflow.keras.models.load_model(...)
25+
- pattern: keras.saving.load_model(...)
26+
- pattern: tensorflow.keras.saving.load_model(...)
27+
- pattern-not:
28+
patterns:
29+
- pattern-either:
30+
- pattern: keras.models.load_model($FILE)
31+
- pattern: tensorflow.keras.models.load_model($FILE)
32+
- pattern: keras.saving.load_model($FILE)
33+
- pattern: tensorflow.keras.saving.load_model($FILE)
34+
- metavariable-regex:
35+
metavariable: $FILE
36+
regex: .*\.keras

python/pickles-in-keras.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from tensorflow import keras
2+
from keras.models import load_model
3+
4+
def id(x):
5+
return x
6+
7+
h5_file_path = id("model.h5")
8+
keras_file_path = id("model.keras")
9+
10+
# ok: pickles-in-keras
11+
m1 = load_model("model.h5")
12+
13+
# ok: pickles-in-keras
14+
m2 = load_model("model.keras")
15+
16+
# ruleid: pickles-in-keras
17+
m3 = load_model(h5_file_path)
18+
19+
# ruleid: pickles-in-keras
20+
m4 = load_model(keras_file_path)

python/pickles-in-keras.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
rules:
2+
- id: pickles-in-keras
3+
message: >-
4+
Keras' load_model function may result in arbitrary code execution:
5+
- It can load vulnerable pickled models
6+
- It can load an hdf5 model that contains a lambda layer with arbitrary code
7+
that will be executed every time the model is used (loading, training, eval)
8+
Note: Keras loading with the built-in file format should be safe as long as checks are not disabled.
9+
languages: [python]
10+
severity: ERROR
11+
metadata:
12+
category: security
13+
cwe: "CWE-502: Deserialization of Untrusted Data"
14+
subcategory: [vuln]
15+
confidence: MEDIUM
16+
likelihood: MEDIUM
17+
impact: HIGH
18+
technology: [keras]
19+
description: "Potential arbitrary code execution from Keras' load_model function"
20+
references:
21+
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/
22+
23+
patterns:
24+
- pattern-either:
25+
- patterns:
26+
- pattern: keras.models.load_model(...)
27+
- pattern-not: keras.models.load_model("...", ...)
28+
- patterns:
29+
- pattern: tensorflow.keras.models.load_model(...)
30+
- pattern-not: tensorflow.keras.models.load_model("...", ...)
31+
- patterns:
32+
- pattern: keras.saving.load_model(...)
33+
- pattern-not: keras.saving.load_model("...", ...)
34+
- patterns:
35+
- pattern: tensorflow.keras.saving.load_model(...)
36+
- pattern-not: tensorflow.keras.saving.load_model("...", ...)

python/pickles-in-pytorch.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,7 @@ def test(arg):
2929

3030
# ok: pickles-in-pytorch
3131
model.load_state_dict(torch.load(arg))
32+
33+
state_dict = model.state_dict()
34+
# ok: pickles-in-pytorch
35+
torch.save(state_dict, arg)

python/pickles-in-pytorch.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,4 +24,11 @@ rules:
2424
- pattern-not: torch.load("...")
2525
- pattern-not: torch.save(..., "...")
2626
- pattern-not: torch.save($M.state_dict(), ...)
27-
- pattern-not-inside: $M.load_state_dict(torch.load(...))
27+
- pattern-not-inside: $M.load_state_dict(...)
28+
29+
- pattern-not:
30+
patterns:
31+
- pattern: torch.save($STATE_DICT, ...)
32+
- pattern-inside: |
33+
$STATE_DICT = $M.state_dict()
34+
...

python/pickles-in-tensorflow.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
import tensorflow as tf
2+
3+
def id(x):
4+
return x
5+
6+
model_dir = id("model_dir")
7+
8+
# ok: pickles-in-tensorflow
9+
m1 = tf.saved_model.load("model_dir")
10+
# ruleid: pickles-in-tensorflow
11+
m2 = tf.saved_model.load(model_dir)

python/pickles-in-tensorflow.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
rules:
2+
- id: pickles-in-tensorflow
3+
message: >-
4+
Tensorflow's low-level load function may result in arbitrary code execution.
5+
languages: [python]
6+
severity: ERROR
7+
metadata:
8+
category: security
9+
cwe: "CWE-502: Deserialization of Untrusted Data"
10+
subcategory: [vuln]
11+
confidence: MEDIUM
12+
likelihood: MEDIUM
13+
impact: HIGH
14+
technology: [keras]
15+
description: "Potential arbitrary code execution from tensorflow's load function"
16+
references:
17+
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/
18+
19+
patterns:
20+
- pattern: tensorflow.saved_model.load(...)
21+
- pattern-not: tensorflow.saved_model.load("...", ...)

0 commit comments

Comments
 (0)