-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathhillel_wayne.tex
497 lines (409 loc) · 16.8 KB
/
hillel_wayne.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
\chapter{Finding Property Tests - Hillel Wayne}
\label{sec:finding_property_tests}
\begin{quotation}
\noindent\textit{\textbf{William Yao:}}
\textit{More starting points for figuring out what properties to test.}
\vspace{\baselineskip}
\noindent\textit{Original article: \cite{finding_properties}}
\end{quotation}
A while back I was ranting about APLs and included this python code to
get the mode of a list:
\begin{minted}{python}
def mode(l):
max = None
count = {}
for x in l:
if x not in count:
count[x] = 0
count[x] += 1
if not max or count[x] > count[max]:
max = x
return max
\end{minted}
There's a bug in it. Do you see it? If not, try running it on the list
\texttt{{[}0,\ 0,\ 1{]}}:
\begin{minted}{python}
>>> mode([0, 0, 1])
1
\end{minted}
The issue is that 0 is falsy, so if max is 0, \texttt{if\ not\ max} is
true.
I could say this bug was a result of carelessness. I didn't write any
tests for this function, just tried a few obvious examples and thought
``yeah it works''. But I don't think writing bespoke manual unit tests
would have caught this. To surface the bug, the mode of a list must be a
falsy value like 0 or \texttt{{[}{]}} \emph{and} the last value of the
list must be something else. It's a small intersection of Python's
typing and the mechanics of \texttt{mode}, making it too unusual a case
to be found by standard unit testing practice.
A different testing style is property based testing (PBT) with
\href{https://www.hillelwayne.com/post/contracts/}{contracts}. By
generating a random set of inputs, we cover more of the state space than
we'd do manually. The problem with PBT, though, is that it can be hard
to find good properties. I'd like to take mode as a case study in what
properties we could come up with.
There are a few things I'm looking for:
\begin{itemize}
\item
The property test should find the bug.
\item
The test should be \emph{simple}. I'm presumably not putting a lot of
effort into testing such a simple function, so a complex test doesn't
accurately capture what I'd do.
\item
The test should be \emph{obvious}. I'm looking for a natural test that
finds the bug, not a post-hoc one. Catching a bug with tests is much
less believable if you already know what you're looking for.
\end{itemize}
So, let's talk some tests and contracts! I'm using
\href{http://hypothesis.works/}{hypothesis} for the property tests,
\href{https://github.com/deadpixi/contracts}{dpcontracts} for my
contracts library, \href{https://docs.pytest.org/en/latest/}{pytest} for
the runner\footnote{Oddly enough I'm getting increasingly less sold on using pytest, purely because I want to experiment with weird janky metaprogramming in my tests and pytest doesn't really support that.}.
For the sake of this problem, assume we're only passing in nonempty
lists.
\begin{minted}{python}
@require("l must not be empty", lambda args: len(args.l) > 0)
def mode(l):
...
\end{minted}
\section{Contract-wise}
\label{contract-wise}
One property of a function is ``all of the contracts are satisfied''. We
can use this to write ``thin'' tests, where we don't put any assertions
in the test itself. If any of our contracts raise an error then the test
will fail.
\begin{minted}{python}
@given(lists(integers(), min_size=1))
def test_mode(l):
mode(l)
\end{minted}
\subsection{Types}
\label{types}
Typically in dynamic programming languages, contracts are used as a poor
replacement of a static type system. Instead of checking the type at
compile time, people check the type at runtime. Most contract libraries
are heavily geared towards this kind of use.
\begin{minted}{python}
@ensure("result is an int", lambda _, r: isinstance(r, int))
\end{minted}
This is a bad contract for three reasons:
\begin{enumerate}
\item
It requires the function to return integers when it's currently
generic. We \emph{could} try to make it generic by doing something
like \texttt{type(a.l)\ ==\ type(r)}, but \emph{ugh}.
\item
We should be using mypy for type checks anyway.
\item
It doesn't actually find the error. The problem isn't the type, it's
that we got the wrong result.
\end{enumerate}
\subsubsection{Sanity Checking}
\label{sanity-checking}
We can go further than replicating static types. One common type of
contract is a ``sanity check'': some property that does not fully
specify our code, but is easy to express and should hold true anyway.
For example, we know the mode will be an element of the list, so why not
check that we're returning an element of the list?
\begin{minted}{python}
@ensure("result is in list", lambda a, r: r in a.l)
\end{minted}
This is a pretty good contract! It tells us useful things about the
function, and it's not easily replacable with a typecheck. If I was
writing production code I'd probably write a lot of contracts like this.
But it also doesn't find the problem, so we need to go further.
\subsection{First element}
\label{first-element}
Our sanity check was only minimally related to our function. There are
lots of functions that return elements in the list: \texttt{head},
\texttt{random\_element}, \texttt{last}, etc. The issue is a subtle bug
in our implementation. Our contract should express some important
property about our function. In mode's case, it should relate to the
count of the value.
One \emph{extremely} useful property is adding bounds. The mode of a
list is the element that occurs most frequently. Every element of the
list should occur less often, or as often, as the
mode\footnote{We have to say ``less than or \textit{equal} to'' for two reasons. First, the mode is not strictly more frequent than other elements, like in \texttt{[1, 1, 2, 2]}. Second, what if the mode \textit{is} \texttt{l[0]}?}.
One good arbitrary element is the first element:
\begin{minted}{python}
@ensure("result > arbitrary",
lambda a, r: a.l.count(r) >= a.l.count(a.l[0]))
\end{minted}
This finds the bug!
\begin{minted}{python}
args = ([0, 0, 1],), kwargs = {}, rargs = Args(l=[0, 0, 1])
result = 1
E dpcontracts.PostconditionError: result > arbitrary
\end{minted}
Personally, I'd prefer this as a property test clause instead of a
contract clause. It doesn't feel ``right'' to me. I think it's more an
aesthetic judgment than a technical one here.
\begin{minted}{python}
@given(lists(integers(), min_size=1))
def test_mode(l):
x = mode(l)
assert l.count(x) >= l.count(l[0])
\end{minted}
Either way, this is only a partial contract: while it will catch
\emph{some} incorrect outputs, it won't catch them all. We could get
this with \texttt{{[}1,\ 0,\ 0,\ 2{]}}:
\texttt{count(0)\ \textgreater{}\ count(2)\ \textgreater{}=\ count(1)},
but our broken function would return \texttt{2}. In some cases, this is
all we can feasibly get. For simpler functions, though, we can rule out
all incorrect outputs. We want a total contract, one which always raises
on an incorrect output and never raises on a correct one.
\subsection{The dang definition}
\label{the-dang-definition}
Why not just use the definition itself?
\begin{minted}{python}
@ensure("result is the mode",
lambda a, r: all((a.l.count(r) >= a.l.count(x) for x in a.l)))
\end{minted}
To make this nicer we can extract this into a dedicated helper contract:
\begin{minted}{python}
def is_mode(l, m):
return all((l.count(m) >= l.count(x) for x in l))
@ensure("result is the mode", lambda a, r: is_mode(a.l, r))
\end{minted}
This also catches the error.
\begin{minted}{python}
args = ([0, 0, 1],), kwargs = {}, rargs = Args(l=[0, 0, 1])
result = 1
\end{minted}
This is the same faulty input as before. Property-based Testing
libraries \textbf{shrink} inputs to find the smallest possible error,
which is \texttt{{[}0,\ 0,\ 1{]}}.
\subsubsection{Compare to an Oracle}
\label{compare-to-an-oracle}
If we had another way of getting the answer that we knew was correct, we
could just compare the two results and see if they're the same. This is
called using an \textbf{oracle}. Oracles are often a good choice when
you're trying to refactor a complex function or optimize an expensive
one. For our purposes, it goes too far.
\begin{minted}{python}
from collections import Counter
def math_mode(l):
c = Counter(l)
return c.most_common(1)[0][0]
@require("l must not be empty", lambda args: len(args.l) > 0)
@ensure("result matches oracle", lambda a, r: r == math_mode(a.l))
def mode(l):
...
\end{minted}
This is too heavy. Not only is it cumbersome, but it overconstrains what
the mode can be. We see this in the error it finds: it finds an error
with a smaller input than the other two!
\begin{minted}{python}
args = ([0, 1],), kwargs = {}, rargs = Args(l=[0, 1]), result = 1
\end{minted}
We haven't precisely defined the semantics of \texttt{mode}. If there
are two values which tie for the most elements, which is the mode? Our
prior contracts didn't say: as long as we picked an element that had at
least as many instances as any other element, we were good. With
\texttt{math\_mode}, we're arbitrarily choosing one of them as the
``real'' mode and checking that our \texttt{mode} picked that arbitrary
element. We can see this better by writing a manual test:
\begin{minted}{python}
def test_mode():
mode([3, 2, 2, 3])
...
args = ([3, 2, 2, 3],), kwargs = {}, rargs = Args(l=[3, 2, 2, 3])
result = 2
\end{minted}
Whereas with our previous contract passes on this.
\section{Property-wise}
\label{property-wise}
Our contract approach converged on ``testing the definition'' as the
best result. There are many cases where code-under-test does not have a
nice mathematical definition. Contracts are still useful here, as they
can rule out bad cases, but you'll need additional tests.
\emph{Hypothetically} contracts can express all possible properties of a
function. In practice you're limited to what your framework can express
and check. For most complicated properties we're better off sticking it
in a dedicated test.
``Property-wise'' property tests have several advantages over
``contract-wise'' property tests:
\begin{enumerate}
\item
We can test properties that aren't ``ergonomic'' in our contract
framework.
\item
We can test properties that involve effects.
\item
We can test
\href{https://www.hillelwayne.com/post/metamorphic-testing/}{metamorphic
relations} which involve comparing multiple function calls.
\end{enumerate}
For this \texttt{mode} problem we don't need any of them, but let's show
off some possible tactics anyway.
\subsection{Preserving
Transformation}
\label{preserving-transformation}
We find some transformation of the input that should give the same
output. For lists, a good transformation is
sorting\footnote{Unless you're trying to test a sorting function.}.
The mode of a list doesn't change if you sort it:
\begin{minted}{python}
@given(lists(integers(), min_size=1))
def test_sorting_preserves_mode(l):
assert mode(l) == mode(sorted(l))
...
l = [0, 0, -1]
\end{minted}
We could also reverse the list instead of sort it, but that gives us an
error case of \texttt{{[}1,\ -1{]}}, which again is due to
overconstraints.
Or we could assert that the mode doesn't change if we add it again to
the list:
\begin{minted}{python}
def test_can_add_to_mode(l):
m = mode(l)
assert mode(l + [m]) == m
\end{minted}
This does \emph{not} find the bug, though.
\subsection{Controlled
Transformation}
\label{controlled-transformation}
Instead of finding a solution that doesn't change the answer, we could
find one that changes it in a known way. One of them might be ``doubling
all of the numbers doubles the mode'':
\begin{minted}{python}
@given(lists(integers(), min_size=1))
def test_doubling_doubles_mode(l):
doubled = [x * 2 for x in l]
assert 2*mode(l) == mode(doubled)
\end{minted}
This does not find the bug. We could also try ``adding 1 to every
element adds 1 to the mode'':
\begin{minted}{python}
@given(lists(integers(), min_size=1))
def test_incrementing_increments_mode(l):
incremented = [x + 1 for x in l]
assert mode(l)+1 == mode(incremented)
...
l = [0, 1]
\end{minted}
It gives the same output as in our overconstrained case, but it only
``is wrong'' when we have \texttt{0}'s anyway. If we restrict the list to only
positive integers, it will pass (unlike our oracle contract).
If we wanted to be extra thorough we could generatively pick both a
slope and an increment:
\begin{minted}{python}
@given(lists(integers(), min_size=1), integers())
def test_affine_relation(l, m, b):
transformed = [m*x+b for x in l]
assert m*mode(l)+b == mode(transformed)
...
l = [0, 1], m = 1, b = 1
\end{minted}
It depends on how paranoid you want to get.
\subsection{Oracle Generators}
\label{oracle-generators}
The big advantage of manual tests to generative ones is that you can
come up with the appropriate outputs for a given input. Since we can't
easily do that in PBT, we're stuck testing properties instead of
oracles.
Or we could go in reverse: take a random output and generate a
corresponding input. One way we can do that:
\begin{enumerate}
\item
generate pairs of elements and counts. Make sure that the elements are
unique
\item
construct a list from that
\item
pass in both the list and the corresponding mode, selected from the
pair.
\end{enumerate}
\begin{minted}{python}
@composite
def list_and_mode(draw):
out = []
pairs_max_10 = tuples(integers(), integers(min_value=1, max_value=10))
counts = draw(lists(pairs_max_10,
min_size=1,
max_size=5,
unique_by= lambda x: x[0]))
for number, count in counts:
out += ([number] * count)
mode_of_out = max(counts, key=lambda x: x[1])[0]
return out, mode_of_out
@given(list_and_mode())
def test_can_find_mode(lm):
l, m = lm
assert mode(l) == m
...
lm = ([0, 1], 0)
\end{minted}
This overconstrains (we're not ruling out two pairs having the same
counts), but it \emph{does not} raise a false positive for
\texttt{{[}3,\ 2,\ 2,\ 3{]}}. This is because we construct the list in
the same way \texttt{max} interprets the list. If we do
\begin{verbatim}
- for number, count in counts:
+ for number, count in reversed(counts):
\end{verbatim}
Then it raises \texttt{{[}3,\ 2{]}} as a ``counterexample''. Between the
cumbersomeness and the overconstraining, making an oracle generator is
not a good choice for this problem. There are some cases, though, where
it can be more useful.
\section{Limitations}
\label{limitations}
Here's a fixed version that \emph{looks} like it will work:
\begin{minted}{python}
def mode(l):
max = None
count = {}
for x in l:
if x not in count:
count[x] = 0
count[x] += 1
+ if max is None or count[x] > count[max]:
- if not max or count[x] > count[max]:
max = x
return max
\end{minted}
And this passes all of our tests. But there's still a bug in it. Again,
take a minute to see if you can find it. If you can't, try the
following:
\begin{minted}{python}
mode([None, None, 2])
\end{minted}
This will select the mode as \texttt{2}, when it really should be
\texttt{None}. The problem isn't in our contracts or assertions. It's in
our test generator: we're only testing with lists of integers.
Hypothesis can generate heterogenous lists, but you still have to
explicitly list the types you want to be in the list. In order to find
this bug we'd have to explicitly realize that \texttt{None} might be a
problem for us.
If we only want to call \texttt{mode} on homogenous lists, we should
instead use a typechecker to catch the bug:
\begin{minted}{python}
+ def mode(l: List[T]) -> T:
- def mode(l):
max = None
+ count = {} # type: Dict[T, int]
- count = {}
\end{minted}
This will raise a spurious error, saying that the return value is
actually an \texttt{Optional{[}T{]}}. If we change \texttt{max\ =\ None}
to \texttt{max\ =\ l{[}0{]}} both the error and the bug go away. But we
can change the return value to \texttt{Optional{[}T{]}} and the bug
remains- mypy can't actually detect if we're passing in a heterogenous
list. More type-oriented languages can ban heterogenous lists outright
but even those will miss the bugs our contracts caught. Static and
dynamic analysis are complementary, not
contradictory\footnote{Both contracts and properties can be checked statically, but \textit{most} people using them will be checking them at runtime. This is because static analysis of contracts quickly turns into formal verification, which is really, really hard. }.
\subsection{Summary}
This was a pretty short dive into what makes a good property or
contract. It also focused on just pure functions: a lot of languages use
contracts to maintain class invariants or monitor the side effects of
procedures.
If you're interested in learning more about properties,
chapter \ref{sec:choosing_properties_for_testing}
is a canonical article on abstract properties and chapter \ref{sec:screencast_introduction}
is a series on applying it to business problems. If you're interested in
learning more about contracts, I'd recommend\ldots{} actually, I can't
think of anything that's not language-specific. Kind of surprising given
how useful they are.