You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Diffix for Desktop uses Diffix as its underlying anonymization mechanism. Diffix was co-developed by the Max Planck Institute for Software Systems and Aircloak GmbH. Diffix is strong enough to satisfy the GDPR definition of anonymization as non-personal data.
3
+
__Diffix for Desktop__ uses __Diffix__ as its underlying anonymization mechanism. __Diffix__ was co-developed by the __Max Planck Institute for Software Systems__ and __Aircloak GmbH__. __Diffix__ is strong enough to satisfy the GDPR definition of anonymization as non-personal data.
4
4
5
-
> If you would like help in getting approval from your Data Protection Officer (DPO) or Data Protection Authority (DPA), please contact us at [[email protected]](mailto:[email protected]).
5
+
The technology is made openly available under the [__Open Diffix__](https://open-diffix.org) organization. The latest version of the mechanism is called __Diffix Elm__.
6
6
7
-
Diffix combines three common anonymization mechanisms:
7
+
> If you would like help in getting approval from your __Data Protection Officer (DPO)__ or __Data Protection Authority (DPA)__, please contact us at [[email protected]](mailto:[email protected]).
8
+
9
+
__Diffix__ combines three common anonymization mechanisms:
8
10
*__Noise:__ Distorts counts.
9
11
*__Suppression:__ Removes outputs that pertain to too few protected entities.
10
12
*__Generalization:__ Makes data more coarse-grained, for instance generalizing date-of-birth to year-of-birth.
11
13
12
14
Noise is commonly used with Differential Privacy mechanisms. Generalization and suppression are commonly used with k-anonymity.
13
15
14
-
Diffix automatically applies these three mechanisms as needed on a query-by-query basis. Diffix detects how much is contributed to each output bin by each protected entity, and tailors noise and suppression so as to maximize data quality while maintaining strong anonymization. The quality of data anonymized with Diffix usually far exceeds that of Differential Privacy and k-anonymity.
16
+
__Diffix__ automatically applies these three mechanisms as needed on a query-by-query basis. __Diffix__ detects how much is contributed to each output bin by each protected entity, and tailors noise and suppression so as to maximize data quality while maintaining strong anonymization. The quality of data anonymized with __Diffix__ usually far exceeds that of Differential Privacy and k-anonymity.
15
17
16
18
## Proportional Noise
17
19
18
-
Diffix adds pseudo-random noise taken from a Normal distribution. The amount of noise (the standard deviation) is proportional to how much is contributed to the count by the heaviest contributors. When counting the number of protected entities, each entity contributes 1, and the noise standard deviation is `SD=1.5`. With high probability, the resulting answer will be within plus or minus 5 of the true answer.
20
+
__Diffix__ adds pseudo-random noise taken from a normal distribution. The amount of noise (the standard deviation) is proportional to how much is contributed to the count by the heaviest contributors. When counting the number of protected entities, each entity contributes 1, and the noise standard deviation is `SD=1.5`. With high probability, the resulting answer will be within plus or minus 5 of the true answer.
19
21
20
22
When counting the number of rows, the amount of noise is larger: proportional to the number of rows contributed by the highest contributors. This is similar to the concept of sensitivity in Differential Privacy. Proportional noise protects high contributors in the case where data recipients may have prior knowledge about the heavy contributors.
21
23
22
-
Finally, Diffix removes the excess contributions of extreme outliers (one or two protected entities that contribute far more rows than other protected entities). This prevents data recipients from inferring information about extreme outliers from the amount of noise itself.
24
+
Finally, __Diffix__ removes the excess contributions of extreme outliers (one or two protected entities that contribute far more rows than other protected entities). This prevents data recipients from inferring information about extreme outliers from the amount of noise itself.
23
25
24
26
## Suppression
25
27
26
-
Diffix recognizes how many protected entities contribute to each output bin. When the number is too small, Diffix suppresses (doesn't output) the bin. This prevents data recipients from inferring information about individual protected entities even when the recipients have prior knowledge.
28
+
__Diffix__ recognizes how many protected entities contribute to each output bin. When the number is too small, __Diffix__ suppresses (doesn't output) the bin. This prevents data recipients from inferring information about individual protected entities even when the recipients have prior knowledge.
27
29
28
-
Rather than apply a single suppression threshold to all bins, Diffix slightly modifies the threshold for different bins. This adds additional uncertainty for recipients that have prior knowledge of protected entities.
30
+
Rather than apply a single suppression threshold to all bins, __Diffix__ slightly modifies the threshold for different bins. This adds additional uncertainty for recipients that have prior knowledge of protected entities.
29
31
30
-
Diffix suppresses bins with fewer than 4 protected entities *on average*. Diffix always suppresses bins with only a single contributing protected entity.
32
+
__Diffix__ suppresses bins with fewer than 4 protected entities *on average*. __Diffix__ always suppresses bins with only a single contributing protected entity.
31
33
32
34
## Generalization
33
35
34
-
Unlike proportional noise and suppression, Diffix does not automate and enforce generalization. This is because the amount of acceptable generalization depends on the analytic goals of each use case. For instance, in some cases year-of-birth may be required, whereas in others decade-of-birth may be acceptable. Diffix has no way of knowing what level of generalization to choose. Rather, this is left to the analyst so that data quality can be tailored to the specific use case.
36
+
Unlike proportional noise and suppression, __Diffix__ does not automate and enforce generalization. This is because the amount of acceptable generalization depends on the analytic goals of each use case. For instance, in some cases year-of-birth may be required, whereas in others decade-of-birth may be acceptable. __Diffix__ has no way of knowing what level of generalization to choose. Rather, this is left to the analyst so that data quality can be tailored to the specific use case.
35
37
36
38
As a general rule, noise and suppression force analysts to generalize. If data is too fine-grained, then each output bin will have very few protected entities. In these cases, the bins may be suppressed, or if not suppressed the signal-to-noise may be too low.
__Diffix for Desktop__ has three phases of operation:
6
6
- Load and configure table from CSV
7
7
- Select data and adjust quality
8
8
- Select columns for anonymization
@@ -16,15 +16,15 @@ An unlimited number of anonymized views of the data may be exported without comp
16
16
17
17
## Load table from CSV
18
18
19
-
Diffix for Desktop only accepts CSV files as input.
19
+
__Diffix for Desktop__ only accepts CSV files as input.
20
20
21
-
Diffix for Desktop interprets the first row of the CSV file as column names.
21
+
__Diffix for Desktop__ interprets the first row of the CSV file as column names.
22
22
23
-
Diffix for Desktop auto-detects the CSV separator. All standard separators are accepted.
23
+
__Diffix for Desktop__ auto-detects the CSV separator. All standard separators are accepted.
24
24
25
-
Diffix for Desktop auto-detects data types as text or numeric. Text columns are generalized with substring selection, and numeric columns are generalized with numeric ranges.
25
+
__Diffix for Desktop__ auto-detects data types as text or numeric. Text columns are generalized with substring selection, and numeric columns are generalized with numeric ranges.
26
26
27
-
After loading, Diffix for Desktop displays the column names and the first 1000 rows of the table. This data may be inspected to validate that the CSV file was loaded correctly.
27
+
After loading, __Diffix for Desktop__ displays the column names and the first 1000 rows of the table. This data may be inspected to validate that the CSV file was loaded correctly.
28
28
29
29
### Sample CSV files
30
30
@@ -33,7 +33,7 @@ are available for testing.
33
33
34
34
## IMPORTANT: Configure the Protected Entity Identifier Column
35
35
36
-
In order for Diffix for Desktop to anonymize properly, the column containing the protected entity identifier must be correctly configured.
36
+
In order for __Diffix for Desktop__ to anonymize properly, the column containing the protected entity identifier must be correctly configured.
37
37
38
38
The **protected entity** is the entity whose privacy is being protected. A protected entity is usually a person, but it could be something else, for instance an account, a family, or even an organization.
39
39
@@ -45,7 +45,7 @@ Some data sets have one row of data per protected entity. Examples include surve
45
45
| O | 54321 | 23 | Bachelor | None | ... |
46
46
| F | 48572 | 32 | PhD | Professor | ... |
47
47
48
-
These *one-row* data sets often do not have any kind identifier column. The Protected Entity Identifier Column may be set to `None`. Diffix for Desktop treats each row as a protected entity.
48
+
These *one-row* data sets often do not have any kind identifier column. The Protected Entity Identifier Column may be set to `None`. __Diffix for Desktop__ treats each row as a protected entity.
49
49
50
50
Other data sets have multiple rows of data per protected entity. Examples include time series data like geo-location, hospital visits, and website visits. These data sets usually have one or more columns that identify the protected identity. For instance, the following is a geo-location data set where the IMEI (International Mobile Equipment Identifier) identifies the protected entity. In this case, the protected entity itself is a mobile device, but in most cases this effectively represents a person.
51
51
@@ -60,7 +60,7 @@ Other data sets have multiple rows of data per protected entity. Examples includ
For this *multi-row* data set, the IMEI column is configured in Diffix for Desktop as the Protected Entity Identifier Column. If a different column were configured as the Protected Entity Identifier Column, then Diffix for Desktop would not anonymize correctly.
63
+
For this *multi-row* data set, the IMEI column is configured in __Diffix for Desktop__ as the Protected Entity Identifier column. If a different column were configured as the Protected Entity Identifier column, then __Diffix for Desktop__ would not anonymize correctly.
64
64
65
65
In some multi-row data sets, a single row may pertain to multiple different protected entities. Examples include bank transactions, email records, and call records. Here is an example of a data set for email records:
66
66
@@ -73,27 +73,27 @@ In some multi-row data sets, a single row may pertain to multiple different prot
73
73
74
74
The `Sender email` and `Receiver email` each identify a different protected entity.
75
75
76
-
> **This version of Diffix for Desktop does not protect a data set where there are multiple protected entities per row**
76
+
> **This version of __Diffix for Desktop__ does not protect a data set where there are multiple protected entities per row**
77
77
78
-
A data set with multiple protected entities needs to be pre-processed to have one protected entity per row before loading into Diffix for Desktop. See [Multiple protected entities per row](#multiple-protected-entities-per-row).
78
+
A data set with multiple protected entities needs to be pre-processed to have one protected entity per row before loading into __Diffix for Desktop__. See [Multiple protected entities per row](#multiple-protected-entities-per-row).
79
79
80
80
## Select columns and generalization
81
81
82
-
Like all data anonymization mechanisms, Diffix distorts and hides data. The more columns included and the finer the data granularity, the more distortion and hiding. Diffix distorts by adding *noise* to counts, and hides data by *suppressing* bins that pertain to too few protected entities.
82
+
Like all data anonymization mechanisms, __Diffix__ distorts and hides data. The more columns included and the finer the data granularity, the more distortion and hiding. __Diffix__ distorts by adding *noise* to counts, and hides data by *suppressing* bins that pertain to too few protected entities.
83
83
84
-
Diffix for Desktop lets you control the quality of the anonymized data through column selection and column generalization (binning). It lets you inspect the quality of the anonymized data at a glance with *distortion statistics* and in detail with *side-by-side comparison* of the anonymized and original data. Through an iterative process of column selection and generalization, and anonymized data inspection, Diffix for Desktop simplifies the task of data anonymization.
84
+
__Diffix for Desktop__ lets you control the quality of the anonymized data through column selection and column generalization (binning). It lets you inspect the quality of the anonymized data at a glance with *distortion statistics* and in detail with *side-by-side comparison* of the anonymized and original data. Through an iterative process of column selection and generalization, and anonymized data inspection, __Diffix for Desktop__ simplifies the task of data anonymization.
85
85
86
86

87
87
88
-
Columns are selected for inclusion in the anonymized data output using the radial buttons. As soon as a column is selected, Diffix for Desktop starts computing the anonymized output. If another column is selected or de-selected before the computing finishes, then the computation is halted and a new computation started.
88
+
Columns are selected for inclusion in the anonymized data output using the radial buttons. As soon as a column is selected, __Diffix for Desktop__ starts computing the anonymized output. If another column is selected or de-selected before the computing finishes, then the computation is halted and a new computation started.
89
89
90
90
When a column is selected, the generalization input is exposed. For text columns, you can select a substring by offset and number of characters. For numeric columns, you can select a bin size.
91
91
92
92
> More generalization (larger substrings or no substring, and larger numeric bins) leads to less suppression and less relative noise, but also less precision.
93
93
94
94
### Toggle between counting rows and counting protected entities (i.e. persons)
95
95
96
-
If the input data is multi-row, then Diffix for Desktop gives you the choice of counting rows or counting protected entities (i.e. persons). The toggle switch may be found at the bottom of the column selection area.
96
+
If the input data is multi-row, then __Diffix for Desktop__ gives you the choice of counting rows or counting protected entities (i.e. persons). The toggle switch may be found at the bottom of the column selection area.
97
97
98
98
## How to interpret the Anonymization Summary
99
99
@@ -138,7 +138,7 @@ The combined view lets you examine precisely the distortion and suppression. The
138
138
139
139
### What is safe to release
140
140
141
-
Note that the data in the `Anonymized` view is the only data that is properly anonymized by Diffix for Desktop. Note in particular that the Anonymization Summary is not anonymized per se. See [Releasing Anonymization Summary statistics](#releasing-anonymization-summary-statistics).
141
+
Note that the data in the `Anonymized` view is the only data that is properly anonymized by __Diffix for Desktop__. Note in particular that the Anonymization Summary is not anonymized per se. See [Releasing Anonymization Summary statistics](#releasing-anonymization-summary-statistics).
0 commit comments