Skip to content

Commit 7a3e4a3

Browse files
Merge pull request #235 from diffix/cristian/misc
Emphasize proper names in docs.
2 parents 2d7ac6c + 0cc583c commit 7a3e4a3

File tree

5 files changed

+40
-37
lines changed

5 files changed

+40
-37
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Diffix for Desktop
22

3-
Desktop application for anonymizing data using Open Diffix Elm.
3+
Desktop application for anonymizing data using __Open Diffix Elm__.
44

55
## To use
66

docs/anonymization.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,38 @@
11
# Anonymization
22

3-
Diffix for Desktop uses Diffix as its underlying anonymization mechanism. Diffix was co-developed by the Max Planck Institute for Software Systems and Aircloak GmbH. Diffix is strong enough to satisfy the GDPR definition of anonymization as non-personal data.
3+
__Diffix for Desktop__ uses __Diffix__ as its underlying anonymization mechanism. __Diffix__ was co-developed by the __Max Planck Institute for Software Systems__ and __Aircloak GmbH__. __Diffix__ is strong enough to satisfy the GDPR definition of anonymization as non-personal data.
44

5-
> If you would like help in getting approval from your Data Protection Officer (DPO) or Data Protection Authority (DPA), please contact us at [[email protected]](mailto:[email protected]).
5+
The technology is made openly available under the [__Open Diffix__](https://open-diffix.org) organization. The latest version of the mechanism is called __Diffix Elm__.
66

7-
Diffix combines three common anonymization mechanisms:
7+
> If you would like help in getting approval from your __Data Protection Officer (DPO)__ or __Data Protection Authority (DPA)__, please contact us at [[email protected]](mailto:[email protected]).
8+
9+
__Diffix__ combines three common anonymization mechanisms:
810
* __Noise:__ Distorts counts.
911
* __Suppression:__ Removes outputs that pertain to too few protected entities.
1012
* __Generalization:__ Makes data more coarse-grained, for instance generalizing date-of-birth to year-of-birth.
1113

1214
Noise is commonly used with Differential Privacy mechanisms. Generalization and suppression are commonly used with k-anonymity.
1315

14-
Diffix automatically applies these three mechanisms as needed on a query-by-query basis. Diffix detects how much is contributed to each output bin by each protected entity, and tailors noise and suppression so as to maximize data quality while maintaining strong anonymization. The quality of data anonymized with Diffix usually far exceeds that of Differential Privacy and k-anonymity.
16+
__Diffix__ automatically applies these three mechanisms as needed on a query-by-query basis. __Diffix__ detects how much is contributed to each output bin by each protected entity, and tailors noise and suppression so as to maximize data quality while maintaining strong anonymization. The quality of data anonymized with __Diffix__ usually far exceeds that of Differential Privacy and k-anonymity.
1517

1618
## Proportional Noise
1719

18-
Diffix adds pseudo-random noise taken from a Normal distribution. The amount of noise (the standard deviation) is proportional to how much is contributed to the count by the heaviest contributors. When counting the number of protected entities, each entity contributes 1, and the noise standard deviation is `SD=1.5`. With high probability, the resulting answer will be within plus or minus 5 of the true answer.
20+
__Diffix__ adds pseudo-random noise taken from a normal distribution. The amount of noise (the standard deviation) is proportional to how much is contributed to the count by the heaviest contributors. When counting the number of protected entities, each entity contributes 1, and the noise standard deviation is `SD=1.5`. With high probability, the resulting answer will be within plus or minus 5 of the true answer.
1921

2022
When counting the number of rows, the amount of noise is larger: proportional to the number of rows contributed by the highest contributors. This is similar to the concept of sensitivity in Differential Privacy. Proportional noise protects high contributors in the case where data recipients may have prior knowledge about the heavy contributors.
2123

22-
Finally, Diffix removes the excess contributions of extreme outliers (one or two protected entities that contribute far more rows than other protected entities). This prevents data recipients from inferring information about extreme outliers from the amount of noise itself.
24+
Finally, __Diffix__ removes the excess contributions of extreme outliers (one or two protected entities that contribute far more rows than other protected entities). This prevents data recipients from inferring information about extreme outliers from the amount of noise itself.
2325

2426
## Suppression
2527

26-
Diffix recognizes how many protected entities contribute to each output bin. When the number is too small, Diffix suppresses (doesn't output) the bin. This prevents data recipients from inferring information about individual protected entities even when the recipients have prior knowledge.
28+
__Diffix__ recognizes how many protected entities contribute to each output bin. When the number is too small, __Diffix__ suppresses (doesn't output) the bin. This prevents data recipients from inferring information about individual protected entities even when the recipients have prior knowledge.
2729

28-
Rather than apply a single suppression threshold to all bins, Diffix slightly modifies the threshold for different bins. This adds additional uncertainty for recipients that have prior knowledge of protected entities.
30+
Rather than apply a single suppression threshold to all bins, __Diffix__ slightly modifies the threshold for different bins. This adds additional uncertainty for recipients that have prior knowledge of protected entities.
2931

30-
Diffix suppresses bins with fewer than 4 protected entities *on average*. Diffix always suppresses bins with only a single contributing protected entity.
32+
__Diffix__ suppresses bins with fewer than 4 protected entities *on average*. __Diffix__ always suppresses bins with only a single contributing protected entity.
3133

3234
## Generalization
3335

34-
Unlike proportional noise and suppression, Diffix does not automate and enforce generalization. This is because the amount of acceptable generalization depends on the analytic goals of each use case. For instance, in some cases year-of-birth may be required, whereas in others decade-of-birth may be acceptable. Diffix has no way of knowing what level of generalization to choose. Rather, this is left to the analyst so that data quality can be tailored to the specific use case.
36+
Unlike proportional noise and suppression, __Diffix__ does not automate and enforce generalization. This is because the amount of acceptable generalization depends on the analytic goals of each use case. For instance, in some cases year-of-birth may be required, whereas in others decade-of-birth may be acceptable. __Diffix__ has no way of knowing what level of generalization to choose. Rather, this is left to the analyst so that data quality can be tailored to the specific use case.
3537

3638
As a general rule, noise and suppression force analysts to generalize. If data is too fine-grained, then each output bin will have very few protected entities. In these cases, the bins may be suppressed, or if not suppressed the signal-to-noise may be too low.

docs/operation.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
> To report feature requests or problem, please contact us at [[email protected]](mailto:[email protected]).
44
5-
Diffix for Desktop has three phases of operation:
5+
__Diffix for Desktop__ has three phases of operation:
66
- Load and configure table from CSV
77
- Select data and adjust quality
88
- Select columns for anonymization
@@ -16,15 +16,15 @@ An unlimited number of anonymized views of the data may be exported without comp
1616

1717
## Load table from CSV
1818

19-
Diffix for Desktop only accepts CSV files as input.
19+
__Diffix for Desktop__ only accepts CSV files as input.
2020

21-
Diffix for Desktop interprets the first row of the CSV file as column names.
21+
__Diffix for Desktop__ interprets the first row of the CSV file as column names.
2222

23-
Diffix for Desktop auto-detects the CSV separator. All standard separators are accepted.
23+
__Diffix for Desktop__ auto-detects the CSV separator. All standard separators are accepted.
2424

25-
Diffix for Desktop auto-detects data types as text or numeric. Text columns are generalized with substring selection, and numeric columns are generalized with numeric ranges.
25+
__Diffix for Desktop__ auto-detects data types as text or numeric. Text columns are generalized with substring selection, and numeric columns are generalized with numeric ranges.
2626

27-
After loading, Diffix for Desktop displays the column names and the first 1000 rows of the table. This data may be inspected to validate that the CSV file was loaded correctly.
27+
After loading, __Diffix for Desktop__ displays the column names and the first 1000 rows of the table. This data may be inspected to validate that the CSV file was loaded correctly.
2828

2929
### Sample CSV files
3030

@@ -33,7 +33,7 @@ are available for testing.
3333

3434
## IMPORTANT: Configure the Protected Entity Identifier Column
3535

36-
In order for Diffix for Desktop to anonymize properly, the column containing the protected entity identifier must be correctly configured.
36+
In order for __Diffix for Desktop__ to anonymize properly, the column containing the protected entity identifier must be correctly configured.
3737

3838
The **protected entity** is the entity whose privacy is being protected. A protected entity is usually a person, but it could be something else, for instance an account, a family, or even an organization.
3939

@@ -45,7 +45,7 @@ Some data sets have one row of data per protected entity. Examples include surve
4545
| O | 54321 | 23 | Bachelor | None | ... |
4646
| F | 48572 | 32 | PhD | Professor | ... |
4747

48-
These *one-row* data sets often do not have any kind identifier column. The Protected Entity Identifier Column may be set to `None`. Diffix for Desktop treats each row as a protected entity.
48+
These *one-row* data sets often do not have any kind identifier column. The Protected Entity Identifier Column may be set to `None`. __Diffix for Desktop__ treats each row as a protected entity.
4949

5050
Other data sets have multiple rows of data per protected entity. Examples include time series data like geo-location, hospital visits, and website visits. These data sets usually have one or more columns that identify the protected identity. For instance, the following is a geo-location data set where the IMEI (International Mobile Equipment Identifier) identifies the protected entity. In this case, the protected entity itself is a mobile device, but in most cases this effectively represents a person.
5151

@@ -60,7 +60,7 @@ Other data sets have multiple rows of data per protected entity. Examples includ
6060
| 456 | 2021-02-13 17:02:51 | -17.67883 | 81.40221 |
6161
| ... | ... | ... | ... |
6262

63-
For this *multi-row* data set, the IMEI column is configured in Diffix for Desktop as the Protected Entity Identifier Column. If a different column were configured as the Protected Entity Identifier Column, then Diffix for Desktop would not anonymize correctly.
63+
For this *multi-row* data set, the IMEI column is configured in __Diffix for Desktop__ as the Protected Entity Identifier column. If a different column were configured as the Protected Entity Identifier column, then __Diffix for Desktop__ would not anonymize correctly.
6464

6565
In some multi-row data sets, a single row may pertain to multiple different protected entities. Examples include bank transactions, email records, and call records. Here is an example of a data set for email records:
6666

@@ -73,27 +73,27 @@ In some multi-row data sets, a single row may pertain to multiple different prot
7373

7474
The `Sender email` and `Receiver email` each identify a different protected entity.
7575

76-
> **This version of Diffix for Desktop does not protect a data set where there are multiple protected entities per row**
76+
> **This version of __Diffix for Desktop__ does not protect a data set where there are multiple protected entities per row**
7777
78-
A data set with multiple protected entities needs to be pre-processed to have one protected entity per row before loading into Diffix for Desktop. See [Multiple protected entities per row](#multiple-protected-entities-per-row).
78+
A data set with multiple protected entities needs to be pre-processed to have one protected entity per row before loading into __Diffix for Desktop__. See [Multiple protected entities per row](#multiple-protected-entities-per-row).
7979

8080
## Select columns and generalization
8181

82-
Like all data anonymization mechanisms, Diffix distorts and hides data. The more columns included and the finer the data granularity, the more distortion and hiding. Diffix distorts by adding *noise* to counts, and hides data by *suppressing* bins that pertain to too few protected entities.
82+
Like all data anonymization mechanisms, __Diffix__ distorts and hides data. The more columns included and the finer the data granularity, the more distortion and hiding. __Diffix__ distorts by adding *noise* to counts, and hides data by *suppressing* bins that pertain to too few protected entities.
8383

84-
Diffix for Desktop lets you control the quality of the anonymized data through column selection and column generalization (binning). It lets you inspect the quality of the anonymized data at a glance with *distortion statistics* and in detail with *side-by-side comparison* of the anonymized and original data. Through an iterative process of column selection and generalization, and anonymized data inspection, Diffix for Desktop simplifies the task of data anonymization.
84+
__Diffix for Desktop__ lets you control the quality of the anonymized data through column selection and column generalization (binning). It lets you inspect the quality of the anonymized data at a glance with *distortion statistics* and in detail with *side-by-side comparison* of the anonymized and original data. Through an iterative process of column selection and generalization, and anonymized data inspection, __Diffix for Desktop__ simplifies the task of data anonymization.
8585

8686
![](images/quality-iterate.png#640)
8787

88-
Columns are selected for inclusion in the anonymized data output using the radial buttons. As soon as a column is selected, Diffix for Desktop starts computing the anonymized output. If another column is selected or de-selected before the computing finishes, then the computation is halted and a new computation started.
88+
Columns are selected for inclusion in the anonymized data output using the radial buttons. As soon as a column is selected, __Diffix for Desktop__ starts computing the anonymized output. If another column is selected or de-selected before the computing finishes, then the computation is halted and a new computation started.
8989

9090
When a column is selected, the generalization input is exposed. For text columns, you can select a substring by offset and number of characters. For numeric columns, you can select a bin size.
9191

9292
> More generalization (larger substrings or no substring, and larger numeric bins) leads to less suppression and less relative noise, but also less precision.
9393
9494
### Toggle between counting rows and counting protected entities (i.e. persons)
9595

96-
If the input data is multi-row, then Diffix for Desktop gives you the choice of counting rows or counting protected entities (i.e. persons). The toggle switch may be found at the bottom of the column selection area.
96+
If the input data is multi-row, then __Diffix for Desktop__ gives you the choice of counting rows or counting protected entities (i.e. persons). The toggle switch may be found at the bottom of the column selection area.
9797

9898
## How to interpret the Anonymization Summary
9999

@@ -138,7 +138,7 @@ The combined view lets you examine precisely the distortion and suppression. The
138138

139139
### What is safe to release
140140

141-
Note that the data in the `Anonymized` view is the only data that is properly anonymized by Diffix for Desktop. Note in particular that the Anonymization Summary is not anonymized per se. See [Releasing Anonymization Summary statistics](#releasing-anonymization-summary-statistics).
141+
Note that the data in the `Anonymized` view is the only data that is properly anonymized by __Diffix for Desktop__. Note in particular that the Anonymization Summary is not anonymized per se. See [Releasing Anonymization Summary statistics](#releasing-anonymization-summary-statistics).
142142

143143
## Export anonymized data to CSV
144144

0 commit comments

Comments
 (0)