Currently, we support datasets with privacy IDs through two metrics. For individual dataframes, IDs are expressed as IfGroupedBy(id_col, SymmetricDifference()). However, sometime we have multiple dataframes with the same ID space. This can happen either at program initialization (if the user provides multiple such dataframes), or throughout the course of the program (a very common execution pattern is to start with a dictionary of dataframes, apply a transformation to one of them, and add the result back to the dictionary). Multiple dataframes with the same ID space are represented through the AddRemoveKeys metric, which pairs with a dictionary of dataframes, and identifies which column in each dataframe contains the ID.
This approach has some shortcomings. See [this doc] for a long explanation. In brief: If we have a dictionary of dataframes whose metric is AddRemoveKeys, and we want to perform a transformation on a dataframe, we have to pull it out of the dictionary. But when we do, the dataframe's own metric is IfGroupedBy. After the transformation, even if the output metric is still IfGroupedBy, we can't tell whether it's the same ID space as we started with.
That means that we can only use specially vetted transformations with AddRemoveKeys (ones that have been guaranteed to preserve the same ID space). In practice, this takes the form of wrapping every relevant transformation in a TransformValue.
Some problems, that I would like to see fixed in a re-design:
- We shouldn't need an entire separate set of transformation to work with
AddRemoveKeys.
- We should be able to tell from the output metric whether the result of a transformation shares an ID space with other dataframes.
Currently, we support datasets with privacy IDs through two metrics. For individual dataframes, IDs are expressed as
IfGroupedBy(id_col, SymmetricDifference()). However, sometime we have multiple dataframes with the same ID space. This can happen either at program initialization (if the user provides multiple such dataframes), or throughout the course of the program (a very common execution pattern is to start with a dictionary of dataframes, apply a transformation to one of them, and add the result back to the dictionary). Multiple dataframes with the same ID space are represented through theAddRemoveKeysmetric, which pairs with a dictionary of dataframes, and identifies which column in each dataframe contains the ID.This approach has some shortcomings. See [this doc] for a long explanation. In brief: If we have a dictionary of dataframes whose metric is
AddRemoveKeys, and we want to perform a transformation on a dataframe, we have to pull it out of the dictionary. But when we do, the dataframe's own metric isIfGroupedBy. After the transformation, even if the output metric is stillIfGroupedBy, we can't tell whether it's the same ID space as we started with.That means that we can only use specially vetted transformations with
AddRemoveKeys(ones that have been guaranteed to preserve the same ID space). In practice, this takes the form of wrapping every relevant transformation in aTransformValue.Some problems, that I would like to see fixed in a re-design:
AddRemoveKeys.