Addition of Jensen Shannon metric

Borovits · Borovits · commit 9d35b333cd97 · 2021-02-17T09:53:42.000+01:00
Result correlation matrices for euclidean distance and PCD added in log
Optimization in Kolmogorov and KL divergence
Readme.md addition
diff --git a/README.MD b/README.MD
@@ -0,0 +1,45 @@
+# Synthetic data evaluation
+
+This repository contains the script for the evaluation of the viewership synthetic data. 
+
+## Details
+The script takes as an input the original and the synthetic datasets. There are in total 5 metrics used in order to evaluate whether the synthetic dataset preserves the patterns and the characteristics of the original one.
+The used metrics are:
+- The Correlation (Euclidean) distance
+- The two-sample Kolmogorov-Smirnov test
+- The Jensen-Shannon divergence
+- The Kullback-Leibler (KL) divergence
+- The pairwise correlation difference (PCD)
+
+### The Correlation (Euclidean) distance
+Having calculated the correlation matrices within the attributes of the original dataset and the
+attributes of the generated dataset, a suitable way to measure the similarity of these is to
+calculate the sum of their pairwise euclidean distances -i.e. the sum of the euclidean distances
+of every X ij and Y ij of the correlations matrices X and Y. These results are a suitable way to
+measure the preservation of the intrinsic patterns occurring between the attributes of the original
+dataset in the new synthetic dataset. The lower this metric is, the better the data generation tool
+preserves the patterns.
+
+### The two-sample Kolmogorov-Smirnov test
+The two-sample Kolmogorov-Smirnov test is used to test whether two samples come from the same distribution.
+The level of significance a is set as a = 0.05. If the generated p-value from the test is lower than a then it is probable that the two distributions are different. The threshold limit for this function is a list containing less than 10 elements.
+
+### The Jensen-Shannon divergence
+The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference (or similarity) between two probability distributions.
+It uses the KL divergence to calculate a normalized score that is symmetrical. It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.
+
+### The Kullback-Leibler (KL) divergence
+The KL divergence, also called relative entropy, is computed over a pair of real and synthetic
+marginal probability mass functions (PMF) for a given variable, and it measures the similarity of
+the two PMFs. When both distributions are identical, the KL divergence is zero, while larger
+values of the KL divergence indicate a larger discrepancy between the two PMFs. Note that the
+KL divergence is computed for each variable independently; therefore, it does not measure
+dependencies among the variables. Note that the KL divergence is defined at the variable level,
+not over the entire dataset.
+
+### The pairwise correlation difference (PCD)
+PCD is intended to measure how much correlation among the variables the different methods
+were able to capture. PCD measures the
+difference in terms of Frobennius norm of the Pearson correlation matrices computed from real
+and synthetic datasets. The smaller the PCD, the closer the synthetic data is to the real data in
+terms of linear correlations across the variables. PCD is defined at the dataset level.
diff --git a/evaluation.py b/evaluation.py
@@ -3,6 +3,7 @@
 from sklearn.metrics import pairwise_distances
 from sklearn.metrics.pairwise import euclidean_distances
 from numpy import linalg as LA
+from numpy import asarray
 from scipy.special import rel_entr
 from scipy.spatial import distance
 import logging
@@ -60,7 +61,7 @@ def euclidean_dist(self):
 
         eucl = LA.norm(eucl_matr)
 
-        return eucl
+        return eucl, eucl_matr
 
     def kolmogorov(self):
 
@@ -82,43 +83,58 @@ def kolmogorov(self):
         sample_real = real_cat[target_cols].reset_index(drop=True)
         sample_synth = synth_cat[target_cols].reset_index(drop=True)
 
-        p_value = 0.05
-        rejected = []
+        cols = {}
         for col in range(10):
             test = ks_2samp(sample_real.iloc[:, col], sample_synth.iloc[:, col])
-            if test[1] < p_value:
-                rejected.append(target_cols[col])
+            col_name = target_cols[col]
+            cols[col_name] = test[1]
 
-        return rejected
+        return cols
+
+    def jensen_shannon(self):
+
+        """ The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference
+        (or similarity) between two probability distributions. It uses the KL divergence to calculate a normalized score
+         that is symmetrical. It is more useful as a measure as it provides a smoothed and normalized version of
+         KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the
+         base-2 logarithm.
+         The threshold limit for this function is value which should be less than 0.5 except for the CONTENT_ID column
+         which needs to be less than 0.75"""
+
+        target_columns = list(self.origdst.columns[11:-3])
+        target_columns.append(self.origdst.columns[4])  # content_id
+
+        js_dict = {}
+
+        for col in target_columns:
+            col_counts_orig = self.origdst[col].value_counts(normalize=True).sort_index(ascending=True)
+            col_counts_synth = self.synthdst[col].value_counts(normalize=True).sort_index(ascending=True)
+
+            js = distance.jensenshannon(asarray(col_counts_orig.tolist()), asarray(col_counts_synth.tolist()), base=2)
+
+            js_dict[col] = js
+
+        return js_dict
 
     def kl_divergence(self):
 
         """ This metric is also defined at the variable level and examines whether the distributions of the attributes are
         identical and measures the potential level of discrepancy between them.
         The threshold limit for this metric is a value below 2"""
 
-        target_columns = self.origdst.columns[11:-3]
+        target_columns = list(self.origdst.columns[11:-3])
+        target_columns.append(self.origdst.columns[4])  # content_id
 
         kl_dict = {}
 
         for col in target_columns:
-
-            col_counts_orig = self.origdst[col].value_counts()
-            col_counts_synth = self.synthdst[col].value_counts()
-
-            for i, k in col_counts_orig.items():
-                col_counts_orig[i] = k / col_counts_orig.sum()
-            for i, k in col_counts_synth.items():
-                col_counts_synth[i] = k / col_counts_synth.sum()
+            col_counts_orig = self.origdst[col].value_counts(normalize=True).sort_index(ascending=True)
+            col_counts_synth = self.synthdst[col].value_counts(normalize=True).sort_index(ascending=True)
 
             kl = sum(rel_entr(col_counts_orig.tolist(), col_counts_synth.tolist()))
 
             kl_dict[col] = kl
 
-            for key in list(kl_dict):
-                if kl_dict[key] < 2:
-                    del kl_dict[key]
-
         return kl_dict
 
     def pairwise_correlation_difference(self):
@@ -143,7 +159,7 @@ def pairwise_correlation_difference(self):
         substract_m = np.subtract(corr_real, corr_rand)
         prwcrdst = LA.norm(substract_m)
 
-        return prwcrdst
+        return prwcrdst, substract_m
 
 
 if __name__ == "__main__":
@@ -159,35 +175,70 @@ def pairwise_correlation_difference(self):
 
     # euclidean distance
     flag_eucl = False
-    eucl = ob.euclidean_dist()
-    print(eucl)
-    logger.info('Euclidean distance calculated')
+    eucl, eumatr = ob.euclidean_dist()
+    logger.info('Euclidean distance was calculated')
+    print('The calculated euclidean distance is: ', eucl)
+    print('The calculated euclidean distance matrix is:', eumatr)
     if eucl > 14:
         logger.error(f'The calculated Euclidean distance value between the two correlation matrices is too high it should be \
         less than 14. The current value is {eucl}')
+        logger.info(f'The Euclidean distance matrix is \n {eumatr}')
     else:
-        logger.info('The dataaset satisfies the criteria for the euclidean distance.')
+        logger.info('The dataset satisfies the criteria for the euclidean distance.')
+        logger.info(f'The Euclidean distance matrix is \n {eumatr}')
         flag_eucl = True
     logger.info('---------------------------------------------------------')
 
     # 2 sample Kolmogorov-Smirnov test
     kst = ob.kolmogorov()
+    p_value = 0.05
     flag_klg = False
-    print(kst)
-    logger.info('Kolmogorov-Smirnov test performed')
-    if kst:
+    logger.info('Kolmogorov-Smirnov test was performed')
+    print('The results of the Kolmogorov-Smirnov test is:', kst)
+    rejected = {}
+    for col in kst:
+        if kst[col] < p_value:
+            rejected[col] = kst[col]
+    if rejected:
         logger.info('The dataset did not pass the Kolmogorov-Smirnov test')
-        logger.info(f'The columns that did not pass the test are {kst}')
+        logger.info(f'The columns that did not pass the test are \n {rejected}')
     else:
         logger.info('The dataset passed the Kolmogorov-Smirnov test')
         flag_klg = True
     logger.info('---------------------------------------------------------')
 
+    # Jensen-Shannon Divergence
+    dict_js = ob.jensen_shannon()
+    logger.info('Jensen-Shannon Divergence was calculated')
+    print('The result of the Jensen-Shannon Divergence is:', dict_js)
+    flag_js = False
+
+    for key in list(dict_js):
+        if (dict_js[key] < 0.50) & (key != 'CONTENT_ID'):
+            del dict_js[key]
+        if key == 'CONTENT_ID':
+            if (dict_js[key] < 0.75):
+                del dict_js[key]
+
+    if dict_js:
+        logger.info('The dataset did not pass the Jensen-Shannon Divergence test')
+        for key in dict_js.keys():
+            logger.info(f'The Jensen-Shannon Divergence value for the column {key} was {dict_js[key]}')
+    else:
+        logger.info('The dataset passed the Jensen-Shannon Divergence test')
+        flag_js = True
+    logger.info('---------------------------------------------------------')
+
     # KL divergence
     dict_kl = ob.kl_divergence()
+    logger.info('KL divergence was calculated')
+    print('The result of the KL divergence is', dict_kl)
     flag_kl = False
-    print(dict_kl)
-    logger.info('KL divergence calculated')
+
+    for key in list(dict_kl):
+        if dict_kl[key] < 2.20:
+            del dict_kl[key]
+
     if dict_kl:
         logger.info('The dataset did not pass the KL divergence evaluation test')
         for key in dict_kl.keys():
@@ -198,18 +249,22 @@ def pairwise_correlation_difference(self):
     logger.info('---------------------------------------------------------')
 
     # pairwise correlation difference
-    pair_corr_diff = ob.pairwise_correlation_difference()
+    pair_corr_diff, pcd_matr = ob.pairwise_correlation_difference()
+    logger.info('Pairwise correlation difference was calculated')
+    print('The calculated Pairwise correlation difference was', pair_corr_diff)
+    print('The calculated Pairwise correlation difference matrix was', pcd_matr)
+
     flag_pcd = False
-    print(pair_corr_diff)
-    logger.info('Pairwise correlation difference calculated')
     if pair_corr_diff > 2.4:
         logger.error(f'The calculated Euclidean distance value between the two correlation matrices is too high it should be \
         less than 14. The current value is {pair_corr_diff}')
+        logger.info(f'The Pairwise distance distance matrix is \n {pcd_matr}')
     else:
         logger.info('The dataaset satisfies the criteria for the Pairwise Correlation Difference.')
+        logger.info(f'The Pairwise distance distance matrix is \n {pcd_matr}')
         flag_pcd = True
 
-    if (flag_eucl & flag_klg & flag_kl & flag_pcd):
+    if (flag_eucl & flag_js & flag_klg & flag_kl & flag_pcd):
         logger.info('The dataaset satisfies the minimum evaluation criteria.')
     else:
         logger.info('The dataaset does not satisfy the minimum evaluation criteria.')