Skip to content

Commit 9d35b33

Browse files
author
Borovits
committed
Addition of Jensen Shannon metric
Result correlation matrices for euclidean distance and PCD added in log Optimization in Kolmogorov and KL divergence Readme.md addition
1 parent d427277 commit 9d35b33

File tree

2 files changed

+134
-34
lines changed

2 files changed

+134
-34
lines changed

README.MD

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Synthetic data evaluation
2+
3+
This repository contains the script for the evaluation of the viewership synthetic data.
4+
5+
## Details
6+
The script takes as an input the original and the synthetic datasets. There are in total 5 metrics used in order to evaluate whether the synthetic dataset preserves the patterns and the characteristics of the original one.
7+
The used metrics are:
8+
- The Correlation (Euclidean) distance
9+
- The two-sample Kolmogorov-Smirnov test
10+
- The Jensen-Shannon divergence
11+
- The Kullback-Leibler (KL) divergence
12+
- The pairwise correlation difference (PCD)
13+
14+
### The Correlation (Euclidean) distance
15+
Having calculated the correlation matrices within the attributes of the original dataset and the
16+
attributes of the generated dataset, a suitable way to measure the similarity of these is to
17+
calculate the sum of their pairwise euclidean distances -i.e. the sum of the euclidean distances
18+
of every X ij and Y ij of the correlations matrices X and Y. These results are a suitable way to
19+
measure the preservation of the intrinsic patterns occurring between the attributes of the original
20+
dataset in the new synthetic dataset. The lower this metric is, the better the data generation tool
21+
preserves the patterns.
22+
23+
### The two-sample Kolmogorov-Smirnov test
24+
The two-sample Kolmogorov-Smirnov test is used to test whether two samples come from the same distribution.
25+
The level of significance a is set as a = 0.05. If the generated p-value from the test is lower than a then it is probable that the two distributions are different. The threshold limit for this function is a list containing less than 10 elements.
26+
27+
### The Jensen-Shannon divergence
28+
The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference (or similarity) between two probability distributions.
29+
It uses the KL divergence to calculate a normalized score that is symmetrical. It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.
30+
31+
### The Kullback-Leibler (KL) divergence
32+
The KL divergence, also called relative entropy, is computed over a pair of real and synthetic
33+
marginal probability mass functions (PMF) for a given variable, and it measures the similarity of
34+
the two PMFs. When both distributions are identical, the KL divergence is zero, while larger
35+
values of the KL divergence indicate a larger discrepancy between the two PMFs. Note that the
36+
KL divergence is computed for each variable independently; therefore, it does not measure
37+
dependencies among the variables. Note that the KL divergence is defined at the variable level,
38+
not over the entire dataset.
39+
40+
### The pairwise correlation difference (PCD)
41+
PCD is intended to measure how much correlation among the variables the different methods
42+
were able to capture. PCD measures the
43+
difference in terms of Frobennius norm of the Pearson correlation matrices computed from real
44+
and synthetic datasets. The smaller the PCD, the closer the synthetic data is to the real data in
45+
terms of linear correlations across the variables. PCD is defined at the dataset level.

evaluation.py

+89-34
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from sklearn.metrics import pairwise_distances
44
from sklearn.metrics.pairwise import euclidean_distances
55
from numpy import linalg as LA
6+
from numpy import asarray
67
from scipy.special import rel_entr
78
from scipy.spatial import distance
89
import logging
@@ -60,7 +61,7 @@ def euclidean_dist(self):
6061

6162
eucl = LA.norm(eucl_matr)
6263

63-
return eucl
64+
return eucl, eucl_matr
6465

6566
def kolmogorov(self):
6667

@@ -82,43 +83,58 @@ def kolmogorov(self):
8283
sample_real = real_cat[target_cols].reset_index(drop=True)
8384
sample_synth = synth_cat[target_cols].reset_index(drop=True)
8485

85-
p_value = 0.05
86-
rejected = []
86+
cols = {}
8787
for col in range(10):
8888
test = ks_2samp(sample_real.iloc[:, col], sample_synth.iloc[:, col])
89-
if test[1] < p_value:
90-
rejected.append(target_cols[col])
89+
col_name = target_cols[col]
90+
cols[col_name] = test[1]
9191

92-
return rejected
92+
return cols
93+
94+
def jensen_shannon(self):
95+
96+
""" The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference
97+
(or similarity) between two probability distributions. It uses the KL divergence to calculate a normalized score
98+
that is symmetrical. It is more useful as a measure as it provides a smoothed and normalized version of
99+
KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the
100+
base-2 logarithm.
101+
The threshold limit for this function is value which should be less than 0.5 except for the CONTENT_ID column
102+
which needs to be less than 0.75"""
103+
104+
target_columns = list(self.origdst.columns[11:-3])
105+
target_columns.append(self.origdst.columns[4]) # content_id
106+
107+
js_dict = {}
108+
109+
for col in target_columns:
110+
col_counts_orig = self.origdst[col].value_counts(normalize=True).sort_index(ascending=True)
111+
col_counts_synth = self.synthdst[col].value_counts(normalize=True).sort_index(ascending=True)
112+
113+
js = distance.jensenshannon(asarray(col_counts_orig.tolist()), asarray(col_counts_synth.tolist()), base=2)
114+
115+
js_dict[col] = js
116+
117+
return js_dict
93118

94119
def kl_divergence(self):
95120

96121
""" This metric is also defined at the variable level and examines whether the distributions of the attributes are
97122
identical and measures the potential level of discrepancy between them.
98123
The threshold limit for this metric is a value below 2"""
99124

100-
target_columns = self.origdst.columns[11:-3]
125+
target_columns = list(self.origdst.columns[11:-3])
126+
target_columns.append(self.origdst.columns[4]) # content_id
101127

102128
kl_dict = {}
103129

104130
for col in target_columns:
105-
106-
col_counts_orig = self.origdst[col].value_counts()
107-
col_counts_synth = self.synthdst[col].value_counts()
108-
109-
for i, k in col_counts_orig.items():
110-
col_counts_orig[i] = k / col_counts_orig.sum()
111-
for i, k in col_counts_synth.items():
112-
col_counts_synth[i] = k / col_counts_synth.sum()
131+
col_counts_orig = self.origdst[col].value_counts(normalize=True).sort_index(ascending=True)
132+
col_counts_synth = self.synthdst[col].value_counts(normalize=True).sort_index(ascending=True)
113133

114134
kl = sum(rel_entr(col_counts_orig.tolist(), col_counts_synth.tolist()))
115135

116136
kl_dict[col] = kl
117137

118-
for key in list(kl_dict):
119-
if kl_dict[key] < 2:
120-
del kl_dict[key]
121-
122138
return kl_dict
123139

124140
def pairwise_correlation_difference(self):
@@ -143,7 +159,7 @@ def pairwise_correlation_difference(self):
143159
substract_m = np.subtract(corr_real, corr_rand)
144160
prwcrdst = LA.norm(substract_m)
145161

146-
return prwcrdst
162+
return prwcrdst, substract_m
147163

148164

149165
if __name__ == "__main__":
@@ -159,35 +175,70 @@ def pairwise_correlation_difference(self):
159175

160176
# euclidean distance
161177
flag_eucl = False
162-
eucl = ob.euclidean_dist()
163-
print(eucl)
164-
logger.info('Euclidean distance calculated')
178+
eucl, eumatr = ob.euclidean_dist()
179+
logger.info('Euclidean distance was calculated')
180+
print('The calculated euclidean distance is: ', eucl)
181+
print('The calculated euclidean distance matrix is:', eumatr)
165182
if eucl > 14:
166183
logger.error(f'The calculated Euclidean distance value between the two correlation matrices is too high it should be \
167184
less than 14. The current value is {eucl}')
185+
logger.info(f'The Euclidean distance matrix is \n {eumatr}')
168186
else:
169-
logger.info('The dataaset satisfies the criteria for the euclidean distance.')
187+
logger.info('The dataset satisfies the criteria for the euclidean distance.')
188+
logger.info(f'The Euclidean distance matrix is \n {eumatr}')
170189
flag_eucl = True
171190
logger.info('---------------------------------------------------------')
172191

173192
# 2 sample Kolmogorov-Smirnov test
174193
kst = ob.kolmogorov()
194+
p_value = 0.05
175195
flag_klg = False
176-
print(kst)
177-
logger.info('Kolmogorov-Smirnov test performed')
178-
if kst:
196+
logger.info('Kolmogorov-Smirnov test was performed')
197+
print('The results of the Kolmogorov-Smirnov test is:', kst)
198+
rejected = {}
199+
for col in kst:
200+
if kst[col] < p_value:
201+
rejected[col] = kst[col]
202+
if rejected:
179203
logger.info('The dataset did not pass the Kolmogorov-Smirnov test')
180-
logger.info(f'The columns that did not pass the test are {kst}')
204+
logger.info(f'The columns that did not pass the test are \n {rejected}')
181205
else:
182206
logger.info('The dataset passed the Kolmogorov-Smirnov test')
183207
flag_klg = True
184208
logger.info('---------------------------------------------------------')
185209

210+
# Jensen-Shannon Divergence
211+
dict_js = ob.jensen_shannon()
212+
logger.info('Jensen-Shannon Divergence was calculated')
213+
print('The result of the Jensen-Shannon Divergence is:', dict_js)
214+
flag_js = False
215+
216+
for key in list(dict_js):
217+
if (dict_js[key] < 0.50) & (key != 'CONTENT_ID'):
218+
del dict_js[key]
219+
if key == 'CONTENT_ID':
220+
if (dict_js[key] < 0.75):
221+
del dict_js[key]
222+
223+
if dict_js:
224+
logger.info('The dataset did not pass the Jensen-Shannon Divergence test')
225+
for key in dict_js.keys():
226+
logger.info(f'The Jensen-Shannon Divergence value for the column {key} was {dict_js[key]}')
227+
else:
228+
logger.info('The dataset passed the Jensen-Shannon Divergence test')
229+
flag_js = True
230+
logger.info('---------------------------------------------------------')
231+
186232
# KL divergence
187233
dict_kl = ob.kl_divergence()
234+
logger.info('KL divergence was calculated')
235+
print('The result of the KL divergence is', dict_kl)
188236
flag_kl = False
189-
print(dict_kl)
190-
logger.info('KL divergence calculated')
237+
238+
for key in list(dict_kl):
239+
if dict_kl[key] < 2.20:
240+
del dict_kl[key]
241+
191242
if dict_kl:
192243
logger.info('The dataset did not pass the KL divergence evaluation test')
193244
for key in dict_kl.keys():
@@ -198,18 +249,22 @@ def pairwise_correlation_difference(self):
198249
logger.info('---------------------------------------------------------')
199250

200251
# pairwise correlation difference
201-
pair_corr_diff = ob.pairwise_correlation_difference()
252+
pair_corr_diff, pcd_matr = ob.pairwise_correlation_difference()
253+
logger.info('Pairwise correlation difference was calculated')
254+
print('The calculated Pairwise correlation difference was', pair_corr_diff)
255+
print('The calculated Pairwise correlation difference matrix was', pcd_matr)
256+
202257
flag_pcd = False
203-
print(pair_corr_diff)
204-
logger.info('Pairwise correlation difference calculated')
205258
if pair_corr_diff > 2.4:
206259
logger.error(f'The calculated Euclidean distance value between the two correlation matrices is too high it should be \
207260
less than 14. The current value is {pair_corr_diff}')
261+
logger.info(f'The Pairwise distance distance matrix is \n {pcd_matr}')
208262
else:
209263
logger.info('The dataaset satisfies the criteria for the Pairwise Correlation Difference.')
264+
logger.info(f'The Pairwise distance distance matrix is \n {pcd_matr}')
210265
flag_pcd = True
211266

212-
if (flag_eucl & flag_klg & flag_kl & flag_pcd):
267+
if (flag_eucl & flag_js & flag_klg & flag_kl & flag_pcd):
213268
logger.info('The dataaset satisfies the minimum evaluation criteria.')
214269
else:
215270
logger.info('The dataaset does not satisfy the minimum evaluation criteria.')

0 commit comments

Comments
 (0)