Fixed small bug on document frequency

Angelogeb · Nov 8, 2018 · 0841105 · 0841105
1 parent b742352
commit 0841105
Show file tree

Hide file tree

Showing 7 changed files with 12 additions and 28 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,3 @@
 *.stdout
 __pycache__/
 .ipynb_checkpoints/
-dist/
diff --git a/hw2/images/q1.png b/hw2/images/q1.png
diff --git a/hw2/images/q2.png b/hw2/images/q2.png
diff --git a/hw2/images/q3.png b/hw2/images/q3.png
diff --git a/hw2/p1/index.py b/hw2/p1/index.py
@@ -38,33 +38,18 @@ def preprocess_docs(src_file,
     fout.close()
     return [ attr for attr in attrs if attr in clean_attrs ]
 
-def doc_frequency(src_file, attrs):
-    """
-       file:
-       <tokens of attrs[0]> \t <tokens of attrs[1]> ..
-
-       result:
-       { 
-         "term1": [ df[attrs[0]], df[attrs[1]], ..., df[attrs[len(attrs)]]],
-         "term2": [ df[attrs[0]], df[attrs[1]], ..., df[attrs[len(attrs)]]]
-       }
-    """
+def doc_frequency(src_file):
 
     fin = open(src_file)
 
-    attr_i = { attr: i for (i, attr) in enumerate(attrs) }
 
-    df_term = {}
+    df_term = Counter()
     n_docs = 0
 
     for line in fin:
         n_docs += 1
-        fields = line.strip().split('\t')
-        for attr in attrs:
-            terms = set(fields[attr_i[attr]].split())
-            for t in terms:
-                df_term[t] = df_term.get(t, [0] * len(attrs))
-                df_term[t][attr_i[attr]] += 1
+        terms = set(line.strip().split())
+        df_term.update(terms)
 
     fin.close()
 
@@ -84,7 +69,7 @@ def build_index(src_file, dst_file, attrs, readable = False):
 
     index = {}
 
-    (n_docs, df) = doc_frequency(src_file, attrs)
+    (n_docs, df) = doc_frequency(src_file)
 
     for (docId, doc) in enumerate(fin, start = 1):
         attr_values = doc.strip().split('\t')
@@ -97,7 +82,7 @@ def build_index(src_file, dst_file, attrs, readable = False):
         num = {} # given a term: tf * idf^2
         doc_2norm = 0
         for term in tf:
-            idf = log10(n_docs/sum(df[term]))
+            idf = log10(n_docs/df[term])
             num[term] = tf[term] * idf
             doc_2norm += num[term] ** 2
             num[term] *= idf
@@ -179,4 +164,4 @@ def process_query(q, index, k):
     for (score, docId) in res:
         print("# " + str(docId))
         print("score: " + str(score))
-        print(linecache.getline(RAW_TSV_FILE, docId + 1))
+        print(linecache.getline(RAW_TSV_FILE, docId + 1))
diff --git a/hw2/report.html b/hw2/report.html
@@ -71,10 +71,10 @@ <h3 id="query-processing">Query Processing</h3>
 <p>The processing is done in a “doc-at-a-time” fashion. When a query is issued a set of pointers to the postings lists corresponding to the terms of it is created to iterate over them. If the query contains terms that do not appear in the index they are just ignored. From the pool of pointers we take the one pointing to the minimum docId and compute the score of such document then advance the pointers accordingly.</p>
 <h3 id="conclusions-and-instructions-to-run">Conclusions and instructions to run</h3>
 <p>The system is written in <code>python3</code> and presented through a web interface served by a Flask server. To launch it, go in the <code>p1/</code> folder and execute <code>server.py</code>. After some seconds a new tab will be opened in the web browser with the interface to the tool.</p>
-<p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /> <img src="images/q3.png" /></p>
+<p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /></p>
 <p>The decisions made work well overall. Some examples of queries and results are shown above.</p>
 <p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains only one of the query terms, still achieves a higher score than a longer document containing both terms.</p>
-<p>In the third example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
+<p>In the second example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
 <h2 id="problem-2">Problem 2</h2>
 <p>For this problem the <code>preprocessed_announcements.tsv</code> file produced by the preprocessing phase of the previous problem is used.</p>
 <p>All the classes follow the <code>sklearn</code> interface:</p>

diff --git a/hw2/report.txt b/hw2/report.txt
@@ -83,7 +83,7 @@ by a Flask server. To launch it, go in the `p1/` folder and execute
 `server.py`. After some seconds a new tab will be opened in the web browser
 with the interface to the tool.
 
-![First](images/q1.png)\ ![](images/q2.png)\ ![](images/q3.png)
+![First](images/q1.png)\ ![](images/q2.png)
 
 The decisions made work well overall. Some examples of
 queries and results are shown above.
@@ -94,7 +94,7 @@ documents. Indeed even if the first document
 contains only one of the query terms, still achieves a higher score than
 a longer document containing both terms.
 
-In the third example the query issued is the first document retrieved.
+In the second example the query issued is the first document retrieved.
 As shown in this case the score achieved by the first document is way
 higher than the average cases given that the query is the document itself.
 Moreover also the processing time increases drastically.
@@ -268,4 +268,4 @@ precision 0.973001588141874
 
 [Mining of Massive Datasets]: http://www.mmds.org/
 [Datasketch]: https://ekzhu.github.io/datasketch/index.html
-[find $r$ and $b$]: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L22-L51
+[find $r$ and $b$]: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L22-L51