Skip to content

Commit

Permalink
Fixed small bug on document frequency
Browse files Browse the repository at this point in the history
  • Loading branch information
Angelogeb committed Nov 8, 2018
1 parent b742352 commit 0841105
Show file tree
Hide file tree
Showing 7 changed files with 12 additions and 28 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,3 @@
*.stdout
__pycache__/
.ipynb_checkpoints/
dist/
Binary file modified hw2/images/q1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified hw2/images/q2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed hw2/images/q3.png
Binary file not shown.
29 changes: 7 additions & 22 deletions hw2/p1/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,33 +38,18 @@ def preprocess_docs(src_file,
fout.close()
return [ attr for attr in attrs if attr in clean_attrs ]

def doc_frequency(src_file, attrs):
"""
file:
<tokens of attrs[0]> \t <tokens of attrs[1]> ..
result:
{
"term1": [ df[attrs[0]], df[attrs[1]], ..., df[attrs[len(attrs)]]],
"term2": [ df[attrs[0]], df[attrs[1]], ..., df[attrs[len(attrs)]]]
}
"""
def doc_frequency(src_file):

fin = open(src_file)

attr_i = { attr: i for (i, attr) in enumerate(attrs) }

df_term = {}
df_term = Counter()
n_docs = 0

for line in fin:
n_docs += 1
fields = line.strip().split('\t')
for attr in attrs:
terms = set(fields[attr_i[attr]].split())
for t in terms:
df_term[t] = df_term.get(t, [0] * len(attrs))
df_term[t][attr_i[attr]] += 1
terms = set(line.strip().split())
df_term.update(terms)

fin.close()

Expand All @@ -84,7 +69,7 @@ def build_index(src_file, dst_file, attrs, readable = False):

index = {}

(n_docs, df) = doc_frequency(src_file, attrs)
(n_docs, df) = doc_frequency(src_file)

for (docId, doc) in enumerate(fin, start = 1):
attr_values = doc.strip().split('\t')
Expand All @@ -97,7 +82,7 @@ def build_index(src_file, dst_file, attrs, readable = False):
num = {} # given a term: tf * idf^2
doc_2norm = 0
for term in tf:
idf = log10(n_docs/sum(df[term]))
idf = log10(n_docs/df[term])
num[term] = tf[term] * idf
doc_2norm += num[term] ** 2
num[term] *= idf
Expand Down Expand Up @@ -179,4 +164,4 @@ def process_query(q, index, k):
for (score, docId) in res:
print("# " + str(docId))
print("score: " + str(score))
print(linecache.getline(RAW_TSV_FILE, docId + 1))
print(linecache.getline(RAW_TSV_FILE, docId + 1))
4 changes: 2 additions & 2 deletions hw2/report.html
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,10 @@ <h3 id="query-processing">Query Processing</h3>
<p>The processing is done in a “doc-at-a-time” fashion. When a query is issued a set of pointers to the postings lists corresponding to the terms of it is created to iterate over them. If the query contains terms that do not appear in the index they are just ignored. From the pool of pointers we take the one pointing to the minimum docId and compute the score of such document then advance the pointers accordingly.</p>
<h3 id="conclusions-and-instructions-to-run">Conclusions and instructions to run</h3>
<p>The system is written in <code>python3</code> and presented through a web interface served by a Flask server. To launch it, go in the <code>p1/</code> folder and execute <code>server.py</code>. After some seconds a new tab will be opened in the web browser with the interface to the tool.</p>
<p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /> <img src="images/q3.png" /></p>
<p><img src="images/q1.png" alt="First" /> <img src="images/q2.png" /></p>
<p>The decisions made work well overall. Some examples of queries and results are shown above.</p>
<p>In the first query a well known issue of cosine-similarity is shown: shorter documents get higher ranking since normalization penalizes longer documents. Indeed even if the first document contains only one of the query terms, still achieves a higher score than a longer document containing both terms.</p>
<p>In the third example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
<p>In the second example the query issued is the first document retrieved. As shown in this case the score achieved by the first document is way higher than the average cases given that the query is the document itself. Moreover also the processing time increases drastically.</p>
<h2 id="problem-2">Problem 2</h2>
<p>For this problem the <code>preprocessed_announcements.tsv</code> file produced by the preprocessing phase of the previous problem is used.</p>
<p>All the classes follow the <code>sklearn</code> interface:</p>
Expand Down
6 changes: 3 additions & 3 deletions hw2/report.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ by a Flask server. To launch it, go in the `p1/` folder and execute
`server.py`. After some seconds a new tab will be opened in the web browser
with the interface to the tool.

![First](images/q1.png)\ ![](images/q2.png)\ ![](images/q3.png)
![First](images/q1.png)\ ![](images/q2.png)

The decisions made work well overall. Some examples of
queries and results are shown above.
Expand All @@ -94,7 +94,7 @@ documents. Indeed even if the first document
contains only one of the query terms, still achieves a higher score than
a longer document containing both terms.

In the third example the query issued is the first document retrieved.
In the second example the query issued is the first document retrieved.
As shown in this case the score achieved by the first document is way
higher than the average cases given that the query is the document itself.
Moreover also the processing time increases drastically.
Expand Down Expand Up @@ -268,4 +268,4 @@ precision 0.973001588141874

[Mining of Massive Datasets]: http://www.mmds.org/
[Datasketch]: https://ekzhu.github.io/datasketch/index.html
[find $r$ and $b$]: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L22-L51
[find $r$ and $b$]: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L22-L51

0 comments on commit 0841105

Please sign in to comment.