Implemented simhash technique to estimate duplicated pages in a given dataset. University project for Information Retrieval (Spring 2015)
Final report can be found here in Greek.
- Matlab 2012b+
- Matlab: 'Statistics and Machine Learning Toolbox
- Java 1.6 (Matlab 2012b needs that version)
The main program is proj.m
- In
DataHasher.javaon lines 45 and 48 insert path for Desktop. - Compile with
javac -source 1.6 -target 1.6 DataHasher.java. - In Matlab workspace run
which classpath.txtand we add the path to the directory ofDataHasher.class. - Run
proj.mand choose whether the input is from a .csv file or from an online source.