- Michael Gibb[MikeGibb7] (110102732)
- Ronit Mahajan[TheSupremeXtream] (110036557)
- Charbel Nakhoul[nakhoulc] (110043150)
- John Ezetah[Vanvians] (10469910)
- Nico Belanger[Nico Belanger] (110138244)
LHDiff is a Python implementation of the algorithm described in “LHDiff: A Language-Independent Hybrid Approach for Code Differencing” by Asaduzzaman et al. (2013).
This tool detects line-level changes between two text files (source code or plain text) using a hybrid strategy that combines:
- Context Similarity (Cosine Similarity over surrounding lines)
- Content Similarity (Levenshtein Distance)
- Efficient Hashing (Simhash)
The result is a robust algorithm capable of identifying matches, modified lines, and even moved lines within files.
Uses tkinter to generate an interactive Graphical User Interface (GUI) in order to input files and access a function that will use simhash to calculate Hamming distance between the Simhashes and then add them to a list of candidates that will then be output on the GUI and the terminal.
Prevents mismatching identical but unrelated lines by analyzing the similarity of surrounding lines using cosine similarity.
Uses:
- Simhash → Generates candidates quickly
- Levenshtein Distance → Scores content similarity precisely
This reduces false matches while maintaining fast performance.
Detects when a line appears in both files but at different locations.
- Python 3.x
simhashlibrary
Install the required dependency:
pip install simhash Simply run the program and input the old and new files. It will be used to compare changes made between each file.
python LHdiff.py
This is the main project file that will trigger the GUI where you can enter the test files and it will display the highlights of changes made between the files (moves in blue, insertions in green, deletions in red), as well as the overall results in an organized format.
python test.py
This is a purer test file that will trigger the GUI, where you can enter the test files and after clicking the enter button, results will be printed in the terminal.
Terminal: COMMIT CLASSIFICATION: bug_intro Bug fixes: 0 Bug introductions: 18 Neutral: 0 Unknown: 0 Total mappings: 18 COMMIT CLASSIFICATION: bug_intro Bug fixes: 0 Bug introductions: 18 Neutral: 0 Unknown: 0 Total mappings: 18
GUI: Moved or swapped lines: old line 11 swapped with new line 15 -/+ } old line 13 swapped with new line 17 -/+ public static void main(String[] args) { old line 14 swapped with new line 18 -/+ double f = 98.6; old line 15 swapped with new line 19 -/+ double c = toCelsius(f); old line 17 swapped with new line 21 -/+ System.out.println("F to C: " + c); old line 19 swapped with new line 23 -/+ double c2 = 37.0; old line 20 swapped with new line 24 -/+ double f2 = toFahrenheit(c2); old line 22 swapped with new line 26 -/+ System.out.println("C to F: " + f2); old line 23 swapped with new line 29 -/+ } old line 24 swapped with new line 30 -/+ }
Inserted lines:
- 13: public static double toKelvin(double celsius) { // added method
- 14: return celsius + 273.15;
Deleted lines: N/A
Mappings: [(1, 1), (3, 3), (4, 4), (5, 5), (6, 6), (8, 8), (9, 9), (10, 10), (11, 15), (13, 17), (14, 18), (15, 19), (17, 21), (19, 23), (20, 24), (22, 26), (23, 29), (24, 30)]
(1, 1) → Line 1 in File A matches Line 1 in File B
(2, 3) → Line 2 in File A corresponds to Line 3 in File B (moved)
(3, 2) → Line 3 in File A corresponds to Line 2 in File B (moved)
(n, m) pairs show all detected matches, moves, and alignments
To check the time complexity and memory allocation you will need to install snakeviz
pip install snakeviz
pip install memory_profilerThen generate the output files using
python -m cProfile -o "optimization_results\output.prof" LHdiff.py
python -m memory_profiler LHdiff.py
Then run the file using
python -m snakeviz optimization_results\output.prof