-
Notifications
You must be signed in to change notification settings - Fork 313
Added Suffix Tree implementation using Ukkonen algorithm #524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CarolLuca
wants to merge
35
commits into
codezonediitj:main
Choose a base branch
from
CarolLuca:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 16 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
a9a5ecd
Added Z-function implementation
CarolLuca 3943502
Fixed error in testing Z-function algorithm
CarolLuca 4c30d0a
Added two arguments to z_function
CarolLuca 2baa77a
Small string mistake fixed
CarolLuca fdec7c9
Instance of ODA wrong initialized fixed
CarolLuca 7f584a8
Reorganized the algorithm's structure
CarolLuca b5887c0
Added missing newline character
CarolLuca a241ff1
Corrected error in test_algo.py
CarolLuca 48366ba
Treated the null tests
CarolLuca 767b7e7
Deleted trailing white spaces
CarolLuca 55a7ae2
Fixed L206 and L231
CarolLuca 774b402
Suffix tree class using Ukkonen algo
CarolLuca b331389
Merge branch 'codezonediitj:main' into main
CarolLuca a1bef9a
MMerge https://github.com/CarolLuca/pydatastructs
CarolLuca b1bc9a8
Updated the suffix tree imports
CarolLuca 8659f84
Solved import issue
CarolLuca c0309f8
Solved reported issues + preferences
CarolLuca 67313c3
Made __new__ method work
CarolLuca b3bf2de
Updated asserts and coding style
CarolLuca 0ad5483
Redistributed the auxiliar classes and improved test code
CarolLuca 719a095
Fixed typo
CarolLuca 4e1247d
Added test for long string
CarolLuca dbfed79
Changed test file location
CarolLuca 9349742
Fixed test code for Linux/MacOS
CarolLuca 466b3ef
Switched to a common encoding for all platforms
CarolLuca 9622c6d
Added tests for auxiliar classes
CarolLuca 9af2a5d
Fixed coding style preferences
CarolLuca d3a8a04
Added more tests
CarolLuca f0b3d35
Modified requested changes
CarolLuca 2b8770f
Minor modifications regarding __new__ method
CarolLuca 75a12d4
Try again with __init__ method
CarolLuca 65f87ce
Coding style
CarolLuca cf67130
Minor flaw in testing
CarolLuca 77af09a
Eliminated __init__ method
CarolLuca b8c6b45
Added the last part of the documentation
CarolLuca File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
from pydatastructs.utils.misc_util import ( | ||
Backend, raise_if_backend_is_not_python) | ||
|
||
__all__ = [ | ||
'SuffixTree' | ||
] | ||
|
||
|
||
# Ukkonen's algorithm gives a O(n) + O(k) contruction time for a suffix tree, | ||
# where n is the length of the string and k is the size of the alphabet of that string. | ||
# Ukkonen's is an online algorithm, | ||
# processing the input sequentially and producing a valid suffix tree at each character. | ||
|
||
class Node(object): | ||
def __init__(self): | ||
self.suffix_node = -1 | ||
|
||
def __repr__(self): | ||
return "Node(suffix link: %d)" % self.suffix_node | ||
|
||
|
||
class Edge(object): | ||
def __init__(self, first_char_index, last_char_index, source_node_index, dest_node_index): | ||
self.first_char_index = first_char_index | ||
self.last_char_index = last_char_index | ||
self.source_node_index = source_node_index | ||
self.dest_node_index = dest_node_index | ||
|
||
@property | ||
def length(self): | ||
return self.last_char_index - self.first_char_index | ||
|
||
def __repr__(self): | ||
return 'Edge(%d, %d, %d, %d)' % (self.source_node_index, self.dest_node_index, self.first_char_index, self.last_char_index) | ||
|
||
|
||
class Suffix(object): | ||
|
||
def __init__(self, source_node_index, first_char_index, last_char_index): | ||
CarolLuca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.source_node_index = source_node_index | ||
self.first_char_index = first_char_index | ||
self.last_char_index = last_char_index | ||
|
||
@property | ||
def length(self): | ||
return self.last_char_index - self.first_char_index | ||
|
||
def explicit(self): | ||
"""A suffix is explicit if it ends on a node. first_char_index | ||
is set greater than last_char_index to indicate this. | ||
""" | ||
return self.first_char_index > self.last_char_index | ||
|
||
def implicit(self): | ||
return self.last_char_index >= self.first_char_index | ||
|
||
|
||
class SuffixTree(object): | ||
"""A suffix tree for string matching. Uses Ukkonen's algorithm | ||
CarolLuca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for construction. | ||
""" | ||
|
||
def __init__(self, string, case_insensitive=False): | ||
|
||
self.string = string | ||
self.case_insensitive = case_insensitive | ||
self.N = len(string) - 1 | ||
self.nodes = [Node()] | ||
self.edges = {} | ||
self.active = Suffix(0, 0, -1) | ||
if self.case_insensitive: | ||
self.string = self.string.lower() | ||
for i in range(len(string)): | ||
self._add_prefix(i) | ||
|
||
def __repr__(self): | ||
|
||
curr_index = self.N | ||
s = "\tStart \tEnd \tSuf \tFirst \tLast \tString\n" | ||
values = list(self.edges.values()) | ||
values.sort(key=lambda x: x.source_node_index) | ||
for edge in values: | ||
if edge.source_node_index == -1: | ||
continue | ||
s += "\t%s \t%s \t%s \t%s \t%s \t" % (edge.source_node_index, edge.dest_node_index, | ||
self.nodes[edge.dest_node_index].suffix_node, edge.first_char_index, edge.last_char_index) | ||
|
||
top = min(curr_index, edge.last_char_index) | ||
s += self.string[edge.first_char_index:top + 1] + "\n" | ||
return s | ||
|
||
def _add_prefix(self, last_char_index): | ||
|
||
last_parent_node = -1 | ||
while True: | ||
parent_node = self.active.source_node_index | ||
if self.active.explicit(): | ||
if (self.active.source_node_index, self.string[last_char_index]) in self.edges: | ||
# prefix is already in tree | ||
break | ||
else: | ||
e = self.edges[self.active.source_node_index, | ||
self.string[self.active.first_char_index]] | ||
if self.string[e.first_char_index + self.active.length + 1] == self.string[last_char_index]: | ||
# prefix is already in tree | ||
break | ||
parent_node = self._split_edge(e, self.active) | ||
|
||
self.nodes.append(Node()) | ||
e = Edge(last_char_index, self.N, parent_node, len(self.nodes) - 1) | ||
self._insert_edge(e) | ||
|
||
if last_parent_node > 0: | ||
self.nodes[last_parent_node].suffix_node = parent_node | ||
last_parent_node = parent_node | ||
|
||
if self.active.source_node_index == 0: | ||
self.active.first_char_index += 1 | ||
else: | ||
self.active.source_node_index = self.nodes[self.active.source_node_index].suffix_node | ||
self._canonize_suffix(self.active) | ||
if last_parent_node > 0: | ||
self.nodes[last_parent_node].suffix_node = parent_node | ||
self.active.last_char_index += 1 | ||
self._canonize_suffix(self.active) | ||
|
||
def _insert_edge(self, edge): | ||
self.edges[(edge.source_node_index, | ||
self.string[edge.first_char_index])] = edge | ||
|
||
def _remove_edge(self, edge): | ||
self.edges.pop( | ||
(edge.source_node_index, self.string[edge.first_char_index])) | ||
|
||
def _split_edge(self, edge, suffix): | ||
self.nodes.append(Node()) | ||
e = Edge(edge.first_char_index, edge.first_char_index + suffix.length, suffix.source_node_index, | ||
len(self.nodes) - 1) | ||
self._remove_edge(edge) | ||
self._insert_edge(e) | ||
# need to add node for each edge | ||
self.nodes[e.dest_node_index].suffix_node = suffix.source_node_index | ||
edge.first_char_index += suffix.length + 1 | ||
edge.source_node_index = e.dest_node_index | ||
self._insert_edge(edge) | ||
return e.dest_node_index | ||
|
||
def _canonize_suffix(self, suffix): | ||
|
||
if not suffix.explicit(): | ||
e = self.edges[suffix.source_node_index, | ||
self.string[suffix.first_char_index]] | ||
if e.length <= suffix.length: | ||
suffix.first_char_index += e.length + 1 | ||
suffix.source_node_index = e.dest_node_index | ||
self._canonize_suffix(suffix) | ||
|
||
# Public methods | ||
def find_substring(self, substring): | ||
CarolLuca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if not substring: | ||
return -1 | ||
if self.case_insensitive: | ||
substring = substring.lower() | ||
curr_node = 0 | ||
i = 0 | ||
while i < len(substring): | ||
edge = self.edges.get((curr_node, substring[i])) | ||
if not edge: | ||
return -1 | ||
ln = min(edge.length + 1, len(substring) - i) | ||
if substring[i:i + ln] != self.string[edge.first_char_index:edge.first_char_index + ln]: | ||
return -1 | ||
i += edge.length + 1 | ||
curr_node = edge.dest_node_index | ||
return edge.first_char_index - len(substring) + ln | ||
|
||
def has_substring(self, substring): | ||
CarolLuca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return self.find_substring(substring) != -1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
import unittest | ||
from pydatastructs.strings.suffix_tree import SuffixTree | ||
|
||
|
||
class SuffixTreeTest(unittest.TestCase): | ||
CarolLuca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Some functional tests. | ||
""" | ||
|
||
def test_empty_string(self): | ||
st = SuffixTree('') | ||
self.assertEqual(st.find_substring('not there'), -1) | ||
self.assertEqual(st.find_substring(''), -1) | ||
self.assertFalse(st.has_substring('not there')) | ||
self.assertFalse(st.has_substring('')) | ||
|
||
def test_repeated_string(self): | ||
st = SuffixTree("aaa") | ||
self.assertEqual(st.find_substring('a'), 0) | ||
self.assertEqual(st.find_substring('aa'), 0) | ||
self.assertEqual(st.find_substring('aaa'), 0) | ||
self.assertEqual(st.find_substring('b'), -1) | ||
self.assertTrue(st.has_substring('a')) | ||
self.assertTrue(st.has_substring('aa')) | ||
self.assertTrue(st.has_substring('aaa')) | ||
|
||
self.assertFalse(st.has_substring('aaaa')) | ||
self.assertFalse(st.has_substring('b')) | ||
# case sensitive by default | ||
self.assertFalse(st.has_substring('A')) | ||
|
||
if __name__ == '__main__': | ||
unittest.main() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.