Find all common substrings

Problem Description

The way to find all the common parts among strings is as easy as building a suffix tree. Then each node would have recorded their occurrence, but it would be hard to decide how to list them.

Let's take an example to see it in detail. Let's say we want to have all the common substring have at least 2 occurrences. Now we have a very small two-word sample [xabcy, mabcn], each character here represents a capture substring group, after suffix trie building we would find suffix of duplicated suffix [abc, bc, ab, a, b, c] as from suffix trie.

So the problems come, how do we list the common substrings without duplication? i.e, only have abc listed?

Think of if we have another two word [bj, bk], then we should have the result of [abc, b], and the capture group does not overlap with each other, i.e the occurrences of abc are [0,1] and b is from [2,3].

Solution

Abstraction terminology

Node: Each circle in the suffix trie image is a node.
Branch: In the suffix trie, a branch is a group of nodes that starts from the root with one direction to the end.

Back to the problem of the duplicated captured groups, they existed in either the same branch, or in different branches.

Within a branch

The solution on the same branch is just using Deep First Search to filter the listed occurrence in the children.

To better illustrate that, say we have an input of [a, ab, abcy, abcn], the common substring with at least 2 occurrences is a and abc as we can see from the following branch image.

Across branches

Because by sorting on the same branch we only consider the common substrings extend from the trail, e.g. a, ab, abc, we also need to consider the substrings extend from the head, e.g, xa, ma, a.

So we add reference links between the suffix branches with these substrings, and thus construct a virtual trie with the existed node and new virtual root.

It shows in the following image with nodes linked by dashed arrows.

So that we adopt the same sorting method on this virtual trie again, use the Deep First Search to filter listed occurrences from the end of the branch. Then we remove the overlapped nodes. Use the origin input xabcy, mabcn as an example, by first sorting on the same branch, and then sorting on the virtual trie, we get the result that only substring abc should be listed.

So overall the algorithms include the following steps:

Building suffix trie and construct virtual trie with input strings.
Filter overlapped common substrings in the suffix trie.
Filter overlapped common substrings in the virtual trie.
listing the nodes which have an attribute listed: true

Implementation

Rust
Javascript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find-all-common-substrings.md

find-all-common-substrings.md

Find all common substrings

Problem Description

Solution

Abstraction terminology

Within a branch

Across branches

Implementation

Files

find-all-common-substrings.md

Latest commit

History

find-all-common-substrings.md

File metadata and controls

Find all common substrings

Problem Description

Solution

Abstraction terminology

Within a branch

Across branches

Implementation