# Algorithm Question Substring Search KMP

Datetime:2016-08-23 03:14:03         Topic: Algorithm          Share        Original >>

Fri, Aug 5, 2016

Category:algorithm Tags:[java] [string]

## Question

LeetCode has this question Implement strStr().

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Tags: Two Pointers, String.

Let’s denote haystack length `N` , needle length `M` , character table size `R` (256 for extended ASCII).

Java’s `String` class method `indexOf()` uses brute force algorithm O(MN).

Examples:

```haystacsflksdjflkshhaystackneeneeneedle
needle
naslkfjskjlhhhh
needle```

## Method 1 2D DFA

KMP Knuth Morris Pratt O(N) time complexity.

2D array DFA[256][M] gives the next character index to match against i.

Example: needle “ABABAC”. DFA[r][j] where r < R, j < M.

```j    :  0 1 2 3 4 5 6
state:  0 0 1 2 3 0
r       A B A B A C
A       1 1 3 1 5 1
B       0 2 0 4 0 4
C       0 0 0 0 0 6
D       0 0 0 0 0 0```

Steps to build DFA:

1. copy mismatched cases (first column all 0).
2. set matched character to go to next state (first column DFA[needle.charAt(0)][0] = 1).
3. start state 0, update state for j in [1, M-1].
```DFA[needle.charAt(0)][0] = 1;
int state = 0;
for (int j = 1; j < M; j++) {
for (int r = 0; r < 256; r++) DFA[r][j] = DFA[r][state];
DFA[needle.charAt(j)][j] = j + 1;
state = DFA[needle.charAt(j)][state];
}```

Note DFA[\’D\‘][5] = 0 but DFA[‘B’][5] = 4.

```for (int i = 0, j = 0; i < N && j < M; i++) {
//no backup, so can increment i
j = DFA[haystack.charAt(i)][j];
}
if (j == M) return i - M;
else return -1;```

## Method 2: 1D Restart Table Array

restart[M] restart[j - 1] gives which character in needle to match against index i in haystack when a mismatch happens at index j of needle.

```A B A B A C
restart:  0 0 1 2 1 0```

Compare suffix needle[j, M-1] with needle, longest common prefix’s length.

```int state = 0;
for (int j = 1; j < M;) {
if (needle.charAt(j) == needle.charAt(state)) {
restart[j++] = ++state;
} else if (state == 0) j++;
else state = restart[state - 1];
}```

To compare, now has to back up, cannot operate on stream like standard input without buffering.

```for (int i = 0, j = 0; i < N - M && j < M;) {
if (needle.charAt(j) == haystack.charAt(j)) {
j++;
i++;
}
else if (j == 0) i++;
else j = restart[j - 1];
}
if (j == M) return i - j;
else return -1;```

Question: worst case operations 2N, practical 1.1 N. Why 2N?