Sitemap

KMP String matching pattern

10 min readApr 20, 2025

Note: If you understand above transition then rest is easy.

Chapter#1: Naive String Searching

Imagine you have a long text string (txt) and a shorter pattern string (pat). You want to find all occurrences of pat within txt.

The naive approach is simple:

  1. Align pat with the beginning of txt.
  2. Compare characters one by one from left to right.
  3. If all characters of pat match the corresponding characters in txt, you’ve found an occurrence.
  4. If a mismatch occurs, shift pat one position to the right in txt and repeat step 2.

Example of Naive Search:

txt = “ABABDABACDABABCABAB”
pat = “ABABC”

txt: ABABDABACDABABCABAB
pat: ABABC (Mismatch at index 4: D != C)

txt: ABABDABACDABABCABAB
pat: ABABC (Match found at index 10)

The Inefficiency:

Notice that when a mismatch occurs (like D != C in the first step), the naive algorithm shifts the pattern by just one position. Often, we already know from the characters that did match that shifting by just one is pointless. For example, after matching “ABAB”, we know the text starts with “ABAB”. Shifting the pattern “ABABC” by one (ABABC) means we’ll compare ‘B’ with ‘A’, which we already know won’t work based on the pattern itself. KMP cleverly avoids these redundant comparisons.

Time complexity: O(m*n)

Chapter#2: The KMP Idea: Smarter Shifting

KMP preprocesses the pattern (pat) to understand its internal structure, specifically looking for repeating prefixes that are also suffixes. This information allows us to shift the pattern more intelligently upon a mismatch.

Key Concept: The Longest Proper Prefix Suffix (LPS) Array

The core of KMP is the LPS array. For a pattern pat of length M, the LPS array lps is also of length M.

  • lps[i] stores the length of the longest proper prefix of pat[0…i] (the substring of pat from index 0 to i) which is also a suffix of pat[0…i].
  • Proper Prefix: A prefix of a string that is not the entire string itself.
  • Suffix: A suffix of a string.

Q. Why proper prefix, why not simple prefix?

A simple prefix when comparing full word will always match with suffix with full word. This doesn’t help to shift the block.

Example: Computing LPS for pat = “ABABCABAB”

So, for pat = "ABABCABAB", the lps array is [0, 0, 1, 2, 0, 1, 2, 3, 4].

Chapter#3: Using the LPS Array for Searching

Let M be the length of txt and N be the length of pat. We use two pointers:

  • i: index for txt (starts at 0)
  • j: index for pat (starts at 0)

The algorithm proceeds as follows:

  1. While i < M:
  • If pat[j] == txt[i]: Characters match. Increment both i and j.
  • If j == N: We have matched the entire pattern! A match is found starting at index i — j in the text. Now, we need to potentially find more matches. We can’t just reset j to 0. We use the LPS array: j = lps[j-1]. This effectively shifts the pattern based on the longest prefix that’s also a suffix of the entire pattern found.
  • If pat[j] != txt[i]: A mismatch occurred after matching j characters.

A. If j != 0: This means some characters matched before the mismatch (pat[0…j-1] matched txt[i-j…i-1]). We don’t need to shift by just one. We consult lps[j-1]. This value tells us the length of the longest proper prefix of the matched part (pat[0…j-1]) that is also a suffix. We can shift the pattern so that this prefix aligns with the corresponding suffix in the text, and continue the comparison from there. So, we set j = lps[j-1]. Crucially, we do not increment i in this case. We are reconsidering txt[i] with a new character in the pattern.

B. If j == 0: The mismatch occurred on the very first character of the pattern. We can’t backtrack in the pattern. We simply move to the next character in the text by incrementing i.

Following is code for reference.

// Function to perform KMP search
void KMPSearch(const std::string& txt, const std::string& pat) {
int m= txt.size();
int n= pat.size();

// Create lps array
std::vector<int> lps = computeLPSArray(pat);

int i = 0; // index for txt
int j = 0; // index for pat
bool found = false;
while (i < m) {
if (pat[j] == txt[i]) {
i++;
j++;
}
if (j == n) {
// Found pattern at index i - j
std::cout << "Found pattern at index " << i - j << std::endl;
found = true;
// Shift pattern using LPS value to find subsequent matches
j = lps[j - 1];
} else if (i < n&& pat[j] != txt[i]) {
// Mismatch after j matches
// Do not match lps[0..lps[j-1]] characters, they will match anyway
if (j != 0) {
j = lps[j - 1]; // Use LPS array to shift pattern intelligently
} else {
i = i + 1; // Mismatch at the first character, move text pointer
}
}
}
if (!found) {
std::cout << "Pattern not found." << std::endl;
}
}

Chapter #4: Compute LPS array efficiently

We can use a clever dynamic programming approach comparing the pattern against itself.

Let’s first establish the reoccurrence relationship.


lps[i]: Length of longest proper prefix of pat[0…i] that is also suffix.
f(i,lps_max): Find lps for pat[0..i] with lps_max is previous lps (candidate for extension)

Rules to update lps:
lps[i] = 1 + prev_max_lps // if pat[i] == pat[lps_max]
lps[i] = lps[prev_max_lps - 1] // if lps_max != 0 otherwise 0

Transition rule:
f(i,prev_max_lps) -> f(i+1,prev_max_lps+1) // if pat[i] == pat[lps_max]
f(i,prev_max_lps) -> f(i,lps[prev_max_lps-1]) // if lps_max != 0
f(i,prev_max_lps) -> f(i+1,0) // if lps_max == 0

Following is recursive code to implement it.

    vector<int> lps;
void getLps(string pat, int i, int prev_max_lps) {
int n = pat.size();
// base case, end of string
if (i == n) {
return;
}

if (pat[i] == pat[prev_max_lps]) {
lps[i] = prev_max_lps + 1; // capture here
getLps(pat, i + 1, prev_max_lps + 1); // Transition to next
} else {
if (prev_max_lps != 0) {
getLps(pat, i, lps[prev_max_lps - 1]); // Backtrack
} else {
lps[i] = prev_max_lps; // capture here
getLps(pat, i+1, prev_max_lps);
}
}
}

Let’s now write bottom up dynamic programming code to implement this.

// Function to compute the LPS array for a given pattern
void computeLPSArray(const std::string& pat, int M, std::vector<int>& lps) {
int N = pat.size();
lps.assign(N, 0); // Initialize lps array with 0s
int prev_max_lps = 0; // Length of the previous longest prefix suffix
int i = 1;
lps[0] = 0; // lps[0] is always 0

// Calculate lps[i] for i = 1 to N-1
while (i < N) {
if (pat[i] == pat[prev_max_lps]) { // Remember prev_max_lps is length, so at current position it match with index
// If characters match, extend the previous longest prefix suffix
lps[i] = ++prev_max_lps;
i++;
} else {
// If characters don't match
if (prev_max_lps != 0) {
// Backtrack: try a shorter prefix suffix
// This is the tricky part: we don't increment i here.
// We use the lps value of the *previous* character's
// longest prefix suffix length to find the next candidate length.
prev_max_lps = lps[prev_max_lps - 1];
} else {
// If length is 0, no prefix suffix found ending at pat[i]
lps[i] = 0;
i++;
}
}
}
}

The prev_max_lps variable: This variable holds the key state. At the beginning of iteration i, it stores the value of lps[i-1]. It represents the length of the longest proper prefix of pat[0…i-1] that is also a suffix.

Match Case (pat[i] == pat[prev_max_lps]): If the current character pat[i] matches the character just after the prefix (pat[prev_max_lps]), it means we can extend the previous prefix-suffix by one character. So, the new length is prev_max_lps + 1. This directly uses the result lps[i-1] (stored in prev_max_lps) to compute lps[i].

Mismatch Case (pat[i] != pat[prev_max_lps]): This is where the backtracking happens. We know the prefix pat[0…prev_max_lps-1] doesn’t work when extended with pat[prev_max_lps] (because pat[prev_max_lps] != pat[i]). We need to find the next shortest prefix of pat[0…i-1] that is also a suffix.

  • How do we find this next shortest prefix? The substring pat[0…prev_max_lps-1] is the prefix-suffix we were just considering (of length prev_max_lps = lps[i-1]). The length of its longest proper prefix that is also a suffix is, by definition, lps[prev_max_lps-1].
  • So, by setting prev_max_lps = lps[prev_max_lps-1], we are effectively saying: “Okay, the prefix of length lps[i-1] didn’t work. Let’s try the prefix whose length is lps[lps[i-1]-1] and see if that prefix’s next character matches pat[i].”
  • We repeat this backtracking using the already computed lps values until we either find a match or length becomes 0.

Why not increment i during backtracking? Because we haven’t successfully determined lps[i] yet. We are still trying to find the correct prefix length for the current character pat[i] by exploring shorter prefixes based on previous lps values.

This makes overall time complexiy of KMP algorithm O(N)

Chapter#5: Leetcode problems

Now let’s solve few leetcode problems.

28. Find the Index of the First Occurrence in a String

https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/description/

Given two strings needle and haystack, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Example 1:

Input: haystack = "sadbutsad", needle = "sad"
Output: 0
Explanation: "sad" occurs at index 0 and 6.
The first occurrence is at index 0, so we return 0.

Example 2:

Input: haystack = "leetcode", needle = "leeto"
Output: -1
Explanation: "leeto" did not occur in "leetcode", so we return -1.

Constraints:

  • 1 <= haystack.length, needle.length <= 104
  • haystack and needle consist of only lowercase English characters.

Approach: Using kmp

We can write kmp based approach as below.

class Solution {
private:
vector<int> getLps(string pat) {
int n = pat.size();
vector<int> lps(n);
lps[0] = 0; // Base case
int i = 1, prev_max_lps = 0;
while (i < n) {
if (pat[i] == pat[prev_max_lps]) {
lps[i] = ++prev_max_lps;
i++;
} else {
if (prev_max_lps != 0) {
// backtrack to try shorter prefix suffix
prev_max_lps = lps[prev_max_lps - 1];
} else {
lps[i] = 0;
i++;
}
}
}
return lps;
}

public:
int strStr(string haystack, string needle) {
int m = haystack.size(), n = needle.size();
vector<int> lps = getLps(needle);
int i = 0, j = 0;
while (i < m) {
if (haystack[i] == needle[j]) {
i++, j++;
}
// if found a match
if (j == n) {
return i - j;
}
if (i < m && haystack[i] != needle[j]) {
// Need to perform smart shifting
if (j != 0) {
j = lps[j - 1]; // i stays same
} else {
i++;
}
}
}
return -1;
}
};

1392. Longest Happy Prefix

https://leetcode.com/problems/longest-happy-prefix/description/

A string is called a happy prefix if is a non-empty prefix which is also a suffix (excluding itself).

Given a string s, return the longest happy prefix of s. Return an empty string "" if no such prefix exists.

Example 1:

Input: s = "level"
Output: "l"
Explanation: s contains 4 prefix excluding itself ("l", "le", "lev", "leve"), and suffix ("l", "el", "vel", "evel"). The largest prefix which is also suffix is given by "l".

Approach# Use KMP Algorithm to find out lps

We can use KMP algorithm to calculate the lps for given input string. lps[n-1] will give the longest happy prefix for given string s .

Following is code for reference.

class Solution {
public:
string longestPrefix(string s) {
int n = s.size();
vector<int> lps(n, 0);
lps[0] = 0; // base case
int i = 1, prev_max_lps = 0;
while (i < n) {
if (s[i] == s[prev_max_lps]) {
lps[i] = ++prev_max_lps;
i++;
} else {
if (prev_max_lps != 0) {
// backtrack to try smaller lps
prev_max_lps = lps[prev_max_lps - 1];
} else {
i++;
}
}
}
// Goal is to get the happy prefix i.e. lps for entire string.
return s.substr(0, prev_max_lps);
}
};

214. Shortest Palindrome

https://leetcode.com/problems/shortest-palindrome/description/

You are given a string s. You can convert s to a palindrome by adding characters in front of it.

Return the shortest palindrome you can find by performing this transformation.

Example 1:

Input: s = "aacecaaa"
Output: "aaacecaaa"

Example 2:

Input: s = "abcd"
Output: "dcbabcd"

Approach# Use KMP to find out lps

  • The value lps[temp.length() — 1] gives the length of the longest prefix of s which is also a suffix of reverse(s).
  • This corresponds to the longest palindromic prefix of the original string s.
  • The characters you need to add to the beginning of s are the reverse of the suffix of s that doesn’t form this palindrome.

Following is code for reference.

class Solution {
public:
string shortestPalindrome(string s) {
string rev = s;
reverse(rev.begin(), rev.end());
string text = s + "#" + rev;

int n = text.size();
vector<int> lps(n, 0);
lps[0] = 0; // base case
int i = 1, prev_max_lps = 0;
while (i < n) {
if (text[i] == text[prev_max_lps]) {
lps[i] = ++prev_max_lps;
i++;
} else {
if (prev_max_lps != 0) {
prev_max_lps =
lps[prev_max_lps - 1]; // bactrack to try smaller lps
} else {
i++;
}
}
}
// Get the lps value for entire string
int lps_size = lps[n - 1];
string append = rev.substr(0, s.size() - lps_size);
return append + s;
}
};

This post is based on interaction with https://aistudio.corp.google.com.

Enjoy learning !!!

--

--

Dilip Kumar
Dilip Kumar

Written by Dilip Kumar

With 18+ years of experience as a software engineer. Enjoy teaching, writing, leading team. Last 4+ years, working at Google as a backend Software Engineer.

No responses yet