What is the Rabin-Karp algorithm?
The Rabin-Karp is a substring-finding algorithm that uses the concept of rolling hashes. It increases the efficiency of finding a certain pattern substring in another substring, similar to the KMP algorithm (which also has an article).
What is a rolling hash?
It’s important to understand the bruteforce method of finding a pattern substring called Y inside another string named X. The algorithm will continue to compare the first letter in X and the first letter until it finds a match. Once it finds a match, it will start comparing the next value in X to the second value in Y. If at one point there is a mismatch, the algorithm goes back to checking the first letter of Y.
Refer to the pattern string as the string that is trying to be found in the text.
In a rolling hash, there will be a window that matches the size of the pattern string. This window will be passed through a deterministic hash function. A hash function applies a mathematical formula to some information, changing it. A deterministic hash function means that if the same input is passed through, it will always produce the same output. The hash should result in a number. Now in the algorithm, a window of size equal to the pattern string will go through the text, and only if the hash of the window is equal to the hash of the pattern will it check letter for letter.
Whenever there is an iteration, the window will move forward. It will remove the hash value provided from the value that was dropped and apply the new hash value to the item that was most recently added to the window. In the following demonstration, the hash function is quite weak, where each letter simply corresponds to its position in the alphabet.
Improving the Hash Function:
The ideal running time of the Rabin-Karp is O(m+n). But the time complexity is better represented as O(m*n) because of how poor the current hash function is. The point of the hash function is to only do in-depth checks when necessary. In this case, any permutation of the pattern will have the exact same hash. So instead of a hash function, where the position of each letter matters, Here is an example of a hash function where position matters. Each value in the array is multiplied by its 10^(index – 1) and summed at the end.
hash(2,5,4,6,7) = (2* 10^4) +(5* 10^3) + (4* 10^2) + (6* 10^1) + (7 * 10^0) = 25467
Whenever the window is moved, it removes the first value and adds a new value. The general formula goes as follows:
((Current Window Hash) – (first item in window)*10^(window size – 1)) * 10 + (new value)
Most of the equations are simple. There is an extra multiplication by 10 to simulate pushing all of the current values over by one to make room for the new value. Let’s continue the first example of 2, 5, 4, 6, 7, where 2 is being replaced with an 8.
hash(5,4,6,7,8) = ((hash(2,5,4,6,7) – 2*10^4)) * 10 + 8 = (25467 – 2000) * 10 + 8 = 5467 * 10 + 8 = 54678 (new hash value)
However, after solving the position issue, there is now an integer overflow issue. If the pattern is 1000 letters long, the values will hold a value greater than or equal to 10^999, which is way beyond what is possible. The hash result can be modulo’d to get a smaller value. Most of the time, doing the modulo of a prime number is the best, as it has the most potential different outputs.
hash(2,5,4,6,7) = (25467) mod 51 = 18
The mod using a prime number means that the output can be anywhere from 0 to n-1. The bigger the prime number, the less likely there are to be false positives when checking the hashes. However, this still doesn’t fix the overflow issue. Now 10^999 mod 51 still needs to be calculated. Using modular congruence, the following is true:
(A * B) mod C = ((A mod C) * (B mod C)) mod C
This means that 10^999 mod 51 can be rewritten as follows:
10^3 mod 51 = ((10 mod 51) * (10 mod 51) * (10 mod 51)) mod 51
This means we can completely avoid big numbers, so patterns of any size can now be accepted. Refer to the final example where 2, 5, 4, 6, and 7 are replacing 2 with an 8:
hash(5,4,6,7,8) ((hash(2,5,4,6,7) – 2*10^4 mod 51) * 10 mod 51 + 8) mod 51 = ((hash(2,5,4,6,7)) – 2((10 mod 51) * (10 mod 51) * (10 mod 51) * (10 mod 51) mod 51) * 10 mod 51 +8) mod 51