Some years ago I read Masterminds of Programming: Conversations with the Creators of Major Programming Languages, an interesting book with interviews of the creators of some famous programming languages. It’s a great book, I definitely recommend it to anyone who wants to take a peek inside the head of the guys that created some of the things that we use everyday when coding. Anyway right now I wanna talk about a remarkably useful thing that I discovered while reading this book, the Aho-Corasick algorithm.
Aho-Corasick is an algorithm for searching all occurrences of a finite number of keywords in a text. In other words: you have some arbitrary text and a set of keywords that you want to search for in this text, you can use Aho-Corasick to accomplish this in a efficient way. The algorithm is described in the paper Efficient String Matching: An Aid to Bibliographic Search and it’s surprisingly simple and elegant.
I’m not gonna go into all the implementation details since the original paper is quite approachable and you can easily find it online (☠), but the core idea is: the algorithm builds a deterministic finite automaton, called the
goto function, which maps characters as transitions between different states. It also builds a
failure function that decides to which state to go after a failed match on the
goto function, and the
output function which marks that one or more keywords have been found in the text. It does the search by iterating through every character in the text, “advancing” states as it finds a transition from the current one to another. In pseudo-code we have something like:
state = 0 index = 0 FOR c in TEXT: WHILE goto(state, c) == NULL: state = failure(state) IF output(state) != NULL: print(index) print(output(state)) index += 1
The really interesting part of this algorithm is that the search for the keywords is performed in parallel, we only iterate once for every word of the searched text. In practice it gives us a running time of
Θ(n + t), where
n = length of the searched text &
t = number of matches (the paper provides a detailed time complexity analysis). You can see it working in the interactive example below:
I have used this algorithm in a number of occasions. Some of them, with the respective library, were:
- JS/Node https://github.com/BrunoRB/ahocorasick I needed to search for a list of keywords in the text of a chat application.
- PHP https://github.com/wikimedia/AhoCorasick I used it to find if a massive list of email subjects contained some dynamic sets of keywords. PS: I think this lib had a bug for numbers-only keywords.
- Python https://github.com/abusix/ahocorapy We were crawling museum web pages and trying to categorize each one according to a big whitelist of categories. PS: Be careful with the python2 implementations, some of them break with unicode (this one don’t).