Introduction
String searching is a common problem in software development. Finding a single keyword is straightforward, but what if you need to search for hundreds of keywords simultaneously? Iterating through the text for each keyword would be inefficient.
The Aho-Corasick algorithm elegantly solves this problem. Developed by Alfred V. Aho and Margaret J. Corasick in 1975, this algorithm can efficiently search for multiple patterns at once.
ACOR is a Go library that implements Aho-Corasick with Redis as the backend storage. In this post, we’ll introduce ACOR, cover the basics of the Aho-Corasick algorithm, and walk through its usage.
Aho-Corasick Algorithm Overview
Basic Principles
The Aho-Corasick algorithm operates in two phases:
- Trie Construction: Build a trie data structure from the keywords to search
- Failure Function Construction: Pre-calculate which state to transition to when a match fails
During search, the input text is traversed only once while finding all keyword matches. The time complexity is O(n + m + z), where n is the text length, m is the total length of all keywords, and z is the number of matches.
Trie Structure
A trie is a tree structure where each node represents a single character. When registering keywords “he”, “his”, and “she”, the following trie is constructed:
Failure Function
The failure function defines which state to return to when matching fails at the current state. For example, if searching for “his” and a different character appears after ’s’, the algorithm falls back to the suffix “s” or the empty string state.
This allows finding all matches without backtracking through the text.
ACOR Design and Features
Redis as Storage
ACOR’s unique feature is storing the trie and failure function in Redis. Benefits include:
- Memory Efficiency: Large keyword sets are managed by Redis’s memory management
- Persistence: Data preservation through Redis persistence options
- Distributed Environment Support: Multiple application instances can share the same trie
State Representation
ACOR represents states as strings. For example, the path “h” -> “i” -> “s” is represented as the state string “his”. This simple representation makes debugging and understanding easier.
Redis Sorted Sets are used to manage valid states.
Getting Started
Prerequisites
- Go 1.13 or higher
- Redis 3.0 or higher
Installation
| |
Basic Usage
| |
API Overview
| Method | Description |
|---|---|
Create(args) | Create a new Aho-Corasick instance |
Add(keyword) | Add a keyword |
Find(text) | Search for matching keywords in text |
Suggest(input) | Get prefix-based autocomplete suggestions |
Flush() | Remove all registered keywords |
Close() | Close Redis connection |
Use Cases
- Spam Filtering: Register spam keywords and search messages
- Sensitive Data Detection: Detect patterns like SSNs, credit card numbers
- Autocomplete: Search query autocomplete systems
- Content Moderation: Filter inappropriate words
Conclusion
ACOR combines the power of the Aho-Corasick algorithm with the flexibility of Redis. It’s easy to integrate into projects that need multi-keyword searching.
For more details, check out the GitHub repository.