Introduction
The Aho-Corasick algorithm is a classic string-searching algorithm that efficiently finds all occurrences of multiple patterns in a text simultaneously. It’s widely used in spam filtering, sensitive data detection, and autocomplete systems.
ACOR is a Go library that implements Aho-Corasick with Redis as the backend storage. Version 1.3.0 introduces Index APIs that extend the existing functionality to provide match position information.
In this post, we’ll take a deep dive into the design and implementation of the FindIndex and SuggestIndex APIs.
Comparison with Existing APIs
Prior to v1.3.0, ACOR provided two search APIs:
| |
These APIs tell you which keywords matched, but not where in the text they were found. For text highlighting or position-based analysis, you’d need to calculate indices separately.
The v1.3.0 Index APIs solve this problem:
| |
API Overview
FindIndex
FindIndex traverses the input text and finds all positions where registered keywords match:
| |
The return value is a map[keyword][]startIndex. Since each keyword can match at multiple positions, the start indices are returned as a slice.
SuggestIndex
SuggestIndex returns matched keywords and their start positions for prefix-based autocomplete:
| |
Usage Example
Here’s a complete usage example:
| |
Implementation Deep Dive
Rune-Based Indexing
The key to FindIndex is proper Unicode handling. Go’s range statement iterates over strings in rune (Unicode code point) units:
| |
This approach ensures correct indexing for multi-byte characters like Korean and emojis.
State Transition Logic
The core of Aho-Corasick is the Trie structure and Failure Function. ACOR stores these in Redis:
| |
Index Calculation
The start index of a matched keyword is calculated by subtracting the keyword length from the end position:
| |
Using len([]rune(output)) ensures calculation by character count, not byte count.
Edge Cases
Overlapping Matches
In the text "her", both "he" and "her" match at index 0:
| |
Repeated Matches
In "hehe", "he" matches twice:
| |
Unicode Handling
Registering Korean keyword "한글" and searching in "가한글":
| |
"가" is the 0th rune, "한글" starts at the 1st rune.
Performance Considerations
Index APIs have the following overhead compared to the original APIs:
- Memory: Storing index information in
map[string][]intstructure - Computation: Calculating rune length with
len([]rune(output))at each match
For simple existence checks where index information isn’t needed, use the original Find/Suggest APIs for better efficiency. For text highlighting, position-based logging, or analysis tasks requiring indices, the Index APIs are the right choice.
Conclusion
ACOR v1.3.0’s Index APIs expand the library’s utility by providing match position information. The rune-based indexing for proper Unicode handling and the implementation of Aho-Corasick’s state transitions in Redis are particularly noteworthy.
For more details, check out the ACOR GitHub repository and official documentation.