Large-scale and high-throughput pattern matching on parallel architectures

Large-scale pattern matching has many applications ranging from text processing to deep packet inspection (DPI) where hundreds or thousands of pre-defined strings or regular expressions (regexes) are matched concurrently and continuously against high-bandwidth data input. The large number of patterns and the high matching throughput make large-scale pattern matching both compute and memory intensive. In this thesis, we propose novel algorithms, constructions, and optimizations to accelerate large-scale pattern matching on two prominent classes of parallel architectures: Field Programmable Gate Arrays (FPGA) and general-purpose multi-core processors. We focus our studies on string pattern matching (SPM) and regular expression matching (REM) in the context of DPI for network intrusion detection. We utilize various design methodologies including pipelining, partitioning, parallel processing, aggregation and modular composition to improve the performance of our SPM and REM solutions on both FPGA and multi-core architectures. For SPM, we analyze various real-life dictionaries as lexical trees and identify the “double power-law” distribution commonly present in the tree nodes. We then propose a head-body partitioning algorithm to partition a dictionary tree into a small “head” and a memory-efficient “body” running in parallel. The “head” part is mapped either as a pipelined binary search tree on FPGA or as a small deterministic finite automaton (DFA) on a processor core; the “body” part is implemented as a compact and variable-stride body branch data structure. Together the head and body parts achieve high-bandwidth and attack-resilient matching throughput with good memory efficiency. For REM, we propose a modified version of the classic McNaughton-Yamada construction. Our modified construction converts an arbitrary regex into a modular nondeterministic finite automaton (NFA) suitable for implementation on FPGA. We also design a spatial stacking technique to easily construct multi-character matching circuit; a BRAM-based character classification to improve the resource efficiency; and a 2-dimensional staged pipeline to operate large number of REM circuits in parallel on FPGA. On a multi-core system, we transform the modular NFA into a segmented NFA, each segment mapped to a (64-bit) word processed as a unit by the processor core. Various techniques are applied to improve the segment processing in both memory and computation efficiency. To handle frequent and dynamic pattern updates, we provide algorithms for fast compilation of large dictionaries, as well as automated construction of large-scale REM circuits. For each of the proposed solutions, we evaluate our designs and optimizations using real-life DPI patterns and data streams with a wide range of characteristics. Computationally, a DFA is more efficient than an NFA. However, converting an NFA to an equivalent DFA can sometimes cause exponential state explosion, making the size of the DFA significantly larger and practically infeasible to implement. In the final part of this thesis, we introduce a novel semi-deterministic finite automata (SFA) which lies between NFA and DFA in terms of computation and memory complexities. We propose state convolvement test and compatible state grouping algorithms to convert NFA into SFA with controlled space-time tradeoff. Although constructing a minimum-sized SFA is shown to be NP-complete, we develop a greedy heuristic to quickly construct a near-optimal SFA in time and space quadratic in the number of states in the original NFA.