I re-architected Trisul after months of intense coding to be able to take advantage of multiple cores. I just want to share the approach I took for this project.
The options I evaluated were :
- Flow pinning (like in Suricata, the new IDS engine)
- Packets mapped to hardware thread based on tuples
- Work stealing
- Hardware threads if idle, steal stuff to do (see Cilk)
Flow pinning turned in disappointing results largely due to :
- While Trisul does flow tracking and reassembly, the main chunk of code deals with metering (counting hundreds of data points based on payload content)
- Hard to balance work based only on tuples
Intel’s Threading Building Blocks are the way to go if you want to build on the Cilk style work stealing model. What’s more you get a lot of extra goodies like concurrent containers, atomics, and native threading wrappers.
Armed with TBB, Trisul is completely implemented as a pipeline with a few serial filters and dozens of parallel filters. The advantage of the pipeline pattern is that you get you can run a lot of code on caches that are still “hot“.
The end results are very encouraging.
Here is a screenshot of trisul chewing through the 11GB of packet traces from the LBL-ICSI Enterprise Tracing Project.
340.7% balanced CPU utilization and almost 3.2 times the speed on 1 hardware thread !!