One of the key decision factors in deploying deep packet inspection (DPI) and choosing a solution that best fits one’s network needs, is cost. Cost-per-transmitted-bit for example, evaluates how much operators are paying for a fixed amount of data processed by their network. This evaluation is particularly important in 5G, where traffic is expected to record phenomenal growth rates over the coming years. According to Ericsson, global mobile data traffic reached around 33 EB per month in 2019 and it’s expected to grow by almost a factor of 5 to reach 164EB per month in 2025.1 With 5G devices making their debut, and with 5G networks being rapidly deployed in key markets, the much touted 5G use cases from cloud gaming to remote surgery to HD video streaming are expected to load the network with new streams of heavy data flows. To meet this demand, network operators need DPI solutions that can be scaled up in real time, without huge outlays in the form of re-investments and continuous upgrades.
Delivering new 5G
In 5G, ultra-reliable low-latency (URLLC) use cases such as industrial automation, remote healthcare and autonomous driving demand that networks deliver latencies lower than 1 ms. Enhanced Mobile Broadband (eMBB) and Massive Machine Type Communications (mMTC) use cases demand peak downlink data speeds and user-experienced speeds of 20 Gbit/s and 100 Mbit/s, respectively. Any user plane functions (UPF) deployed between the N6 interface and the 5G user equipment (UE) must process traffic at these latencies and speeds, because the slightest processing delays may result in a disastrous impact on the operator’s SLAs.
The processing capabilities of DPI libraries embedded into UPFs consequently become a major requirement in order to accommodate 5G service classes. With 5G, a network-wide upgrade of the underlying computing capacity is inevitable. However, what often baffles operators, is figuring out the exact upgrade requirements needed to support new 5G traffic volumes and the lower latencies, and ensuring that those upgrades meet the overall 5G network performance needs.
CPU capacity and DPI performance
We decided that the best way to put the uncertainties around capacity upgrades to rest, specifically for DPI, is to quantify the impact of different deployment options on state-of-the-art x86 server hardware. To do so, we collaborated with Intel® Network Builders on a series of tests that took place a few weeks ago. The following tests were executed on an Intel server board (Intel® Server System R2208WFQZS) powered by two Intel Xeon Scalable Gold 6230N processors (0x500002c), with 384 GB of memory (24 16 GB 2666 DDR4 memory modules), two 512 GB SSD M.2 SATA 3.0 drives along with four Intel SSD DC P4510 Series 1.0TB SSD PCIe NVMe 3.0 drives. Networking was provided by four Intel® Ethernet Network Adapter XXV710-DA2 dual port 25 GbE Ethernet controllers.
Impact of NUMA alignment on DPI performance
While superior server processing capabilities amplify the throughput metrics of DPI, there are a number of other variables, related to server capacity, that play an important role in determining the DPI performance. Processing capability is also dictated by the processor’s architectural features. One such feature is Non-Uniform Memory Access (NUMA). NUMA is part of a shared memory architecture that defines the positioning of a server’s main memory modules in relation to its processors, in a multi-processor system.
During the lab test, we extended our testing to assess the impact of NUMA alignment on the performance of R&S®PACE 2. Again, we deployed Intel® Xeon® Gold 6230N CPUs. Our objective is to illustrate how NUMA misalignment –vs- misalignment affects DPI throughput.
We ran the test using both scenarios – one, where we used the ability of R&S®PACE 2
APIs to allocate memory resources with NUMA-aware sub-initialization functions for each worker thread. In the second scenario, we purposely misaligned the thread to the remote NUMA-node with the result that the necessary data were accessed via the Intel® QuickPath Interconnect (Intel® QPI) bus.
R&S®PACE 2 NUMA support
R&S®PACE 2 supports integration using NUMA-aware sub-initialization functions for each worker thread. That is considered best practice for multi-socket systems in order to allocate requested memory with the requesting NUMA node. Otherwise, all per thread processing data, such as tracking tables for subscriber and flows, can be misaligned. In that suboptimal case, the affected processing threads have to access the remote memory via QPI. This results in a higher workload on the bus and additional latency overhead compared to direct local memory access.
Test results
More specifically, the tests showed that misaligned NUMA worker threads can result in up to a 25.4% decrease in throughput on a server powered by an Intel® Xeon® Gold 6230N CPU. The impact of the misalignment is more visible across application-centric traffic flows compared to default profiles delivered with Open source TRex traffic generator, like rtsp.yaml or sfr3.yaml[2]. These profiles require significantly less memory access to the flow and subscriber hash tables compared to a mixed-application DPI (mixed_dpi.yaml) profile.