linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] problems with RFS on bRPC applications
@ 2025-07-23 12:04 Hc Zheng
  0 siblings, 0 replies; only message in thread
From: Hc Zheng @ 2025-07-23 12:04 UTC (permalink / raw)
  To: andrew+netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Saeed Mahameed, Tariq Toukan,
	Leon Romanovsky, laoar.shao, yc1082463
  Cc: netdev, linux-rdma, linux-kernel

Hi all,

I have tried to enable ARFS on the Mellanox CX-6 Ethernet card. It
works fine for simple workloads and benchmarks, but when running a
BRPC (https://github.com/apache/brpc) workload on a 2 NUMA-node
machine, the performance degrades significantly. After some tracing, I
identified the following workload patterns that ARFS/RFS failed to
handle efficiently:

- The workload has multiple threads that use epoll and read from the
same socket, which may cause the flow to be frequently updated in
sock_flow_table.

- The threads reading from the socket also migrate frequently between CPUs.

With these patterns, the flow is being updated very frequently, which
causes severe lock contention on arfs->arfs_lock in
mlx5e_rx_flow_steer. As a result, network packets are not being
handled in a timely manner.

Here are the case we want to enable ARFS/RFS:

- We want to ensure that flows belonging to different containers do
not interfere with each other. Our goal is for the flows to be steered
to the appropriate container’s CPUs.

- In the case of BRPC, the original RFS/ARFS logic does not help, so
we aim to steer the flow to CPUs running the container, as close as
possible and as balance as possible.

One simple solution I came up with is to have another mode in addition
to RFS, eg: like rps_record_sock_flow with a fix interval to avoid
frequently updated, or add an interface to allow usespace to dynamicly
steer the flows. This mode would steer flows to cpu within the target
container’s CPU set, providing some load balancing and locality

I have written some simple PoC code for this. After applying it in
production, we noticed the following performance changes:

- Cross NUMA memory bandwidth: 13GB → 9GB

- Pod system busy: 7.2% → 6.8%

- CPU PSI: 14ms → 12ms

However, we also noticed that some RX queue receives more flows than
others, since this code does not implement load balancing.

I am writing this email to request suggestions from netdev developers.

Additionally, for Mellanox forks, is there any plans to refine
arfs->arfs_lock in mlx5e_rx_flow_steer?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a04fbf72476..1df7e125c61f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@
 #include <asm/byteorder.h>
 #include <asm/local.h>

+#include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/rculist.h>
 #include <linux/workqueue.h>
@@ -753,15 +754,21 @@ static inline void rps_record_sock_flow(struct
rps_sock_flow_table *table,
        if (table && hash) {
                unsigned int index = hash & table->mask;
                u32 val = hash & ~rps_cpu_mask;
+               u32 old = READ_ONCE(table->ents[index]);

+
+               if (likely((old & ~rps_cpu_mask) == val &&
cpumask_test_cpu(old & rps_cpu_mask, current->cpus_ptr))) {
+                   return;
+               }
                /* We only give a hint, preemption can change CPU under us */
                val |= raw_smp_processor_id();

                /* The following WRITE_ONCE() is paired with the READ_ONCE()
                 * here, and another one in get_rps_cpu().
                 */
-               if (READ_ONCE(table->ents[index]) != val)
+               if (old != val) {
                        WRITE_ONCE(table->ents[index], val);
+               }
        }
 }



Best Regards
Huaicheng Zheng

^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2025-07-23 12:04 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 12:04 [RFC] problems with RFS on bRPC applications Hc Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).