* [PATCH net-next v3 0/3] xsk: introduce atomic for cq in generic path
@ 2025-11-28 13:45 Jason Xing
2025-11-28 13:45 ` [PATCH net-next v3 1/3] xsk: add atomic cached_prod for copy mode Jason Xing
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jason Xing @ 2025-11-28 13:45 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
This series tries to replace cq_cached_prod_lock spin lock with atomic
operations to get better performance.
---
v3
Link: https://lore.kernel.org/all/20251125085431.4039-1-kerneljasonxing@gmail.com/
1. fix one race issue that cannot be resolved by simple seperated atomic
operations. So this revision only updates patch [2/3] and tries to use
try_cmpxchg method to avoid that problem. (paolo)
2. update commit log accordingly.
V2
Link: https://lore.kernel.org/all/20251124080858.89593-1-kerneljasonxing@gmail.com/
1. use separate functions rather than branches within shared routines. (Maciej)
2. make each patch as simple as possible for easier review
Jason Xing (3):
xsk: add atomic cached_prod for copy mode
xsk: use atomic operations around cached_prod for copy mode
xsk: remove spin lock protection of cached_prod
include/net/xsk_buff_pool.h | 5 -----
net/xdp/xsk.c | 23 +++++------------------
net/xdp/xsk_buff_pool.c | 1 -
net/xdp/xsk_queue.h | 23 +++++++++++++++++------
4 files changed, 22 insertions(+), 30 deletions(-)
--
2.41.3
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next v3 1/3] xsk: add atomic cached_prod for copy mode
2025-11-28 13:45 [PATCH net-next v3 0/3] xsk: introduce atomic for cq in generic path Jason Xing
@ 2025-11-28 13:45 ` Jason Xing
2025-11-28 13:46 ` [PATCH net-next v3 2/3] xsk: use atomic operations around " Jason Xing
2025-11-28 13:46 ` [PATCH net-next v3 3/3] xsk: remove spin lock protection of cached_prod Jason Xing
2 siblings, 0 replies; 10+ messages in thread
From: Jason Xing @ 2025-11-28 13:45 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Add a union member for completion queue only in copy mode for now. The
purpose is to replace the cq_cached_prod_lock with atomic operation
to improve performance. Note that completion queue in zerocopy mode
doesn't need to be converted because the whole process is lockless.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
net/xdp/xsk_queue.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 1eb8d9f8b104..44cc01555c0b 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -40,7 +40,11 @@ struct xdp_umem_ring {
struct xsk_queue {
u32 ring_mask;
u32 nentries;
- u32 cached_prod;
+ union {
+ u32 cached_prod;
+ /* Used for cq in copy mode only */
+ atomic_t cached_prod_atomic;
+ };
u32 cached_cons;
struct xdp_ring *ring;
u64 invalid_descs;
--
2.41.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-11-28 13:45 [PATCH net-next v3 0/3] xsk: introduce atomic for cq in generic path Jason Xing
2025-11-28 13:45 ` [PATCH net-next v3 1/3] xsk: add atomic cached_prod for copy mode Jason Xing
@ 2025-11-28 13:46 ` Jason Xing
2025-11-28 14:20 ` Paolo Abeni
2025-11-28 13:46 ` [PATCH net-next v3 3/3] xsk: remove spin lock protection of cached_prod Jason Xing
2 siblings, 1 reply; 10+ messages in thread
From: Jason Xing @ 2025-11-28 13:46 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Use atomic_try_cmpxchg operations to replace spin lock. Technically
CAS (Compare And Swap) is better than a coarse way like spin-lock
especially when we only need to perform a few simple operations.
Similar idea can also be found in the recent commit 100dfa74cad9
("net: dev_queue_xmit() llist adoption") that implements the lockless
logic with the help of try_cmpxchg.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
Paolo, sorry that I didn't try to move the lock to struct xsk_queue
because after investigation I reckon try_cmpxchg can add less overhead
when multiple xsks contend at this point. So I hope this approach
can be adopted.
---
net/xdp/xsk.c | 4 ++--
net/xdp/xsk_queue.h | 17 ++++++++++++-----
2 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bcfd400e9cf8..b63409b1422e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -551,7 +551,7 @@ static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
int ret;
spin_lock(&pool->cq_cached_prod_lock);
- ret = xskq_prod_reserve(pool->cq);
+ ret = xsk_cq_cached_prod_reserve(pool->cq);
spin_unlock(&pool->cq_cached_prod_lock);
return ret;
@@ -588,7 +588,7 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
{
spin_lock(&pool->cq_cached_prod_lock);
- xskq_prod_cancel_n(pool->cq, n);
+ atomic_sub(n, &pool->cq->cached_prod_atomic);
spin_unlock(&pool->cq_cached_prod_lock);
}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 44cc01555c0b..7fdc80e624d6 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -402,13 +402,20 @@ static inline void xskq_prod_cancel_n(struct xsk_queue *q, u32 cnt)
q->cached_prod -= cnt;
}
-static inline int xskq_prod_reserve(struct xsk_queue *q)
+static inline int xsk_cq_cached_prod_reserve(struct xsk_queue *q)
{
- if (xskq_prod_is_full(q))
- return -ENOSPC;
+ int free_entries;
+ u32 cached_prod;
+
+ do {
+ q->cached_cons = READ_ONCE(q->ring->consumer);
+ cached_prod = atomic_read(&q->cached_prod_atomic);
+ free_entries = q->nentries - (cached_prod - q->cached_cons);
+ if (free_entries <= 0)
+ return -ENOSPC;
+ } while (!atomic_try_cmpxchg(&q->cached_prod_atomic, &cached_prod,
+ cached_prod + 1));
- /* A, matches D */
- q->cached_prod++;
return 0;
}
--
2.41.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next v3 3/3] xsk: remove spin lock protection of cached_prod
2025-11-28 13:45 [PATCH net-next v3 0/3] xsk: introduce atomic for cq in generic path Jason Xing
2025-11-28 13:45 ` [PATCH net-next v3 1/3] xsk: add atomic cached_prod for copy mode Jason Xing
2025-11-28 13:46 ` [PATCH net-next v3 2/3] xsk: use atomic operations around " Jason Xing
@ 2025-11-28 13:46 ` Jason Xing
2 siblings, 0 replies; 10+ messages in thread
From: Jason Xing @ 2025-11-28 13:46 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
From: Jason Xing <kernelxing@tencent.com>
Remove the spin lock protection along with some functions adjusted.
Now cached_prod is fully converted to atomic, which does help in the
contended case where umem is shared between xsks.
Removing that lock can avoid manipulating one extra cacheline in the
extremely hot path which directly improves the performance by around
5% over different platforms as Paolo found[1].
[1]: https://lore.kernel.org/all/4c645223-8c52-40d3-889b-f3cf7fa09f89@redhat.com/
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
include/net/xsk_buff_pool.h | 5 -----
net/xdp/xsk.c | 21 ++++-----------------
net/xdp/xsk_buff_pool.c | 1 -
3 files changed, 4 insertions(+), 23 deletions(-)
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 92a2358c6ce3..0b1abdb99c9e 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -90,11 +90,6 @@ struct xsk_buff_pool {
* destructor callback.
*/
spinlock_t cq_prod_lock;
- /* Mutual exclusion of the completion ring in the SKB mode.
- * Protect: when sockets share a single cq when the same netdev
- * and queue id is shared.
- */
- spinlock_t cq_cached_prod_lock;
struct xdp_buff_xsk *free_heads[];
};
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b63409b1422e..ae8a92c168b8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -546,17 +546,6 @@ static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
return dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags);
}
-static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
-{
- int ret;
-
- spin_lock(&pool->cq_cached_prod_lock);
- ret = xsk_cq_cached_prod_reserve(pool->cq);
- spin_unlock(&pool->cq_cached_prod_lock);
-
- return ret;
-}
-
static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
struct sk_buff *skb)
{
@@ -585,11 +574,9 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
}
-static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
+static void xsk_cq_cached_prod_cancel(struct xsk_buff_pool *pool, u32 n)
{
- spin_lock(&pool->cq_cached_prod_lock);
atomic_sub(n, &pool->cq->cached_prod_atomic);
- spin_unlock(&pool->cq_cached_prod_lock);
}
static void xsk_inc_num_desc(struct sk_buff *skb)
@@ -643,7 +630,7 @@ static void xsk_consume_skb(struct sk_buff *skb)
}
skb->destructor = sock_wfree;
- xsk_cq_cancel_locked(xs->pool, num_descs);
+ xsk_cq_cached_prod_cancel(xs->pool, num_descs);
/* Free skb without triggering the perf drop trace */
consume_skb(skb);
xs->skb = NULL;
@@ -860,7 +847,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
xskq_cons_release(xs->tx);
} else {
/* Let application retry */
- xsk_cq_cancel_locked(xs->pool, 1);
+ xsk_cq_cached_prod_cancel(xs->pool, 1);
}
return ERR_PTR(err);
@@ -898,7 +885,7 @@ static int __xsk_generic_xmit(struct sock *sk)
* if there is space in it. This avoids having to implement
* any buffering in the Tx path.
*/
- err = xsk_cq_reserve_locked(xs->pool);
+ err = xsk_cq_cached_prod_reserve(xs->pool->cq);
if (err) {
err = -EAGAIN;
goto out;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 51526034c42a..9539f121b290 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -91,7 +91,6 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
INIT_LIST_HEAD(&pool->xsk_tx_list);
spin_lock_init(&pool->xsk_tx_list_lock);
spin_lock_init(&pool->cq_prod_lock);
- spin_lock_init(&pool->cq_cached_prod_lock);
refcount_set(&pool->users, 1);
pool->fq = xs->fq_tmp;
--
2.41.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-11-28 13:46 ` [PATCH net-next v3 2/3] xsk: use atomic operations around " Jason Xing
@ 2025-11-28 14:20 ` Paolo Abeni
2025-11-29 0:55 ` Jason Xing
0 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2025-11-28 14:20 UTC (permalink / raw)
To: Jason Xing, davem, edumazet, kuba, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev
Cc: bpf, netdev, Jason Xing
On 11/28/25 2:46 PM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
>
> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> CAS (Compare And Swap) is better than a coarse way like spin-lock
> especially when we only need to perform a few simple operations.
> Similar idea can also be found in the recent commit 100dfa74cad9
> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> logic with the help of try_cmpxchg.
>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> because after investigation I reckon try_cmpxchg can add less overhead
> when multiple xsks contend at this point. So I hope this approach
> can be adopted.
I still think that moving the lock would be preferable, because it makes
sense also from a maintenance perspective. Can you report the difference
you measure atomics vs moving the spin lock?
Have you tried moving cq_prod_lock, too?
/P
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-11-28 14:20 ` Paolo Abeni
@ 2025-11-29 0:55 ` Jason Xing
2025-12-03 6:56 ` Jason Xing
0 siblings, 1 reply; 10+ messages in thread
From: Jason Xing @ 2025-11-29 0:55 UTC (permalink / raw)
To: Paolo Abeni
Cc: davem, edumazet, kuba, bjorn, magnus.karlsson, maciej.fijalkowski,
jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
andrew+netdev, bpf, netdev, Jason Xing
On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 11/28/25 2:46 PM, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Use atomic_try_cmpxchg operations to replace spin lock. Technically
> > CAS (Compare And Swap) is better than a coarse way like spin-lock
> > especially when we only need to perform a few simple operations.
> > Similar idea can also be found in the recent commit 100dfa74cad9
> > ("net: dev_queue_xmit() llist adoption") that implements the lockless
> > logic with the help of try_cmpxchg.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> > because after investigation I reckon try_cmpxchg can add less overhead
> > when multiple xsks contend at this point. So I hope this approach
> > can be adopted.
>
> I still think that moving the lock would be preferable, because it makes
> sense also from a maintenance perspective.
I can see your point here. Sure, moving the lock is relatively easier
to understand. But my take is that atomic changes here are not that
hard to read :) It has the same effect as spin lock because it will
atomically check, compare and set in try_cmpxchg().
> Can you report the difference
> you measure atomics vs moving the spin lock?
No problem, hopefully I will give a detailed report next week because
I'm going to apply it directly in production where we have multiple
xsk sharing the same umem.
IMHO, in theory, atomics is way better than spin lock in contended
cases since the protected area is small and fast.
>
> Have you tried moving cq_prod_lock, too?
Not yet, thanks for reminding me. It should not affect the sending
rate but the tx completion time, I think.
I'll also test this as well next week :)
Thanks,
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-11-29 0:55 ` Jason Xing
@ 2025-12-03 6:56 ` Jason Xing
2025-12-03 9:24 ` Paolo Abeni
0 siblings, 1 reply; 10+ messages in thread
From: Jason Xing @ 2025-12-03 6:56 UTC (permalink / raw)
To: Paolo Abeni
Cc: davem, edumazet, kuba, bjorn, magnus.karlsson, maciej.fijalkowski,
jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
andrew+netdev, bpf, netdev, Jason Xing
Hi Paolo,
On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 11/28/25 2:46 PM, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Use atomic_try_cmpxchg operations to replace spin lock. Technically
> > > CAS (Compare And Swap) is better than a coarse way like spin-lock
> > > especially when we only need to perform a few simple operations.
> > > Similar idea can also be found in the recent commit 100dfa74cad9
> > > ("net: dev_queue_xmit() llist adoption") that implements the lockless
> > > logic with the help of try_cmpxchg.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> > > because after investigation I reckon try_cmpxchg can add less overhead
> > > when multiple xsks contend at this point. So I hope this approach
> > > can be adopted.
> >
> > I still think that moving the lock would be preferable, because it makes
> > sense also from a maintenance perspective.
>
> I can see your point here. Sure, moving the lock is relatively easier
> to understand. But my take is that atomic changes here are not that
> hard to read :) It has the same effect as spin lock because it will
> atomically check, compare and set in try_cmpxchg().
>
> > Can you report the difference
> > you measure atomics vs moving the spin lock?
>
> No problem, hopefully I will give a detailed report next week because
> I'm going to apply it directly in production where we have multiple
> xsk sharing the same umem.
I'm done with the test in production where a few applications rely on
multiple xsks sharing the same pool to send UDP packets. Here are
significant numbers from bcc tool that recorded the latency caused by
these particular functions:
1. use spin lock
$ sudo ./funclatency xsk_cq_reserve_locked
Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 25308114 |** |
256 -> 511 : 283924647 |********************** |
512 -> 1023 : 501589652 |****************************************|
1024 -> 2047 : 93045664 |******* |
2048 -> 4095 : 746395 | |
4096 -> 8191 : 424053 | |
8192 -> 16383 : 1041 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 6 | |
1048576 -> 2097151 : 2 | |
avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
2. use atomic
$ sudo ./funclatency xsk_cq_cached_prod_reserve
Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 109815401 |********* |
256 -> 511 : 485028947 |****************************************|
512 -> 1023 : 320121627 |************************** |
1024 -> 2047 : 38538584 |*** |
2048 -> 4095 : 377026 | |
4096 -> 8191 : 340961 | |
8192 -> 16383 : 549 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 10 | |
avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
And those numbers were verified over and over again which means they
are quite stable.
You can see that when using atomic, the avg is smaller and the count
of [128 -> 255] is larger, which shows better performance.
I will add the above numbers in the commit log after the merge window is open.
>
> IMHO, in theory, atomics is way better than spin lock in contended
> cases since the protected area is small and fast.
I also spent time investigating the details of both approaches. Spin
lock uses something like atomic_try_cmpxchg first and then fallbacks
to slow path. That is more complicated than atomic. And the protected
area is small enough and simple calculations don't bother asking one
thread to set a few things and then wait.
>
> >
> > Have you tried moving cq_prod_lock, too?
>
> Not yet, thanks for reminding me. It should not affect the sending
> rate but the tx completion time, I think.
I also tried moving this lock, but sadly I noticed that in completion
time the lock was set which led to invalidation of the cache line of
another thread sending packets. It can be obviously proved by perf
cycles:ppp:
1. before
8.70% xsk_cq_cached_prod_reserve
2. after
12.31% xsk_cq_cached_prod_reserve
So I decided not to bring such a modification. Anyway, thanks for your
valuable suggestions and I learnt a lot from those interesting
experiments.
Thanks,
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-12-03 6:56 ` Jason Xing
@ 2025-12-03 9:24 ` Paolo Abeni
2025-12-03 9:40 ` Magnus Karlsson
0 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2025-12-03 9:24 UTC (permalink / raw)
To: Jason Xing
Cc: davem, edumazet, kuba, bjorn, magnus.karlsson, maciej.fijalkowski,
jonathan.lemon, sdf, ast, daniel, hawk, john.fastabend, horms,
andrew+netdev, bpf, netdev, Jason Xing
On 12/3/25 7:56 AM, Jason Xing wrote:
> On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
>>> On 11/28/25 2:46 PM, Jason Xing wrote:
>>>> From: Jason Xing <kernelxing@tencent.com>
>>>>
>>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
>>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
>>>> especially when we only need to perform a few simple operations.
>>>> Similar idea can also be found in the recent commit 100dfa74cad9
>>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
>>>> logic with the help of try_cmpxchg.
>>>>
>>>> Signed-off-by: Jason Xing <kernelxing@tencent.com>
>>>> ---
>>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
>>>> because after investigation I reckon try_cmpxchg can add less overhead
>>>> when multiple xsks contend at this point. So I hope this approach
>>>> can be adopted.
>>>
>>> I still think that moving the lock would be preferable, because it makes
>>> sense also from a maintenance perspective.
>>
>> I can see your point here. Sure, moving the lock is relatively easier
>> to understand. But my take is that atomic changes here are not that
>> hard to read :) It has the same effect as spin lock because it will
>> atomically check, compare and set in try_cmpxchg().
>>
>>> Can you report the difference
>>> you measure atomics vs moving the spin lock?
>>
>> No problem, hopefully I will give a detailed report next week because
>> I'm going to apply it directly in production where we have multiple
>> xsk sharing the same umem.
>
> I'm done with the test in production where a few applications rely on
> multiple xsks sharing the same pool to send UDP packets. Here are
> significant numbers from bcc tool that recorded the latency caused by
> these particular functions:
>
> 1. use spin lock
> $ sudo ./funclatency xsk_cq_reserve_locked
> Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> ^C
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 25308114 |** |
> 256 -> 511 : 283924647 |********************** |
> 512 -> 1023 : 501589652 |****************************************|
> 1024 -> 2047 : 93045664 |******* |
> 2048 -> 4095 : 746395 | |
> 4096 -> 8191 : 424053 | |
> 8192 -> 16383 : 1041 | |
> 16384 -> 32767 : 0 | |
> 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 0 | |
> 262144 -> 524287 : 0 | |
> 524288 -> 1048575 : 6 | |
> 1048576 -> 2097151 : 2 | |
>
> avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
>
> 2. use atomic
> $ sudo ./funclatency xsk_cq_cached_prod_reserve
> Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> ^C
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 109815401 |********* |
> 256 -> 511 : 485028947 |****************************************|
> 512 -> 1023 : 320121627 |************************** |
> 1024 -> 2047 : 38538584 |*** |
> 2048 -> 4095 : 377026 | |
> 4096 -> 8191 : 340961 | |
> 8192 -> 16383 : 549 | |
> 16384 -> 32767 : 0 | |
> 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 0 | |
> 262144 -> 524287 : 0 | |
> 524288 -> 1048575 : 10 | |
>
> avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
>
> And those numbers were verified over and over again which means they
> are quite stable.
>
> You can see that when using atomic, the avg is smaller and the count
> of [128 -> 255] is larger, which shows better performance.
>
> I will add the above numbers in the commit log after the merge window is open.
It's not just a matter of performance. Spinlock additionally give you
fairness and lockdep guarantees, beyond being easier to graps for
however is going to touch this code in the future, while raw atomic none
of them.
From a maintainability perspective spinlocks are much more preferable.
IMHO micro-benchmarking is not a strong enough argument to counter the
spinlock adavantages: at very _least_ large performance gain should be
observed in relevant test-cases and/or real live workloads.
>>> Have you tried moving cq_prod_lock, too?
>>
>> Not yet, thanks for reminding me. It should not affect the sending
>> rate but the tx completion time, I think.
>
> I also tried moving this lock, but sadly I noticed that in completion
> time the lock was set which led to invalidation of the cache line of
> another thread sending packets. It can be obviously proved by perf
> cycles:ppp:
> 1. before
> 8.70% xsk_cq_cached_prod_reserve
>
> 2. after
> 12.31% xsk_cq_cached_prod_reserve
>
> So I decided not to bring such a modification. Anyway, thanks for your
> valuable suggestions and I learnt a lot from those interesting
> experiments.
The goal of such change would be reducing the number of touched
cachelines; when I suggested the above, I did not dive into the producer
specifics, I assumed the relevant producer data were inside the
xsk_queue struct.
It looks like the data is actually inside 'struct xdp_ring', so the
producer lock should be moved there, specifically:
struct xdp_ring {
u32 producer ____cacheline_aligned_in_smp;
spinlock_t producer_lock;
// ...
I'm a bit lost in the structs indirection, but I think the above would
be beneficial even for the ZC path.
/P
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-12-03 9:24 ` Paolo Abeni
@ 2025-12-03 9:40 ` Magnus Karlsson
2025-12-03 11:16 ` Jason Xing
0 siblings, 1 reply; 10+ messages in thread
From: Magnus Karlsson @ 2025-12-03 9:40 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jason Xing, davem, edumazet, kuba, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
On Wed, 3 Dec 2025 at 10:25, Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 12/3/25 7:56 AM, Jason Xing wrote:
> > On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >>> On 11/28/25 2:46 PM, Jason Xing wrote:
> >>>> From: Jason Xing <kernelxing@tencent.com>
> >>>>
> >>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> >>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
> >>>> especially when we only need to perform a few simple operations.
> >>>> Similar idea can also be found in the recent commit 100dfa74cad9
> >>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> >>>> logic with the help of try_cmpxchg.
> >>>>
> >>>> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> >>>> ---
> >>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> >>>> because after investigation I reckon try_cmpxchg can add less overhead
> >>>> when multiple xsks contend at this point. So I hope this approach
> >>>> can be adopted.
> >>>
> >>> I still think that moving the lock would be preferable, because it makes
> >>> sense also from a maintenance perspective.
> >>
> >> I can see your point here. Sure, moving the lock is relatively easier
> >> to understand. But my take is that atomic changes here are not that
> >> hard to read :) It has the same effect as spin lock because it will
> >> atomically check, compare and set in try_cmpxchg().
> >>
> >>> Can you report the difference
> >>> you measure atomics vs moving the spin lock?
> >>
> >> No problem, hopefully I will give a detailed report next week because
> >> I'm going to apply it directly in production where we have multiple
> >> xsk sharing the same umem.
> >
> > I'm done with the test in production where a few applications rely on
> > multiple xsks sharing the same pool to send UDP packets. Here are
> > significant numbers from bcc tool that recorded the latency caused by
> > these particular functions:
> >
> > 1. use spin lock
> > $ sudo ./funclatency xsk_cq_reserve_locked
> > Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> > ^C
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 25308114 |** |
> > 256 -> 511 : 283924647 |********************** |
> > 512 -> 1023 : 501589652 |****************************************|
> > 1024 -> 2047 : 93045664 |******* |
> > 2048 -> 4095 : 746395 | |
> > 4096 -> 8191 : 424053 | |
> > 8192 -> 16383 : 1041 | |
> > 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 6 | |
> > 1048576 -> 2097151 : 2 | |
> >
> > avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
> >
> > 2. use atomic
> > $ sudo ./funclatency xsk_cq_cached_prod_reserve
> > Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> > ^C
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 109815401 |********* |
> > 256 -> 511 : 485028947 |****************************************|
> > 512 -> 1023 : 320121627 |************************** |
> > 1024 -> 2047 : 38538584 |*** |
> > 2048 -> 4095 : 377026 | |
> > 4096 -> 8191 : 340961 | |
> > 8192 -> 16383 : 549 | |
> > 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 10 | |
> >
> > avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
> >
> > And those numbers were verified over and over again which means they
> > are quite stable.
> >
> > You can see that when using atomic, the avg is smaller and the count
> > of [128 -> 255] is larger, which shows better performance.
> >
> > I will add the above numbers in the commit log after the merge window is open.
>
> It's not just a matter of performance. Spinlock additionally give you
> fairness and lockdep guarantees, beyond being easier to graps for
> however is going to touch this code in the future, while raw atomic none
> of them.
>
> From a maintainability perspective spinlocks are much more preferable.
>
> IMHO micro-benchmarking is not a strong enough argument to counter the
> spinlock adavantages: at very _least_ large performance gain should be
> observed in relevant test-cases and/or real live workloads.
>
> >>> Have you tried moving cq_prod_lock, too?
> >>
> >> Not yet, thanks for reminding me. It should not affect the sending
> >> rate but the tx completion time, I think.
> >
> > I also tried moving this lock, but sadly I noticed that in completion
> > time the lock was set which led to invalidation of the cache line of
> > another thread sending packets. It can be obviously proved by perf
> > cycles:ppp:
> > 1. before
> > 8.70% xsk_cq_cached_prod_reserve
> >
> > 2. after
> > 12.31% xsk_cq_cached_prod_reserve
> >
> > So I decided not to bring such a modification. Anyway, thanks for your
> > valuable suggestions and I learnt a lot from those interesting
> > experiments.
>
> The goal of such change would be reducing the number of touched
> cachelines; when I suggested the above, I did not dive into the producer
> specifics, I assumed the relevant producer data were inside the
> xsk_queue struct.
>
> It looks like the data is actually inside 'struct xdp_ring', so the
> producer lock should be moved there, specifically:
>
> struct xdp_ring {
> u32 producer ____cacheline_aligned_in_smp;
> spinlock_t producer_lock;
> // ...
This struct is reflected to user space, so we should not put the spin
lock there. But you could put it in struct xsk_queue, but perhaps you
would want to call it something more generic as there would be a lock
present in all queues/rings, though you would only use it for the cq.
Some of the members in xsk_queue are nearly always used when
manipulating the ring, so the cache line should be hot.
I am just thinking aloud if this would be correct. There is one pool
per cq. When a pool is shared, the cq belonging to that pool is also
always shared, so I think that would be correct moving the lock from
the pool to the cq.
> I'm a bit lost in the structs indirection, but I think the above would
> be beneficial even for the ZC path.
>
> /P
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached_prod for copy mode
2025-12-03 9:40 ` Magnus Karlsson
@ 2025-12-03 11:16 ` Jason Xing
0 siblings, 0 replies; 10+ messages in thread
From: Jason Xing @ 2025-12-03 11:16 UTC (permalink / raw)
To: Magnus Karlsson
Cc: Paolo Abeni, davem, edumazet, kuba, bjorn, magnus.karlsson,
maciej.fijalkowski, jonathan.lemon, sdf, ast, daniel, hawk,
john.fastabend, horms, andrew+netdev, bpf, netdev, Jason Xing
On Wed, Dec 3, 2025 at 5:40 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Wed, 3 Dec 2025 at 10:25, Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 12/3/25 7:56 AM, Jason Xing wrote:
> > > On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >>> On 11/28/25 2:46 PM, Jason Xing wrote:
> > >>>> From: Jason Xing <kernelxing@tencent.com>
> > >>>>
> > >>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> > >>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
> > >>>> especially when we only need to perform a few simple operations.
> > >>>> Similar idea can also be found in the recent commit 100dfa74cad9
> > >>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> > >>>> logic with the help of try_cmpxchg.
> > >>>>
> > >>>> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > >>>> ---
> > >>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> > >>>> because after investigation I reckon try_cmpxchg can add less overhead
> > >>>> when multiple xsks contend at this point. So I hope this approach
> > >>>> can be adopted.
> > >>>
> > >>> I still think that moving the lock would be preferable, because it makes
> > >>> sense also from a maintenance perspective.
> > >>
> > >> I can see your point here. Sure, moving the lock is relatively easier
> > >> to understand. But my take is that atomic changes here are not that
> > >> hard to read :) It has the same effect as spin lock because it will
> > >> atomically check, compare and set in try_cmpxchg().
> > >>
> > >>> Can you report the difference
> > >>> you measure atomics vs moving the spin lock?
> > >>
> > >> No problem, hopefully I will give a detailed report next week because
> > >> I'm going to apply it directly in production where we have multiple
> > >> xsk sharing the same umem.
> > >
> > > I'm done with the test in production where a few applications rely on
> > > multiple xsks sharing the same pool to send UDP packets. Here are
> > > significant numbers from bcc tool that recorded the latency caused by
> > > these particular functions:
> > >
> > > 1. use spin lock
> > > $ sudo ./funclatency xsk_cq_reserve_locked
> > > Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> > > ^C
> > > nsecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | |
> > > 64 -> 127 : 0 | |
> > > 128 -> 255 : 25308114 |** |
> > > 256 -> 511 : 283924647 |********************** |
> > > 512 -> 1023 : 501589652 |****************************************|
> > > 1024 -> 2047 : 93045664 |******* |
> > > 2048 -> 4095 : 746395 | |
> > > 4096 -> 8191 : 424053 | |
> > > 8192 -> 16383 : 1041 | |
> > > 16384 -> 32767 : 0 | |
> > > 32768 -> 65535 : 0 | |
> > > 65536 -> 131071 : 0 | |
> > > 131072 -> 262143 : 0 | |
> > > 262144 -> 524287 : 0 | |
> > > 524288 -> 1048575 : 6 | |
> > > 1048576 -> 2097151 : 2 | |
> > >
> > > avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
> > >
> > > 2. use atomic
> > > $ sudo ./funclatency xsk_cq_cached_prod_reserve
> > > Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> > > ^C
> > > nsecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | |
> > > 64 -> 127 : 0 | |
> > > 128 -> 255 : 109815401 |********* |
> > > 256 -> 511 : 485028947 |****************************************|
> > > 512 -> 1023 : 320121627 |************************** |
> > > 1024 -> 2047 : 38538584 |*** |
> > > 2048 -> 4095 : 377026 | |
> > > 4096 -> 8191 : 340961 | |
> > > 8192 -> 16383 : 549 | |
> > > 16384 -> 32767 : 0 | |
> > > 32768 -> 65535 : 0 | |
> > > 65536 -> 131071 : 0 | |
> > > 131072 -> 262143 : 0 | |
> > > 262144 -> 524287 : 0 | |
> > > 524288 -> 1048575 : 10 | |
> > >
> > > avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
> > >
> > > And those numbers were verified over and over again which means they
> > > are quite stable.
> > >
> > > You can see that when using atomic, the avg is smaller and the count
> > > of [128 -> 255] is larger, which shows better performance.
> > >
> > > I will add the above numbers in the commit log after the merge window is open.
> >
> > It's not just a matter of performance. Spinlock additionally give you
> > fairness and lockdep guarantees, beyond being easier to graps for
> > however is going to touch this code in the future, while raw atomic none
> > of them.
Right.
> >
> > From a maintainability perspective spinlocks are much more preferable.
> >
> > IMHO micro-benchmarking is not a strong enough argument to counter the
> > spinlock adavantages: at very _least_ large performance gain should be
> > observed in relevant test-cases and/or real live workloads.
The problem is that I have no good benchmark to see the minor
improvement because it requires multiple xsks. Xdpsock has bugs and
doesn't allow two xsks running in parallel. Unlike one xsk scenario,
it's really difficult for me currently to measure the improvement. So
I resorted to observing latency.
With that said, I will follow your suggestion to move that lock :)
> >
> > >>> Have you tried moving cq_prod_lock, too?
> > >>
> > >> Not yet, thanks for reminding me. It should not affect the sending
> > >> rate but the tx completion time, I think.
> > >
> > > I also tried moving this lock, but sadly I noticed that in completion
> > > time the lock was set which led to invalidation of the cache line of
> > > another thread sending packets. It can be obviously proved by perf
> > > cycles:ppp:
> > > 1. before
> > > 8.70% xsk_cq_cached_prod_reserve
> > >
> > > 2. after
> > > 12.31% xsk_cq_cached_prod_reserve
> > >
> > > So I decided not to bring such a modification. Anyway, thanks for your
> > > valuable suggestions and I learnt a lot from those interesting
> > > experiments.
> >
> > The goal of such change would be reducing the number of touched
> > cachelines;
Yep.
>> when I suggested the above, I did not dive into the producer
> > specifics, I assumed the relevant producer data were inside the
> > xsk_queue struct.
> >
> > It looks like the data is actually inside 'struct xdp_ring', so the
> > producer lock should be moved there, specifically:
> >
> > struct xdp_ring {
> > u32 producer ____cacheline_aligned_in_smp;
> > spinlock_t producer_lock;
> > // ...
>
> This struct is reflected to user space, so we should not put the spin
> lock there.
Agree on this point.
> But you could put it in struct xsk_queue, but perhaps you
My concern is that only cq uses this lock while this structure is used
by all types of queues.
> would want to call it something more generic as there would be a lock
> present in all queues/rings, though you would only use it for the cq.
> Some of the members in xsk_queue are nearly always used when
> manipulating the ring, so the cache line should be hot.
>
> I am just thinking aloud if this would be correct. There is one pool
> per cq. When a pool is shared, the cq belonging to that pool is also
> always shared, so I think that would be correct moving the lock from
> the pool to the cq.
Exactly, that is how Paolo suggested previously.
>
> > I'm a bit lost in the structs indirection, but I think the above would
> > be beneficial even for the ZC path.
Spot on, that will be part of my future plan. Thanks!
Thanks,
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-03 11:17 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-28 13:45 [PATCH net-next v3 0/3] xsk: introduce atomic for cq in generic path Jason Xing
2025-11-28 13:45 ` [PATCH net-next v3 1/3] xsk: add atomic cached_prod for copy mode Jason Xing
2025-11-28 13:46 ` [PATCH net-next v3 2/3] xsk: use atomic operations around " Jason Xing
2025-11-28 14:20 ` Paolo Abeni
2025-11-29 0:55 ` Jason Xing
2025-12-03 6:56 ` Jason Xing
2025-12-03 9:24 ` Paolo Abeni
2025-12-03 9:40 ` Magnus Karlsson
2025-12-03 11:16 ` Jason Xing
2025-11-28 13:46 ` [PATCH net-next v3 3/3] xsk: remove spin lock protection of cached_prod Jason Xing
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).