* [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
@ 2025-09-21 9:58 Eric Dumazet
2025-09-22 2:21 ` Willem de Bruijn
2025-09-22 8:37 ` Paolo Abeni
0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-21 9:58 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Willem de Bruijn, Kuniyuki Iwashima, netdev,
eric.dumazet, Eric Dumazet
busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.
Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.
This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).
This means that fewer number of cpus have to acquire
the UDP receive queue lock.
Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.
Then one of the cpu is chosen to process this lockless queue
in a batch.
The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.
Tested:
DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)
Before:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 1004179 0.0
Udp6InErrors 3117 0.0
Udp6RcvbufErrors 3117 0.0
After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams 1116633 0.0
Udp6InErrors 14197275 0.0
Udp6RcvbufErrors 14197275 0.0
We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.
I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().
Before:
@udp_enqueue_us[398]:
[0] 24901 |@@@ |
[1] 63512 |@@@@@@@@@ |
[2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 54022 |@@@@@@@@ |
[16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 4219 | |
[128, 256) 188 | |
After:
@udp_enqueue_us[398]:
[0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 1111277 |@@@@@@@@@@ |
[2, 4) 501439 |@@@@ |
[4, 8) 102921 | |
[8, 16) 29895 | |
[16, 32) 43500 | |
[32, 64) 31552 | |
[64, 128) 979 | |
[128, 256) 13 | |
Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v3: - Moved kfree(up->udp_prod_queue) to udp_destruct_common(),
addressing reports from Jakub and syzbot.
- Perform SKB_DROP_REASON_PROTO_MEM drops after the queue
spinlock is released.
v2: https://lore.kernel.org/netdev/20250920080227.3674860-1-edumazet@google.com/
- Added a kfree(up->udp_prod_queue) in udpv6_destroy_sock() (Jakub feedback on v1)
- Added bpftrace histograms in changelog.
v1: https://lore.kernel.org/netdev/20250919164308.2455564-1-edumazet@google.com/
include/linux/udp.h | 9 +++-
include/net/udp.h | 11 ++++-
net/ipv4/udp.c | 114 ++++++++++++++++++++++++++------------------
net/ipv6/udp.c | 5 +-
4 files changed, 88 insertions(+), 51 deletions(-)
diff --git a/include/linux/udp.h b/include/linux/udp.h
index e554890c4415b411f35007d3ece9e6042db7a544..58795688a18636ea79aa1f5d06eacc676a2e7849 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -44,6 +44,12 @@ enum {
UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
};
+/* per NUMA structure for lockless producer usage. */
+struct udp_prod_queue {
+ struct llist_head ll_root ____cacheline_aligned_in_smp;
+ atomic_t rmem_alloc;
+};
+
struct udp_sock {
/* inet_sock has to be the first member */
struct inet_sock inet;
@@ -90,6 +96,8 @@ struct udp_sock {
struct sk_buff *skb,
int nhoff);
+ struct udp_prod_queue *udp_prod_queue;
+
/* udp_recvmsg try to use this before splicing sk_receive_queue */
struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;
@@ -109,7 +117,6 @@ struct udp_sock {
*/
struct hlist_node tunnel_list;
struct numa_drop_counters drop_counters;
- spinlock_t busylock ____cacheline_aligned_in_smp;
};
#define udp_test_bit(nr, sk) \
diff --git a/include/net/udp.h b/include/net/udp.h
index 059a0cee5f559b8d75e71031a00d0aa2769e257f..cffedb3e40f24513e44fb7598c0ad917fd15b616 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -284,16 +284,23 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
netdev_features_t features, bool is_ipv6);
-static inline void udp_lib_init_sock(struct sock *sk)
+static inline int udp_lib_init_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
sk->sk_drop_counters = &up->drop_counters;
- spin_lock_init(&up->busylock);
skb_queue_head_init(&up->reader_queue);
INIT_HLIST_NODE(&up->tunnel_list);
up->forward_threshold = sk->sk_rcvbuf >> 2;
set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
+
+ up->udp_prod_queue = kcalloc(nr_node_ids, sizeof(*up->udp_prod_queue),
+ GFP_KERNEL);
+ if (!up->udp_prod_queue)
+ return -ENOMEM;
+ for (int i = 0; i < nr_node_ids; i++)
+ init_llist_head(&up->udp_prod_queue[i].ll_root);
+ return 0;
}
static inline void udp_drops_inc(struct sock *sk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 85cfc32eb2ccb3e229177fb37910fefde0254ffe..fce1d0ffd6361d271ae3528fea026a8d6c07ac6e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1685,25 +1685,6 @@ static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
}
-/* Idea of busylocks is to let producers grab an extra spinlock
- * to relieve pressure on the receive_queue spinlock shared by consumer.
- * Under flood, this means that only one producer can be in line
- * trying to acquire the receive_queue spinlock.
- */
-static spinlock_t *busylock_acquire(struct sock *sk)
-{
- spinlock_t *busy = &udp_sk(sk)->busylock;
-
- spin_lock(busy);
- return busy;
-}
-
-static void busylock_release(spinlock_t *busy)
-{
- if (busy)
- spin_unlock(busy);
-}
-
static int udp_rmem_schedule(struct sock *sk, int size)
{
int delta;
@@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
{
struct sk_buff_head *list = &sk->sk_receive_queue;
+ struct udp_prod_queue *udp_prod_queue;
+ struct sk_buff *next, *to_drop = NULL;
+ struct llist_node *ll_list;
unsigned int rmem, rcvbuf;
- spinlock_t *busy = NULL;
int size, err = -ENOMEM;
+ int total_size = 0;
+ int q_size = 0;
+ int nb = 0;
rmem = atomic_read(&sk->sk_rmem_alloc);
rcvbuf = READ_ONCE(sk->sk_rcvbuf);
size = skb->truesize;
+ udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
+
+ rmem += atomic_read(&udp_prod_queue->rmem_alloc);
+
/* Immediately drop when the receive queue is full.
* Cast to unsigned int performs the boundary check for INT_MAX.
*/
@@ -1747,45 +1737,75 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
if (rmem > (rcvbuf >> 1)) {
skb_condense(skb);
size = skb->truesize;
- rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
- if (rmem > rcvbuf)
- goto uncharge_drop;
- busy = busylock_acquire(sk);
- } else {
- atomic_add(size, &sk->sk_rmem_alloc);
}
udp_set_dev_scratch(skb);
+ atomic_add(size, &udp_prod_queue->rmem_alloc);
+
+ if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
+ return 0;
+
spin_lock(&list->lock);
- err = udp_rmem_schedule(sk, size);
- if (err) {
- spin_unlock(&list->lock);
- goto uncharge_drop;
- }
- sk_forward_alloc_add(sk, -size);
+ ll_list = llist_del_all(&udp_prod_queue->ll_root);
- /* no need to setup a destructor, we will explicitly release the
- * forward allocated memory on dequeue
- */
- sock_skb_set_dropcount(sk, skb);
+ ll_list = llist_reverse_order(ll_list);
+
+ llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
+ size = udp_skb_truesize(skb);
+ total_size += size;
+ err = udp_rmem_schedule(sk, size);
+ if (unlikely(err)) {
+ /* Free the skbs outside of locked section. */
+ skb->next = to_drop;
+ to_drop = skb;
+ continue;
+ }
+
+ q_size += size;
+ sk_forward_alloc_add(sk, -size);
+
+ /* no need to setup a destructor, we will explicitly release the
+ * forward allocated memory on dequeue
+ */
+ sock_skb_set_dropcount(sk, skb);
+ nb++;
+ __skb_queue_tail(list, skb);
+ }
+
+ atomic_add(q_size, &sk->sk_rmem_alloc);
- __skb_queue_tail(list, skb);
spin_unlock(&list->lock);
- if (!sock_flag(sk, SOCK_DEAD))
- INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
+ if (!sock_flag(sk, SOCK_DEAD)) {
+ /* Multiple threads might be blocked in recvmsg(),
+ * using prepare_to_wait_exclusive().
+ */
+ while (nb) {
+ INDIRECT_CALL_1(sk->sk_data_ready,
+ sock_def_readable, sk);
+ nb--;
+ }
+ }
+
+ if (unlikely(to_drop)) {
+ for (nb = 0; to_drop != NULL; nb++) {
+ skb = to_drop;
+ to_drop = skb->next;
+ skb_mark_not_on_list(skb);
+ /* TODO: update SNMP values. */
+ sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_PROTO_MEM);
+ }
+ numa_drop_add(&udp_sk(sk)->drop_counters, nb);
+ }
- busylock_release(busy);
- return 0;
+ atomic_sub(total_size, &udp_prod_queue->rmem_alloc);
-uncharge_drop:
- atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+ return 0;
drop:
udp_drops_inc(sk);
- busylock_release(busy);
return err;
}
EXPORT_IPV6_MOD_GPL(__udp_enqueue_schedule_skb);
@@ -1803,6 +1823,7 @@ void udp_destruct_common(struct sock *sk)
kfree_skb(skb);
}
udp_rmem_release(sk, total, 0, true);
+ kfree(up->udp_prod_queue);
}
EXPORT_IPV6_MOD_GPL(udp_destruct_common);
@@ -1814,10 +1835,11 @@ static void udp_destruct_sock(struct sock *sk)
int udp_init_sock(struct sock *sk)
{
- udp_lib_init_sock(sk);
+ int res = udp_lib_init_sock(sk);
+
sk->sk_destruct = udp_destruct_sock;
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
- return 0;
+ return res;
}
void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9f4d340d1e3a63d38f80138ef9f6aac4a33afa05..813a2ba75824d14631642bf6973f65063b2825cb 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -67,10 +67,11 @@ static void udpv6_destruct_sock(struct sock *sk)
int udpv6_init_sock(struct sock *sk)
{
- udp_lib_init_sock(sk);
+ int res = udp_lib_init_sock(sk);
+
sk->sk_destruct = udpv6_destruct_sock;
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
- return 0;
+ return res;
}
INDIRECT_CALLABLE_SCOPE
--
2.51.0.470.ga7dc726c21-goog
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-21 9:58 [PATCH v3 net-next] udp: remove busylock and add per NUMA queues Eric Dumazet
@ 2025-09-22 2:21 ` Willem de Bruijn
2025-09-22 7:31 ` Eric Dumazet
2025-09-22 8:37 ` Paolo Abeni
1 sibling, 1 reply; 11+ messages in thread
From: Willem de Bruijn @ 2025-09-22 2:21 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Willem de Bruijn, Kuniyuki Iwashima, netdev,
eric.dumazet, Eric Dumazet
Eric Dumazet wrote:
> busylock was protecting UDP sockets against packet floods,
> but unfortunately was not protecting the host itself.
>
> Under stress, many cpus could spin while acquiring the busylock,
> and NIC had to drop packets. Or packets would be dropped
> in cpu backlog if RPS/RFS were in place.
>
> This patch replaces the busylock by intermediate
> lockless queues. (One queue per NUMA node).
>
> This means that fewer number of cpus have to acquire
> the UDP receive queue lock.
>
> Most of the cpus can either:
> - immediately drop the packet.
> - or queue it in their NUMA aware lockless queue.
>
> Then one of the cpu is chosen to process this lockless queue
> in a batch.
>
> The batch only contains packets that were cooked on the same
> NUMA node, thus with very limited latency impact.
>
> Tested:
>
> DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
> (Intel(R) Xeon(R) 6985P-C)
>
> Before:
>
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams 1004179 0.0
> Udp6InErrors 3117 0.0
> Udp6RcvbufErrors 3117 0.0
>
> After:
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams 1116633 0.0
> Udp6InErrors 14197275 0.0
> Udp6RcvbufErrors 14197275 0.0
>
> We can see this host can now proces 14.2 M more packets per second
> while under attack, and the victim socket can receive 11 % more
> packets.
Impressive gains under DoS!
Main concern is that it adds an extra queue/dequeue and thus some
cycle cost for all udp sockets in the common case where they are not
contended. These are simple linked list operations, so I suppose the
only cost may be the cacheline if not warm. Busylock had the nice
property of only being used under mem pressure. Could this benefit
from the same?
> I used a small bpftrace program measuring time (in us) spent in
> __udp_enqueue_schedule_skb().
>
> Before:
>
> @udp_enqueue_us[398]:
> [0] 24901 |@@@ |
> [1] 63512 |@@@@@@@@@ |
> [2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [8, 16) 54022 |@@@@@@@@ |
> [16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [64, 128) 4219 | |
> [128, 256) 188 | |
>
> After:
>
> @udp_enqueue_us[398]:
> [0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1] 1111277 |@@@@@@@@@@ |
> [2, 4) 501439 |@@@@ |
> [4, 8) 102921 | |
> [8, 16) 29895 | |
> [16, 32) 43500 | |
> [32, 64) 31552 | |
> [64, 128) 979 | |
> [128, 256) 13 | |
>
> Note that the remaining bottleneck for this platform is in
> udp_drops_inc() because we limited struct numa_drop_counters
> to only two nodes so far.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v3: - Moved kfree(up->udp_prod_queue) to udp_destruct_common(),
> addressing reports from Jakub and syzbot.
>
> - Perform SKB_DROP_REASON_PROTO_MEM drops after the queue
> spinlock is released.
>
> v2: https://lore.kernel.org/netdev/20250920080227.3674860-1-edumazet@google.com/
> - Added a kfree(up->udp_prod_queue) in udpv6_destroy_sock() (Jakub feedback on v1)
> - Added bpftrace histograms in changelog.
>
> v1: https://lore.kernel.org/netdev/20250919164308.2455564-1-edumazet@google.com/
>
> include/linux/udp.h | 9 +++-
> include/net/udp.h | 11 ++++-
> net/ipv4/udp.c | 114 ++++++++++++++++++++++++++------------------
> net/ipv6/udp.c | 5 +-
> 4 files changed, 88 insertions(+), 51 deletions(-)
>
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index e554890c4415b411f35007d3ece9e6042db7a544..58795688a18636ea79aa1f5d06eacc676a2e7849 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -44,6 +44,12 @@ enum {
> UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
> };
>
> +/* per NUMA structure for lockless producer usage. */
> +struct udp_prod_queue {
> + struct llist_head ll_root ____cacheline_aligned_in_smp;
> + atomic_t rmem_alloc;
> +};
> +
> struct udp_sock {
> /* inet_sock has to be the first member */
> struct inet_sock inet;
> @@ -90,6 +96,8 @@ struct udp_sock {
> struct sk_buff *skb,
> int nhoff);
>
> + struct udp_prod_queue *udp_prod_queue;
> +
> /* udp_recvmsg try to use this before splicing sk_receive_queue */
> struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;
>
> @@ -109,7 +117,6 @@ struct udp_sock {
> */
> struct hlist_node tunnel_list;
> struct numa_drop_counters drop_counters;
> - spinlock_t busylock ____cacheline_aligned_in_smp;
> };
>
> #define udp_test_bit(nr, sk) \
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 059a0cee5f559b8d75e71031a00d0aa2769e257f..cffedb3e40f24513e44fb7598c0ad917fd15b616 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -284,16 +284,23 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
> struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> netdev_features_t features, bool is_ipv6);
>
> -static inline void udp_lib_init_sock(struct sock *sk)
> +static inline int udp_lib_init_sock(struct sock *sk)
> {
> struct udp_sock *up = udp_sk(sk);
>
> sk->sk_drop_counters = &up->drop_counters;
> - spin_lock_init(&up->busylock);
> skb_queue_head_init(&up->reader_queue);
> INIT_HLIST_NODE(&up->tunnel_list);
> up->forward_threshold = sk->sk_rcvbuf >> 2;
> set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> +
> + up->udp_prod_queue = kcalloc(nr_node_ids, sizeof(*up->udp_prod_queue),
> + GFP_KERNEL);
> + if (!up->udp_prod_queue)
> + return -ENOMEM;
> + for (int i = 0; i < nr_node_ids; i++)
> + init_llist_head(&up->udp_prod_queue[i].ll_root);
> + return 0;
> }
>
> static inline void udp_drops_inc(struct sock *sk)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 85cfc32eb2ccb3e229177fb37910fefde0254ffe..fce1d0ffd6361d271ae3528fea026a8d6c07ac6e 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1685,25 +1685,6 @@ static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
> udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
> }
>
> -/* Idea of busylocks is to let producers grab an extra spinlock
> - * to relieve pressure on the receive_queue spinlock shared by consumer.
> - * Under flood, this means that only one producer can be in line
> - * trying to acquire the receive_queue spinlock.
> - */
> -static spinlock_t *busylock_acquire(struct sock *sk)
> -{
> - spinlock_t *busy = &udp_sk(sk)->busylock;
> -
> - spin_lock(busy);
> - return busy;
> -}
> -
> -static void busylock_release(spinlock_t *busy)
> -{
> - if (busy)
> - spin_unlock(busy);
> -}
> -
> static int udp_rmem_schedule(struct sock *sk, int size)
> {
> int delta;
> @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> {
> struct sk_buff_head *list = &sk->sk_receive_queue;
> + struct udp_prod_queue *udp_prod_queue;
> + struct sk_buff *next, *to_drop = NULL;
> + struct llist_node *ll_list;
> unsigned int rmem, rcvbuf;
> - spinlock_t *busy = NULL;
> int size, err = -ENOMEM;
> + int total_size = 0;
> + int q_size = 0;
> + int nb = 0;
>
> rmem = atomic_read(&sk->sk_rmem_alloc);
> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> size = skb->truesize;
>
> + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
There is a small chance that a cpu enqueues to this queue and no
further arrivals on that numa node happen, stranding skbs on this
intermediate queue, right? If so, those are leaked on
udp_destruct_common.
> +
> + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> +
> /* Immediately drop when the receive queue is full.
> * Cast to unsigned int performs the boundary check for INT_MAX.
> */
> @@ -1747,45 +1737,75 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> if (rmem > (rcvbuf >> 1)) {
> skb_condense(skb);
> size = skb->truesize;
> - rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> - if (rmem > rcvbuf)
> - goto uncharge_drop;
> - busy = busylock_acquire(sk);
> - } else {
> - atomic_add(size, &sk->sk_rmem_alloc);
> }
>
> udp_set_dev_scratch(skb);
>
> + atomic_add(size, &udp_prod_queue->rmem_alloc);
> +
> + if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> + return 0;
> +
> spin_lock(&list->lock);
> - err = udp_rmem_schedule(sk, size);
> - if (err) {
> - spin_unlock(&list->lock);
> - goto uncharge_drop;
> - }
>
> - sk_forward_alloc_add(sk, -size);
> + ll_list = llist_del_all(&udp_prod_queue->ll_root);
>
> - /* no need to setup a destructor, we will explicitly release the
> - * forward allocated memory on dequeue
> - */
> - sock_skb_set_dropcount(sk, skb);
> + ll_list = llist_reverse_order(ll_list);
> +
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> + size = udp_skb_truesize(skb);
> + total_size += size;
> + err = udp_rmem_schedule(sk, size);
> + if (unlikely(err)) {
> + /* Free the skbs outside of locked section. */
> + skb->next = to_drop;
> + to_drop = skb;
> + continue;
> + }
> +
> + q_size += size;
> + sk_forward_alloc_add(sk, -size);
> +
> + /* no need to setup a destructor, we will explicitly release the
> + * forward allocated memory on dequeue
> + */
> + sock_skb_set_dropcount(sk, skb);
Since drop counters are approximate, read these once and report the
same for all packets in a batch?
> + nb++;
> + __skb_queue_tail(list, skb);
> + }
> +
> + atomic_add(q_size, &sk->sk_rmem_alloc);
>
> - __skb_queue_tail(list, skb);
> spin_unlock(&list->lock);
>
> - if (!sock_flag(sk, SOCK_DEAD))
> - INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
> + if (!sock_flag(sk, SOCK_DEAD)) {
> + /* Multiple threads might be blocked in recvmsg(),
> + * using prepare_to_wait_exclusive().
> + */
> + while (nb) {
> + INDIRECT_CALL_1(sk->sk_data_ready,
> + sock_def_readable, sk);
> + nb--;
> + }
> + }
> +
> + if (unlikely(to_drop)) {
> + for (nb = 0; to_drop != NULL; nb++) {
> + skb = to_drop;
> + to_drop = skb->next;
> + skb_mark_not_on_list(skb);
> + /* TODO: update SNMP values. */
> + sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_PROTO_MEM);
> + }
> + numa_drop_add(&udp_sk(sk)->drop_counters, nb);
> + }
>
> - busylock_release(busy);
> - return 0;
> + atomic_sub(total_size, &udp_prod_queue->rmem_alloc);
>
> -uncharge_drop:
> - atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
> + return 0;
>
> drop:
> udp_drops_inc(sk);
> - busylock_release(busy);
> return err;
> }
> EXPORT_IPV6_MOD_GPL(__udp_enqueue_schedule_skb);
> @@ -1803,6 +1823,7 @@ void udp_destruct_common(struct sock *sk)
> kfree_skb(skb);
> }
> udp_rmem_release(sk, total, 0, true);
> + kfree(up->udp_prod_queue);
> }
> EXPORT_IPV6_MOD_GPL(udp_destruct_common);
>
> @@ -1814,10 +1835,11 @@ static void udp_destruct_sock(struct sock *sk)
>
> int udp_init_sock(struct sock *sk)
> {
> - udp_lib_init_sock(sk);
> + int res = udp_lib_init_sock(sk);
> +
> sk->sk_destruct = udp_destruct_sock;
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> - return 0;
> + return res;
> }
>
> void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 9f4d340d1e3a63d38f80138ef9f6aac4a33afa05..813a2ba75824d14631642bf6973f65063b2825cb 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -67,10 +67,11 @@ static void udpv6_destruct_sock(struct sock *sk)
>
> int udpv6_init_sock(struct sock *sk)
> {
> - udp_lib_init_sock(sk);
> + int res = udp_lib_init_sock(sk);
> +
> sk->sk_destruct = udpv6_destruct_sock;
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> - return 0;
> + return res;
> }
>
> INDIRECT_CALLABLE_SCOPE
> --
> 2.51.0.470.ga7dc726c21-goog
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 2:21 ` Willem de Bruijn
@ 2025-09-22 7:31 ` Eric Dumazet
2025-09-22 12:38 ` Willem de Bruijn
0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2025-09-22 7:31 UTC (permalink / raw)
To: Willem de Bruijn
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, Kuniyuki Iwashima, netdev, eric.dumazet
On Sun, Sep 21, 2025 at 7:21 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Eric Dumazet wrote:
> > busylock was protecting UDP sockets against packet floods,
> > but unfortunately was not protecting the host itself.
> >
> > Under stress, many cpus could spin while acquiring the busylock,
> > and NIC had to drop packets. Or packets would be dropped
> > in cpu backlog if RPS/RFS were in place.
> >
> > This patch replaces the busylock by intermediate
> > lockless queues. (One queue per NUMA node).
> >
> > This means that fewer number of cpus have to acquire
> > the UDP receive queue lock.
> >
> > Most of the cpus can either:
> > - immediately drop the packet.
> > - or queue it in their NUMA aware lockless queue.
> >
> > Then one of the cpu is chosen to process this lockless queue
> > in a batch.
> >
> > The batch only contains packets that were cooked on the same
> > NUMA node, thus with very limited latency impact.
> >
> > Tested:
> >
> > DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
> > (Intel(R) Xeon(R) 6985P-C)
> >
> > Before:
> >
> > nstat -n ; sleep 1 ; nstat | grep Udp
> > Udp6InDatagrams 1004179 0.0
> > Udp6InErrors 3117 0.0
> > Udp6RcvbufErrors 3117 0.0
> >
> > After:
> > nstat -n ; sleep 1 ; nstat | grep Udp
> > Udp6InDatagrams 1116633 0.0
> > Udp6InErrors 14197275 0.0
> > Udp6RcvbufErrors 14197275 0.0
> >
> > We can see this host can now proces 14.2 M more packets per second
> > while under attack, and the victim socket can receive 11 % more
> > packets.
>
> Impressive gains under DoS!
>
> Main concern is that it adds an extra queue/dequeue and thus some
> cycle cost for all udp sockets in the common case where they are not
> contended. These are simple linked list operations, so I suppose the
> only cost may be the cacheline if not warm. Busylock had the nice
> property of only being used under mem pressure. Could this benefit
> from the same?
I hear you, but the extra cache line is local to the node, compared to prior
non-local and shared cache line, especially on modern cpus (AMD Turin
/ Venice or Intel Granite Rapids)
This is a very minor cost, compared to the average delay between
packet being received
on the wire and being presented on __udp_enqueue_schedule_skb().
And the core of the lockless algorithm is that all packets go through
the same logic.
See my following answer.
>
> > I used a small bpftrace program measuring time (in us) spent in
> > __udp_enqueue_schedule_skb().
> >
> > Before:
> >
> > @udp_enqueue_us[398]:
> > [0] 24901 |@@@ |
> > [1] 63512 |@@@@@@@@@ |
> > [2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [8, 16) 54022 |@@@@@@@@ |
> > [16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [64, 128) 4219 | |
> > [128, 256) 188 | |
> >
> > After:
> >
> > @udp_enqueue_us[398]:
> > [0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [1] 1111277 |@@@@@@@@@@ |
> > [2, 4) 501439 |@@@@ |
> > [4, 8) 102921 | |
> > [8, 16) 29895 | |
> > [16, 32) 43500 | |
> > [32, 64) 31552 | |
> > [64, 128) 979 | |
> > [128, 256) 13 | |
> >
> > Note that the remaining bottleneck for this platform is in
> > udp_drops_inc() because we limited struct numa_drop_counters
> > to only two nodes so far.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > v3: - Moved kfree(up->udp_prod_queue) to udp_destruct_common(),
> > addressing reports from Jakub and syzbot.
> >
> > - Perform SKB_DROP_REASON_PROTO_MEM drops after the queue
> > spinlock is released.
> >
> > v2: https://lore.kernel.org/netdev/20250920080227.3674860-1-edumazet@google.com/
> > - Added a kfree(up->udp_prod_queue) in udpv6_destroy_sock() (Jakub feedback on v1)
> > - Added bpftrace histograms in changelog.
> >
> > v1: https://lore.kernel.org/netdev/20250919164308.2455564-1-edumazet@google.com/
> >
> > include/linux/udp.h | 9 +++-
> > include/net/udp.h | 11 ++++-
> > net/ipv4/udp.c | 114 ++++++++++++++++++++++++++------------------
> > net/ipv6/udp.c | 5 +-
> > 4 files changed, 88 insertions(+), 51 deletions(-)
> >
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index e554890c4415b411f35007d3ece9e6042db7a544..58795688a18636ea79aa1f5d06eacc676a2e7849 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -44,6 +44,12 @@ enum {
> > UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
> > };
> >
> > +/* per NUMA structure for lockless producer usage. */
> > +struct udp_prod_queue {
> > + struct llist_head ll_root ____cacheline_aligned_in_smp;
> > + atomic_t rmem_alloc;
> > +};
> > +
> > struct udp_sock {
> > /* inet_sock has to be the first member */
> > struct inet_sock inet;
> > @@ -90,6 +96,8 @@ struct udp_sock {
> > struct sk_buff *skb,
> > int nhoff);
> >
> > + struct udp_prod_queue *udp_prod_queue;
> > +
> > /* udp_recvmsg try to use this before splicing sk_receive_queue */
> > struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;
> >
> > @@ -109,7 +117,6 @@ struct udp_sock {
> > */
> > struct hlist_node tunnel_list;
> > struct numa_drop_counters drop_counters;
> > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > };
> >
> > #define udp_test_bit(nr, sk) \
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index 059a0cee5f559b8d75e71031a00d0aa2769e257f..cffedb3e40f24513e44fb7598c0ad917fd15b616 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -284,16 +284,23 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
> > struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> > netdev_features_t features, bool is_ipv6);
> >
> > -static inline void udp_lib_init_sock(struct sock *sk)
> > +static inline int udp_lib_init_sock(struct sock *sk)
> > {
> > struct udp_sock *up = udp_sk(sk);
> >
> > sk->sk_drop_counters = &up->drop_counters;
> > - spin_lock_init(&up->busylock);
> > skb_queue_head_init(&up->reader_queue);
> > INIT_HLIST_NODE(&up->tunnel_list);
> > up->forward_threshold = sk->sk_rcvbuf >> 2;
> > set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> > +
> > + up->udp_prod_queue = kcalloc(nr_node_ids, sizeof(*up->udp_prod_queue),
> > + GFP_KERNEL);
> > + if (!up->udp_prod_queue)
> > + return -ENOMEM;
> > + for (int i = 0; i < nr_node_ids; i++)
> > + init_llist_head(&up->udp_prod_queue[i].ll_root);
> > + return 0;
> > }
> >
> > static inline void udp_drops_inc(struct sock *sk)
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 85cfc32eb2ccb3e229177fb37910fefde0254ffe..fce1d0ffd6361d271ae3528fea026a8d6c07ac6e 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1685,25 +1685,6 @@ static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
> > udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
> > }
> >
> > -/* Idea of busylocks is to let producers grab an extra spinlock
> > - * to relieve pressure on the receive_queue spinlock shared by consumer.
> > - * Under flood, this means that only one producer can be in line
> > - * trying to acquire the receive_queue spinlock.
> > - */
> > -static spinlock_t *busylock_acquire(struct sock *sk)
> > -{
> > - spinlock_t *busy = &udp_sk(sk)->busylock;
> > -
> > - spin_lock(busy);
> > - return busy;
> > -}
> > -
> > -static void busylock_release(spinlock_t *busy)
> > -{
> > - if (busy)
> > - spin_unlock(busy);
> > -}
> > -
> > static int udp_rmem_schedule(struct sock *sk, int size)
> > {
> > int delta;
> > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > {
> > struct sk_buff_head *list = &sk->sk_receive_queue;
> > + struct udp_prod_queue *udp_prod_queue;
> > + struct sk_buff *next, *to_drop = NULL;
> > + struct llist_node *ll_list;
> > unsigned int rmem, rcvbuf;
> > - spinlock_t *busy = NULL;
> > int size, err = -ENOMEM;
> > + int total_size = 0;
> > + int q_size = 0;
> > + int nb = 0;
> >
> > rmem = atomic_read(&sk->sk_rmem_alloc);
> > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > size = skb->truesize;
> >
> > + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
>
> There is a small chance that a cpu enqueues to this queue and no
> further arrivals on that numa node happen, stranding skbs on this
> intermediate queue, right? If so, those are leaked on
> udp_destruct_common.
There is absolutely 0 chance this can occur.
This is because I use llist_add() return value to decide to either :
A) Queue was empty, I am the first cpu to add a packet, I am elected
to drain this queue.
A.1 Grab the spinlock (competing with recvmsg() or other cpus on
other NUMA nodes)
While waiting my turn in this spinlock acquisition, other
cpus can add more packets
to the per-NUMA queue I was elected to drain.
A.2 llist_del_all() to drain the Queue. I got my skb but
possibly others as well.
After llist_del_all(), one other cpu trying to add a new
packet will eventually see the queue was empty again
and will be elected to serve it. It will go to A.1 and A.2
B) Return immediately because there were other packets in the Queue,
so I know one other
cpu is in A.1 or before A.2
All A) and B) are under rcu lock, so udp_destruct_common() will not run.
>
> > +
> > + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > +
> > /* Immediately drop when the receive queue is full.
> > * Cast to unsigned int performs the boundary check for INT_MAX.
> > */
> > @@ -1747,45 +1737,75 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > if (rmem > (rcvbuf >> 1)) {
> > skb_condense(skb);
> > size = skb->truesize;
> > - rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > - if (rmem > rcvbuf)
> > - goto uncharge_drop;
> > - busy = busylock_acquire(sk);
> > - } else {
> > - atomic_add(size, &sk->sk_rmem_alloc);
> > }
> >
> > udp_set_dev_scratch(skb);
> >
> > + atomic_add(size, &udp_prod_queue->rmem_alloc);
> > +
> > + if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> > + return 0;
> > +
> > spin_lock(&list->lock);
> > - err = udp_rmem_schedule(sk, size);
> > - if (err) {
> > - spin_unlock(&list->lock);
> > - goto uncharge_drop;
> > - }
> >
> > - sk_forward_alloc_add(sk, -size);
> > + ll_list = llist_del_all(&udp_prod_queue->ll_root);
> >
> > - /* no need to setup a destructor, we will explicitly release the
> > - * forward allocated memory on dequeue
> > - */
> > - sock_skb_set_dropcount(sk, skb);
> > + ll_list = llist_reverse_order(ll_list);
> > +
> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> > + size = udp_skb_truesize(skb);
> > + total_size += size;
> > + err = udp_rmem_schedule(sk, size);
> > + if (unlikely(err)) {
> > + /* Free the skbs outside of locked section. */
> > + skb->next = to_drop;
> > + to_drop = skb;
> > + continue;
> > + }
> > +
> > + q_size += size;
> > + sk_forward_alloc_add(sk, -size);
> > +
> > + /* no need to setup a destructor, we will explicitly release the
> > + * forward allocated memory on dequeue
> > + */
> > + sock_skb_set_dropcount(sk, skb);
>
> Since drop counters are approximate, read these once and report the
> same for all packets in a batch?
Good idea, thanks !
(My test receiver was not setting SOCK_RXQ_OVFL)
I can squash in V4 , or add this as a separate patch.
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fce1d0ffd6361d271ae3528fea026a8d6c07ac6e..95241093b7f01b2dc31d9520b693f46400e545ff
100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1706,6 +1706,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
struct sk_buff *skb)
int size, err = -ENOMEM;
int total_size = 0;
int q_size = 0;
+ int dropcount;
int nb = 0;
rmem = atomic_read(&sk->sk_rmem_alloc);
@@ -1746,6 +1747,8 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
struct sk_buff *skb)
if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
return 0;
+ dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ? sk_drops_read(sk) : 0;
+
spin_lock(&list->lock);
ll_list = llist_del_all(&udp_prod_queue->ll_root);
@@ -1769,7 +1772,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
struct sk_buff *skb)
/* no need to setup a destructor, we will explicitly release the
* forward allocated memory on dequeue
*/
- sock_skb_set_dropcount(sk, skb);
+ SOCK_SKB_CB(skb)->dropcount = dropcount;
nb++;
__skb_queue_tail(list, skb);
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-21 9:58 [PATCH v3 net-next] udp: remove busylock and add per NUMA queues Eric Dumazet
2025-09-22 2:21 ` Willem de Bruijn
@ 2025-09-22 8:37 ` Paolo Abeni
2025-09-22 8:47 ` Eric Dumazet
1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2025-09-22 8:37 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski
Cc: Simon Horman, Willem de Bruijn, Kuniyuki Iwashima, netdev,
eric.dumazet
Hi,
On 9/21/25 11:58 AM, Eric Dumazet wrote:
> @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> {
> struct sk_buff_head *list = &sk->sk_receive_queue;
> + struct udp_prod_queue *udp_prod_queue;
> + struct sk_buff *next, *to_drop = NULL;
> + struct llist_node *ll_list;
> unsigned int rmem, rcvbuf;
> - spinlock_t *busy = NULL;
> int size, err = -ENOMEM;
> + int total_size = 0;
> + int q_size = 0;
> + int nb = 0;
>
> rmem = atomic_read(&sk->sk_rmem_alloc);
> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> size = skb->truesize;
>
> + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> +
> + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> +
> /* Immediately drop when the receive queue is full.
> * Cast to unsigned int performs the boundary check for INT_MAX.
> */
Double checking I'm reading the code correctly... AFAICS the rcvbuf size
check is now only per NUMA node, that means that each node can now add
at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
correct?
What if the user-space process never reads the packets (or is very
slow)? I'm under the impression the max rcvbuf occupation will be
limited only by the memory accounting?!? (and not by sk_rcvbuf)
Side note: I'm wondering if we could avoid the numa queue for connected
sockets? With early demux, and no nft/bridge in between the path from
NIC to socket should be pretty fast and possibly the additional queuing
visible?
Thanks,
Paolo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 8:37 ` Paolo Abeni
@ 2025-09-22 8:47 ` Eric Dumazet
2025-09-22 9:27 ` Paolo Abeni
2025-09-22 9:34 ` Eric Dumazet
0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-22 8:47 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, netdev, eric.dumazet
On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On 9/21/25 11:58 AM, Eric Dumazet wrote:
> > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > {
> > struct sk_buff_head *list = &sk->sk_receive_queue;
> > + struct udp_prod_queue *udp_prod_queue;
> > + struct sk_buff *next, *to_drop = NULL;
> > + struct llist_node *ll_list;
> > unsigned int rmem, rcvbuf;
> > - spinlock_t *busy = NULL;
> > int size, err = -ENOMEM;
> > + int total_size = 0;
> > + int q_size = 0;
> > + int nb = 0;
> >
> > rmem = atomic_read(&sk->sk_rmem_alloc);
> > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > size = skb->truesize;
> >
> > + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> > +
> > + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > +
> > /* Immediately drop when the receive queue is full.
> > * Cast to unsigned int performs the boundary check for INT_MAX.
> > */
>
> Double checking I'm reading the code correctly... AFAICS the rcvbuf size
> check is now only per NUMA node, that means that each node can now add
> at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
> correct?
This is a transient condition. In my tests with 6 NUMA nodes pushing
packets very hard,
I was not able to see a significant bump of sk_rmem_alloc (over sk_rcvbuf)
>
> What if the user-space process never reads the packets (or is very
> slow)? I'm under the impression the max rcvbuf occupation will be
> limited only by the memory accounting?!? (and not by sk_rcvbuf)
Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
further incoming packets are dropped.
As you said, memory accounting is there.
This could matter if we had thousands of UDP sockets under flood at
the same time,
but that would require thousands of cpus and/or NIC rx queues.
>
> Side note: I'm wondering if we could avoid the numa queue for connected
> sockets? With early demux, and no nft/bridge in between the path from
> NIC to socket should be pretty fast and possibly the additional queuing
> visible?
I tried this last week and got no difference in performance on my test machines.
I can retry this and give you precise numbers before sending V4.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 8:47 ` Eric Dumazet
@ 2025-09-22 9:27 ` Paolo Abeni
2025-09-22 9:34 ` Eric Dumazet
1 sibling, 0 replies; 11+ messages in thread
From: Paolo Abeni @ 2025-09-22 9:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, netdev, eric.dumazet
On 9/22/25 10:47 AM, Eric Dumazet wrote:
> On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> What if the user-space process never reads the packets (or is very
>> slow)? I'm under the impression the max rcvbuf occupation will be
>> limited only by the memory accounting?!? (and not by sk_rcvbuf)
>
> Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
> further incoming packets are dropped.
>
> As you said, memory accounting is there.
>
> This could matter if we had thousands of UDP sockets under flood at
> the same time,
> but that would require thousands of cpus and/or NIC rx queues.
Ah, I initially misread:
rmem += atomic_read(&udp_prod_queue->rmem_alloc);
as
rmem = atomic_read(&udp_prod_queue->rmem_alloc);
and was fooled on the overall boundary check. LGTM now, thanks!
Paolo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 8:47 ` Eric Dumazet
2025-09-22 9:27 ` Paolo Abeni
@ 2025-09-22 9:34 ` Eric Dumazet
2025-09-22 9:41 ` Paolo Abeni
1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2025-09-22 9:34 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, netdev, eric.dumazet
On Mon, Sep 22, 2025 at 1:47 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > Hi,
> >
> > On 9/21/25 11:58 AM, Eric Dumazet wrote:
> > > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > > {
> > > struct sk_buff_head *list = &sk->sk_receive_queue;
> > > + struct udp_prod_queue *udp_prod_queue;
> > > + struct sk_buff *next, *to_drop = NULL;
> > > + struct llist_node *ll_list;
> > > unsigned int rmem, rcvbuf;
> > > - spinlock_t *busy = NULL;
> > > int size, err = -ENOMEM;
> > > + int total_size = 0;
> > > + int q_size = 0;
> > > + int nb = 0;
> > >
> > > rmem = atomic_read(&sk->sk_rmem_alloc);
> > > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > > size = skb->truesize;
> > >
> > > + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> > > +
> > > + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > > +
> > > /* Immediately drop when the receive queue is full.
> > > * Cast to unsigned int performs the boundary check for INT_MAX.
> > > */
> >
> > Double checking I'm reading the code correctly... AFAICS the rcvbuf size
> > check is now only per NUMA node, that means that each node can now add
> > at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
> > correct?
>
> This is a transient condition. In my tests with 6 NUMA nodes pushing
> packets very hard,
> I was not able to see a significant bump of sk_rmem_alloc (over sk_rcvbuf)
>
>
>
> >
> > What if the user-space process never reads the packets (or is very
> > slow)? I'm under the impression the max rcvbuf occupation will be
> > limited only by the memory accounting?!? (and not by sk_rcvbuf)
>
> Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
> further incoming packets are dropped.
>
> As you said, memory accounting is there.
>
> This could matter if we had thousands of UDP sockets under flood at
> the same time,
> but that would require thousands of cpus and/or NIC rx queues.
>
>
>
> >
> > Side note: I'm wondering if we could avoid the numa queue for connected
> > sockets? With early demux, and no nft/bridge in between the path from
> > NIC to socket should be pretty fast and possibly the additional queuing
> > visible?
>
> I tried this last week and got no difference in performance on my test machines.
>
> I can retry this and give you precise numbers before sending V4.
I did my experiment again.
Very little difference (1 or 2 %, but would need many runs to have a
confirmation)
Also loopback traffic would be unprotected (Only RSS on a physical NIC
would properly use a single cpu for all packets)
Looking at the performance profile of the cpus
Always using the per-numa queue
16.06% [kernel] [k] skb_release_data
10.59% [kernel] [k] dev_gro_receive
9.50% [kernel] [k]
idpf_rx_process_skb_fields
5.37% [kernel] [k]
__udp_enqueue_schedule_skb
3.93% [kernel] [k] net_rx_action
2.93% [kernel] [k] ip6t_do_table
2.84% [kernel] [k] napi_alloc_skb
2.41% [kernel] [k]
queued_spin_lock_slowpath
2.01% [kernel] [k]
__netif_receive_skb_core
1.95% [kernel] [k]
idpf_vport_splitq_napi_poll
1.90% [kernel] [k] __memcpy
1.88% [kernel] [k] napi_gro_receive
1.86% [kernel] [k]
kmem_cache_alloc_bulk_noprof
1.50% [kernel] [k] napi_consume_skb
1.28% [kernel] [k] sock_def_readable
1.00% [kernel] [k] llist_add_batch
0.93% [kernel] [k] ip6_rcv_core
0.91% [kernel] [k]
call_function_single_prep_ipi
0.82% [kernel] [k] ipv6_gro_receive
0.81% [kernel] [k]
ip6_protocol_deliver_rcu
0.81% [kernel] [k] fib6_node_lookup
0.79% [kernel] [k] ip6_route_input
0.78% [kernel] [k] eth_type_trans
0.75% [kernel] [k] ip6_sublist_rcv
0.75% [kernel] [k] __try_to_wake_up
0.73% [kernel] [k] udp6_csum_init
0.70% [kernel] [k] _raw_spin_lock
0.70% [kernel] [k] __wake_up_common_lock
0.69% [kernel] [k] read_tsc
0.62% [kernel] [k] ttwu_queue_wakelist
0.58% [kernel] [k] udp6_gro_receive
0.57% [kernel] [k]
netif_receive_skb_list_internal
0.55% [kernel] [k] llist_reverse_order
0.53% [kernel] [k] available_idle_cpu
0.52% [kernel] [k] sched_clock_noinstr
0.51% [kernel] [k]
__sk_mem_raise_allocated
Avoiding it for connected sockets:
14.75% [kernel] [k] skb_release_data
10.76% [kernel] [k] dev_gro_receive
9.48% [kernel] [k]
idpf_rx_process_skb_fields
4.29% [kernel] [k]
__udp_enqueue_schedule_skb
4.02% [kernel] [k] net_rx_action
3.17% [kernel] [k] ip6t_do_table
2.55% [kernel] [k] napi_alloc_skb
2.20% [kernel] [k] __memcpy
2.04% [kernel] [k]
queued_spin_lock_slowpath
1.99% [kernel] [k]
__netif_receive_skb_core
1.98% [kernel] [k]
kmem_cache_alloc_bulk_noprof
1.76% [kernel] [k] napi_gro_receive
1.74% [kernel] [k]
idpf_vport_splitq_napi_poll
1.55% [kernel] [k] napi_consume_skb
1.36% [kernel] [k] sock_def_readable
1.18% [kernel] [k] llist_add_batch
1.04% [kernel] [k] udp6_csum_init
0.92% [kernel] [k] fib6_node_lookup
0.92% [kernel] [k] _raw_spin_lock
0.91% [kernel] [k]
call_function_single_prep_ipi
0.88% [kernel] [k] ip6_rcv_core
0.86% [kernel] [k]
ip6_protocol_deliver_rcu
0.84% [kernel] [k] __try_to_wake_up
0.81% [kernel] [k] ip6_route_input
0.80% [kernel] [k] ipv6_gro_receive
0.75% [kernel] [k] __skb_flow_dissect
0.70% [kernel] [k] read_tsc
0.70% [kernel] [k] ttwu_queue_wakelist
0.69% [kernel] [k] eth_type_trans
0.69% [kernel] [k] __wake_up_common_lock
0.64% [kernel] [k] sched_clock_noinstr
I guess we have bigger fish to fry ;)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 9:34 ` Eric Dumazet
@ 2025-09-22 9:41 ` Paolo Abeni
2025-09-22 10:29 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2025-09-22 9:41 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, netdev, eric.dumazet
On 9/22/25 11:34 AM, Eric Dumazet wrote:
> On Mon, Sep 22, 2025 at 1:47 AM Eric Dumazet <edumazet@google.com> wrote:
>> On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>> On 9/21/25 11:58 AM, Eric Dumazet wrote:
>>>> @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
>>>> int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>>>> {
>>>> struct sk_buff_head *list = &sk->sk_receive_queue;
>>>> + struct udp_prod_queue *udp_prod_queue;
>>>> + struct sk_buff *next, *to_drop = NULL;
>>>> + struct llist_node *ll_list;
>>>> unsigned int rmem, rcvbuf;
>>>> - spinlock_t *busy = NULL;
>>>> int size, err = -ENOMEM;
>>>> + int total_size = 0;
>>>> + int q_size = 0;
>>>> + int nb = 0;
>>>>
>>>> rmem = atomic_read(&sk->sk_rmem_alloc);
>>>> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
>>>> size = skb->truesize;
>>>>
>>>> + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
>>>> +
>>>> + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
>>>> +
>>>> /* Immediately drop when the receive queue is full.
>>>> * Cast to unsigned int performs the boundary check for INT_MAX.
>>>> */
>>>
>>> Double checking I'm reading the code correctly... AFAICS the rcvbuf size
>>> check is now only per NUMA node, that means that each node can now add
>>> at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
>>> correct?
>>
>> This is a transient condition. In my tests with 6 NUMA nodes pushing
>> packets very hard,
>> I was not able to see a significant bump of sk_rmem_alloc (over sk_rcvbuf)
>>
>>
>>
>>>
>>> What if the user-space process never reads the packets (or is very
>>> slow)? I'm under the impression the max rcvbuf occupation will be
>>> limited only by the memory accounting?!? (and not by sk_rcvbuf)
>>
>> Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
>> further incoming packets are dropped.
>>
>> As you said, memory accounting is there.
>>
>> This could matter if we had thousands of UDP sockets under flood at
>> the same time,
>> but that would require thousands of cpus and/or NIC rx queues.
>>
>>
>>
>>>
>>> Side note: I'm wondering if we could avoid the numa queue for connected
>>> sockets? With early demux, and no nft/bridge in between the path from
>>> NIC to socket should be pretty fast and possibly the additional queuing
>>> visible?
>>
>> I tried this last week and got no difference in performance on my test machines.
>>
>> I can retry this and give you precise numbers before sending V4.
>
> I did my experiment again.
>
> Very little difference (1 or 2 %, but would need many runs to have a
> confirmation)
>
> Also loopback traffic would be unprotected (Only RSS on a physical NIC
> would properly use a single cpu for all packets)
>
> Looking at the performance profile of the cpus
[...]
Indeed delta looks in the noise range, thanks for checking.
Just in case there is any doubt:
Acked-by: Paolo Abeni <pabeni@redhat.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 9:41 ` Paolo Abeni
@ 2025-09-22 10:29 ` Eric Dumazet
0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-22 10:29 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, netdev, eric.dumazet
On Mon, Sep 22, 2025 at 2:41 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 9/22/25 11:34 AM, Eric Dumazet wrote:
> > On Mon, Sep 22, 2025 at 1:47 AM Eric Dumazet <edumazet@google.com> wrote:
> >> On Mon, Sep 22, 2025 at 1:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>> On 9/21/25 11:58 AM, Eric Dumazet wrote:
> >>>> @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> >>>> int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >>>> {
> >>>> struct sk_buff_head *list = &sk->sk_receive_queue;
> >>>> + struct udp_prod_queue *udp_prod_queue;
> >>>> + struct sk_buff *next, *to_drop = NULL;
> >>>> + struct llist_node *ll_list;
> >>>> unsigned int rmem, rcvbuf;
> >>>> - spinlock_t *busy = NULL;
> >>>> int size, err = -ENOMEM;
> >>>> + int total_size = 0;
> >>>> + int q_size = 0;
> >>>> + int nb = 0;
> >>>>
> >>>> rmem = atomic_read(&sk->sk_rmem_alloc);
> >>>> rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> >>>> size = skb->truesize;
> >>>>
> >>>> + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> >>>> +
> >>>> + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> >>>> +
> >>>> /* Immediately drop when the receive queue is full.
> >>>> * Cast to unsigned int performs the boundary check for INT_MAX.
> >>>> */
> >>>
> >>> Double checking I'm reading the code correctly... AFAICS the rcvbuf size
> >>> check is now only per NUMA node, that means that each node can now add
> >>> at most sk_rcvbuf bytes to the socket receive queue simultaneously, am I
> >>> correct?
> >>
> >> This is a transient condition. In my tests with 6 NUMA nodes pushing
> >> packets very hard,
> >> I was not able to see a significant bump of sk_rmem_alloc (over sk_rcvbuf)
> >>
> >>
> >>
> >>>
> >>> What if the user-space process never reads the packets (or is very
> >>> slow)? I'm under the impression the max rcvbuf occupation will be
> >>> limited only by the memory accounting?!? (and not by sk_rcvbuf)
> >>
> >> Well, as soon as sk->sk_rmem_alloc is bigger than sk_rcvbuf, all
> >> further incoming packets are dropped.
> >>
> >> As you said, memory accounting is there.
> >>
> >> This could matter if we had thousands of UDP sockets under flood at
> >> the same time,
> >> but that would require thousands of cpus and/or NIC rx queues.
> >>
> >>
> >>
> >>>
> >>> Side note: I'm wondering if we could avoid the numa queue for connected
> >>> sockets? With early demux, and no nft/bridge in between the path from
> >>> NIC to socket should be pretty fast and possibly the additional queuing
> >>> visible?
> >>
> >> I tried this last week and got no difference in performance on my test machines.
> >>
> >> I can retry this and give you precise numbers before sending V4.
> >
> > I did my experiment again.
> >
> > Very little difference (1 or 2 %, but would need many runs to have a
> > confirmation)
> >
> > Also loopback traffic would be unprotected (Only RSS on a physical NIC
> > would properly use a single cpu for all packets)
> >
> > Looking at the performance profile of the cpus
>
> [...]
>
> Indeed delta looks in the noise range, thanks for checking.
>
> Just in case there is any doubt:
>
> Acked-by: Paolo Abeni <pabeni@redhat.com>
Thanks for the review Paolo, I will include your Acked-by to V4.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 7:31 ` Eric Dumazet
@ 2025-09-22 12:38 ` Willem de Bruijn
2025-09-22 12:58 ` Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Willem de Bruijn @ 2025-09-22 12:38 UTC (permalink / raw)
To: Eric Dumazet, Willem de Bruijn
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, Kuniyuki Iwashima, netdev, eric.dumazet
Eric Dumazet wrote:
> On Sun, Sep 21, 2025 at 7:21 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Eric Dumazet wrote:
> > > busylock was protecting UDP sockets against packet floods,
> > > but unfortunately was not protecting the host itself.
> > >
> > > Under stress, many cpus could spin while acquiring the busylock,
> > > and NIC had to drop packets. Or packets would be dropped
> > > in cpu backlog if RPS/RFS were in place.
> > >
> > > This patch replaces the busylock by intermediate
> > > lockless queues. (One queue per NUMA node).
> > >
> > > This means that fewer number of cpus have to acquire
> > > the UDP receive queue lock.
> > >
> > > Most of the cpus can either:
> > > - immediately drop the packet.
> > > - or queue it in their NUMA aware lockless queue.
> > >
> > > Then one of the cpu is chosen to process this lockless queue
> > > in a batch.
> > >
> > > The batch only contains packets that were cooked on the same
> > > NUMA node, thus with very limited latency impact.
> > >
> > > Tested:
> > >
> > > DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
> > > (Intel(R) Xeon(R) 6985P-C)
> > >
> > > Before:
> > >
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams 1004179 0.0
> > > Udp6InErrors 3117 0.0
> > > Udp6RcvbufErrors 3117 0.0
> > >
> > > After:
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams 1116633 0.0
> > > Udp6InErrors 14197275 0.0
> > > Udp6RcvbufErrors 14197275 0.0
> > >
> > > We can see this host can now proces 14.2 M more packets per second
> > > while under attack, and the victim socket can receive 11 % more
> > > packets.
> >
> > Impressive gains under DoS!
> >
> > Main concern is that it adds an extra queue/dequeue and thus some
> > cycle cost for all udp sockets in the common case where they are not
> > contended. These are simple linked list operations, so I suppose the
> > only cost may be the cacheline if not warm. Busylock had the nice
> > property of only being used under mem pressure. Could this benefit
> > from the same?
>
> I hear you, but the extra cache line is local to the node, compared to prior
> non-local and shared cache line, especially on modern cpus (AMD Turin
> / Venice or Intel Granite Rapids)
>
> This is a very minor cost, compared to the average delay between
> packet being received
> on the wire and being presented on __udp_enqueue_schedule_skb().
>
> And the core of the lockless algorithm is that all packets go through
> the same logic.
> See my following answer.
>
> >
> > > I used a small bpftrace program measuring time (in us) spent in
> > > __udp_enqueue_schedule_skb().
> > >
> > > Before:
> > >
> > > @udp_enqueue_us[398]:
> > > [0] 24901 |@@@ |
> > > [1] 63512 |@@@@@@@@@ |
> > > [2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [8, 16) 54022 |@@@@@@@@ |
> > > [16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > > [64, 128) 4219 | |
> > > [128, 256) 188 | |
> > >
> > > After:
> > >
> > > @udp_enqueue_us[398]:
> > > [0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [1] 1111277 |@@@@@@@@@@ |
> > > [2, 4) 501439 |@@@@ |
> > > [4, 8) 102921 | |
> > > [8, 16) 29895 | |
> > > [16, 32) 43500 | |
> > > [32, 64) 31552 | |
> > > [64, 128) 979 | |
> > > [128, 256) 13 | |
> > >
> > > Note that the remaining bottleneck for this platform is in
> > > udp_drops_inc() because we limited struct numa_drop_counters
> > > to only two nodes so far.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > > v3: - Moved kfree(up->udp_prod_queue) to udp_destruct_common(),
> > > addressing reports from Jakub and syzbot.
> > >
> > > - Perform SKB_DROP_REASON_PROTO_MEM drops after the queue
> > > spinlock is released.
> > >
> > > v2: https://lore.kernel.org/netdev/20250920080227.3674860-1-edumazet@google.com/
> > > - Added a kfree(up->udp_prod_queue) in udpv6_destroy_sock() (Jakub feedback on v1)
> > > - Added bpftrace histograms in changelog.
> > >
> > > v1: https://lore.kernel.org/netdev/20250919164308.2455564-1-edumazet@google.com/
> > >
> > > include/linux/udp.h | 9 +++-
> > > include/net/udp.h | 11 ++++-
> > > net/ipv4/udp.c | 114 ++++++++++++++++++++++++++------------------
> > > net/ipv6/udp.c | 5 +-
> > > 4 files changed, 88 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > > index e554890c4415b411f35007d3ece9e6042db7a544..58795688a18636ea79aa1f5d06eacc676a2e7849 100644
> > > --- a/include/linux/udp.h
> > > +++ b/include/linux/udp.h
> > > @@ -44,6 +44,12 @@ enum {
> > > UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
> > > };
> > >
> > > +/* per NUMA structure for lockless producer usage. */
> > > +struct udp_prod_queue {
> > > + struct llist_head ll_root ____cacheline_aligned_in_smp;
> > > + atomic_t rmem_alloc;
> > > +};
> > > +
> > > struct udp_sock {
> > > /* inet_sock has to be the first member */
> > > struct inet_sock inet;
> > > @@ -90,6 +96,8 @@ struct udp_sock {
> > > struct sk_buff *skb,
> > > int nhoff);
> > >
> > > + struct udp_prod_queue *udp_prod_queue;
> > > +
> > > /* udp_recvmsg try to use this before splicing sk_receive_queue */
> > > struct sk_buff_head reader_queue ____cacheline_aligned_in_smp;
> > >
> > > @@ -109,7 +117,6 @@ struct udp_sock {
> > > */
> > > struct hlist_node tunnel_list;
> > > struct numa_drop_counters drop_counters;
> > > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > > };
> > >
> > > #define udp_test_bit(nr, sk) \
> > > diff --git a/include/net/udp.h b/include/net/udp.h
> > > index 059a0cee5f559b8d75e71031a00d0aa2769e257f..cffedb3e40f24513e44fb7598c0ad917fd15b616 100644
> > > --- a/include/net/udp.h
> > > +++ b/include/net/udp.h
> > > @@ -284,16 +284,23 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
> > > struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> > > netdev_features_t features, bool is_ipv6);
> > >
> > > -static inline void udp_lib_init_sock(struct sock *sk)
> > > +static inline int udp_lib_init_sock(struct sock *sk)
> > > {
> > > struct udp_sock *up = udp_sk(sk);
> > >
> > > sk->sk_drop_counters = &up->drop_counters;
> > > - spin_lock_init(&up->busylock);
> > > skb_queue_head_init(&up->reader_queue);
> > > INIT_HLIST_NODE(&up->tunnel_list);
> > > up->forward_threshold = sk->sk_rcvbuf >> 2;
> > > set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> > > +
> > > + up->udp_prod_queue = kcalloc(nr_node_ids, sizeof(*up->udp_prod_queue),
> > > + GFP_KERNEL);
> > > + if (!up->udp_prod_queue)
> > > + return -ENOMEM;
> > > + for (int i = 0; i < nr_node_ids; i++)
> > > + init_llist_head(&up->udp_prod_queue[i].ll_root);
> > > + return 0;
> > > }
> > >
> > > static inline void udp_drops_inc(struct sock *sk)
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 85cfc32eb2ccb3e229177fb37910fefde0254ffe..fce1d0ffd6361d271ae3528fea026a8d6c07ac6e 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -1685,25 +1685,6 @@ static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
> > > udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
> > > }
> > >
> > > -/* Idea of busylocks is to let producers grab an extra spinlock
> > > - * to relieve pressure on the receive_queue spinlock shared by consumer.
> > > - * Under flood, this means that only one producer can be in line
> > > - * trying to acquire the receive_queue spinlock.
> > > - */
> > > -static spinlock_t *busylock_acquire(struct sock *sk)
> > > -{
> > > - spinlock_t *busy = &udp_sk(sk)->busylock;
> > > -
> > > - spin_lock(busy);
> > > - return busy;
> > > -}
> > > -
> > > -static void busylock_release(spinlock_t *busy)
> > > -{
> > > - if (busy)
> > > - spin_unlock(busy);
> > > -}
> > > -
> > > static int udp_rmem_schedule(struct sock *sk, int size)
> > > {
> > > int delta;
> > > @@ -1718,14 +1699,23 @@ static int udp_rmem_schedule(struct sock *sk, int size)
> > > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > > {
> > > struct sk_buff_head *list = &sk->sk_receive_queue;
> > > + struct udp_prod_queue *udp_prod_queue;
> > > + struct sk_buff *next, *to_drop = NULL;
> > > + struct llist_node *ll_list;
> > > unsigned int rmem, rcvbuf;
> > > - spinlock_t *busy = NULL;
> > > int size, err = -ENOMEM;
> > > + int total_size = 0;
> > > + int q_size = 0;
> > > + int nb = 0;
> > >
> > > rmem = atomic_read(&sk->sk_rmem_alloc);
> > > rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > > size = skb->truesize;
> > >
> > > + udp_prod_queue = &udp_sk(sk)->udp_prod_queue[numa_node_id()];
> >
> > There is a small chance that a cpu enqueues to this queue and no
> > further arrivals on that numa node happen, stranding skbs on this
> > intermediate queue, right? If so, those are leaked on
> > udp_destruct_common.
>
> There is absolutely 0 chance this can occur.
>
> This is because I use llist_add() return value to decide to either :
>
> A) Queue was empty, I am the first cpu to add a packet, I am elected
> to drain this queue.
>
> A.1 Grab the spinlock (competing with recvmsg() or other cpus on
> other NUMA nodes)
> While waiting my turn in this spinlock acquisition, other
> cpus can add more packets
> to the per-NUMA queue I was elected to drain.
>
> A.2 llist_del_all() to drain the Queue. I got my skb but
> possibly others as well.
>
> After llist_del_all(), one other cpu trying to add a new
> packet will eventually see the queue was empty again
> and will be elected to serve it. It will go to A.1 and A.2
>
> B) Return immediately because there were other packets in the Queue,
> so I know one other
> cpu is in A.1 or before A.2
>
>
> All A) and B) are under rcu lock, so udp_destruct_common() will not run.
Of course. Thanks. Nice lockless multi-producer multi-consumer struct.
I had not seen it before. A try_cmpxchg and xchg pair, but on an
uncontended NUMA local cacheline in the normal case.
>
> >
> > > +
> > > + rmem += atomic_read(&udp_prod_queue->rmem_alloc);
> > > +
> > > /* Immediately drop when the receive queue is full.
> > > * Cast to unsigned int performs the boundary check for INT_MAX.
> > > */
> > > @@ -1747,45 +1737,75 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > > if (rmem > (rcvbuf >> 1)) {
> > > skb_condense(skb);
> > > size = skb->truesize;
> > > - rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > - if (rmem > rcvbuf)
> > > - goto uncharge_drop;
> > > - busy = busylock_acquire(sk);
> > > - } else {
> > > - atomic_add(size, &sk->sk_rmem_alloc);
> > > }
> > >
> > > udp_set_dev_scratch(skb);
> > >
> > > + atomic_add(size, &udp_prod_queue->rmem_alloc);
> > > +
> > > + if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> > > + return 0;
> > > +
> > > spin_lock(&list->lock);
> > > - err = udp_rmem_schedule(sk, size);
> > > - if (err) {
> > > - spin_unlock(&list->lock);
> > > - goto uncharge_drop;
> > > - }
> > >
> > > - sk_forward_alloc_add(sk, -size);
> > > + ll_list = llist_del_all(&udp_prod_queue->ll_root);
> > >
> > > - /* no need to setup a destructor, we will explicitly release the
> > > - * forward allocated memory on dequeue
> > > - */
> > > - sock_skb_set_dropcount(sk, skb);
> > > + ll_list = llist_reverse_order(ll_list);
> > > +
> > > + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> > > + size = udp_skb_truesize(skb);
> > > + total_size += size;
> > > + err = udp_rmem_schedule(sk, size);
> > > + if (unlikely(err)) {
> > > + /* Free the skbs outside of locked section. */
> > > + skb->next = to_drop;
> > > + to_drop = skb;
> > > + continue;
> > > + }
> > > +
> > > + q_size += size;
> > > + sk_forward_alloc_add(sk, -size);
> > > +
> > > + /* no need to setup a destructor, we will explicitly release the
> > > + * forward allocated memory on dequeue
> > > + */
> > > + sock_skb_set_dropcount(sk, skb);
> >
> > Since drop counters are approximate, read these once and report the
> > same for all packets in a batch?
>
> Good idea, thanks !
> (My test receiver was not setting SOCK_RXQ_OVFL)
>
> I can squash in V4 , or add this as a separate patch.
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index fce1d0ffd6361d271ae3528fea026a8d6c07ac6e..95241093b7f01b2dc31d9520b693f46400e545ff
> 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1706,6 +1706,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> int size, err = -ENOMEM;
> int total_size = 0;
> int q_size = 0;
> + int dropcount;
> int nb = 0;
>
> rmem = atomic_read(&sk->sk_rmem_alloc);
> @@ -1746,6 +1747,8 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> if (!llist_add(&skb->ll_node, &udp_prod_queue->ll_root))
> return 0;
>
> + dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ? sk_drops_read(sk) : 0;
> +
> spin_lock(&list->lock);
>
> ll_list = llist_del_all(&udp_prod_queue->ll_root);
> @@ -1769,7 +1772,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
> struct sk_buff *skb)
> /* no need to setup a destructor, we will explicitly release the
> * forward allocated memory on dequeue
> */
> - sock_skb_set_dropcount(sk, skb);
> + SOCK_SKB_CB(skb)->dropcount = dropcount;
> nb++;
> __skb_queue_tail(list, skb);
> }
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] udp: remove busylock and add per NUMA queues
2025-09-22 12:38 ` Willem de Bruijn
@ 2025-09-22 12:58 ` Eric Dumazet
0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-22 12:58 UTC (permalink / raw)
To: Willem de Bruijn
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, Kuniyuki Iwashima, netdev, eric.dumazet
On Mon, Sep 22, 2025 at 5:38 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Of course. Thanks. Nice lockless multi-producer multi-consumer struct.
> I had not seen it before. A try_cmpxchg and xchg pair, but on an
> uncontended NUMA local cacheline in the normal case.
I have played with a similar strategy for skb_attempt_defer_free(), I
am still polishing a series.
For some reason, spin_lock() on recent platforms is extremely expensive,
when false sharing is occurring.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-09-22 12:58 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-21 9:58 [PATCH v3 net-next] udp: remove busylock and add per NUMA queues Eric Dumazet
2025-09-22 2:21 ` Willem de Bruijn
2025-09-22 7:31 ` Eric Dumazet
2025-09-22 12:38 ` Willem de Bruijn
2025-09-22 12:58 ` Eric Dumazet
2025-09-22 8:37 ` Paolo Abeni
2025-09-22 8:47 ` Eric Dumazet
2025-09-22 9:27 ` Paolo Abeni
2025-09-22 9:34 ` Eric Dumazet
2025-09-22 9:41 ` Paolo Abeni
2025-09-22 10:29 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).