From: Balazs Scheidler <bazsi77@gmail.com>
To: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>, pabeni@redhat.com
Subject: [RFC, RESEND] UDP receive path batching improvement
Date: Fri, 22 Aug 2025 10:15:41 +0200 [thread overview]
Message-ID: <aKgnLcw6yzq78CIP@bzorp3> (raw)
Hi,
There's this patch from 2018:
commit 6b229cf77d683f634f0edd876c6d1015402303ad
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Dec 8 11:41:56 2016 -0800
udp: add batching to udp_rmem_release()
This patch is delaying updates to the current size of the socket buffer
(sk->sk_rmem_alloc) to avoid a cache ping-pong between the network receive
path and the user-space process.
This change in particular causes an issue for us in our use-case:
+ if (likely(partial)) {
+ up->forward_deficit += size;
+ size = up->forward_deficit;
+ if (size < (sk->sk_rcvbuf >> 2) &&
+ !skb_queue_empty(&sk->sk_receive_queue))
+ return;
+ } else {
+ size += up->forward_deficit;
+ }
+ up->forward_deficit = 0;
The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
done to the counter.
In our case (syslog receive path via udp), socket buffers are generally
tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
the senders can generate spikes in their traffic and a lot of senders send
to the same port. Due to latencies, sometimes these buffers take MBs of data
before the user-space process even has a chance to consume them.
If we were talking about video or voice streams sent over UDP, the current
behaviour makes a lot of sense: upon the very first drop, also drop
subsequent packets until things recover.
However in the case of syslog, every message is an isolated datapoint and
subsequent packets are not related at all.
Due to this batching, the kernel always "overestimates" how full the receive
buffer is.
Instead of using 25% of the receive buffer, couldn't we use a different
trigger mechanism? These are my thoughts:
1) simple packet counter, if the datagrams are small, byte based estimates
can vary in number of packets (which ultimately drives the overhead here)
2) limit the byte based limit to 64k-128k or so, is we might be in the MBs
range with typical buffer sizes.
Both of these solutions should improve UDP syslog data loss on reception and
still amortize the modification overhead (e.g. cache ping pong) of
sk->sk_rmem_alloc.
Here's a POC patch that implements the 2nd solution, but I think I would
prefer the first one.
Feedback welcome.
diff --git a/include/net/udp.h b/include/net/udp.h
index e2af3bda90c9..222c0267af17 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -284,13 +284,18 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
netdev_features_t features, bool is_ipv6);
+static inline int udp_lib_forward_threshold(struct sock *sk)
+{
+ return min(sk->sk_rcvbuf >> 2, 65536);
+}
+
static inline void udp_lib_init_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
skb_queue_head_init(&up->reader_queue);
INIT_HLIST_NODE(&up->tunnel_list);
- up->forward_threshold = sk->sk_rcvbuf >> 2;
+ up->forward_threshold = udp_lib_forward_threshold(sk);
set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cc3ce0f762ec..00647213db86 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2953,7 +2953,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
sockopt_lock_sock(sk);
/* paired with READ_ONCE in udp_rmem_release() */
- WRITE_ONCE(up->forward_threshold, sk->sk_rcvbuf >> 2);
+ WRITE_ONCE(up->forward_threshold, udp_lib_forward_threshold(sk));
sockopt_release_sock(sk);
}
return err;
I am happy to submit a proper patch if this is something feasible. Thank you.
--
Bazsi
Happy Logging!
next reply other threads:[~2025-08-22 8:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-22 8:15 Balazs Scheidler [this message]
2025-08-22 8:18 ` [RFC, RESEND] UDP receive path batching improvement Eric Dumazet
2025-08-22 9:15 ` Balazs Scheidler
2025-08-22 9:37 ` Eric Dumazet
2025-08-22 12:56 ` Balazs Scheidler
2025-08-22 13:10 ` Eric Dumazet
2025-08-22 13:20 ` Eric Dumazet
2025-08-22 13:33 ` Balazs Scheidler
2025-08-22 13:56 ` Eric Dumazet
2025-08-22 14:33 ` Balazs Scheidler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aKgnLcw6yzq78CIP@bzorp3 \
--to=bazsi77@gmail.com \
--cc=edumazet@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox