[RFC, RESEND] UDP receive path batching improvement

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC, RESEND] UDP receive path batching improvement
@ 2025-08-22  8:15 Balazs Scheidler
  2025-08-22  8:18 ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Balazs Scheidler @ 2025-08-22  8:15 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, pabeni

Hi,

There's this patch from 2018:

commit 6b229cf77d683f634f0edd876c6d1015402303ad
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Dec 8 11:41:56 2016 -0800

    udp: add batching to udp_rmem_release()

This patch is delaying updates to the current size of the socket buffer
(sk->sk_rmem_alloc) to avoid a cache ping-pong between the network receive
path and the user-space process.

This change in particular causes an issue for us in our use-case:

+       if (likely(partial)) {
+               up->forward_deficit += size;
+               size = up->forward_deficit;
+               if (size < (sk->sk_rcvbuf >> 2) &&
+                   !skb_queue_empty(&sk->sk_receive_queue))
+                       return;
+       } else {
+               size += up->forward_deficit;
+       }
+       up->forward_deficit = 0;

The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
done to the counter.  

In our case (syslog receive path via udp), socket buffers are generally
tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
the senders can generate spikes in their traffic and a lot of senders send
to the same port. Due to latencies, sometimes these buffers take MBs of data
before the user-space process even has a chance to consume them.

If we were talking about video or voice streams sent over UDP, the current
behaviour makes a lot of sense: upon the very first drop, also drop
subsequent packets until things recover.  

However in the case of syslog, every message is an isolated datapoint and
subsequent packets are not related at all.

Due to this batching, the kernel always "overestimates" how full the receive
buffer is.

Instead of using 25% of the receive buffer, couldn't we use a different
trigger mechanism? These are my thoughts:
  1) simple packet counter, if the datagrams are small, byte based estimates
     can vary in number of packets (which ultimately drives the overhead here)
  2) limit the byte based limit to 64k-128k or so, is we might be in the MBs
     range with typical buffer sizes.

Both of these solutions should improve UDP syslog data loss on reception and
still amortize the modification overhead (e.g.  cache ping pong) of
sk->sk_rmem_alloc.

Here's a POC patch that implements the 2nd solution, but I think I would
prefer the first one.

Feedback welcome.

diff --git a/include/net/udp.h b/include/net/udp.h
index e2af3bda90c9..222c0267af17 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -284,13 +284,18 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
                                  netdev_features_t features, bool is_ipv6);

+static inline int udp_lib_forward_threshold(struct sock *sk)
+{
+       return min(sk->sk_rcvbuf >> 2, 65536);
+}
+
 static inline void udp_lib_init_sock(struct sock *sk)
 {
        struct udp_sock *up = udp_sk(sk);

        skb_queue_head_init(&up->reader_queue);
        INIT_HLIST_NODE(&up->tunnel_list);
-       up->forward_threshold = sk->sk_rcvbuf >> 2;
+       up->forward_threshold = udp_lib_forward_threshold(sk);
        set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
 }

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cc3ce0f762ec..00647213db86 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2953,7 +2953,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
                        sockopt_lock_sock(sk);
                        /* paired with READ_ONCE in udp_rmem_release() */
-                       WRITE_ONCE(up->forward_threshold, sk->sk_rcvbuf >> 2);
+                       WRITE_ONCE(up->forward_threshold, udp_lib_forward_threshold(sk));
                        sockopt_release_sock(sk);
                }
                return err;

I am happy to submit a proper patch if this is something feasible. Thank you.

-- 
Bazsi
Happy Logging!

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22  8:15 [RFC, RESEND] UDP receive path batching improvement Balazs Scheidler
@ 2025-08-22  8:18 ` Eric Dumazet
  2025-08-22  9:15   ` Balazs Scheidler
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2025-08-22  8:18 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
>
> Hi,
>
> There's this patch from 2018:
>
> commit 6b229cf77d683f634f0edd876c6d1015402303ad
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Thu Dec 8 11:41:56 2016 -0800
>
>     udp: add batching to udp_rmem_release()
>
> This patch is delaying updates to the current size of the socket buffer
> (sk->sk_rmem_alloc) to avoid a cache ping-pong between the network receive
> path and the user-space process.
>
> This change in particular causes an issue for us in our use-case:
>
> +       if (likely(partial)) {
> +               up->forward_deficit += size;
> +               size = up->forward_deficit;
> +               if (size < (sk->sk_rcvbuf >> 2) &&
> +                   !skb_queue_empty(&sk->sk_receive_queue))
> +                       return;
> +       } else {
> +               size += up->forward_deficit;
> +       }
> +       up->forward_deficit = 0;
>
> The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> done to the counter.
>
> In our case (syslog receive path via udp), socket buffers are generally
> tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> the senders can generate spikes in their traffic and a lot of senders send
> to the same port. Due to latencies, sometimes these buffers take MBs of data
> before the user-space process even has a chance to consume them.
>


This seems very high usage for a single UDP socket.

Have you tried SO_REUSEPORT to spread incoming packets to more sockets
(and possibly more threads) ?


> If we were talking about video or voice streams sent over UDP, the current
> behaviour makes a lot of sense: upon the very first drop, also drop
> subsequent packets until things recover.
>
> However in the case of syslog, every message is an isolated datapoint and
> subsequent packets are not related at all.
>
> Due to this batching, the kernel always "overestimates" how full the receive
> buffer is.
>
> Instead of using 25% of the receive buffer, couldn't we use a different
> trigger mechanism? These are my thoughts:
>   1) simple packet counter, if the datagrams are small, byte based estimates
>      can vary in number of packets (which ultimately drives the overhead here)
>   2) limit the byte based limit to 64k-128k or so, is we might be in the MBs
>      range with typical buffer sizes.
>
> Both of these solutions should improve UDP syslog data loss on reception and
> still amortize the modification overhead (e.g.  cache ping pong) of
> sk->sk_rmem_alloc.
>
> Here's a POC patch that implements the 2nd solution, but I think I would
> prefer the first one.
>
> Feedback welcome.
>
> diff --git a/include/net/udp.h b/include/net/udp.h
> index e2af3bda90c9..222c0267af17 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -284,13 +284,18 @@ INDIRECT_CALLABLE_DECLARE(int udpv6_rcv(struct sk_buff *));
>  struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>                                   netdev_features_t features, bool is_ipv6);
>
> +static inline int udp_lib_forward_threshold(struct sock *sk)
> +{
> +       return min(sk->sk_rcvbuf >> 2, 65536);
> +}
> +
>  static inline void udp_lib_init_sock(struct sock *sk)
>  {
>         struct udp_sock *up = udp_sk(sk);
>
>         skb_queue_head_init(&up->reader_queue);
>         INIT_HLIST_NODE(&up->tunnel_list);
> -       up->forward_threshold = sk->sk_rcvbuf >> 2;
> +       up->forward_threshold = udp_lib_forward_threshold(sk);
>         set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>  }
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index cc3ce0f762ec..00647213db86 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2953,7 +2953,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>                 if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
>                         sockopt_lock_sock(sk);
>                         /* paired with READ_ONCE in udp_rmem_release() */
> -                       WRITE_ONCE(up->forward_threshold, sk->sk_rcvbuf >> 2);
> +                       WRITE_ONCE(up->forward_threshold, udp_lib_forward_threshold(sk));
>                         sockopt_release_sock(sk);
>                 }
>                 return err;
>
> I am happy to submit a proper patch if this is something feasible. Thank you.
>
> --
> Bazsi
> Happy Logging!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22  8:18 ` Eric Dumazet
@ 2025-08-22  9:15   ` Balazs Scheidler
  2025-08-22  9:37     ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Balazs Scheidler @ 2025-08-22  9:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > done to the counter.
> >
> > In our case (syslog receive path via udp), socket buffers are generally
> > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > the senders can generate spikes in their traffic and a lot of senders send
> > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > before the user-space process even has a chance to consume them.
> >
> 
> 
> This seems very high usage for a single UDP socket.
> 
> Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> (and possibly more threads) ?

Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
load over multiple sockets evenly, instead of the normal load balancing
algorithm built into SO_REUSEPORT.

Sometimes the processing on the userspace side is heavy enough (think of
parsing, heuristics, data normalization) and the load on the box heavy
enough that I still see drops from time to time.

If a client sends 100k messages in a tight loop for a while, that's going to
use a lot of buffer space.  What bothers me further is that it could be ok
to lose a single packet, but any time we drop one packet, we will continue
to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
receive buffer is completely emptied).  This problem, combined with small
packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
of the socket buffer is a huge offset. 

I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
assume that 1 update every 100 packets should still be OK.

-- 
Bazsi
Happy Logging!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22  9:15   ` Balazs Scheidler
@ 2025-08-22  9:37     ` Eric Dumazet
  2025-08-22 12:56       ` Balazs Scheidler
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2025-08-22  9:37 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
>
> On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > done to the counter.
> > >
> > > In our case (syslog receive path via udp), socket buffers are generally
> > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > the senders can generate spikes in their traffic and a lot of senders send
> > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > before the user-space process even has a chance to consume them.
> > >
> >
> >
> > This seems very high usage for a single UDP socket.
> >
> > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > (and possibly more threads) ?
>
> Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> load over multiple sockets evenly, instead of the normal load balancing
> algorithm built into SO_REUSEPORT.
>

Great. But if you have many receive queues, are you sure this choice does not
add false sharing ?

> Sometimes the processing on the userspace side is heavy enough (think of
> parsing, heuristics, data normalization) and the load on the box heavy
> enough that I still see drops from time to time.
>
> If a client sends 100k messages in a tight loop for a while, that's going to
> use a lot of buffer space.  What bothers me further is that it could be ok
> to lose a single packet, but any time we drop one packet, we will continue
> to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> receive buffer is completely emptied).  This problem, combined with small
> packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> of the socket buffer is a huge offset.

sock_writeable() uses a 50% threshold.

>
> I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> assume that 1 update every 100 packets should still be OK.

Maybe, but some UDP packets have a truesize around 128 KB or even more.

Perhaps add a new UDP socket option to let the user decide on what
they feel is better for them ?

I suspect that the main issue is about having a single drop in the first place,
because of false sharing on sk->sk_drops

Perhaps we should move sk_drops on a dedicated cache line,
and perhaps have two counters for NUMA servers.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22  9:37     ` Eric Dumazet
@ 2025-08-22 12:56       ` Balazs Scheidler
  2025-08-22 13:10         ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Balazs Scheidler @ 2025-08-22 12:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > done to the counter.
> > > >
> > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > before the user-space process even has a chance to consume them.
> > > >
> > >
> > >
> > > This seems very high usage for a single UDP socket.
> > >
> > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > (and possibly more threads) ?
> >
> > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > load over multiple sockets evenly, instead of the normal load balancing
> > algorithm built into SO_REUSEPORT.
> >
> 
> Great. But if you have many receive queues, are you sure this choice does not
> add false sharing ?

I am not sure how that could trigger false sharing here.  I am using a
"socket" filter, which generates a random number modulo the number of
sockets:

```
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

int number_of_sockets;

SEC("socket")
int random_choice(struct __sk_buff *skb)
{
  if (number_of_sockets == 0)
    return -1;

  return bpf_get_prandom_u32() % number_of_sockets;
}
```

Last I've checked the code, all it did was putting the incoming packet into
the right socket buffer, as returned by the filter. What would be the false
sharing in this case?

> 
> > Sometimes the processing on the userspace side is heavy enough (think of
> > parsing, heuristics, data normalization) and the load on the box heavy
> > enough that I still see drops from time to time.
> >
> > If a client sends 100k messages in a tight loop for a while, that's going to
> > use a lot of buffer space.  What bothers me further is that it could be ok
> > to lose a single packet, but any time we drop one packet, we will continue
> > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > receive buffer is completely emptied).  This problem, combined with small
> > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > of the socket buffer is a huge offset.
> 
> sock_writeable() uses a 50% threshold.

I am not sure why this is relevant here, the write side of sockets can
easily be flow controlled (e.g. the process waiting until it can send more
data). Also my clients are not necessarily client boxes. PaloAlto firewalls
can generate 70k events-per-second in syslog alone. And that does leave the
firewall, and my challenge is to read all of that.

> 
> >
> > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > assume that 1 update every 100 packets should still be OK.
> 
> Maybe, but some UDP packets have a truesize around 128 KB or even more.

I understand that the truesize incorporates struct sk_buff header and we may
also see non-linear SKBs, which could inflate the number (saying this without really
understanding all the specifics there).

> 
> Perhaps add a new UDP socket option to let the user decide on what
> they feel is better for them ?

I wanted to avoid a knob for this, but I can easily implement this way. So
should I create a patch for a setsockopt() that allows setting
udp_sk->forward_threshold?

> 
> I suspect that the main issue is about having a single drop in the first place,
> because of false sharing on sk->sk_drops
> 
> Perhaps we should move sk_drops on a dedicated cache line,
> and perhaps have two counters for NUMA servers.

I am looking into sk_drops, I don't know what it does at the moment, it's
been a while I've last read this codebase :)

-- 
Bazsi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22 12:56       ` Balazs Scheidler
@ 2025-08-22 13:10         ` Eric Dumazet
  2025-08-22 13:20           ` Eric Dumazet
  2025-08-22 13:33           ` Balazs Scheidler
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Dumazet @ 2025-08-22 13:10 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
>
> On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > done to the counter.
> > > > >
> > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > before the user-space process even has a chance to consume them.
> > > > >
> > > >
> > > >
> > > > This seems very high usage for a single UDP socket.
> > > >
> > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > (and possibly more threads) ?
> > >
> > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > load over multiple sockets evenly, instead of the normal load balancing
> > > algorithm built into SO_REUSEPORT.
> > >
> >
> > Great. But if you have many receive queues, are you sure this choice does not
> > add false sharing ?
>
> I am not sure how that could trigger false sharing here.  I am using a
> "socket" filter, which generates a random number modulo the number of
> sockets:
>
> ```
> #include "vmlinux.h"
> #include <bpf/bpf_helpers.h>
>
> int number_of_sockets;
>
> SEC("socket")
> int random_choice(struct __sk_buff *skb)
> {
>   if (number_of_sockets == 0)
>     return -1;
>
>   return bpf_get_prandom_u32() % number_of_sockets;
> }
> ```

How many receive queues does your NIC have (ethtool -l eth0) ?

This filter causes huge contention on the receive queues and various
socket fields, accessed by different cpus.

You should instead perform a choice based on the napi_id (skb->napi_id)


>
> Last I've checked the code, all it did was putting the incoming packet into
> the right socket buffer, as returned by the filter. What would be the false
> sharing in this case?
>
> >
> > > Sometimes the processing on the userspace side is heavy enough (think of
> > > parsing, heuristics, data normalization) and the load on the box heavy
> > > enough that I still see drops from time to time.
> > >
> > > If a client sends 100k messages in a tight loop for a while, that's going to
> > > use a lot of buffer space.  What bothers me further is that it could be ok
> > > to lose a single packet, but any time we drop one packet, we will continue
> > > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > > receive buffer is completely emptied).  This problem, combined with small
> > > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > > of the socket buffer is a huge offset.
> >
> > sock_writeable() uses a 50% threshold.
>
> I am not sure why this is relevant here, the write side of sockets can
> easily be flow controlled (e.g. the process waiting until it can send more
> data). Also my clients are not necessarily client boxes. PaloAlto firewalls
> can generate 70k events-per-second in syslog alone. And that does leave the
> firewall, and my challenge is to read all of that.
>
> >
> > >
> > > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > > assume that 1 update every 100 packets should still be OK.
> >
> > Maybe, but some UDP packets have a truesize around 128 KB or even more.
>
> I understand that the truesize incorporates struct sk_buff header and we may
> also see non-linear SKBs, which could inflate the number (saying this without really
> understanding all the specifics there).
>
> >
> > Perhaps add a new UDP socket option to let the user decide on what
> > they feel is better for them ?
>
> I wanted to avoid a knob for this, but I can easily implement this way. So
> should I create a patch for a setsockopt() that allows setting
> udp_sk->forward_threshold?
>
> >
> > I suspect that the main issue is about having a single drop in the first place,
> > because of false sharing on sk->sk_drops
> >
> > Perhaps we should move sk_drops on a dedicated cache line,
> > and perhaps have two counters for NUMA servers.
>
> I am looking into sk_drops, I don't know what it does at the moment, it's
> been a while I've last read this codebase :)
>

Can you post

ss -aum src :1000  <replace 1000 with your UDP source port>

We will check the dXXXX output (number of drops), per socket.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22 13:10         ` Eric Dumazet
@ 2025-08-22 13:20           ` Eric Dumazet
  2025-08-22 13:33           ` Balazs Scheidler
  1 sibling, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2025-08-22 13:20 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 6:10 AM Eric Dumazet <edumazet@google.com> wrote:
>

>
> Can you post
>
> ss -aum src :1000  <replace 1000 with your UDP source port>
>
> We will check the dXXXX output (number of drops), per socket.

Small experiment :

otrv5:/home/edumazet# ./super_netperf 10 -t UDP_STREAM -H otrv6 -l10
-- -n -P,1000 -m 1200
   4304

If I remove the problematic sk_drops update :

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index efd742279289fc13aec9369d0f01a3be3aa73151..8976399d4e52f21058f74fde13d46e35c7617deb
100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1575,7 +1575,8 @@ int __udp_enqueue_schedule_skb(struct sock *sk,
struct sk_buff *skb)
        atomic_sub(skb->truesize, &sk->sk_rmem_alloc);

 drop:
-       atomic_inc(&sk->sk_drops);
+// Find a better way to make this operation not too expensive.
+//     atomic_inc(&sk->sk_drops);
        busylock_release(busy);
        return err;
 }

otrv5:/home/edumazet# ./super_netperf 10 -t UDP_STREAM -H otrv6 -l10
-- -n -P,1000 -m 1200
   6076

So there is definitely room for a big improvement here.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22 13:10         ` Eric Dumazet
  2025-08-22 13:20           ` Eric Dumazet
@ 2025-08-22 13:33           ` Balazs Scheidler
  2025-08-22 13:56             ` Eric Dumazet
  1 sibling, 1 reply; 10+ messages in thread
From: Balazs Scheidler @ 2025-08-22 13:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > done to the counter.
> > > > > >
> > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > before the user-space process even has a chance to consume them.
> > > > > >
> > > > >
> > > > >
> > > > > This seems very high usage for a single UDP socket.
> > > > >
> > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > (and possibly more threads) ?
> > > >
> > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > algorithm built into SO_REUSEPORT.
> > > >
> > >
> > > Great. But if you have many receive queues, are you sure this choice does not
> > > add false sharing ?
> >
> > I am not sure how that could trigger false sharing here.  I am using a
> > "socket" filter, which generates a random number modulo the number of
> > sockets:
> >
> > ```
> > #include "vmlinux.h"
> > #include <bpf/bpf_helpers.h>
> >
> > int number_of_sockets;
> >
> > SEC("socket")
> > int random_choice(struct __sk_buff *skb)
> > {
> >   if (number_of_sockets == 0)
> >     return -1;
> >
> >   return bpf_get_prandom_u32() % number_of_sockets;
> > }
> > ```
> 
> How many receive queues does your NIC have (ethtool -l eth0) ?
> 
> This filter causes huge contention on the receive queues and various
> socket fields, accessed by different cpus.
> 
> You should instead perform a choice based on the napi_id (skb->napi_id)

I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
my historical knowledge of the IP stack is that we are using a single thread
to handle incoming datagrams, but I have to realize that information did not
age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
have a say in that...).

This box is a VM, but I am not even sure about the virtualization stack used, I
am finding it out the number of receive queues.

But with that said, I was under the impression that the bottleneck is in
userspace and the userspace's roundtrip to get back to receiving UDP. 
The same event loop is processing a number of connections/UDP sockets in
parallel.

Sometimes syslog-ng just doesn't get around quickly enough if there's too much to do
with a specific datagram.  My assumption has been that it is this latency
that causes datagrams to be dropped.

> 
> 
> >
> > Last I've checked the code, all it did was putting the incoming packet into
> > the right socket buffer, as returned by the filter. What would be the false
> > sharing in this case?
> >
> > >
> > > > Sometimes the processing on the userspace side is heavy enough (think of
> > > > parsing, heuristics, data normalization) and the load on the box heavy
> > > > enough that I still see drops from time to time.
> > > >
> > > > If a client sends 100k messages in a tight loop for a while, that's going to
> > > > use a lot of buffer space.  What bothers me further is that it could be ok
> > > > to lose a single packet, but any time we drop one packet, we will continue
> > > > to lose all of them, at least until we fetch 25% of SO_RCVBUF (or if the
> > > > receive buffer is completely emptied).  This problem, combined with small
> > > > packets (think of 100-150 byte payload) can easily cause excessive drops. 25%
> > > > of the socket buffer is a huge offset.
> > >
> > > sock_writeable() uses a 50% threshold.
> >
> > I am not sure why this is relevant here, the write side of sockets can
> > easily be flow controlled (e.g. the process waiting until it can send more
> > data). Also my clients are not necessarily client boxes. PaloAlto firewalls
> > can generate 70k events-per-second in syslog alone. And that does leave the
> > firewall, and my challenge is to read all of that.
> >
> > >
> > > >
> > > > I am not sure how many packets warrants a sk_rmem_alloc update, but I'd
> > > > assume that 1 update every 100 packets should still be OK.
> > >
> > > Maybe, but some UDP packets have a truesize around 128 KB or even more.
> >
> > I understand that the truesize incorporates struct sk_buff header and we may
> > also see non-linear SKBs, which could inflate the number (saying this without really
> > understanding all the specifics there).
> >
> > >
> > > Perhaps add a new UDP socket option to let the user decide on what
> > > they feel is better for them ?
> >
> > I wanted to avoid a knob for this, but I can easily implement this way. So
> > should I create a patch for a setsockopt() that allows setting
> > udp_sk->forward_threshold?
> >
> > >
> > > I suspect that the main issue is about having a single drop in the first place,
> > > because of false sharing on sk->sk_drops
> > >
> > > Perhaps we should move sk_drops on a dedicated cache line,
> > > and perhaps have two counters for NUMA servers.
> >
> > I am looking into sk_drops, I don't know what it does at the moment, it's
> > been a while I've last read this codebase :)
> >
> 
> Can you post
> 
> ss -aum src :1000  <replace 1000 with your UDP source port>
> 
> We will check the dXXXX output (number of drops), per socket.

I don't have access to "ss", but I have this screenshot about a similar
metrics that we collect every 30 seconds:

https://drive.google.com/file/d/1HrMHSrbrkwCILQiBgAZw-J1r39PBED0f/view?usp=sharing

These metrics are collected via SK_MEMINFO from each of the sockets.

Simmilar to this case, drops usually happen on all the threads at once, even
if the receive rate is really low. Right now (when this screenshot was
taken), the UDP socket buffer remained at ~400kB (the default, as the sysctl
knobs were not persisted).

-- 
Bazsi


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22 13:33           ` Balazs Scheidler
@ 2025-08-22 13:56             ` Eric Dumazet
  2025-08-22 14:33               ` Balazs Scheidler
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2025-08-22 13:56 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 6:33 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
>
> On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> > On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > >
> > > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > > done to the counter.
> > > > > > >
> > > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > > before the user-space process even has a chance to consume them.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > This seems very high usage for a single UDP socket.
> > > > > >
> > > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > > (and possibly more threads) ?
> > > > >
> > > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > > algorithm built into SO_REUSEPORT.
> > > > >
> > > >
> > > > Great. But if you have many receive queues, are you sure this choice does not
> > > > add false sharing ?
> > >
> > > I am not sure how that could trigger false sharing here.  I am using a
> > > "socket" filter, which generates a random number modulo the number of
> > > sockets:
> > >
> > > ```
> > > #include "vmlinux.h"
> > > #include <bpf/bpf_helpers.h>
> > >
> > > int number_of_sockets;
> > >
> > > SEC("socket")
> > > int random_choice(struct __sk_buff *skb)
> > > {
> > >   if (number_of_sockets == 0)
> > >     return -1;
> > >
> > >   return bpf_get_prandom_u32() % number_of_sockets;
> > > }
> > > ```
> >
> > How many receive queues does your NIC have (ethtool -l eth0) ?
> >
> > This filter causes huge contention on the receive queues and various
> > socket fields, accessed by different cpus.
> >
> > You should instead perform a choice based on the napi_id (skb->napi_id)
>
> I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
> my historical knowledge of the IP stack is that we are using a single thread
> to handle incoming datagrams, but I have to realize that information did not
> age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
> have a say in that...).
>
> This box is a VM, but I am not even sure about the virtualization stack used, I
> am finding it out the number of receive queues.

I think this is the critical part. The optimal eBPF program depends on this.

In anycase, the 25% threshold makes the usable capacity smaller,
so I would advise setting bigger SO_RCVBUF values.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC, RESEND] UDP receive path batching improvement
  2025-08-22 13:56             ` Eric Dumazet
@ 2025-08-22 14:33               ` Balazs Scheidler
  0 siblings, 0 replies; 10+ messages in thread
From: Balazs Scheidler @ 2025-08-22 14:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, pabeni

On Fri, Aug 22, 2025 at 06:56:03AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 6:33 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > > > done to the counter.
> > > > > > > >
> > > > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > > > before the user-space process even has a chance to consume them.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This seems very high usage for a single UDP socket.
> > > > > > >
> > > > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > > > (and possibly more threads) ?
> > > > > >
> > > > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > > > algorithm built into SO_REUSEPORT.
> > > > > >
> > > > >
> > > > > Great. But if you have many receive queues, are you sure this choice does not
> > > > > add false sharing ?
> > > >
> > > > I am not sure how that could trigger false sharing here.  I am using a
> > > > "socket" filter, which generates a random number modulo the number of
> > > > sockets:
> > > >
> > > > ```
> > > > #include "vmlinux.h"
> > > > #include <bpf/bpf_helpers.h>
> > > >
> > > > int number_of_sockets;
> > > >
> > > > SEC("socket")
> > > > int random_choice(struct __sk_buff *skb)
> > > > {
> > > >   if (number_of_sockets == 0)
> > > >     return -1;
> > > >
> > > >   return bpf_get_prandom_u32() % number_of_sockets;
> > > > }
> > > > ```
> > >
> > > How many receive queues does your NIC have (ethtool -l eth0) ?
> > >
> > > This filter causes huge contention on the receive queues and various
> > > socket fields, accessed by different cpus.
> > >
> > > You should instead perform a choice based on the napi_id (skb->napi_id)
> >
> > I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
> > my historical knowledge of the IP stack is that we are using a single thread
> > to handle incoming datagrams, but I have to realize that information did not
> > age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
> > have a say in that...).
> >
> > This box is a VM, but I am not even sure about the virtualization stack used, I
> > am finding it out the number of receive queues.
> 
> I think this is the critical part. The optimal eBPF program depends on this.
> 
> In anycase, the 25% threshold makes the usable capacity smaller,
> so I would advise setting bigger SO_RCVBUF values.

Thank you, that's exactly what we are doing.  The box was powecycled and we
lost the settings.  I am now improving the eBPF load balancing algorithm so
we get a better use of caches on the kernel receive side.

What do you think about the recovery-from-drop part?  I mean if I could get
sk_rmem_alloc updated faster as the userspace consumes packets, a single
packet drop would not cause a this many packets to be lost, at the cost of
loss events to be more spread out in time.

Would something like my original posting be acceptable?

-- 
Bazsi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-22 14:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  8:15 [RFC, RESEND] UDP receive path batching improvement Balazs Scheidler
2025-08-22  8:18 ` Eric Dumazet
2025-08-22  9:15   ` Balazs Scheidler
2025-08-22  9:37     ` Eric Dumazet
2025-08-22 12:56       ` Balazs Scheidler
2025-08-22 13:10         ` Eric Dumazet
2025-08-22 13:20           ` Eric Dumazet
2025-08-22 13:33           ` Balazs Scheidler
2025-08-22 13:56             ` Eric Dumazet
2025-08-22 14:33               ` Balazs Scheidler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).