Re: [RFC, RESEND] UDP receive path batching improvement

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balazs Scheidler <bazsi77@gmail.com>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org, pabeni@redhat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement
Date: Fri, 22 Aug 2025 16:33:46 +0200	[thread overview]
Message-ID: <aKh_yi0gASYajhev@bzorp3> (raw)
In-Reply-To: <CANn89iK5-WQ-geM6nzz_WOBwc8_jt7HQUqXbm_eDceydvf0FJQ@mail.gmail.com>

On Fri, Aug 22, 2025 at 06:56:03AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 6:33 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > > > done to the counter.
> > > > > > > >
> > > > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > > > before the user-space process even has a chance to consume them.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This seems very high usage for a single UDP socket.
> > > > > > >
> > > > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > > > (and possibly more threads) ?
> > > > > >
> > > > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > > > algorithm built into SO_REUSEPORT.
> > > > > >
> > > > >
> > > > > Great. But if you have many receive queues, are you sure this choice does not
> > > > > add false sharing ?
> > > >
> > > > I am not sure how that could trigger false sharing here.  I am using a
> > > > "socket" filter, which generates a random number modulo the number of
> > > > sockets:
> > > >
> > > > ```
> > > > #include "vmlinux.h"
> > > > #include <bpf/bpf_helpers.h>
> > > >
> > > > int number_of_sockets;
> > > >
> > > > SEC("socket")
> > > > int random_choice(struct __sk_buff *skb)
> > > > {
> > > >   if (number_of_sockets == 0)
> > > >     return -1;
> > > >
> > > >   return bpf_get_prandom_u32() % number_of_sockets;
> > > > }
> > > > ```
> > >
> > > How many receive queues does your NIC have (ethtool -l eth0) ?
> > >
> > > This filter causes huge contention on the receive queues and various
> > > socket fields, accessed by different cpus.
> > >
> > > You should instead perform a choice based on the napi_id (skb->napi_id)
> >
> > I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
> > my historical knowledge of the IP stack is that we are using a single thread
> > to handle incoming datagrams, but I have to realize that information did not
> > age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
> > have a say in that...).
> >
> > This box is a VM, but I am not even sure about the virtualization stack used, I
> > am finding it out the number of receive queues.
> 
> I think this is the critical part. The optimal eBPF program depends on this.
> 
> In anycase, the 25% threshold makes the usable capacity smaller,
> so I would advise setting bigger SO_RCVBUF values.

Thank you, that's exactly what we are doing.  The box was powecycled and we
lost the settings.  I am now improving the eBPF load balancing algorithm so
we get a better use of caches on the kernel receive side.

What do you think about the recovery-from-drop part?  I mean if I could get
sk_rmem_alloc updated faster as the userspace consumes packets, a single
packet drop would not cause a this many packets to be lost, at the cost of
loss events to be more spread out in time.

Would something like my original posting be acceptable?

-- 
Bazsi

     prev parent reply	other threads:[~2025-08-22 14:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-22  8:15 [RFC, RESEND] UDP receive path batching improvement Balazs Scheidler
2025-08-22  8:18 ` Eric Dumazet
2025-08-22  9:15   ` Balazs Scheidler
2025-08-22  9:37     ` Eric Dumazet
2025-08-22 12:56       ` Balazs Scheidler
2025-08-22 13:10         ` Eric Dumazet
2025-08-22 13:20           ` Eric Dumazet
2025-08-22 13:33           ` Balazs Scheidler
2025-08-22 13:56             ` Eric Dumazet
2025-08-22 14:33               ` Balazs Scheidler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKh_yi0gASYajhev@bzorp3 \
    --to=bazsi77@gmail.com \
    --cc=edumazet@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.