netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Balazs Scheidler <bazsi77@gmail.com>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org, pabeni@redhat.com
Subject: Re: [RFC, RESEND] UDP receive path batching improvement
Date: Fri, 22 Aug 2025 16:33:46 +0200	[thread overview]
Message-ID: <aKh_yi0gASYajhev@bzorp3> (raw)
In-Reply-To: <CANn89iK5-WQ-geM6nzz_WOBwc8_jt7HQUqXbm_eDceydvf0FJQ@mail.gmail.com>

On Fri, Aug 22, 2025 at 06:56:03AM -0700, Eric Dumazet wrote:
> On Fri, Aug 22, 2025 at 6:33 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 06:10:28AM -0700, Eric Dumazet wrote:
> > > On Fri, Aug 22, 2025 at 5:56 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 02:37:28AM -0700, Eric Dumazet wrote:
> > > > > On Fri, Aug 22, 2025 at 2:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 01:18:36AM -0700, Eric Dumazet wrote:
> > > > > > > On Fri, Aug 22, 2025 at 1:15 AM Balazs Scheidler <bazsi77@gmail.com> wrote:
> > > > > > > > The condition above uses "sk->sk_rcvbuf >> 2" as a trigger when the update is
> > > > > > > > done to the counter.
> > > > > > > >
> > > > > > > > In our case (syslog receive path via udp), socket buffers are generally
> > > > > > > > tuned up (in the order of 32MB or even more, I have seen 256MB as well), as
> > > > > > > > the senders can generate spikes in their traffic and a lot of senders send
> > > > > > > > to the same port. Due to latencies, sometimes these buffers take MBs of data
> > > > > > > > before the user-space process even has a chance to consume them.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This seems very high usage for a single UDP socket.
> > > > > > >
> > > > > > > Have you tried SO_REUSEPORT to spread incoming packets to more sockets
> > > > > > > (and possibly more threads) ?
> > > > > >
> > > > > > Yes.  I use SO_REUSEPORT (16 sockets), I even use eBPF to distribute the
> > > > > > load over multiple sockets evenly, instead of the normal load balancing
> > > > > > algorithm built into SO_REUSEPORT.
> > > > > >
> > > > >
> > > > > Great. But if you have many receive queues, are you sure this choice does not
> > > > > add false sharing ?
> > > >
> > > > I am not sure how that could trigger false sharing here.  I am using a
> > > > "socket" filter, which generates a random number modulo the number of
> > > > sockets:
> > > >
> > > > ```
> > > > #include "vmlinux.h"
> > > > #include <bpf/bpf_helpers.h>
> > > >
> > > > int number_of_sockets;
> > > >
> > > > SEC("socket")
> > > > int random_choice(struct __sk_buff *skb)
> > > > {
> > > >   if (number_of_sockets == 0)
> > > >     return -1;
> > > >
> > > >   return bpf_get_prandom_u32() % number_of_sockets;
> > > > }
> > > > ```
> > >
> > > How many receive queues does your NIC have (ethtool -l eth0) ?
> > >
> > > This filter causes huge contention on the receive queues and various
> > > socket fields, accessed by different cpus.
> > >
> > > You should instead perform a choice based on the napi_id (skb->napi_id)
> >
> > I don't have ssh access to the box, unfortunately.  I'll look into napi_id,
> > my historical knowledge of the IP stack is that we are using a single thread
> > to handle incoming datagrams, but I have to realize that information did not
> > age well. Also, the kernel is ancient, 4.18 something, RHEL8 (no, I didn't
> > have a say in that...).
> >
> > This box is a VM, but I am not even sure about the virtualization stack used, I
> > am finding it out the number of receive queues.
> 
> I think this is the critical part. The optimal eBPF program depends on this.
> 
> In anycase, the 25% threshold makes the usable capacity smaller,
> so I would advise setting bigger SO_RCVBUF values.

Thank you, that's exactly what we are doing.  The box was powecycled and we
lost the settings.  I am now improving the eBPF load balancing algorithm so
we get a better use of caches on the kernel receive side.

What do you think about the recovery-from-drop part?  I mean if I could get
sk_rmem_alloc updated faster as the userspace consumes packets, a single
packet drop would not cause a this many packets to be lost, at the cost of
loss events to be more spread out in time.

Would something like my original posting be acceptable?

-- 
Bazsi

      reply	other threads:[~2025-08-22 14:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-22  8:15 [RFC, RESEND] UDP receive path batching improvement Balazs Scheidler
2025-08-22  8:18 ` Eric Dumazet
2025-08-22  9:15   ` Balazs Scheidler
2025-08-22  9:37     ` Eric Dumazet
2025-08-22 12:56       ` Balazs Scheidler
2025-08-22 13:10         ` Eric Dumazet
2025-08-22 13:20           ` Eric Dumazet
2025-08-22 13:33           ` Balazs Scheidler
2025-08-22 13:56             ` Eric Dumazet
2025-08-22 14:33               ` Balazs Scheidler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKh_yi0gASYajhev@bzorp3 \
    --to=bazsi77@gmail.com \
    --cc=edumazet@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).