From: David Gibson <david@gibson.dropbear.id.au>
To: Eric Dumazet <edumazet@google.com>
Cc: Stefano Brivio <sbrivio@redhat.com>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
netdev@vger.kernel.org, Kuniyuki Iwashima <kuniyu@amazon.com>,
Mike Manning <mvrmanning@gmail.com>,
Paul Holzinger <pholzing@redhat.com>,
Philo Lu <lulie@linux.alibaba.com>,
Cambda Zhu <cambda@linux.alibaba.com>,
Fred Chen <fred.cc@alibaba-inc.com>,
Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Subject: Re: [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup
Date: Fri, 6 Dec 2024 13:16:24 +1100 [thread overview]
Message-ID: <Z1JeePBN5f1YCmYd@zatzit> (raw)
In-Reply-To: <CANn89i+PCsOHvd02nvM0oRjAXxPTgX6V1Y1-xfRL_43Ew9=H=w@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3922 bytes --]
On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > [snip]
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index 6a01905d379f..8490408f6009 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > {
> > > > unsigned short hnum = ntohs(dport);
> > > > - struct udp_hslot *hslot2;
> > > > + struct udp_hslot *hslot, *hslot2;
> > > > struct sock *result, *sk;
> > > > unsigned int hash2;
> > > >
> > > > + hslot = udp_hashslot(udptable, net, hnum);
> > > > + spin_lock_bh(&hslot->lock);
> > >
> > > This is not acceptable.
> > > UDP is best effort, packets can be dropped.
> > > Please fix user application expectations.
> >
> > The packets aren't merely dropped, they're rejected with an ICMP Port
> > Unreachable.
>
> We made UDP stack scalable with RCU, it took years of work.
>
> And this patch is bringing back the UDP stack to horrible performance
> from more than a decade ago.
> Everybody will go back to DPDK.
It's reasonable to be concerned about the performance impact. But
this seems like preamture hyperbole given no-one has numbers yet, or
has even suggested a specific benchmark to reveal the impact.
> I am pretty certain this can be solved without using a spinlock in the
> fast path.
Quite possibly. But Stefano has tried, and it certainly wasn't
trivial.
> Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> 10,000,000 packets per second....
>
> Changing source address on an UDP socket is highly unusual, we are not
> going to slow down UDP for this case.
Changing in a general way is very rare, one specific case is not.
Every time you connect() a socket that wasn't previously bound to a
specific address you get an implicit source address change from
0.0.0.0 or :: to something that depends on the routing table.
> Application could instead open another socket, and would probably work
> on old linux versions.
Possibly there's a procedure that would work here, but it's not at all
obvious:
* Clearly, you can't close the non-connected socket before opening
the connected one - that just introduces a new much wider race. It
doesn't even get rid of the existing one, because unless you can
independently predict what the correct bound address will be
for a given peer address, the second socket will still have an
address change when you connect().
* So, you must create the connected socket before closing the
unconnected one, meaning you have to use SO_REUSEADDR or
SO_REUSEPORT whether or not you otherwise wanted to.
* While both sockets are open, you need to handle the possibility
that packets could be delivered to either one. Doable, but a pain
in the arse.
* How do you know when the transition is completed and you can close
the unconnected socket? The fact that the rehashing has completed
and all the necessary memory barriers passed isn't something
userspace can directly discern.
> If the regression was recent, this would be considered as a normal regression,
> but apparently nobody noticed for 10 years. This should be saying something...
It does. But so does the fact that it can be trivially reproduced.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-12-06 2:16 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-04 22:12 [PATCH net-next 0/2] Fix race between datagram socket address change and rehash Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 1/2] datagram: Rehash sockets only if local address changed for their family Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup Stefano Brivio
2024-12-05 9:30 ` Paolo Abeni
2024-12-05 15:58 ` Stefano Brivio
2024-12-05 16:53 ` Paolo Abeni
2024-12-06 10:50 ` Stefano Brivio
2024-12-06 12:36 ` Paolo Abeni
2024-12-06 13:35 ` Stefano Brivio
2024-12-06 15:10 ` Paolo Abeni
2024-12-18 16:21 ` Stefano Brivio
2024-12-05 16:35 ` Eric Dumazet
2024-12-05 22:32 ` David Gibson
2024-12-05 22:52 ` Eric Dumazet
2024-12-06 2:16 ` David Gibson [this message]
2024-12-06 9:04 ` Eric Dumazet
2024-12-09 2:20 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z1JeePBN5f1YCmYd@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=cambda@linux.alibaba.com \
--cc=edumazet@google.com \
--cc=fred.cc@alibaba-inc.com \
--cc=kuniyu@amazon.com \
--cc=lulie@linux.alibaba.com \
--cc=mvrmanning@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pholzing@redhat.com \
--cc=sbrivio@redhat.com \
--cc=willemdebruijn.kernel@gmail.com \
--cc=yubing.qiuyubing@alibaba-inc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).