netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Eric Dumazet <edumazet@google.com>
Cc: Stefano Brivio <sbrivio@redhat.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	netdev@vger.kernel.org, Kuniyuki Iwashima <kuniyu@amazon.com>,
	Mike Manning <mvrmanning@gmail.com>,
	Paul Holzinger <pholzing@redhat.com>,
	Philo Lu <lulie@linux.alibaba.com>,
	Cambda Zhu <cambda@linux.alibaba.com>,
	Fred Chen <fred.cc@alibaba-inc.com>,
	Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Subject: Re: [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup
Date: Mon, 9 Dec 2024 13:20:20 +1100	[thread overview]
Message-ID: <Z1ZT5Cwd-VXK1_27@zatzit> (raw)
In-Reply-To: <CANn89iJqLU6RuHgdbz3iGNL_K8XaPBYr3pWqQmgth2TFf14obg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5346 bytes --]

On Fri, Dec 06, 2024 at 10:04:33AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 3:16 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> > > On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> > > <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > [snip]
> > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > > index 6a01905d379f..8490408f6009 100644
> > > > > > --- a/net/ipv4/udp.c
> > > > > > +++ b/net/ipv4/udp.c
> > > > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > > >  {
> > > > > >         unsigned short hnum = ntohs(dport);
> > > > > > -       struct udp_hslot *hslot2;
> > > > > > +       struct udp_hslot *hslot, *hslot2;
> > > > > >         struct sock *result, *sk;
> > > > > >         unsigned int hash2;
> > > > > >
> > > > > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > > > > +       spin_lock_bh(&hslot->lock);
> > > > >
> > > > > This is not acceptable.
> > > > > UDP is best effort, packets can be dropped.
> > > > > Please fix user application expectations.
> > > >
> > > > The packets aren't merely dropped, they're rejected with an ICMP Port
> > > > Unreachable.
> > >
> > > We made UDP stack scalable with RCU, it took years of work.
> > >
> > > And this patch is bringing back the UDP stack to horrible performance
> > > from more than a decade ago.
> > > Everybody will go back to DPDK.
> >
> > It's reasonable to be concerned about the performance impact.  But
> > this seems like preamture hyperbole given no-one has numbers yet, or
> > has even suggested a specific benchmark to reveal the impact.
> >
> > > I am pretty certain this can be solved without using a spinlock in the
> > > fast path.
> >
> > Quite possibly.  But Stefano has tried, and it certainly wasn't
> > trivial.
> >
> > > Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> > > 10,000,000 packets per second....
> > >
> > > Changing source address on an UDP socket is highly unusual, we are not
> > > going to slow down UDP for this case.
> >
> > Changing in a general way is very rare, one specific case is not.
> > Every time you connect() a socket that wasn't previously bound to a
> > specific address you get an implicit source address change from
> > 0.0.0.0 or :: to something that depends on the routing table.
> >
> > > Application could instead open another socket, and would probably work
> > > on old linux versions.
> >
> > Possibly there's a procedure that would work here, but it's not at all
> > obvious:
> >
> >  * Clearly, you can't close the non-connected socket before opening
> >    the connected one - that just introduces a new much wider race.  It
> >    doesn't even get rid of the existing one, because unless you can
> >    independently predict what the correct bound address will be
> >    for a given peer address, the second socket will still have an
> >    address change when you connect().
> >
> 
> The order is kind of obvious.
> 
> Kernel does not have to deal with wrong application design.

What we're talking about is:

	bind("0.0.0.0:12345");
	connect("1.2.3.4:54321");

Which AFAIK has been a legal sequence since the sockets interface was
a thing.  I don't think it's reasonable to call expecting that *not*
to trigger ICMPs around the connect "wrong application design".

> >  * So, you must create the connected socket before closing the
> >    unconnected one, meaning you have to use SO_REUSEADDR or
> >    SO_REUSEPORT whether or not you otherwise wanted to.
> >
> >  * While both sockets are open, you need to handle the possibility
> >    that packets could be delivered to either one.  Doable, but a pain
> >    in the arse.
> 
> Given UDP does not have a proper listen() + accept() model, I am
> afraid this is the only way
> 
> You need to keep the generic UDP socket as a catch all, and deal with
> packets received on it.
> 
> >
> >  * How do you know when the transition is completed and you can close
> >    the unconnected socket?  The fact that the rehashing has completed
> >    and all the necessary memory barriers passed isn't something
> >    userspace can directly discern.
> >
> > > If the regression was recent, this would be considered as a normal regression,
> > > but apparently nobody noticed for 10 years. This should be saying something...
> >
> > It does.  But so does the fact that it can be trivially reproduced.
> 
> If a kernel fix is doable without making UDP stack a complete nogo for
> most of us,

The benchmarks Stefano has tried so far don't show an impact, and you
haven't yet suggested another one.  Again, calling this a "complete
nogo" seems like huge hyperbole without more data.

> I will be happy to review it.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2024-12-09  2:20 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-04 22:12 [PATCH net-next 0/2] Fix race between datagram socket address change and rehash Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 1/2] datagram: Rehash sockets only if local address changed for their family Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup Stefano Brivio
2024-12-05  9:30   ` Paolo Abeni
2024-12-05 15:58     ` Stefano Brivio
2024-12-05 16:53       ` Paolo Abeni
2024-12-06 10:50         ` Stefano Brivio
2024-12-06 12:36           ` Paolo Abeni
2024-12-06 13:35             ` Stefano Brivio
2024-12-06 15:10               ` Paolo Abeni
2024-12-18 16:21               ` Stefano Brivio
2024-12-05 16:35   ` Eric Dumazet
2024-12-05 22:32     ` David Gibson
2024-12-05 22:52       ` Eric Dumazet
2024-12-06  2:16         ` David Gibson
2024-12-06  9:04           ` Eric Dumazet
2024-12-09  2:20             ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1ZT5Cwd-VXK1_27@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=cambda@linux.alibaba.com \
    --cc=edumazet@google.com \
    --cc=fred.cc@alibaba-inc.com \
    --cc=kuniyu@amazon.com \
    --cc=lulie@linux.alibaba.com \
    --cc=mvrmanning@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pholzing@redhat.com \
    --cc=sbrivio@redhat.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yubing.qiuyubing@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).