All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Eric Dumazet <edumazet@google.com>
Cc: Stefano Brivio <sbrivio@redhat.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	netdev@vger.kernel.org, Kuniyuki Iwashima <kuniyu@amazon.com>,
	Mike Manning <mvrmanning@gmail.com>,
	Paul Holzinger <pholzing@redhat.com>,
	Philo Lu <lulie@linux.alibaba.com>,
	Cambda Zhu <cambda@linux.alibaba.com>,
	Fred Chen <fred.cc@alibaba-inc.com>,
	Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Subject: Re: [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup
Date: Mon, 9 Dec 2024 13:20:20 +1100	[thread overview]
Message-ID: <Z1ZT5Cwd-VXK1_27@zatzit> (raw)
In-Reply-To: <CANn89iJqLU6RuHgdbz3iGNL_K8XaPBYr3pWqQmgth2TFf14obg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5346 bytes --]

On Fri, Dec 06, 2024 at 10:04:33AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 3:16 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> > > On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> > > <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > [snip]
> > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > > index 6a01905d379f..8490408f6009 100644
> > > > > > --- a/net/ipv4/udp.c
> > > > > > +++ b/net/ipv4/udp.c
> > > > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > > >                 int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > > >  {
> > > > > >         unsigned short hnum = ntohs(dport);
> > > > > > -       struct udp_hslot *hslot2;
> > > > > > +       struct udp_hslot *hslot, *hslot2;
> > > > > >         struct sock *result, *sk;
> > > > > >         unsigned int hash2;
> > > > > >
> > > > > > +       hslot = udp_hashslot(udptable, net, hnum);
> > > > > > +       spin_lock_bh(&hslot->lock);
> > > > >
> > > > > This is not acceptable.
> > > > > UDP is best effort, packets can be dropped.
> > > > > Please fix user application expectations.
> > > >
> > > > The packets aren't merely dropped, they're rejected with an ICMP Port
> > > > Unreachable.
> > >
> > > We made UDP stack scalable with RCU, it took years of work.
> > >
> > > And this patch is bringing back the UDP stack to horrible performance
> > > from more than a decade ago.
> > > Everybody will go back to DPDK.
> >
> > It's reasonable to be concerned about the performance impact.  But
> > this seems like preamture hyperbole given no-one has numbers yet, or
> > has even suggested a specific benchmark to reveal the impact.
> >
> > > I am pretty certain this can be solved without using a spinlock in the
> > > fast path.
> >
> > Quite possibly.  But Stefano has tried, and it certainly wasn't
> > trivial.
> >
> > > Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> > > 10,000,000 packets per second....
> > >
> > > Changing source address on an UDP socket is highly unusual, we are not
> > > going to slow down UDP for this case.
> >
> > Changing in a general way is very rare, one specific case is not.
> > Every time you connect() a socket that wasn't previously bound to a
> > specific address you get an implicit source address change from
> > 0.0.0.0 or :: to something that depends on the routing table.
> >
> > > Application could instead open another socket, and would probably work
> > > on old linux versions.
> >
> > Possibly there's a procedure that would work here, but it's not at all
> > obvious:
> >
> >  * Clearly, you can't close the non-connected socket before opening
> >    the connected one - that just introduces a new much wider race.  It
> >    doesn't even get rid of the existing one, because unless you can
> >    independently predict what the correct bound address will be
> >    for a given peer address, the second socket will still have an
> >    address change when you connect().
> >
> 
> The order is kind of obvious.
> 
> Kernel does not have to deal with wrong application design.

What we're talking about is:

	bind("0.0.0.0:12345");
	connect("1.2.3.4:54321");

Which AFAIK has been a legal sequence since the sockets interface was
a thing.  I don't think it's reasonable to call expecting that *not*
to trigger ICMPs around the connect "wrong application design".

> >  * So, you must create the connected socket before closing the
> >    unconnected one, meaning you have to use SO_REUSEADDR or
> >    SO_REUSEPORT whether or not you otherwise wanted to.
> >
> >  * While both sockets are open, you need to handle the possibility
> >    that packets could be delivered to either one.  Doable, but a pain
> >    in the arse.
> 
> Given UDP does not have a proper listen() + accept() model, I am
> afraid this is the only way
> 
> You need to keep the generic UDP socket as a catch all, and deal with
> packets received on it.
> 
> >
> >  * How do you know when the transition is completed and you can close
> >    the unconnected socket?  The fact that the rehashing has completed
> >    and all the necessary memory barriers passed isn't something
> >    userspace can directly discern.
> >
> > > If the regression was recent, this would be considered as a normal regression,
> > > but apparently nobody noticed for 10 years. This should be saying something...
> >
> > It does.  But so does the fact that it can be trivially reproduced.
> 
> If a kernel fix is doable without making UDP stack a complete nogo for
> most of us,

The benchmarks Stefano has tried so far don't show an impact, and you
haven't yet suggested another one.  Again, calling this a "complete
nogo" seems like huge hyperbole without more data.

> I will be happy to review it.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-12-09  2:20 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-04 22:12 [PATCH net-next 0/2] Fix race between datagram socket address change and rehash Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 1/2] datagram: Rehash sockets only if local address changed for their family Stefano Brivio
2024-12-04 22:12 ` [PATCH net-next 2/2] datagram, udp: Set local address and rehash socket atomically against lookup Stefano Brivio
2024-12-05  9:30   ` Paolo Abeni
2024-12-05 15:58     ` Stefano Brivio
2024-12-05 16:53       ` Paolo Abeni
2024-12-06 10:50         ` Stefano Brivio
2024-12-06 12:36           ` Paolo Abeni
2024-12-06 13:35             ` Stefano Brivio
2024-12-06 15:10               ` Paolo Abeni
2024-12-18 16:21               ` Stefano Brivio
2024-12-05 16:35   ` Eric Dumazet
2024-12-05 22:32     ` David Gibson
2024-12-05 22:52       ` Eric Dumazet
2024-12-06  2:16         ` David Gibson
2024-12-06  9:04           ` Eric Dumazet
2024-12-09  2:20             ` David Gibson [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-12-10 18:34 kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1ZT5Cwd-VXK1_27@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=cambda@linux.alibaba.com \
    --cc=edumazet@google.com \
    --cc=fred.cc@alibaba-inc.com \
    --cc=kuniyu@amazon.com \
    --cc=lulie@linux.alibaba.com \
    --cc=mvrmanning@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pholzing@redhat.com \
    --cc=sbrivio@redhat.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yubing.qiuyubing@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.