Re: Possible race with xsk_flush

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Shawn Bohrer <sbohrer@cloudflare.com>
To: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, bjorn@kernel.org,
	magnus.karlsson@intel.com, kernel-team@cloudflare.com
Subject: Re: Possible race with xsk_flush
Date: Tue, 20 Dec 2022 09:25:57 -0600	[thread overview]
Message-ID: <Y6HUBWCytyebNnOx@sbohrer-cf-dell> (raw)
In-Reply-To: <CAJ8uoz1D3WY=joXqMo80a5Vqx+3N=5YX6Lh=KC1=coM5zDb-dA@mail.gmail.com>

On Tue, Dec 20, 2022 at 10:06:48AM +0100, Magnus Karlsson wrote:
> On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
> >
> > On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > > To summarize, we are expecting this ordering:
> > >
> > > CPU 0 __xsk_rcv_zc()
> > > CPU 0 __xsk_map_flush()
> > > CPU 2 __xsk_rcv_zc()
> > > CPU 2 __xsk_map_flush()
> > >
> > > But we are seeing this order:
> > >
> > > CPU 0 __xsk_rcv_zc()
> > > CPU 2 __xsk_rcv_zc()
> > > CPU 0 __xsk_map_flush()
> > > CPU 2 __xsk_map_flush()
> > >
> > > Here is the veth NAPI poll loop:
> > >
> > > static int veth_poll(struct napi_struct *napi, int budget)
> > > {
> > >     struct veth_rq *rq =
> > >     container_of(napi, struct veth_rq, xdp_napi);
> > >     struct veth_stats stats = {};
> > >     struct veth_xdp_tx_bq bq;
> > >     int done;
> > >
> > >     bq.count = 0;
> > >
> > >     xdp_set_return_frame_no_direct();
> > >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> > >
> > >     if (done < budget && napi_complete_done(napi, done)) {
> > >         /* Write rx_notify_masked before reading ptr_ring */
> > >        smp_store_mb(rq->rx_notify_masked, false);
> > >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> > >            if (napi_schedule_prep(&rq->xdp_napi)) {
> > >                WRITE_ONCE(rq->rx_notify_masked, true);
> > >                __napi_schedule(&rq->xdp_napi);
> > >             }
> > >         }
> > >     }
> > >
> > >     if (stats.xdp_tx > 0)
> > >         veth_xdp_flush(rq, &bq);
> > >     if (stats.xdp_redirect > 0)
> > >         xdp_do_flush();
> > >     xdp_clear_return_frame_no_direct();
> > >
> > >     return done;
> > > }
> > >
> > > Something I have never seen before is that there is
> > > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > > Let us check if this has something to do with it. So two suggestions
> > > to be executed separately:
> > >
> > > * Put a probe at the __napi_schedule() above and check if it gets
> > > triggered before this problem
> > > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > > before "if (done < budget && napi_complete_done(napi, done)) {"
> > >
> > > This might provide us some hints on what is going on.
> >
> > After staring at this code for way too long I finally made a
> > breakthrough!  I could not understand how this race could occur when
> > napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
> >
> > ```
> >   static inline void *netpoll_poll_lock(struct napi_struct *napi)
> >   {
> >     struct net_device *dev = napi->dev;
> >
> >     if (dev && dev->npinfo) {
> >       int owner = smp_processor_id();
> >
> >       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
> >         cpu_relax();
> >
> >       return napi;
> >     }
> >     return NULL;
> >   }
> > ```
> > If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> > Adding some more trace points I see:
> >
> > ```
> >   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
> >   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> >   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> >   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> > ```
> >
> > Here you can see that poll_owner=-1 meaning the lock was never
> > acquired because npinfo is NULL.  This means that the same veth rx
> > queue can be napi_polled from multiple CPU and nothing stops it from
> > running concurrently.  They all look like this, just most of the time
> > there aren't concurrent napi_polls running for the same queue.  They
> > do however move around CPUs as I explained earlier.
> >
> > I'll note that I've ran with your suggested change of moving
> > xdp_do_flush() before napi_complete_done() all weekend and I have not
> > reproduced the issue.  I don't know if that truly means the issue is
> > fixed by that change or not.  I suspect it does fix the issue because
> > it prevents the napi_struct from being scheduled again before the
> > first poll has completed, and nap_schedule_prep() ensures that only
> > one instance is ever running.
> 
> Thanks Shawn! Good news that the patch seems to fix the problem. To
> me, napi_schedule_prep() makes sure that only one NAPI instance is
> running. Netpoll is an optional feature and I do not even have it
> compiled into my kernel. At least I do not have it defined in my
> .config and I cannot find any netpoll symbols with a readelf command.
> If netpoll is not used, I would suspect that npinfo == NULL. So to me,
> it is still a mystery why this is happening.

Oh I don't think it is a mystery anymore.  The napi_complete_done()
signals that this instance of of the napi_poll is complete.  As you
said nap_schedule_prep() checks to ensure that only one instance of
napi_poll is running at a time, but we just indicated it was done with
napi_complete_done().  This allows this CPU or more importantly any
other CPU to reschedule napi polling for this receive queue, but we
haven't called xdp_do_flush() yet so the flush can race.  I'll note
that the napi_schedule_prep()/__napi_schedule() in veth_poll really
isn't the problem since it will schedule itself back on the same CPU.
The problem is simply that another CPU is free to call
napi_scheulde_prep()/__napi_schedule() in that window after
napi_complete_done() and before xdp_do_flush().  The veth driver can
schedule a napi_poll from the transmit path which is what starts the
poll on a second CPU.

I was simply blinded by that stupid netpoll_poll_lock() but it is
completely unnecessary.
 
> To validate or disprove that there are two instances of the same napi
> running, could you please put a probe in veth_poll() right after
> veth_xdp_rcv() and one at xdp_do_flush() and dump the napi id in both
> cases? And run the original code with the bug :-). It would be good to
> know what exactly happens in the code between these points when this
> bug occurs. Maybe we could do this by enabling the trace buffer and
> dumping it when the bug occurs.

The two instances that race are definitely different napi_polls, the
second one is scheduled from the veth transmit path.

> > If we think this is the correct fix I'll let it run for another day or
> > two and prepare a patch.
> 
> There is one more advantage to this bug fix and that is performance.
> By calling __xdp_map_flush() immediately after the receive function,
> user space can start to work on the packets quicker so performance
> will improve.

Yes definitely.

Thanks,
Shawn Bohrer

next prev parent reply	other threads:[~2022-12-20 15:26 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-14 22:32 Possible race with xsk_flush Shawn Bohrer
2022-12-15 10:22 ` Magnus Karlsson
2022-12-15 19:50   ` Alex Forster
2022-12-16  0:14   ` Shawn Bohrer
2022-12-16 10:05     ` Magnus Karlsson
2022-12-16 16:48       ` Shawn Bohrer
2022-12-16 23:42       ` Shawn Bohrer
2022-12-20  1:31       ` Shawn Bohrer
2022-12-20  9:06         ` Magnus Karlsson
2022-12-20 15:25           ` Shawn Bohrer [this message]
2022-12-20 15:59             ` Magnus Karlsson
2022-12-20 16:02               ` Shawn Bohrer
2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
2022-12-22  1:07   ` Jakub Kicinski
2022-12-22 14:30     ` Paolo Abeni
2022-12-22 10:18   ` Paolo Abeni
2023-01-11 14:02     ` Magnus Karlsson
2023-01-11 14:21       ` Toke Høiland-Jørgensen
2023-01-11 16:24         ` Magnus Karlsson
2023-01-11 23:06           ` Toke Høiland-Jørgensen
2022-12-22 14:40   ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y6HUBWCytyebNnOx@sbohrer-cf-dell \
    --to=sbohrer@cloudflare.com \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.