netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible race with xsk_flush
@ 2022-12-14 22:32 Shawn Bohrer
  2022-12-15 10:22 ` Magnus Karlsson
  2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
  0 siblings, 2 replies; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-14 22:32 UTC (permalink / raw)
  To: netdev, bpf; +Cc: bjorn, magnus.karlsson, kernel-team

Hello,

I've been trying to track down a problem we've been seeing with
occasional corrupt packets using AF_XDP.  As you probably expect this
occurs because we end up with the same descriptor in ring multiple
times.  However, after lots of debugging and analyzing I think this is
actually caused by a kernel bug, though I still don't fully understand
what is happening.  This is currently being seen on kernel 5.15.77 and
I haven't done any testing to see if we see the same issue on newer
versions.

Inside my application I've written code to track and log as
descriptors are placed into the fill and tx queues, and when they are
pulled out of the rx and completion queues.  The transitions are
logged to /sys/kernel/debug/tracing/trace_marker.  Additionally I keep
my own VecDeque which mirrors the order descriptors are placed in the
fill queue and I verify that they come out of the rx queue in the same
order.  I do realize there are some cases where they might not come
out in the same order but I do not appear to be hitting that.

I then add several kprobes to track the kernel side with ftrace. Most
importantly are these probes:

This adds a probe on the call to xskq_prod_reserve_desc() this
actually creates two probes so you see two prints everytime it is hit:
perf probe -a '__xsk_rcv_zc:7 addr len xs xs->pool->fq'

This adds a probe on xsk_flush():
perf probe -a 'xsk_flush:0 xs'

My AF_XDP application is bound to two multi-queue veth interfaces
(called 'ingress' and 'egress' in my prints) in a networking
namespace.  I'm then generating traffic with iperf and hping3 through
these interfaces.  When I see an out-of-order descriptor in my
application I dump the state of my internal VecDeque to the
trace_marker.  Here is an example of what I'm seeing which looks like
a kernel bug:

 // On CPU 0 we've removed descriptor 0xff0900 from the fill queue,
 // copied a packet, but have not put it in the rx queue yet
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698512: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698513: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 // On CPU 2 we've removed descriptor 0x1000900 from the fill queue,
 // copied a packet, but have not put it in the rx queue yet
          iperf2-1217    [002] d.Z1. 609766.698523: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
          iperf2-1217    [002] d.Z1. 609766.698524: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 // On CPU 0 xsk_flush is called on the socket
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698528: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
 // My application receives 0xff0900 on the ingress interface queue 1
 flowtrackd-9chS-142014  [000] ..... 609766.698540: tracing_mark_write: ingress q:1 0xff0900 FILL -> RX
 // On CPU 2 xsk_flush is called on the socket
          iperf2-1217    [002] d.Z1. 609766.698545: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
 // My application receives 0xf61900 this is unexpected.  We expected
 // to receive 0x1000900 which is what you saw in the previous
 // __xsk_rcv_zc print.  0xf61900 is in the fill queue but it is far
 // away from our current position in the ring and I've trimmed the
 // print slightly to show that.
 flowtrackd-9chS-142014  [000] ..... 609766.698617: tracing_mark_write: ingress q:1 0xf61900 FILL -> RX: expected 0x1000900 remaining: [fe4100, f9c100, f8a100, ..., f61900

From reading the code I believe that the call to
xskq_prod_reserve_desc() inside __xsk_rcv_zc is the only place
descriptors are placed into the RX queue.  To me this means I should
see a print from my probe for the mystery 0xf61900 but we do not see
this print.  Instead we see the expected 0x1000900.  One theory I have
is that there could be a race where CPU 2 increments the cached_prod
pointer but has not yet updated the addr and len, CPU 0 calls
xsk_flush(), and now my application reads the old descriptor entry
from that location in the RX ring.  This would explain everything, but
the problem with this theory is that __xsk_rcv_zc() and xsk_flush()
are getting called from within napi_poll() and this appears to hold
the netpoll_poll_lock() for the whole time which I think should
prevent the race I just described.

A couple more notes:
* The ftrace print order and timestamps seem to indicate that the CPU
  2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
  if these timestamps can be trusted but it does imply that maybe this
  can race as I described.  I've triggered this twice with xsk_flush
  probes and both show the order above.
* In the 3 times I've triggered this it has occurred right when the
  softirq processing switches CPUs
* I've also triggered this before I added the xsk_flush() probe and
  in that case saw the kernel side additionally fill in the next
  expected descriptor, which in the example above would be 0xfe4100.
  This seems to indicate that my tracking is all still sane.
* This is fairly reproducible, but I've got 10 test boxes running and
  I only get maybe bug a day.

Any thoughts on if the bug I described is actually possible,
alternative theories, or other things to test/try would be welcome.

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-14 22:32 Possible race with xsk_flush Shawn Bohrer
@ 2022-12-15 10:22 ` Magnus Karlsson
  2022-12-15 19:50   ` Alex Forster
  2022-12-16  0:14   ` Shawn Bohrer
  2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
  1 sibling, 2 replies; 21+ messages in thread
From: Magnus Karlsson @ 2022-12-15 10:22 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Wed, Dec 14, 2022 at 11:33 PM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
>
> Hello,
>
> I've been trying to track down a problem we've been seeing with
> occasional corrupt packets using AF_XDP.  As you probably expect this
> occurs because we end up with the same descriptor in ring multiple
> times.  However, after lots of debugging and analyzing I think this is
> actually caused by a kernel bug, though I still don't fully understand
> what is happening.  This is currently being seen on kernel 5.15.77 and
> I haven't done any testing to see if we see the same issue on newer
> versions.
>
> Inside my application I've written code to track and log as
> descriptors are placed into the fill and tx queues, and when they are
> pulled out of the rx and completion queues.  The transitions are
> logged to /sys/kernel/debug/tracing/trace_marker.  Additionally I keep
> my own VecDeque which mirrors the order descriptors are placed in the
> fill queue and I verify that they come out of the rx queue in the same
> order.  I do realize there are some cases where they might not come
> out in the same order but I do not appear to be hitting that.
>
> I then add several kprobes to track the kernel side with ftrace. Most
> importantly are these probes:
>
> This adds a probe on the call to xskq_prod_reserve_desc() this
> actually creates two probes so you see two prints everytime it is hit:
> perf probe -a '__xsk_rcv_zc:7 addr len xs xs->pool->fq'
>
> This adds a probe on xsk_flush():
> perf probe -a 'xsk_flush:0 xs'
>
> My AF_XDP application is bound to two multi-queue veth interfaces
> (called 'ingress' and 'egress' in my prints) in a networking
> namespace.  I'm then generating traffic with iperf and hping3 through
> these interfaces.  When I see an out-of-order descriptor in my
> application I dump the state of my internal VecDeque to the
> trace_marker.  Here is an example of what I'm seeing which looks like
> a kernel bug:
>
>  // On CPU 0 we've removed descriptor 0xff0900 from the fill queue,
>  // copied a packet, but have not put it in the rx queue yet
>  flowtrackd-9chS-142014  [000] d.Z1. 609766.698512: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
>  flowtrackd-9chS-142014  [000] d.Z1. 609766.698513: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
>  // On CPU 2 we've removed descriptor 0x1000900 from the fill queue,
>  // copied a packet, but have not put it in the rx queue yet
>           iperf2-1217    [002] d.Z1. 609766.698523: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
>           iperf2-1217    [002] d.Z1. 609766.698524: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
>  // On CPU 0 xsk_flush is called on the socket
>  flowtrackd-9chS-142014  [000] d.Z1. 609766.698528: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
>  // My application receives 0xff0900 on the ingress interface queue 1
>  flowtrackd-9chS-142014  [000] ..... 609766.698540: tracing_mark_write: ingress q:1 0xff0900 FILL -> RX
>  // On CPU 2 xsk_flush is called on the socket
>           iperf2-1217    [002] d.Z1. 609766.698545: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
>  // My application receives 0xf61900 this is unexpected.  We expected
>  // to receive 0x1000900 which is what you saw in the previous
>  // __xsk_rcv_zc print.  0xf61900 is in the fill queue but it is far
>  // away from our current position in the ring and I've trimmed the
>  // print slightly to show that.
>  flowtrackd-9chS-142014  [000] ..... 609766.698617: tracing_mark_write: ingress q:1 0xf61900 FILL -> RX: expected 0x1000900 remaining: [fe4100, f9c100, f8a100, ..., f61900
>
> From reading the code I believe that the call to
> xskq_prod_reserve_desc() inside __xsk_rcv_zc is the only place
> descriptors are placed into the RX queue.  To me this means I should
> see a print from my probe for the mystery 0xf61900 but we do not see
> this print.  Instead we see the expected 0x1000900.  One theory I have
> is that there could be a race where CPU 2 increments the cached_prod
> pointer but has not yet updated the addr and len, CPU 0 calls
> xsk_flush(), and now my application reads the old descriptor entry
> from that location in the RX ring.  This would explain everything, but
> the problem with this theory is that __xsk_rcv_zc() and xsk_flush()
> are getting called from within napi_poll() and this appears to hold
> the netpoll_poll_lock() for the whole time which I think should
> prevent the race I just described.

Thanks Shawn for your detailed bug report. The rings between user
space and kernel space are single-producer/single-consumer, so in your
case when both CPU0 and CPU2 are working on the fill ring and the Rx
ring at the same time, this will indeed produce races. The intended
design principle to protect against this is that only one NAPI context
can access any given ring at any point in time and that the NAPI logic
should prevent two instances of the same NAPI instance from running at
the same time. So if that is not true for some reason, we would get a
race like this. Another option is that one of the CPUs should really
process another fill ring instead of the same.

Do you see the second socket being worked on when this happens?

Could you please share how you set up the two AF_XDP sockets?

Are you using XDP_DRV mode in your tests?

> A couple more notes:
> * The ftrace print order and timestamps seem to indicate that the CPU
>   2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
>   if these timestamps can be trusted but it does imply that maybe this
>   can race as I described.  I've triggered this twice with xsk_flush
>   probes and both show the order above.
> * In the 3 times I've triggered this it has occurred right when the
>   softirq processing switches CPUs

This is interesting. Could you check, in some way, if you only have
one core working on the fill ring before the softirq switching and
then after that you have two? And if you have two, is that period
transient?

> * I've also triggered this before I added the xsk_flush() probe and
>   in that case saw the kernel side additionally fill in the next
>   expected descriptor, which in the example above would be 0xfe4100.
>   This seems to indicate that my tracking is all still sane.
> * This is fairly reproducible, but I've got 10 test boxes running and
>   I only get maybe bug a day.
>
> Any thoughts on if the bug I described is actually possible,
> alternative theories, or other things to test/try would be welcome.

I thought this would be impossible, but apparently not :-). We are
apparently doing something wrong in the AF_XDP code or have the wrong
assumptions in some situation, but I just do not know what at this
point in time. Maybe it is veth that breaks some of our assumptions,
who knows. But let us dig into it. I need your help here, because I
think it will be hard for me to reproduce the issue.

Thanks: Magnus

> Thanks,
> Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-15 10:22 ` Magnus Karlsson
@ 2022-12-15 19:50   ` Alex Forster
  2022-12-16  0:14   ` Shawn Bohrer
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Forster @ 2022-12-15 19:50 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Shawn Bohrer, netdev, bpf, bjorn, magnus.karlsson, kernel-team

Hi Magnus,

> Could you please share how you set up the two AF_XDP sockets?

Our architecture is pretty unique:

   outside of │ inside of
    namespace │ namespace
              │
    ┌───────┐ │ ┌───────┐
    │ outer │ │ │ inner │
    │  veth │ │ │ veth  │
    └──┬─▲──┘ │ └──┬─▲──┘
       │ │    │    │ │
    ┌──▼─┴────┴────▼─┴──┐
    │    shared umem    │
    └───────────────────┘

The goal is to position ourselves in the middle of a veth pair so that
we can perform bidirectional traffic inspection and manipulation. To do
this, we attach AF_XDP to both veth interfaces and share a umem between
them. This allows us to forward packets between the veth interfaces
without copying in userspace.

These interfaces are both multi-queue, with AF_XDP sockets attached to
each queue. The queues are each managed on their own (unpinned) threads
and have their own rx/tx/fill/completion rings. We also enable
threaded NAPI on both of these interfaces, which may or may not be an
important detail to note, since the problem appears much harder (though
not impossible) to reproduce with threaded NAPI enabled.

Here’s a script that configures a namespace and veth pair that closely
resembles production, except for enabling threaded NAPI:

```
#!/bin/bash

set -e -u -x -o pipefail

QUEUES=${QUEUES:=$(($(grep -c ^processor /proc/cpuinfo)))}

OUTER_CUSTOMER_VETH=${OUTER_CUSTOMER_VETH:=outer-veth}
INNER_CUSTOMER_VETH=${INNER_CUSTOMER_VETH:=inner-veth}
CUSTOMER_NAMESPACE=${CUSTOMER_NAMESPACE:=customer-namespace}

ip netns add $CUSTOMER_NAMESPACE
ip netns exec $CUSTOMER_NAMESPACE bash <<EOF
  set -e -u -x -o pipefail
  ip addr add 127.0.0.1/8 dev lo
  ip link set dev lo up
EOF

ip link add \
  name $OUTER_CUSTOMER_VETH \
  numrxqueues $QUEUES numtxqueues $QUEUES type veth \
  peer name $INNER_CUSTOMER_VETH netns $CUSTOMER_NAMESPACE \
  numrxqueues $QUEUES numtxqueues $QUEUES

ethtool -K $OUTER_CUSTOMER_VETH \
  gro off gso off tso off tx off rxvlan off txvlan off
ip link set dev $OUTER_CUSTOMER_VETH up
ip addr add 169.254.10.1/30 dev $OUTER_CUSTOMER_VETH

ip netns exec $CUSTOMER_NAMESPACE bash <<EOF
  set -e -u -x -o pipefail
  ethtool -K $INNER_CUSTOMER_VETH \
    gro off gso off tso off tx off rxvlan off txvlan off
  ip link set dev $INNER_CUSTOMER_VETH up
  ip addr add 169.254.10.2/30 dev $INNER_CUSTOMER_VETH
EOF
```

> Are you using XDP_DRV mode in your tests?

Yes.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-15 10:22 ` Magnus Karlsson
  2022-12-15 19:50   ` Alex Forster
@ 2022-12-16  0:14   ` Shawn Bohrer
  2022-12-16 10:05     ` Magnus Karlsson
  1 sibling, 1 reply; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-16  0:14 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Thu, Dec 15, 2022 at 11:22:05AM +0100, Magnus Karlsson wrote:
> Thanks Shawn for your detailed bug report. The rings between user
> space and kernel space are single-producer/single-consumer, so in your
> case when both CPU0 and CPU2 are working on the fill ring and the Rx
> ring at the same time, this will indeed produce races. The intended
> design principle to protect against this is that only one NAPI context
> can access any given ring at any point in time and that the NAPI logic
> should prevent two instances of the same NAPI instance from running at
> the same time. So if that is not true for some reason, we would get a
> race like this. Another option is that one of the CPUs should really
> process another fill ring instead of the same.
> 
> Do you see the second socket being worked on when this happens?
> 
> Could you please share how you set up the two AF_XDP sockets?

Alex Forster sent more details on the configuration but just to
reiterate there are actually 8 AF_XDP sockets in this test setup.
There are two veth interfaces and each interface has four receive
queues.  We create one socket per interface/queue pair.  Our XDP
program redirects each packet to the correct AF_XDP socket based on
the queue number.

Yes there is often activity on other sockets near the time when the
bug occurs.  This is why I'm printing xs/fq, the socket address and
fill queue address, and printing the ingress/egress device name and
queue number in my prints.  This allows to match up the user space and
kernel space prints.  Additionally we are using a shared UMEM so
descriptors could move around between sockets though I've tried to
minimize this and in every case I've seen so far the mystery
descriptor was last used on the same socket and has also been in the
fill queue just not next in line.

> Are you using XDP_DRV mode in your tests?
> 
> > A couple more notes:
> > * The ftrace print order and timestamps seem to indicate that the CPU
> >   2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
> >   if these timestamps can be trusted but it does imply that maybe this
> >   can race as I described.  I've triggered this twice with xsk_flush
> >   probes and both show the order above.
> > * In the 3 times I've triggered this it has occurred right when the
> >   softirq processing switches CPUs
> 
> This is interesting. Could you check, in some way, if you only have
> one core working on the fill ring before the softirq switching and
> then after that you have two? And if you have two, is that period
> transient?

I think what you are asking is why does the softirq processing switch
CPUs?  There is still a lot I don't fully understand here but I've
tried to understand this, if only to try to make it happen more
frequently and make this easier to reproduce.

In this test setup there is no hardware IRQ.  iperf2 sends the packet
and the CPU where iperf is running runs the veth softirq.  I'm not
sure how it picks which veth receive queue receives the packets, but
they end up distributed across the veth qeueus.  Additionally
__veth_xdp_flush() calls __napi_schedule().  This is called from
veth_xdp_xmit() which I think means that transmitting packets from
AF_XDP also schedules the softirq on the current CPU for that veth
queue.  What I definitely see is that if I pin both iperf and my
application to a single CPU all softirqs of all queues run on that
single CPU.  If I pin iperf2 to one core and my application to another
core I get softirqs for all veth queues on both cores.

In our test setup we aren't applying any cpu affinity.  iperf2 is
multi-threaded and can run on all 4 cores, and our application is
multithreaded and can run on all 4 cores.  The napi scheduling seems
to be per veth queue and yes I see those softirqs move and switch
between CPUs.  I don't however have anything that clearly shows it
running concurrently on two CPUs (The stretches of __xsk_rcv_zc are
all on one core before it switches).  The closest I have is the
several microseconds where it appears xsk_flush() overlaps at the end
of my traces.  I would think that if the napi locking didn't work at
all you'd see clear overlap.

From my experiments with CPU affinity I've updated my test setup to
frequently change the CPU affinity of iperf and our application on one
of my test boxes with hopes that it helps to reproduce but I have no
results so far.

> > * I've also triggered this before I added the xsk_flush() probe and
> >   in that case saw the kernel side additionally fill in the next
> >   expected descriptor, which in the example above would be 0xfe4100.
> >   This seems to indicate that my tracking is all still sane.
> > * This is fairly reproducible, but I've got 10 test boxes running and
> >   I only get maybe bug a day.
> >
> > Any thoughts on if the bug I described is actually possible,
> > alternative theories, or other things to test/try would be welcome.
> 
> I thought this would be impossible, but apparently not :-). We are
> apparently doing something wrong in the AF_XDP code or have the wrong
> assumptions in some situation, but I just do not know what at this
> point in time. Maybe it is veth that breaks some of our assumptions,
> who knows. But let us dig into it. I need your help here, because I
> think it will be hard for me to reproduce the issue.

Yeah if you have ideas on what to test I'll do my best to try them.

I've additionally updated my application to put a bad "cookie"
descriptor address back in the RX ring before updating the consumer
pointer.  My hope is that if we then ever receive that cookie it
proves the kernel raced and failed to update the correct address.

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-16  0:14   ` Shawn Bohrer
@ 2022-12-16 10:05     ` Magnus Karlsson
  2022-12-16 16:48       ` Shawn Bohrer
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Magnus Karlsson @ 2022-12-16 10:05 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Fri, Dec 16, 2022 at 1:15 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
>
> On Thu, Dec 15, 2022 at 11:22:05AM +0100, Magnus Karlsson wrote:
> > Thanks Shawn for your detailed bug report. The rings between user
> > space and kernel space are single-producer/single-consumer, so in your
> > case when both CPU0 and CPU2 are working on the fill ring and the Rx
> > ring at the same time, this will indeed produce races. The intended
> > design principle to protect against this is that only one NAPI context
> > can access any given ring at any point in time and that the NAPI logic
> > should prevent two instances of the same NAPI instance from running at
> > the same time. So if that is not true for some reason, we would get a
> > race like this. Another option is that one of the CPUs should really
> > process another fill ring instead of the same.
> >
> > Do you see the second socket being worked on when this happens?
> >
> > Could you please share how you set up the two AF_XDP sockets?
>
> Alex Forster sent more details on the configuration but just to
> reiterate there are actually 8 AF_XDP sockets in this test setup.
> There are two veth interfaces and each interface has four receive
> queues.  We create one socket per interface/queue pair.  Our XDP
> program redirects each packet to the correct AF_XDP socket based on
> the queue number.
>
> Yes there is often activity on other sockets near the time when the
> bug occurs.  This is why I'm printing xs/fq, the socket address and
> fill queue address, and printing the ingress/egress device name and
> queue number in my prints.  This allows to match up the user space and
> kernel space prints.  Additionally we are using a shared UMEM so
> descriptors could move around between sockets though I've tried to
> minimize this and in every case I've seen so far the mystery
> descriptor was last used on the same socket and has also been in the
> fill queue just not next in line.
>
> > Are you using XDP_DRV mode in your tests?
> >
> > > A couple more notes:
> > > * The ftrace print order and timestamps seem to indicate that the CPU
> > >   2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
> > >   if these timestamps can be trusted but it does imply that maybe this
> > >   can race as I described.  I've triggered this twice with xsk_flush
> > >   probes and both show the order above.
> > > * In the 3 times I've triggered this it has occurred right when the
> > >   softirq processing switches CPUs
> >
> > This is interesting. Could you check, in some way, if you only have
> > one core working on the fill ring before the softirq switching and
> > then after that you have two? And if you have two, is that period
> > transient?
>
> I think what you are asking is why does the softirq processing switch
> CPUs?  There is still a lot I don't fully understand here but I've
> tried to understand this, if only to try to make it happen more
> frequently and make this easier to reproduce.
>
> In this test setup there is no hardware IRQ.  iperf2 sends the packet
> and the CPU where iperf is running runs the veth softirq.  I'm not
> sure how it picks which veth receive queue receives the packets, but
> they end up distributed across the veth qeueus.  Additionally
> __veth_xdp_flush() calls __napi_schedule().  This is called from
> veth_xdp_xmit() which I think means that transmitting packets from
> AF_XDP also schedules the softirq on the current CPU for that veth
> queue.  What I definitely see is that if I pin both iperf and my
> application to a single CPU all softirqs of all queues run on that
> single CPU.  If I pin iperf2 to one core and my application to another
> core I get softirqs for all veth queues on both cores.

To summarize, we are expecting this ordering:

CPU 0 __xsk_rcv_zc()
CPU 0 __xsk_map_flush()
CPU 2 __xsk_rcv_zc()
CPU 2 __xsk_map_flush()

But we are seeing this order:

CPU 0 __xsk_rcv_zc()
CPU 2 __xsk_rcv_zc()
CPU 0 __xsk_map_flush()
CPU 2 __xsk_map_flush()

Here is the veth NAPI poll loop:

static int veth_poll(struct napi_struct *napi, int budget)
{
    struct veth_rq *rq =
    container_of(napi, struct veth_rq, xdp_napi);
    struct veth_stats stats = {};
    struct veth_xdp_tx_bq bq;
    int done;

    bq.count = 0;

    xdp_set_return_frame_no_direct();
    done = veth_xdp_rcv(rq, budget, &bq, &stats);

    if (done < budget && napi_complete_done(napi, done)) {
        /* Write rx_notify_masked before reading ptr_ring */
       smp_store_mb(rq->rx_notify_masked, false);
       if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
           if (napi_schedule_prep(&rq->xdp_napi)) {
               WRITE_ONCE(rq->rx_notify_masked, true);
               __napi_schedule(&rq->xdp_napi);
            }
        }
    }

    if (stats.xdp_tx > 0)
        veth_xdp_flush(rq, &bq);
    if (stats.xdp_redirect > 0)
        xdp_do_flush();
    xdp_clear_return_frame_no_direct();

    return done;
}

Something I have never seen before is that there is
napi_complete_done() and a __napi_schedule() before xdp_do_flush().
Let us check if this has something to do with it. So two suggestions
to be executed separately:

* Put a probe at the __napi_schedule() above and check if it gets
triggered before this problem
* Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
before "if (done < budget && napi_complete_done(napi, done)) {"

This might provide us some hints on what is going on.

Thanks: Magnus

> In our test setup we aren't applying any cpu affinity.  iperf2 is
> multi-threaded and can run on all 4 cores, and our application is
> multithreaded and can run on all 4 cores.  The napi scheduling seems
> to be per veth queue and yes I see those softirqs move and switch
> between CPUs.  I don't however have anything that clearly shows it
> running concurrently on two CPUs (The stretches of __xsk_rcv_zc are
> all on one core before it switches).  The closest I have is the
> several microseconds where it appears xsk_flush() overlaps at the end
> of my traces.  I would think that if the napi locking didn't work at
> all you'd see clear overlap.
>
> From my experiments with CPU affinity I've updated my test setup to
> frequently change the CPU affinity of iperf and our application on one
> of my test boxes with hopes that it helps to reproduce but I have no
> results so far.
>
> > > * I've also triggered this before I added the xsk_flush() probe and
> > >   in that case saw the kernel side additionally fill in the next
> > >   expected descriptor, which in the example above would be 0xfe4100.
> > >   This seems to indicate that my tracking is all still sane.
> > > * This is fairly reproducible, but I've got 10 test boxes running and
> > >   I only get maybe bug a day.
> > >
> > > Any thoughts on if the bug I described is actually possible,
> > > alternative theories, or other things to test/try would be welcome.
> >
> > I thought this would be impossible, but apparently not :-). We are
> > apparently doing something wrong in the AF_XDP code or have the wrong
> > assumptions in some situation, but I just do not know what at this
> > point in time. Maybe it is veth that breaks some of our assumptions,
> > who knows. But let us dig into it. I need your help here, because I
> > think it will be hard for me to reproduce the issue.
>
> Yeah if you have ideas on what to test I'll do my best to try them.
>
> I've additionally updated my application to put a bad "cookie"
> descriptor address back in the RX ring before updating the consumer
> pointer.  My hope is that if we then ever receive that cookie it
> proves the kernel raced and failed to update the correct address.
>
> Thanks,
> Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-16 10:05     ` Magnus Karlsson
@ 2022-12-16 16:48       ` Shawn Bohrer
  2022-12-16 23:42       ` Shawn Bohrer
  2022-12-20  1:31       ` Shawn Bohrer
  2 siblings, 0 replies; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> To summarize, we are expecting this ordering:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_rcv_zc()
> CPU 2 __xsk_map_flush()
> 
> But we are seeing this order:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 2 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_map_flush()
 
Yes exactly, and I think I've proved that this really is the order,
and the race is occurring.  See my cookie/poisoning below.

> Here is the veth NAPI poll loop:
> 
> static int veth_poll(struct napi_struct *napi, int budget)
> {
>     struct veth_rq *rq =
>     container_of(napi, struct veth_rq, xdp_napi);
>     struct veth_stats stats = {};
>     struct veth_xdp_tx_bq bq;
>     int done;
> 
>     bq.count = 0;
> 
>     xdp_set_return_frame_no_direct();
>     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> 
>     if (done < budget && napi_complete_done(napi, done)) {
>         /* Write rx_notify_masked before reading ptr_ring */
>        smp_store_mb(rq->rx_notify_masked, false);
>        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>            if (napi_schedule_prep(&rq->xdp_napi)) {
>                WRITE_ONCE(rq->rx_notify_masked, true);
>                __napi_schedule(&rq->xdp_napi);
>             }
>         }
>     }
> 
>     if (stats.xdp_tx > 0)
>         veth_xdp_flush(rq, &bq);
>     if (stats.xdp_redirect > 0)
>         xdp_do_flush();
>     xdp_clear_return_frame_no_direct();
> 
>     return done;
> }
> 
> Something I have never seen before is that there is
> napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> Let us check if this has something to do with it. So two suggestions
> to be executed separately:
> 
> * Put a probe at the __napi_schedule() above and check if it gets
> triggered before this problem
> * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> before "if (done < budget && napi_complete_done(napi, done)) {"
> 
> This might provide us some hints on what is going on.

Excellent observation, I haven't really looked at what
napi_complete_done() does yet.  I did notice it could call
__napi_schedule() and that seemed like it might be fine.  I'll also
note that veth_xdp_flush() can also ultimately call __napi_schedule().
I'll see what I can do to explore these ideas.
 
> > I've additionally updated my application to put a bad "cookie"
> > descriptor address back in the RX ring before updating the consumer
> > pointer.  My hope is that if we then ever receive that cookie it
> > proves the kernel raced and failed to update the correct address.

I guess this is more like poisoning the old descriptors rather than a
cookie.  This ran last night and one of my machines read back my
0xdeadbeefdeadbeef poisoned cookie value:

          iperf2-125483  [003] d.Z1. 792878.867088: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x8d4900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125483  [003] d.Z1. 792878.867093: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
          iperf2-125491  [001] d.Z1. 792878.867219: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xc79900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125491  [001] d.Z1. 792878.867229: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
          iperf2-125491  [001] d.Z1. 792878.867291: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x18e1900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125483  [003] d.Z1. 792878.867441: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xc0a900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125491  [001] d.Z1. 792878.867457: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
 flowtrackd-zjTA-201813  [001] ..... 792878.867496: tracing_mark_write: ingress q:2 0x8d4900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867503: tracing_mark_write: ingress q:2 0xc79900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867506: tracing_mark_write: ingress q:2 0x18e1900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867524: tracing_mark_write: read invalid descriptor cookie: 0xdeadbeefdeadbeef

This shows what I've seen before where the xsk_flush() of CPU 1 runs
after (during?) __xsk_rcv_zc() of CPU 3.  In this trace we never see
the xsk_flush() from CPU 3 but I stop tracing when the bug occurs so
it probably just hasn't happened yet.

So at least to me this does confirm there is definitely a race here
where we can flush an updated producer pointer before the descriptor
address has been filled in.

--
Shawn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-16 10:05     ` Magnus Karlsson
  2022-12-16 16:48       ` Shawn Bohrer
@ 2022-12-16 23:42       ` Shawn Bohrer
  2022-12-20  1:31       ` Shawn Bohrer
  2 siblings, 0 replies; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-16 23:42 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> To summarize, we are expecting this ordering:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_rcv_zc()
> CPU 2 __xsk_map_flush()
> 
> But we are seeing this order:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 2 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_map_flush()
> 
> Here is the veth NAPI poll loop:
> 
> static int veth_poll(struct napi_struct *napi, int budget)
> {
>     struct veth_rq *rq =
>     container_of(napi, struct veth_rq, xdp_napi);
>     struct veth_stats stats = {};
>     struct veth_xdp_tx_bq bq;
>     int done;
> 
>     bq.count = 0;
> 
>     xdp_set_return_frame_no_direct();
>     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> 
>     if (done < budget && napi_complete_done(napi, done)) {
>         /* Write rx_notify_masked before reading ptr_ring */
>        smp_store_mb(rq->rx_notify_masked, false);
>        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>            if (napi_schedule_prep(&rq->xdp_napi)) {
>                WRITE_ONCE(rq->rx_notify_masked, true);
>                __napi_schedule(&rq->xdp_napi);
>             }
>         }
>     }
> 
>     if (stats.xdp_tx > 0)
>         veth_xdp_flush(rq, &bq);
>     if (stats.xdp_redirect > 0)
>         xdp_do_flush();
>     xdp_clear_return_frame_no_direct();
> 
>     return done;
> }
> 
> Something I have never seen before is that there is
> napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> Let us check if this has something to do with it. So two suggestions
> to be executed separately:
> 
> * Put a probe at the __napi_schedule() above and check if it gets
> triggered before this problem
> * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> before "if (done < budget && napi_complete_done(napi, done)) {"

I've built a kernel moving xdp_do_flush() before napi_complete_done()
and will leave this running over the weekend.  I also spent a while
trying to understand the napi code to see if that reordering seemed
like it might cause the bug.  I'll admit I still don't fully undestand
the napi code or how the bug can happen.  We'll see if this appears to
fix the issue, but since it is somewhat hard to reproduce I'd love for
someone to be able to explain why it fixes it.

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-16 10:05     ` Magnus Karlsson
  2022-12-16 16:48       ` Shawn Bohrer
  2022-12-16 23:42       ` Shawn Bohrer
@ 2022-12-20  1:31       ` Shawn Bohrer
  2022-12-20  9:06         ` Magnus Karlsson
  2 siblings, 1 reply; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-20  1:31 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> To summarize, we are expecting this ordering:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_rcv_zc()
> CPU 2 __xsk_map_flush()
> 
> But we are seeing this order:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 2 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_map_flush()
> 
> Here is the veth NAPI poll loop:
> 
> static int veth_poll(struct napi_struct *napi, int budget)
> {
>     struct veth_rq *rq =
>     container_of(napi, struct veth_rq, xdp_napi);
>     struct veth_stats stats = {};
>     struct veth_xdp_tx_bq bq;
>     int done;
> 
>     bq.count = 0;
> 
>     xdp_set_return_frame_no_direct();
>     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> 
>     if (done < budget && napi_complete_done(napi, done)) {
>         /* Write rx_notify_masked before reading ptr_ring */
>        smp_store_mb(rq->rx_notify_masked, false);
>        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>            if (napi_schedule_prep(&rq->xdp_napi)) {
>                WRITE_ONCE(rq->rx_notify_masked, true);
>                __napi_schedule(&rq->xdp_napi);
>             }
>         }
>     }
> 
>     if (stats.xdp_tx > 0)
>         veth_xdp_flush(rq, &bq);
>     if (stats.xdp_redirect > 0)
>         xdp_do_flush();
>     xdp_clear_return_frame_no_direct();
> 
>     return done;
> }
> 
> Something I have never seen before is that there is
> napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> Let us check if this has something to do with it. So two suggestions
> to be executed separately:
> 
> * Put a probe at the __napi_schedule() above and check if it gets
> triggered before this problem
> * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> before "if (done < budget && napi_complete_done(napi, done)) {"
> 
> This might provide us some hints on what is going on.

After staring at this code for way too long I finally made a
breakthrough!  I could not understand how this race could occur when
napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():

```
  static inline void *netpoll_poll_lock(struct napi_struct *napi)
  {
    struct net_device *dev = napi->dev;

    if (dev && dev->npinfo) {
      int owner = smp_processor_id();

      while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
        cpu_relax();

      return napi;
    }
    return NULL;
  }
```
If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
Adding some more trace points I see:

```
  iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
  iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
  iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
  iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000 
```

Here you can see that poll_owner=-1 meaning the lock was never
acquired because npinfo is NULL.  This means that the same veth rx
queue can be napi_polled from multiple CPU and nothing stops it from
running concurrently.  They all look like this, just most of the time
there aren't concurrent napi_polls running for the same queue.  They
do however move around CPUs as I explained earlier.

I'll note that I've ran with your suggested change of moving
xdp_do_flush() before napi_complete_done() all weekend and I have not
reproduced the issue.  I don't know if that truly means the issue is
fixed by that change or not.  I suspect it does fix the issue because
it prevents the napi_struct from being scheduled again before the
first poll has completed, and nap_schedule_prep() ensures that only
one instance is ever running.

If we think this is the correct fix I'll let it run for another day or
two and prepare a patch.

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-20  1:31       ` Shawn Bohrer
@ 2022-12-20  9:06         ` Magnus Karlsson
  2022-12-20 15:25           ` Shawn Bohrer
  0 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2022-12-20  9:06 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
>
> On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > To summarize, we are expecting this ordering:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_rcv_zc()
> > CPU 2 __xsk_map_flush()
> >
> > But we are seeing this order:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 2 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_map_flush()
> >
> > Here is the veth NAPI poll loop:
> >
> > static int veth_poll(struct napi_struct *napi, int budget)
> > {
> >     struct veth_rq *rq =
> >     container_of(napi, struct veth_rq, xdp_napi);
> >     struct veth_stats stats = {};
> >     struct veth_xdp_tx_bq bq;
> >     int done;
> >
> >     bq.count = 0;
> >
> >     xdp_set_return_frame_no_direct();
> >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> >
> >     if (done < budget && napi_complete_done(napi, done)) {
> >         /* Write rx_notify_masked before reading ptr_ring */
> >        smp_store_mb(rq->rx_notify_masked, false);
> >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> >            if (napi_schedule_prep(&rq->xdp_napi)) {
> >                WRITE_ONCE(rq->rx_notify_masked, true);
> >                __napi_schedule(&rq->xdp_napi);
> >             }
> >         }
> >     }
> >
> >     if (stats.xdp_tx > 0)
> >         veth_xdp_flush(rq, &bq);
> >     if (stats.xdp_redirect > 0)
> >         xdp_do_flush();
> >     xdp_clear_return_frame_no_direct();
> >
> >     return done;
> > }
> >
> > Something I have never seen before is that there is
> > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > Let us check if this has something to do with it. So two suggestions
> > to be executed separately:
> >
> > * Put a probe at the __napi_schedule() above and check if it gets
> > triggered before this problem
> > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > before "if (done < budget && napi_complete_done(napi, done)) {"
> >
> > This might provide us some hints on what is going on.
>
> After staring at this code for way too long I finally made a
> breakthrough!  I could not understand how this race could occur when
> napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
>
> ```
>   static inline void *netpoll_poll_lock(struct napi_struct *napi)
>   {
>     struct net_device *dev = napi->dev;
>
>     if (dev && dev->npinfo) {
>       int owner = smp_processor_id();
>
>       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
>         cpu_relax();
>
>       return napi;
>     }
>     return NULL;
>   }
> ```
> If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> Adding some more trace points I see:
>
> ```
>   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
>   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
>   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
>   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> ```
>
> Here you can see that poll_owner=-1 meaning the lock was never
> acquired because npinfo is NULL.  This means that the same veth rx
> queue can be napi_polled from multiple CPU and nothing stops it from
> running concurrently.  They all look like this, just most of the time
> there aren't concurrent napi_polls running for the same queue.  They
> do however move around CPUs as I explained earlier.
>
> I'll note that I've ran with your suggested change of moving
> xdp_do_flush() before napi_complete_done() all weekend and I have not
> reproduced the issue.  I don't know if that truly means the issue is
> fixed by that change or not.  I suspect it does fix the issue because
> it prevents the napi_struct from being scheduled again before the
> first poll has completed, and nap_schedule_prep() ensures that only
> one instance is ever running.

Thanks Shawn! Good news that the patch seems to fix the problem. To
me, napi_schedule_prep() makes sure that only one NAPI instance is
running. Netpoll is an optional feature and I do not even have it
compiled into my kernel. At least I do not have it defined in my
.config and I cannot find any netpoll symbols with a readelf command.
If netpoll is not used, I would suspect that npinfo == NULL. So to me,
it is still a mystery why this is happening.

To validate or disprove that there are two instances of the same napi
running, could you please put a probe in veth_poll() right after
veth_xdp_rcv() and one at xdp_do_flush() and dump the napi id in both
cases? And run the original code with the bug :-). It would be good to
know what exactly happens in the code between these points when this
bug occurs. Maybe we could do this by enabling the trace buffer and
dumping it when the bug occurs.

> If we think this is the correct fix I'll let it run for another day or
> two and prepare a patch.

There is one more advantage to this bug fix and that is performance.
By calling __xdp_map_flush() immediately after the receive function,
user space can start to work on the packets quicker so performance
will improve.

> Thanks,
> Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-20  9:06         ` Magnus Karlsson
@ 2022-12-20 15:25           ` Shawn Bohrer
  2022-12-20 15:59             ` Magnus Karlsson
  0 siblings, 1 reply; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-20 15:25 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Tue, Dec 20, 2022 at 10:06:48AM +0100, Magnus Karlsson wrote:
> On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
> >
> > On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > > To summarize, we are expecting this ordering:
> > >
> > > CPU 0 __xsk_rcv_zc()
> > > CPU 0 __xsk_map_flush()
> > > CPU 2 __xsk_rcv_zc()
> > > CPU 2 __xsk_map_flush()
> > >
> > > But we are seeing this order:
> > >
> > > CPU 0 __xsk_rcv_zc()
> > > CPU 2 __xsk_rcv_zc()
> > > CPU 0 __xsk_map_flush()
> > > CPU 2 __xsk_map_flush()
> > >
> > > Here is the veth NAPI poll loop:
> > >
> > > static int veth_poll(struct napi_struct *napi, int budget)
> > > {
> > >     struct veth_rq *rq =
> > >     container_of(napi, struct veth_rq, xdp_napi);
> > >     struct veth_stats stats = {};
> > >     struct veth_xdp_tx_bq bq;
> > >     int done;
> > >
> > >     bq.count = 0;
> > >
> > >     xdp_set_return_frame_no_direct();
> > >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> > >
> > >     if (done < budget && napi_complete_done(napi, done)) {
> > >         /* Write rx_notify_masked before reading ptr_ring */
> > >        smp_store_mb(rq->rx_notify_masked, false);
> > >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> > >            if (napi_schedule_prep(&rq->xdp_napi)) {
> > >                WRITE_ONCE(rq->rx_notify_masked, true);
> > >                __napi_schedule(&rq->xdp_napi);
> > >             }
> > >         }
> > >     }
> > >
> > >     if (stats.xdp_tx > 0)
> > >         veth_xdp_flush(rq, &bq);
> > >     if (stats.xdp_redirect > 0)
> > >         xdp_do_flush();
> > >     xdp_clear_return_frame_no_direct();
> > >
> > >     return done;
> > > }
> > >
> > > Something I have never seen before is that there is
> > > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > > Let us check if this has something to do with it. So two suggestions
> > > to be executed separately:
> > >
> > > * Put a probe at the __napi_schedule() above and check if it gets
> > > triggered before this problem
> > > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > > before "if (done < budget && napi_complete_done(napi, done)) {"
> > >
> > > This might provide us some hints on what is going on.
> >
> > After staring at this code for way too long I finally made a
> > breakthrough!  I could not understand how this race could occur when
> > napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
> >
> > ```
> >   static inline void *netpoll_poll_lock(struct napi_struct *napi)
> >   {
> >     struct net_device *dev = napi->dev;
> >
> >     if (dev && dev->npinfo) {
> >       int owner = smp_processor_id();
> >
> >       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
> >         cpu_relax();
> >
> >       return napi;
> >     }
> >     return NULL;
> >   }
> > ```
> > If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> > Adding some more trace points I see:
> >
> > ```
> >   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
> >   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> >   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> >   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> > ```
> >
> > Here you can see that poll_owner=-1 meaning the lock was never
> > acquired because npinfo is NULL.  This means that the same veth rx
> > queue can be napi_polled from multiple CPU and nothing stops it from
> > running concurrently.  They all look like this, just most of the time
> > there aren't concurrent napi_polls running for the same queue.  They
> > do however move around CPUs as I explained earlier.
> >
> > I'll note that I've ran with your suggested change of moving
> > xdp_do_flush() before napi_complete_done() all weekend and I have not
> > reproduced the issue.  I don't know if that truly means the issue is
> > fixed by that change or not.  I suspect it does fix the issue because
> > it prevents the napi_struct from being scheduled again before the
> > first poll has completed, and nap_schedule_prep() ensures that only
> > one instance is ever running.
> 
> Thanks Shawn! Good news that the patch seems to fix the problem. To
> me, napi_schedule_prep() makes sure that only one NAPI instance is
> running. Netpoll is an optional feature and I do not even have it
> compiled into my kernel. At least I do not have it defined in my
> .config and I cannot find any netpoll symbols with a readelf command.
> If netpoll is not used, I would suspect that npinfo == NULL. So to me,
> it is still a mystery why this is happening.

Oh I don't think it is a mystery anymore.  The napi_complete_done()
signals that this instance of of the napi_poll is complete.  As you
said nap_schedule_prep() checks to ensure that only one instance of
napi_poll is running at a time, but we just indicated it was done with
napi_complete_done().  This allows this CPU or more importantly any
other CPU to reschedule napi polling for this receive queue, but we
haven't called xdp_do_flush() yet so the flush can race.  I'll note
that the napi_schedule_prep()/__napi_schedule() in veth_poll really
isn't the problem since it will schedule itself back on the same CPU.
The problem is simply that another CPU is free to call
napi_scheulde_prep()/__napi_schedule() in that window after
napi_complete_done() and before xdp_do_flush().  The veth driver can
schedule a napi_poll from the transmit path which is what starts the
poll on a second CPU.

I was simply blinded by that stupid netpoll_poll_lock() but it is
completely unnecessary.
 
> To validate or disprove that there are two instances of the same napi
> running, could you please put a probe in veth_poll() right after
> veth_xdp_rcv() and one at xdp_do_flush() and dump the napi id in both
> cases? And run the original code with the bug :-). It would be good to
> know what exactly happens in the code between these points when this
> bug occurs. Maybe we could do this by enabling the trace buffer and
> dumping it when the bug occurs.

The two instances that race are definitely different napi_polls, the
second one is scheduled from the veth transmit path.

> > If we think this is the correct fix I'll let it run for another day or
> > two and prepare a patch.
> 
> There is one more advantage to this bug fix and that is performance.
> By calling __xdp_map_flush() immediately after the receive function,
> user space can start to work on the packets quicker so performance
> will improve.

Yes definitely.

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-20 15:25           ` Shawn Bohrer
@ 2022-12-20 15:59             ` Magnus Karlsson
  2022-12-20 16:02               ` Shawn Bohrer
  0 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2022-12-20 15:59 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Tue, Dec 20, 2022 at 4:26 PM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
>
> On Tue, Dec 20, 2022 at 10:06:48AM +0100, Magnus Karlsson wrote:
> > On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
> > >
> > > On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > > > To summarize, we are expecting this ordering:
> > > >
> > > > CPU 0 __xsk_rcv_zc()
> > > > CPU 0 __xsk_map_flush()
> > > > CPU 2 __xsk_rcv_zc()
> > > > CPU 2 __xsk_map_flush()
> > > >
> > > > But we are seeing this order:
> > > >
> > > > CPU 0 __xsk_rcv_zc()
> > > > CPU 2 __xsk_rcv_zc()
> > > > CPU 0 __xsk_map_flush()
> > > > CPU 2 __xsk_map_flush()
> > > >
> > > > Here is the veth NAPI poll loop:
> > > >
> > > > static int veth_poll(struct napi_struct *napi, int budget)
> > > > {
> > > >     struct veth_rq *rq =
> > > >     container_of(napi, struct veth_rq, xdp_napi);
> > > >     struct veth_stats stats = {};
> > > >     struct veth_xdp_tx_bq bq;
> > > >     int done;
> > > >
> > > >     bq.count = 0;
> > > >
> > > >     xdp_set_return_frame_no_direct();
> > > >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> > > >
> > > >     if (done < budget && napi_complete_done(napi, done)) {
> > > >         /* Write rx_notify_masked before reading ptr_ring */
> > > >        smp_store_mb(rq->rx_notify_masked, false);
> > > >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> > > >            if (napi_schedule_prep(&rq->xdp_napi)) {
> > > >                WRITE_ONCE(rq->rx_notify_masked, true);
> > > >                __napi_schedule(&rq->xdp_napi);
> > > >             }
> > > >         }
> > > >     }
> > > >
> > > >     if (stats.xdp_tx > 0)
> > > >         veth_xdp_flush(rq, &bq);
> > > >     if (stats.xdp_redirect > 0)
> > > >         xdp_do_flush();
> > > >     xdp_clear_return_frame_no_direct();
> > > >
> > > >     return done;
> > > > }
> > > >
> > > > Something I have never seen before is that there is
> > > > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > > > Let us check if this has something to do with it. So two suggestions
> > > > to be executed separately:
> > > >
> > > > * Put a probe at the __napi_schedule() above and check if it gets
> > > > triggered before this problem
> > > > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > > > before "if (done < budget && napi_complete_done(napi, done)) {"
> > > >
> > > > This might provide us some hints on what is going on.
> > >
> > > After staring at this code for way too long I finally made a
> > > breakthrough!  I could not understand how this race could occur when
> > > napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
> > >
> > > ```
> > >   static inline void *netpoll_poll_lock(struct napi_struct *napi)
> > >   {
> > >     struct net_device *dev = napi->dev;
> > >
> > >     if (dev && dev->npinfo) {
> > >       int owner = smp_processor_id();
> > >
> > >       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
> > >         cpu_relax();
> > >
> > >       return napi;
> > >     }
> > >     return NULL;
> > >   }
> > > ```
> > > If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> > > Adding some more trace points I see:
> > >
> > > ```
> > >   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
> > >   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> > >   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> > >   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> > > ```
> > >
> > > Here you can see that poll_owner=-1 meaning the lock was never
> > > acquired because npinfo is NULL.  This means that the same veth rx
> > > queue can be napi_polled from multiple CPU and nothing stops it from
> > > running concurrently.  They all look like this, just most of the time
> > > there aren't concurrent napi_polls running for the same queue.  They
> > > do however move around CPUs as I explained earlier.
> > >
> > > I'll note that I've ran with your suggested change of moving
> > > xdp_do_flush() before napi_complete_done() all weekend and I have not
> > > reproduced the issue.  I don't know if that truly means the issue is
> > > fixed by that change or not.  I suspect it does fix the issue because
> > > it prevents the napi_struct from being scheduled again before the
> > > first poll has completed, and nap_schedule_prep() ensures that only
> > > one instance is ever running.
> >
> > Thanks Shawn! Good news that the patch seems to fix the problem. To
> > me, napi_schedule_prep() makes sure that only one NAPI instance is
> > running. Netpoll is an optional feature and I do not even have it
> > compiled into my kernel. At least I do not have it defined in my
> > .config and I cannot find any netpoll symbols with a readelf command.
> > If netpoll is not used, I would suspect that npinfo == NULL. So to me,
> > it is still a mystery why this is happening.
>
> Oh I don't think it is a mystery anymore.  The napi_complete_done()
> signals that this instance of of the napi_poll is complete.  As you
> said nap_schedule_prep() checks to ensure that only one instance of
> napi_poll is running at a time, but we just indicated it was done with
> napi_complete_done().  This allows this CPU or more importantly any
> other CPU to reschedule napi polling for this receive queue, but we
> haven't called xdp_do_flush() yet so the flush can race.  I'll note
> that the napi_schedule_prep()/__napi_schedule() in veth_poll really
> isn't the problem since it will schedule itself back on the same CPU.
> The problem is simply that another CPU is free to call
> napi_scheulde_prep()/__napi_schedule() in that window after
> napi_complete_done() and before xdp_do_flush().  The veth driver can
> schedule a napi_poll from the transmit path which is what starts the
> poll on a second CPU.

Bingo! Would you like to prepare a patch or would you like me to do
it? This has been broken since the support was put in veth in 4.19.

> I was simply blinded by that stupid netpoll_poll_lock() but it is
> completely unnecessary.
>
> > To validate or disprove that there are two instances of the same napi
> > running, could you please put a probe in veth_poll() right after
> > veth_xdp_rcv() and one at xdp_do_flush() and dump the napi id in both
> > cases? And run the original code with the bug :-). It would be good to
> > know what exactly happens in the code between these points when this
> > bug occurs. Maybe we could do this by enabling the trace buffer and
> > dumping it when the bug occurs.
>
> The two instances that race are definitely different napi_polls, the
> second one is scheduled from the veth transmit path.
>
> > > If we think this is the correct fix I'll let it run for another day or
> > > two and prepare a patch.
> >
> > There is one more advantage to this bug fix and that is performance.
> > By calling __xdp_map_flush() immediately after the receive function,
> > user space can start to work on the packets quicker so performance
> > will improve.
>
> Yes definitely.
>
> Thanks,
> Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible race with xsk_flush
  2022-12-20 15:59             ` Magnus Karlsson
@ 2022-12-20 16:02               ` Shawn Bohrer
  0 siblings, 0 replies; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-20 16:02 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team

On Tue, Dec 20, 2022 at 04:59:30PM +0100, Magnus Karlsson wrote:
> On Tue, Dec 20, 2022 at 4:26 PM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
> >
> > On Tue, Dec 20, 2022 at 10:06:48AM +0100, Magnus Karlsson wrote:
> > > On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@cloudflare.com> wrote:
> > > >
> > > > On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > > > > To summarize, we are expecting this ordering:
> > > > >
> > > > > CPU 0 __xsk_rcv_zc()
> > > > > CPU 0 __xsk_map_flush()
> > > > > CPU 2 __xsk_rcv_zc()
> > > > > CPU 2 __xsk_map_flush()
> > > > >
> > > > > But we are seeing this order:
> > > > >
> > > > > CPU 0 __xsk_rcv_zc()
> > > > > CPU 2 __xsk_rcv_zc()
> > > > > CPU 0 __xsk_map_flush()
> > > > > CPU 2 __xsk_map_flush()
> > > > >
> > > > > Here is the veth NAPI poll loop:
> > > > >
> > > > > static int veth_poll(struct napi_struct *napi, int budget)
> > > > > {
> > > > >     struct veth_rq *rq =
> > > > >     container_of(napi, struct veth_rq, xdp_napi);
> > > > >     struct veth_stats stats = {};
> > > > >     struct veth_xdp_tx_bq bq;
> > > > >     int done;
> > > > >
> > > > >     bq.count = 0;
> > > > >
> > > > >     xdp_set_return_frame_no_direct();
> > > > >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> > > > >
> > > > >     if (done < budget && napi_complete_done(napi, done)) {
> > > > >         /* Write rx_notify_masked before reading ptr_ring */
> > > > >        smp_store_mb(rq->rx_notify_masked, false);
> > > > >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> > > > >            if (napi_schedule_prep(&rq->xdp_napi)) {
> > > > >                WRITE_ONCE(rq->rx_notify_masked, true);
> > > > >                __napi_schedule(&rq->xdp_napi);
> > > > >             }
> > > > >         }
> > > > >     }
> > > > >
> > > > >     if (stats.xdp_tx > 0)
> > > > >         veth_xdp_flush(rq, &bq);
> > > > >     if (stats.xdp_redirect > 0)
> > > > >         xdp_do_flush();
> > > > >     xdp_clear_return_frame_no_direct();
> > > > >
> > > > >     return done;
> > > > > }
> > > > >
> > > > > Something I have never seen before is that there is
> > > > > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > > > > Let us check if this has something to do with it. So two suggestions
> > > > > to be executed separately:
> > > > >
> > > > > * Put a probe at the __napi_schedule() above and check if it gets
> > > > > triggered before this problem
> > > > > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > > > > before "if (done < budget && napi_complete_done(napi, done)) {"
> > > > >
> > > > > This might provide us some hints on what is going on.
> > > >
> > > > After staring at this code for way too long I finally made a
> > > > breakthrough!  I could not understand how this race could occur when
> > > > napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
> > > >
> > > > ```
> > > >   static inline void *netpoll_poll_lock(struct napi_struct *napi)
> > > >   {
> > > >     struct net_device *dev = napi->dev;
> > > >
> > > >     if (dev && dev->npinfo) {
> > > >       int owner = smp_processor_id();
> > > >
> > > >       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
> > > >         cpu_relax();
> > > >
> > > >       return napi;
> > > >     }
> > > >     return NULL;
> > > >   }
> > > > ```
> > > > If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> > > > Adding some more trace points I see:
> > > >
> > > > ```
> > > >   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
> > > >   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> > > >   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
> > > >   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> > > > ```
> > > >
> > > > Here you can see that poll_owner=-1 meaning the lock was never
> > > > acquired because npinfo is NULL.  This means that the same veth rx
> > > > queue can be napi_polled from multiple CPU and nothing stops it from
> > > > running concurrently.  They all look like this, just most of the time
> > > > there aren't concurrent napi_polls running for the same queue.  They
> > > > do however move around CPUs as I explained earlier.
> > > >
> > > > I'll note that I've ran with your suggested change of moving
> > > > xdp_do_flush() before napi_complete_done() all weekend and I have not
> > > > reproduced the issue.  I don't know if that truly means the issue is
> > > > fixed by that change or not.  I suspect it does fix the issue because
> > > > it prevents the napi_struct from being scheduled again before the
> > > > first poll has completed, and nap_schedule_prep() ensures that only
> > > > one instance is ever running.
> > >
> > > Thanks Shawn! Good news that the patch seems to fix the problem. To
> > > me, napi_schedule_prep() makes sure that only one NAPI instance is
> > > running. Netpoll is an optional feature and I do not even have it
> > > compiled into my kernel. At least I do not have it defined in my
> > > .config and I cannot find any netpoll symbols with a readelf command.
> > > If netpoll is not used, I would suspect that npinfo == NULL. So to me,
> > > it is still a mystery why this is happening.
> >
> > Oh I don't think it is a mystery anymore.  The napi_complete_done()
> > signals that this instance of of the napi_poll is complete.  As you
> > said nap_schedule_prep() checks to ensure that only one instance of
> > napi_poll is running at a time, but we just indicated it was done with
> > napi_complete_done().  This allows this CPU or more importantly any
> > other CPU to reschedule napi polling for this receive queue, but we
> > haven't called xdp_do_flush() yet so the flush can race.  I'll note
> > that the napi_schedule_prep()/__napi_schedule() in veth_poll really
> > isn't the problem since it will schedule itself back on the same CPU.
> > The problem is simply that another CPU is free to call
> > napi_scheulde_prep()/__napi_schedule() in that window after
> > napi_complete_done() and before xdp_do_flush().  The veth driver can
> > schedule a napi_poll from the transmit path which is what starts the
> > poll on a second CPU.
> 
> Bingo! Would you like to prepare a patch or would you like me to do
> it? This has been broken since the support was put in veth in 4.19.

I'll prepare the patch today.  I feel like I've spent enough of my
life chasing this down and appreciate your help!

Thanks,
Shawn Bohrer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-14 22:32 Possible race with xsk_flush Shawn Bohrer
  2022-12-15 10:22 ` Magnus Karlsson
@ 2022-12-20 18:59 ` Shawn Bohrer
  2022-12-22  1:07   ` Jakub Kicinski
                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Shawn Bohrer @ 2022-12-20 18:59 UTC (permalink / raw)
  To: magnus.karlsson; +Cc: netdev, bpf, bjorn, kernel-team, davem, Shawn Bohrer

When AF_XDP is used on on a veth interface the RX ring is updated in two
steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
fills them and places them in the RX ring updating the cached_prod
pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
cached_prod pointer allowing user-space to see the recently filled in
descriptors.  The rings are intended to be SPSC, however the existing
order in veth_poll allows the xdp_do_flush() to run concurrently with
another CPU creating a race condition that allows user-space to see old
or uninitialized descriptors in the RX ring.  This bug has been observed
in production systems.

To summarize, we are expecting this ordering:

CPU 0 __xsk_rcv_zc()
CPU 0 __xsk_map_flush()
CPU 2 __xsk_rcv_zc()
CPU 2 __xsk_map_flush()

But we are seeing this order:

CPU 0 __xsk_rcv_zc()
CPU 2 __xsk_rcv_zc()
CPU 0 __xsk_map_flush()
CPU 2 __xsk_map_flush()

This occurs because we rely on NAPI to ensure that only one napi_poll
handler is running at a time for the given veth receive queue.
napi_schedule_prep() will prevent multiple instances from getting
scheduled. However calling napi_complete_done() signals that this
napi_poll is complete and allows subsequent calls to
napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
concurrent napi_poll before the xdp_do_flush() has been called.  For the
veth driver a concurrent call to napi_schedule_prep() and
__napi_schedule() can occur on a different CPU because the veth xmit
path can additionally schedule a napi_poll creating the race.

The fix as suggested by Magnus Karlsson, is to simply move the
xdp_do_flush() call before napi_complete_done().  This syncs the
producer ring pointers before another instance of napi_poll can be
scheduled on another CPU.  It will also slightly improve performance by
moving the flush closer to when the descriptors were placed in the
RX ring.

Fixes: d1396004dd86 ("veth: Add XDP TX and REDIRECT")
Suggested-by: Magnus Karlsson <magnus.karlsson@gmail.com>
Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com>
---
 drivers/net/veth.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ac7c0653695f..dfc7d87fad59 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -974,6 +974,9 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
 
+	if (stats.xdp_redirect > 0)
+		xdp_do_flush();
+
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
 		smp_store_mb(rq->rx_notify_masked, false);
@@ -987,8 +990,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
 
 	if (stats.xdp_tx > 0)
 		veth_xdp_flush(rq, &bq);
-	if (stats.xdp_redirect > 0)
-		xdp_do_flush();
 	xdp_clear_return_frame_no_direct();
 
 	return done;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
@ 2022-12-22  1:07   ` Jakub Kicinski
  2022-12-22 14:30     ` Paolo Abeni
  2022-12-22 10:18   ` Paolo Abeni
  2022-12-22 14:40   ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2022-12-22  1:07 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: magnus.karlsson, netdev, bpf, bjorn, kernel-team, davem

On Tue, 20 Dec 2022 12:59:03 -0600 Shawn Bohrer wrote:
>  	if (stats.xdp_tx > 0)
>  		veth_xdp_flush(rq, &bq);

This one does not need similar treatment, right?
Only thing I could spot is potential races in updating stats, 
but that seems like a larger issue with veth xdp.

> -	if (stats.xdp_redirect > 0)
> -		xdp_do_flush();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
  2022-12-22  1:07   ` Jakub Kicinski
@ 2022-12-22 10:18   ` Paolo Abeni
  2023-01-11 14:02     ` Magnus Karlsson
  2022-12-22 14:40   ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2022-12-22 10:18 UTC (permalink / raw)
  To: Shawn Bohrer, magnus.karlsson; +Cc: netdev, bpf, bjorn, kernel-team, davem

On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
> When AF_XDP is used on on a veth interface the RX ring is updated in two
> steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
> fills them and places them in the RX ring updating the cached_prod
> pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
> cached_prod pointer allowing user-space to see the recently filled in
> descriptors.  The rings are intended to be SPSC, however the existing
> order in veth_poll allows the xdp_do_flush() to run concurrently with
> another CPU creating a race condition that allows user-space to see old
> or uninitialized descriptors in the RX ring.  This bug has been observed
> in production systems.
> 
> To summarize, we are expecting this ordering:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_rcv_zc()
> CPU 2 __xsk_map_flush()
> 
> But we are seeing this order:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 2 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_map_flush()
> 
> This occurs because we rely on NAPI to ensure that only one napi_poll
> handler is running at a time for the given veth receive queue.
> napi_schedule_prep() will prevent multiple instances from getting
> scheduled. However calling napi_complete_done() signals that this
> napi_poll is complete and allows subsequent calls to
> napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
> concurrent napi_poll before the xdp_do_flush() has been called.  For the
> veth driver a concurrent call to napi_schedule_prep() and
> __napi_schedule() can occur on a different CPU because the veth xmit
> path can additionally schedule a napi_poll creating the race.

The above looks like a generic problem that other drivers could hit.
Perhaps it could be worthy updating the xdp_do_flush() doc text to
explicitly mention it must be called before napi_complete_done().

(in a separate, net-next patch)

Thanks!

Paolo

> 
> The fix as suggested by Magnus Karlsson, is to simply move the
> xdp_do_flush() call before napi_complete_done().  This syncs the
> producer ring pointers before another instance of napi_poll can be
> scheduled on another CPU.  It will also slightly improve performance by
> moving the flush closer to when the descriptors were placed in the
> RX ring.
> 
> Fixes: d1396004dd86 ("veth: Add XDP TX and REDIRECT")
> Suggested-by: Magnus Karlsson <magnus.karlsson@gmail.com>
> Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com>
> ---
>  drivers/net/veth.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index ac7c0653695f..dfc7d87fad59 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -974,6 +974,9 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  	xdp_set_return_frame_no_direct();
>  	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>  
> +	if (stats.xdp_redirect > 0)
> +		xdp_do_flush();
> +
>  	if (done < budget && napi_complete_done(napi, done)) {
>  		/* Write rx_notify_masked before reading ptr_ring */
>  		smp_store_mb(rq->rx_notify_masked, false);
> @@ -987,8 +990,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  
>  	if (stats.xdp_tx > 0)
>  		veth_xdp_flush(rq, &bq);
> -	if (stats.xdp_redirect > 0)
> -		xdp_do_flush();
>  	xdp_clear_return_frame_no_direct();
>  
>  	return done;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-22  1:07   ` Jakub Kicinski
@ 2022-12-22 14:30     ` Paolo Abeni
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Abeni @ 2022-12-22 14:30 UTC (permalink / raw)
  To: Jakub Kicinski, Shawn Bohrer
  Cc: magnus.karlsson, netdev, bpf, bjorn, kernel-team, davem

On Wed, 2022-12-21 at 17:07 -0800, Jakub Kicinski wrote:
> On Tue, 20 Dec 2022 12:59:03 -0600 Shawn Bohrer wrote:
> >  	if (stats.xdp_tx > 0)
> >  		veth_xdp_flush(rq, &bq);
> 
> This one does not need similar treatment, right?
> Only thing I could spot is potential races in updating stats, 
> but that seems like a larger issue with veth xdp.

Agreed, I think stats would deserve a separate patch (and this one
looks correct).

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
  2022-12-22  1:07   ` Jakub Kicinski
  2022-12-22 10:18   ` Paolo Abeni
@ 2022-12-22 14:40   ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 21+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-22 14:40 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: magnus.karlsson, netdev, bpf, bjorn, kernel-team, davem

Hello:

This patch was applied to netdev/net.git (master)
by Paolo Abeni <pabeni@redhat.com>:

On Tue, 20 Dec 2022 12:59:03 -0600 you wrote:
> When AF_XDP is used on on a veth interface the RX ring is updated in two
> steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
> fills them and places them in the RX ring updating the cached_prod
> pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
> cached_prod pointer allowing user-space to see the recently filled in
> descriptors.  The rings are intended to be SPSC, however the existing
> order in veth_poll allows the xdp_do_flush() to run concurrently with
> another CPU creating a race condition that allows user-space to see old
> or uninitialized descriptors in the RX ring.  This bug has been observed
> in production systems.
> 
> [...]

Here is the summary with links:
  - veth: Fix race with AF_XDP exposing old or uninitialized descriptors
    https://git.kernel.org/netdev/net/c/fa349e396e48

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2022-12-22 10:18   ` Paolo Abeni
@ 2023-01-11 14:02     ` Magnus Karlsson
  2023-01-11 14:21       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2023-01-11 14:02 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Shawn Bohrer, netdev, bpf, bjorn, kernel-team, davem

On Thu, Dec 22, 2022 at 11:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
> > When AF_XDP is used on on a veth interface the RX ring is updated in two
> > steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
> > fills them and places them in the RX ring updating the cached_prod
> > pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
> > cached_prod pointer allowing user-space to see the recently filled in
> > descriptors.  The rings are intended to be SPSC, however the existing
> > order in veth_poll allows the xdp_do_flush() to run concurrently with
> > another CPU creating a race condition that allows user-space to see old
> > or uninitialized descriptors in the RX ring.  This bug has been observed
> > in production systems.
> >
> > To summarize, we are expecting this ordering:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_rcv_zc()
> > CPU 2 __xsk_map_flush()
> >
> > But we are seeing this order:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 2 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_map_flush()
> >
> > This occurs because we rely on NAPI to ensure that only one napi_poll
> > handler is running at a time for the given veth receive queue.
> > napi_schedule_prep() will prevent multiple instances from getting
> > scheduled. However calling napi_complete_done() signals that this
> > napi_poll is complete and allows subsequent calls to
> > napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
> > concurrent napi_poll before the xdp_do_flush() has been called.  For the
> > veth driver a concurrent call to napi_schedule_prep() and
> > __napi_schedule() can occur on a different CPU because the veth xmit
> > path can additionally schedule a napi_poll creating the race.
>
> The above looks like a generic problem that other drivers could hit.
> Perhaps it could be worthy updating the xdp_do_flush() doc text to
> explicitly mention it must be called before napi_complete_done().

Good observation. I took a quick peek at this and it seems there are
at least 5 more drivers that can call napi_complete_done() before
xdp_do_flush():

drivers/net/ethernet/qlogic/qede/
drivers/net/ethernet/freescale/dpaa2
drivers/net/ethernet/freescale/dpaa
drivers/net/ethernet/microchip/lan966x
drivers/net/virtio_net.c

The question is then if this race can occur on these five drivers.
Dpaa2 has AF_XDP zero-copy support implemented, so it can suffer from
this race as the Tx zero-copy path is basically just a napi_schedule()
and it can be called/invoked from multiple processes at the same time.
In regards to the others, I do not know.

Would it be prudent to just switch the order of xdp_do_flush() and
napi_complete_done() in all these drivers, or would that be too
defensive?

> (in a separate, net-next patch)
>
> Thanks!
>
> Paolo
>
> >
> > The fix as suggested by Magnus Karlsson, is to simply move the
> > xdp_do_flush() call before napi_complete_done().  This syncs the
> > producer ring pointers before another instance of napi_poll can be
> > scheduled on another CPU.  It will also slightly improve performance by
> > moving the flush closer to when the descriptors were placed in the
> > RX ring.
> >
> > Fixes: d1396004dd86 ("veth: Add XDP TX and REDIRECT")
> > Suggested-by: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Signed-off-by: Shawn Bohrer <sbohrer@cloudflare.com>
> > ---
> >  drivers/net/veth.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index ac7c0653695f..dfc7d87fad59 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -974,6 +974,9 @@ static int veth_poll(struct napi_struct *napi, int budget)
> >       xdp_set_return_frame_no_direct();
> >       done = veth_xdp_rcv(rq, budget, &bq, &stats);
> >
> > +     if (stats.xdp_redirect > 0)
> > +             xdp_do_flush();
> > +
> >       if (done < budget && napi_complete_done(napi, done)) {
> >               /* Write rx_notify_masked before reading ptr_ring */
> >               smp_store_mb(rq->rx_notify_masked, false);
> > @@ -987,8 +990,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
> >
> >       if (stats.xdp_tx > 0)
> >               veth_xdp_flush(rq, &bq);
> > -     if (stats.xdp_redirect > 0)
> > -             xdp_do_flush();
> >       xdp_clear_return_frame_no_direct();
> >
> >       return done;
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2023-01-11 14:02     ` Magnus Karlsson
@ 2023-01-11 14:21       ` Toke Høiland-Jørgensen
  2023-01-11 16:24         ` Magnus Karlsson
  0 siblings, 1 reply; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-11 14:21 UTC (permalink / raw)
  To: Magnus Karlsson, Paolo Abeni
  Cc: Shawn Bohrer, netdev, bpf, bjorn, kernel-team, davem

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Thu, Dec 22, 2022 at 11:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
>> > When AF_XDP is used on on a veth interface the RX ring is updated in two
>> > steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
>> > fills them and places them in the RX ring updating the cached_prod
>> > pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
>> > cached_prod pointer allowing user-space to see the recently filled in
>> > descriptors.  The rings are intended to be SPSC, however the existing
>> > order in veth_poll allows the xdp_do_flush() to run concurrently with
>> > another CPU creating a race condition that allows user-space to see old
>> > or uninitialized descriptors in the RX ring.  This bug has been observed
>> > in production systems.
>> >
>> > To summarize, we are expecting this ordering:
>> >
>> > CPU 0 __xsk_rcv_zc()
>> > CPU 0 __xsk_map_flush()
>> > CPU 2 __xsk_rcv_zc()
>> > CPU 2 __xsk_map_flush()
>> >
>> > But we are seeing this order:
>> >
>> > CPU 0 __xsk_rcv_zc()
>> > CPU 2 __xsk_rcv_zc()
>> > CPU 0 __xsk_map_flush()
>> > CPU 2 __xsk_map_flush()
>> >
>> > This occurs because we rely on NAPI to ensure that only one napi_poll
>> > handler is running at a time for the given veth receive queue.
>> > napi_schedule_prep() will prevent multiple instances from getting
>> > scheduled. However calling napi_complete_done() signals that this
>> > napi_poll is complete and allows subsequent calls to
>> > napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
>> > concurrent napi_poll before the xdp_do_flush() has been called.  For the
>> > veth driver a concurrent call to napi_schedule_prep() and
>> > __napi_schedule() can occur on a different CPU because the veth xmit
>> > path can additionally schedule a napi_poll creating the race.
>>
>> The above looks like a generic problem that other drivers could hit.
>> Perhaps it could be worthy updating the xdp_do_flush() doc text to
>> explicitly mention it must be called before napi_complete_done().
>
> Good observation. I took a quick peek at this and it seems there are
> at least 5 more drivers that can call napi_complete_done() before
> xdp_do_flush():
>
> drivers/net/ethernet/qlogic/qede/
> drivers/net/ethernet/freescale/dpaa2
> drivers/net/ethernet/freescale/dpaa
> drivers/net/ethernet/microchip/lan966x
> drivers/net/virtio_net.c
>
> The question is then if this race can occur on these five drivers.
> Dpaa2 has AF_XDP zero-copy support implemented, so it can suffer from
> this race as the Tx zero-copy path is basically just a napi_schedule()
> and it can be called/invoked from multiple processes at the same time.
> In regards to the others, I do not know.
>
> Would it be prudent to just switch the order of xdp_do_flush() and
> napi_complete_done() in all these drivers, or would that be too
> defensive?

We rely on being inside a single NAPI instance trough to the
xdp_do_flush() call for RCU protection of all in-kernel data structures
as well[0]. Not sure if this leads to actual real-world bugs for the
in-kernel path, but conceptually it's wrong at least. So yeah, I think
we should definitely swap the order everywhere and document this!

-Toke

[0] See https://lore.kernel.org/r/20210624160609.292325-1-toke@redhat.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2023-01-11 14:21       ` Toke Høiland-Jørgensen
@ 2023-01-11 16:24         ` Magnus Karlsson
  2023-01-11 23:06           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2023-01-11 16:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Paolo Abeni, Shawn Bohrer, netdev, bpf, bjorn, kernel-team, davem

On Wed, Jan 11, 2023 at 3:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Thu, Dec 22, 2022 at 11:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
> >> > When AF_XDP is used on on a veth interface the RX ring is updated in two
> >> > steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
> >> > fills them and places them in the RX ring updating the cached_prod
> >> > pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
> >> > cached_prod pointer allowing user-space to see the recently filled in
> >> > descriptors.  The rings are intended to be SPSC, however the existing
> >> > order in veth_poll allows the xdp_do_flush() to run concurrently with
> >> > another CPU creating a race condition that allows user-space to see old
> >> > or uninitialized descriptors in the RX ring.  This bug has been observed
> >> > in production systems.
> >> >
> >> > To summarize, we are expecting this ordering:
> >> >
> >> > CPU 0 __xsk_rcv_zc()
> >> > CPU 0 __xsk_map_flush()
> >> > CPU 2 __xsk_rcv_zc()
> >> > CPU 2 __xsk_map_flush()
> >> >
> >> > But we are seeing this order:
> >> >
> >> > CPU 0 __xsk_rcv_zc()
> >> > CPU 2 __xsk_rcv_zc()
> >> > CPU 0 __xsk_map_flush()
> >> > CPU 2 __xsk_map_flush()
> >> >
> >> > This occurs because we rely on NAPI to ensure that only one napi_poll
> >> > handler is running at a time for the given veth receive queue.
> >> > napi_schedule_prep() will prevent multiple instances from getting
> >> > scheduled. However calling napi_complete_done() signals that this
> >> > napi_poll is complete and allows subsequent calls to
> >> > napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
> >> > concurrent napi_poll before the xdp_do_flush() has been called.  For the
> >> > veth driver a concurrent call to napi_schedule_prep() and
> >> > __napi_schedule() can occur on a different CPU because the veth xmit
> >> > path can additionally schedule a napi_poll creating the race.
> >>
> >> The above looks like a generic problem that other drivers could hit.
> >> Perhaps it could be worthy updating the xdp_do_flush() doc text to
> >> explicitly mention it must be called before napi_complete_done().
> >
> > Good observation. I took a quick peek at this and it seems there are
> > at least 5 more drivers that can call napi_complete_done() before
> > xdp_do_flush():
> >
> > drivers/net/ethernet/qlogic/qede/
> > drivers/net/ethernet/freescale/dpaa2
> > drivers/net/ethernet/freescale/dpaa
> > drivers/net/ethernet/microchip/lan966x
> > drivers/net/virtio_net.c
> >
> > The question is then if this race can occur on these five drivers.
> > Dpaa2 has AF_XDP zero-copy support implemented, so it can suffer from
> > this race as the Tx zero-copy path is basically just a napi_schedule()
> > and it can be called/invoked from multiple processes at the same time.
> > In regards to the others, I do not know.
> >
> > Would it be prudent to just switch the order of xdp_do_flush() and
> > napi_complete_done() in all these drivers, or would that be too
> > defensive?
>
> We rely on being inside a single NAPI instance trough to the
> xdp_do_flush() call for RCU protection of all in-kernel data structures
> as well[0]. Not sure if this leads to actual real-world bugs for the
> in-kernel path, but conceptually it's wrong at least. So yeah, I think
> we should definitely swap the order everywhere and document this!

OK, let me take a stab at it. For at least the first four, it will be
compilation tested only from my side since I do not own any of those
SoCs/cards. Will need the help of those maintainers for sure.

> -Toke
>
> [0] See https://lore.kernel.org/r/20210624160609.292325-1-toke@redhat.com
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
  2023-01-11 16:24         ` Magnus Karlsson
@ 2023-01-11 23:06           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-11 23:06 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Paolo Abeni, Shawn Bohrer, netdev, bpf, bjorn, kernel-team, davem

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Wed, Jan 11, 2023 at 3:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, Dec 22, 2022 at 11:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> >>
>> >> On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
>> >> > When AF_XDP is used on on a veth interface the RX ring is updated in two
>> >> > steps.  veth_xdp_rcv() removes packet descriptors from the FILL ring
>> >> > fills them and places them in the RX ring updating the cached_prod
>> >> > pointer.  Later xdp_do_flush() syncs the RX ring prod pointer with the
>> >> > cached_prod pointer allowing user-space to see the recently filled in
>> >> > descriptors.  The rings are intended to be SPSC, however the existing
>> >> > order in veth_poll allows the xdp_do_flush() to run concurrently with
>> >> > another CPU creating a race condition that allows user-space to see old
>> >> > or uninitialized descriptors in the RX ring.  This bug has been observed
>> >> > in production systems.
>> >> >
>> >> > To summarize, we are expecting this ordering:
>> >> >
>> >> > CPU 0 __xsk_rcv_zc()
>> >> > CPU 0 __xsk_map_flush()
>> >> > CPU 2 __xsk_rcv_zc()
>> >> > CPU 2 __xsk_map_flush()
>> >> >
>> >> > But we are seeing this order:
>> >> >
>> >> > CPU 0 __xsk_rcv_zc()
>> >> > CPU 2 __xsk_rcv_zc()
>> >> > CPU 0 __xsk_map_flush()
>> >> > CPU 2 __xsk_map_flush()
>> >> >
>> >> > This occurs because we rely on NAPI to ensure that only one napi_poll
>> >> > handler is running at a time for the given veth receive queue.
>> >> > napi_schedule_prep() will prevent multiple instances from getting
>> >> > scheduled. However calling napi_complete_done() signals that this
>> >> > napi_poll is complete and allows subsequent calls to
>> >> > napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
>> >> > concurrent napi_poll before the xdp_do_flush() has been called.  For the
>> >> > veth driver a concurrent call to napi_schedule_prep() and
>> >> > __napi_schedule() can occur on a different CPU because the veth xmit
>> >> > path can additionally schedule a napi_poll creating the race.
>> >>
>> >> The above looks like a generic problem that other drivers could hit.
>> >> Perhaps it could be worthy updating the xdp_do_flush() doc text to
>> >> explicitly mention it must be called before napi_complete_done().
>> >
>> > Good observation. I took a quick peek at this and it seems there are
>> > at least 5 more drivers that can call napi_complete_done() before
>> > xdp_do_flush():
>> >
>> > drivers/net/ethernet/qlogic/qede/
>> > drivers/net/ethernet/freescale/dpaa2
>> > drivers/net/ethernet/freescale/dpaa
>> > drivers/net/ethernet/microchip/lan966x
>> > drivers/net/virtio_net.c
>> >
>> > The question is then if this race can occur on these five drivers.
>> > Dpaa2 has AF_XDP zero-copy support implemented, so it can suffer from
>> > this race as the Tx zero-copy path is basically just a napi_schedule()
>> > and it can be called/invoked from multiple processes at the same time.
>> > In regards to the others, I do not know.
>> >
>> > Would it be prudent to just switch the order of xdp_do_flush() and
>> > napi_complete_done() in all these drivers, or would that be too
>> > defensive?
>>
>> We rely on being inside a single NAPI instance trough to the
>> xdp_do_flush() call for RCU protection of all in-kernel data structures
>> as well[0]. Not sure if this leads to actual real-world bugs for the
>> in-kernel path, but conceptually it's wrong at least. So yeah, I think
>> we should definitely swap the order everywhere and document this!
>
> OK, let me take a stab at it. For at least the first four, it will be
> compilation tested only from my side since I do not own any of those
> SoCs/cards. Will need the help of those maintainers for sure.

Sounds good, thanks! :)

-Toke


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-11 23:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-14 22:32 Possible race with xsk_flush Shawn Bohrer
2022-12-15 10:22 ` Magnus Karlsson
2022-12-15 19:50   ` Alex Forster
2022-12-16  0:14   ` Shawn Bohrer
2022-12-16 10:05     ` Magnus Karlsson
2022-12-16 16:48       ` Shawn Bohrer
2022-12-16 23:42       ` Shawn Bohrer
2022-12-20  1:31       ` Shawn Bohrer
2022-12-20  9:06         ` Magnus Karlsson
2022-12-20 15:25           ` Shawn Bohrer
2022-12-20 15:59             ` Magnus Karlsson
2022-12-20 16:02               ` Shawn Bohrer
2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
2022-12-22  1:07   ` Jakub Kicinski
2022-12-22 14:30     ` Paolo Abeni
2022-12-22 10:18   ` Paolo Abeni
2023-01-11 14:02     ` Magnus Karlsson
2023-01-11 14:21       ` Toke Høiland-Jørgensen
2023-01-11 16:24         ` Magnus Karlsson
2023-01-11 23:06           ` Toke Høiland-Jørgensen
2022-12-22 14:40   ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).