From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Paolo Abeni <pabeni@redhat.com>,
Shawn Bohrer <sbohrer@cloudflare.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org, bjorn@kernel.org,
kernel-team@cloudflare.com, davem@davemloft.net
Subject: Re: [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors
Date: Thu, 12 Jan 2023 00:06:58 +0100 [thread overview]
Message-ID: <871qo0hdrh.fsf@toke.dk> (raw)
In-Reply-To: <CAJ8uoz3Yqqaxmj2x+mXhS9UhSZr-UGh8-Njmk9wB9ceC4cYn1g@mail.gmail.com>
Magnus Karlsson <magnus.karlsson@gmail.com> writes:
> On Wed, Jan 11, 2023 at 3:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, Dec 22, 2022 at 11:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> >>
>> >> On Tue, 2022-12-20 at 12:59 -0600, Shawn Bohrer wrote:
>> >> > When AF_XDP is used on on a veth interface the RX ring is updated in two
>> >> > steps. veth_xdp_rcv() removes packet descriptors from the FILL ring
>> >> > fills them and places them in the RX ring updating the cached_prod
>> >> > pointer. Later xdp_do_flush() syncs the RX ring prod pointer with the
>> >> > cached_prod pointer allowing user-space to see the recently filled in
>> >> > descriptors. The rings are intended to be SPSC, however the existing
>> >> > order in veth_poll allows the xdp_do_flush() to run concurrently with
>> >> > another CPU creating a race condition that allows user-space to see old
>> >> > or uninitialized descriptors in the RX ring. This bug has been observed
>> >> > in production systems.
>> >> >
>> >> > To summarize, we are expecting this ordering:
>> >> >
>> >> > CPU 0 __xsk_rcv_zc()
>> >> > CPU 0 __xsk_map_flush()
>> >> > CPU 2 __xsk_rcv_zc()
>> >> > CPU 2 __xsk_map_flush()
>> >> >
>> >> > But we are seeing this order:
>> >> >
>> >> > CPU 0 __xsk_rcv_zc()
>> >> > CPU 2 __xsk_rcv_zc()
>> >> > CPU 0 __xsk_map_flush()
>> >> > CPU 2 __xsk_map_flush()
>> >> >
>> >> > This occurs because we rely on NAPI to ensure that only one napi_poll
>> >> > handler is running at a time for the given veth receive queue.
>> >> > napi_schedule_prep() will prevent multiple instances from getting
>> >> > scheduled. However calling napi_complete_done() signals that this
>> >> > napi_poll is complete and allows subsequent calls to
>> >> > napi_schedule_prep() and __napi_schedule() to succeed in scheduling a
>> >> > concurrent napi_poll before the xdp_do_flush() has been called. For the
>> >> > veth driver a concurrent call to napi_schedule_prep() and
>> >> > __napi_schedule() can occur on a different CPU because the veth xmit
>> >> > path can additionally schedule a napi_poll creating the race.
>> >>
>> >> The above looks like a generic problem that other drivers could hit.
>> >> Perhaps it could be worthy updating the xdp_do_flush() doc text to
>> >> explicitly mention it must be called before napi_complete_done().
>> >
>> > Good observation. I took a quick peek at this and it seems there are
>> > at least 5 more drivers that can call napi_complete_done() before
>> > xdp_do_flush():
>> >
>> > drivers/net/ethernet/qlogic/qede/
>> > drivers/net/ethernet/freescale/dpaa2
>> > drivers/net/ethernet/freescale/dpaa
>> > drivers/net/ethernet/microchip/lan966x
>> > drivers/net/virtio_net.c
>> >
>> > The question is then if this race can occur on these five drivers.
>> > Dpaa2 has AF_XDP zero-copy support implemented, so it can suffer from
>> > this race as the Tx zero-copy path is basically just a napi_schedule()
>> > and it can be called/invoked from multiple processes at the same time.
>> > In regards to the others, I do not know.
>> >
>> > Would it be prudent to just switch the order of xdp_do_flush() and
>> > napi_complete_done() in all these drivers, or would that be too
>> > defensive?
>>
>> We rely on being inside a single NAPI instance trough to the
>> xdp_do_flush() call for RCU protection of all in-kernel data structures
>> as well[0]. Not sure if this leads to actual real-world bugs for the
>> in-kernel path, but conceptually it's wrong at least. So yeah, I think
>> we should definitely swap the order everywhere and document this!
>
> OK, let me take a stab at it. For at least the first four, it will be
> compilation tested only from my side since I do not own any of those
> SoCs/cards. Will need the help of those maintainers for sure.
Sounds good, thanks! :)
-Toke
next prev parent reply other threads:[~2023-01-11 23:07 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-14 22:32 Possible race with xsk_flush Shawn Bohrer
2022-12-15 10:22 ` Magnus Karlsson
2022-12-15 19:50 ` Alex Forster
2022-12-16 0:14 ` Shawn Bohrer
2022-12-16 10:05 ` Magnus Karlsson
2022-12-16 16:48 ` Shawn Bohrer
2022-12-16 23:42 ` Shawn Bohrer
2022-12-20 1:31 ` Shawn Bohrer
2022-12-20 9:06 ` Magnus Karlsson
2022-12-20 15:25 ` Shawn Bohrer
2022-12-20 15:59 ` Magnus Karlsson
2022-12-20 16:02 ` Shawn Bohrer
2022-12-20 18:59 ` [PATCH] veth: Fix race with AF_XDP exposing old or uninitialized descriptors Shawn Bohrer
2022-12-22 1:07 ` Jakub Kicinski
2022-12-22 14:30 ` Paolo Abeni
2022-12-22 10:18 ` Paolo Abeni
2023-01-11 14:02 ` Magnus Karlsson
2023-01-11 14:21 ` Toke Høiland-Jørgensen
2023-01-11 16:24 ` Magnus Karlsson
2023-01-11 23:06 ` Toke Høiland-Jørgensen [this message]
2022-12-22 14:40 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871qo0hdrh.fsf@toke.dk \
--to=toke@redhat.com \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=kernel-team@cloudflare.com \
--cc=magnus.karlsson@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sbohrer@cloudflare.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.