public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Daniel Borkmann <daniel@iogearbox.net>,
	bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: Martin KaFai Lau <kafai@fb.com>,
	Hangbin Liu <liuhangbin@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Magnus Karlsson <magnus.karlsson@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to redirect map entries
Date: Tue, 22 Jun 2021 00:35:08 +0200	[thread overview]
Message-ID: <87r1gurgmb.fsf@toke.dk> (raw)
In-Reply-To: <96117b3f-8041-b524-ef70-d5afc97e32f9@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 6/21/21 11:39 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 6/17/21 11:27 PM, Toke Høiland-Jørgensen wrote:
>>>> XDP_REDIRECT works by a three-step process: the bpf_redirect() and
>>>> bpf_redirect_map() helpers will lookup the target of the redirect and store
>>>> it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
>>>> Next, when the program returns the XDP_REDIRECT return code, the driver
>>>> will call xdp_do_redirect() which will use the information thus stored to
>>>> actually enqueue the frame into a bulk queue structure (that differs
>>>> slightly by map type, but shares the same principle). Finally, before
>>>> exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
>>>> flush all the different bulk queues, thus completing the redirect.
>>>>
>>>> Pointers to the map entries will be kept around for this whole sequence of
>>>> steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>> the core code; instead drivers add their own rcu_read_lock() around the XDP
>>>> portions of the code, but somewhat inconsistently as Martin discovered[0].
>>>> However, things still work because everything happens inside a single NAPI
>>>> poll sequence, which means it's between a pair of calls to
>>>> local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
>>>> document this intention by using rcu_dereference_check() with
>>>> rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
>>>> lockdep to verify that everything is done correctly.
>>>>
>>>> This patch does just that: we add an __rcu annotation to the map entry
>>>> pointers and remove the various comments explaining the NAPI poll assurance
>>>> strewn through devmap.c in favour of a longer explanation in filter.c. The
>>>> goal is to have one coherent documentation of the entire flow, and rely on
>>>> the RCU annotations as a "standard" way of communicating the flow in the
>>>> map code (which can additionally be understood by sparse and lockdep).
>>>>
>>>> The RCU annotation replacements result in a fairly straight-forward
>>>> replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
>>>> becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
>>>> proper constructs to cast the pointer back and forth between __rcu and
>>>> __kernel address space (for the benefit of sparse). The one complication is
>>>> that xskmap has a few constructions where double-pointers are passed back
>>>> and forth; these simply all gain __rcu annotations, and only the final
>>>> reference/dereference to the inner-most pointer gets changed.
>>>>
>>>> With this, everything can be run through sparse without eliciting
>>>> complaints, and lockdep can verify correctness even without the use of
>>>> rcu_read_lock() in the drivers. Subsequent patches will clean these up from
>>>> the drivers.
>>>>
>>>> [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
>>>> [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
>>>>
>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>> ---
>>>>    include/net/xdp_sock.h |  2 +-
>>>>    kernel/bpf/cpumap.c    | 13 +++++++----
>>>>    kernel/bpf/devmap.c    | 49 ++++++++++++++++++------------------------
>>>>    net/core/filter.c      | 28 ++++++++++++++++++++++++
>>>>    net/xdp/xsk.c          |  4 ++--
>>>>    net/xdp/xsk.h          |  4 ++--
>>>>    net/xdp/xskmap.c       | 29 ++++++++++++++-----------
>>>>    7 files changed, 80 insertions(+), 49 deletions(-)
>>> [...]
>>>>    						 __dev_map_entry_free);
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index caa88955562e..0b7db5c70385 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -3922,6 +3922,34 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>>>    	.arg2_type	= ARG_ANYTHING,
>>>>    };
>>>>    
>>>> +/* XDP_REDIRECT works by a three-step process, implemented in the functions
>>>> + * below:
>>>> + *
>>>> + * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
>>>> + *    of the redirect and store it (along with some other metadata) in a per-CPU
>>>> + *    struct bpf_redirect_info.
>>>> + *
>>>> + * 2. When the program returns the XDP_REDIRECT return code, the driver will
>>>> + *    call xdp_do_redirect() which will use the information in struct
>>>> + *    bpf_redirect_info to actually enqueue the frame into a map type-specific
>>>> + *    bulk queue structure.
>>>> + *
>>>> + * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
>>>> + *    which will flush all the different bulk queues, thus completing the
>>>> + *    redirect.
>>>> + *
>>>> + * Pointers to the map entries will be kept around for this whole sequence of
>>>> + * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>> + * the core code; instead, the RCU protection relies on everything happening
>>>> + * inside a single NAPI poll sequence, which means it's between a pair of calls
>>>> + * to local_bh_disable()/local_bh_enable().
>>>> + *
>>>> + * The map entries are marked as __rcu and the map code makes sure to
>>>> + * dereference those pointers with rcu_dereference_check() in a way that works
>>>> + * for both sections that to hold an rcu_read_lock() and sections that are
>>>> + * called from NAPI without a separate rcu_read_lock(). The code below does not
>>>> + * use RCU annotations, but relies on those in the map code.
>>>
>>> One more follow-up question related to tc BPF: given we do use rcu_read_lock_bh()
>>> in case of sch_handle_egress(), could we also remove the rcu_read_lock() pair
>>> from cls_bpf_classify() then?
>> 
>> I believe so, yeah. Patch 2 in this series should even make lockdep stop
>> complaining about it :)
>
> Btw, I was wondering whether we should just get rid of all the WARN_ON_ONCE()s
> from those map helpers given in most situations these are not triggered anyway
> due to retpoline avoidance where verifier rewrites the calls to jump to the map
> backend implementation directly. One alternative could be to have an extension
> to the bpf prologue generation under CONFIG_DEBUG_LOCK_ALLOC and call the lockdep
> checks from there, but it's probably not worth the effort. (In the trampoline
> case we have those __bpf_prog_enter()/__bpf_prog_enter_sleepable() where the
> latter in particular has asserts like might_fault(), fwiw.)

I agree that it's probably overkill to amend the prologue. No strong
opinion on whether removing the checks entirely is a good idea; I guess
they at least serve as documentation even if they're not actually called
that often?

>> I can add a patch removing the rcu_read_lock() from cls_bpf in the next
>> version.
>> 
>>> It would also be great if this scenario in general could be placed
>>> under the Documentation/RCU/whatisRCU.rst as an example, so we could
>>> refer to the official doc on this, too, if Paul is good with this.
>> 
>> I'll take a look and see if I can find a way to fit it in there...
>> 
>>> Could you also update the RCU comment in bpf_prog_run_xdp()? Or
>>> alternatively move all the below driver comments in there as a single
>>> location?
>>>
>>>     /* This code is invoked within a single NAPI poll cycle and thus under
>>>      * local_bh_disable(), which provides the needed RCU protection.
>>>      */
>> 
>> Sure, can do. And yeah, I do agree that moving the comment in there
>> makes more sense than scattering it over all the drivers, even if that
>> means I have to go back and edit all the drivers again :P
>
> Yeap, all of the above sounds good, thanks!

Cool :)


  reply	other threads:[~2021-06-21 22:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17 21:27 [PATCH bpf-next v3 00/16] Clean up and document RCU-based object protection for XDP_REDIRECT Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 01/16] rcu: Create an unrcu_pointer() to remove __rcu from a pointer Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 02/16] bpf: allow RCU-protected lookups to happen from bh context Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to redirect map entries Toke Høiland-Jørgensen
2021-06-18  4:55   ` Martin KaFai Lau
2021-06-18 23:27   ` Daniel Borkmann
2021-06-21 21:39     ` Toke Høiland-Jørgensen
2021-06-21 22:15       ` Daniel Borkmann
2021-06-21 22:35         ` Toke Høiland-Jørgensen [this message]
2021-06-22  8:50           ` Daniel Borkmann
2021-06-22 13:55       ` Toke Høiland-Jørgensen
2021-06-22 20:26         ` Paul E. McKenney
2021-06-22 21:48           ` Toke Høiland-Jørgensen
2021-06-22 23:19             ` Paul E. McKenney
2021-06-23 10:55               ` Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 04/16] ena: remove rcu_read_lock() around XDP program invocation Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 05/16] bnxt: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 06/16] thunderx: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 07/16] freescale: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 08/16] net: intel: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 09/16] marvell: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 10/16] mlx4: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 11/16] nfp: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 12/16] qede: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 13/16] sfc: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 14/16] netsec: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 15/16] stmmac: " Toke Høiland-Jørgensen
2021-06-18  9:47   ` Wong Vee Khee
2021-06-17 21:27 ` [PATCH bpf-next v3 16/16] net: ti: " Toke Høiland-Jørgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r1gurgmb.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=liuhangbin@gmail.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox