All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Kui-Feng Lee <kuifeng@meta.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, song@kernel.org,
	kernel-team@meta.com, andrii@kernel.org, sdf@google.com
Subject: Re: [PATCH bpf-next v6 1/8] bpf: Retire the struct_ops map kvalue->refcnt.
Date: Mon, 13 Mar 2023 23:05:49 -0700	[thread overview]
Message-ID: <685fe34b-4f84-8bb2-4da7-67eef8ca3b0e@linux.dev> (raw)
In-Reply-To: <20230310043812.3087672-2-kuifeng@meta.com>

On 3/9/23 8:38 PM, Kui-Feng Lee wrote:
> We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
> RCU grace period.
> 
> Maintenance of kvalue->refcnt was a complicated task, as we had to
> simultaneously keep track of two reference counts: one for the
> reference count of bpf_map. When the kvalue->refcnt reaches zero, we
> also have to reduce the reference count on bpf_map - yet these steps
> are not performed in an atomic manner and require us to be vigilant
> when managing them. By eliminating kvalue->refcnt, we can make our
> maintenance more straightforward as the refcount of bpf_map is now
> solely managed!
> 
> To prevent the trampoline image of a struct_ops from being released
> while it is still in use, we wait for an RCU grace period. The
> setsockopt(TCP_CONGESTION, "...") command allows you to change your
> socket's congestion control algorithm and can result in releasing the
> old struct_ops implementation.

If the setsockopt() above is referring to the syscall setsockopt(), then the old 
struct_ops is fine. The old struct_ops is protected by the struct_ops map's 
refcnt (or the current kvalue->refcnt). The sk in setsockopt(sk, ...) will no 
longer use the old struct_ops before the refcnt is decremented. This part should 
be the same as the tcp-cc kernel module.

> Moreover, since this function is
> exposed through bpf_setsockopt(), it may be accessed by BPF programs
> as well. To ensure that the trampoline image belonging to struct_op
> can be safely called while its method is in use, struct_ops is
> safeguarded with rcu_read_lock(). Doing so prevents any destruction of
> the associated images before returning from a trampoline and requires
> us to wait for an RCU grace period.

The bpf_setsockopt(TCP_CONGESTION) is the reason that the trampoline image needs 
a grace period, but I noticed RCU grace period itself is not enough for 
trampoline image and more on this later.

Another reason the struct_ops map needs a RCU grace period is because of the 
bpf_try_module_get() (in tcp_set_default_congestion_control for example).


> ---
>   include/linux/bpf.h         |  1 +
>   kernel/bpf/bpf_struct_ops.c | 68 ++++++++++++++++++++-----------------
>   kernel/bpf/syscall.c        |  6 ++--
>   3 files changed, 42 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e64ff1e89fb2..00ca92ea6f2e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1938,6 +1938,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>   struct bpf_map *__bpf_map_get(struct fd f);
>   void bpf_map_inc(struct bpf_map *map);
>   void bpf_map_inc_with_uref(struct bpf_map *map);
> +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
>   struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
>   void bpf_map_put_with_uref(struct bpf_map *map);
>   void bpf_map_put(struct bpf_map *map);
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 38903fb52f98..ab7811a4c1dd 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -58,6 +58,11 @@ struct bpf_struct_ops_map {
>   	struct bpf_struct_ops_value kvalue;
>   };
>   
> +struct bpf_struct_ops_link {
> +	struct bpf_link link;
> +	struct bpf_map __rcu *map;
> +};

Comparing with v5, this is moved from patch 3 to patch 1. It is not used here, 
so it belongs to patch 3.


> @@ -574,6 +585,19 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>   {
>   	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
>   
> +	/* The struct_ops's function may switch to another struct_ops.
> +	 *
> +	 * For example, bpf_tcp_cc_x->init() may switch to
> +	 * another tcp_cc_y by calling
> +	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> +	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
> +	 * and its refcount may reach 0 which then free its
> +	 * trampoline image while tcp_cc_x is still running.
> +	 *
> +	 * Thus, a rcu grace period is needed here.
> +	 */
> +	synchronize_rcu();

After the trampoline image finished running a struct_ops's "prog", it still has 
a few insn need to execute in the trampoline image, so it also needs to wait for 
synchronize_rcu_tasks/call_rcu_tasks.

This is an old issue, only happens when the struct_ops prog calls 
bpf_setsockopt(TCP_CONGESTION) with CONFIG_PREEMPT and unlikely other upcoming 
struct_ops subsystem may need this, please help to do a follow up fix on it 
(separate from this set) to also wait for the rcu_tasks gp.



  reply	other threads:[~2023-03-14  6:06 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-10  4:38 [PATCH bpf-next v6 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
2023-03-14  6:05   ` Martin KaFai Lau [this message]
2023-03-10  4:38 ` [PATCH bpf-next v6 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
2023-03-10 16:47   ` Stephen Hemminger
2023-03-13 15:46     ` Kui-Feng Lee
2023-03-13 16:43       ` Kui-Feng Lee
2023-03-14  0:28   ` Martin KaFai Lau
2023-03-14  4:31     ` Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
2023-03-14  1:42   ` Martin KaFai Lau
2023-03-16  0:21     ` Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
2023-03-14  5:04   ` Martin KaFai Lau
2023-03-10 16:28 ` [PATCH bpf-next v6 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=685fe34b-4f84-8bb2-4da7-67eef8ca3b0e@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kuifeng@meta.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.