BPF List
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>,
	Yonghong Song <yonghong.song@linux.dev>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	houtao1@huawei.com, bpf@vger.kernel.org
Subject: Re: [PATCH bpf v2 4/5] bpf: Optimize the free of inner map
Date: Fri, 17 Nov 2023 23:34:39 -0800	[thread overview]
Message-ID: <b3955bcd-779e-4171-9daa-d5e2d6b9afd8@linux.dev> (raw)
In-Reply-To: <20231113123324.3914612-5-houtao@huaweicloud.com>

On 11/13/23 4:33 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When removing the inner map from the outer map, the inner map will be
> freed after one RCU grace period and one RCU tasks trace grace
> period, so it is certain that the bpf program which may access the inner
> map, has exited before the inner map is freed.
> 
> However there is unnecessary to wait for one RCU grace period or one RCU
> tasks trace grace period if the outer map is only accessed by sleepable
> program or non-sleepable program. So recording the context of the owned
> bpf programs when adding map into env->used_maps and using the recorded
> access context to decide which, and how many, RCU grace periods are
> needed when freeing the inner map.
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   include/linux/bpf.h     |  9 ++++++++-
>   kernel/bpf/map_in_map.c | 18 +++++++++++++-----
>   kernel/bpf/syscall.c    | 15 +++++++++++++--
>   kernel/bpf/verifier.c   |  5 +++++
>   4 files changed, 39 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index ec3c90202ffe6..8faa1af4b39df 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -245,6 +245,12 @@ struct bpf_list_node_kern {
>   	void *owner;
>   } __attribute__((aligned(8)));
>   
> +enum {
> +	BPF_MAP_ACC_NORMAL_PROG_CTX = 1,
> +	BPF_MAP_ACC_SLEEPABLE_PROG_CTX = 2,

nit. It is a bit flag. Use (1U << 0) and (1U << 1) to make it more obvious.

How about renaming this to BPF_MAP_RCU_GP and BPF_MAP_RCU_TT_GP to better 
reflect it is used to decide if the map_free needs to wait for any rcu gp?

> +	BPF_MAP_ACC_PROG_CTX_MASK = BPF_MAP_ACC_NORMAL_PROG_CTX | BPF_MAP_ACC_SLEEPABLE_PROG_CTX,
> +};
> +
>   struct bpf_map {
>   	/* The first two cachelines with read-mostly members of which some
>   	 * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -292,7 +298,8 @@ struct bpf_map {
>   	} owner;
>   	bool bypass_spec_v1;
>   	bool frozen; /* write-once; write-protected by freeze_mutex */
> -	bool free_after_mult_rcu_gp;
> +	atomic_t owned_prog_ctx;

Instead of the enum flags, this should only need a true/false value to tell 
whether the outer map has ever been used by a sleepable prog or not. may be 
renaming this to "atomic used_by_sleepable;"?

> +	atomic_t may_be_accessed_prog_ctx;

nit. rename this to "rcu_gp_flags;"

>   	s64 __percpu *elem_count;
>   };
>   
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index cf33630655661..e3d26a89ac5b6 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -131,12 +131,20 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred)
>   {
>   	struct bpf_map *inner_map = ptr;
>   
> -	/* The inner map may still be used by both non-sleepable and sleepable
> -	 * bpf program, so free it after one RCU grace period and one tasks
> -	 * trace RCU grace period.
> +	/* Defer the freeing of inner map according to the owner program
> +	 * context of outer maps, so unnecessary multiple RCU GP waitings
> +	 * can be avoided.
>   	 */
> -	if (deferred)
> -		WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
> +	if (deferred) {
> +		/* owned_prog_ctx may be updated concurrently by new bpf program
> +		 * so add smp_mb() below to ensure that reading owned_prog_ctx
> +		 * will return the newly-set bit when the new bpf program finds
> +		 * the inner map before it is removed from outer map.
> +		 */
> +		smp_mb();

This part took my head spinning a little, so it is better to ask. The 
owned_prog_ctx is set during verification time. There are many instructions till 
the prog is actually verified, attached (another syscall) and then run to do the 
actual lookup(&outer_map). Is this level of reordering practically possible?

> +		atomic_or(atomic_read(&map->owned_prog_ctx),
> +			  &inner_map->may_be_accessed_prog_ctx);
> +	}
>   	bpf_map_put(inner_map);
>   }
>   
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e2d2701ce2c45..5a7906f2b027e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct work_struct *work)
>   {
>   	struct bpf_map *map = container_of(work, struct bpf_map, work);
>   	struct btf_record *rec = map->record;
> +	int acc_ctx;
>   
>   	security_bpf_map_free(map);
>   	bpf_map_release_memcg(map);
>   
> -	if (READ_ONCE(map->free_after_mult_rcu_gp))
> -		synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
> +	acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) & BPF_MAP_ACC_PROG_CTX_MASK;

The mask should not be needed.

> +	if (acc_ctx) {
> +		if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX)
> +			synchronize_rcu();
> +		else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX)
> +			synchronize_rcu_tasks_trace();
> +		else
> +			synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);

Is it better to add a rcu_head to the map and then use call_rcu_(). e.g. when 
there is many delete happened to the outer map during a process restart to 
re-populate the outer map. It is relatively much cheaper to add a rcu_head to 
the map comparing to adding one for each elem. wdyt?

> +	}
>   
>   	/* implementation dependent freeing */
>   	map->ops->map_free(map);
> @@ -5326,6 +5334,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>   		goto out_unlock;
>   	}
>   
> +	/* No need to update owned_prog_ctx, because the bpf program doesn't
> +	 * access the map.
> +	 */
>   	memcpy(used_maps_new, used_maps_old,
>   	       sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
>   	used_maps_new[prog->aux->used_map_cnt] = map;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index bd1c42eb540f1..d8d5432b240dc 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18012,12 +18012,17 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
>   				return -E2BIG;
>   			}
>   
> +			atomic_or(env->prog->aux->sleepable ? BPF_MAP_ACC_SLEEPABLE_PROG_CTX :
> +							      BPF_MAP_ACC_NORMAL_PROG_CTX,
> +				  &map->owned_prog_ctx);
>   			/* hold the map. If the program is rejected by verifier,
>   			 * the map will be released by release_maps() or it
>   			 * will be used by the valid program until it's unloaded
>   			 * and all maps are released in free_used_maps()
>   			 */
>   			bpf_map_inc(map);
> +			/* Paired with smp_mb() in bpf_map_fd_put_ptr() */
> +			smp_mb__after_atomic();
>   
>   			aux->map_index = env->used_map_cnt;
>   			env->used_maps[env->used_map_cnt++] = map;


  reply	other threads:[~2023-11-18  7:34 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-13 12:33 [PATCH bpf v2 0/5] bpf: Fix the release of inner map Hou Tao
2023-11-13 12:33 ` [PATCH bpf v2 1/5] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers Hou Tao
2023-11-13 12:33 ` [PATCH bpf v2 2/5] bpf: Add map and need_defer parameters to .map_fd_put_ptr() Hou Tao
2023-11-13 12:33 ` [PATCH bpf v2 3/5] bpf: Defer the free of inner map when necessary Hou Tao
2023-11-13 12:33 ` [PATCH bpf v2 4/5] bpf: Optimize the free of inner map Hou Tao
2023-11-18  7:34   ` Martin KaFai Lau [this message]
2023-11-18 13:04     ` Hou Tao
2023-11-18 23:28       ` Martin KaFai Lau
2023-11-21  2:27         ` Hou Tao
2023-11-21  5:19   ` Alexei Starovoitov
2023-11-21  6:45     ` Hou Tao
2023-11-21 17:49       ` Alexei Starovoitov
2023-11-22  1:54         ` Hou Tao
2023-11-13 12:33 ` [PATCH bpf v2 5/5] selftests/bpf: Add test cases for " Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b3955bcd-779e-4171-9daa-d5e2d6b9afd8@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox