Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Dave Marchevsky <davemarchevsky@fb.com>, bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Kernel Team <kernel-team@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
Date: Tue, 21 Nov 2023 07:34:05 -0800	[thread overview]
Message-ID: <20669b09-36be-493c-9cf7-bf34e906832c@linux.dev> (raw)
In-Reply-To: <20231120175925.733167-2-davemarchevsky@fb.com>


On 11/20/23 12:59 PM, Dave Marchevsky wrote:
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
>
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
>
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.
>
> Implementation details:
>
>    * A few small helpers are added to deal with bpf_local_storage_data's
>      'data' field having different semantics when the local_storage map
>      is mmapable. With their help, many of the changes to existing code
>      are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
>      selem->elem_size becomes selem_bytes_used(selem)).
>
>    * The map flags are copied into bpf_local_storage_data when its
>      containing bpf_local_storage_elem is alloc'd, since the
>      bpf_local_storage_map associated with them may be gone when
>      bpf_local_storage_data is free'd, and testing flags for
>      BPF_F_MMAPABLE is necessary when free'ing to ensure that the
>      mmapable region is free'd.
>      * The extra field doesn't change bpf_local_storage_elem's size.
>        There were 48 bytes of padding after the bpf_local_storage_data
>        field, now there are 40.
>
>    * Currently, bpf_local_storage_update always creates a new
>      bpf_local_storage_elem for the 'updated' value - the only exception
>      being if the map_value has a bpf_spin_lock field, in which case the
>      spin lock is grabbed instead of the less granular bpf_local_storage
>      lock, and the value updated in place. This inplace update behavior
>      is desired for mmapable local_storage map_values as well, since
>      creating a new selem would result in new mmapable pages.
>
>    * The size of the mmapable pages are accounted for when calling
>      mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
>      mem_uncharge may be called before they actually go away.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/linux/bpf_local_storage.h |  14 ++-
>   kernel/bpf/bpf_local_storage.c    | 145 ++++++++++++++++++++++++------
>   kernel/bpf/bpf_task_storage.c     |  35 ++++++--
>   kernel/bpf/syscall.c              |   2 +-
>   4 files changed, 163 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>   	 * the number of cachelines accessed during the cache hit case.
>   	 */
>   	struct bpf_local_storage_map __rcu *smap;
> -	u8 data[] __aligned(8);
> +	/* Need to duplicate smap's map_flags as smap may be gone when
> +	 * it's time to free bpf_local_storage_data
> +	 */
> +	u64 smap_map_flags;
> +	/* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> +	 * Otherwise the actual mapval data lives here
> +	 */
> +	union {
> +		DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> +		void *actual_data __aligned(8);
> +	};
>   };
>   
>   /* Linked to bpf_local_storage and bpf_local_storage_map */
> @@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = {			\
>   /* Helper functions for bpf_local_storage */
>   int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
>   
> +void *sdata_mapval(struct bpf_local_storage_data *data);
> +
>   struct bpf_map *
>   bpf_local_storage_map_alloc(union bpf_attr *attr,
>   			    struct bpf_local_storage_cache *cache,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 146824cc9689..9b3becbcc1a3 100644
> --- a/kernel/bpf/bpf_local_storage.c
[...]
> @@ -583,14 +665,14 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
>   		if (err) {
>   			bpf_selem_free(selem, smap, true);
> -			mem_uncharge(smap, owner, smap->elem_size);
> +			mem_uncharge(smap, owner, selem_bytes_used(smap));
>   			return ERR_PTR(err);
>   		}
>   
>   		return SDATA(selem);
>   	}
>   
> -	if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
> +	if (can_update_existing_selem(smap, map_flags) && !(map_flags & BPF_NOEXIST)) {
>   		/* Hoping to find an old_sdata to do inline update
>   		 * such that it can avoid taking the local_storage->lock
>   		 * and changing the lists.
> @@ -601,8 +683,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -			copy_map_value_locked(&smap->map, old_sdata->data,
> -					      value, false);
> +			if (map_flags & BPF_F_LOCK)
> +				copy_map_value_locked(&smap->map,
> +						      sdata_mapval(old_sdata),
> +						      value, false);
> +			else
> +				copy_map_value(&smap->map, sdata_mapval(old_sdata),
> +					       value);

IIUC, if two 'storage_update' to the same map/key and then
these two updates will be serialized due to spin_lock.
How about concurrent update for mmap'ed sdata, do we need
any protection here?

>   			return old_sdata;
>   		}
>   	}
> @@ -633,8 +720,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		goto unlock;
>   
>   	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -		copy_map_value_locked(&smap->map, old_sdata->data, value,
> -				      false);
> +		copy_map_value_locked(&smap->map, sdata_mapval(old_sdata),
> +				      value, false);
>   		selem = SELEM(old_sdata);
>   		goto unlock;
>   	}
> @@ -656,7 +743,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   unlock:
>   	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>   	if (alloc_selem) {
> -		mem_uncharge(smap, owner, smap->elem_size);
> +		mem_uncharge(smap, owner, selem_bytes_used(smap));
>   		bpf_selem_free(alloc_selem, smap, true);
>   	}
>   	return err ? ERR_PTR(err) : SDATA(selem);
> @@ -707,6 +794,10 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
>   	if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE)
>   		return -E2BIG;
>   
> +	if ((attr->map_flags & BPF_F_MMAPABLE) &&
> +	    attr->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> +		return -EINVAL;
> +
>   	return 0;
>   }
>   

[...]

next prev parent reply	other threads:[~2023-11-21 15:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
2023-11-20 21:41   ` Johannes Weiner
2023-11-21  0:42   ` Martin KaFai Lau
2023-11-21  6:11     ` David Marchevsky
2023-11-21 19:27       ` Martin KaFai Lau
2023-11-21 19:49         ` Alexei Starovoitov
2023-12-11 17:31           ` David Marchevsky
2023-11-21  2:32   ` kernel test robot
2023-11-21  5:06   ` kernel test robot
2023-11-21  5:20   ` kernel test robot
2023-11-21  5:44   ` Alexei Starovoitov
2023-11-21  6:41   ` Yonghong Song
2023-11-21 15:34   ` Yonghong Song [this message]
2023-11-21 19:30   ` Andrii Nakryiko
2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
2023-11-21 19:34   ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20669b09-36be-493c-9cf7-bf34e906832c@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davemarchevsky@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox