From: Yonghong Song <yonghong.song@linux.dev>
To: Dave Marchevsky <davemarchevsky@fb.com>, bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Kernel Team <kernel-team@fb.com>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
Date: Tue, 21 Nov 2023 07:34:05 -0800 [thread overview]
Message-ID: <20669b09-36be-493c-9cf7-bf34e906832c@linux.dev> (raw)
In-Reply-To: <20231120175925.733167-2-davemarchevsky@fb.com>
On 11/20/23 12:59 PM, Dave Marchevsky wrote:
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
>
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
>
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.
>
> Implementation details:
>
> * A few small helpers are added to deal with bpf_local_storage_data's
> 'data' field having different semantics when the local_storage map
> is mmapable. With their help, many of the changes to existing code
> are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
> selem->elem_size becomes selem_bytes_used(selem)).
>
> * The map flags are copied into bpf_local_storage_data when its
> containing bpf_local_storage_elem is alloc'd, since the
> bpf_local_storage_map associated with them may be gone when
> bpf_local_storage_data is free'd, and testing flags for
> BPF_F_MMAPABLE is necessary when free'ing to ensure that the
> mmapable region is free'd.
> * The extra field doesn't change bpf_local_storage_elem's size.
> There were 48 bytes of padding after the bpf_local_storage_data
> field, now there are 40.
>
> * Currently, bpf_local_storage_update always creates a new
> bpf_local_storage_elem for the 'updated' value - the only exception
> being if the map_value has a bpf_spin_lock field, in which case the
> spin lock is grabbed instead of the less granular bpf_local_storage
> lock, and the value updated in place. This inplace update behavior
> is desired for mmapable local_storage map_values as well, since
> creating a new selem would result in new mmapable pages.
>
> * The size of the mmapable pages are accounted for when calling
> mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
> mem_uncharge may be called before they actually go away.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
> include/linux/bpf_local_storage.h | 14 ++-
> kernel/bpf/bpf_local_storage.c | 145 ++++++++++++++++++++++++------
> kernel/bpf/bpf_task_storage.c | 35 ++++++--
> kernel/bpf/syscall.c | 2 +-
> 4 files changed, 163 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
> * the number of cachelines accessed during the cache hit case.
> */
> struct bpf_local_storage_map __rcu *smap;
> - u8 data[] __aligned(8);
> + /* Need to duplicate smap's map_flags as smap may be gone when
> + * it's time to free bpf_local_storage_data
> + */
> + u64 smap_map_flags;
> + /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> + * Otherwise the actual mapval data lives here
> + */
> + union {
> + DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> + void *actual_data __aligned(8);
> + };
> };
>
> /* Linked to bpf_local_storage and bpf_local_storage_map */
> @@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = { \
> /* Helper functions for bpf_local_storage */
> int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
>
> +void *sdata_mapval(struct bpf_local_storage_data *data);
> +
> struct bpf_map *
> bpf_local_storage_map_alloc(union bpf_attr *attr,
> struct bpf_local_storage_cache *cache,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 146824cc9689..9b3becbcc1a3 100644
> --- a/kernel/bpf/bpf_local_storage.c
[...]
> @@ -583,14 +665,14 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
> if (err) {
> bpf_selem_free(selem, smap, true);
> - mem_uncharge(smap, owner, smap->elem_size);
> + mem_uncharge(smap, owner, selem_bytes_used(smap));
> return ERR_PTR(err);
> }
>
> return SDATA(selem);
> }
>
> - if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
> + if (can_update_existing_selem(smap, map_flags) && !(map_flags & BPF_NOEXIST)) {
> /* Hoping to find an old_sdata to do inline update
> * such that it can avoid taking the local_storage->lock
> * and changing the lists.
> @@ -601,8 +683,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> if (err)
> return ERR_PTR(err);
> if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> - copy_map_value_locked(&smap->map, old_sdata->data,
> - value, false);
> + if (map_flags & BPF_F_LOCK)
> + copy_map_value_locked(&smap->map,
> + sdata_mapval(old_sdata),
> + value, false);
> + else
> + copy_map_value(&smap->map, sdata_mapval(old_sdata),
> + value);
IIUC, if two 'storage_update' to the same map/key and then
these two updates will be serialized due to spin_lock.
How about concurrent update for mmap'ed sdata, do we need
any protection here?
> return old_sdata;
> }
> }
> @@ -633,8 +720,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> goto unlock;
>
> if (old_sdata && (map_flags & BPF_F_LOCK)) {
> - copy_map_value_locked(&smap->map, old_sdata->data, value,
> - false);
> + copy_map_value_locked(&smap->map, sdata_mapval(old_sdata),
> + value, false);
> selem = SELEM(old_sdata);
> goto unlock;
> }
> @@ -656,7 +743,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> unlock:
> raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> if (alloc_selem) {
> - mem_uncharge(smap, owner, smap->elem_size);
> + mem_uncharge(smap, owner, selem_bytes_used(smap));
> bpf_selem_free(alloc_selem, smap, true);
> }
> return err ? ERR_PTR(err) : SDATA(selem);
> @@ -707,6 +794,10 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
> if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE)
> return -E2BIG;
>
> + if ((attr->map_flags & BPF_F_MMAPABLE) &&
> + attr->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> + return -EINVAL;
> +
> return 0;
> }
>
[...]
next prev parent reply other threads:[~2023-11-21 15:34 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
2023-11-20 21:41 ` Johannes Weiner
2023-11-21 0:42 ` Martin KaFai Lau
2023-11-21 6:11 ` David Marchevsky
2023-11-21 19:27 ` Martin KaFai Lau
2023-11-21 19:49 ` Alexei Starovoitov
2023-12-11 17:31 ` David Marchevsky
2023-11-21 2:32 ` kernel test robot
2023-11-21 5:06 ` kernel test robot
2023-11-21 5:20 ` kernel test robot
2023-11-21 5:44 ` Alexei Starovoitov
2023-11-21 6:41 ` Yonghong Song
2023-11-21 15:34 ` Yonghong Song [this message]
2023-11-21 19:30 ` Andrii Nakryiko
2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
2023-11-21 19:34 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20669b09-36be-493c-9cf7-bf34e906832c@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davemarchevsky@fb.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.