Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage

BPF List
 help / color / mirror / Atom feed

From: Martin KaFai Lau <martin.lau@linux.dev>
To: David Marchevsky <david.marchevsky@linux.dev>,
	Dave Marchevsky <davemarchevsky@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Kernel Team <kernel-team@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	bpf@vger.kernel.org
Subject: Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
Date: Tue, 21 Nov 2023 11:27:28 -0800	[thread overview]
Message-ID: <f4d7f72d-1ba2-49dc-b4e0-03289393d436@linux.dev> (raw)
In-Reply-To: <3dd86df3-0692-42d8-b075-f79c5dc052be@linux.dev>

On 11/20/23 10:11 PM, David Marchevsky wrote:
> 
> 
> On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
>> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>> index 173ec7f43ed1..114973f925ea 100644
>>> --- a/include/linux/bpf_local_storage.h
>>> +++ b/include/linux/bpf_local_storage.h
>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>>         * the number of cachelines accessed during the cache hit case.
>>>         */
>>>        struct bpf_local_storage_map __rcu *smap;
>>> -    u8 data[] __aligned(8);
>>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>>> +     * it's time to free bpf_local_storage_data
>>> +     */
>>> +    u64 smap_map_flags;
>>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>>> +     * Otherwise the actual mapval data lives here
>>> +     */
>>> +    union {
>>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>>> +        void *actual_data __aligned(8);
>>
>> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
>>
>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
>>
>> struct normal_and_mmap_value {
>>      int some_int;
>>      int __percpu_kptr *some_cnts;
>>
>>      struct bpf_mmap_page __kptr *some_stats;
>> };
>>
>> struct mmap_only_value {
>>      struct bpf_mmap_page __kptr *some_stats;
>> };
>>
>> [ ... ]
>>
> 
> This is an intriguing idea. For conciseness I'll call this specific
> kind of kptr 'mmapable kptrs' for the rest of this message. Below is
> more of a brainstorming dump than a cohesive response, separate trains
> of thought are separated by two newlines.

Thanks for bearing with me while some ideas could be crazy. I am trying to see 
how this would look like for other local storage, sk and inode. Allocating a 
page for each sk will not be nice for server with half a million sk(s). e.g. 
half a million sk(s) sharing a few bandwidth policies or a few tuning 
parameters. Creating something mmap'able to the user space and also sharable 
among many sk(s) will be useful.

> 
> 
> My initial thought upon seeing struct normal_and_mmap_value was to note
> that we currently don't support mmaping for map_value types with _any_
> special fields ('special' as determined by btf_parse_fields). But IIUC
> you're actually talking about exposing the some_stats pointee memory via
> mmap, not the containing struct with kptr fields. That is, for maps that
> support these kptrs, mmap()ing a map with value type struct
> normal_and_mmap_value would return the some_stats pointer value, and
> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
> logic in this patch. We'd only be able to support one such mmapable kptr
> field per mapval type, but that isn't a dealbreaker.
> 
> Some maps, like task_storage, would only support mmap() on a map_value
> with mmapable kptr field, as mmap()ing the mapval itself doesn't make
> sense or is unsafe. Seems like arraymap would do the opposite, only

Changing direction a bit since arraymap is brought up. :)

arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an 
arraymap as kptr, the bpf prog should be able to access it as a map. More like 
the current map-in-map setup. The arraymap can be used as regular map in the 
user space also (like pinning). It may need some btf plumbing to tell the value 
type of the arrayamp to the verifier.

The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags) 
can be used where the value->array_mmap initialized as an arraymap_fd. This will 
limit the arraymap kptr update only from the syscall side which seems to be your 
usecase also? Allocating the arraymap from the bpf prog side needs some thoughts 
and need a whitelist.

The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd, 
&task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we 
can create a libbpf helper to free the fd(s) resources held in the looked-up 
value by using the value's btf.

The bpf_local_storage_map side probably does not need to support mmap() then.


> supporting mmap()ing the mapval itself. I'm curious if any map could
> feasibly support both, and if so, might have to do logic like:
> 
>    if (map_val has mmapable kptr)
>       mmap the pointee of mmapable kptr
>    else
>       mmap the map_val itself
> 
> Which is maybe too confusing of a detail to expose to BPF program
> writers. Maybe a little too presumptuous and brainstorm-ey given the
> limited number of mmap()able maps currently, but making this a kptr type
> means maps should either ignore/fail if they don't support it, or have
> consistent semantics amongst maps that do support it.
> 
> 
> Instead of  struct bpf_mmap_page __kptr *some_stats;  I'd prefer
> something like
> 
> struct my_type { long count; long another_count; };
> 
> struct mmap_only_value {
>    struct my_type __mmapable_kptr *some_stats;
> };
> 
> This way the type of mmap()able field is known to BPF programs that
> interact with it. This is all assuming that struct bpf_mmap_page is an
> opaque page-sized blob of bytes.
> 
> 
> We could then support structs like
> 
> struct mmap_value_and_lock {
>    struct bpf_spin_lock l;
>    int some_int;
>    struct my_type __mmapable_kptr *some_stats;
> };
> 
> and have bpf_map_update_elem handler use the spin_lock instead of
> map-internal lock where appropriate. But no way to ensure userspace task
> using the mmap()ed region uses the spin_lock.
> 
>>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>>> index 146824cc9689..9b3becbcc1a3 100644
>>> --- a/kernel/bpf/bpf_local_storage.c
>>> +++ b/kernel/bpf/bpf_local_storage.c
>>> @@ -15,7 +15,8 @@
>>>    #include <linux/rcupdate_trace.h>
>>>    #include <linux/rcupdate_wait.h>
>>>    -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
>>> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
>>> +    (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>>>      static struct bpf_local_storage_map_bucket *
>>>    select_bucket(struct bpf_local_storage_map *smap,
>>> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>>>        return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>>>    }
>>>    +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
>>> +
>>> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
>>> +{
>>> +    struct mem_cgroup *memcg, *old_memcg;
>>> +    void *ptr;
>>> +
>>> +    memcg = bpf_map_get_memcg(&smap->map);
>>> +    old_memcg = set_active_memcg(memcg);
>>> +    ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
>>> +                      NUMA_NO_NODE);
>>> +    set_active_memcg(old_memcg);
>>> +    mem_cgroup_put(memcg);
>>> +
>>> +    return ptr;
>>> +}
>>
>> [ ... ]
>>
>>> @@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>>            void *value, bool charge_mem, gfp_t gfp_flags)
>>>    {
>>>        struct bpf_local_storage_elem *selem;
>>> +    void *mmapable_value = NULL;
>>> +    u32 selem_mem;
>>>    -    if (charge_mem && mem_charge(smap, owner, smap->elem_size))
>>> +    selem_mem = selem_bytes_used(smap);
>>> +    if (charge_mem && mem_charge(smap, owner, selem_mem))
>>>            return NULL;
>>>    +    if (smap->map.map_flags & BPF_F_MMAPABLE) {
>>> +        mmapable_value = alloc_mmapable_selem_value(smap);
>>
>> This probably is not always safe for bpf prog to do. Leaving the gfp_flags was not used aside, the bpf local storage is moving to the bpf's memalloc because of https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
>>
> 
> Minor point: alloc_mmapable_selem_value's bpf_map_area_mmapable_alloc
> call will always call vmalloc under the hood. vmalloc has locks as well,
> so your point stands.
> 
> I think I see how this ties into your 'specific kptr type' proposal
> above. Let me know if this sounds right: if there was a bpf_mem_alloc
> type focused on providing vmalloc'd mmap()able memory, we could use it
> here instead of raw vmalloc and avoid the lock recursion problem linked
> above. Such an allocator could be used in something like bpf_obj_new to
> create the __mmapable_kptr - either from BPF prog or mmap path before
> remap_vmalloc_range.
> 
> re: gfp_flags, looks like verifier is setting this param to either
> GFP_ATOMIC or GFP_KERNEL. Looks like we should not allow GFP_KERNEL
> allocs here?

Going back to this patch, not sure what does it take to make bpf_mem_alloc() 
mmap()able. May be we can limit the blast radius for now, like limit this alloc 
to the user space mmap() call for now. Does it fit your use case? A whitelist 
for bpf prog could also be created later if needed.

> 
>>> +        if (!mmapable_value)
>>> +            goto err_out;
>>> +    }
>>> +
>>

next prev parent reply	other threads:[~2023-11-21 19:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
2023-11-20 21:41   ` Johannes Weiner
2023-11-21  0:42   ` Martin KaFai Lau
2023-11-21  6:11     ` David Marchevsky
2023-11-21 19:27       ` Martin KaFai Lau [this message]
2023-11-21 19:49         ` Alexei Starovoitov
2023-12-11 17:31           ` David Marchevsky
2023-11-21  2:32   ` kernel test robot
2023-11-21  5:06   ` kernel test robot
2023-11-21  5:20   ` kernel test robot
2023-11-21  5:44   ` Alexei Starovoitov
2023-11-21  6:41   ` Yonghong Song
2023-11-21 15:34   ` Yonghong Song
2023-11-21 19:30   ` Andrii Nakryiko
2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
2023-11-21 19:34   ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f4d7f72d-1ba2-49dc-b4e0-03289393d436@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davemarchevsky@fb.com \
    --cc=david.marchevsky@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox