From: David Marchevsky <david.marchevsky@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>
Cc: David Marchevsky <david.marchevsky@linux.dev>,
Dave Marchevsky <davemarchevsky@fb.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Kernel Team <kernel-team@fb.com>,
Johannes Weiner <hannes@cmpxchg.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
Date: Mon, 11 Dec 2023 12:31:46 -0500 [thread overview]
Message-ID: <45878586-cc5f-435f-83fb-9a3c39824550@linux.dev> (raw)
In-Reply-To: <CAADnVQK6c8chC1E6_O8bncncBuiscdFrKk6EgPbBC_WyVoj=9w@mail.gmail.com>
On 11/21/23 2:49 PM, Alexei Starovoitov wrote:
> On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/20/23 10:11 PM, David Marchevsky wrote:
>>>
>>>
>>> On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
>>>> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>>>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>>>> index 173ec7f43ed1..114973f925ea 100644
>>>>> --- a/include/linux/bpf_local_storage.h
>>>>> +++ b/include/linux/bpf_local_storage.h
>>>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>>>> * the number of cachelines accessed during the cache hit case.
>>>>> */
>>>>> struct bpf_local_storage_map __rcu *smap;
>>>>> - u8 data[] __aligned(8);
>>>>> + /* Need to duplicate smap's map_flags as smap may be gone when
>>>>> + * it's time to free bpf_local_storage_data
>>>>> + */
>>>>> + u64 smap_map_flags;
>>>>> + /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>>>>> + * Otherwise the actual mapval data lives here
>>>>> + */
>>>>> + union {
>>>>> + DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>>>>> + void *actual_data __aligned(8);
>>>>
>>>> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
>>>>
>>>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
>>>>
>>>> struct normal_and_mmap_value {
>>>> int some_int;
>>>> int __percpu_kptr *some_cnts;
>>>>
>>>> struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> struct mmap_only_value {
>>>> struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> [ ... ]
>>>>
>>>
>>> This is an intriguing idea. For conciseness I'll call this specific
>>> kind of kptr 'mmapable kptrs' for the rest of this message. Below is
>>> more of a brainstorming dump than a cohesive response, separate trains
>>> of thought are separated by two newlines.
>>
>> Thanks for bearing with me while some ideas could be crazy. I am trying to see
>> how this would look like for other local storage, sk and inode. Allocating a
>> page for each sk will not be nice for server with half a million sk(s). e.g.
>> half a million sk(s) sharing a few bandwidth policies or a few tuning
>> parameters. Creating something mmap'able to the user space and also sharable
>> among many sk(s) will be useful.
>>
>>>
>>>
>>> My initial thought upon seeing struct normal_and_mmap_value was to note
>>> that we currently don't support mmaping for map_value types with _any_
>>> special fields ('special' as determined by btf_parse_fields). But IIUC
>>> you're actually talking about exposing the some_stats pointee memory via
>>> mmap, not the containing struct with kptr fields. That is, for maps that
>>> support these kptrs, mmap()ing a map with value type struct
>>> normal_and_mmap_value would return the some_stats pointer value, and
>>> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
>>> logic in this patch. We'd only be able to support one such mmapable kptr
>>> field per mapval type, but that isn't a dealbreaker.
>>>
>>> Some maps, like task_storage, would only support mmap() on a map_value
>>> with mmapable kptr field, as mmap()ing the mapval itself doesn't make
>>> sense or is unsafe. Seems like arraymap would do the opposite, only
>>
>> Changing direction a bit since arraymap is brought up. :)
>>
>> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an
>> arraymap as kptr, the bpf prog should be able to access it as a map. More like
>> the current map-in-map setup. The arraymap can be used as regular map in the
>> user space also (like pinning). It may need some btf plumbing to tell the value
>> type of the arrayamp to the verifier.
>>
>> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags)
>> can be used where the value->array_mmap initialized as an arraymap_fd. This will
>> limit the arraymap kptr update only from the syscall side which seems to be your
>> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts
>> and need a whitelist.
>>
>> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd,
>> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we
>> can create a libbpf helper to free the fd(s) resources held in the looked-up
>> value by using the value's btf.
>>
>> The bpf_local_storage_map side probably does not need to support mmap() then.
>
> Martin,
> that's an interesting idea!
> I kinda like it and I think it's worth exploring further.
>
> I think the main quirk of the proposed mmap-of-task-local-storage
> is using 'current' task as an implicit 'key' in task local storage map.
> It fits here, but I'm not sure it addresses sched-ext use case.
>
> Tejun, David,
> could you please chime in ?
> Do you think mmap(..., task_local_storage_map_fd, ...)
> that returns a page that belongs to current task only is enough ?
>
> If not we need to think through how to mmap local storage of other
> tasks. One proposal was to use pgoff to carry the key somehow
> like io-uring does, but if we want to generalize that the pgoff approach
> falls apart if we want __mmapable_kptr to work like Martin is proposing above,
> since the key will not fit in 64-bit of pgoff.
>
> Maybe we need an office hours slot to discuss. This looks to be a big
> topic. Not sure we can converge over email.
> Just getting everyone on the same page will take a lot of email reading.
Meta BPF folks were all in one place for reasons unrelated to this
thread, and decided to have a design discussion regarding this mmapable
task_local storage implementation and other implementation ideas
discussed in this thread. I will summarize the discussion below. We
didn't arrive at a confident conclusion, though, so plenty of time for
others to chime in and for proper office hours discussion to happen if
necessary. Below, anything attributed to a specific person is
paraphrased from my notes, there will certainly be errors /
misrememberings.
mmapable task_local storage design discussion 11/29
We first summarized approaches that were discussed in this thread:
1) Current implementation in this series
2) pseudo-map_in_map, kptr arraymap type:
struct mmapable_data_map {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(map_flags, BPF_F_MMAPABLE);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, __u64);
};
struct my_mapval {
int whatever;
struct bpf_arraymap __kptr __arraymap_type(mmapable_data_map) *m;
/* Need special logic to support getting / updating above field from userspace (as fd) */
};
3) "mmapable page" kptr type, or "mmapable kptr" tag
struct my_mapval {
int whatever;
struct bpf_mmapable_page *m;
};
struct my_mapval2 { /* Separate approach than my_mapval above */
struct bpf_spin_lock l;
int some_int;
struct my_type __mmapable_kptr *some_stats;
};
Points of consideration regardless of implementation approach:
* mmap() syscall's return address must be page-aligned. If we want to
reduce / eliminate wasted memory by supporting packing of multiple
mapvals onto single page, need to be able to return non-page-aligned
addr. Using a BPF syscall subcommand in lieu of or in addition to
mmap() syscall would be a way to support this.
* Dave wants to avoid implementing packing at all, but having a
reasonable path forward would be nice in case actual usecase
arises.
* Andrii suggested a new actual mmap syscall supporting passing of
custom params, useful for any subsystem using mmap in
nontraditional ways. This was initially a response to "use offset
to pass task id" discussion re: selecting non-current task.
* How orthogonal is Martin's "kptr to arraymap" suggestion from the
general mmapable local_storage goal? Is there a world where we
choose a different approach for this work, and then to "kptr to
arraymap" independently later?
The above was mostly summary of existing discussion in this thread. Rest
of discussion flowed from there.
Q&A:
- Do we need to be able to query other tasks' mapvals? (for David Vernet
/ Tejun Heo)
TJ had a usecase where this might've been necessary, but rewrote.
David: Being able to do this in general, aside from TJ's specific case,
would be useful. David provided an example from ghOSt project - central
scheduling. Another example: Folly runtime framework, farms out work to
worker threads, might want to tag them.
- Which usecases actually care about avoiding syscall overhead?
Alexei: Most services that would want to use this functionality to tag
their tasks don't need it, as they just set the value once.
Dave: Some tracing usecases (e.g. strobemeta) need it.
- Do we want to use mmap() in current-task-only limited way, or do BPF
subcommand or something more exotic?
TJ: What if bpf subcommand returns FD that can be mmap'd. fd identifies
which task_local storage elem is mmap'd. Subcommand:
bpf_map_lookup_elem_fd(map *, u64 elem_id).
Alexei: Such a thing should return fd to arbitrary mapval, and should
support other common ops (open, close, etc.).
David: What's the problem w/ having fd that only supports mmap for now?
TJ: 'Dangling' fds exist in some usecases already
Discussion around the bpf_map_lookup_elem_fd idea continued for quite a
while. Folks liked that it avoids the "can only have one mmapable field"
issue from proposal (3) above, without making any implicit assumptions.
Alexei: Can we instead have pointer to userspace blob - similar to rseq
- to avoid wasting most of page?
TJ: My instinct is to stick to something more generic, would rather pay
4k.
Discussion around userspace pointer continued for a while as well,
ending in the conclusion that we should take a look at using
get_user_pages, perhaps wrapping such functionality in a 'guppable' kptr
type. Folks liked the 'guppable' idea as it sort-of avoids the wasted
memory issue, pushing the details to userspace, and punts on working out
a path forward for mmap interface, which other implementation ideas
require.
Action items based on convo, priority order:
- Think more about / prototype 'guppable' kptr idea
- If the above has issues, try bpf_map_lookup_elem_fd
- If both above have issues, consider earlier approaches
I will start tackling these soon.
next prev parent reply other threads:[~2023-12-11 17:40 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
2023-11-20 21:41 ` Johannes Weiner
2023-11-21 0:42 ` Martin KaFai Lau
2023-11-21 6:11 ` David Marchevsky
2023-11-21 19:27 ` Martin KaFai Lau
2023-11-21 19:49 ` Alexei Starovoitov
2023-12-11 17:31 ` David Marchevsky [this message]
2023-11-21 2:32 ` kernel test robot
2023-11-21 5:06 ` kernel test robot
2023-11-21 5:20 ` kernel test robot
2023-11-21 5:44 ` Alexei Starovoitov
2023-11-21 6:41 ` Yonghong Song
2023-11-21 15:34 ` Yonghong Song
2023-11-21 19:30 ` Andrii Nakryiko
2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
2023-11-21 19:34 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45878586-cc5f-435f-83fb-9a3c39824550@linux.dev \
--to=david.marchevsky@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davemarchevsky@fb.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox