From: David Marchevsky <david.marchevsky@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Dave Marchevsky <davemarchevsky@fb.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Kernel Team <kernel-team@fb.com>,
Nathan Slingerland <slinger@meta.com>
Subject: Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper
Date: Fri, 4 Aug 2023 02:59:13 -0400 [thread overview]
Message-ID: <a07132a2-9828-a84a-af5b-ab660678157d@linux.dev> (raw)
In-Reply-To: <CAADnVQKo5VTkmS+DdYc5a8Hns4meptn7g76dOjxmJCHgpo29hQ@mail.gmail.com>
On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> At Meta we have a profiling daemon which periodically collects
>> information on many hosts. This collection usually involves grabbing
>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>> those cases we must fall back to digging around in /proc/PID/maps to map
>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>> occur synchronously with stack collection, so the process might already
>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>> symbolicate.
>>
>> This 'exited process problem' doesn't occur very often as
>> most of the prod services we care to profile are long-lived daemons,
>> there are enough usecases to warrant a workaround: a BPF program which
>> can be optionally loaded at data collection time and essentially walks
>> /proc/PID/maps. Currently this is done by walking the vma list:
>>
>> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>
>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>> longer a vma linked list to walk. Walking the vma maple tree is not as
>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>> returns the vma containing addr, or if no vma contains addr,
>> the closest vma with higher start addr.
>>
>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>> find_vma, with the major difference that no 'closest vma' is returned if
>> there is no VMA containing a particular address. This prevents BPF
>> programs from being able to use bpf_find_vma to iterate all vmas in a
>> task in a reasonable way.
>>
>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>> 'closest vma' behavior when used. Because this is find_vma's default
>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>> on find_vma retval.
>>
>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>> 'start'. The former is used in documentation and more accurately
>> describes the param.
>>
>> [
>> RFC: This isn't an ideal solution for iteration of all vmas in a task
>> in the long term for a few reasons:
>>
>> * In nmi context, second call to bpf_find_vma will fail because
>> irq_work is busy, so can't iterate all vmas
>> * Repeatedly taking and releasing mmap_read lock when a dedicated
>> iterate_all_vmas(task) kfunc could just take it once and hold for
>> all vmas
>>
>> My specific usecase doesn't do vma iteration in nmi context and I
>> think the 'closest vma' behavior can be useful here despite locking
>> inefficiencies.
>>
>> When Alexei and I discussed this offline, two alternatives to
>> provide similar functionality while addressing above issues seemed
>> reasonable:
>>
>> * open-coded iterator for task vma. Similar to existing
>> task_vma bpf_iter, but no need to create a bpf_link and read
>> bpf_iter fd from userspace.
>> * New kfunc taking callback similar bpf_find_vma, but iterating
>> over all vmas in one go
>>
>> I think this patch is useful on its own since it's a fairly minimal
>> change and fixes my usecase. Sending for early feedback and to
>> solicit further thought about whether this should be dropped in
>> favor of one of the above options.
>
> - In theory this patch can work, but patch 2 didn't attempt to actually
> use it in a loop to iterate all vma-s.
> Which is a bit of red flag whether such iteration is practical
> (either via bpf_loop or bpf_for).
>
> - This behavior of bpf_find_vma() feels too much implementation detail.
> find_vma will probably stay this way, since different parts of the kernel
> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
>
> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
> I don't think bpf prog can do it on its own.
> Because with bpf_find_vma() the lock will drop at every step the problems
> described at that large comment will be hit sooner or later.
>
> All concerns combined I feel we better provide a new kfunc that iterates vma
> and drops the lock before invoking callback.
> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
> Maybe it's ok.
> Doing it open coded iterators style is likely better.
> bpf_iter_vma_new() kfunc will do
> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
>
> I'd try to do open-code-iter first. It's a good test for the iter infra.
> bpf_iter_testmod_seq_new is an example of how to add a new iter.
>
> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
> It's not a trusted arg. We better move away from this legacy pointer.
> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
> fwiw bpf_get_current_task_btf_proto has
> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
> The bpf prog might look like:
> task = bpf_get_current_task_btf();
> err = bpf_iter_vma_new(&it, task);
> while ((vma = bpf_iter_vma_next(&it))) ...;
> assuming lock is not dropped by _next.
The only concern here that doesn't seem reasonable to me is the
"too much implementation detail". I agree with the rest, though,
so will send a different series with new implementation and point
to this discussion.
next prev parent reply other threads:[~2023-08-04 6:59 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-01 14:54 [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Dave Marchevsky
2023-08-01 14:54 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising bpf_find_vma's BPF_F_VMA_NEXT flag Dave Marchevsky
2023-08-01 20:41 ` [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Alexei Starovoitov
2023-08-04 6:59 ` David Marchevsky [this message]
2023-08-04 15:52 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a07132a2-9828-a84a-af5b-ab660678157d@linux.dev \
--to=david.marchevsky@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davemarchevsky@fb.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=slinger@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox