* [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper
@ 2023-08-01 14:54 Dave Marchevsky
2023-08-01 14:54 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising bpf_find_vma's BPF_F_VMA_NEXT flag Dave Marchevsky
2023-08-01 20:41 ` [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Alexei Starovoitov
0 siblings, 2 replies; 5+ messages in thread
From: Dave Marchevsky @ 2023-08-01 14:54 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Kernel Team, Dave Marchevsky,
Nathan Slingerland
At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.
This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons,
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:
struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. That commit replaces
vm_next hopping with calls to find_vma(mm, addr) helper function, which
returns the vma containing addr, or if no vma contains addr,
the closest vma with higher start addr.
The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
find_vma, with the major difference that no 'closest vma' is returned if
there is no VMA containing a particular address. This prevents BPF
programs from being able to use bpf_find_vma to iterate all vmas in a
task in a reasonable way.
This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
'closest vma' behavior when used. Because this is find_vma's default
behavior it's as straightforward as nerfing a 'vma contains addr' check
on find_vma retval.
Also, change bpf_find_vma's address parameter to 'addr' instead of
'start'. The former is used in documentation and more accurately
describes the param.
[
RFC: This isn't an ideal solution for iteration of all vmas in a task
in the long term for a few reasons:
* In nmi context, second call to bpf_find_vma will fail because
irq_work is busy, so can't iterate all vmas
* Repeatedly taking and releasing mmap_read lock when a dedicated
iterate_all_vmas(task) kfunc could just take it once and hold for
all vmas
My specific usecase doesn't do vma iteration in nmi context and I
think the 'closest vma' behavior can be useful here despite locking
inefficiencies.
When Alexei and I discussed this offline, two alternatives to
provide similar functionality while addressing above issues seemed
reasonable:
* open-coded iterator for task vma. Similar to existing
task_vma bpf_iter, but no need to create a bpf_link and read
bpf_iter fd from userspace.
* New kfunc taking callback similar bpf_find_vma, but iterating
over all vmas in one go
I think this patch is useful on its own since it's a fairly minimal
change and fixes my usecase. Sending for early feedback and to
solicit further thought about whether this should be dropped in
favor of one of the above options.
]
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Nathan Slingerland <slinger@meta.com>
---
include/uapi/linux/bpf.h | 14 ++++++++++++--
kernel/bpf/task_iter.c | 12 ++++++++----
tools/include/uapi/linux/bpf.h | 14 ++++++++++++--
3 files changed, 32 insertions(+), 8 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@ union bpf_attr {
* function with *task*, *vma*, and *callback_ctx*.
* The *callback_fn* should be a static function and
* the *callback_ctx* should be a pointer to the stack.
- * The *flags* is used to control certain aspects of the helper.
- * Currently, the *flags* must be 0.
+ * The *flags* is used to control certain aspects of the helper and
+ * may be one of the following:
+ *
+ * **BPF_F_VMA_NEXT**
+ * If no vma contains *addr*, call *callback_fn* with the next vma,
+ * i.e. the vma with lowest vm_start that is higher than *addr*.
+ * This replicates behavior of kernel's find_vma helper.
*
* The expected callback signature is
*
@@ -6026,6 +6031,11 @@ enum {
BPF_F_EXCLUDE_INGRESS = (1ULL << 4),
};
+/* Flags for bpf_find_vma helper */
+enum {
+ BPF_F_VMA_NEXT = (1ULL << 0),
+};
+
#define __bpf_md_ptr(type, name) \
union { \
type name; \
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..a8c87dcf36ad 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -777,7 +777,7 @@ static struct bpf_iter_reg task_vma_reg_info = {
.show_fdinfo = bpf_iter_task_show_fdinfo,
};
-BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
+BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, addr,
bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags)
{
struct mmap_unlock_irq_work *work = NULL;
@@ -785,10 +785,13 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
bool irq_work_busy = false;
struct mm_struct *mm;
int ret = -ENOENT;
+ bool vma_next;
- if (flags)
+ if (flags & ~BPF_F_VMA_NEXT)
return -EINVAL;
+ vma_next = flags & BPF_F_VMA_NEXT;
+
if (!task)
return -ENOENT;
@@ -801,9 +804,10 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
if (irq_work_busy || !mmap_read_trylock(mm))
return -EBUSY;
- vma = find_vma(mm, start);
+ vma = find_vma(mm, addr);
- if (vma && vma->vm_start <= start && vma->vm_end > start) {
+ if (vma &&
+ ((vma->vm_start <= addr && vma->vm_end > addr) || vma_next)) {
callback_fn((u64)(long)task, (u64)(long)vma,
(u64)(long)callback_ctx, 0, 0);
ret = 0;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@ union bpf_attr {
* function with *task*, *vma*, and *callback_ctx*.
* The *callback_fn* should be a static function and
* the *callback_ctx* should be a pointer to the stack.
- * The *flags* is used to control certain aspects of the helper.
- * Currently, the *flags* must be 0.
+ * The *flags* is used to control certain aspects of the helper and
+ * may be one of the following:
+ *
+ * **BPF_F_VMA_NEXT**
+ * If no vma contains *addr*, call *callback_fn* with the next vma,
+ * i.e. the vma with lowest vm_start that is higher than *addr*.
+ * This replicates behavior of kernel's find_vma helper.
*
* The expected callback signature is
*
@@ -6026,6 +6031,11 @@ enum {
BPF_F_EXCLUDE_INGRESS = (1ULL << 4),
};
+/* Flags for bpf_find_vma helper */
+enum {
+ BPF_F_VMA_NEXT = (1ULL << 0),
+};
+
#define __bpf_md_ptr(type, name) \
union { \
type name; \
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising bpf_find_vma's BPF_F_VMA_NEXT flag
2023-08-01 14:54 [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Dave Marchevsky
@ 2023-08-01 14:54 ` Dave Marchevsky
2023-08-01 20:41 ` [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Alexei Starovoitov
1 sibling, 0 replies; 5+ messages in thread
From: Dave Marchevsky @ 2023-08-01 14:54 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Kernel Team, Dave Marchevsky
Nothing is mapped to the zero page, so current find_vma tests use addr 0
to test "failure to find vma containing addr". With the BPF_F_VMA_NEXT
flag, a bpf_find_vma call on an addr 0 will return some vma, so only
small adjustments to existing tests are necessary to validate.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
.../testing/selftests/bpf/prog_tests/find_vma.c | 17 +++++++++++++----
tools/testing/selftests/bpf/progs/find_vma.c | 5 +++--
2 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/find_vma.c b/tools/testing/selftests/bpf/prog_tests/find_vma.c
index 5165b38f0e59..dccf1ccd7468 100644
--- a/tools/testing/selftests/bpf/prog_tests/find_vma.c
+++ b/tools/testing/selftests/bpf/prog_tests/find_vma.c
@@ -19,6 +19,7 @@ static void test_and_reset_skel(struct find_vma *skel, int expected_find_zero_re
skel->bss->found_vm_exec = 0;
skel->data->find_addr_ret = -1;
skel->data->find_zero_ret = -1;
+ skel->bss->find_zero_flags = 0;
skel->bss->d_iname[0] = 0;
}
@@ -77,16 +78,23 @@ static void test_find_vma_pe(struct find_vma *skel)
close(pfd);
}
-static void test_find_vma_kprobe(struct find_vma *skel)
+static void test_find_vma_kprobe(struct find_vma *skel, bool vma_next)
{
- int err;
+ int err, expected_find_zero_ret;
err = find_vma__attach(skel);
if (!ASSERT_OK(err, "get_branch_snapshot__attach"))
return;
+ if (vma_next) {
+ skel->bss->find_zero_flags = BPF_F_VMA_NEXT;
+ expected_find_zero_ret = 0;
+ } else {
+ expected_find_zero_ret = -ENOENT; /* no vma contains ptr 0 */
+ }
+
getpgid(skel->bss->target_pid);
- test_and_reset_skel(skel, -ENOENT /* could not find vma for ptr 0 */, true);
+ test_and_reset_skel(skel, expected_find_zero_ret, true);
}
static void test_illegal_write_vma(void)
@@ -119,7 +127,8 @@ void serial_test_find_vma(void)
skel->bss->addr = (__u64)(uintptr_t)test_find_vma_pe;
test_find_vma_pe(skel);
- test_find_vma_kprobe(skel);
+ test_find_vma_kprobe(skel, false);
+ test_find_vma_kprobe(skel, true);
find_vma__destroy(skel);
test_illegal_write_vma();
diff --git a/tools/testing/selftests/bpf/progs/find_vma.c b/tools/testing/selftests/bpf/progs/find_vma.c
index 38034fb82530..73ade81722fa 100644
--- a/tools/testing/selftests/bpf/progs/find_vma.c
+++ b/tools/testing/selftests/bpf/progs/find_vma.c
@@ -17,6 +17,7 @@ pid_t target_pid = 0;
char d_iname[DNAME_INLINE_LEN] = {0};
__u32 found_vm_exec = 0;
__u64 addr = 0;
+__u64 find_zero_flags = 0;
int find_zero_ret = -1;
int find_addr_ret = -1;
@@ -46,7 +47,7 @@ int handle_getpid(void)
find_addr_ret = bpf_find_vma(task, addr, check_vma, &data, 0);
/* this should return -ENOENT */
- find_zero_ret = bpf_find_vma(task, 0, check_vma, &data, 0);
+ find_zero_ret = bpf_find_vma(task, 0, check_vma, &data, find_zero_flags);
return 0;
}
@@ -64,6 +65,6 @@ int handle_pe(void)
/* In NMI, this should return -EBUSY, as the previous call is using
* the irq_work.
*/
- find_zero_ret = bpf_find_vma(task, 0, check_vma, &data, 0);
+ find_zero_ret = bpf_find_vma(task, 0, check_vma, &data, find_zero_flags);
return 0;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper
2023-08-01 14:54 [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Dave Marchevsky
2023-08-01 14:54 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising bpf_find_vma's BPF_F_VMA_NEXT flag Dave Marchevsky
@ 2023-08-01 20:41 ` Alexei Starovoitov
2023-08-04 6:59 ` David Marchevsky
1 sibling, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2023-08-01 20:41 UTC (permalink / raw)
To: Dave Marchevsky
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Kernel Team, Nathan Slingerland
On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> At Meta we have a profiling daemon which periodically collects
> information on many hosts. This collection usually involves grabbing
> stacks (user and kernel) using perf_event BPF progs and later symbolicating
> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
> those cases we must fall back to digging around in /proc/PID/maps to map
> virtual address to (binary, offset). The /proc/PID/maps digging does not
> occur synchronously with stack collection, so the process might already
> be gone, in which case it won't have /proc/PID/maps and we will fail to
> symbolicate.
>
> This 'exited process problem' doesn't occur very often as
> most of the prod services we care to profile are long-lived daemons,
> there are enough usecases to warrant a workaround: a BPF program which
> can be optionally loaded at data collection time and essentially walks
> /proc/PID/maps. Currently this is done by walking the vma list:
>
> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>
> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
> longer a vma linked list to walk. Walking the vma maple tree is not as
> simple as hopping struct vm_area_struct->vm_next. That commit replaces
> vm_next hopping with calls to find_vma(mm, addr) helper function, which
> returns the vma containing addr, or if no vma contains addr,
> the closest vma with higher start addr.
>
> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
> find_vma, with the major difference that no 'closest vma' is returned if
> there is no VMA containing a particular address. This prevents BPF
> programs from being able to use bpf_find_vma to iterate all vmas in a
> task in a reasonable way.
>
> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
> 'closest vma' behavior when used. Because this is find_vma's default
> behavior it's as straightforward as nerfing a 'vma contains addr' check
> on find_vma retval.
>
> Also, change bpf_find_vma's address parameter to 'addr' instead of
> 'start'. The former is used in documentation and more accurately
> describes the param.
>
> [
> RFC: This isn't an ideal solution for iteration of all vmas in a task
> in the long term for a few reasons:
>
> * In nmi context, second call to bpf_find_vma will fail because
> irq_work is busy, so can't iterate all vmas
> * Repeatedly taking and releasing mmap_read lock when a dedicated
> iterate_all_vmas(task) kfunc could just take it once and hold for
> all vmas
>
> My specific usecase doesn't do vma iteration in nmi context and I
> think the 'closest vma' behavior can be useful here despite locking
> inefficiencies.
>
> When Alexei and I discussed this offline, two alternatives to
> provide similar functionality while addressing above issues seemed
> reasonable:
>
> * open-coded iterator for task vma. Similar to existing
> task_vma bpf_iter, but no need to create a bpf_link and read
> bpf_iter fd from userspace.
> * New kfunc taking callback similar bpf_find_vma, but iterating
> over all vmas in one go
>
> I think this patch is useful on its own since it's a fairly minimal
> change and fixes my usecase. Sending for early feedback and to
> solicit further thought about whether this should be dropped in
> favor of one of the above options.
- In theory this patch can work, but patch 2 didn't attempt to actually
use it in a loop to iterate all vma-s.
Which is a bit of red flag whether such iteration is practical
(either via bpf_loop or bpf_for).
- This behavior of bpf_find_vma() feels too much implementation detail.
find_vma will probably stay this way, since different parts of the kernel
rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
- Looking at task_vma_seq_get_next().. that's how vma iter should be done and
I don't think bpf prog can do it on its own.
Because with bpf_find_vma() the lock will drop at every step the problems
described at that large comment will be hit sooner or later.
All concerns combined I feel we better provide a new kfunc that iterates vma
and drops the lock before invoking callback.
It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
Maybe it's ok.
Doing it open coded iterators style is likely better.
bpf_iter_vma_new() kfunc will do
bpf_mmap_unlock_get_irq_work+mmap_read_trylock
while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
I'd try to do open-code-iter first. It's a good test for the iter infra.
bpf_iter_testmod_seq_new is an example of how to add a new iter.
Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
It's not a trusted arg. We better move away from this legacy pointer.
bpf_iter_vma_new() should accept only trusted ptr to task_struct.
fwiw bpf_get_current_task_btf_proto has
.ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
The bpf prog might look like:
task = bpf_get_current_task_btf();
err = bpf_iter_vma_new(&it, task);
while ((vma = bpf_iter_vma_next(&it))) ...;
assuming lock is not dropped by _next.
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper
2023-08-01 20:41 ` [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Alexei Starovoitov
@ 2023-08-04 6:59 ` David Marchevsky
2023-08-04 15:52 ` Yonghong Song
0 siblings, 1 reply; 5+ messages in thread
From: David Marchevsky @ 2023-08-04 6:59 UTC (permalink / raw)
To: Alexei Starovoitov, Dave Marchevsky
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Kernel Team, Nathan Slingerland
On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> At Meta we have a profiling daemon which periodically collects
>> information on many hosts. This collection usually involves grabbing
>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>> those cases we must fall back to digging around in /proc/PID/maps to map
>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>> occur synchronously with stack collection, so the process might already
>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>> symbolicate.
>>
>> This 'exited process problem' doesn't occur very often as
>> most of the prod services we care to profile are long-lived daemons,
>> there are enough usecases to warrant a workaround: a BPF program which
>> can be optionally loaded at data collection time and essentially walks
>> /proc/PID/maps. Currently this is done by walking the vma list:
>>
>> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>
>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>> longer a vma linked list to walk. Walking the vma maple tree is not as
>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>> returns the vma containing addr, or if no vma contains addr,
>> the closest vma with higher start addr.
>>
>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>> find_vma, with the major difference that no 'closest vma' is returned if
>> there is no VMA containing a particular address. This prevents BPF
>> programs from being able to use bpf_find_vma to iterate all vmas in a
>> task in a reasonable way.
>>
>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>> 'closest vma' behavior when used. Because this is find_vma's default
>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>> on find_vma retval.
>>
>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>> 'start'. The former is used in documentation and more accurately
>> describes the param.
>>
>> [
>> RFC: This isn't an ideal solution for iteration of all vmas in a task
>> in the long term for a few reasons:
>>
>> * In nmi context, second call to bpf_find_vma will fail because
>> irq_work is busy, so can't iterate all vmas
>> * Repeatedly taking and releasing mmap_read lock when a dedicated
>> iterate_all_vmas(task) kfunc could just take it once and hold for
>> all vmas
>>
>> My specific usecase doesn't do vma iteration in nmi context and I
>> think the 'closest vma' behavior can be useful here despite locking
>> inefficiencies.
>>
>> When Alexei and I discussed this offline, two alternatives to
>> provide similar functionality while addressing above issues seemed
>> reasonable:
>>
>> * open-coded iterator for task vma. Similar to existing
>> task_vma bpf_iter, but no need to create a bpf_link and read
>> bpf_iter fd from userspace.
>> * New kfunc taking callback similar bpf_find_vma, but iterating
>> over all vmas in one go
>>
>> I think this patch is useful on its own since it's a fairly minimal
>> change and fixes my usecase. Sending for early feedback and to
>> solicit further thought about whether this should be dropped in
>> favor of one of the above options.
>
> - In theory this patch can work, but patch 2 didn't attempt to actually
> use it in a loop to iterate all vma-s.
> Which is a bit of red flag whether such iteration is practical
> (either via bpf_loop or bpf_for).
>
> - This behavior of bpf_find_vma() feels too much implementation detail.
> find_vma will probably stay this way, since different parts of the kernel
> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
>
> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
> I don't think bpf prog can do it on its own.
> Because with bpf_find_vma() the lock will drop at every step the problems
> described at that large comment will be hit sooner or later.
>
> All concerns combined I feel we better provide a new kfunc that iterates vma
> and drops the lock before invoking callback.
> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
> Maybe it's ok.
> Doing it open coded iterators style is likely better.
> bpf_iter_vma_new() kfunc will do
> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
>
> I'd try to do open-code-iter first. It's a good test for the iter infra.
> bpf_iter_testmod_seq_new is an example of how to add a new iter.
>
> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
> It's not a trusted arg. We better move away from this legacy pointer.
> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
> fwiw bpf_get_current_task_btf_proto has
> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
> The bpf prog might look like:
> task = bpf_get_current_task_btf();
> err = bpf_iter_vma_new(&it, task);
> while ((vma = bpf_iter_vma_next(&it))) ...;
> assuming lock is not dropped by _next.
The only concern here that doesn't seem reasonable to me is the
"too much implementation detail". I agree with the rest, though,
so will send a different series with new implementation and point
to this discussion.
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper
2023-08-04 6:59 ` David Marchevsky
@ 2023-08-04 15:52 ` Yonghong Song
0 siblings, 0 replies; 5+ messages in thread
From: Yonghong Song @ 2023-08-04 15:52 UTC (permalink / raw)
To: David Marchevsky, Alexei Starovoitov, Dave Marchevsky
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Kernel Team, Nathan Slingerland
On 8/3/23 11:59 PM, David Marchevsky wrote:
> On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>>
>>> At Meta we have a profiling daemon which periodically collects
>>> information on many hosts. This collection usually involves grabbing
>>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>>> those cases we must fall back to digging around in /proc/PID/maps to map
>>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>>> occur synchronously with stack collection, so the process might already
>>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>>> symbolicate.
>>>
>>> This 'exited process problem' doesn't occur very often as
>>> most of the prod services we care to profile are long-lived daemons,
>>> there are enough usecases to warrant a workaround: a BPF program which
>>> can be optionally loaded at data collection time and essentially walks
>>> /proc/PID/maps. Currently this is done by walking the vma list:
>>>
>>> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>>> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>>
>>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>>> longer a vma linked list to walk. Walking the vma maple tree is not as
>>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>>> returns the vma containing addr, or if no vma contains addr,
>>> the closest vma with higher start addr.
>>>
>>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>>> find_vma, with the major difference that no 'closest vma' is returned if
>>> there is no VMA containing a particular address. This prevents BPF
>>> programs from being able to use bpf_find_vma to iterate all vmas in a
>>> task in a reasonable way.
>>>
>>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>>> 'closest vma' behavior when used. Because this is find_vma's default
>>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>>> on find_vma retval.
>>>
>>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>>> 'start'. The former is used in documentation and more accurately
>>> describes the param.
>>>
>>> [
>>> RFC: This isn't an ideal solution for iteration of all vmas in a task
>>> in the long term for a few reasons:
>>>
>>> * In nmi context, second call to bpf_find_vma will fail because
>>> irq_work is busy, so can't iterate all vmas
>>> * Repeatedly taking and releasing mmap_read lock when a dedicated
>>> iterate_all_vmas(task) kfunc could just take it once and hold for
>>> all vmas
>>>
>>> My specific usecase doesn't do vma iteration in nmi context and I
>>> think the 'closest vma' behavior can be useful here despite locking
>>> inefficiencies.
>>>
>>> When Alexei and I discussed this offline, two alternatives to
>>> provide similar functionality while addressing above issues seemed
>>> reasonable:
>>>
>>> * open-coded iterator for task vma. Similar to existing
>>> task_vma bpf_iter, but no need to create a bpf_link and read
>>> bpf_iter fd from userspace.
>>> * New kfunc taking callback similar bpf_find_vma, but iterating
>>> over all vmas in one go
>>>
>>> I think this patch is useful on its own since it's a fairly minimal
>>> change and fixes my usecase. Sending for early feedback and to
>>> solicit further thought about whether this should be dropped in
>>> favor of one of the above options.
>>
>> - In theory this patch can work, but patch 2 didn't attempt to actually
>> use it in a loop to iterate all vma-s.
>> Which is a bit of red flag whether such iteration is practical
>> (either via bpf_loop or bpf_for).
>>
>> - This behavior of bpf_find_vma() feels too much implementation detail.
>> find_vma will probably stay this way, since different parts of the kernel
>> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
>>
>> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
>> I don't think bpf prog can do it on its own.
>> Because with bpf_find_vma() the lock will drop at every step the problems
>> described at that large comment will be hit sooner or later.
>>
>> All concerns combined I feel we better provide a new kfunc that iterates vma
>> and drops the lock before invoking callback.
>> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
>> Maybe it's ok.
>> Doing it open coded iterators style is likely better.
>> bpf_iter_vma_new() kfunc will do
>> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
>> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
>>
>> I'd try to do open-code-iter first. It's a good test for the iter infra.
>> bpf_iter_testmod_seq_new is an example of how to add a new iter.
>>
>> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
>> It's not a trusted arg. We better move away from this legacy pointer.
>> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
>> fwiw bpf_get_current_task_btf_proto has
>> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
>> The bpf prog might look like:
>> task = bpf_get_current_task_btf();
>> err = bpf_iter_vma_new(&it, task);
>> while ((vma = bpf_iter_vma_next(&it))) ...;
>> assuming lock is not dropped by _next.
>
> The only concern here that doesn't seem reasonable to me is the
> "too much implementation detail". I agree with the rest, though,
> so will send a different series with new implementation and point
> to this discussion.
For reference, this is another use case for traversing
vma's in the bpf program reported from bcc mailing list:
https://github.com/iovisor/bcc/pull/4679
The use case is for that the application may not have frame pointer
so bpf program will just scan stack's and find potential
user text region pointers and report them. This is similar
to what current arch (e.g., x86) code reporting crash stack.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-08-04 15:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-01 14:54 [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Dave Marchevsky
2023-08-01 14:54 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising bpf_find_vma's BPF_F_VMA_NEXT flag Dave Marchevsky
2023-08-01 20:41 ` [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper Alexei Starovoitov
2023-08-04 6:59 ` David Marchevsky
2023-08-04 15:52 ` Yonghong Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox