[PATCH v3 bpf-next 0/3] Open-coded task

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 bpf-next 0/3] Open-coded task_vma iter
@ 2023-08-22  5:05 Dave Marchevsky
  2023-08-22  5:05 ` [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num Dave Marchevsky
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Dave Marchevsky @ 2023-08-22  5:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Dave Marchevsky

At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.

This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons, but
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:

  struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
  mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */

Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. Luckily,
commit f39af05949a4 ("mm: add VMA iterator"), another commit in that series,
added struct vma_iterator and for_each_vma macro for easy vma iteration. If
similar functionality was exposed to BPF programs, it would be perfect for our
usecase.

This series adds such functionality, specifically a BPF equivalent of
for_each_vma using the open-coded iterator style.

Notes:
  * This approach was chosen after discussion on a previous series [0] which
    attempted to solve the same problem by adding a BPF_F_VMA_NEXT flag to
    bpf_find_vma.
  * Unlike the task_vma bpf_iter, the open-coded iterator kfuncs here do not
    drop the vma read lock between iterations. See Alexei's response in [0].
  * The [vsyscall] page isn't really part of task->mm's vmas, but
    /proc/PID/maps returns information about it anyways. The vma iter added
    here does not do the same. See comment on selftest in patch 3.
  * bpf_iter_task_vma allocates a _data struct which contains - among other
    things - struct vma_iterator, using BPF allocator and keeps a pointer to
    the bpf_iter_task_vma_data. This is done in order to prevent changes to
    struct ma_state - which is wrapped by struct vma_iterator - from
    necessitating changes to uapi struct bpf_iter_task_vma.

Changelog:

v2 -> v3: https://lore.kernel.org/bpf/20230821173415.1970776-1-davemarchevsky@fb.com/

Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num")
  * Add Yonghong ack

Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * UAPI bpf header and tools/ version should match
  * Add bpf_iter_task_vma_kern_data which bpf_iter_task_vma_kern points to,
    bpf_mem_alloc/free it instead of just vma_iterator. (Alexei)
    * Inner data ptr == NULL implies initialization failed


v1 -> v2: https://lore.kernel.org/bpf/20230810183513.684836-1-davemarchevsky@fb.com/
  * Patch 1
    * Now removes the unnecessary BTF_TYPE_EMIT instead of changing the
      type (Yonghong)
  * Patch 2
    * Don't do unnecessary BTF_TYPE_EMIT (Yonghong)
    * Bump task refcount to prevent ->mm reuse (Yonghong)
    * Keep a pointer to vma_iterator in bpf_iter_task_vma, alloc/free
      via BPF mem allocator (Yonghong, Stanislav)
  * Patch 3

Patch summary:
  * Patch 1 is a tiny fix I ran into while implementing the vma iter in this
    series. It can be applied independently.
  * Patch 2 is the meat of the implementation
  * Patch 3 adds tests for the new functionality
    * Existing iter tests exercise failure cases (e.g. prog that doesn't call
      _destroy()). I didn't replicate them in this series, but am happy to add
      them in v2 if folks feel that it would be worthwhile.

  [0]: https://lore.kernel.org/bpf/20230801145414.418145-1-davemarchevsky@fb.com/


Dave Marchevsky (3):
  bpf: Don't explicitly emit BTF for struct btf_iter_num
  bpf: Introduce task_vma open-coded iterator kfuncs
  selftests/bpf: Add tests for open-coded task_vma iter

 include/uapi/linux/bpf.h                      |  4 +
 kernel/bpf/bpf_iter.c                         |  2 -
 kernel/bpf/helpers.c                          |  3 +
 kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  4 +
 tools/lib/bpf/bpf_helpers.h                   |  8 ++
 .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
 .../testing/selftests/bpf/prog_tests/iters.c  | 71 ++++++++++++++++
 ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
 .../selftests/bpf/progs/iters_task_vma.c      | 56 +++++++++++++
 10 files changed, 243 insertions(+), 15 deletions(-)
 rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
 create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num
  2023-08-22  5:05 [PATCH v3 bpf-next 0/3] Open-coded task_vma iter Dave Marchevsky
@ 2023-08-22  5:05 ` Dave Marchevsky
  2023-08-22 23:37   ` Andrii Nakryiko
  2023-08-22  5:05 ` [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs Dave Marchevsky
  2023-08-22  5:05 ` [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter Dave Marchevsky
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Marchevsky @ 2023-08-22  5:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Dave Marchevsky

Commit 6018e1f407cc ("bpf: implement numbers iterator") added the
BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
doesn't exist, so only a forward declaration is emitted in BTF:

  FWD 'btf_iter_num' fwd_kind=struct

That commit was probably hoping to ensure that struct bpf_iter_num is
emitted in vmlinux BTF. A previous version of this patch changed the
line to emit the correct type, but Yonghong confirmed that it would
definitely be emitted regardless in [0], so this patch simply removes
the line.

This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
causing any issues that I noticed, aside from mild confusion when I
looked through the code.

  [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/bpf_iter.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 96856f130cbf..833faa04461b 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -793,8 +793,6 @@ __bpf_kfunc int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end)
 	BUILD_BUG_ON(sizeof(struct bpf_iter_num_kern) != sizeof(struct bpf_iter_num));
 	BUILD_BUG_ON(__alignof__(struct bpf_iter_num_kern) != __alignof__(struct bpf_iter_num));
 
-	BTF_TYPE_EMIT(struct btf_iter_num);
-
 	/* start == end is legit, it's an empty range and we'll just get NULL
 	 * on first (and any subsequent) bpf_iter_num_next() call
 	 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num
  2023-08-22  5:05 ` [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num Dave Marchevsky
@ 2023-08-22 23:37   ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-22 23:37 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf

On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> Commit 6018e1f407cc ("bpf: implement numbers iterator") added the
> BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
> doesn't exist, so only a forward declaration is emitted in BTF:
>
>   FWD 'btf_iter_num' fwd_kind=struct
>
> That commit was probably hoping to ensure that struct bpf_iter_num is
> emitted in vmlinux BTF. A previous version of this patch changed the
> line to emit the correct type, but Yonghong confirmed that it would
> definitely be emitted regardless in [0], so this patch simply removes
> the line.
>
> This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
> causing any issues that I noticed, aside from mild confusion when I
> looked through the code.
>
>   [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  kernel/bpf/bpf_iter.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 96856f130cbf..833faa04461b 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -793,8 +793,6 @@ __bpf_kfunc int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end)
>         BUILD_BUG_ON(sizeof(struct bpf_iter_num_kern) != sizeof(struct bpf_iter_num));
>         BUILD_BUG_ON(__alignof__(struct bpf_iter_num_kern) != __alignof__(struct bpf_iter_num));
>
> -       BTF_TYPE_EMIT(struct btf_iter_num);
> -

heh, hard to notice this typo... I think I needed it for the version
of patch set before the switch to kfunc, so yeah, we don't need it
anymore, thanks!

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>         /* start == end is legit, it's an empty range and we'll just get NULL
>          * on first (and any subsequent) bpf_iter_num_next() call
>          */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22  5:05 [PATCH v3 bpf-next 0/3] Open-coded task_vma iter Dave Marchevsky
  2023-08-22  5:05 ` [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num Dave Marchevsky
@ 2023-08-22  5:05 ` Dave Marchevsky
  2023-08-22 17:42   ` Yonghong Song
  2023-08-22 23:52   ` Andrii Nakryiko
  2023-08-22  5:05 ` [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter Dave Marchevsky
  2 siblings, 2 replies; 25+ messages in thread
From: Dave Marchevsky @ 2023-08-22  5:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Dave Marchevsky, Nathan Slingerland

This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task_vma in open-coded
iterator style. BPF programs can use these kfuncs directly or through
bpf_for_each macro for natural-looking iteration of all task vmas.

The implementation borrows heavily from bpf_find_vma helper's locking -
differing only in that it holds the mmap_read lock for all iterations
while the helper only executes its provided callback on a maximum of 1
vma. Aside from locking, struct vma_iterator and vma_next do all the
heavy lifting.

The newly-added struct bpf_iter_task_vma has a name collision with a
selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
file is renamed in order to avoid the collision.

A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
only field in struct bpf_iter_task_vma. This is because the inner data
struct contains a struct vma_iterator (not ptr), whose size is likely to
change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
such a change would require change in opaque bpf_iter_task_vma struct's
size. So better to allocate vma_iterator using BPF allocator, and since
that alloc must already succeed, might as well allocate all iter fields,
thereby freezing struct bpf_iter_task_vma size.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Nathan Slingerland <slinger@meta.com>
---
 include/uapi/linux/bpf.h                      |  4 +
 kernel/bpf/helpers.c                          |  3 +
 kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  4 +
 tools/lib/bpf/bpf_helpers.h                   |  8 ++
 .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
 ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
 7 files changed, 116 insertions(+), 13 deletions(-)
 rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8790b3962e4b..49fc1989a548 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7311,4 +7311,8 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_iter_task_vma {
+	__u64 __opaque[1]; /* See bpf_iter_num comment above */
+} __attribute__((aligned(8)));
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index eb91cae0612a..7a06dea749f1 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
 BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
+BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
+BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
 BTF_ID_FLAGS(func, bpf_dynptr_adjust)
 BTF_ID_FLAGS(func, bpf_dynptr_is_null)
 BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..51c2dce435c1 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -7,7 +7,9 @@
 #include <linux/fs.h>
 #include <linux/fdtable.h>
 #include <linux/filter.h>
+#include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
+#include <linux/mm_types.h>
 #include "mmap_unlock_work.h"
 
 static const char * const iter_task_type_names[] = {
@@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
+struct bpf_iter_task_vma_kern_data {
+	struct task_struct *task;
+	struct mm_struct *mm;
+	struct mmap_unlock_irq_work *work;
+	struct vma_iterator vmi;
+};
+
+/* Non-opaque version of uapi bpf_iter_task_vma */
+struct bpf_iter_task_vma_kern {
+	struct bpf_iter_task_vma_kern_data *data;
+} __attribute__((aligned(8)));
+
+__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
+				      struct task_struct *task, u64 addr)
+{
+	struct bpf_iter_task_vma_kern *kit = (void *)it;
+	bool irq_work_busy = false;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
+	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
+
+	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
+	 * before, so non-NULL kit->data doesn't point to previously
+	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
+	 */
+	kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
+	if (!kit->data)
+		return -ENOMEM;
+	kit->data->task = NULL;
+
+	if (!task) {
+		err = -ENOENT;
+		goto err_cleanup_iter;
+	}
+
+	kit->data->task = get_task_struct(task);
+	kit->data->mm = task->mm;
+	if (!kit->data->mm) {
+		err = -ENOENT;
+		goto err_cleanup_iter;
+	}
+
+	/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
+	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
+	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
+		err = -EBUSY;
+		goto err_cleanup_iter;
+	}
+
+	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
+	return 0;
+
+err_cleanup_iter:
+	if (kit->data->task)
+		put_task_struct(kit->data->task);
+	bpf_mem_free(&bpf_global_ma, kit->data);
+	/* NULL kit->data signals failed bpf_iter_task_vma initialization */
+	kit->data = NULL;
+	return err;
+}
+
+__bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
+{
+	struct bpf_iter_task_vma_kern *kit = (void *)it;
+
+	if (!kit->data) /* bpf_iter_task_vma_new failed */
+		return NULL;
+	return vma_next(&kit->data->vmi);
+}
+
+__bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
+{
+	struct bpf_iter_task_vma_kern *kit = (void *)it;
+
+	if (kit->data) {
+		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
+		put_task_struct(kit->data->task);
+		bpf_mem_free(&bpf_global_ma, kit->data);
+	}
+}
+
 DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work);
 
 static void do_mmap_read_unlock(struct irq_work *entry)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8790b3962e4b..49fc1989a548 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7311,4 +7311,8 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_iter_task_vma {
+	__u64 __opaque[1]; /* See bpf_iter_num comment above */
+} __attribute__((aligned(8)));
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index bbab9ad9dc5a..d885ffee4d88 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -302,6 +302,14 @@ extern int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) __weak
 extern int *bpf_iter_num_next(struct bpf_iter_num *it) __weak __ksym;
 extern void bpf_iter_num_destroy(struct bpf_iter_num *it) __weak __ksym;
 
+struct bpf_iter_task_vma;
+
+extern int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
+				 struct task_struct *task,
+				 unsigned long addr) __weak __ksym;
+extern struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) __weak __ksym;
+extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __weak __ksym;
+
 #ifndef bpf_for_each
 /* bpf_for_each(iter_type, cur_elem, args...) provides generic construct for
  * using BPF open-coded iterators without having to write mundane explicit
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 1f02168103dd..41aba139b20b 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -10,7 +10,7 @@
 #include "bpf_iter_task.skel.h"
 #include "bpf_iter_task_stack.skel.h"
 #include "bpf_iter_task_file.skel.h"
-#include "bpf_iter_task_vma.skel.h"
+#include "bpf_iter_task_vmas.skel.h"
 #include "bpf_iter_task_btf.skel.h"
 #include "bpf_iter_tcp4.skel.h"
 #include "bpf_iter_tcp6.skel.h"
@@ -1399,19 +1399,19 @@ static void str_strip_first_line(char *str)
 static void test_task_vma_common(struct bpf_iter_attach_opts *opts)
 {
 	int err, iter_fd = -1, proc_maps_fd = -1;
-	struct bpf_iter_task_vma *skel;
+	struct bpf_iter_task_vmas *skel;
 	int len, read_size = 4;
 	char maps_path[64];
 
-	skel = bpf_iter_task_vma__open();
-	if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma__open"))
+	skel = bpf_iter_task_vmas__open();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vmas__open"))
 		return;
 
 	skel->bss->pid = getpid();
 	skel->bss->one_task = opts ? 1 : 0;
 
-	err = bpf_iter_task_vma__load(skel);
-	if (!ASSERT_OK(err, "bpf_iter_task_vma__load"))
+	err = bpf_iter_task_vmas__load(skel);
+	if (!ASSERT_OK(err, "bpf_iter_task_vmas__load"))
 		goto out;
 
 	skel->links.proc_maps = bpf_program__attach_iter(
@@ -1462,25 +1462,25 @@ static void test_task_vma_common(struct bpf_iter_attach_opts *opts)
 out:
 	close(proc_maps_fd);
 	close(iter_fd);
-	bpf_iter_task_vma__destroy(skel);
+	bpf_iter_task_vmas__destroy(skel);
 }
 
 static void test_task_vma_dead_task(void)
 {
-	struct bpf_iter_task_vma *skel;
+	struct bpf_iter_task_vmas *skel;
 	int wstatus, child_pid = -1;
 	time_t start_tm, cur_tm;
 	int err, iter_fd = -1;
 	int wait_sec = 3;
 
-	skel = bpf_iter_task_vma__open();
-	if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma__open"))
+	skel = bpf_iter_task_vmas__open();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vmas__open"))
 		return;
 
 	skel->bss->pid = getpid();
 
-	err = bpf_iter_task_vma__load(skel);
-	if (!ASSERT_OK(err, "bpf_iter_task_vma__load"))
+	err = bpf_iter_task_vmas__load(skel);
+	if (!ASSERT_OK(err, "bpf_iter_task_vmas__load"))
 		goto out;
 
 	skel->links.proc_maps = bpf_program__attach_iter(
@@ -1533,7 +1533,7 @@ static void test_task_vma_dead_task(void)
 out:
 	waitpid(child_pid, &wstatus, 0);
 	close(iter_fd);
-	bpf_iter_task_vma__destroy(skel);
+	bpf_iter_task_vmas__destroy(skel);
 }
 
 void test_bpf_sockmap_map_iter_fd(void)
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
similarity index 100%
rename from tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
rename to tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22  5:05 ` [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs Dave Marchevsky
@ 2023-08-22 17:42   ` Yonghong Song
  2023-08-22 19:19     ` David Marchevsky
  2023-08-22 23:52   ` Andrii Nakryiko
  1 sibling, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2023-08-22 17:42 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, sdf, Nathan Slingerland



On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> creation and manipulation of struct bpf_iter_task_vma in open-coded
> iterator style. BPF programs can use these kfuncs directly or through
> bpf_for_each macro for natural-looking iteration of all task vmas.
> 
> The implementation borrows heavily from bpf_find_vma helper's locking -
> differing only in that it holds the mmap_read lock for all iterations
> while the helper only executes its provided callback on a maximum of 1
> vma. Aside from locking, struct vma_iterator and vma_next do all the
> heavy lifting.
> 
> The newly-added struct bpf_iter_task_vma has a name collision with a
> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> file is renamed in order to avoid the collision.
> 
> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> only field in struct bpf_iter_task_vma. This is because the inner data
> struct contains a struct vma_iterator (not ptr), whose size is likely to
> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> such a change would require change in opaque bpf_iter_task_vma struct's
> size. So better to allocate vma_iterator using BPF allocator, and since
> that alloc must already succeed, might as well allocate all iter fields,
> thereby freezing struct bpf_iter_task_vma size.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> Cc: Nathan Slingerland <slinger@meta.com>
> ---
>   include/uapi/linux/bpf.h                      |  4 +
>   kernel/bpf/helpers.c                          |  3 +
>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>   tools/include/uapi/linux/bpf.h                |  4 +
>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>   7 files changed, 116 insertions(+), 13 deletions(-)
>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8790b3962e4b..49fc1989a548 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>   	__u64 __opaque[1];
>   } __attribute__((aligned(8)));
>   
> +struct bpf_iter_task_vma {
> +	__u64 __opaque[1]; /* See bpf_iter_num comment above */
> +} __attribute__((aligned(8)));

In the future, we might have bpf_iter_cgroup, bpf_iter_task, 
bpf_iter_cgroup_task, etc. They may all use the same struct
like
   struct bpf_iter_<...> {
     __u64 __opaque[1];
   } __attribute__((aligned(8)));

Maybe we want a generic one instead of having lots of
structs with the same underline definition? For example,
   struct bpf_iter_generic
?

> +
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index eb91cae0612a..7a06dea749f1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index c4ab9d6cdbe9..51c2dce435c1 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -7,7 +7,9 @@
>   #include <linux/fs.h>
>   #include <linux/fdtable.h>
>   #include <linux/filter.h>
> +#include <linux/bpf_mem_alloc.h>
>   #include <linux/btf_ids.h>
> +#include <linux/mm_types.h>
>   #include "mmap_unlock_work.h"
>   
>   static const char * const iter_task_type_names[] = {
> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>   	.arg5_type	= ARG_ANYTHING,
>   };
>   
> +struct bpf_iter_task_vma_kern_data {
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	struct mmap_unlock_irq_work *work;
> +	struct vma_iterator vmi;
> +};
> +
> +/* Non-opaque version of uapi bpf_iter_task_vma */
> +struct bpf_iter_task_vma_kern {
> +	struct bpf_iter_task_vma_kern_data *data;
> +} __attribute__((aligned(8)));
> +
> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> +				      struct task_struct *task, u64 addr)
> +{
> +	struct bpf_iter_task_vma_kern *kit = (void *)it;
> +	bool irq_work_busy = false;
> +	int err;
> +
> +	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> +	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> +
> +	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> +	 * before, so non-NULL kit->data doesn't point to previously
> +	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> +	 */
> +	kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
> +	if (!kit->data)
> +		return -ENOMEM;
> +	kit->data->task = NULL;
> +
> +	if (!task) {
> +		err = -ENOENT;
> +		goto err_cleanup_iter;
> +	}
> +
> +	kit->data->task = get_task_struct(task);

The above is not safe. Since there is no restriction on 'task',
the 'task' could be in a state for destruction with 'usage' count 0
and then get_task_struct(task) won't work since it unconditionally
tries to increase 'usage' count from 0 to 1.

Or, 'task' may be valid at the entry of the funciton, but when
'task' is in get_task_struct(), 'task' may have been destroyed
and 'task' memory is reused by somebody else.

I suggest that we check input parameter 'task' must be
PTR_TRUSTED or MEM_RCU. This way, the above !task checking
is not necessary and get_task_struct() can correctly
hold a reference to 'task'.

> +	kit->data->mm = task->mm;
> +	if (!kit->data->mm) {
> +		err = -ENOENT;
> +		goto err_cleanup_iter;
> +	}
> +
> +	/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> +	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> +	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> +		err = -EBUSY;
> +		goto err_cleanup_iter;
> +	}
> +
> +	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> +	return 0;
> +
> +err_cleanup_iter:
> +	if (kit->data->task)
> +		put_task_struct(kit->data->task);
> +	bpf_mem_free(&bpf_global_ma, kit->data);
> +	/* NULL kit->data signals failed bpf_iter_task_vma initialization */
> +	kit->data = NULL;
> +	return err;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 17:42   ` Yonghong Song
@ 2023-08-22 19:19     ` David Marchevsky
  2023-08-22 20:14       ` Yonghong Song
  2023-08-23  0:04       ` Andrii Nakryiko
  0 siblings, 2 replies; 25+ messages in thread
From: David Marchevsky @ 2023-08-22 19:19 UTC (permalink / raw)
  To: yonghong.song, Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, sdf, Nathan Slingerland

On 8/22/23 1:42 PM, Yonghong Song wrote:
> 
> 
> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
>> creation and manipulation of struct bpf_iter_task_vma in open-coded
>> iterator style. BPF programs can use these kfuncs directly or through
>> bpf_for_each macro for natural-looking iteration of all task vmas.
>>
>> The implementation borrows heavily from bpf_find_vma helper's locking -
>> differing only in that it holds the mmap_read lock for all iterations
>> while the helper only executes its provided callback on a maximum of 1
>> vma. Aside from locking, struct vma_iterator and vma_next do all the
>> heavy lifting.
>>
>> The newly-added struct bpf_iter_task_vma has a name collision with a
>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
>> file is renamed in order to avoid the collision.
>>
>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
>> only field in struct bpf_iter_task_vma. This is because the inner data
>> struct contains a struct vma_iterator (not ptr), whose size is likely to
>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
>> such a change would require change in opaque bpf_iter_task_vma struct's
>> size. So better to allocate vma_iterator using BPF allocator, and since
>> that alloc must already succeed, might as well allocate all iter fields,
>> thereby freezing struct bpf_iter_task_vma size.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> Cc: Nathan Slingerland <slinger@meta.com>
>> ---
>>   include/uapi/linux/bpf.h                      |  4 +
>>   kernel/bpf/helpers.c                          |  3 +
>>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h                |  4 +
>>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
>>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>>   7 files changed, 116 insertions(+), 13 deletions(-)
>>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 8790b3962e4b..49fc1989a548 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>       __u64 __opaque[1];
>>   } __attribute__((aligned(8)));
>>   +struct bpf_iter_task_vma {
>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
>> +} __attribute__((aligned(8)));
> 
> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> like
>   struct bpf_iter_<...> {
>     __u64 __opaque[1];
>   } __attribute__((aligned(8)));
> 
> Maybe we want a generic one instead of having lots of
> structs with the same underline definition? For example,
>   struct bpf_iter_generic
> ?
> 

The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
of iters would break the scheme. We could:

  * Add bpf_for_each_generic that only uses bpf_iter_generic
    * This exposes implementation details in an ugly way, though.
  * Do some macro magic to pick bpf_iter_generic for some types of iters, and
    use consistent naming pattern for others.
    * I'm not sure how to do this with preprocessor
  * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
    data struct, and use bpf_iter_generic for all of them
    * Probably need to see more iter implementation / usage before making such
      a change
  * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
    * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
      logic. Could do similar typedef w/ struct to try to work around
      it.

Let me know what you think. Personally I considered doing typedef while
implementing this, so that's the alternative I'd choose. 

>> +
>>   #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index eb91cae0612a..7a06dea749f1 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
>> index c4ab9d6cdbe9..51c2dce435c1 100644
>> --- a/kernel/bpf/task_iter.c
>> +++ b/kernel/bpf/task_iter.c
>> @@ -7,7 +7,9 @@
>>   #include <linux/fs.h>
>>   #include <linux/fdtable.h>
>>   #include <linux/filter.h>
>> +#include <linux/bpf_mem_alloc.h>
>>   #include <linux/btf_ids.h>
>> +#include <linux/mm_types.h>
>>   #include "mmap_unlock_work.h"
>>     static const char * const iter_task_type_names[] = {
>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>>       .arg5_type    = ARG_ANYTHING,
>>   };
>>   +struct bpf_iter_task_vma_kern_data {
>> +    struct task_struct *task;
>> +    struct mm_struct *mm;
>> +    struct mmap_unlock_irq_work *work;
>> +    struct vma_iterator vmi;
>> +};
>> +
>> +/* Non-opaque version of uapi bpf_iter_task_vma */
>> +struct bpf_iter_task_vma_kern {
>> +    struct bpf_iter_task_vma_kern_data *data;
>> +} __attribute__((aligned(8)));
>> +
>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>> +                      struct task_struct *task, u64 addr)
>> +{
>> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
>> +    bool irq_work_busy = false;
>> +    int err;
>> +
>> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
>> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>> +
>> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
>> +     * before, so non-NULL kit->data doesn't point to previously
>> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
>> +     */
>> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
>> +    if (!kit->data)
>> +        return -ENOMEM;
>> +    kit->data->task = NULL;
>> +
>> +    if (!task) {
>> +        err = -ENOENT;
>> +        goto err_cleanup_iter;
>> +    }
>> +
>> +    kit->data->task = get_task_struct(task);
> 
> The above is not safe. Since there is no restriction on 'task',
> the 'task' could be in a state for destruction with 'usage' count 0
> and then get_task_struct(task) won't work since it unconditionally
> tries to increase 'usage' count from 0 to 1.
> 
> Or, 'task' may be valid at the entry of the funciton, but when
> 'task' is in get_task_struct(), 'task' may have been destroyed
> and 'task' memory is reused by somebody else.
> 
> I suggest that we check input parameter 'task' must be
> PTR_TRUSTED or MEM_RCU. This way, the above !task checking
> is not necessary and get_task_struct() can correctly
> hold a reference to 'task'.
> 

Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
to this kfunc currently.

* bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
* ptr hop from trusted task_struct to 'real_parent' or similar should
  yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def
* if task kptr is in map_val, direct reference to it should result
  in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
  using bpf_task_from_pid (?)

But regardless, better to be explicit. Will change.

>> +    kit->data->mm = task->mm;
>> +    if (!kit->data->mm) {
>> +        err = -ENOENT;
>> +        goto err_cleanup_iter;
>> +    }
>> +
>> +    /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
>> +    irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
>> +    if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
>> +        err = -EBUSY;
>> +        goto err_cleanup_iter;
>> +    }
>> +
>> +    vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
>> +    return 0;
>> +
>> +err_cleanup_iter:
>> +    if (kit->data->task)
>> +        put_task_struct(kit->data->task);
>> +    bpf_mem_free(&bpf_global_ma, kit->data);
>> +    /* NULL kit->data signals failed bpf_iter_task_vma initialization */
>> +    kit->data = NULL;
>> +    return err;
>> +}
>> +
> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 19:19     ` David Marchevsky
@ 2023-08-22 20:14       ` Yonghong Song
  2023-08-22 22:36         ` Alexei Starovoitov
  2023-08-23  0:04       ` Andrii Nakryiko
  1 sibling, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2023-08-22 20:14 UTC (permalink / raw)
  To: David Marchevsky, Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, sdf, Nathan Slingerland



On 8/22/23 12:19 PM, David Marchevsky wrote:
> On 8/22/23 1:42 PM, Yonghong Song wrote:
>>
>>
>> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
>>> creation and manipulation of struct bpf_iter_task_vma in open-coded
>>> iterator style. BPF programs can use these kfuncs directly or through
>>> bpf_for_each macro for natural-looking iteration of all task vmas.
>>>
>>> The implementation borrows heavily from bpf_find_vma helper's locking -
>>> differing only in that it holds the mmap_read lock for all iterations
>>> while the helper only executes its provided callback on a maximum of 1
>>> vma. Aside from locking, struct vma_iterator and vma_next do all the
>>> heavy lifting.
>>>
>>> The newly-added struct bpf_iter_task_vma has a name collision with a
>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
>>> file is renamed in order to avoid the collision.
>>>
>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
>>> only field in struct bpf_iter_task_vma. This is because the inner data
>>> struct contains a struct vma_iterator (not ptr), whose size is likely to
>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
>>> such a change would require change in opaque bpf_iter_task_vma struct's
>>> size. So better to allocate vma_iterator using BPF allocator, and since
>>> that alloc must already succeed, might as well allocate all iter fields,
>>> thereby freezing struct bpf_iter_task_vma size.
>>>
>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>> Cc: Nathan Slingerland <slinger@meta.com>
>>> ---
>>>    include/uapi/linux/bpf.h                      |  4 +
>>>    kernel/bpf/helpers.c                          |  3 +
>>>    kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>>>    tools/include/uapi/linux/bpf.h                |  4 +
>>>    tools/lib/bpf/bpf_helpers.h                   |  8 ++
>>>    .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>>>    ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>>>    7 files changed, 116 insertions(+), 13 deletions(-)
>>>    rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 8790b3962e4b..49fc1989a548 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>>        __u64 __opaque[1];
>>>    } __attribute__((aligned(8)));
>>>    +struct bpf_iter_task_vma {
>>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
>>> +} __attribute__((aligned(8)));
>>
>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
>> like
>>    struct bpf_iter_<...> {
>>      __u64 __opaque[1];
>>    } __attribute__((aligned(8)));
>>
>> Maybe we want a generic one instead of having lots of
>> structs with the same underline definition? For example,
>>    struct bpf_iter_generic
>> ?
>>
> 
> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> of iters would break the scheme. We could:
> 
>    * Add bpf_for_each_generic that only uses bpf_iter_generic
>      * This exposes implementation details in an ugly way, though.
>    * Do some macro magic to pick bpf_iter_generic for some types of iters, and
>      use consistent naming pattern for others.
>      * I'm not sure how to do this with preprocessor
>    * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
>      data struct, and use bpf_iter_generic for all of them
>      * Probably need to see more iter implementation / usage before making such
>        a change
>    * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
>      * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
>        logic. Could do similar typedef w/ struct to try to work around
>        it.
> 
> Let me know what you think. Personally I considered doing typedef while
> implementing this, so that's the alternative I'd choose.

Okay, since we have naming convention restriction, typedef probably the
best option, something like
   typedef struct bpf_iter_num bpf_iter_task_vma
?

Verifier might need to be changed if verifier strips all modifiers
(including tyypedef) to find the struct name.

> 
>>> +
>>>    #endif /* _UAPI__LINUX_BPF_H__ */
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index eb91cae0612a..7a06dea749f1 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>>>    BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>>>    BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>>>    BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>>>    BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>>>    BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>>>    BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
>>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
>>> index c4ab9d6cdbe9..51c2dce435c1 100644
>>> --- a/kernel/bpf/task_iter.c
>>> +++ b/kernel/bpf/task_iter.c
>>> @@ -7,7 +7,9 @@
>>>    #include <linux/fs.h>
>>>    #include <linux/fdtable.h>
>>>    #include <linux/filter.h>
>>> +#include <linux/bpf_mem_alloc.h>
>>>    #include <linux/btf_ids.h>
>>> +#include <linux/mm_types.h>
>>>    #include "mmap_unlock_work.h"
>>>      static const char * const iter_task_type_names[] = {
>>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>>>        .arg5_type    = ARG_ANYTHING,
>>>    };
>>>    +struct bpf_iter_task_vma_kern_data {
>>> +    struct task_struct *task;
>>> +    struct mm_struct *mm;
>>> +    struct mmap_unlock_irq_work *work;
>>> +    struct vma_iterator vmi;
>>> +};
>>> +
>>> +/* Non-opaque version of uapi bpf_iter_task_vma */
>>> +struct bpf_iter_task_vma_kern {
>>> +    struct bpf_iter_task_vma_kern_data *data;
>>> +} __attribute__((aligned(8)));
>>> +
>>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>>> +                      struct task_struct *task, u64 addr)
>>> +{
>>> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
>>> +    bool irq_work_busy = false;
>>> +    int err;
>>> +
>>> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
>>> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>>> +
>>> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
>>> +     * before, so non-NULL kit->data doesn't point to previously
>>> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
>>> +     */
>>> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
>>> +    if (!kit->data)
>>> +        return -ENOMEM;
>>> +    kit->data->task = NULL;
>>> +
>>> +    if (!task) {
>>> +        err = -ENOENT;
>>> +        goto err_cleanup_iter;
>>> +    }
>>> +
>>> +    kit->data->task = get_task_struct(task);
>>
>> The above is not safe. Since there is no restriction on 'task',
>> the 'task' could be in a state for destruction with 'usage' count 0
>> and then get_task_struct(task) won't work since it unconditionally
>> tries to increase 'usage' count from 0 to 1.
>>
>> Or, 'task' may be valid at the entry of the funciton, but when
>> 'task' is in get_task_struct(), 'task' may have been destroyed
>> and 'task' memory is reused by somebody else.
>>
>> I suggest that we check input parameter 'task' must be
>> PTR_TRUSTED or MEM_RCU. This way, the above !task checking
>> is not necessary and get_task_struct() can correctly
>> hold a reference to 'task'.
>>
> 
> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
> to this kfunc currently.
> 
> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
> * ptr hop from trusted task_struct to 'real_parent' or similar should
>    yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def

This is true.

> * if task kptr is in map_val, direct reference to it should result
>    in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
>    using bpf_task_from_pid (?)

If in the rcu region, direct access of the task kptr can be marked
with MEM_RCU | PTR_MAYBE_NULL. After a null ptr check, you will
get a rcu protected task ptr.

> 
> But regardless, better to be explicit. Will change.
> 
>>> +    kit->data->mm = task->mm;
>>> +    if (!kit->data->mm) {
>>> +        err = -ENOENT;
>>> +        goto err_cleanup_iter;
>>> +    }
>>> +
>>> +    /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
>>> +    irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
>>> +    if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
>>> +        err = -EBUSY;
>>> +        goto err_cleanup_iter;
>>> +    }
>>> +
>>> +    vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
>>> +    return 0;
>>> +
>>> +err_cleanup_iter:
>>> +    if (kit->data->task)
>>> +        put_task_struct(kit->data->task);
>>> +    bpf_mem_free(&bpf_global_ma, kit->data);
>>> +    /* NULL kit->data signals failed bpf_iter_task_vma initialization */
>>> +    kit->data = NULL;
>>> +    return err;
>>> +}
>>> +
>> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 20:14       ` Yonghong Song
@ 2023-08-22 22:36         ` Alexei Starovoitov
  2023-08-22 23:57           ` Andrii Nakryiko
  2023-08-23  0:11           ` Yonghong Song
  0 siblings, 2 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2023-08-22 22:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Stanislav Fomichev, Nathan Slingerland

On Tue, Aug 22, 2023 at 1:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/22/23 12:19 PM, David Marchevsky wrote:
> > On 8/22/23 1:42 PM, Yonghong Song wrote:
> >>
> >>
> >> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> >>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> >>> creation and manipulation of struct bpf_iter_task_vma in open-coded
> >>> iterator style. BPF programs can use these kfuncs directly or through
> >>> bpf_for_each macro for natural-looking iteration of all task vmas.
> >>>
> >>> The implementation borrows heavily from bpf_find_vma helper's locking -
> >>> differing only in that it holds the mmap_read lock for all iterations
> >>> while the helper only executes its provided callback on a maximum of 1
> >>> vma. Aside from locking, struct vma_iterator and vma_next do all the
> >>> heavy lifting.
> >>>
> >>> The newly-added struct bpf_iter_task_vma has a name collision with a
> >>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> >>> file is renamed in order to avoid the collision.
> >>>
> >>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> >>> only field in struct bpf_iter_task_vma. This is because the inner data
> >>> struct contains a struct vma_iterator (not ptr), whose size is likely to
> >>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> >>> such a change would require change in opaque bpf_iter_task_vma struct's
> >>> size. So better to allocate vma_iterator using BPF allocator, and since
> >>> that alloc must already succeed, might as well allocate all iter fields,
> >>> thereby freezing struct bpf_iter_task_vma size.
> >>>
> >>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >>> Cc: Nathan Slingerland <slinger@meta.com>
> >>> ---
> >>>    include/uapi/linux/bpf.h                      |  4 +
> >>>    kernel/bpf/helpers.c                          |  3 +
> >>>    kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> >>>    tools/include/uapi/linux/bpf.h                |  4 +
> >>>    tools/lib/bpf/bpf_helpers.h                   |  8 ++
> >>>    .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> >>>    ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> >>>    7 files changed, 116 insertions(+), 13 deletions(-)
> >>>    rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> >>>
> >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>> index 8790b3962e4b..49fc1989a548 100644
> >>> --- a/include/uapi/linux/bpf.h
> >>> +++ b/include/uapi/linux/bpf.h
> >>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>>        __u64 __opaque[1];
> >>>    } __attribute__((aligned(8)));
> >>>    +struct bpf_iter_task_vma {
> >>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >>> +} __attribute__((aligned(8)));
> >>
> >> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> >> like
> >>    struct bpf_iter_<...> {
> >>      __u64 __opaque[1];
> >>    } __attribute__((aligned(8)));
> >>
> >> Maybe we want a generic one instead of having lots of
> >> structs with the same underline definition? For example,
> >>    struct bpf_iter_generic
> >> ?
> >>
> >
> > The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> > and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> > of iters would break the scheme. We could:
> >
> >    * Add bpf_for_each_generic that only uses bpf_iter_generic
> >      * This exposes implementation details in an ugly way, though.
> >    * Do some macro magic to pick bpf_iter_generic for some types of iters, and
> >      use consistent naming pattern for others.
> >      * I'm not sure how to do this with preprocessor
> >    * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
> >      data struct, and use bpf_iter_generic for all of them
> >      * Probably need to see more iter implementation / usage before making such
> >        a change
> >    * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
> >      * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
> >        logic. Could do similar typedef w/ struct to try to work around
> >        it.
> >
> > Let me know what you think. Personally I considered doing typedef while
> > implementing this, so that's the alternative I'd choose.
>
> Okay, since we have naming convention restriction, typedef probably the
> best option, something like
>    typedef struct bpf_iter_num bpf_iter_task_vma
> ?
>
> Verifier might need to be changed if verifier strips all modifiers
> (including tyypedef) to find the struct name.

I don't quite see how typedef helps here.
Say we do:
struct bpf_iter_task_vma {
     __u64 __opaque[1];
} __attribute__((aligned(8)));

as Dave is proposing.
Then tomorrow we add another bpf_iter_foo that is exactly the same opaque[1].
And we will have bpf_iter_num, bpf_iter_task_vma, bpf_iter_foo structs
with the same layout. So what? Eye sore?
In case we need to extend task_vma from 1 to 2 it will be easier to do
when all of them are separate structs.

And typedef has unknown verification implications.

Either way we need to find a way to move these structs from uapi/bpf.h
along with bpf_rb_root and friends to some "obviously unstable" header.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 22:36         ` Alexei Starovoitov
@ 2023-08-22 23:57           ` Andrii Nakryiko
  2023-08-23  0:11           ` Yonghong Song
  1 sibling, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-22 23:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, David Marchevsky, Dave Marchevsky, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Stanislav Fomichev,
	Nathan Slingerland

On Tue, Aug 22, 2023 at 3:37 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 1:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> >
> > On 8/22/23 12:19 PM, David Marchevsky wrote:
> > > On 8/22/23 1:42 PM, Yonghong Song wrote:
> > >>
> > >>
> > >> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> > >>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> > >>> creation and manipulation of struct bpf_iter_task_vma in open-coded
> > >>> iterator style. BPF programs can use these kfuncs directly or through
> > >>> bpf_for_each macro for natural-looking iteration of all task vmas.
> > >>>
> > >>> The implementation borrows heavily from bpf_find_vma helper's locking -
> > >>> differing only in that it holds the mmap_read lock for all iterations
> > >>> while the helper only executes its provided callback on a maximum of 1
> > >>> vma. Aside from locking, struct vma_iterator and vma_next do all the
> > >>> heavy lifting.
> > >>>
> > >>> The newly-added struct bpf_iter_task_vma has a name collision with a
> > >>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> > >>> file is renamed in order to avoid the collision.
> > >>>
> > >>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> > >>> only field in struct bpf_iter_task_vma. This is because the inner data
> > >>> struct contains a struct vma_iterator (not ptr), whose size is likely to
> > >>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> > >>> such a change would require change in opaque bpf_iter_task_vma struct's
> > >>> size. So better to allocate vma_iterator using BPF allocator, and since
> > >>> that alloc must already succeed, might as well allocate all iter fields,
> > >>> thereby freezing struct bpf_iter_task_vma size.
> > >>>
> > >>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > >>> Cc: Nathan Slingerland <slinger@meta.com>
> > >>> ---
> > >>>    include/uapi/linux/bpf.h                      |  4 +
> > >>>    kernel/bpf/helpers.c                          |  3 +
> > >>>    kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> > >>>    tools/include/uapi/linux/bpf.h                |  4 +
> > >>>    tools/lib/bpf/bpf_helpers.h                   |  8 ++
> > >>>    .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> > >>>    ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> > >>>    7 files changed, 116 insertions(+), 13 deletions(-)
> > >>>    rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> > >>>
> > >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >>> index 8790b3962e4b..49fc1989a548 100644
> > >>> --- a/include/uapi/linux/bpf.h
> > >>> +++ b/include/uapi/linux/bpf.h
> > >>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> > >>>        __u64 __opaque[1];
> > >>>    } __attribute__((aligned(8)));
> > >>>    +struct bpf_iter_task_vma {
> > >>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
> > >>> +} __attribute__((aligned(8)));
> > >>
> > >> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> > >> like
> > >>    struct bpf_iter_<...> {
> > >>      __u64 __opaque[1];
> > >>    } __attribute__((aligned(8)));
> > >>
> > >> Maybe we want a generic one instead of having lots of
> > >> structs with the same underline definition? For example,
> > >>    struct bpf_iter_generic
> > >> ?
> > >>
> > >
> > > The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> > > and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> > > of iters would break the scheme. We could:
> > >
> > >    * Add bpf_for_each_generic that only uses bpf_iter_generic
> > >      * This exposes implementation details in an ugly way, though.
> > >    * Do some macro magic to pick bpf_iter_generic for some types of iters, and
> > >      use consistent naming pattern for others.
> > >      * I'm not sure how to do this with preprocessor
> > >    * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
> > >      data struct, and use bpf_iter_generic for all of them
> > >      * Probably need to see more iter implementation / usage before making such
> > >        a change
> > >    * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
> > >      * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
> > >        logic. Could do similar typedef w/ struct to try to work around
> > >        it.
> > >
> > > Let me know what you think. Personally I considered doing typedef while
> > > implementing this, so that's the alternative I'd choose.
> >
> > Okay, since we have naming convention restriction, typedef probably the
> > best option, something like
> >    typedef struct bpf_iter_num bpf_iter_task_vma
> > ?
> >
> > Verifier might need to be changed if verifier strips all modifiers
> > (including tyypedef) to find the struct name.
>
> I don't quite see how typedef helps here.
> Say we do:
> struct bpf_iter_task_vma {
>      __u64 __opaque[1];
> } __attribute__((aligned(8)));
>
> as Dave is proposing.
> Then tomorrow we add another bpf_iter_foo that is exactly the same opaque[1].
> And we will have bpf_iter_num, bpf_iter_task_vma, bpf_iter_foo structs
> with the same layout. So what? Eye sore?
> In case we need to extend task_vma from 1 to 2 it will be easier to do
> when all of them are separate structs.
>
> And typedef has unknown verification implications.

+1, I wouldn't complicate things and have a proper struct with strict
naming convention for each iter kind

>
> Either way we need to find a way to move these structs from uapi/bpf.h
> along with bpf_rb_root and friends to some "obviously unstable" header.

I don't think we have to add struct bpf_iter_task_vma to uapi/bpf.h at
all and rely on it coming from vmlinux.h. As I mentioned in previous
email, num iterator is a bit special in its generality and simplicity
(which allows to be confident it won't need to be changed), while
other iterators should be treated as more unstable and come from
vmlinux.h (or wherever else kfuncs will be coming from in the future).

tl;dr: we can define this struct right next to where we defined
bpf_iter_task_vma_new() kfunc. Unless I'm missing some complication,
of course.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 22:36         ` Alexei Starovoitov
  2023-08-22 23:57           ` Andrii Nakryiko
@ 2023-08-23  0:11           ` Yonghong Song
  1 sibling, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2023-08-23  0:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Stanislav Fomichev, Nathan Slingerland



On 8/22/23 3:36 PM, Alexei Starovoitov wrote:
> On Tue, Aug 22, 2023 at 1:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 8/22/23 12:19 PM, David Marchevsky wrote:
>>> On 8/22/23 1:42 PM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
>>>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
>>>>> creation and manipulation of struct bpf_iter_task_vma in open-coded
>>>>> iterator style. BPF programs can use these kfuncs directly or through
>>>>> bpf_for_each macro for natural-looking iteration of all task vmas.
>>>>>
>>>>> The implementation borrows heavily from bpf_find_vma helper's locking -
>>>>> differing only in that it holds the mmap_read lock for all iterations
>>>>> while the helper only executes its provided callback on a maximum of 1
>>>>> vma. Aside from locking, struct vma_iterator and vma_next do all the
>>>>> heavy lifting.
>>>>>
>>>>> The newly-added struct bpf_iter_task_vma has a name collision with a
>>>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
>>>>> file is renamed in order to avoid the collision.
>>>>>
>>>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
>>>>> only field in struct bpf_iter_task_vma. This is because the inner data
>>>>> struct contains a struct vma_iterator (not ptr), whose size is likely to
>>>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
>>>>> such a change would require change in opaque bpf_iter_task_vma struct's
>>>>> size. So better to allocate vma_iterator using BPF allocator, and since
>>>>> that alloc must already succeed, might as well allocate all iter fields,
>>>>> thereby freezing struct bpf_iter_task_vma size.
>>>>>
>>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>>>> Cc: Nathan Slingerland <slinger@meta.com>
>>>>> ---
>>>>>     include/uapi/linux/bpf.h                      |  4 +
>>>>>     kernel/bpf/helpers.c                          |  3 +
>>>>>     kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>>>>>     tools/include/uapi/linux/bpf.h                |  4 +
>>>>>     tools/lib/bpf/bpf_helpers.h                   |  8 ++
>>>>>     .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>>>>>     ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>>>>>     7 files changed, 116 insertions(+), 13 deletions(-)
>>>>>     rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>>>>>
>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>> index 8790b3962e4b..49fc1989a548 100644
>>>>> --- a/include/uapi/linux/bpf.h
>>>>> +++ b/include/uapi/linux/bpf.h
>>>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>>>>         __u64 __opaque[1];
>>>>>     } __attribute__((aligned(8)));
>>>>>     +struct bpf_iter_task_vma {
>>>>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
>>>>> +} __attribute__((aligned(8)));
>>>>
>>>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
>>>> like
>>>>     struct bpf_iter_<...> {
>>>>       __u64 __opaque[1];
>>>>     } __attribute__((aligned(8)));
>>>>
>>>> Maybe we want a generic one instead of having lots of
>>>> structs with the same underline definition? For example,
>>>>     struct bpf_iter_generic
>>>> ?
>>>>
>>>
>>> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
>>> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
>>> of iters would break the scheme. We could:
>>>
>>>     * Add bpf_for_each_generic that only uses bpf_iter_generic
>>>       * This exposes implementation details in an ugly way, though.
>>>     * Do some macro magic to pick bpf_iter_generic for some types of iters, and
>>>       use consistent naming pattern for others.
>>>       * I'm not sure how to do this with preprocessor
>>>     * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
>>>       data struct, and use bpf_iter_generic for all of them
>>>       * Probably need to see more iter implementation / usage before making such
>>>         a change
>>>     * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
>>>       * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
>>>         logic. Could do similar typedef w/ struct to try to work around
>>>         it.
>>>
>>> Let me know what you think. Personally I considered doing typedef while
>>> implementing this, so that's the alternative I'd choose.
>>
>> Okay, since we have naming convention restriction, typedef probably the
>> best option, something like
>>     typedef struct bpf_iter_num bpf_iter_task_vma
>> ?
>>
>> Verifier might need to be changed if verifier strips all modifiers
>> (including tyypedef) to find the struct name.
> 
> I don't quite see how typedef helps here.
> Say we do:
> struct bpf_iter_task_vma {
>       __u64 __opaque[1];
> } __attribute__((aligned(8)));
> 
> as Dave is proposing.
> Then tomorrow we add another bpf_iter_foo that is exactly the same opaque[1].
> And we will have bpf_iter_num, bpf_iter_task_vma, bpf_iter_foo structs
> with the same layout. So what? Eye sore?
> In case we need to extend task_vma from 1 to 2 it will be easier to do
> when all of them are separate structs.
> 
> And typedef has unknown verification implications.

This is true. Some investigation is needed.

> 
> Either way we need to find a way to move these structs from uapi/bpf.h
> along with bpf_rb_root and friends to some "obviously unstable" header.

If we move this out uapi/bpf.h to an unstable header, we will have much
more flexibility for future change. Original idea (v1) to allocate the
whole struct on the stack would be okay even if kernel changes
struct vma_iterator size, whether bigger or smaller.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 19:19     ` David Marchevsky
  2023-08-22 20:14       ` Yonghong Song
@ 2023-08-23  0:04       ` Andrii Nakryiko
  2023-08-23  5:42         ` David Marchevsky
  1 sibling, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23  0:04 UTC (permalink / raw)
  To: David Marchevsky
  Cc: yonghong.song, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	sdf, Nathan Slingerland

On Tue, Aug 22, 2023 at 12:20 PM David Marchevsky
<david.marchevsky@linux.dev> wrote:
>
> On 8/22/23 1:42 PM, Yonghong Song wrote:
> >
> >
> > On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> >> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> >> creation and manipulation of struct bpf_iter_task_vma in open-coded
> >> iterator style. BPF programs can use these kfuncs directly or through
> >> bpf_for_each macro for natural-looking iteration of all task vmas.
> >>
> >> The implementation borrows heavily from bpf_find_vma helper's locking -
> >> differing only in that it holds the mmap_read lock for all iterations
> >> while the helper only executes its provided callback on a maximum of 1
> >> vma. Aside from locking, struct vma_iterator and vma_next do all the
> >> heavy lifting.
> >>
> >> The newly-added struct bpf_iter_task_vma has a name collision with a
> >> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> >> file is renamed in order to avoid the collision.
> >>
> >> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> >> only field in struct bpf_iter_task_vma. This is because the inner data
> >> struct contains a struct vma_iterator (not ptr), whose size is likely to
> >> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> >> such a change would require change in opaque bpf_iter_task_vma struct's
> >> size. So better to allocate vma_iterator using BPF allocator, and since
> >> that alloc must already succeed, might as well allocate all iter fields,
> >> thereby freezing struct bpf_iter_task_vma size.
> >>
> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >> Cc: Nathan Slingerland <slinger@meta.com>
> >> ---
> >>   include/uapi/linux/bpf.h                      |  4 +
> >>   kernel/bpf/helpers.c                          |  3 +
> >>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> >>   tools/include/uapi/linux/bpf.h                |  4 +
> >>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
> >>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> >>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> >>   7 files changed, 116 insertions(+), 13 deletions(-)
> >>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 8790b3962e4b..49fc1989a548 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>       __u64 __opaque[1];
> >>   } __attribute__((aligned(8)));
> >>   +struct bpf_iter_task_vma {
> >> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >> +} __attribute__((aligned(8)));
> >
> > In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> > like
> >   struct bpf_iter_<...> {
> >     __u64 __opaque[1];
> >   } __attribute__((aligned(8)));
> >
> > Maybe we want a generic one instead of having lots of
> > structs with the same underline definition? For example,
> >   struct bpf_iter_generic
> > ?
> >
>
> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> of iters would break the scheme. We could:
>
>   * Add bpf_for_each_generic that only uses bpf_iter_generic
>     * This exposes implementation details in an ugly way, though.
>   * Do some macro magic to pick bpf_iter_generic for some types of iters, and
>     use consistent naming pattern for others.
>     * I'm not sure how to do this with preprocessor
>   * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
>     data struct, and use bpf_iter_generic for all of them
>     * Probably need to see more iter implementation / usage before making such
>       a change
>   * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
>     * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
>       logic. Could do similar typedef w/ struct to try to work around
>       it.
>
> Let me know what you think. Personally I considered doing typedef while
> implementing this, so that's the alternative I'd choose.
>
> >> +
> >>   #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index eb91cae0612a..7a06dea749f1 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> >>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> >>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> >>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> >>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> >>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> >> index c4ab9d6cdbe9..51c2dce435c1 100644
> >> --- a/kernel/bpf/task_iter.c
> >> +++ b/kernel/bpf/task_iter.c
> >> @@ -7,7 +7,9 @@
> >>   #include <linux/fs.h>
> >>   #include <linux/fdtable.h>
> >>   #include <linux/filter.h>
> >> +#include <linux/bpf_mem_alloc.h>
> >>   #include <linux/btf_ids.h>
> >> +#include <linux/mm_types.h>
> >>   #include "mmap_unlock_work.h"
> >>     static const char * const iter_task_type_names[] = {
> >> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> >>       .arg5_type    = ARG_ANYTHING,
> >>   };
> >>   +struct bpf_iter_task_vma_kern_data {
> >> +    struct task_struct *task;
> >> +    struct mm_struct *mm;
> >> +    struct mmap_unlock_irq_work *work;
> >> +    struct vma_iterator vmi;
> >> +};
> >> +
> >> +/* Non-opaque version of uapi bpf_iter_task_vma */
> >> +struct bpf_iter_task_vma_kern {
> >> +    struct bpf_iter_task_vma_kern_data *data;
> >> +} __attribute__((aligned(8)));
> >> +
> >> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >> +                      struct task_struct *task, u64 addr)
> >> +{
> >> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
> >> +    bool irq_work_busy = false;
> >> +    int err;
> >> +
> >> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> >> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> >> +
> >> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> >> +     * before, so non-NULL kit->data doesn't point to previously
> >> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> >> +     */
> >> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
> >> +    if (!kit->data)
> >> +        return -ENOMEM;
> >> +    kit->data->task = NULL;
> >> +
> >> +    if (!task) {
> >> +        err = -ENOENT;
> >> +        goto err_cleanup_iter;
> >> +    }
> >> +
> >> +    kit->data->task = get_task_struct(task);
> >
> > The above is not safe. Since there is no restriction on 'task',
> > the 'task' could be in a state for destruction with 'usage' count 0
> > and then get_task_struct(task) won't work since it unconditionally
> > tries to increase 'usage' count from 0 to 1.
> >
> > Or, 'task' may be valid at the entry of the funciton, but when
> > 'task' is in get_task_struct(), 'task' may have been destroyed
> > and 'task' memory is reused by somebody else.
> >
> > I suggest that we check input parameter 'task' must be
> > PTR_TRUSTED or MEM_RCU. This way, the above !task checking
> > is not necessary and get_task_struct() can correctly
> > hold a reference to 'task'.
> >
>
> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
> to this kfunc currently.
>
> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
> * ptr hop from trusted task_struct to 'real_parent' or similar should
>   yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def
> * if task kptr is in map_val, direct reference to it should result
>   in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
>   using bpf_task_from_pid (?)
>
> But regardless, better to be explicit. Will change.

How horrible would it be to base an interface on TID/PID (i.e., int)
as input argument to specify a task? I'm just thinking it might be
more generic and easy to use in more situations:
   - for all the cases where we have struct task_struct, getting its
pid is trivial: `task->pid`;
   - but in some situations PID might be coming from outside: either
as an argument to CLI tool, or from old-style tracepoint (e.g.,
context_switch where we have prev/next task pid), etc.

The downside is that we'd need to look up a task, right? But on the
other hand we get more generality and won't have to rely on having
PTR_TRUSTED task_struct.

Thoughts?

>
> >> +    kit->data->mm = task->mm;
> >> +    if (!kit->data->mm) {
> >> +        err = -ENOENT;
> >> +        goto err_cleanup_iter;
> >> +    }
> >> +
> >> +    /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> >> +    irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> >> +    if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> >> +        err = -EBUSY;
> >> +        goto err_cleanup_iter;
> >> +    }
> >> +
> >> +    vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> >> +    return 0;
> >> +
> >> +err_cleanup_iter:
> >> +    if (kit->data->task)
> >> +        put_task_struct(kit->data->task);
> >> +    bpf_mem_free(&bpf_global_ma, kit->data);
> >> +    /* NULL kit->data signals failed bpf_iter_task_vma initialization */
> >> +    kit->data = NULL;
> >> +    return err;
> >> +}
> >> +
> > [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23  0:04       ` Andrii Nakryiko
@ 2023-08-23  5:42         ` David Marchevsky
  2023-08-23 14:57           ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: David Marchevsky @ 2023-08-23  5:42 UTC (permalink / raw)
  To: Andrii Nakryiko, David Marchevsky
  Cc: yonghong.song, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	sdf, Nathan Slingerland

On 8/22/23 8:04 PM, Andrii Nakryiko wrote:
> On Tue, Aug 22, 2023 at 12:20 PM David Marchevsky
> <david.marchevsky@linux.dev> wrote:
>>
>> On 8/22/23 1:42 PM, Yonghong Song wrote:
>>>
>>>
>>> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
>>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
>>>> creation and manipulation of struct bpf_iter_task_vma in open-coded
>>>> iterator style. BPF programs can use these kfuncs directly or through
>>>> bpf_for_each macro for natural-looking iteration of all task vmas.
>>>>
>>>> The implementation borrows heavily from bpf_find_vma helper's locking -
>>>> differing only in that it holds the mmap_read lock for all iterations
>>>> while the helper only executes its provided callback on a maximum of 1
>>>> vma. Aside from locking, struct vma_iterator and vma_next do all the
>>>> heavy lifting.
>>>>
>>>> The newly-added struct bpf_iter_task_vma has a name collision with a
>>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
>>>> file is renamed in order to avoid the collision.
>>>>
>>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
>>>> only field in struct bpf_iter_task_vma. This is because the inner data
>>>> struct contains a struct vma_iterator (not ptr), whose size is likely to
>>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
>>>> such a change would require change in opaque bpf_iter_task_vma struct's
>>>> size. So better to allocate vma_iterator using BPF allocator, and since
>>>> that alloc must already succeed, might as well allocate all iter fields,
>>>> thereby freezing struct bpf_iter_task_vma size.
>>>>
>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>>> Cc: Nathan Slingerland <slinger@meta.com>
>>>> ---
>>>>   include/uapi/linux/bpf.h                      |  4 +
>>>>   kernel/bpf/helpers.c                          |  3 +
>>>>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>>>>   tools/include/uapi/linux/bpf.h                |  4 +
>>>>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
>>>>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>>>>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>>>>   7 files changed, 116 insertions(+), 13 deletions(-)
>>>>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>>>>
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 8790b3962e4b..49fc1989a548 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>>>       __u64 __opaque[1];
>>>>   } __attribute__((aligned(8)));
>>>>   +struct bpf_iter_task_vma {
>>>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
>>>> +} __attribute__((aligned(8)));
>>>
>>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
>>> like
>>>   struct bpf_iter_<...> {
>>>     __u64 __opaque[1];
>>>   } __attribute__((aligned(8)));
>>>
>>> Maybe we want a generic one instead of having lots of
>>> structs with the same underline definition? For example,
>>>   struct bpf_iter_generic
>>> ?
>>>
>>
>> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
>> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
>> of iters would break the scheme. We could:
>>
>>   * Add bpf_for_each_generic that only uses bpf_iter_generic
>>     * This exposes implementation details in an ugly way, though.
>>   * Do some macro magic to pick bpf_iter_generic for some types of iters, and
>>     use consistent naming pattern for others.
>>     * I'm not sure how to do this with preprocessor
>>   * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
>>     data struct, and use bpf_iter_generic for all of them
>>     * Probably need to see more iter implementation / usage before making such
>>       a change
>>   * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
>>     * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
>>       logic. Could do similar typedef w/ struct to try to work around
>>       it.
>>
>> Let me know what you think. Personally I considered doing typedef while
>> implementing this, so that's the alternative I'd choose.
>>
>>>> +
>>>>   #endif /* _UAPI__LINUX_BPF_H__ */
>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>>> index eb91cae0612a..7a06dea749f1 100644
>>>> --- a/kernel/bpf/helpers.c
>>>> +++ b/kernel/bpf/helpers.c
>>>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>>>>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>>>>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>>>>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
>>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
>>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
>>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>>>>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
>>>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
>>>> index c4ab9d6cdbe9..51c2dce435c1 100644
>>>> --- a/kernel/bpf/task_iter.c
>>>> +++ b/kernel/bpf/task_iter.c
>>>> @@ -7,7 +7,9 @@
>>>>   #include <linux/fs.h>
>>>>   #include <linux/fdtable.h>
>>>>   #include <linux/filter.h>
>>>> +#include <linux/bpf_mem_alloc.h>
>>>>   #include <linux/btf_ids.h>
>>>> +#include <linux/mm_types.h>
>>>>   #include "mmap_unlock_work.h"
>>>>     static const char * const iter_task_type_names[] = {
>>>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>>>>       .arg5_type    = ARG_ANYTHING,
>>>>   };
>>>>   +struct bpf_iter_task_vma_kern_data {
>>>> +    struct task_struct *task;
>>>> +    struct mm_struct *mm;
>>>> +    struct mmap_unlock_irq_work *work;
>>>> +    struct vma_iterator vmi;
>>>> +};
>>>> +
>>>> +/* Non-opaque version of uapi bpf_iter_task_vma */
>>>> +struct bpf_iter_task_vma_kern {
>>>> +    struct bpf_iter_task_vma_kern_data *data;
>>>> +} __attribute__((aligned(8)));
>>>> +
>>>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>>>> +                      struct task_struct *task, u64 addr)
>>>> +{
>>>> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
>>>> +    bool irq_work_busy = false;
>>>> +    int err;
>>>> +
>>>> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
>>>> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>>>> +
>>>> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
>>>> +     * before, so non-NULL kit->data doesn't point to previously
>>>> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
>>>> +     */
>>>> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
>>>> +    if (!kit->data)
>>>> +        return -ENOMEM;
>>>> +    kit->data->task = NULL;
>>>> +
>>>> +    if (!task) {
>>>> +        err = -ENOENT;
>>>> +        goto err_cleanup_iter;
>>>> +    }
>>>> +
>>>> +    kit->data->task = get_task_struct(task);
>>>
>>> The above is not safe. Since there is no restriction on 'task',
>>> the 'task' could be in a state for destruction with 'usage' count 0
>>> and then get_task_struct(task) won't work since it unconditionally
>>> tries to increase 'usage' count from 0 to 1.
>>>
>>> Or, 'task' may be valid at the entry of the funciton, but when
>>> 'task' is in get_task_struct(), 'task' may have been destroyed
>>> and 'task' memory is reused by somebody else.
>>>
>>> I suggest that we check input parameter 'task' must be
>>> PTR_TRUSTED or MEM_RCU. This way, the above !task checking
>>> is not necessary and get_task_struct() can correctly
>>> hold a reference to 'task'.
>>>
>>
>> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
>> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
>> to this kfunc currently.
>>
>> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
>> * ptr hop from trusted task_struct to 'real_parent' or similar should
>>   yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def
>> * if task kptr is in map_val, direct reference to it should result
>>   in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
>>   using bpf_task_from_pid (?)
>>
>> But regardless, better to be explicit. Will change.
> 
> How horrible would it be to base an interface on TID/PID (i.e., int)
> as input argument to specify a task? I'm just thinking it might be
> more generic and easy to use in more situations:
>    - for all the cases where we have struct task_struct, getting its
> pid is trivial: `task->pid`;
>    - but in some situations PID might be coming from outside: either
> as an argument to CLI tool, or from old-style tracepoint (e.g.,
> context_switch where we have prev/next task pid), etc.
> 
> The downside is that we'd need to look up a task, right? But on the
> other hand we get more generality and won't have to rely on having
> PTR_TRUSTED task_struct.
> 
> Thoughts?
> 

Meh, taking tid here feels like the 'old-school' approach, before recent
efforts to teach the verifier more about resource acquisition, locking,
iteration, trustedness, etc. All allowing us to push more important logic
out of 'opaque' helper impl and into BPF programs.

In this tid -> struct task_struct case I think the provided resource acquisition 
is sufficient. Using your examples:

  * We have a TRUSTED or RCU struct task_struct
    * No need to do anything, can pass to bpf_iter_task_vma_new

  * We have a struct task_struct, but it's UNTRUSTED or has no trustedness
    type flags
    * Use bpf_task_acquire or bpf_task_from_pid

  * We just have a pid ('coming from outside')
    * Use bpf_task_from_pid

If there is some scenario where we can't get from pid to properly acquired task
in the BPF program, let's improve the resource acquisition instead of pushing
it into the kfunc impl.

Also, should we look up (and refcount incr) the task using task->rcu_users
refcount w/ bpf_task_{acquire,release}, or using task->usage refcount w/ 
{get,put}_task_struct ? More generally, if there are >1 valid ways to acquire
task_struct or some other resource, pushing acquisition to the BPF program
gives us the benefit of not having to pick one (or do possibly ugly / complex
flags). As long as type flags express that the resource will not go away,
this kfunc impl can ignore the details of how that property came about.

>>
>>>> +    kit->data->mm = task->mm;
>>>> +    if (!kit->data->mm) {
>>>> +        err = -ENOENT;
>>>> +        goto err_cleanup_iter;
>>>> +    }
>>>> +
>>>> +    /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
>>>> +    irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
>>>> +    if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
>>>> +        err = -EBUSY;
>>>> +        goto err_cleanup_iter;
>>>> +    }
>>>> +
>>>> +    vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
>>>> +    return 0;
>>>> +
>>>> +err_cleanup_iter:
>>>> +    if (kit->data->task)
>>>> +        put_task_struct(kit->data->task);
>>>> +    bpf_mem_free(&bpf_global_ma, kit->data);
>>>> +    /* NULL kit->data signals failed bpf_iter_task_vma initialization */
>>>> +    kit->data = NULL;
>>>> +    return err;
>>>> +}
>>>> +
>>> [...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23  5:42         ` David Marchevsky
@ 2023-08-23 14:57           ` Alexei Starovoitov
  2023-08-23 16:55             ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2023-08-23 14:57 UTC (permalink / raw)
  To: David Marchevsky
  Cc: Andrii Nakryiko, Yonghong Song, Dave Marchevsky, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Stanislav Fomichev,
	Nathan Slingerland

On Tue, Aug 22, 2023 at 10:42 PM David Marchevsky
<david.marchevsky@linux.dev> wrote:
>
> On 8/22/23 8:04 PM, Andrii Nakryiko wrote:
> > On Tue, Aug 22, 2023 at 12:20 PM David Marchevsky
> > <david.marchevsky@linux.dev> wrote:
> >>
> >> On 8/22/23 1:42 PM, Yonghong Song wrote:
> >>>
> >>>
> >>> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> >>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> >>>> creation and manipulation of struct bpf_iter_task_vma in open-coded
> >>>> iterator style. BPF programs can use these kfuncs directly or through
> >>>> bpf_for_each macro for natural-looking iteration of all task vmas.
> >>>>
> >>>> The implementation borrows heavily from bpf_find_vma helper's locking -
> >>>> differing only in that it holds the mmap_read lock for all iterations
> >>>> while the helper only executes its provided callback on a maximum of 1
> >>>> vma. Aside from locking, struct vma_iterator and vma_next do all the
> >>>> heavy lifting.
> >>>>
> >>>> The newly-added struct bpf_iter_task_vma has a name collision with a
> >>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> >>>> file is renamed in order to avoid the collision.
> >>>>
> >>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> >>>> only field in struct bpf_iter_task_vma. This is because the inner data
> >>>> struct contains a struct vma_iterator (not ptr), whose size is likely to
> >>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> >>>> such a change would require change in opaque bpf_iter_task_vma struct's
> >>>> size. So better to allocate vma_iterator using BPF allocator, and since
> >>>> that alloc must already succeed, might as well allocate all iter fields,
> >>>> thereby freezing struct bpf_iter_task_vma size.
> >>>>
> >>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >>>> Cc: Nathan Slingerland <slinger@meta.com>
> >>>> ---
> >>>>   include/uapi/linux/bpf.h                      |  4 +
> >>>>   kernel/bpf/helpers.c                          |  3 +
> >>>>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> >>>>   tools/include/uapi/linux/bpf.h                |  4 +
> >>>>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
> >>>>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> >>>>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> >>>>   7 files changed, 116 insertions(+), 13 deletions(-)
> >>>>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> >>>>
> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>>> index 8790b3962e4b..49fc1989a548 100644
> >>>> --- a/include/uapi/linux/bpf.h
> >>>> +++ b/include/uapi/linux/bpf.h
> >>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>>>       __u64 __opaque[1];
> >>>>   } __attribute__((aligned(8)));
> >>>>   +struct bpf_iter_task_vma {
> >>>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >>>> +} __attribute__((aligned(8)));
> >>>
> >>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> >>> like
> >>>   struct bpf_iter_<...> {
> >>>     __u64 __opaque[1];
> >>>   } __attribute__((aligned(8)));
> >>>
> >>> Maybe we want a generic one instead of having lots of
> >>> structs with the same underline definition? For example,
> >>>   struct bpf_iter_generic
> >>> ?
> >>>
> >>
> >> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> >> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> >> of iters would break the scheme. We could:
> >>
> >>   * Add bpf_for_each_generic that only uses bpf_iter_generic
> >>     * This exposes implementation details in an ugly way, though.
> >>   * Do some macro magic to pick bpf_iter_generic for some types of iters, and
> >>     use consistent naming pattern for others.
> >>     * I'm not sure how to do this with preprocessor
> >>   * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
> >>     data struct, and use bpf_iter_generic for all of them
> >>     * Probably need to see more iter implementation / usage before making such
> >>       a change
> >>   * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
> >>     * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
> >>       logic. Could do similar typedef w/ struct to try to work around
> >>       it.
> >>
> >> Let me know what you think. Personally I considered doing typedef while
> >> implementing this, so that's the alternative I'd choose.
> >>
> >>>> +
> >>>>   #endif /* _UAPI__LINUX_BPF_H__ */
> >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>>> index eb91cae0612a..7a06dea749f1 100644
> >>>> --- a/kernel/bpf/helpers.c
> >>>> +++ b/kernel/bpf/helpers.c
> >>>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> >>>>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> >>>>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> >>>>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> >>>>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> >>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> >>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> >>>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> >>>> index c4ab9d6cdbe9..51c2dce435c1 100644
> >>>> --- a/kernel/bpf/task_iter.c
> >>>> +++ b/kernel/bpf/task_iter.c
> >>>> @@ -7,7 +7,9 @@
> >>>>   #include <linux/fs.h>
> >>>>   #include <linux/fdtable.h>
> >>>>   #include <linux/filter.h>
> >>>> +#include <linux/bpf_mem_alloc.h>
> >>>>   #include <linux/btf_ids.h>
> >>>> +#include <linux/mm_types.h>
> >>>>   #include "mmap_unlock_work.h"
> >>>>     static const char * const iter_task_type_names[] = {
> >>>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> >>>>       .arg5_type    = ARG_ANYTHING,
> >>>>   };
> >>>>   +struct bpf_iter_task_vma_kern_data {
> >>>> +    struct task_struct *task;
> >>>> +    struct mm_struct *mm;
> >>>> +    struct mmap_unlock_irq_work *work;
> >>>> +    struct vma_iterator vmi;
> >>>> +};
> >>>> +
> >>>> +/* Non-opaque version of uapi bpf_iter_task_vma */
> >>>> +struct bpf_iter_task_vma_kern {
> >>>> +    struct bpf_iter_task_vma_kern_data *data;
> >>>> +} __attribute__((aligned(8)));
> >>>> +
> >>>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >>>> +                      struct task_struct *task, u64 addr)
> >>>> +{
> >>>> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
> >>>> +    bool irq_work_busy = false;
> >>>> +    int err;
> >>>> +
> >>>> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> >>>> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> >>>> +
> >>>> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> >>>> +     * before, so non-NULL kit->data doesn't point to previously
> >>>> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> >>>> +     */
> >>>> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
> >>>> +    if (!kit->data)
> >>>> +        return -ENOMEM;
> >>>> +    kit->data->task = NULL;
> >>>> +
> >>>> +    if (!task) {
> >>>> +        err = -ENOENT;
> >>>> +        goto err_cleanup_iter;
> >>>> +    }
> >>>> +
> >>>> +    kit->data->task = get_task_struct(task);
> >>>
> >>> The above is not safe. Since there is no restriction on 'task',
> >>> the 'task' could be in a state for destruction with 'usage' count 0
> >>> and then get_task_struct(task) won't work since it unconditionally
> >>> tries to increase 'usage' count from 0 to 1.
> >>>
> >>> Or, 'task' may be valid at the entry of the funciton, but when
> >>> 'task' is in get_task_struct(), 'task' may have been destroyed
> >>> and 'task' memory is reused by somebody else.
> >>>
> >>> I suggest that we check input parameter 'task' must be
> >>> PTR_TRUSTED or MEM_RCU. This way, the above !task checking
> >>> is not necessary and get_task_struct() can correctly
> >>> hold a reference to 'task'.
> >>>
> >>
> >> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
> >> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
> >> to this kfunc currently.
> >>
> >> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
> >> * ptr hop from trusted task_struct to 'real_parent' or similar should
> >>   yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def
> >> * if task kptr is in map_val, direct reference to it should result
> >>   in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
> >>   using bpf_task_from_pid (?)
> >>
> >> But regardless, better to be explicit. Will change.
> >
> > How horrible would it be to base an interface on TID/PID (i.e., int)
> > as input argument to specify a task? I'm just thinking it might be
> > more generic and easy to use in more situations:
> >    - for all the cases where we have struct task_struct, getting its
> > pid is trivial: `task->pid`;
> >    - but in some situations PID might be coming from outside: either
> > as an argument to CLI tool, or from old-style tracepoint (e.g.,
> > context_switch where we have prev/next task pid), etc.
> >
> > The downside is that we'd need to look up a task, right? But on the
> > other hand we get more generality and won't have to rely on having
> > PTR_TRUSTED task_struct.
> >
> > Thoughts?
> >
>
> Meh, taking tid here feels like the 'old-school' approach, before recent
> efforts to teach the verifier more about resource acquisition, locking,
> iteration, trustedness, etc. All allowing us to push more important logic
> out of 'opaque' helper impl and into BPF programs.
>
> In this tid -> struct task_struct case I think the provided resource acquisition
> is sufficient. Using your examples:
>
>   * We have a TRUSTED or RCU struct task_struct
>     * No need to do anything, can pass to bpf_iter_task_vma_new
>
>   * We have a struct task_struct, but it's UNTRUSTED or has no trustedness
>     type flags
>     * Use bpf_task_acquire or bpf_task_from_pid
>
>   * We just have a pid ('coming from outside')
>     * Use bpf_task_from_pid
>
> If there is some scenario where we can't get from pid to properly acquired task
> in the BPF program, let's improve the resource acquisition instead of pushing
> it into the kfunc impl.
>
> Also, should we look up (and refcount incr) the task using task->rcu_users
> refcount w/ bpf_task_{acquire,release}, or using task->usage refcount w/
> {get,put}_task_struct ? More generally, if there are >1 valid ways to acquire
> task_struct or some other resource, pushing acquisition to the BPF program
> gives us the benefit of not having to pick one (or do possibly ugly / complex
> flags). As long as type flags express that the resource will not go away,
> this kfunc impl can ignore the details of how that property came about.

Agree with above reasoning. pid->task should be separate set of kfunc.
It's not a trivial task either. there are pid namespaces. there is u32 pid
and there is 'struct pid'. u32_pid->struct_pid->task has to be designed first.
Probably in some case the trusted task pointer is ready to be accessed,
but context could be such that pid->task cannot be done.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 14:57           ` Alexei Starovoitov
@ 2023-08-23 16:55             ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23 16:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Marchevsky, Yonghong Song, Dave Marchevsky, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Stanislav Fomichev,
	Nathan Slingerland

On Wed, Aug 23, 2023 at 7:58 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 10:42 PM David Marchevsky
> <david.marchevsky@linux.dev> wrote:
> >
> > On 8/22/23 8:04 PM, Andrii Nakryiko wrote:
> > > On Tue, Aug 22, 2023 at 12:20 PM David Marchevsky
> > > <david.marchevsky@linux.dev> wrote:
> > >>
> > >> On 8/22/23 1:42 PM, Yonghong Song wrote:
> > >>>
> > >>>
> > >>> On 8/21/23 10:05 PM, Dave Marchevsky wrote:
> > >>>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> > >>>> creation and manipulation of struct bpf_iter_task_vma in open-coded
> > >>>> iterator style. BPF programs can use these kfuncs directly or through
> > >>>> bpf_for_each macro for natural-looking iteration of all task vmas.
> > >>>>
> > >>>> The implementation borrows heavily from bpf_find_vma helper's locking -
> > >>>> differing only in that it holds the mmap_read lock for all iterations
> > >>>> while the helper only executes its provided callback on a maximum of 1
> > >>>> vma. Aside from locking, struct vma_iterator and vma_next do all the
> > >>>> heavy lifting.
> > >>>>
> > >>>> The newly-added struct bpf_iter_task_vma has a name collision with a
> > >>>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> > >>>> file is renamed in order to avoid the collision.
> > >>>>
> > >>>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> > >>>> only field in struct bpf_iter_task_vma. This is because the inner data
> > >>>> struct contains a struct vma_iterator (not ptr), whose size is likely to
> > >>>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> > >>>> such a change would require change in opaque bpf_iter_task_vma struct's
> > >>>> size. So better to allocate vma_iterator using BPF allocator, and since
> > >>>> that alloc must already succeed, might as well allocate all iter fields,
> > >>>> thereby freezing struct bpf_iter_task_vma size.
> > >>>>
> > >>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > >>>> Cc: Nathan Slingerland <slinger@meta.com>
> > >>>> ---
> > >>>>   include/uapi/linux/bpf.h                      |  4 +
> > >>>>   kernel/bpf/helpers.c                          |  3 +
> > >>>>   kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> > >>>>   tools/include/uapi/linux/bpf.h                |  4 +
> > >>>>   tools/lib/bpf/bpf_helpers.h                   |  8 ++
> > >>>>   .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> > >>>>   ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> > >>>>   7 files changed, 116 insertions(+), 13 deletions(-)
> > >>>>   rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> > >>>>
> > >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >>>> index 8790b3962e4b..49fc1989a548 100644
> > >>>> --- a/include/uapi/linux/bpf.h
> > >>>> +++ b/include/uapi/linux/bpf.h
> > >>>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> > >>>>       __u64 __opaque[1];
> > >>>>   } __attribute__((aligned(8)));
> > >>>>   +struct bpf_iter_task_vma {
> > >>>> +    __u64 __opaque[1]; /* See bpf_iter_num comment above */
> > >>>> +} __attribute__((aligned(8)));
> > >>>
> > >>> In the future, we might have bpf_iter_cgroup, bpf_iter_task, bpf_iter_cgroup_task, etc. They may all use the same struct
> > >>> like
> > >>>   struct bpf_iter_<...> {
> > >>>     __u64 __opaque[1];
> > >>>   } __attribute__((aligned(8)));
> > >>>
> > >>> Maybe we want a generic one instead of having lots of
> > >>> structs with the same underline definition? For example,
> > >>>   struct bpf_iter_generic
> > >>> ?
> > >>>
> > >>
> > >> The bpf_for_each macro assumes a consistent naming scheme for opaque iter struct
> > >> and associated kfuncs. Having a 'bpf_iter_generic' shared amongst multiple types
> > >> of iters would break the scheme. We could:
> > >>
> > >>   * Add bpf_for_each_generic that only uses bpf_iter_generic
> > >>     * This exposes implementation details in an ugly way, though.
> > >>   * Do some macro magic to pick bpf_iter_generic for some types of iters, and
> > >>     use consistent naming pattern for others.
> > >>     * I'm not sure how to do this with preprocessor
> > >>   * Migrate all opaque iter structs to only contain pointer to bpf_mem_alloc'd
> > >>     data struct, and use bpf_iter_generic for all of them
> > >>     * Probably need to see more iter implementation / usage before making such
> > >>       a change
> > >>   * Do 'typedef __u64 __aligned(8) bpf_iter_<...>
> > >>     * BTF_KIND_TYPEDEF intead of BTF_KIND_STRUCT might throw off some verifier
> > >>       logic. Could do similar typedef w/ struct to try to work around
> > >>       it.
> > >>
> > >> Let me know what you think. Personally I considered doing typedef while
> > >> implementing this, so that's the alternative I'd choose.
> > >>
> > >>>> +
> > >>>>   #endif /* _UAPI__LINUX_BPF_H__ */
> > >>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >>>> index eb91cae0612a..7a06dea749f1 100644
> > >>>> --- a/kernel/bpf/helpers.c
> > >>>> +++ b/kernel/bpf/helpers.c
> > >>>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> > >>>>   BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> > >>>>   BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> > >>>>   BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> > >>>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> > >>>>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> > >>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> > >>>>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> > >>>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > >>>> index c4ab9d6cdbe9..51c2dce435c1 100644
> > >>>> --- a/kernel/bpf/task_iter.c
> > >>>> +++ b/kernel/bpf/task_iter.c
> > >>>> @@ -7,7 +7,9 @@
> > >>>>   #include <linux/fs.h>
> > >>>>   #include <linux/fdtable.h>
> > >>>>   #include <linux/filter.h>
> > >>>> +#include <linux/bpf_mem_alloc.h>
> > >>>>   #include <linux/btf_ids.h>
> > >>>> +#include <linux/mm_types.h>
> > >>>>   #include "mmap_unlock_work.h"
> > >>>>     static const char * const iter_task_type_names[] = {
> > >>>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> > >>>>       .arg5_type    = ARG_ANYTHING,
> > >>>>   };
> > >>>>   +struct bpf_iter_task_vma_kern_data {
> > >>>> +    struct task_struct *task;
> > >>>> +    struct mm_struct *mm;
> > >>>> +    struct mmap_unlock_irq_work *work;
> > >>>> +    struct vma_iterator vmi;
> > >>>> +};
> > >>>> +
> > >>>> +/* Non-opaque version of uapi bpf_iter_task_vma */
> > >>>> +struct bpf_iter_task_vma_kern {
> > >>>> +    struct bpf_iter_task_vma_kern_data *data;
> > >>>> +} __attribute__((aligned(8)));
> > >>>> +
> > >>>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > >>>> +                      struct task_struct *task, u64 addr)
> > >>>> +{
> > >>>> +    struct bpf_iter_task_vma_kern *kit = (void *)it;
> > >>>> +    bool irq_work_busy = false;
> > >>>> +    int err;
> > >>>> +
> > >>>> +    BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> > >>>> +    BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> > >>>> +
> > >>>> +    /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> > >>>> +     * before, so non-NULL kit->data doesn't point to previously
> > >>>> +     * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> > >>>> +     */
> > >>>> +    kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data));
> > >>>> +    if (!kit->data)
> > >>>> +        return -ENOMEM;
> > >>>> +    kit->data->task = NULL;
> > >>>> +
> > >>>> +    if (!task) {
> > >>>> +        err = -ENOENT;
> > >>>> +        goto err_cleanup_iter;
> > >>>> +    }
> > >>>> +
> > >>>> +    kit->data->task = get_task_struct(task);
> > >>>
> > >>> The above is not safe. Since there is no restriction on 'task',
> > >>> the 'task' could be in a state for destruction with 'usage' count 0
> > >>> and then get_task_struct(task) won't work since it unconditionally
> > >>> tries to increase 'usage' count from 0 to 1.
> > >>>
> > >>> Or, 'task' may be valid at the entry of the funciton, but when
> > >>> 'task' is in get_task_struct(), 'task' may have been destroyed
> > >>> and 'task' memory is reused by somebody else.
> > >>>
> > >>> I suggest that we check input parameter 'task' must be
> > >>> PTR_TRUSTED or MEM_RCU. This way, the above !task checking
> > >>> is not necessary and get_task_struct() can correctly
> > >>> hold a reference to 'task'.
> > >>>
> > >>
> > >> Adding a PTR_TRUSTED or MEM_RCU check seems reasonable. I'm curious
> > >> whether there's any way to feed a 'plain' struct task_struct PTR_TO_BTF_ID
> > >> to this kfunc currently.
> > >>
> > >> * bpf_get_current_task_btf helper returns PTR_TRUSTED | PTR_TO_BTF_ID
> > >> * ptr hop from trusted task_struct to 'real_parent' or similar should
> > >>   yield MEM_RCU (due to BTF_TYPE_SAFE_RCU(struct task_struct) def
> > >> * if task kptr is in map_val, direct reference to it should result
> > >>   in PTR_UNTRUSTED PTR_TO_BTF_ID, must kptr_xchg it or acquire again
> > >>   using bpf_task_from_pid (?)
> > >>
> > >> But regardless, better to be explicit. Will change.
> > >
> > > How horrible would it be to base an interface on TID/PID (i.e., int)
> > > as input argument to specify a task? I'm just thinking it might be
> > > more generic and easy to use in more situations:
> > >    - for all the cases where we have struct task_struct, getting its
> > > pid is trivial: `task->pid`;
> > >    - but in some situations PID might be coming from outside: either
> > > as an argument to CLI tool, or from old-style tracepoint (e.g.,
> > > context_switch where we have prev/next task pid), etc.
> > >
> > > The downside is that we'd need to look up a task, right? But on the
> > > other hand we get more generality and won't have to rely on having
> > > PTR_TRUSTED task_struct.
> > >
> > > Thoughts?
> > >
> >
> > Meh, taking tid here feels like the 'old-school' approach, before recent
> > efforts to teach the verifier more about resource acquisition, locking,
> > iteration, trustedness, etc. All allowing us to push more important logic
> > out of 'opaque' helper impl and into BPF programs.
> >
> > In this tid -> struct task_struct case I think the provided resource acquisition
> > is sufficient. Using your examples:
> >
> >   * We have a TRUSTED or RCU struct task_struct
> >     * No need to do anything, can pass to bpf_iter_task_vma_new
> >
> >   * We have a struct task_struct, but it's UNTRUSTED or has no trustedness
> >     type flags
> >     * Use bpf_task_acquire or bpf_task_from_pid
> >
> >   * We just have a pid ('coming from outside')
> >     * Use bpf_task_from_pid
> >
> > If there is some scenario where we can't get from pid to properly acquired task
> > in the BPF program, let's improve the resource acquisition instead of pushing
> > it into the kfunc impl.
> >
> > Also, should we look up (and refcount incr) the task using task->rcu_users
> > refcount w/ bpf_task_{acquire,release}, or using task->usage refcount w/
> > {get,put}_task_struct ? More generally, if there are >1 valid ways to acquire
> > task_struct or some other resource, pushing acquisition to the BPF program
> > gives us the benefit of not having to pick one (or do possibly ugly / complex
> > flags). As long as type flags express that the resource will not go away,
> > this kfunc impl can ignore the details of how that property came about.
>
> Agree with above reasoning. pid->task should be separate set of kfunc.
> It's not a trivial task either. there are pid namespaces. there is u32 pid
> and there is 'struct pid'. u32_pid->struct_pid->task has to be designed first.
> Probably in some case the trusted task pointer is ready to be accessed,
> but context could be such that pid->task cannot be done.

makes sense, having bpf_task_from_pid() removes my concerns, I forgot about it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22  5:05 ` [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs Dave Marchevsky
  2023-08-22 17:42   ` Yonghong Song
@ 2023-08-22 23:52   ` Andrii Nakryiko
  2023-08-23  7:26     ` David Marchevsky
  1 sibling, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-22 23:52 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Nathan Slingerland

On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> creation and manipulation of struct bpf_iter_task_vma in open-coded
> iterator style. BPF programs can use these kfuncs directly or through
> bpf_for_each macro for natural-looking iteration of all task vmas.
>
> The implementation borrows heavily from bpf_find_vma helper's locking -
> differing only in that it holds the mmap_read lock for all iterations
> while the helper only executes its provided callback on a maximum of 1
> vma. Aside from locking, struct vma_iterator and vma_next do all the
> heavy lifting.
>
> The newly-added struct bpf_iter_task_vma has a name collision with a
> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> file is renamed in order to avoid the collision.
>
> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> only field in struct bpf_iter_task_vma. This is because the inner data
> struct contains a struct vma_iterator (not ptr), whose size is likely to
> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> such a change would require change in opaque bpf_iter_task_vma struct's
> size. So better to allocate vma_iterator using BPF allocator, and since
> that alloc must already succeed, might as well allocate all iter fields,
> thereby freezing struct bpf_iter_task_vma size.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> Cc: Nathan Slingerland <slinger@meta.com>
> ---
>  include/uapi/linux/bpf.h                      |  4 +
>  kernel/bpf/helpers.c                          |  3 +
>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |  4 +
>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>  7 files changed, 116 insertions(+), 13 deletions(-)
>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8790b3962e4b..49fc1989a548 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>         __u64 __opaque[1];
>  } __attribute__((aligned(8)));
>
> +struct bpf_iter_task_vma {
> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> +} __attribute__((aligned(8)));
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index eb91cae0612a..7a06dea749f1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index c4ab9d6cdbe9..51c2dce435c1 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -7,7 +7,9 @@
>  #include <linux/fs.h>
>  #include <linux/fdtable.h>
>  #include <linux/filter.h>
> +#include <linux/bpf_mem_alloc.h>
>  #include <linux/btf_ids.h>
> +#include <linux/mm_types.h>
>  #include "mmap_unlock_work.h"
>
>  static const char * const iter_task_type_names[] = {
> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>         .arg5_type      = ARG_ANYTHING,
>  };
>
> +struct bpf_iter_task_vma_kern_data {
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       struct mmap_unlock_irq_work *work;
> +       struct vma_iterator vmi;
> +};
> +
> +/* Non-opaque version of uapi bpf_iter_task_vma */
> +struct bpf_iter_task_vma_kern {
> +       struct bpf_iter_task_vma_kern_data *data;
> +} __attribute__((aligned(8)));
> +

it's a bit worrying that we'll rely on memory allocation inside NMI to
be able to use this. I'm missing previous email discussion (I declared
email bankruptcy after long vacation), but I suppose the option to fix
bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
pointers), or even to 96 to have a bit of headroom in case we need a
bit more space was rejected? It seems unlikely that vma_iterator will
have to grow, but if it does, it has 5 bytes of padding right now for
various flags, plus we can have extra 8 bytes reserved just in case.

I know it's a big struct and will take a big chunk of the BPF stack,
but I'm a bit worried about both the performance implication of mem
alloc under NMI, and allocation failing.

Maybe the worry is overblown, but I thought I'll bring it up anyways.

> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> +                                     struct task_struct *task, u64 addr)
> +{
> +       struct bpf_iter_task_vma_kern *kit = (void *)it;
> +       bool irq_work_busy = false;
> +       int err;
> +

[...]

>  static void do_mmap_read_unlock(struct irq_work *entry)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 8790b3962e4b..49fc1989a548 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>         __u64 __opaque[1];
>  } __attribute__((aligned(8)));
>
> +struct bpf_iter_task_vma {
> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> +} __attribute__((aligned(8)));
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index bbab9ad9dc5a..d885ffee4d88 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -302,6 +302,14 @@ extern int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) __weak
>  extern int *bpf_iter_num_next(struct bpf_iter_num *it) __weak __ksym;
>  extern void bpf_iter_num_destroy(struct bpf_iter_num *it) __weak __ksym;
>
> +struct bpf_iter_task_vma;
> +
> +extern int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> +                                struct task_struct *task,
> +                                unsigned long addr) __weak __ksym;
> +extern struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) __weak __ksym;
> +extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __weak __ksym;

my intent wasn't to add all open-coded iterators to bpf_helpers.h. I
think bpf_iter_num_* is rather an exception and isn't supposed to ever
change or be removed, while other iterators should be allowed to be
changed.

The goal is for all such kfuncs (and struct bpf_iter_task_vma state
itself, probably) to come from vmlinux.h, eventually, so let's leave
it out of libbpf's stable bpf_helpers.h header.


[...]

> @@ -1533,7 +1533,7 @@ static void test_task_vma_dead_task(void)
>  out:
>         waitpid(child_pid, &wstatus, 0);
>         close(iter_fd);
> -       bpf_iter_task_vma__destroy(skel);
> +       bpf_iter_task_vmas__destroy(skel);
>  }
>
>  void test_bpf_sockmap_map_iter_fd(void)
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
> similarity index 100%
> rename from tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
> rename to tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-22 23:52   ` Andrii Nakryiko
@ 2023-08-23  7:26     ` David Marchevsky
  2023-08-23 15:03       ` Alexei Starovoitov
  2023-08-23 17:07       ` Andrii Nakryiko
  0 siblings, 2 replies; 25+ messages in thread
From: David Marchevsky @ 2023-08-23  7:26 UTC (permalink / raw)
  To: Andrii Nakryiko, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Nathan Slingerland

On 8/22/23 7:52 PM, Andrii Nakryiko wrote:
> On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
>> creation and manipulation of struct bpf_iter_task_vma in open-coded
>> iterator style. BPF programs can use these kfuncs directly or through
>> bpf_for_each macro for natural-looking iteration of all task vmas.
>>
>> The implementation borrows heavily from bpf_find_vma helper's locking -
>> differing only in that it holds the mmap_read lock for all iterations
>> while the helper only executes its provided callback on a maximum of 1
>> vma. Aside from locking, struct vma_iterator and vma_next do all the
>> heavy lifting.
>>
>> The newly-added struct bpf_iter_task_vma has a name collision with a
>> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
>> file is renamed in order to avoid the collision.
>>
>> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
>> only field in struct bpf_iter_task_vma. This is because the inner data
>> struct contains a struct vma_iterator (not ptr), whose size is likely to
>> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
>> such a change would require change in opaque bpf_iter_task_vma struct's
>> size. So better to allocate vma_iterator using BPF allocator, and since
>> that alloc must already succeed, might as well allocate all iter fields,
>> thereby freezing struct bpf_iter_task_vma size.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> Cc: Nathan Slingerland <slinger@meta.com>
>> ---
>>  include/uapi/linux/bpf.h                      |  4 +
>>  kernel/bpf/helpers.c                          |  3 +
>>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
>>  tools/include/uapi/linux/bpf.h                |  4 +
>>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
>>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
>>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
>>  7 files changed, 116 insertions(+), 13 deletions(-)
>>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 8790b3962e4b..49fc1989a548 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>         __u64 __opaque[1];
>>  } __attribute__((aligned(8)));
>>
>> +struct bpf_iter_task_vma {
>> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index eb91cae0612a..7a06dea749f1 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
>>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
>>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
>>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
>>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
>>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
>> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
>> index c4ab9d6cdbe9..51c2dce435c1 100644
>> --- a/kernel/bpf/task_iter.c
>> +++ b/kernel/bpf/task_iter.c
>> @@ -7,7 +7,9 @@
>>  #include <linux/fs.h>
>>  #include <linux/fdtable.h>
>>  #include <linux/filter.h>
>> +#include <linux/bpf_mem_alloc.h>
>>  #include <linux/btf_ids.h>
>> +#include <linux/mm_types.h>
>>  #include "mmap_unlock_work.h"
>>
>>  static const char * const iter_task_type_names[] = {
>> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>>         .arg5_type      = ARG_ANYTHING,
>>  };
>>
>> +struct bpf_iter_task_vma_kern_data {
>> +       struct task_struct *task;
>> +       struct mm_struct *mm;
>> +       struct mmap_unlock_irq_work *work;
>> +       struct vma_iterator vmi;
>> +};
>> +
>> +/* Non-opaque version of uapi bpf_iter_task_vma */
>> +struct bpf_iter_task_vma_kern {
>> +       struct bpf_iter_task_vma_kern_data *data;
>> +} __attribute__((aligned(8)));
>> +
> 
> it's a bit worrying that we'll rely on memory allocation inside NMI to
> be able to use this. I'm missing previous email discussion (I declared
> email bankruptcy after long vacation), but I suppose the option to fix
> bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
> pointers), or even to 96 to have a bit of headroom in case we need a
> bit more space was rejected? It seems unlikely that vma_iterator will
> have to grow, but if it does, it has 5 bytes of padding right now for
> various flags, plus we can have extra 8 bytes reserved just in case.
> 
> I know it's a big struct and will take a big chunk of the BPF stack,
> but I'm a bit worried about both the performance implication of mem
> alloc under NMI, and allocation failing.
> 
> Maybe the worry is overblown, but I thought I'll bring it up anyways.
> 

Few tangential trains of thought here, separated by multiple newlines
for easier reading.


IIUC the any-context BPF allocator will not actually allocate memory in NMI
context, instead relying on its existing pre-filled caches.

Alexei's patch adding the allocator says ([0]):

  The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.

So sounds bpf_mem_alloc in a kfunc called by a BPF program is NMI-safe.


That's not to say that I'm happy about adding a fallible bpf_mem_alloc call here
before the kfunc can do anything useful. But it seems like the best way to
guarantee that we never see a mailing list message like:

  Hello, I just added a field to 'struct ma_state' in my subsystem and it seems
  I've triggered a BUILD_BUG_ON in this far-away BPF subsystem. It looks like
  you're making stability guarantees based on the size of my internal struct.
  What the hell?

Sure, after I remove the kfuncs and struct bpf_iter_task_vma fwd decl from
bpf_helpers.h - per your other comment below - we can do the whole "kfuncs
aren't uapi and this struct bpf_iter_task_vma is coming from vmlinux.h,
not some stable header" spiel and convince this hypothetical person. Not having
to do the spiel here reinforces the more general "Modern BPF exposes
functionality w/ kfuncs and kptrs, which are inherently _unstable_" messaging
more effectively than having to explain.


If we go back to putting struct vma_iterator on the BPF stack, I think we
definitely want to keep the BUILD_BUG_ON. If it were removed and vma_iterator
size changes, that would affect portability of BPF programs that assume the old
size of bpf_iter_task_vma, no? Which bpf_for_each is doing since it puts
bpf_iter_task_vma on the stack.

Is there some CO-RE technique that would handle above scenario portably? I
can't think of anything straightforward. Maybe if BPF prog BTF only had
a fwd decl for bpf_iter_task_vma, and size thus had to be taken from
vmlinux BTF. But that would fail to compile since it the struct goes
on the stack. Maybe use some placeholder size for compilation and use
BTF tag to tell libbpf to patch insns w/ vmlinux's size for this struct?


Re: padding bytes, seems worse to me than not using them. Have to make
assumptions about far-away struct, specifically vma_iterator
which landed quite recently as part of maple tree series. The assumptions
don't prevent my hypothetical mailing list confusion from happening, increases
the confusion if it does happen ("I added a small field recently, why didn't
this break then? If it's explicitly and intentionally unstable, why add
padding bytes?")

  [0]: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com

>> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>> +                                     struct task_struct *task, u64 addr)
>> +{
>> +       struct bpf_iter_task_vma_kern *kit = (void *)it;
>> +       bool irq_work_busy = false;
>> +       int err;
>> +
> 
> [...]
> 
>>  static void do_mmap_read_unlock(struct irq_work *entry)
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 8790b3962e4b..49fc1989a548 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
>>         __u64 __opaque[1];
>>  } __attribute__((aligned(8)));
>>
>> +struct bpf_iter_task_vma {
>> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>> index bbab9ad9dc5a..d885ffee4d88 100644
>> --- a/tools/lib/bpf/bpf_helpers.h
>> +++ b/tools/lib/bpf/bpf_helpers.h
>> @@ -302,6 +302,14 @@ extern int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) __weak
>>  extern int *bpf_iter_num_next(struct bpf_iter_num *it) __weak __ksym;
>>  extern void bpf_iter_num_destroy(struct bpf_iter_num *it) __weak __ksym;
>>
>> +struct bpf_iter_task_vma;
>> +
>> +extern int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>> +                                struct task_struct *task,
>> +                                unsigned long addr) __weak __ksym;
>> +extern struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) __weak __ksym;
>> +extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __weak __ksym;
> 
> my intent wasn't to add all open-coded iterators to bpf_helpers.h. I
> think bpf_iter_num_* is rather an exception and isn't supposed to ever
> change or be removed, while other iterators should be allowed to be
> changed.
> 
> The goal is for all such kfuncs (and struct bpf_iter_task_vma state
> itself, probably) to come from vmlinux.h, eventually, so let's leave
> it out of libbpf's stable bpf_helpers.h header.
> 
> 
> [...]

As alluded to in my long response above, this sounds reasonable, will
remove from here.

> 
>> @@ -1533,7 +1533,7 @@ static void test_task_vma_dead_task(void)
>>  out:
>>         waitpid(child_pid, &wstatus, 0);
>>         close(iter_fd);
>> -       bpf_iter_task_vma__destroy(skel);
>> +       bpf_iter_task_vmas__destroy(skel);
>>  }
>>
>>  void test_bpf_sockmap_map_iter_fd(void)
>> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
>> similarity index 100%
>> rename from tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
>> rename to tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23  7:26     ` David Marchevsky
@ 2023-08-23 15:03       ` Alexei Starovoitov
  2023-08-23 17:14         ` Andrii Nakryiko
  2023-08-23 17:07       ` Andrii Nakryiko
  1 sibling, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2023-08-23 15:03 UTC (permalink / raw)
  To: David Marchevsky
  Cc: Andrii Nakryiko, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 12:26 AM David Marchevsky
<david.marchevsky@linux.dev> wrote:
>
> On 8/22/23 7:52 PM, Andrii Nakryiko wrote:
> > On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >>
> >> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> >> creation and manipulation of struct bpf_iter_task_vma in open-coded
> >> iterator style. BPF programs can use these kfuncs directly or through
> >> bpf_for_each macro for natural-looking iteration of all task vmas.
> >>
> >> The implementation borrows heavily from bpf_find_vma helper's locking -
> >> differing only in that it holds the mmap_read lock for all iterations
> >> while the helper only executes its provided callback on a maximum of 1
> >> vma. Aside from locking, struct vma_iterator and vma_next do all the
> >> heavy lifting.
> >>
> >> The newly-added struct bpf_iter_task_vma has a name collision with a
> >> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> >> file is renamed in order to avoid the collision.
> >>
> >> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> >> only field in struct bpf_iter_task_vma. This is because the inner data
> >> struct contains a struct vma_iterator (not ptr), whose size is likely to
> >> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> >> such a change would require change in opaque bpf_iter_task_vma struct's
> >> size. So better to allocate vma_iterator using BPF allocator, and since
> >> that alloc must already succeed, might as well allocate all iter fields,
> >> thereby freezing struct bpf_iter_task_vma size.
> >>
> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >> Cc: Nathan Slingerland <slinger@meta.com>
> >> ---
> >>  include/uapi/linux/bpf.h                      |  4 +
> >>  kernel/bpf/helpers.c                          |  3 +
> >>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> >>  tools/include/uapi/linux/bpf.h                |  4 +
> >>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
> >>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> >>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> >>  7 files changed, 116 insertions(+), 13 deletions(-)
> >>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 8790b3962e4b..49fc1989a548 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>         __u64 __opaque[1];
> >>  } __attribute__((aligned(8)));
> >>
> >> +struct bpf_iter_task_vma {
> >> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >> +} __attribute__((aligned(8)));
> >> +
> >>  #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index eb91cae0612a..7a06dea749f1 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> >> index c4ab9d6cdbe9..51c2dce435c1 100644
> >> --- a/kernel/bpf/task_iter.c
> >> +++ b/kernel/bpf/task_iter.c
> >> @@ -7,7 +7,9 @@
> >>  #include <linux/fs.h>
> >>  #include <linux/fdtable.h>
> >>  #include <linux/filter.h>
> >> +#include <linux/bpf_mem_alloc.h>
> >>  #include <linux/btf_ids.h>
> >> +#include <linux/mm_types.h>
> >>  #include "mmap_unlock_work.h"
> >>
> >>  static const char * const iter_task_type_names[] = {
> >> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> >>         .arg5_type      = ARG_ANYTHING,
> >>  };
> >>
> >> +struct bpf_iter_task_vma_kern_data {
> >> +       struct task_struct *task;
> >> +       struct mm_struct *mm;
> >> +       struct mmap_unlock_irq_work *work;
> >> +       struct vma_iterator vmi;
> >> +};
> >> +
> >> +/* Non-opaque version of uapi bpf_iter_task_vma */
> >> +struct bpf_iter_task_vma_kern {
> >> +       struct bpf_iter_task_vma_kern_data *data;
> >> +} __attribute__((aligned(8)));
> >> +
> >
> > it's a bit worrying that we'll rely on memory allocation inside NMI to
> > be able to use this. I'm missing previous email discussion (I declared
> > email bankruptcy after long vacation), but I suppose the option to fix
> > bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
> > pointers), or even to 96 to have a bit of headroom in case we need a
> > bit more space was rejected? It seems unlikely that vma_iterator will
> > have to grow, but if it does, it has 5 bytes of padding right now for
> > various flags, plus we can have extra 8 bytes reserved just in case.
> >
> > I know it's a big struct and will take a big chunk of the BPF stack,
> > but I'm a bit worried about both the performance implication of mem
> > alloc under NMI, and allocation failing.
> >
> > Maybe the worry is overblown, but I thought I'll bring it up anyways.
> >
>
> Few tangential trains of thought here, separated by multiple newlines
> for easier reading.
>
>
> IIUC the any-context BPF allocator will not actually allocate memory in NMI
> context, instead relying on its existing pre-filled caches.
>
> Alexei's patch adding the allocator says ([0]):
>
>   The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
>
> So sounds bpf_mem_alloc in a kfunc called by a BPF program is NMI-safe.
>
>
> That's not to say that I'm happy about adding a fallible bpf_mem_alloc call here
> before the kfunc can do anything useful. But it seems like the best way to
> guarantee that we never see a mailing list message like:
>
>   Hello, I just added a field to 'struct ma_state' in my subsystem and it seems
>   I've triggered a BUILD_BUG_ON in this far-away BPF subsystem. It looks like
>   you're making stability guarantees based on the size of my internal struct.
>   What the hell?
>
> Sure, after I remove the kfuncs and struct bpf_iter_task_vma fwd decl from
> bpf_helpers.h - per your other comment below - we can do the whole "kfuncs
> aren't uapi and this struct bpf_iter_task_vma is coming from vmlinux.h,
> not some stable header" spiel and convince this hypothetical person. Not having
> to do the spiel here reinforces the more general "Modern BPF exposes
> functionality w/ kfuncs and kptrs, which are inherently _unstable_" messaging
> more effectively than having to explain.
>
>
> If we go back to putting struct vma_iterator on the BPF stack, I think we
> definitely want to keep the BUILD_BUG_ON. If it were removed and vma_iterator
> size changes, that would affect portability of BPF programs that assume the old
> size of bpf_iter_task_vma, no? Which bpf_for_each is doing since it puts
> bpf_iter_task_vma on the stack.
>
> Is there some CO-RE technique that would handle above scenario portably? I
> can't think of anything straightforward. Maybe if BPF prog BTF only had
> a fwd decl for bpf_iter_task_vma, and size thus had to be taken from
> vmlinux BTF. But that would fail to compile since it the struct goes
> on the stack. Maybe use some placeholder size for compilation and use
> BTF tag to tell libbpf to patch insns w/ vmlinux's size for this struct?
>
>
> Re: padding bytes, seems worse to me than not using them. Have to make
> assumptions about far-away struct, specifically vma_iterator
> which landed quite recently as part of maple tree series. The assumptions
> don't prevent my hypothetical mailing list confusion from happening, increases
> the confusion if it does happen ("I added a small field recently, why didn't
> this break then? If it's explicitly and intentionally unstable, why add
> padding bytes?")
>
>   [0]: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com

+1
imo struct bpf_iter_task_vma_kern is a bit too big to put on bpf prog stack.
bpf_mem_alloc short term is a lesser evil imo.
Long term we need to think how to extend bpf ISA with alloca.
It's time to figure out how to grow the stack.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 15:03       ` Alexei Starovoitov
@ 2023-08-23 17:14         ` Andrii Nakryiko
  2023-08-23 17:53           ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23 17:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 8:03 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 12:26 AM David Marchevsky
> <david.marchevsky@linux.dev> wrote:
> >
> > On 8/22/23 7:52 PM, Andrii Nakryiko wrote:
> > > On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > >>
> > >> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> > >> creation and manipulation of struct bpf_iter_task_vma in open-coded
> > >> iterator style. BPF programs can use these kfuncs directly or through
> > >> bpf_for_each macro for natural-looking iteration of all task vmas.
> > >>
> > >> The implementation borrows heavily from bpf_find_vma helper's locking -
> > >> differing only in that it holds the mmap_read lock for all iterations
> > >> while the helper only executes its provided callback on a maximum of 1
> > >> vma. Aside from locking, struct vma_iterator and vma_next do all the
> > >> heavy lifting.
> > >>
> > >> The newly-added struct bpf_iter_task_vma has a name collision with a
> > >> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> > >> file is renamed in order to avoid the collision.
> > >>
> > >> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> > >> only field in struct bpf_iter_task_vma. This is because the inner data
> > >> struct contains a struct vma_iterator (not ptr), whose size is likely to
> > >> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> > >> such a change would require change in opaque bpf_iter_task_vma struct's
> > >> size. So better to allocate vma_iterator using BPF allocator, and since
> > >> that alloc must already succeed, might as well allocate all iter fields,
> > >> thereby freezing struct bpf_iter_task_vma size.
> > >>
> > >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > >> Cc: Nathan Slingerland <slinger@meta.com>
> > >> ---
> > >>  include/uapi/linux/bpf.h                      |  4 +
> > >>  kernel/bpf/helpers.c                          |  3 +
> > >>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> > >>  tools/include/uapi/linux/bpf.h                |  4 +
> > >>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
> > >>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> > >>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> > >>  7 files changed, 116 insertions(+), 13 deletions(-)
> > >>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 8790b3962e4b..49fc1989a548 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> > >>         __u64 __opaque[1];
> > >>  } __attribute__((aligned(8)));
> > >>
> > >> +struct bpf_iter_task_vma {
> > >> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >>  #endif /* _UAPI__LINUX_BPF_H__ */
> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >> index eb91cae0612a..7a06dea749f1 100644
> > >> --- a/kernel/bpf/helpers.c
> > >> +++ b/kernel/bpf/helpers.c
> > >> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> > >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > >> index c4ab9d6cdbe9..51c2dce435c1 100644
> > >> --- a/kernel/bpf/task_iter.c
> > >> +++ b/kernel/bpf/task_iter.c
> > >> @@ -7,7 +7,9 @@
> > >>  #include <linux/fs.h>
> > >>  #include <linux/fdtable.h>
> > >>  #include <linux/filter.h>
> > >> +#include <linux/bpf_mem_alloc.h>
> > >>  #include <linux/btf_ids.h>
> > >> +#include <linux/mm_types.h>
> > >>  #include "mmap_unlock_work.h"
> > >>
> > >>  static const char * const iter_task_type_names[] = {
> > >> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> > >>         .arg5_type      = ARG_ANYTHING,
> > >>  };
> > >>
> > >> +struct bpf_iter_task_vma_kern_data {
> > >> +       struct task_struct *task;
> > >> +       struct mm_struct *mm;
> > >> +       struct mmap_unlock_irq_work *work;
> > >> +       struct vma_iterator vmi;
> > >> +};
> > >> +
> > >> +/* Non-opaque version of uapi bpf_iter_task_vma */
> > >> +struct bpf_iter_task_vma_kern {
> > >> +       struct bpf_iter_task_vma_kern_data *data;
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >
> > > it's a bit worrying that we'll rely on memory allocation inside NMI to
> > > be able to use this. I'm missing previous email discussion (I declared
> > > email bankruptcy after long vacation), but I suppose the option to fix
> > > bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
> > > pointers), or even to 96 to have a bit of headroom in case we need a
> > > bit more space was rejected? It seems unlikely that vma_iterator will
> > > have to grow, but if it does, it has 5 bytes of padding right now for
> > > various flags, plus we can have extra 8 bytes reserved just in case.
> > >
> > > I know it's a big struct and will take a big chunk of the BPF stack,
> > > but I'm a bit worried about both the performance implication of mem
> > > alloc under NMI, and allocation failing.
> > >
> > > Maybe the worry is overblown, but I thought I'll bring it up anyways.
> > >
> >
> > Few tangential trains of thought here, separated by multiple newlines
> > for easier reading.
> >
> >
> > IIUC the any-context BPF allocator will not actually allocate memory in NMI
> > context, instead relying on its existing pre-filled caches.
> >
> > Alexei's patch adding the allocator says ([0]):
> >
> >   The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
> >
> > So sounds bpf_mem_alloc in a kfunc called by a BPF program is NMI-safe.
> >
> >
> > That's not to say that I'm happy about adding a fallible bpf_mem_alloc call here
> > before the kfunc can do anything useful. But it seems like the best way to
> > guarantee that we never see a mailing list message like:
> >
> >   Hello, I just added a field to 'struct ma_state' in my subsystem and it seems
> >   I've triggered a BUILD_BUG_ON in this far-away BPF subsystem. It looks like
> >   you're making stability guarantees based on the size of my internal struct.
> >   What the hell?
> >
> > Sure, after I remove the kfuncs and struct bpf_iter_task_vma fwd decl from
> > bpf_helpers.h - per your other comment below - we can do the whole "kfuncs
> > aren't uapi and this struct bpf_iter_task_vma is coming from vmlinux.h,
> > not some stable header" spiel and convince this hypothetical person. Not having
> > to do the spiel here reinforces the more general "Modern BPF exposes
> > functionality w/ kfuncs and kptrs, which are inherently _unstable_" messaging
> > more effectively than having to explain.
> >
> >
> > If we go back to putting struct vma_iterator on the BPF stack, I think we
> > definitely want to keep the BUILD_BUG_ON. If it were removed and vma_iterator
> > size changes, that would affect portability of BPF programs that assume the old
> > size of bpf_iter_task_vma, no? Which bpf_for_each is doing since it puts
> > bpf_iter_task_vma on the stack.
> >
> > Is there some CO-RE technique that would handle above scenario portably? I
> > can't think of anything straightforward. Maybe if BPF prog BTF only had
> > a fwd decl for bpf_iter_task_vma, and size thus had to be taken from
> > vmlinux BTF. But that would fail to compile since it the struct goes
> > on the stack. Maybe use some placeholder size for compilation and use
> > BTF tag to tell libbpf to patch insns w/ vmlinux's size for this struct?
> >
> >
> > Re: padding bytes, seems worse to me than not using them. Have to make
> > assumptions about far-away struct, specifically vma_iterator
> > which landed quite recently as part of maple tree series. The assumptions
> > don't prevent my hypothetical mailing list confusion from happening, increases
> > the confusion if it does happen ("I added a small field recently, why didn't
> > this break then? If it's explicitly and intentionally unstable, why add
> > padding bytes?")
> >
> >   [0]: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
>
> +1
> imo struct bpf_iter_task_vma_kern is a bit too big to put on bpf prog stack.
> bpf_mem_alloc short term is a lesser evil imo.

Yep, a bit big, though still reasonable to put on the stack. As I
mentioned above, my worry is the unreliability of being able to
instantiate task_vma iterator in NMI context (which is probably where
this will be used often). We'll see if this actually the problem in
practice once this is used and deployed fleet-wide. If it is, we might
want to revisit it by introducing v2.

> Long term we need to think how to extend bpf ISA with alloca.

To be frank, I'm not following how alloca is relevant here. We don't
have anything dynamically sized on the stack.

Unless you envision protocol where we have a separate function to get
size of iter struct, then alloca enough space, then pass that to
bpf_iter_xxx_new()? Not sure whether this is statically verifiable,
but given it's long-term, we can put it on backburner for now.

> It's time to figure out how to grow the stack.

We effectively already grow the stack based on exact prog's usage.
It's more of "how to support a bigger stack" during verification with
reasonable increase in memory and cpu usage.


Anyways, as I said, I suspected the answer will be what it is, but
documenting the reasons and what alternatives were considered is still
worthwhile, IMO. Thanks for the discussion.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 17:14         ` Andrii Nakryiko
@ 2023-08-23 17:53           ` Alexei Starovoitov
  2023-08-23 18:13             ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2023-08-23 17:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 10:14 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> > Long term we need to think how to extend bpf ISA with alloca.
>
> To be frank, I'm not following how alloca is relevant here. We don't
> have anything dynamically sized on the stack.
>
> Unless you envision protocol where we have a separate function to get
> size of iter struct, then alloca enough space, then pass that to
> bpf_iter_xxx_new()? Not sure whether this is statically verifiable,
> but given it's long-term, we can put it on backburner for now.

With alloca bpf_for_each() macro can allocate whatever stack necessary.

In other words:

struct bpf_iter_task_vma *it;

it = bpf_alloca(bpf_core_type_size(struct bpf_iter_task_vma));
bpf_for_each2(task_vma, it, ...) { .. }

While struct bpf_iter_task_vma can come from vmlinux.h

size of kern data struct is CO-RE-able, so no worries about increase
in size due to maple tree or lockdep on/off.
And no concern of failing allocation at run-time.
(the verifier would reject big stack alloc at load time).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 17:53           ` Alexei Starovoitov
@ 2023-08-23 18:13             ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23 18:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 10:53 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 10:14 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > > Long term we need to think how to extend bpf ISA with alloca.
> >
> > To be frank, I'm not following how alloca is relevant here. We don't
> > have anything dynamically sized on the stack.
> >
> > Unless you envision protocol where we have a separate function to get
> > size of iter struct, then alloca enough space, then pass that to
> > bpf_iter_xxx_new()? Not sure whether this is statically verifiable,
> > but given it's long-term, we can put it on backburner for now.
>
> With alloca bpf_for_each() macro can allocate whatever stack necessary.
>
> In other words:
>
> struct bpf_iter_task_vma *it;
>
> it = bpf_alloca(bpf_core_type_size(struct bpf_iter_task_vma));
> bpf_for_each2(task_vma, it, ...) { .. }

ah, I see, not a dedicated kfunc, just CO-RE relocation. Makes sense.

>
> While struct bpf_iter_task_vma can come from vmlinux.h
>
> size of kern data struct is CO-RE-able, so no worries about increase
> in size due to maple tree or lockdep on/off.
> And no concern of failing allocation at run-time.
> (the verifier would reject big stack alloc at load time).

yep, makes sense, the size will be statically known to the verifier. I
was overcomplicating this in my mind with extra kfunc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23  7:26     ` David Marchevsky
  2023-08-23 15:03       ` Alexei Starovoitov
@ 2023-08-23 17:07       ` Andrii Nakryiko
  2023-08-23 17:26         ` Alexei Starovoitov
  1 sibling, 1 reply; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23 17:07 UTC (permalink / raw)
  To: David Marchevsky
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team, yonghong.song,
	sdf, Nathan Slingerland

On Wed, Aug 23, 2023 at 12:26 AM David Marchevsky
<david.marchevsky@linux.dev> wrote:
>
> On 8/22/23 7:52 PM, Andrii Nakryiko wrote:
> > On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >>
> >> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> >> creation and manipulation of struct bpf_iter_task_vma in open-coded
> >> iterator style. BPF programs can use these kfuncs directly or through
> >> bpf_for_each macro for natural-looking iteration of all task vmas.
> >>
> >> The implementation borrows heavily from bpf_find_vma helper's locking -
> >> differing only in that it holds the mmap_read lock for all iterations
> >> while the helper only executes its provided callback on a maximum of 1
> >> vma. Aside from locking, struct vma_iterator and vma_next do all the
> >> heavy lifting.
> >>
> >> The newly-added struct bpf_iter_task_vma has a name collision with a
> >> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> >> file is renamed in order to avoid the collision.
> >>
> >> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> >> only field in struct bpf_iter_task_vma. This is because the inner data
> >> struct contains a struct vma_iterator (not ptr), whose size is likely to
> >> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> >> such a change would require change in opaque bpf_iter_task_vma struct's
> >> size. So better to allocate vma_iterator using BPF allocator, and since
> >> that alloc must already succeed, might as well allocate all iter fields,
> >> thereby freezing struct bpf_iter_task_vma size.
> >>
> >> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >> Cc: Nathan Slingerland <slinger@meta.com>
> >> ---
> >>  include/uapi/linux/bpf.h                      |  4 +
> >>  kernel/bpf/helpers.c                          |  3 +
> >>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> >>  tools/include/uapi/linux/bpf.h                |  4 +
> >>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
> >>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> >>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> >>  7 files changed, 116 insertions(+), 13 deletions(-)
> >>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 8790b3962e4b..49fc1989a548 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>         __u64 __opaque[1];
> >>  } __attribute__((aligned(8)));
> >>
> >> +struct bpf_iter_task_vma {
> >> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >> +} __attribute__((aligned(8)));
> >> +
> >>  #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index eb91cae0612a..7a06dea749f1 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> >>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> >>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> >> index c4ab9d6cdbe9..51c2dce435c1 100644
> >> --- a/kernel/bpf/task_iter.c
> >> +++ b/kernel/bpf/task_iter.c
> >> @@ -7,7 +7,9 @@
> >>  #include <linux/fs.h>
> >>  #include <linux/fdtable.h>
> >>  #include <linux/filter.h>
> >> +#include <linux/bpf_mem_alloc.h>
> >>  #include <linux/btf_ids.h>
> >> +#include <linux/mm_types.h>
> >>  #include "mmap_unlock_work.h"
> >>
> >>  static const char * const iter_task_type_names[] = {
> >> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> >>         .arg5_type      = ARG_ANYTHING,
> >>  };
> >>
> >> +struct bpf_iter_task_vma_kern_data {
> >> +       struct task_struct *task;
> >> +       struct mm_struct *mm;
> >> +       struct mmap_unlock_irq_work *work;
> >> +       struct vma_iterator vmi;
> >> +};
> >> +
> >> +/* Non-opaque version of uapi bpf_iter_task_vma */
> >> +struct bpf_iter_task_vma_kern {
> >> +       struct bpf_iter_task_vma_kern_data *data;
> >> +} __attribute__((aligned(8)));
> >> +
> >
> > it's a bit worrying that we'll rely on memory allocation inside NMI to
> > be able to use this. I'm missing previous email discussion (I declared
> > email bankruptcy after long vacation), but I suppose the option to fix
> > bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
> > pointers), or even to 96 to have a bit of headroom in case we need a
> > bit more space was rejected? It seems unlikely that vma_iterator will
> > have to grow, but if it does, it has 5 bytes of padding right now for
> > various flags, plus we can have extra 8 bytes reserved just in case.
> >
> > I know it's a big struct and will take a big chunk of the BPF stack,
> > but I'm a bit worried about both the performance implication of mem
> > alloc under NMI, and allocation failing.
> >
> > Maybe the worry is overblown, but I thought I'll bring it up anyways.
> >
>
> Few tangential trains of thought here, separated by multiple newlines
> for easier reading.
>
>
> IIUC the any-context BPF allocator will not actually allocate memory in NMI
> context, instead relying on its existing pre-filled caches.
>
> Alexei's patch adding the allocator says ([0]):
>
>   The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
>
> So sounds bpf_mem_alloc in a kfunc called by a BPF program is NMI-safe.

Right. My concern wasn't about safety of allocation, but rather about
it being drawn from a limited pool and potentially returning -ENOMEM
even if the system is not actually out of memory.

I guess only the actual usage in a big fleet will show how often
-ENOMEM comes up. It might be not a problem at all, which is why I'm
saying that my worries might be overblown.

>
>
> That's not to say that I'm happy about adding a fallible bpf_mem_alloc call here
> before the kfunc can do anything useful. But it seems like the best way to
> guarantee that we never see a mailing list message like:
>
>   Hello, I just added a field to 'struct ma_state' in my subsystem and it seems
>   I've triggered a BUILD_BUG_ON in this far-away BPF subsystem. It looks like
>   you're making stability guarantees based on the size of my internal struct.
>   What the hell?
>
> Sure, after I remove the kfuncs and struct bpf_iter_task_vma fwd decl from
> bpf_helpers.h - per your other comment below - we can do the whole "kfuncs
> aren't uapi and this struct bpf_iter_task_vma is coming from vmlinux.h,
> not some stable header" spiel and convince this hypothetical person. Not having
> to do the spiel here reinforces the more general "Modern BPF exposes
> functionality w/ kfuncs and kptrs, which are inherently _unstable_" messaging
> more effectively than having to explain.
>
>
> If we go back to putting struct vma_iterator on the BPF stack, I think we
> definitely want to keep the BUILD_BUG_ON. If it were removed and vma_iterator
> size changes, that would affect portability of BPF programs that assume the old
> size of bpf_iter_task_vma, no? Which bpf_for_each is doing since it puts
> bpf_iter_task_vma on the stack.

Yes, I think we'll have to have BUILD_BUG_ON. And yes, whoever
increases vma_iter by more than 13 bytes will have to interact with
us.

But my thinking was that in *unlikely* case of this happening, given
the unstable API rules, we'll just drop "v1" of task_vma iterator,
both kfuncs and iter struct itself, and will introduce v2 with a
bigger size of the state. We definitely don't want to just change the
size of the iter state struct, that will cause not just confusion, but
potentially stack variables clobbering, if old definition is used.

And the above seems acceptable only because it seems that vma_iter
changing its size so dramatically is very unlikely. Kernel folks won't
be increasing the already pretty big struct just for the fun of it.


>
> Is there some CO-RE technique that would handle above scenario portably? I
> can't think of anything straightforward. Maybe if BPF prog BTF only had
> a fwd decl for bpf_iter_task_vma, and size thus had to be taken from
> vmlinux BTF. But that would fail to compile since it the struct goes
> on the stack. Maybe use some placeholder size for compilation and use
> BTF tag to tell libbpf to patch insns w/ vmlinux's size for this struct?
>

I don't think there is any magic that could be done when on the stack
variable size changes. But if we do v1 vs v2 change (if necessary),
then we have mechanisms to detect the presence of new kfuncs and then
choosing proper iterator (task_vma vs task_vma2, something like that).


>
> Re: padding bytes, seems worse to me than not using them. Have to make
> assumptions about far-away struct, specifically vma_iterator
> which landed quite recently as part of maple tree series. The assumptions
> don't prevent my hypothetical mailing list confusion from happening, increases
> the confusion if it does happen ("I added a small field recently, why didn't
> this break then? If it's explicitly and intentionally unstable, why add
> padding bytes?")
>

To prevent unnecessary user breakage. Just because something is
unstable doesn't mean we want to break it every kernel release, right?
Otherwise why don't we rename each kfunc just for the fun of it, every
release ;) 8 extra bytes on the stack seems like a negligible price
for ensuring less user pain in the long run.

>   [0]: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
>
> >> +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >> +                                     struct task_struct *task, u64 addr)
> >> +{
> >> +       struct bpf_iter_task_vma_kern *kit = (void *)it;
> >> +       bool irq_work_busy = false;
> >> +       int err;
> >> +
> >
> > [...]
> >
> >>  static void do_mmap_read_unlock(struct irq_work *entry)
> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >> index 8790b3962e4b..49fc1989a548 100644
> >> --- a/tools/include/uapi/linux/bpf.h
> >> +++ b/tools/include/uapi/linux/bpf.h
> >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> >>         __u64 __opaque[1];
> >>  } __attribute__((aligned(8)));
> >>
> >> +struct bpf_iter_task_vma {
> >> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> >> +} __attribute__((aligned(8)));
> >> +
> >>  #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> >> index bbab9ad9dc5a..d885ffee4d88 100644
> >> --- a/tools/lib/bpf/bpf_helpers.h
> >> +++ b/tools/lib/bpf/bpf_helpers.h
> >> @@ -302,6 +302,14 @@ extern int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) __weak
> >>  extern int *bpf_iter_num_next(struct bpf_iter_num *it) __weak __ksym;
> >>  extern void bpf_iter_num_destroy(struct bpf_iter_num *it) __weak __ksym;
> >>
> >> +struct bpf_iter_task_vma;
> >> +
> >> +extern int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >> +                                struct task_struct *task,
> >> +                                unsigned long addr) __weak __ksym;
> >> +extern struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) __weak __ksym;
> >> +extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __weak __ksym;
> >
> > my intent wasn't to add all open-coded iterators to bpf_helpers.h. I
> > think bpf_iter_num_* is rather an exception and isn't supposed to ever
> > change or be removed, while other iterators should be allowed to be
> > changed.
> >
> > The goal is for all such kfuncs (and struct bpf_iter_task_vma state
> > itself, probably) to come from vmlinux.h, eventually, so let's leave
> > it out of libbpf's stable bpf_helpers.h header.
> >
> >
> > [...]
>
> As alluded to in my long response above, this sounds reasonable, will
> remove from here.
>

Cool.

> >
> >> @@ -1533,7 +1533,7 @@ static void test_task_vma_dead_task(void)
> >>  out:
> >>         waitpid(child_pid, &wstatus, 0);
> >>         close(iter_fd);
> >> -       bpf_iter_task_vma__destroy(skel);
> >> +       bpf_iter_task_vmas__destroy(skel);
> >>  }
> >>
> >>  void test_bpf_sockmap_map_iter_fd(void)
> >> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
> >> similarity index 100%
> >> rename from tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
> >> rename to tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
> >> --
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 17:07       ` Andrii Nakryiko
@ 2023-08-23 17:26         ` Alexei Starovoitov
  2023-08-23 17:43           ` Andrii Nakryiko
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2023-08-23 17:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 10:07 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Yes, I think we'll have to have BUILD_BUG_ON. And yes, whoever
> increases vma_iter by more than 13 bytes will have to interact with
> us.

A bit of history.
Before maple tree the vma_iterator didn't exist. vma_next would walk rb tree.
So if we introduced task_vma iterator couple years ago the maple tree
change would have grown our bpf_iter_task_vma by 64 bytes.
If we reserved an 8 or 16 byte gap it wouldn't have helped.
Hence it's wishful thinking that a gap might help in the future.

Direct stack alloc of kernel data structure is also dangerous in
presence of kernel debug knobs. There are no locks inside vma_iterator
at the moment, but if it was there we wouldn't be able to use it
on bpf prog stack regardless of its size, because lockdep on/off
would have changed the size.
I think it's always better to have extra indirection between bpf prog
and kernel object.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs
  2023-08-23 17:26         ` Alexei Starovoitov
@ 2023-08-23 17:43           ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23 17:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Yonghong Song, Stanislav Fomichev, Nathan Slingerland

On Wed, Aug 23, 2023 at 10:26 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 10:07 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Yes, I think we'll have to have BUILD_BUG_ON. And yes, whoever
> > increases vma_iter by more than 13 bytes will have to interact with
> > us.
>
> A bit of history.
> Before maple tree the vma_iterator didn't exist. vma_next would walk rb tree.
> So if we introduced task_vma iterator couple years ago the maple tree
> change would have grown our bpf_iter_task_vma by 64 bytes.
> If we reserved an 8 or 16 byte gap it wouldn't have helped.

Yep, we'd have to introduce v2, of course. And if tomorrow we switch
from maple tree to something else, we'd probably need another version
of iterator as well. I made no claims that padding will be a
future-proof approach, just a pragmatic mitigation of small reasonable
changes in a struct we can't control.

> Hence it's wishful thinking that a gap might help in the future.
>
> Direct stack alloc of kernel data structure is also dangerous in
> presence of kernel debug knobs. There are no locks inside vma_iterator
> at the moment, but if it was there we wouldn't be able to use it
> on bpf prog stack regardless of its size, because lockdep on/off
> would have changed the size.
> I think it's always better to have extra indirection between bpf prog
> and kernel object.

It's a tradeoff and I understand the pros and cons. Let's do mem_alloc
and see how that works in practice.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter
  2023-08-22  5:05 [PATCH v3 bpf-next 0/3] Open-coded task_vma iter Dave Marchevsky
  2023-08-22  5:05 ` [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num Dave Marchevsky
  2023-08-22  5:05 ` [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs Dave Marchevsky
@ 2023-08-22  5:05 ` Dave Marchevsky
  2023-08-23  0:13   ` Andrii Nakryiko
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Marchevsky @ 2023-08-22  5:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf,
	Dave Marchevsky

The open-coded task_vma iter added earlier in this series allows for
natural iteration over a task's vmas using existing open-coded iter
infrastructure, specifically bpf_for_each.

This patch adds a test demonstrating this pattern and validating
correctness. The vma->vm_start and vma->vm_end addresses of the first
1000 vmas are recorded and compared to /proc/PID/maps output. As
expected, both see the same vmas and addresses - with the exception of
the [vsyscall] vma - which is explained in a comment in the prog_tests
program.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../testing/selftests/bpf/prog_tests/iters.c  | 71 +++++++++++++++++++
 .../selftests/bpf/progs/iters_task_vma.c      | 56 +++++++++++++++
 2 files changed, 127 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c

diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c
index 10804ae5ae97..f91f4a49066a 100644
--- a/tools/testing/selftests/bpf/prog_tests/iters.c
+++ b/tools/testing/selftests/bpf/prog_tests/iters.c
@@ -8,6 +8,7 @@
 #include "iters_looping.skel.h"
 #include "iters_num.skel.h"
 #include "iters_testmod_seq.skel.h"
+#include "iters_task_vma.skel.h"
 
 static void subtest_num_iters(void)
 {
@@ -90,6 +91,74 @@ static void subtest_testmod_seq_iters(void)
 	iters_testmod_seq__destroy(skel);
 }
 
+static void subtest_task_vma_iters(void)
+{
+	unsigned long start, end, bpf_iter_start, bpf_iter_end;
+	struct iters_task_vma *skel;
+	char rest_of_line[1000];
+	unsigned int seen;
+	int err;
+	FILE *f;
+
+	skel = iters_task_vma__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.iter_task_vma_for_each, true);
+
+	err = iters_task_vma__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	skel->bss->target_pid = getpid();
+
+	err = iters_task_vma__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto cleanup;
+
+	iters_task_vma__detach(skel);
+	getpgid(skel->bss->target_pid);
+
+	if (!ASSERT_GT(skel->bss->vmas_seen, 0, "vmas_seen_gt_zero"))
+		goto cleanup;
+
+	f = fopen("/proc/self/maps", "r");
+	if (!ASSERT_OK_PTR(f, "proc_maps_fopen"))
+		goto cleanup;
+
+	seen = 0;
+	while (fscanf(f, "%lx-%lx %[^\n]\n", &start, &end, rest_of_line) == 3) {
+		/* [vsyscall] vma isn't _really_ part of task->mm vmas.
+		 * /proc/PID/maps returns it when out of vmas - see get_gate_vma
+		 * calls in fs/proc/task_mmu.c
+		 */
+		if (strstr(rest_of_line, "[vsyscall]"))
+			continue;
+
+		err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.vm_start),
+					  &seen, &bpf_iter_start);
+		if (!ASSERT_OK(err, "vm_start map_lookup_elem"))
+			goto cleanup;
+
+		err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.vm_end),
+					  &seen, &bpf_iter_end);
+		if (!ASSERT_OK(err, "vm_end map_lookup_elem"))
+			goto cleanup;
+
+		ASSERT_EQ(bpf_iter_start, start, "vma->vm_start match");
+		ASSERT_EQ(bpf_iter_end, end, "vma->vm_end match");
+		seen++;
+	}
+
+	fclose(f);
+
+	if (!ASSERT_EQ(skel->bss->vmas_seen, seen, "vmas_seen_eq"))
+		goto cleanup;
+
+cleanup:
+	iters_task_vma__destroy(skel);
+}
+
 void test_iters(void)
 {
 	RUN_TESTS(iters_state_safety);
@@ -103,4 +172,6 @@ void test_iters(void)
 		subtest_num_iters();
 	if (test__start_subtest("testmod_seq"))
 		subtest_testmod_seq_iters();
+	if (test__start_subtest("task_vma"))
+		subtest_task_vma_iters();
 }
diff --git a/tools/testing/selftests/bpf/progs/iters_task_vma.c b/tools/testing/selftests/bpf/progs/iters_task_vma.c
new file mode 100644
index 000000000000..b961d0a12223
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/iters_task_vma.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include <limits.h>
+#include <linux/errno.h>
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+pid_t target_pid = 0;
+unsigned int vmas_seen = 0;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, unsigned long);
+} vm_start SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1000);
+	__type(key, int);
+	__type(value, unsigned long);
+} vm_end SEC(".maps");
+
+SEC("?raw_tp/sys_enter")
+int iter_task_vma_for_each(const void *ctx)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	struct vm_area_struct *vma;
+	unsigned long *start, *end;
+
+	if (task->pid != target_pid)
+		return 0;
+
+	bpf_for_each(task_vma, vma, task, 0) {
+		if (vmas_seen >= 1000)
+			break;
+
+		start = bpf_map_lookup_elem(&vm_start, &vmas_seen);
+		if (!start)
+			break;
+		*start = vma->vm_start;
+
+		end = bpf_map_lookup_elem(&vm_end, &vmas_seen);
+		if (!end)
+			break;
+		*end = vma->vm_end;
+
+		vmas_seen++;
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter
  2023-08-22  5:05 ` [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter Dave Marchevsky
@ 2023-08-23  0:13   ` Andrii Nakryiko
  0 siblings, 0 replies; 25+ messages in thread
From: Andrii Nakryiko @ 2023-08-23  0:13 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, yonghong.song, sdf

On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> The open-coded task_vma iter added earlier in this series allows for
> natural iteration over a task's vmas using existing open-coded iter
> infrastructure, specifically bpf_for_each.
>
> This patch adds a test demonstrating this pattern and validating
> correctness. The vma->vm_start and vma->vm_end addresses of the first
> 1000 vmas are recorded and compared to /proc/PID/maps output. As
> expected, both see the same vmas and addresses - with the exception of
> the [vsyscall] vma - which is explained in a comment in the prog_tests
> program.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  .../testing/selftests/bpf/prog_tests/iters.c  | 71 +++++++++++++++++++
>  .../selftests/bpf/progs/iters_task_vma.c      | 56 +++++++++++++++
>  2 files changed, 127 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c
> index 10804ae5ae97..f91f4a49066a 100644
> --- a/tools/testing/selftests/bpf/prog_tests/iters.c
> +++ b/tools/testing/selftests/bpf/prog_tests/iters.c
> @@ -8,6 +8,7 @@
>  #include "iters_looping.skel.h"
>  #include "iters_num.skel.h"
>  #include "iters_testmod_seq.skel.h"
> +#include "iters_task_vma.skel.h"
>
>  static void subtest_num_iters(void)
>  {
> @@ -90,6 +91,74 @@ static void subtest_testmod_seq_iters(void)
>         iters_testmod_seq__destroy(skel);
>  }
>
> +static void subtest_task_vma_iters(void)
> +{
> +       unsigned long start, end, bpf_iter_start, bpf_iter_end;
> +       struct iters_task_vma *skel;
> +       char rest_of_line[1000];
> +       unsigned int seen;
> +       int err;
> +       FILE *f;
> +
> +       skel = iters_task_vma__open();
> +       if (!ASSERT_OK_PTR(skel, "skel_open"))
> +               return;
> +
> +       bpf_program__set_autoload(skel->progs.iter_task_vma_for_each, true);

why marking prog with SEC("?") just to explicitly enable autoload for
it? Just drop "?" in the first place?

> +
> +       err = iters_task_vma__load(skel);
> +       if (!ASSERT_OK(err, "skel_load"))
> +               goto cleanup;
> +
> +       skel->bss->target_pid = getpid();
> +
> +       err = iters_task_vma__attach(skel);
> +       if (!ASSERT_OK(err, "skel_attach"))
> +               goto cleanup;
> +
> +       iters_task_vma__detach(skel);
> +       getpgid(skel->bss->target_pid);

Is this a trigger for the program? If yes, it's confusing that it
happens after skeleton detach. Is this intentional?

> +
> +       if (!ASSERT_GT(skel->bss->vmas_seen, 0, "vmas_seen_gt_zero"))
> +               goto cleanup;
> +
> +       f = fopen("/proc/self/maps", "r");
> +       if (!ASSERT_OK_PTR(f, "proc_maps_fopen"))
> +               goto cleanup;
> +
> +       seen = 0;
> +       while (fscanf(f, "%lx-%lx %[^\n]\n", &start, &end, rest_of_line) == 3) {
> +               /* [vsyscall] vma isn't _really_ part of task->mm vmas.
> +                * /proc/PID/maps returns it when out of vmas - see get_gate_vma
> +                * calls in fs/proc/task_mmu.c
> +                */
> +               if (strstr(rest_of_line, "[vsyscall]"))
> +                       continue;
> +
> +               err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.vm_start),
> +                                         &seen, &bpf_iter_start);
> +               if (!ASSERT_OK(err, "vm_start map_lookup_elem"))
> +                       goto cleanup;
> +
> +               err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.vm_end),
> +                                         &seen, &bpf_iter_end);
> +               if (!ASSERT_OK(err, "vm_end map_lookup_elem"))
> +                       goto cleanup;
> +
> +               ASSERT_EQ(bpf_iter_start, start, "vma->vm_start match");
> +               ASSERT_EQ(bpf_iter_end, end, "vma->vm_end match");
> +               seen++;
> +       }
> +
> +       fclose(f);
> +
> +       if (!ASSERT_EQ(skel->bss->vmas_seen, seen, "vmas_seen_eq"))
> +               goto cleanup;
> +
> +cleanup:
> +       iters_task_vma__destroy(skel);
> +}
> +
>  void test_iters(void)
>  {
>         RUN_TESTS(iters_state_safety);
> @@ -103,4 +172,6 @@ void test_iters(void)
>                 subtest_num_iters();
>         if (test__start_subtest("testmod_seq"))
>                 subtest_testmod_seq_iters();
> +       if (test__start_subtest("task_vma"))
> +               subtest_task_vma_iters();
>  }
> diff --git a/tools/testing/selftests/bpf/progs/iters_task_vma.c b/tools/testing/selftests/bpf/progs/iters_task_vma.c
> new file mode 100644
> index 000000000000..b961d0a12223
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/iters_task_vma.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
> +
> +#include <limits.h>
> +#include <linux/errno.h>
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_misc.h"
> +
> +pid_t target_pid = 0;
> +unsigned int vmas_seen = 0;
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY);
> +       __uint(max_entries, 1000);
> +       __type(key, int);
> +       __type(value, unsigned long);
> +} vm_start SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY);
> +       __uint(max_entries, 1000);
> +       __type(key, int);
> +       __type(value, unsigned long);
> +} vm_end SEC(".maps");
> +
> +SEC("?raw_tp/sys_enter")
> +int iter_task_vma_for_each(const void *ctx)
> +{
> +       struct task_struct *task = bpf_get_current_task_btf();
> +       struct vm_area_struct *vma;
> +       unsigned long *start, *end;
> +
> +       if (task->pid != target_pid)
> +               return 0;
> +
> +       bpf_for_each(task_vma, vma, task, 0) {
> +               if (vmas_seen >= 1000)
> +                       break;
> +

this test is not idempotent, if there will be more than one syscall,
you'll keep incrementing vmas_seen. This might result in flaky test,
so I'd suggest to recalculate vmas_seen each time, as a local counter,
and then at the end writing it to global var:


int zero;
int vmas_seen;


...


int seen = zero;

... all the same logic but using seen as a key ...


vmas_seen = seen.

> +               start = bpf_map_lookup_elem(&vm_start, &vmas_seen);
> +               if (!start)
> +                       break;
> +               *start = vma->vm_start;
> +
> +               end = bpf_map_lookup_elem(&vm_end, &vmas_seen);
> +               if (!end)
> +                       break;
> +               *end = vma->vm_end;
> +
> +               vmas_seen++;
> +       }
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-08-23 18:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-22  5:05 [PATCH v3 bpf-next 0/3] Open-coded task_vma iter Dave Marchevsky
2023-08-22  5:05 ` [PATCH v3 bpf-next 1/3] bpf: Don't explicitly emit BTF for struct btf_iter_num Dave Marchevsky
2023-08-22 23:37   ` Andrii Nakryiko
2023-08-22  5:05 ` [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs Dave Marchevsky
2023-08-22 17:42   ` Yonghong Song
2023-08-22 19:19     ` David Marchevsky
2023-08-22 20:14       ` Yonghong Song
2023-08-22 22:36         ` Alexei Starovoitov
2023-08-22 23:57           ` Andrii Nakryiko
2023-08-23  0:11           ` Yonghong Song
2023-08-23  0:04       ` Andrii Nakryiko
2023-08-23  5:42         ` David Marchevsky
2023-08-23 14:57           ` Alexei Starovoitov
2023-08-23 16:55             ` Andrii Nakryiko
2023-08-22 23:52   ` Andrii Nakryiko
2023-08-23  7:26     ` David Marchevsky
2023-08-23 15:03       ` Alexei Starovoitov
2023-08-23 17:14         ` Andrii Nakryiko
2023-08-23 17:53           ` Alexei Starovoitov
2023-08-23 18:13             ` Andrii Nakryiko
2023-08-23 17:07       ` Andrii Nakryiko
2023-08-23 17:26         ` Alexei Starovoitov
2023-08-23 17:43           ` Andrii Nakryiko
2023-08-22  5:05 ` [PATCH v3 bpf-next 3/3] selftests/bpf: Add tests for open-coded task_vma iter Dave Marchevsky
2023-08-23  0:13   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox