public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map.
@ 2024-08-16 19:12 Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 1/6] bpf: define BPF_UPTR a new enumerator of btf_field_type Kui-Feng Lee
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Some of BPF schedulers (sched_ext) need hints from user programs to do
a better job. For example, a scheduler can handle a task in a
different way if it knows a task is doing GC. So, we need an efficient
way to share the information between user programs and BPF
programs. Sharing memory between user programs and BPF programs is
what this patchset does.

== REQUIREMENT ==

This patchset enables every task in every process to share a small
chunk of memory of it's own with a BPF scheduler. So, they can update
the hints without expensive overhead of syscalls. It also wants every
task sees only the data/memory belong to the task/or the task's
process.

== DESIGN ==

This patchset enables BPF prorams to embed __uptr; uptr in the values
of task storage maps. A uptr field can only be set by user programs by
updating map element value through a syscall. A uptr points to a block
of memory allocated by the user program updating the element
value. The memory will be pinned to ensure it staying in the core
memory and to avoid a page fault when the BPF program accesses it.

For example, the following code fragment is a part of a BPF program
that embeds a __uptr field "udata" in the value type "struct
value_type" of the task storage map "datamap". The size of the memory
pointed by a uptr is determized by its type. Here we have "struct
user_data". The BPF program can read and write this block of memory
directly.

File task_ls_uptr.c:

    struct user_data {
    	int a;
    	int b;
    	int result;
    };
    
    struct value_type {
    	struct user_data __uptr *udata;
    };
    
    struct {
    	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
    	__uint(map_flags, BPF_F_NO_PREALLOC);
    	__type(key, int);
    	__type(value, struct value_type);
    } datamap SEC(".maps");
    
    pid_t target_pid = 0;
    
    SEC("tp_btf/sys_enter")
    int BPF_PROG(on_enter, struct pt_regs *regs, long id)
    {
    	struct task_struct *task;
    	struct value_type *ptr;
    	struct user_data *udata;
    
    	task = bpf_get_current_task_btf();
    	if (task->pid != target_pid)
    		return 0;
    
    	ptr = bpf_task_storage_get(&datamap, task, 0,
    				   BPF_LOCAL_STORAGE_GET_F_CREATE);
    	if (!ptr)
    		return 0;
    
    	udata = ptr->udata;
    	if (udata)
    		udata->result = udata->a + udata->b;
    
    	return 0;
    }

The following code fragment is a corresponding user program. It calls
bpf_map_update_elem() to update "datamap" and point "udata" to a the
memory block residing in one page. The memory pointed by "udata" will
be shared between the BPF program and the user program and should not
cross multiple pages.

    static void test_uptr(void)
    {
    	struct task_ls_uptr *skel = NULL;
	static struct user_data user_data __attribute__((aligned(16))) = {
		.a = 1,
		.b = 2,
		.result = 0,
	};
    	struct value_type value;
    	int task_fd = -1;
    	int err;
    
    	value.udata = &user_data;
    
    	task_fd = sys_pidfd_open(getpid(), 0);
    	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
    		goto out;
    
    	skel = task_ls_uptr__open_and_load();
    	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
    		goto out;
    
    	err = bpf_map_update_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value, 0);
    	if (!ASSERT_OK(err, "update datamap"))
    		goto out;
    
    	skel->bss->target_pid = syscall(SYS_gettid);
    
    	err = task_ls_uptr__attach(skel);
    	if (!ASSERT_OK(err, "skel_attach"))
    		goto out;
    
    	syscall(SYS_gettid);
    	syscall(SYS_gettid);
    
    	ASSERT_EQ(user_data->a + user_data->b, user_data->result, "result");
    out:
    	task_ls_uptr__destroy(skel);
    	close(task_fd);
    	munmap(user_data, sizeof(user_data));
    }

== MEMORY ==

In order to use memory efficiently, we don't want to pin a large
number of pages. To archieve that, user programs should collect the
memory blocks pointed by uptrs together to share memory pages if
possible. It avoid the situation that pin one page for each thread in
a process.  Instead, we can have several threads pointing their uptrs
to the same page but with different offsets.

Although it is not necessary, avoiding the memory pointed by an uptr
crossing the boundary of a page can prevent an additional mapping in
the kernel address space.

== RESTRICT ==

The memory pointed by a uptr should reside in one memory
page. Crossing multi-pages is not supported at the moment.

Only task storage map have been supported at the moment.

The values of uptrs can only be updated by user programs through
syscalls.

bpf_map_lookup_elem() from userspace returns zeroed values for uptrs
to prevent leaking information of the kernel.

---

Changes from v3:

 - Merge part 4 and 5 as the new part 4 in order to cease the warning
   of unused functions from CI.

Changes from v1:

 - Rename BPF_KPTR_USER to BPF_UPTR.

 - Restrict uptr to one page.

 - Mark uptr with PTR_TO_MEM | PTR_MAY_BE_NULL and with the size of
   the target type.

 - Move uptr away from bpf_obj_memcpy() by introducing
   bpf_obj_uptrcpy() and copy_map_uptr_locked().

 - Remove the BPF_FROM_USER flag.

 - Align the meory pointed by an uptr in the test case. Remove the
   uptr of mmapped memory.

v3: https://lore.kernel.org/all/20240814033010.2980635-1-thinker.li@gmail.com/
v1: https://lore.kernel.org/all/20240807235755.1435806-1-thinker.li@gmail.com/

Kui-Feng Lee (6):
  bpf: define BPF_UPTR a new enumerator of btf_field_type.
  bpf: Parse and support "uptr" tag.
  bpf: Handle BPF_UPTR in verifier.
  bpf: pin, translate, and unpin __uptr from syscalls.
  libbpf: define __uptr.
  selftests/bpf: test __uptr on the value of a task storage map.

 include/linux/bpf.h                           |  36 ++++
 kernel/bpf/bpf_local_storage.c                |  23 ++-
 kernel/bpf/btf.c                              |   5 +
 kernel/bpf/helpers.c                          |  20 ++
 kernel/bpf/syscall.c                          | 174 +++++++++++++++++-
 kernel/bpf/verifier.c                         |  37 +++-
 tools/lib/bpf/bpf_helpers.h                   |   1 +
 .../bpf/prog_tests/task_local_storage.c       | 106 +++++++++++
 .../selftests/bpf/progs/task_ls_uptr.c        |  65 +++++++
 9 files changed, 458 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/task_ls_uptr.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 1/6] bpf: define BPF_UPTR a new enumerator of btf_field_type.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 2/6] bpf: Parse and support "uptr" tag Kui-Feng Lee
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Define BPF_UPTR, and modify functions that describe attributes of a field
type.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9f35df07e86d..954e476b5605 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -203,6 +203,7 @@ enum btf_field_type {
 	BPF_GRAPH_ROOT = BPF_RB_ROOT | BPF_LIST_HEAD,
 	BPF_REFCOUNT   = (1 << 9),
 	BPF_WORKQUEUE  = (1 << 10),
+	BPF_UPTR       = (1 << 11),
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -322,6 +323,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "kptr";
 	case BPF_KPTR_PERCPU:
 		return "percpu_kptr";
+	case BPF_UPTR:
+		return "uptr";
 	case BPF_LIST_HEAD:
 		return "bpf_list_head";
 	case BPF_LIST_NODE:
@@ -350,6 +353,7 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_UPTR:
 		return sizeof(u64);
 	case BPF_LIST_HEAD:
 		return sizeof(struct bpf_list_head);
@@ -379,6 +383,7 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_UPTR:
 		return __alignof__(u64);
 	case BPF_LIST_HEAD:
 		return __alignof__(struct bpf_list_head);
@@ -419,6 +424,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_UPTR:
 		break;
 	default:
 		WARN_ON_ONCE(1);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 2/6] bpf: Parse and support "uptr" tag.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 1/6] bpf: define BPF_UPTR a new enumerator of btf_field_type Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 3/6] bpf: Handle BPF_UPTR in verifier Kui-Feng Lee
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Parse "uptr" tag from BTF, map it to BPF_UPTR, and support it in related
functions. "uptr" tag is used to annotate a field in a struct type is a
uptr, which is used to share a block memory between user programs and BPF
programs.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/btf.c     | 5 +++++
 kernel/bpf/syscall.c | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ab3c9f592c9f..f61bad288385 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3356,6 +3356,8 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 		type = BPF_KPTR_REF;
 	else if (!strcmp("percpu_kptr", __btf_name_by_offset(btf, t->name_off)))
 		type = BPF_KPTR_PERCPU;
+	else if (!strcmp("uptr", __btf_name_by_offset(btf, t->name_off)))
+		type = BPF_UPTR;
 	else
 		return -EINVAL;
 
@@ -3533,6 +3535,7 @@ static int btf_repeat_fields(struct btf_field_info *info,
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_UPTR:
 		case BPF_LIST_HEAD:
 		case BPF_RB_ROOT:
 			break;
@@ -3659,6 +3662,7 @@ static int btf_find_field_one(const struct btf *btf,
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_UPTR:
 		ret = btf_find_kptr(btf, var_type, off, sz,
 				    info_cnt ? &info[0] : &tmp);
 		if (ret < 0)
@@ -3983,6 +3987,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_UPTR:
 			ret = btf_parse_kptr(btf, &rec->fields[i], &info_arr[i]);
 			if (ret < 0)
 				goto end;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 65dcd92d0b2c..fed4a2145f81 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -548,6 +548,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_UPTR:
 			if (rec->fields[i].kptr.module)
 				module_put(rec->fields[i].kptr.module);
 			btf_put(rec->fields[i].kptr.btf);
@@ -596,6 +597,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_UPTR:
 			btf_get(fields[i].kptr.btf);
 			if (fields[i].kptr.module && !try_module_get(fields[i].kptr.module)) {
 				ret = -ENXIO;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 3/6] bpf: Handle BPF_UPTR in verifier.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 1/6] bpf: define BPF_UPTR a new enumerator of btf_field_type Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 2/6] bpf: Parse and support "uptr" tag Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls Kui-Feng Lee
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Give PTR_TO_MEM | PTR_MAYBE_NULL to the memory pointed by an uptr with the
size of the pointed type to make them readable and writable.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/verifier.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e3932f8ce10a..5bc5b37b63cc 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	int perm_flags;
 	const char *reg_name = "";
 
+	if (kptr_field->type == BPF_UPTR)
+		/* BPF programs should not change any user kptr */
+		return -EACCES;
+
 	if (btf_is_kernel(reg->btf)) {
 		perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
 
@@ -5488,6 +5492,29 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
 	return ret;
 }
 
+static int mark_uptr_ld_reg(struct bpf_verifier_env *env, u32 regno,
+			    struct btf_field *field)
+{
+	struct bpf_reg_state *val_reg;
+	const struct btf_type *t;
+	u32 type_id, tsz;
+
+	val_reg = reg_state(env, regno);
+	type_id = field->kptr.btf_id;
+	t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
+	if (!t) {
+		verbose(env, "The type of uptr is invalid");
+		return -EACCES;
+	}
+
+	mark_reg_known_zero(env, cur_regs(env), regno);
+	val_reg->type = PTR_TO_MEM | PTR_MAYBE_NULL;
+	val_reg->mem_size = tsz;
+	val_reg->id = ++env->id_gen;
+
+	return 0;
+}
+
 static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 				 int value_regno, int insn_idx,
 				 struct btf_field *kptr_field)
@@ -5516,9 +5543,16 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 		verbose(env, "store to referenced kptr disallowed\n");
 		return -EACCES;
 	}
+	if (class != BPF_LDX && kptr_field->type == BPF_UPTR) {
+		verbose(env, "store to uptr disallowed\n");
+		return -EACCES;
+	}
 
 	if (class == BPF_LDX) {
 		val_reg = reg_state(env, value_regno);
+		if (kptr_field->type == BPF_UPTR)
+			return mark_uptr_ld_reg(env, value_regno, kptr_field);
+
 		/* We can simply mark the value_regno receiving the pointer
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
 		 */
@@ -5576,6 +5610,7 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			case BPF_KPTR_UNREF:
 			case BPF_KPTR_REF:
 			case BPF_KPTR_PERCPU:
+			case BPF_UPTR:
 				if (src != ACCESS_DIRECT) {
 					verbose(env, "kptr cannot be accessed indirectly by helper\n");
 					return -EACCES;
@@ -6956,7 +6991,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			return err;
 		if (tnum_is_const(reg->var_off))
 			kptr_field = btf_record_find(reg->map_ptr->record,
-						     off + reg->var_off.value, BPF_KPTR);
+						     off + reg->var_off.value, BPF_KPTR | BPF_UPTR);
 		if (kptr_field) {
 			err = check_map_kptr_access(env, regno, value_regno, insn_idx, kptr_field);
 		} else if (t == BPF_READ && value_regno >= 0) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
                   ` (2 preceding siblings ...)
  2024-08-16 19:12 ` [RFC bpf-next v4 3/6] bpf: Handle BPF_UPTR in verifier Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  2024-08-28 23:24   ` Alexei Starovoitov
  2024-08-16 19:12 ` [RFC bpf-next v4 5/6] libbpf: define __uptr Kui-Feng Lee
  2024-08-16 19:12 ` [RFC bpf-next v4 6/6] selftests/bpf: test __uptr on the value of a task storage map Kui-Feng Lee
  5 siblings, 1 reply; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee, linux-mm

When a user program updates a map value, every uptr will be pinned and
translated to an address in the kernel. This process is initiated by
calling bpf_map_update_elem() from user programs.

To access uptrs in BPF programs, they are pinned using
pin_user_pages_fast(), but the conversion to kernel addresses is actually
done by page_address(). The uptrs can be unpinned using unpin_user_pages().

Currently, the memory block pointed to by a uptr must reside in a single
memory page, as crossing multiple pages is not supported. uptr is only
supported by task storage maps and can only be set by user programs through
syscalls.

When the value of an uptr is overwritten or destroyed, the memory pointed
to by the old value must be unpinned. This is ensured by calling
bpf_obj_uptrcpy() and copy_map_uptr_locked() when updating map value and by
bpf_obj_free_fields() when destroying map value.

Cc: linux-mm@kvack.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h            |  30 ++++++
 kernel/bpf/bpf_local_storage.c |  23 ++++-
 kernel/bpf/helpers.c           |  20 ++++
 kernel/bpf/syscall.c           | 172 ++++++++++++++++++++++++++++++++-
 4 files changed, 237 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 954e476b5605..886c818ff555 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -477,6 +477,8 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 		data_race(*ldst++ = *lsrc++);
 }
 
+void bpf_obj_unpin_uptr(const struct btf_field *field, void *addr);
+
 /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
 static inline void bpf_obj_memcpy(struct btf_record *rec,
 				  void *dst, void *src, u32 size,
@@ -503,6 +505,34 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
 	memcpy(dst + curr_off, src + curr_off, size - curr_off);
 }
 
+static inline void bpf_obj_uptrcpy(struct btf_record *rec,
+				   void *dst, void *src)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(rec))
+		return;
+
+	for (i = 0; i < rec->cnt; i++) {
+		u32 next_off = rec->fields[i].offset;
+		void *addr;
+
+		if (rec->fields[i].type == BPF_UPTR) {
+			/* Unpin old address.
+			 *
+			 * Alignments are guaranteed by btf_find_field_one().
+			 */
+			addr = *(void **)(dst + next_off);
+			if (addr)
+				bpf_obj_unpin_uptr(&rec->fields[i], addr);
+
+			*(void **)(dst + next_off) = *(void **)(src + next_off);
+		}
+	}
+}
+
+void copy_map_uptr_locked(struct bpf_map *map, void *dst, void *src, bool lock_src);
+
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
 	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index c938dea5ddbf..2fafad53b9d9 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -99,8 +99,11 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 	}
 
 	if (selem) {
-		if (value)
+		if (value) {
 			copy_map_value(&smap->map, SDATA(selem)->data, value);
+			if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE)
+				bpf_obj_uptrcpy(smap->map.record, SDATA(selem)->data, value);
+		}
 		/* No need to call check_and_init_map_value as memory is zero init */
 		return selem;
 	}
@@ -575,8 +578,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		if (err)
 			return ERR_PTR(err);
 		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
-			copy_map_value_locked(&smap->map, old_sdata->data,
-					      value, false);
+			if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE &&
+			    btf_record_has_field(smap->map.record, BPF_UPTR))
+				copy_map_uptr_locked(&smap->map, old_sdata->data,
+						     value, false);
+			else
+				copy_map_value_locked(&smap->map, old_sdata->data,
+						      value, false);
 			return old_sdata;
 		}
 	}
@@ -607,8 +615,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		goto unlock;
 
 	if (old_sdata && (map_flags & BPF_F_LOCK)) {
-		copy_map_value_locked(&smap->map, old_sdata->data, value,
-				      false);
+		if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE &&
+		    btf_record_has_field(smap->map.record, BPF_UPTR))
+			copy_map_uptr_locked(&smap->map, old_sdata->data,
+					     value, false);
+		else
+			copy_map_value_locked(&smap->map, old_sdata->data,
+					      value, false);
 		selem = SELEM(old_sdata);
 		goto unlock;
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d02ae323996b..d588b52605b9 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -388,6 +388,26 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+/* Copy map value and uptr from src to dst, with lock_src indicating
+ * whether src or dst is locked.
+ */
+void copy_map_uptr_locked(struct bpf_map *map, void *src, void *dst,
+			  bool lock_src)
+{
+	struct bpf_spin_lock *lock;
+
+	if (lock_src)
+		lock = src + map->record->spin_lock_off;
+	else
+		lock = dst + map->record->spin_lock_off;
+	preempt_disable();
+	__bpf_spin_lock_irqsave(lock);
+	copy_map_value(map, dst, src);
+	bpf_obj_uptrcpy(map->record, dst, src);
+	__bpf_spin_unlock_irqrestore(lock);
+	preempt_enable();
+}
+
 BPF_CALL_0(bpf_jiffies64)
 {
 	return get_jiffies_64();
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fed4a2145f81..1854aeb13ff7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -155,8 +155,140 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
 		synchronize_rcu();
 }
 
-static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
-				void *key, void *value, __u64 flags)
+void bpf_obj_unpin_uptr(const struct btf_field *field, void *addr)
+{
+	struct page *pages[1];
+	u32 size, type_id;
+	int npages;
+	void *ptr;
+
+	type_id = field->kptr.btf_id;
+	btf_type_id_size(field->kptr.btf, &type_id, &size);
+	if (size == 0)
+		return;
+
+	ptr = (void *)((intptr_t)addr & PAGE_MASK);
+
+	npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
+	if (WARN_ON_ONCE(npages > 1))
+		return;
+
+	pages[0] = virt_to_page(ptr);
+	unpin_user_pages(pages, 1);
+}
+
+/* Unpin uptr fields in the record up to cnt */
+static void bpf_obj_unpin_uptrs_cnt(struct btf_record *rec, int cnt, void *src)
+{
+	u32 next_off;
+	void **kaddr_ptr;
+	int i;
+
+	for (i = 0; i < cnt; i++) {
+		if (rec->fields[i].type != BPF_UPTR)
+			continue;
+
+		next_off = rec->fields[i].offset;
+		kaddr_ptr = src + next_off;
+		if (*kaddr_ptr) {
+			bpf_obj_unpin_uptr(&rec->fields[i], *kaddr_ptr);
+			*kaddr_ptr = NULL;
+		}
+	}
+}
+
+/* Find all BPF_UPTR fields in the record, pin the user memory, map it
+ * to kernel space, and update the addresses in the source memory.
+ *
+ * The map value passing from userspace may contain user kptrs pointing to
+ * user memory. This function pins the user memory and maps it to kernel
+ * memory so that BPF programs can access it.
+ */
+static int bpf_obj_trans_pin_uptrs(struct btf_record *rec, void *src, u32 size)
+{
+	u32 type_id, tsz, npages, next_off;
+	void *uaddr, *kaddr, **uaddr_ptr;
+	const struct btf_type *t;
+	struct page *pages[1];
+	int i, err;
+
+	if (IS_ERR_OR_NULL(rec))
+		return 0;
+
+	if (!btf_record_has_field(rec, BPF_UPTR))
+		return 0;
+
+	for (i = 0; i < rec->cnt; i++) {
+		if (rec->fields[i].type != BPF_UPTR)
+			continue;
+
+		next_off = rec->fields[i].offset;
+		if (next_off + sizeof(void *) > size) {
+			err = -EFAULT;
+			goto rollback;
+		}
+		uaddr_ptr = src + next_off;
+		uaddr = *uaddr_ptr;
+		if (!uaddr)
+			continue;
+
+		/* Make sure the user memory takes up at most one page */
+		type_id = rec->fields[i].kptr.btf_id;
+		t = btf_type_id_size(rec->fields[i].kptr.btf, &type_id, &tsz);
+		if (!t) {
+			err = -EFAULT;
+			goto rollback;
+		}
+		if (tsz == 0) {
+			*uaddr_ptr = NULL;
+			continue;
+		}
+		npages = (((intptr_t)uaddr + tsz + ~PAGE_MASK) -
+			  ((intptr_t)uaddr & PAGE_MASK)) >> PAGE_SHIFT;
+		if (npages > 1) {
+			/* Allow only one page */
+			err = -EFAULT;
+			goto rollback;
+		}
+
+		/* Pin the user memory */
+		err = pin_user_pages_fast((intptr_t)uaddr, 1, FOLL_LONGTERM | FOLL_WRITE, pages);
+		if (err < 0)
+			goto rollback;
+
+		/* Map to kernel space */
+		kaddr = page_address(pages[0]);
+		if (unlikely(!kaddr)) {
+			WARN_ON_ONCE(1);
+			unpin_user_pages(pages, 1);
+			err = -EFAULT;
+			goto rollback;
+		}
+		*uaddr_ptr = kaddr + ((intptr_t)uaddr & ~PAGE_MASK);
+	}
+
+	return 0;
+
+rollback:
+	/* Unpin the user memory of earlier fields */
+	bpf_obj_unpin_uptrs_cnt(rec, i, src);
+
+	return err;
+}
+
+static void bpf_obj_unpin_uptrs(struct btf_record *rec, void *src)
+{
+	if (IS_ERR_OR_NULL(rec))
+		return;
+
+	if (!btf_record_has_field(rec, BPF_UPTR))
+		return;
+
+	bpf_obj_unpin_uptrs_cnt(rec, rec->cnt, src);
+}
+
+static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
+				      void *key, void *value, __u64 flags)
 {
 	int err;
 
@@ -208,6 +340,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	return err;
 }
 
+static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
+				void *key, void *value, __u64 flags)
+{
+	int err;
+
+	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE) {
+		/* Pin user memory can lead to context switch, so we need
+		 * to do it before potential RCU lock.
+		 */
+		err = bpf_obj_trans_pin_uptrs(map->record, value,
+					      bpf_map_value_size(map));
+		if (err)
+			return err;
+	}
+
+	err = bpf_map_update_value_inner(map, map_file, key, value, flags);
+
+	if (err && map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
+		bpf_obj_unpin_uptrs(map->record, value);
+
+	return err;
+}
+
 static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 			      __u64 flags)
 {
@@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 				field->kptr.dtor(xchgd_field);
 			}
 			break;
+		case BPF_UPTR:
+			if (*(void **)field_ptr)
+				bpf_obj_unpin_uptr(field, *(void **)field_ptr);
+			*(void **)field_ptr = NULL;
+			break;
 		case BPF_LIST_HEAD:
 			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
 				continue;
@@ -1099,7 +1259,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
-				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE,
+				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
 				       map->value_size);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
@@ -1155,6 +1315,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 					goto free_map_tab;
 				}
 				break;
+			case BPF_UPTR:
+				if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
+					ret = -EOPNOTSUPP;
+					goto free_map_tab;
+				}
+				break;
 			case BPF_LIST_HEAD:
 			case BPF_RB_ROOT:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 5/6] libbpf: define __uptr.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
                   ` (3 preceding siblings ...)
  2024-08-16 19:12 ` [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  2024-08-27 23:13   ` Andrii Nakryiko
  2024-08-16 19:12 ` [RFC bpf-next v4 6/6] selftests/bpf: test __uptr on the value of a task storage map Kui-Feng Lee
  5 siblings, 1 reply; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Make __uptr available to BPF programs to enable them to define uptrs.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 tools/lib/bpf/bpf_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 305c62817dd3..7ff9d947b976 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -185,6 +185,7 @@ enum libbpf_tristate {
 #define __kptr_untrusted __attribute__((btf_type_tag("kptr_untrusted")))
 #define __kptr __attribute__((btf_type_tag("kptr")))
 #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
+#define __uptr __attribute__((btf_type_tag("uptr")))
 
 #if defined (__clang__)
 #define bpf_ksym_exists(sym) ({						\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC bpf-next v4 6/6] selftests/bpf: test __uptr on the value of a task storage map.
  2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
                   ` (4 preceding siblings ...)
  2024-08-16 19:12 ` [RFC bpf-next v4 5/6] libbpf: define __uptr Kui-Feng Lee
@ 2024-08-16 19:12 ` Kui-Feng Lee
  5 siblings, 0 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-16 19:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, andrii; +Cc: sinquersw, kuifeng, Kui-Feng Lee

Make sure the memory of uptrs have been mapped to the kernel properly. Also
ensure the values of uptrs in the kernel haven't been copied to userspace.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 .../bpf/prog_tests/task_local_storage.c       | 106 ++++++++++++++++++
 .../selftests/bpf/progs/task_ls_uptr.c        |  65 +++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/task_ls_uptr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index c33c05161a9e..5709b083021c 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -5,6 +5,7 @@
 #include <unistd.h>
 #include <sched.h>
 #include <pthread.h>
+#include <sys/eventfd.h>
 #include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <sys/types.h>
 #include <test_progs.h>
@@ -14,6 +15,20 @@
 #include "task_ls_recursion.skel.h"
 #include "task_storage_nodeadlock.skel.h"
 
+struct user_data {
+	int a;
+	int b;
+	int result;
+};
+
+struct value_type {
+	struct user_data *udata;
+};
+
+#define MAGIC_VALUE 0xabcd1234
+
+#include "task_ls_uptr.skel.h"
+
 static void test_sys_enter_exit(void)
 {
 	struct task_local_storage *skel;
@@ -40,6 +55,95 @@ static void test_sys_enter_exit(void)
 	task_local_storage__destroy(skel);
 }
 
+static struct user_data user_data __attribute__((aligned(16))) = {
+	.a = 1,
+	.b = 2,
+};
+
+static void test_uptr(void)
+{
+	struct task_ls_uptr *skel = NULL;
+	int task_fd = -1, ev_fd = -1;
+	struct value_type value;
+	int err, wstatus;
+	__u64 dummy = 1;
+	pid_t pid;
+
+	value.udata = &user_data;
+
+	task_fd = sys_pidfd_open(getpid(), 0);
+	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
+		goto out;
+
+	ev_fd = eventfd(0, 0);
+	if (!ASSERT_NEQ(ev_fd, -1, "eventfd"))
+		goto out;
+
+	skel = task_ls_uptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		goto out;
+
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value, 0);
+	if (!ASSERT_OK(err, "update_datamap"))
+		exit(1);
+
+	err = task_ls_uptr__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	fflush(stdout);
+	fflush(stderr);
+
+	pid = fork();
+	if (pid < 0)
+		goto out;
+
+	/* Call syscall in the child process, but access the map value of
+	 * the parent process in the BPF program to check if the user kptr
+	 * is translated/mapped correctly.
+	 */
+	if (pid == 0) {
+		/* child */
+
+		/* Overwrite the user_data in the child process to check if
+		 * the BPF program accesses the user_data of the parent.
+		 */
+		user_data.a = 0;
+		user_data.b = 0;
+
+		/* Wait for the parent to set child_pid */
+		read(ev_fd, &dummy, sizeof(dummy));
+
+		exit(0);
+	}
+
+	skel->bss->parent_pid = syscall(SYS_gettid);
+	skel->bss->child_pid = pid;
+
+	write(ev_fd, &dummy, sizeof(dummy));
+
+	err = waitpid(pid, &wstatus, 0);
+	ASSERT_EQ(err, pid, "waitpid");
+	skel->bss->child_pid = 0;
+
+	ASSERT_EQ(MAGIC_VALUE + user_data.a + user_data.b,
+		  user_data.result, "result");
+
+	/* Check if user programs can access the value of user kptrs
+	 * through bpf_map_lookup_elem(). Make sure the kernel value is not
+	 * leaked.
+	 */
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value);
+	if (!ASSERT_OK(err, "bpf_map_lookup_elem"))
+		goto out;
+	ASSERT_EQ(value.udata, NULL, "lookup_udata");
+
+out:
+	task_ls_uptr__destroy(skel);
+	close(ev_fd);
+	close(task_fd);
+}
+
 static void test_exit_creds(void)
 {
 	struct task_local_storage_exit_creds *skel;
@@ -237,4 +341,6 @@ void test_task_local_storage(void)
 		test_recursion();
 	if (test__start_subtest("nodeadlock"))
 		test_nodeadlock();
+	if (test__start_subtest("uptr"))
+		test_uptr();
 }
diff --git a/tools/testing/selftests/bpf/progs/task_ls_uptr.c b/tools/testing/selftests/bpf/progs/task_ls_uptr.c
new file mode 100644
index 000000000000..473e6890d522
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_ls_uptr.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "task_kfunc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct user_data {
+	int a;
+	int b;
+	int result;
+};
+
+struct value_type {
+	struct user_data __uptr *udata;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct value_type);
+} datamap SEC(".maps");
+
+#define MAGIC_VALUE 0xabcd1234
+
+/* This is a workaround to avoid clang generating a forward reference for
+ * struct user_data. This is a known issue and will be fixed in the future.
+ */
+struct user_data __dummy;
+
+pid_t child_pid = 0;
+pid_t parent_pid = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task, *data_task;
+	struct value_type *ptr;
+	struct user_data *udata;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != child_pid)
+		return 0;
+
+	data_task = bpf_task_from_pid(parent_pid);
+	if (!data_task)
+		return 0;
+
+	ptr = bpf_task_storage_get(&datamap, data_task, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	bpf_task_release(data_task);
+	if (!ptr)
+		return 0;
+
+	udata = ptr->udata;
+	if (!udata)
+		return 0;
+	udata->result = MAGIC_VALUE + udata->a + udata->b;
+
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 5/6] libbpf: define __uptr.
  2024-08-16 19:12 ` [RFC bpf-next v4 5/6] libbpf: define __uptr Kui-Feng Lee
@ 2024-08-27 23:13   ` Andrii Nakryiko
  2024-08-28 17:53     ` Kui-Feng Lee
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2024-08-27 23:13 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, andrii, sinquersw, kuifeng

On Fri, Aug 16, 2024 at 12:12 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> Make __uptr available to BPF programs to enable them to define uptrs.
>
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  tools/lib/bpf/bpf_helpers.h | 1 +
>  1 file changed, 1 insertion(+)
>

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 305c62817dd3..7ff9d947b976 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -185,6 +185,7 @@ enum libbpf_tristate {
>  #define __kptr_untrusted __attribute__((btf_type_tag("kptr_untrusted")))
>  #define __kptr __attribute__((btf_type_tag("kptr")))
>  #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
> +#define __uptr __attribute__((btf_type_tag("uptr")))
>
>  #if defined (__clang__)
>  #define bpf_ksym_exists(sym) ({                                                \
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 5/6] libbpf: define __uptr.
  2024-08-27 23:13   ` Andrii Nakryiko
@ 2024-08-28 17:53     ` Kui-Feng Lee
  0 siblings, 0 replies; 15+ messages in thread
From: Kui-Feng Lee @ 2024-08-28 17:53 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee; +Cc: bpf, ast, martin.lau, andrii, kuifeng


Thanks for the review.

On 8/27/24 16:13, Andrii Nakryiko wrote:
> On Fri, Aug 16, 2024 at 12:12 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> Make __uptr available to BPF programs to enable them to define uptrs.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   tools/lib/bpf/bpf_helpers.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> 
>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>> index 305c62817dd3..7ff9d947b976 100644
>> --- a/tools/lib/bpf/bpf_helpers.h
>> +++ b/tools/lib/bpf/bpf_helpers.h
>> @@ -185,6 +185,7 @@ enum libbpf_tristate {
>>   #define __kptr_untrusted __attribute__((btf_type_tag("kptr_untrusted")))
>>   #define __kptr __attribute__((btf_type_tag("kptr")))
>>   #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
>> +#define __uptr __attribute__((btf_type_tag("uptr")))
>>
>>   #if defined (__clang__)
>>   #define bpf_ksym_exists(sym) ({                                                \
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-08-16 19:12 ` [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls Kui-Feng Lee
@ 2024-08-28 23:24   ` Alexei Starovoitov
  2024-09-04 22:21     ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2024-08-28 23:24 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Andrii Nakryiko,
	Kui-Feng Lee, Kui-Feng Lee, linux-mm

On Fri, Aug 16, 2024 at 12:12 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> When a user program updates a map value, every uptr will be pinned and
> translated to an address in the kernel. This process is initiated by
> calling bpf_map_update_elem() from user programs.
>
> To access uptrs in BPF programs, they are pinned using
> pin_user_pages_fast(), but the conversion to kernel addresses is actually
> done by page_address(). The uptrs can be unpinned using unpin_user_pages().
>
> Currently, the memory block pointed to by a uptr must reside in a single
> memory page, as crossing multiple pages is not supported. uptr is only
> supported by task storage maps and can only be set by user programs through
> syscalls.
>
> When the value of an uptr is overwritten or destroyed, the memory pointed
> to by the old value must be unpinned. This is ensured by calling
> bpf_obj_uptrcpy() and copy_map_uptr_locked()

Doesn't look like there is a test for it, but more importantly
unpin shouldn't be called from bpf prog, since
we cannot guarantee that the execution context is safe enough to do unpin.
More on this below.

> when updating map value and by
> bpf_obj_free_fields() when destroying map value.
>
> Cc: linux-mm@kvack.org
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  include/linux/bpf.h            |  30 ++++++
>  kernel/bpf/bpf_local_storage.c |  23 ++++-
>  kernel/bpf/helpers.c           |  20 ++++
>  kernel/bpf/syscall.c           | 172 ++++++++++++++++++++++++++++++++-
>  4 files changed, 237 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 954e476b5605..886c818ff555 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -477,6 +477,8 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>                 data_race(*ldst++ = *lsrc++);
>  }
>
> +void bpf_obj_unpin_uptr(const struct btf_field *field, void *addr);
> +
>  /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>  static inline void bpf_obj_memcpy(struct btf_record *rec,
>                                   void *dst, void *src, u32 size,
> @@ -503,6 +505,34 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>         memcpy(dst + curr_off, src + curr_off, size - curr_off);
>  }
>
> +static inline void bpf_obj_uptrcpy(struct btf_record *rec,
> +                                  void *dst, void *src)
> +{
> +       int i;
> +
> +       if (IS_ERR_OR_NULL(rec))
> +               return;
> +
> +       for (i = 0; i < rec->cnt; i++) {
> +               u32 next_off = rec->fields[i].offset;
> +               void *addr;
> +
> +               if (rec->fields[i].type == BPF_UPTR) {
> +                       /* Unpin old address.
> +                        *
> +                        * Alignments are guaranteed by btf_find_field_one().
> +                        */
> +                       addr = *(void **)(dst + next_off);
> +                       if (addr)
> +                               bpf_obj_unpin_uptr(&rec->fields[i], addr);
> +
> +                       *(void **)(dst + next_off) = *(void **)(src + next_off);
> +               }
> +       }
> +}

The whole helper can be removed. See below.

> +
> +void copy_map_uptr_locked(struct bpf_map *map, void *dst, void *src, bool lock_src);
> +
>  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>  {
>         bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index c938dea5ddbf..2fafad53b9d9 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -99,8 +99,11 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>         }
>
>         if (selem) {
> -               if (value)
> +               if (value) {
>                         copy_map_value(&smap->map, SDATA(selem)->data, value);
> +                       if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +                               bpf_obj_uptrcpy(smap->map.record, SDATA(selem)->data, value);

This part should be dropped.
bpf prog should not be able to call unpin on uptr.
It cannot supply new uptr in value anyway.
On the other side the user space should be able to
bpf_map_update_elem() with one value->udata and
then call it again with a different value->udata.
Old one should be unpinned and the new udata pinned,
but that shouldn't be done from the guts of bpf_selem_alloc().
Instead, all of pin/unpin must be done while handling sys_bpf command.
More below.


> +               }
>                 /* No need to call check_and_init_map_value as memory is zero init */
>                 return selem;
>         }
> @@ -575,8 +578,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 if (err)
>                         return ERR_PTR(err);
>                 if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -                       copy_map_value_locked(&smap->map, old_sdata->data,
> -                                             value, false);
> +                       if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE &&
> +                           btf_record_has_field(smap->map.record, BPF_UPTR))
> +                               copy_map_uptr_locked(&smap->map, old_sdata->data,
> +                                                    value, false);
> +                       else
> +                               copy_map_value_locked(&smap->map, old_sdata->data,
> +                                                     value, false);

Similar. unpin here is dangerous.
Since the combination of bpf_spin_lock and uptr in map element
causing this complexity we should simply disable this combination
until the actual use case comes up.
Then above hunk won't be needed.

>                         return old_sdata;
>                 }
>         }
> @@ -607,8 +615,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 goto unlock;
>
>         if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -               copy_map_value_locked(&smap->map, old_sdata->data, value,
> -                                     false);
> +               if (smap->map.map_type == BPF_MAP_TYPE_TASK_STORAGE &&
> +                   btf_record_has_field(smap->map.record, BPF_UPTR))
> +                       copy_map_uptr_locked(&smap->map, old_sdata->data,
> +                                            value, false);
> +               else
> +                       copy_map_value_locked(&smap->map, old_sdata->data,
> +                                             value, false);

This one won't be needed either.

>                 selem = SELEM(old_sdata);
>                 goto unlock;
>         }
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d02ae323996b..d588b52605b9 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -388,6 +388,26 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>         preempt_enable();
>  }
>
> +/* Copy map value and uptr from src to dst, with lock_src indicating
> + * whether src or dst is locked.
> + */
> +void copy_map_uptr_locked(struct bpf_map *map, void *src, void *dst,
> +                         bool lock_src)
> +{
> +       struct bpf_spin_lock *lock;
> +
> +       if (lock_src)
> +               lock = src + map->record->spin_lock_off;
> +       else
> +               lock = dst + map->record->spin_lock_off;
> +       preempt_disable();
> +       __bpf_spin_lock_irqsave(lock);
> +       copy_map_value(map, dst, src);
> +       bpf_obj_uptrcpy(map->record, dst, src);
> +       __bpf_spin_unlock_irqrestore(lock);
> +       preempt_enable();
> +}

This one has to be removed too.
Just think of the consequences of the above.
It may do unpin with irqs disabled.
It's asking for trouble depending on where the udata pointer originates.

> +void bpf_obj_unpin_uptr(const struct btf_field *field, void *addr)
> +{
> +       struct page *pages[1];
> +       u32 size, type_id;
> +       int npages;
> +       void *ptr;
> +
> +       type_id = field->kptr.btf_id;
> +       btf_type_id_size(field->kptr.btf, &type_id, &size);
> +       if (size == 0)
> +               return;
> +
> +       ptr = (void *)((intptr_t)addr & PAGE_MASK);
> +
> +       npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
> +       if (WARN_ON_ONCE(npages > 1))
> +               return;

This check is unnecessary. We check that there is only one page
during the pin. No need to repeat during unpin.

> +
> +       pages[0] = virt_to_page(ptr);
> +       unpin_user_pages(pages, 1);

The whole helper can be just above two lines.

> +}
> +
> +/* Unpin uptr fields in the record up to cnt */
> +static void bpf_obj_unpin_uptrs_cnt(struct btf_record *rec, int cnt, void *src)
> +{
> +       u32 next_off;
> +       void **kaddr_ptr;
> +       int i;
> +
> +       for (i = 0; i < cnt; i++) {
> +               if (rec->fields[i].type != BPF_UPTR)
> +                       continue;
> +
> +               next_off = rec->fields[i].offset;
> +               kaddr_ptr = src + next_off;
> +               if (*kaddr_ptr) {
> +                       bpf_obj_unpin_uptr(&rec->fields[i], *kaddr_ptr);
> +                       *kaddr_ptr = NULL;
> +               }
> +       }
> +}
> +
> +/* Find all BPF_UPTR fields in the record, pin the user memory, map it
> + * to kernel space, and update the addresses in the source memory.
> + *
> + * The map value passing from userspace may contain user kptrs pointing to
> + * user memory. This function pins the user memory and maps it to kernel
> + * memory so that BPF programs can access it.
> + */
> +static int bpf_obj_trans_pin_uptrs(struct btf_record *rec, void *src, u32 size)
> +{
> +       u32 type_id, tsz, npages, next_off;
> +       void *uaddr, *kaddr, **uaddr_ptr;
> +       const struct btf_type *t;
> +       struct page *pages[1];
> +       int i, err;
> +
> +       if (IS_ERR_OR_NULL(rec))
> +               return 0;
> +
> +       if (!btf_record_has_field(rec, BPF_UPTR))
> +               return 0;
> +
> +       for (i = 0; i < rec->cnt; i++) {
> +               if (rec->fields[i].type != BPF_UPTR)
> +                       continue;
> +
> +               next_off = rec->fields[i].offset;
> +               if (next_off + sizeof(void *) > size) {
> +                       err = -EFAULT;
> +                       goto rollback;
> +               }

size argument and above check are unnecessary.
btf_record has to be correct at this point.

> +               uaddr_ptr = src + next_off;
> +               uaddr = *uaddr_ptr;
> +               if (!uaddr)
> +                       continue;
> +
> +               /* Make sure the user memory takes up at most one page */
> +               type_id = rec->fields[i].kptr.btf_id;
> +               t = btf_type_id_size(rec->fields[i].kptr.btf, &type_id, &tsz);
> +               if (!t) {
> +                       err = -EFAULT;
> +                       goto rollback;
> +               }
> +               if (tsz == 0) {
> +                       *uaddr_ptr = NULL;
> +                       continue;
> +               }

tsz==0 ? are you sure this can happen?
If so there has to be a test for this.
And we probably should reject it earlier in the verifier.
zero sized struct as uptr makes no practical use case.

> +               npages = (((intptr_t)uaddr + tsz + ~PAGE_MASK) -
> +                         ((intptr_t)uaddr & PAGE_MASK)) >> PAGE_SHIFT;
> +               if (npages > 1) {
> +                       /* Allow only one page */
> +                       err = -EFAULT;

E2BIG would be a better error in such a case.

> +                       goto rollback;
> +               }
> +
> +               /* Pin the user memory */
> +               err = pin_user_pages_fast((intptr_t)uaddr, 1, FOLL_LONGTERM | FOLL_WRITE, pages);
> +               if (err < 0)
> +                       goto rollback;

since it's "_fast" version it can return 0 too.
In this case it's a case of rollback as well.
It's better to change this check to if (err != 1)
which is a more canonical way.

> +
> +               /* Map to kernel space */
> +               kaddr = page_address(pages[0]);
> +               if (unlikely(!kaddr)) {
> +                       WARN_ON_ONCE(1);

Since the page was pinned the above cannot fail.
No reason for this check.

> +                       unpin_user_pages(pages, 1);
> +                       err = -EFAULT;
> +                       goto rollback;
> +               }
> +               *uaddr_ptr = kaddr + ((intptr_t)uaddr & ~PAGE_MASK);
> +       }
> +
> +       return 0;
> +
> +rollback:
> +       /* Unpin the user memory of earlier fields */
> +       bpf_obj_unpin_uptrs_cnt(rec, i, src);
> +
> +       return err;
> +}
> +
> +static void bpf_obj_unpin_uptrs(struct btf_record *rec, void *src)
> +{
> +       if (IS_ERR_OR_NULL(rec))
> +               return;
> +
> +       if (!btf_record_has_field(rec, BPF_UPTR))
> +               return;
> +
> +       bpf_obj_unpin_uptrs_cnt(rec, rec->cnt, src);
> +}
> +
> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
> +                                     void *key, void *value, __u64 flags)
>  {
>         int err;
>
> @@ -208,6 +340,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>         return err;
>  }
>
> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> +                               void *key, void *value, __u64 flags)
> +{
> +       int err;
> +
> +       if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE) {
> +               /* Pin user memory can lead to context switch, so we need
> +                * to do it before potential RCU lock.
> +                */
> +               err = bpf_obj_trans_pin_uptrs(map->record, value,
> +                                             bpf_map_value_size(map));
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = bpf_map_update_value_inner(map, map_file, key, value, flags);
> +
> +       if (err && map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +               bpf_obj_unpin_uptrs(map->record, value);

Pls don't rename bpf_map_update_value_inner.
Instead add "if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)" case
inside bpf_map_update_value() and
do all of unpin/pin calls from there.

> +
> +       return err;
> +}
> +
>  static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>                               __u64 flags)
>  {
> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>                                 field->kptr.dtor(xchgd_field);
>                         }
>                         break;
> +               case BPF_UPTR:
> +                       if (*(void **)field_ptr)
> +                               bpf_obj_unpin_uptr(field, *(void **)field_ptr);
> +                       *(void **)field_ptr = NULL;

This one will be called from
 task_storage_delete->bpf_selem_free->bpf_obj_free_fields

and even if upin was safe to do from that context
we cannot just do:
*(void **)field_ptr = NULL;

since bpf prog might be running in parallel,
it could have just read that addr and now is using it.

The first thought of a way to fix this was to split
bpf_obj_free_fields() into the current one plus
bpf_obj_free_fields_after_gp()
that will do the above unpin bit.
and call the later one from bpf_selem_free_rcu()
while bpf_obj_free_fields() from bpf_selem_free()
will not touch uptr.

But after digging further I realized that task_storage
already switched to use bpf_ma, so the above won't work.

So we need something similar to BPF_KPTR_REF logic:
xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
and then delay of uptr unpin for that address into call_rcu.

Any better ideas?

> +                       break;
>                 case BPF_LIST_HEAD:
>                         if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                 continue;
> @@ -1099,7 +1259,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>
>         map->record = btf_parse_fields(btf, value_type,
>                                        BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
> -                                      BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE,
> +                                      BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
>                                        map->value_size);
>         if (!IS_ERR_OR_NULL(map->record)) {
>                 int i;
> @@ -1155,6 +1315,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>                                         goto free_map_tab;
>                                 }
>                                 break;
> +                       case BPF_UPTR:
> +                               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +                                       ret = -EOPNOTSUPP;
> +                                       goto free_map_tab;
> +                               }
> +                               break;

I was thinking whether we need an additional check that bpf_obj_new()
cannot be used to allocate a prog supplied struct with uptr in it,
but we're good here, since we only allow
__btf_type_is_scalar_struct() for bpf_obj_new.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-08-28 23:24   ` Alexei Starovoitov
@ 2024-09-04 22:21     ` Martin KaFai Lau
  2024-09-06 20:11       ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2024-09-04 22:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Kui-Feng Lee, linux-mm

On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>                                  field->kptr.dtor(xchgd_field);
>>                          }
>>                          break;
>> +               case BPF_UPTR:
>> +                       if (*(void **)field_ptr)
>> +                               bpf_obj_unpin_uptr(field, *(void **)field_ptr);
>> +                       *(void **)field_ptr = NULL;
> This one will be called from
>   task_storage_delete->bpf_selem_free->bpf_obj_free_fields
> 
> and even if upin was safe to do from that context
> we cannot just do:
> *(void **)field_ptr = NULL;
> 
> since bpf prog might be running in parallel,
> it could have just read that addr and now is using it.
> 
> The first thought of a way to fix this was to split
> bpf_obj_free_fields() into the current one plus
> bpf_obj_free_fields_after_gp()
> that will do the above unpin bit.
> and call the later one from bpf_selem_free_rcu()
> while bpf_obj_free_fields() from bpf_selem_free()
> will not touch uptr.
> 
> But after digging further I realized that task_storage
> already switched to use bpf_ma, so the above won't work.
> 
> So we need something similar to BPF_KPTR_REF logic:
> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> and then delay of uptr unpin for that address into call_rcu.
> 
> Any better ideas?

Many thanks to Kui-Feng starting this useful work on task storage. I will think 
about it and respin the set.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-09-04 22:21     ` Martin KaFai Lau
@ 2024-09-06 20:11       ` Martin KaFai Lau
  2024-09-06 23:44         ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2024-09-06 20:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Kui-Feng Lee, linux-mm

On 9/4/24 3:21 PM, Martin KaFai Lau wrote:
> On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
>>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, 
>>> void *obj)
>>>                                  field->kptr.dtor(xchgd_field);
>>>                          }
>>>                          break;
>>> +               case BPF_UPTR:
>>> +                       if (*(void **)field_ptr)
>>> +                               bpf_obj_unpin_uptr(field, *(void **)field_ptr);
>>> +                       *(void **)field_ptr = NULL;
>> This one will be called from
>>   task_storage_delete->bpf_selem_free->bpf_obj_free_fields
>>
>> and even if upin was safe to do from that context
>> we cannot just do:
>> *(void **)field_ptr = NULL;
>>
>> since bpf prog might be running in parallel,
>> it could have just read that addr and now is using it.
>>
>> The first thought of a way to fix this was to split
>> bpf_obj_free_fields() into the current one plus
>> bpf_obj_free_fields_after_gp()
>> that will do the above unpin bit.
>> and call the later one from bpf_selem_free_rcu()
>> while bpf_obj_free_fields() from bpf_selem_free()
>> will not touch uptr.
>>
>> But after digging further I realized that task_storage
>> already switched to use bpf_ma, so the above won't work.
>>
>> So we need something similar to BPF_KPTR_REF logic:
>> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>> and then delay of uptr unpin for that address into call_rcu.
>>
>> Any better ideas?
> 

I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now 
(renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace 
for the common case.

selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf prog.

bpf_selem_free knows whether a selem can be reused immediately based on the 
caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool 
reuse_now)".

If a selem cannot be reuse_now (i.e. == false), it is currently going through 
"call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do 
unpin_user_page() in the rcu callback.

A selem can be reuse_now (i.e. == true) if the selem is no longer needed because 
either its owner (i.e. the task_struct here) is going away in free_task() or the 
bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should 
have a hold on the selem at this point. I think for these two cases, the 
unpin_user_page() can be directly called in bpf_selem_free().

One existing bug is, from looking at patch 6, I don't think the free_task() case 
can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not 
mark the previously obtained task_storage to be invalid:

data_task = bpf_task_from_pid(parent_pid);
ptr = bpf_task_storage_get(&datamap, data_task, 0, ...);
bpf_task_release(data_task);
if (!ptr)
	return 0;
/* The prog still holds a valid task storage ptr. */
udata = ptr->udata;

It can be fixed by marking the ref_obj_id of the "ptr". Although it is more 
correct to make the task storage "ptr" invalid after task_release, it may break 
the existing progs.

The same issue probably is true for cgroup_storage. There is no release kfunc 
for inode and sk, so inode and sk storage should be fine.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-09-06 20:11       ` Martin KaFai Lau
@ 2024-09-06 23:44         ` Alexei Starovoitov
  2024-09-07  1:32           ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2024-09-06 23:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Kui-Feng Lee, linux-mm

On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/4/24 3:21 PM, Martin KaFai Lau wrote:
> > On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
> >>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec,
> >>> void *obj)
> >>>                                  field->kptr.dtor(xchgd_field);
> >>>                          }
> >>>                          break;
> >>> +               case BPF_UPTR:
> >>> +                       if (*(void **)field_ptr)
> >>> +                               bpf_obj_unpin_uptr(field, *(void **)field_ptr);
> >>> +                       *(void **)field_ptr = NULL;
> >> This one will be called from
> >>   task_storage_delete->bpf_selem_free->bpf_obj_free_fields
> >>
> >> and even if upin was safe to do from that context
> >> we cannot just do:
> >> *(void **)field_ptr = NULL;
> >>
> >> since bpf prog might be running in parallel,
> >> it could have just read that addr and now is using it.
> >>
> >> The first thought of a way to fix this was to split
> >> bpf_obj_free_fields() into the current one plus
> >> bpf_obj_free_fields_after_gp()
> >> that will do the above unpin bit.
> >> and call the later one from bpf_selem_free_rcu()
> >> while bpf_obj_free_fields() from bpf_selem_free()
> >> will not touch uptr.
> >>
> >> But after digging further I realized that task_storage
> >> already switched to use bpf_ma, so the above won't work.
> >>
> >> So we need something similar to BPF_KPTR_REF logic:
> >> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
> >> and then delay of uptr unpin for that address into call_rcu.
> >>
> >> Any better ideas?
> >
>
> I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now
> (renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace
> for the common case.
>
> selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf prog.
>
> bpf_selem_free knows whether a selem can be reused immediately based on the
> caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool
> reuse_now)".
>
> If a selem cannot be reuse_now (i.e. == false), it is currently going through
> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do
> unpin_user_page() in the rcu callback.
>
> A selem can be reuse_now (i.e. == true) if the selem is no longer needed because
> either its owner (i.e. the task_struct here) is going away in free_task() or the
> bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should
> have a hold on the selem at this point. I think for these two cases, the
> unpin_user_page() can be directly called in bpf_selem_free().

but there is also this path:
bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free
 -> bpf_obj_free_fields

In this case bpf prog may still be looking at uptr address
and we cannot do unpin right away in bpf_obj_free_fields.
All other special fields in map value are ok,
since they are either relying on bpf_mem_alloc and
have rcu/rcu_tasks_trace gp
or extra indirection like timer/wq.

> One existing bug is, from looking at patch 6, I don't think the free_task() case
> can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not
> mark the previously obtained task_storage to be invalid:
>
> data_task = bpf_task_from_pid(parent_pid);
> ptr = bpf_task_storage_get(&datamap, data_task, 0, ...);
> bpf_task_release(data_task);
> if (!ptr)
>         return 0;
> /* The prog still holds a valid task storage ptr. */
> udata = ptr->udata;
>
> It can be fixed by marking the ref_obj_id of the "ptr". Although it is more
> correct to make the task storage "ptr" invalid after task_release, it may break
> the existing progs.

Are you suggesting that bpf_task_release should invalidate all pointers
fetched from map value?
That will work, but it's not an issue for other special fields in there
like kptr.
So this invalidation would be need only for uptr which feels
weird to special case it and probably will be confusing to users writing
such programs.
Above bpf prog example should be ok to use.
We only need to delay unpin after rcu/rcu_task_trace gp.
Hence my proposal in bpf_obj_free_fields() do:
 case UPTR:
   xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
   call_rcu(...) to unpin.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-09-06 23:44         ` Alexei Starovoitov
@ 2024-09-07  1:32           ` Martin KaFai Lau
  2024-09-07  4:03             ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2024-09-07  1:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Kui-Feng Lee, linux-mm

On 9/6/24 4:44 PM, Alexei Starovoitov wrote:
> On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/4/24 3:21 PM, Martin KaFai Lau wrote:
>>> On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
>>>>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec,
>>>>> void *obj)
>>>>>                                   field->kptr.dtor(xchgd_field);
>>>>>                           }
>>>>>                           break;
>>>>> +               case BPF_UPTR:
>>>>> +                       if (*(void **)field_ptr)
>>>>> +                               bpf_obj_unpin_uptr(field, *(void **)field_ptr);
>>>>> +                       *(void **)field_ptr = NULL;
>>>> This one will be called from
>>>>    task_storage_delete->bpf_selem_free->bpf_obj_free_fields
>>>>
>>>> and even if upin was safe to do from that context
>>>> we cannot just do:
>>>> *(void **)field_ptr = NULL;
>>>>
>>>> since bpf prog might be running in parallel,
>>>> it could have just read that addr and now is using it.
>>>>
>>>> The first thought of a way to fix this was to split
>>>> bpf_obj_free_fields() into the current one plus
>>>> bpf_obj_free_fields_after_gp()
>>>> that will do the above unpin bit.
>>>> and call the later one from bpf_selem_free_rcu()
>>>> while bpf_obj_free_fields() from bpf_selem_free()
>>>> will not touch uptr.
>>>>
>>>> But after digging further I realized that task_storage
>>>> already switched to use bpf_ma, so the above won't work.
>>>>
>>>> So we need something similar to BPF_KPTR_REF logic:
>>>> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>>> and then delay of uptr unpin for that address into call_rcu.
>>>>
>>>> Any better ideas?
>>>
>>
>> I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now
>> (renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace
>> for the common case.
>>
>> selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf prog.
>>
>> bpf_selem_free knows whether a selem can be reused immediately based on the
>> caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool
>> reuse_now)".
>>
>> If a selem cannot be reuse_now (i.e. == false), it is currently going through
>> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do
>> unpin_user_page() in the rcu callback.
>>
>> A selem can be reuse_now (i.e. == true) if the selem is no longer needed because
>> either its owner (i.e. the task_struct here) is going away in free_task() or the
>> bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should
>> have a hold on the selem at this point. I think for these two cases, the
>> unpin_user_page() can be directly called in bpf_selem_free().
> 
> but there is also this path:
> bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free
>   -> bpf_obj_free_fields
> 
> In this case bpf prog may still be looking at uptr address
> and we cannot do unpin right away in bpf_obj_free_fields.

cannot unpin immediately in the bpf_task_storage_delete() path is understood. 
task_storage can be used in sleepable. It needs to wait for the tasks_trace and 
the regular rcu gp before unpin.

I forgot to mention earlier that bpf_task_storage_delete() will have the 
bpf_selem_free(..., reuse_now == false). It will then do the 
"call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu);". The unpin could 
happen in bpf_selem_free_trace_rcu() in this case. I am suggesting to unpin in 
bpf_selem_free_trace_rcu together with the selem free.

I just noticed the map and its btf_record are gone in 
bpf_selem_free_trace_rcu()... so won't work. :(

> All other special fields in map value are ok,
> since they are either relying on bpf_mem_alloc and
> have rcu/rcu_tasks_trace gp
> or extra indirection like timer/wq.
> 
>> One existing bug is, from looking at patch 6, I don't think the free_task() case
>> can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not
>> mark the previously obtained task_storage to be invalid:
>>
>> data_task = bpf_task_from_pid(parent_pid);
>> ptr = bpf_task_storage_get(&datamap, data_task, 0, ...);
>> bpf_task_release(data_task);
>> if (!ptr)
>>          return 0;
>> /* The prog still holds a valid task storage ptr. */
>> udata = ptr->udata;
>>
>> It can be fixed by marking the ref_obj_id of the "ptr". Although it is more
>> correct to make the task storage "ptr" invalid after task_release, it may break
>> the existing progs.
> 
> Are you suggesting that bpf_task_release should invalidate all pointers
> fetched from map value?

I was thinking at least the map value ptr itself needs to be invalidated.

> That will work, but it's not an issue for other special fields in there
> like kptr.
> So this invalidation would be need only for uptr which feels
> weird to special case it and probably will be confusing to users writing
> such programs.

hmm... I haven't thought about the other pointer fields that read before the 
task_release().

Agreed, it is hard to use if only marks uptr invalid. Thinking about it. Even 
marking the map value ptr invalid while other previously read fields keep 
working is also the same weirdness.

> Above bpf prog example should be ok to use.
> We only need to delay unpin after rcu/rcu_task_trace gp.
> Hence my proposal in bpf_obj_free_fields() do:
>   case UPTR:
>     xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>     call_rcu(...) to unpin.

Agree that call_rcu() here is the only option. It probably needs to go through 
the tasks_trace gp also.

Can the page->rcu_head be used here?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.
  2024-09-07  1:32           ` Martin KaFai Lau
@ 2024-09-07  4:03             ` Martin KaFai Lau
  0 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2024-09-07  4:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Kui-Feng Lee, linux-mm

On 9/6/24 6:32 PM, Martin KaFai Lau wrote:
> On 9/6/24 4:44 PM, Alexei Starovoitov wrote:
>> On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 9/4/24 3:21 PM, Martin KaFai Lau wrote:
>>>> On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
>>>>>> @@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec,
>>>>>> void *obj)
>>>>>>                                   field->kptr.dtor(xchgd_field);
>>>>>>                           }
>>>>>>                           break;
>>>>>> +               case BPF_UPTR:
>>>>>> +                       if (*(void **)field_ptr)
>>>>>> +                               bpf_obj_unpin_uptr(field, *(void 
>>>>>> **)field_ptr);
>>>>>> +                       *(void **)field_ptr = NULL;
>>>>> This one will be called from
>>>>>    task_storage_delete->bpf_selem_free->bpf_obj_free_fields
>>>>>
>>>>> and even if upin was safe to do from that context
>>>>> we cannot just do:
>>>>> *(void **)field_ptr = NULL;
>>>>>
>>>>> since bpf prog might be running in parallel,
>>>>> it could have just read that addr and now is using it.
>>>>>
>>>>> The first thought of a way to fix this was to split
>>>>> bpf_obj_free_fields() into the current one plus
>>>>> bpf_obj_free_fields_after_gp()
>>>>> that will do the above unpin bit.
>>>>> and call the later one from bpf_selem_free_rcu()
>>>>> while bpf_obj_free_fields() from bpf_selem_free()
>>>>> will not touch uptr.
>>>>>
>>>>> But after digging further I realized that task_storage
>>>>> already switched to use bpf_ma, so the above won't work.
>>>>>
>>>>> So we need something similar to BPF_KPTR_REF logic:
>>>>> xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>>>> and then delay of uptr unpin for that address into call_rcu.
>>>>>
>>>>> Any better ideas?
>>>>
>>>
>>> I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now
>>> (renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace
>>> for the common case.
>>>
>>> selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf 
>>> prog.
>>>
>>> bpf_selem_free knows whether a selem can be reused immediately based on the
>>> caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool
>>> reuse_now)".
>>>
>>> If a selem cannot be reuse_now (i.e. == false), it is currently going through
>>> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do
>>> unpin_user_page() in the rcu callback.
>>>
>>> A selem can be reuse_now (i.e. == true) if the selem is no longer needed because
>>> either its owner (i.e. the task_struct here) is going away in free_task() or the
>>> bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should
>>> have a hold on the selem at this point. I think for these two cases, the
>>> unpin_user_page() can be directly called in bpf_selem_free().
>>
>> but there is also this path:
>> bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free
>>   -> bpf_obj_free_fields
>>
>> In this case bpf prog may still be looking at uptr address
>> and we cannot do unpin right away in bpf_obj_free_fields.
> 
> cannot unpin immediately in the bpf_task_storage_delete() path is understood. 
> task_storage can be used in sleepable. It needs to wait for the tasks_trace and 
> the regular rcu gp before unpin.
> 
> I forgot to mention earlier that bpf_task_storage_delete() will have the 
> bpf_selem_free(..., reuse_now == false). It will then do the 
> "call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu);". The unpin could 
> happen in bpf_selem_free_trace_rcu() in this case. I am suggesting to unpin in 
> bpf_selem_free_trace_rcu together with the selem free.
> 
> I just noticed the map and its btf_record are gone in 
> bpf_selem_free_trace_rcu()... so won't work. :(

Thought about it more. Adding a rcu_barrier() to bpf_local_storage_map_free() 
may be enough. Then bpf_selem_free_rcu() will have the map->record to unpin.
will need to think about it more.

> 
>> All other special fields in map value are ok,
>> since they are either relying on bpf_mem_alloc and
>> have rcu/rcu_tasks_trace gp
>> or extra indirection like timer/wq.
>>
>>> One existing bug is, from looking at patch 6, I don't think the free_task() case
>>> can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not
>>> mark the previously obtained task_storage to be invalid:
>>>
>>> data_task = bpf_task_from_pid(parent_pid);
>>> ptr = bpf_task_storage_get(&datamap, data_task, 0, ...);
>>> bpf_task_release(data_task);
>>> if (!ptr)
>>>          return 0;
>>> /* The prog still holds a valid task storage ptr. */
>>> udata = ptr->udata;
>>>
>>> It can be fixed by marking the ref_obj_id of the "ptr". Although it is more
>>> correct to make the task storage "ptr" invalid after task_release, it may break
>>> the existing progs.
>>
>> Are you suggesting that bpf_task_release should invalidate all pointers
>> fetched from map value?
> 
> I was thinking at least the map value ptr itself needs to be invalidated.
> 
>> That will work, but it's not an issue for other special fields in there
>> like kptr.
>> So this invalidation would be need only for uptr which feels
>> weird to special case it and probably will be confusing to users writing
>> such programs.
> 
> hmm... I haven't thought about the other pointer fields that read before the 
> task_release().
> 
> Agreed, it is hard to use if only marks uptr invalid. Thinking about it. Even 
> marking the map value ptr invalid while other previously read fields keep 
> working is also the same weirdness.
> 
>> Above bpf prog example should be ok to use.
>> We only need to delay unpin after rcu/rcu_task_trace gp.
>> Hence my proposal in bpf_obj_free_fields() do:
>>   case UPTR:
>>     xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
>>     call_rcu(...) to unpin.
> 
> Agree that call_rcu() here is the only option. It probably needs to go through 
> the tasks_trace gp also.
> 
> Can the page->rcu_head be used here?
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-07  4:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 19:12 [RFC bpf-next v4 0/6] Share user memory to BPF program through task storage map Kui-Feng Lee
2024-08-16 19:12 ` [RFC bpf-next v4 1/6] bpf: define BPF_UPTR a new enumerator of btf_field_type Kui-Feng Lee
2024-08-16 19:12 ` [RFC bpf-next v4 2/6] bpf: Parse and support "uptr" tag Kui-Feng Lee
2024-08-16 19:12 ` [RFC bpf-next v4 3/6] bpf: Handle BPF_UPTR in verifier Kui-Feng Lee
2024-08-16 19:12 ` [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls Kui-Feng Lee
2024-08-28 23:24   ` Alexei Starovoitov
2024-09-04 22:21     ` Martin KaFai Lau
2024-09-06 20:11       ` Martin KaFai Lau
2024-09-06 23:44         ` Alexei Starovoitov
2024-09-07  1:32           ` Martin KaFai Lau
2024-09-07  4:03             ` Martin KaFai Lau
2024-08-16 19:12 ` [RFC bpf-next v4 5/6] libbpf: define __uptr Kui-Feng Lee
2024-08-27 23:13   ` Andrii Nakryiko
2024-08-28 17:53     ` Kui-Feng Lee
2024-08-16 19:12 ` [RFC bpf-next v4 6/6] selftests/bpf: test __uptr on the value of a task storage map Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox