[RFC bpf-next 0/5] Share user memory to BPF program through task storage map.

BPF List
 help / color / mirror / Atom feed

* [RFC bpf-next 0/5] Share user memory to BPF program through task storage map.
@ 2024-08-07 23:57 Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 1/5] bpf: Parse and support "kptr_user" tag Kui-Feng Lee
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Some of BPF schedulers (sched_ext) need hints from user programs to do
a better job. For example, a scheduler can handle a task in a
different way if it knows a task is doing GC. So, we need an efficient
way to share the information between user programs and BPF
programs. Sharing memory between user programs and BPF programs is
what this patchset does.

== REQUIREMENT ==

This patchset tries to let every task in every process can share a
small chunk of memory of it's own with a BPF scheduler. So, they can
update the hints without expensive overhead of syscalls. It also wants
every task sees only the data/memory belong to the task/or the task's
process.

== DESIGN ==

This patchset enables BPF prorams to embed __kptr_user; user kptr, a
new type of kptrs, in the values of task storage maps. A user kptr
field can only be set by user programs by updating map element value
through a syscall. A user kptr points to a block of memory allocated
by the user program updating the element value. The memory will be
pinned to ensure it staying in the core memory and to avoid a page
fault when the BPF program accesses it.

For example, the following code fragment is a part of a BPF program
that embeds a __kptr_user field "udata" in the value type "struct
value_type" of the task storage map "datamap". The size of the memory
pointed by a user kptr is determized by its type. Here we have "struct
user_data". The BPF program can read and write this block of memory
directly.

File task_ls_kptr_user.c:

    struct user_data {
    	int a;
    	int b;
    	int result;
    };

    struct value_type {
    	struct user_data __kptr_user *udata;
    };

    struct {
    	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
    	__uint(map_flags, BPF_F_NO_PREALLOC);
    	__type(key, int);
    	__type(value, struct value_type);
    } datamap SEC(".maps");

    pid_t target_pid = 0;

    SEC("tp_btf/sys_enter")
    int BPF_PROG(on_enter, struct pt_regs *regs, long id)
    {
    	struct task_struct *task;
    	struct value_type *ptr;
    	struct user_data *udata;

    	task = bpf_get_current_task_btf();
    	if (task->pid != target_pid)
    		return 0;

    	ptr = bpf_task_storage_get(&datamap, task, 0,
    				   BPF_LOCAL_STORAGE_GET_F_CREATE);
    	if (!ptr)
    		return 0;

    	udata = ptr->udata;
    	if (udata)
    		udata->result = udata->a + udata->b;

    	return 0;
    }

The following code fragment is a corresponding user program. It calls
bpf_map_update_elem() to update "datamap" and point "udata" to a
mmaped memory. The memory pointed by "udata" will be shared between
the BPF program and the user program.

    static void test_kptr_user(void)
    {
    	struct task_ls_kptr_user *skel = NULL;
    	struct user_data *user_data;
    	struct value_type value;
    	int task_fd = -1;
    	int err;

    	user_data = mmap(NULL, sizeof(user_data), PROT_READ | PROT_WRITE,
    			 MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    	if (!ASSERT_NEQ(user_data, MAP_FAILED, "mmap"))
    		return;
    	user_data->a = 1;
    	user_data->b = 2;
    	user_data->result = 0;
    	value.udata = user_data;

    	task_fd = sys_pidfd_open(getpid(), 0);
    	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
    		goto out;

    	skel = task_ls_kptr_user__open_and_load();
    	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
    		goto out;

    	err = bpf_map_update_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value, 0);
    	if (!ASSERT_OK(err, "update datamap"))
    		goto out;

    	skel->bss->target_pid = syscall(SYS_gettid);

    	err = task_ls_kptr_user__attach(skel);
    	if (!ASSERT_OK(err, "skel_attach"))
    		goto out;

    	syscall(SYS_gettid);
    	syscall(SYS_gettid);

    	ASSERT_EQ(user_data->a + user_data->b, user_data->result, "result");
    out:
    	task_ls_kptr_user__destroy(skel);
    	close(task_fd);
    	munmap(user_data, sizeof(user_data));
    }

== MEMORY ==

In order to use memory efficiently, we don't want to pin a large
number of pages. To archieve that, user programs should collect the
memory blocks pointed by user kptrs together to share memory pages if
possible. It avoid the situation that pin one page for each thread in
a process.  Instead, we can have several threads pointing their user
kptrs to the same page but with different offsets.

Although it is not necessary, avoiding the memory pointed by a user
kptr crossing the boundary of a page can prevent an additional mapping
in the kernel address space.

== RESTRICT ==

There is a limitation on the number of pinned pages for one user kptr,
KPTR_USER_MAX_PAGES(16). This is random picked number for safety. We
can remove this limitation if we don't want it.

Only task storage map have been supported at the moment.

The values of user kptrs can only be updated by user programs through
syscalls. You can not change the values of user kptrs in BPF programs.

bpf_map_lookup_elem() from userspace returns zeroed values for user
kptrs to prevent leaking information of the kernel.

Kui-Feng Lee (5):
  bpf: Parse and support "kptr_user" tag.
  bpf: Handle BPF_KPTR_USER in verifier.
  bpf: pin, translate, and unpin __kptr_user from syscalls.
  libbpf: define __kptr_user.
  selftests/bpf: test __kptr_user on the value of a task storage map.

 include/linux/bpf.h                           |  43 ++++-
 include/linux/bpf_local_storage.h             |   2 +-
 kernel/bpf/bpf_local_storage.c                |  18 +-
 kernel/bpf/btf.c                              |   5 +
 kernel/bpf/helpers.c                          |  12 +-
 kernel/bpf/local_storage.c                    |   2 +-
 kernel/bpf/syscall.c                          | 179 +++++++++++++++++-
 kernel/bpf/verifier.c                         |  11 ++
 net/core/bpf_sk_storage.c                     |   2 +-
 tools/lib/bpf/bpf_helpers.h                   |   1 +
 .../bpf/prog_tests/task_local_storage.c       | 122 ++++++++++++
 .../selftests/bpf/progs/task_ls_kptr_user.c   |  72 +++++++
 12 files changed, 447 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/task_ls_kptr_user.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC bpf-next 1/5] bpf: Parse and support "kptr_user" tag.
  2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
@ 2024-08-07 23:57 ` Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier Kui-Feng Lee
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Parse "kptr_user" tag from BTF, map it to BPF_KPTR_USER, and support it in
related functions.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h  | 8 +++++++-
 kernel/bpf/btf.c     | 5 +++++
 kernel/bpf/syscall.c | 2 ++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b9425e410bcb..87d5f98249e2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -194,7 +194,6 @@ enum btf_field_type {
 	BPF_KPTR_UNREF = (1 << 2),
 	BPF_KPTR_REF   = (1 << 3),
 	BPF_KPTR_PERCPU = (1 << 4),
-	BPF_KPTR       = BPF_KPTR_UNREF | BPF_KPTR_REF | BPF_KPTR_PERCPU,
 	BPF_LIST_HEAD  = (1 << 5),
 	BPF_LIST_NODE  = (1 << 6),
 	BPF_RB_ROOT    = (1 << 7),
@@ -203,6 +202,8 @@ enum btf_field_type {
 	BPF_GRAPH_ROOT = BPF_RB_ROOT | BPF_LIST_HEAD,
 	BPF_REFCOUNT   = (1 << 9),
 	BPF_WORKQUEUE  = (1 << 10),
+	BPF_KPTR_USER  = (1 << 11),
+	BPF_KPTR       = BPF_KPTR_UNREF | BPF_KPTR_REF | BPF_KPTR_PERCPU | BPF_KPTR_USER,
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -322,6 +323,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "kptr";
 	case BPF_KPTR_PERCPU:
 		return "percpu_kptr";
+	case BPF_KPTR_USER:
+		return "user_kptr";
 	case BPF_LIST_HEAD:
 		return "bpf_list_head";
 	case BPF_LIST_NODE:
@@ -350,6 +353,7 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_KPTR_USER:
 		return sizeof(u64);
 	case BPF_LIST_HEAD:
 		return sizeof(struct bpf_list_head);
@@ -379,6 +383,7 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_KPTR_USER:
 		return __alignof__(u64);
 	case BPF_LIST_HEAD:
 		return __alignof__(struct bpf_list_head);
@@ -419,6 +424,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_KPTR_USER:
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 95426d5b634e..3b0f555fbbe6 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3361,6 +3361,8 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 		type = BPF_KPTR_REF;
 	else if (!strcmp("percpu_kptr", __btf_name_by_offset(btf, t->name_off)))
 		type = BPF_KPTR_PERCPU;
+	else if (!strcmp("kptr_user", __btf_name_by_offset(btf, t->name_off)))
+		type = BPF_KPTR_USER;
 	else
 		return -EINVAL;
 
@@ -3538,6 +3540,7 @@ static int btf_repeat_fields(struct btf_field_info *info,
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_KPTR_USER:
 		case BPF_LIST_HEAD:
 		case BPF_RB_ROOT:
 			break;
@@ -3664,6 +3667,7 @@ static int btf_find_field_one(const struct btf *btf,
 	case BPF_KPTR_UNREF:
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
+	case BPF_KPTR_USER:
 		ret = btf_find_kptr(btf, var_type, off, sz,
 				    info_cnt ? &info[0] : &tmp);
 		if (ret < 0)
@@ -3988,6 +3992,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_KPTR_USER:
 			ret = btf_parse_kptr(btf, &rec->fields[i], &info_arr[i]);
 			if (ret < 0)
 				goto end;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index bf6c5f685ea2..90a25307480e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -548,6 +548,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_KPTR_USER:
 			if (rec->fields[i].kptr.module)
 				module_put(rec->fields[i].kptr.module);
 			btf_put(rec->fields[i].kptr.btf);
@@ -596,6 +597,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
+		case BPF_KPTR_USER:
 			btf_get(fields[i].kptr.btf);
 			if (fields[i].kptr.module && !try_module_get(fields[i].kptr.module)) {
 				ret = -ENXIO;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier.
  2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 1/5] bpf: Parse and support "kptr_user" tag Kui-Feng Lee
@ 2024-08-07 23:57 ` Kui-Feng Lee
  2024-08-12 16:48   ` Alexei Starovoitov
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Give PTR_MAYBE_NULL | PTR_UNTRUSTED | MEM_ALLOC | NON_OWN_REF to kptr_user
to the memory pointed by it readable and writable.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/verifier.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index df3be12096cf..84647e599595 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	int perm_flags;
 	const char *reg_name = "";
 
+	if (kptr_field->type == BPF_KPTR_USER)
+		/* BPF programs should not change any user kptr */
+		return -EACCES;
+
 	if (btf_is_kernel(reg->btf)) {
 		perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
 
@@ -5483,6 +5487,12 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
 			ret |= NON_OWN_REF;
 	} else {
 		ret |= PTR_UNTRUSTED;
+		if (kptr_field->type == BPF_KPTR_USER)
+			/* In oder to access directly from bpf
+			 * programs. NON_OWN_REF make the memory
+			 * writable. Check check_ptr_to_btf_access().
+			 */
+			ret |= MEM_ALLOC | NON_OWN_REF;
 	}
 
 	return ret;
@@ -5576,6 +5586,7 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			case BPF_KPTR_UNREF:
 			case BPF_KPTR_REF:
 			case BPF_KPTR_PERCPU:
+			case BPF_KPTR_USER:
 				if (src != ACCESS_DIRECT) {
 					verbose(env, "kptr cannot be accessed indirectly by helper\n");
 					return -EACCES;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 1/5] bpf: Parse and support "kptr_user" tag Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier Kui-Feng Lee
@ 2024-08-07 23:57 ` Kui-Feng Lee
  2024-08-08  0:05   ` Kui-Feng Lee
                     ` (3 more replies)
  2024-08-07 23:57 ` [RFC bpf-next 4/5] libbpf: define __kptr_user Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map Kui-Feng Lee
  4 siblings, 4 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee, linux-mm

User kptrs are pinned, by pin_user_pages_fast(), and translated to an
address in the kernel when the value is updated by user programs. (Call
bpf_map_update_elem() from user programs.) And, the pinned pages are
unpinned if the value of user kptrs are overritten or if the values of maps
are deleted/destroyed.

The pages are mapped through vmap() in order to get a continuous space in
the kernel if the memory pointed by a user kptr resides in two or more
pages. For the case of single page, page_address() is called to get the
address of a page in the kernel.

User kptr is only supported by task storage maps.

One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
is a random picked number for safety. We actually can remove this
restriction totally.

User kptrs could only be set by user programs through syscalls.  Any
attempts of updating the value of a map with __kptr_user in it should
ignore the values of user kptrs from BPF programs. The values of user kptrs
will keep as they were if the new values are from BPF programs, not from
user programs.

Cc: linux-mm@kvack.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h               |  35 +++++-
 include/linux/bpf_local_storage.h |   2 +-
 kernel/bpf/bpf_local_storage.c    |  18 +--
 kernel/bpf/helpers.c              |  12 +-
 kernel/bpf/local_storage.c        |   2 +-
 kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
 net/core/bpf_sk_storage.c         |   2 +-
 7 files changed, 227 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 87d5f98249e2..f4ad0bc183cb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -30,6 +30,7 @@
 #include <linux/static_call.h>
 #include <linux/memcontrol.h>
 #include <linux/cfi.h>
+#include <linux/mm.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 		data_race(*ldst++ = *lsrc++);
 }
 
+void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
+
 /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
 static inline void bpf_obj_memcpy(struct btf_record *rec,
 				  void *dst, void *src, u32 size,
-				  bool long_memcpy)
+				  bool long_memcpy, bool from_user)
 {
 	u32 curr_off = 0;
 	int i;
@@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
 	for (i = 0; i < rec->cnt; i++) {
 		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;
+		void *addr;
 
 		memcpy(dst + curr_off, src + curr_off, sz);
+		if (from_user && rec->fields[i].type == BPF_KPTR_USER) {
+			/* Unpin old address.
+			 *
+			 * Alignments are guaranteed by btf_find_field_one().
+			 */
+			addr = *(void **)(dst + next_off);
+			if (virt_addr_valid(addr))
+				bpf_obj_unpin_uaddr(&rec->fields[i], addr);
+			else if (addr)
+				WARN_ON_ONCE(1);
+
+			*(void **)(dst + next_off) = *(void **)(src + next_off);
+		}
 		curr_off += rec->fields[i].size + sz;
 	}
 	memcpy(dst + curr_off, src + curr_off, size - curr_off);
 }
 
+static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)
+{
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
+}
+
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
 }
 
 static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
 }
 
 static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
@@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
 	bpf_obj_memzero(map->record, dst, map->value_size);
 }
 
+void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
+				bool lock_src, bool from_user);
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
@@ -775,6 +799,11 @@ enum bpf_arg_type {
 };
 static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
 
+#define BPF_MAP_UPDATE_FLAG_BITS 3
+enum bpf_map_update_flag {
+	BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
+};
+
 /* type of values returned from helper functions */
 enum bpf_return_type {
 	RET_INTEGER,			/* function returns integer */
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index dcddb0aef7d8..d337df68fa23 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
 
 struct bpf_local_storage_elem *
 bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
-		bool charge_mem, gfp_t gfp_flags);
+		bool charge_mem, gfp_t gfp_flags, bool from_user);
 
 void bpf_selem_free(struct bpf_local_storage_elem *selem,
 		    struct bpf_local_storage_map *smap,
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index c938dea5ddbf..c4cf09e27a19 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
 
 struct bpf_local_storage_elem *
 bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
-		void *value, bool charge_mem, gfp_t gfp_flags)
+		void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
 {
 	struct bpf_local_storage_elem *selem;
 
@@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 
 	if (selem) {
 		if (value)
-			copy_map_value(&smap->map, SDATA(selem)->data, value);
+			copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
 		/* No need to call check_and_init_map_value as memory is zero init */
 		return selem;
 	}
@@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 	struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
 	struct bpf_local_storage *local_storage;
 	unsigned long flags;
+	bool from_user = map_flags & BPF_FROM_USER;
 	int err;
 
 	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
+	map_flags &= ~BPF_FROM_USER;
 	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
 	    /* BPF_F_LOCK can only be used in a value with spin_lock */
 	    unlikely((map_flags & BPF_F_LOCK) &&
@@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		if (err)
 			return ERR_PTR(err);
 
-		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
+		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
 		if (!selem)
 			return ERR_PTR(-ENOMEM);
 
@@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		if (err)
 			return ERR_PTR(err);
 		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
-			copy_map_value_locked(&smap->map, old_sdata->data,
-					      value, false);
+			copy_map_value_locked_user(&smap->map, old_sdata->data,
+						   value, false, from_user);
 			return old_sdata;
 		}
 	}
@@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 	/* A lookup has just been done before and concluded a new selem is
 	 * needed. The chance of an unnecessary alloc is unlikely.
 	 */
-	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
+	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
 	if (!alloc_selem)
 		return ERR_PTR(-ENOMEM);
 
@@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		goto unlock;
 
 	if (old_sdata && (map_flags & BPF_F_LOCK)) {
-		copy_map_value_locked(&smap->map, old_sdata->data, value,
-				      false);
+		copy_map_value_locked_user(&smap->map, old_sdata->data, value,
+					   false, from_user);
 		selem = SELEM(old_sdata);
 		goto unlock;
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d02ae323996b..4aef86209fdd 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
 	.arg1_btf_id    = BPF_PTR_POISON,
 };
 
-void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
-			   bool lock_src)
+void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
+				bool lock_src, bool from_user)
 {
 	struct bpf_spin_lock *lock;
 
@@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 		lock = dst + map->record->spin_lock_off;
 	preempt_disable();
 	__bpf_spin_lock_irqsave(lock);
-	copy_map_value(map, dst, src);
+	copy_map_value_user(map, dst, src, from_user);
 	__bpf_spin_unlock_irqrestore(lock);
 	preempt_enable();
 }
 
+void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
+			   bool lock_src)
+{
+	copy_map_value_locked_user(map, dst, src, lock_src, false);
+}
+
 BPF_CALL_0(bpf_jiffies64)
 {
 	return get_jiffies_64();
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 3969eb0382af..62a12fa8ce9e 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
 	struct bpf_cgroup_storage *storage;
 	struct bpf_storage_buffer *new;
 
-	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
+	if (unlikely(flags & ~BPF_F_LOCK))
 		return -EINVAL;
 
 	if (unlikely((flags & BPF_F_LOCK) &&
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 90a25307480e..eaa2a9d13265 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
 		synchronize_rcu();
 }
 
-static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
-				void *key, void *value, __u64 flags)
+static void *trans_addr_pages(struct page **pages, int npages)
+{
+	if (npages == 1)
+		return page_address(pages[0]);
+	/* For multiple pages, we need to use vmap() to get a contiguous
+	 * virtual address range.
+	 */
+	return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+}
+
+#define KPTR_USER_MAX_PAGES 16
+
+static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
+{
+	const struct btf_type *t;
+	struct page *pages[KPTR_USER_MAX_PAGES];
+	void *ptr, *kern_addr;
+	u32 type_id, tsz;
+	int r, npages;
+
+	ptr = *addr;
+	type_id = field->kptr.btf_id;
+	t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
+	if (!t)
+		return -EINVAL;
+	if (tsz == 0) {
+		*addr = NULL;
+		return 0;
+	}
+
+	npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
+		  ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
+	if (npages > KPTR_USER_MAX_PAGES)
+		return -E2BIG;
+	r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);
+	if (r != npages)
+		return -EINVAL;
+	kern_addr = trans_addr_pages(pages, npages);
+	if (!kern_addr)
+		return -ENOMEM;
+	*addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
+	return 0;
+}
+
+void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
+{
+	struct page *pages[KPTR_USER_MAX_PAGES];
+	int npages, i;
+	u32 size, type_id;
+	void *ptr;
+
+	type_id = field->kptr.btf_id;
+	btf_type_id_size(field->kptr.btf, &type_id, &size);
+	if (size == 0)
+		return;
+
+	ptr = (void *)((intptr_t)addr & PAGE_MASK);
+	npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
+	for (i = 0; i < npages; i++) {
+		pages[i] = virt_to_page(ptr);
+		ptr += PAGE_SIZE;
+	}
+	if (npages > 1)
+		/* Paired with vmap() in trans_addr_pages() */
+		vunmap((void *)((intptr_t)addr & PAGE_MASK));
+	unpin_user_pages(pages, npages);
+}
+
+static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
+{
+	u32 next_off;
+	int i, err;
+
+	if (IS_ERR_OR_NULL(rec))
+		return 0;
+
+	if (!btf_record_has_field(rec, BPF_KPTR_USER))
+		return 0;
+
+	for (i = 0; i < rec->cnt; i++) {
+		if (rec->fields[i].type != BPF_KPTR_USER)
+			continue;
+
+		next_off = rec->fields[i].offset;
+		if (next_off + sizeof(void *) > size)
+			return -EINVAL;
+		err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
+		if (!err)
+			continue;
+
+		/* Rollback */
+		for (i--; i >= 0; i--) {
+			if (rec->fields[i].type != BPF_KPTR_USER)
+				continue;
+			next_off = rec->fields[i].offset;
+			bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
+			*(void **)(src + next_off) = NULL;
+		}
+
+		return err;
+	}
+
+	return 0;
+}
+
+static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
+{
+	u32 next_off;
+	int i;
+
+	if (IS_ERR_OR_NULL(rec))
+		return;
+
+	if (!btf_record_has_field(rec, BPF_KPTR_USER))
+		return;
+
+	for (i = 0; i < rec->cnt; i++) {
+		if (rec->fields[i].type != BPF_KPTR_USER)
+			continue;
+
+		next_off = rec->fields[i].offset;
+		bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
+		*(void **)(src + next_off) = NULL;
+	}
+}
+
+static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
+				      void *key, void *value, __u64 flags)
 {
 	int err;
 
@@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	return err;
 }
 
+static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
+				void *key, void *value, __u64 flags)
+{
+	int err;
+
+	if (flags & BPF_FROM_USER) {
+		/* Pin user memory can lead to context switch, so we need
+		 * to do it before potential RCU lock.
+		 */
+		err = bpf_obj_trans_pin_uaddrs(map->record, value,
+					       bpf_map_value_size(map));
+		if (err)
+			return err;
+	}
+
+	err = bpf_map_update_value_inner(map, map_file, key, value, flags);
+
+	if (err && (flags & BPF_FROM_USER))
+		bpf_obj_unpin_uaddrs(map->record, value);
+
+	return err;
+}
+
 static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 			      __u64 flags)
 {
@@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 				field->kptr.dtor(xchgd_field);
 			}
 			break;
+		case BPF_KPTR_USER:
+			if (virt_addr_valid(*(void **)field_ptr))
+				bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
+			*(void **)field_ptr = NULL;
+			break;
 		case BPF_LIST_HEAD:
 			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
 				continue;
@@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 					goto free_map_tab;
 				}
 				break;
+			case BPF_KPTR_USER:
+				if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
+					ret = -EOPNOTSUPP;
+					goto free_map_tab;
+				}
+				break;
 			case BPF_LIST_HEAD:
 			case BPF_RB_ROOT:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
@@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 	struct bpf_map *map;
 	void *key, *value;
 	u32 value_size;
+	u64 extra_flags = 0;
 	struct fd f;
 	int err;
 
 	if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
 		return -EINVAL;
+	/* Prevent userspace from setting any internal flags */
+	if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
+		return -EINVAL;
 
 	f = fdget(ufd);
 	map = __bpf_map_get(f);
@@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto free_key;
 	}
 
-	err = bpf_map_update_value(map, f.file, key, value, attr->flags);
+	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
+		extra_flags |= BPF_FROM_USER;
+	err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
 	if (!err)
 		maybe_wait_bpf_programs(map);
 
@@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
 	void __user *keys = u64_to_user_ptr(attr->batch.keys);
 	u32 value_size, cp, max_count;
 	void *key, *value;
+	u64 extra_flags = 0;
 	int err = 0;
 
 	if (attr->batch.elem_flags & ~BPF_F_LOCK)
@@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
 		return -ENOMEM;
 	}
 
+	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
+		extra_flags |= BPF_FROM_USER;
 	for (cp = 0; cp < max_count; cp++) {
 		err = -EFAULT;
 		if (copy_from_user(key, keys + cp * map->key_size,
@@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
 			break;
 
 		err = bpf_map_update_value(map, map_file, key, value,
-					   attr->batch.elem_flags);
+					   attr->batch.elem_flags | extra_flags);
 
 		if (err)
 			break;
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index bc01b3aa6b0f..db5281384e6a 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
 {
 	struct bpf_local_storage_elem *copy_selem;
 
-	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
+	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
 	if (!copy_selem)
 		return NULL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC bpf-next 4/5] libbpf: define __kptr_user.
  2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
                   ` (2 preceding siblings ...)
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
@ 2024-08-07 23:57 ` Kui-Feng Lee
  2024-08-07 23:57 ` [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map Kui-Feng Lee
  4 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Make __kptr_user available to BPF programs to enable them to define user
kptrs.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 tools/lib/bpf/bpf_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 305c62817dd3..8f7fb00b90e3 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -185,6 +185,7 @@ enum libbpf_tristate {
 #define __kptr_untrusted __attribute__((btf_type_tag("kptr_untrusted")))
 #define __kptr __attribute__((btf_type_tag("kptr")))
 #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))
+#define __kptr_user __attribute__((btf_type_tag("kptr_user")))
 
 #if defined (__clang__)
 #define bpf_ksym_exists(sym) ({						\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.
  2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
                   ` (3 preceding siblings ...)
  2024-08-07 23:57 ` [RFC bpf-next 4/5] libbpf: define __kptr_user Kui-Feng Lee
@ 2024-08-07 23:57 ` Kui-Feng Lee
  2024-08-12 16:58   ` Alexei Starovoitov
  4 siblings, 1 reply; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-07 23:57 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Make sure the memory of user kptrs have been mapped to the kernel
properly. Also ensure the values of user kptrs in the kernel haven't been
copied to userspace.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 .../bpf/prog_tests/task_local_storage.c       | 122 ++++++++++++++++++
 .../selftests/bpf/progs/task_ls_kptr_user.c   |  72 +++++++++++
 2 files changed, 194 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/task_ls_kptr_user.c

diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index c33c05161a9e..17221024fb28 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -5,8 +5,10 @@
 #include <unistd.h>
 #include <sched.h>
 #include <pthread.h>
+#include <sys/eventfd.h>
 #include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <sys/types.h>
+#include <sys/mman.h>
 #include <test_progs.h>
 #include "task_local_storage_helpers.h"
 #include "task_local_storage.skel.h"
@@ -14,6 +16,21 @@
 #include "task_ls_recursion.skel.h"
 #include "task_storage_nodeadlock.skel.h"
 
+struct user_data {
+	int a;
+	int b;
+	int result;
+};
+
+struct value_type {
+	struct user_data *udata_mmap;
+	struct user_data *udata;
+};
+
+#define MAGIC_VALUE 0xabcd1234
+
+#include "task_ls_kptr_user.skel.h"
+
 static void test_sys_enter_exit(void)
 {
 	struct task_local_storage *skel;
@@ -40,6 +57,109 @@ static void test_sys_enter_exit(void)
 	task_local_storage__destroy(skel);
 }
 
+static void test_kptr_user(void)
+{
+	struct user_data user_data = { .a = 1, .b = 2, .result = 0 };
+	struct user_data user_data_mmap_v = { .a = 7, .b = 7 };
+	struct task_ls_kptr_user *skel = NULL;
+	struct user_data *user_data_mmap;
+	int task_fd = -1, ev_fd = -1;
+	struct value_type value;
+	pid_t pid;
+	int err, wstatus;
+	__u64 dummy = 1;
+
+	user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
+			      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
+		return;
+
+	memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
+	value.udata_mmap = user_data_mmap;
+	value.udata = &user_data;
+
+	task_fd = sys_pidfd_open(getpid(), 0);
+	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
+		goto out;
+
+	ev_fd = eventfd(0, 0);
+	if (!ASSERT_NEQ(ev_fd, -1, "eventfd"))
+		goto out;
+
+	skel = task_ls_kptr_user__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		goto out;
+
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value, 0);
+	if (!ASSERT_OK(err, "update_datamap"))
+		exit(1);
+
+	/* Make sure the BPF program can access the user_data_mmap even if
+	 * it's munmapped already.
+	 */
+	munmap(user_data_mmap, sizeof(*user_data_mmap));
+	user_data_mmap = NULL;
+
+	err = task_ls_kptr_user__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	fflush(stdout);
+	fflush(stderr);
+
+	pid = fork();
+	if (pid < 0)
+		goto out;
+
+	/* Call syscall in the child process, but access the map value of
+	 * the parent process in the BPF program to check if the user kptr
+	 * is translated/mapped correctly.
+	 */
+	if (pid == 0) {
+		/* child */
+
+		/* Overwrite the user_data in the child process to check if
+		 * the BPF program accesses the user_data of the parent.
+		 */
+		user_data.a = 0;
+		user_data.b = 0;
+
+		/* Wait for the parent to set child_pid */
+		read(ev_fd, &dummy, sizeof(dummy));
+
+		exit(0);
+	}
+
+	skel->bss->parent_pid = syscall(SYS_gettid);
+	skel->bss->child_pid = pid;
+
+	write(ev_fd, &dummy, sizeof(dummy));
+
+	err = waitpid(pid, &wstatus, 0);
+	ASSERT_EQ(err, pid, "waitpid");
+	skel->bss->child_pid = 0;
+
+	ASSERT_EQ(MAGIC_VALUE + user_data.a + user_data.b +
+		  user_data_mmap_v.a + user_data_mmap_v.b,
+		  user_data.result, "result");
+
+	/* Check if user programs can access the value of user kptrs
+	 * through bpf_map_lookup_elem(). Make sure the kernel value is not
+	 * leaked.
+	 */
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.datamap), &task_fd, &value);
+	if (!ASSERT_OK(err, "bpf_map_lookup_elem"))
+		goto out;
+	ASSERT_EQ(value.udata_mmap, NULL, "lookup_udata_mmap");
+	ASSERT_EQ(value.udata, NULL, "lookup_udata");
+
+out:
+	task_ls_kptr_user__destroy(skel);
+	close(ev_fd);
+	close(task_fd);
+	munmap(user_data_mmap, sizeof(*user_data_mmap));
+}
+
 static void test_exit_creds(void)
 {
 	struct task_local_storage_exit_creds *skel;
@@ -237,4 +357,6 @@ void test_task_local_storage(void)
 		test_recursion();
 	if (test__start_subtest("nodeadlock"))
 		test_nodeadlock();
+	if (test__start_subtest("kptr_user"))
+		test_kptr_user();
 }
diff --git a/tools/testing/selftests/bpf/progs/task_ls_kptr_user.c b/tools/testing/selftests/bpf/progs/task_ls_kptr_user.c
new file mode 100644
index 000000000000..ff5ca3a5da1e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_ls_kptr_user.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "task_kfunc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct user_data {
+	int a;
+	int b;
+	int result;
+};
+
+struct value_type {
+	struct user_data __kptr_user *udata_mmap;
+	struct user_data __kptr_user *udata;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct value_type);
+} datamap SEC(".maps");
+
+#define MAGIC_VALUE 0xabcd1234
+
+/* This is a workaround to avoid clang generating a forward reference for
+ * struct user_data. This is a known issue and will be fixed in the future.
+ */
+struct user_data __dummy;
+
+pid_t child_pid = 0;
+pid_t parent_pid = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task, *data_task;
+	struct value_type *ptr;
+	struct user_data *udata;
+	int acc;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != child_pid)
+		return 0;
+
+	data_task = bpf_task_from_pid(parent_pid);
+	if (!data_task)
+		return 0;
+
+	ptr = bpf_task_storage_get(&datamap, data_task, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	bpf_task_release(data_task);
+	if (!ptr)
+		return 0;
+
+	udata = ptr->udata_mmap;
+	if (!udata)
+		return 0;
+	acc = udata->a + udata->b;
+
+	udata = ptr->udata;
+	if (!udata)
+		return 0;
+	udata->result = MAGIC_VALUE + udata->a + udata->b + acc;
+
+	return 0;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
@ 2024-08-08  0:05   ` Kui-Feng Lee
  2024-08-08  0:39   ` Kui-Feng Lee
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-08  0:05 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: kuifeng, linux-mm



On 8/7/24 16:57, Kui-Feng Lee wrote:
> User kptrs are pinned, by pin_user_pages_fast(), and translated to an
> address in the kernel when the value is updated by user programs. (Call
> bpf_map_update_elem() from user programs.) And, the pinned pages are
> unpinned if the value of user kptrs are overritten or if the values of maps
> are deleted/destroyed.
> 
> The pages are mapped through vmap() in order to get a continuous space in
> the kernel if the memory pointed by a user kptr resides in two or more
> pages. For the case of single page, page_address() is called to get the
> address of a page in the kernel.
> 
> User kptr is only supported by task storage maps.
> 
> One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
> is a random picked number for safety. We actually can remove this
> restriction totally.
> 
> User kptrs could only be set by user programs through syscalls.  Any
> attempts of updating the value of a map with __kptr_user in it should
> ignore the values of user kptrs from BPF programs. The values of user kptrs
> will keep as they were if the new values are from BPF programs, not from
> user programs.
> 
> Cc: linux-mm@kvack.org
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>   include/linux/bpf.h               |  35 +++++-
>   include/linux/bpf_local_storage.h |   2 +-
>   kernel/bpf/bpf_local_storage.c    |  18 +--
>   kernel/bpf/helpers.c              |  12 +-
>   kernel/bpf/local_storage.c        |   2 +-
>   kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
>   net/core/bpf_sk_storage.c         |   2 +-
>   7 files changed, 227 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 87d5f98249e2..f4ad0bc183cb 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -30,6 +30,7 @@
>   #include <linux/static_call.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cfi.h>
> +#include <linux/mm.h>
>   
>   struct bpf_verifier_env;
>   struct bpf_verifier_log;
> @@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>   		data_race(*ldst++ = *lsrc++);
>   }
>   
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
> +
>   /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>   static inline void bpf_obj_memcpy(struct btf_record *rec,
>   				  void *dst, void *src, u32 size,
> -				  bool long_memcpy)
> +				  bool long_memcpy, bool from_user)
>   {
>   	u32 curr_off = 0;
>   	int i;
> @@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>   	for (i = 0; i < rec->cnt; i++) {
>   		u32 next_off = rec->fields[i].offset;
>   		u32 sz = next_off - curr_off;
> +		void *addr;
>   
>   		memcpy(dst + curr_off, src + curr_off, sz);
> +		if (from_user && rec->fields[i].type == BPF_KPTR_USER) {
> +			/* Unpin old address.
> +			 *
> +			 * Alignments are guaranteed by btf_find_field_one().
> +			 */
> +			addr = *(void **)(dst + next_off);
> +			if (virt_addr_valid(addr))
> +				bpf_obj_unpin_uaddr(&rec->fields[i], addr);
> +			else if (addr)
> +				WARN_ON_ONCE(1);
> +
> +			*(void **)(dst + next_off) = *(void **)(src + next_off);
> +		}
>   		curr_off += rec->fields[i].size + sz;
>   	}
>   	memcpy(dst + curr_off, src + curr_off, size - curr_off);
>   }
>   
> +static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)
> +{
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
> +}
> +
>   static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
>   }
>   
>   static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
>   }
>   
>   static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
> @@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
>   	bpf_obj_memzero(map->record, dst, map->value_size);
>   }
>   
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user);
>   void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   			   bool lock_src);
>   void bpf_timer_cancel_and_free(void *timer);
> @@ -775,6 +799,11 @@ enum bpf_arg_type {
>   };
>   static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
>   
> +#define BPF_MAP_UPDATE_FLAG_BITS 3
> +enum bpf_map_update_flag {
> +	BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
> +};
> +
>   /* type of values returned from helper functions */
>   enum bpf_return_type {
>   	RET_INTEGER,			/* function returns integer */
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index dcddb0aef7d8..d337df68fa23 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
> -		bool charge_mem, gfp_t gfp_flags);
> +		bool charge_mem, gfp_t gfp_flags, bool from_user);
>   
>   void bpf_selem_free(struct bpf_local_storage_elem *selem,
>   		    struct bpf_local_storage_map *smap,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index c938dea5ddbf..c4cf09e27a19 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> -		void *value, bool charge_mem, gfp_t gfp_flags)
> +		void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
>   {
>   	struct bpf_local_storage_elem *selem;
>   
> @@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   
>   	if (selem) {
>   		if (value)
> -			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +			copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
>   		/* No need to call check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
> @@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
>   	struct bpf_local_storage *local_storage;
>   	unsigned long flags;
> +	bool from_user = map_flags & BPF_FROM_USER;
>   	int err;
>   
>   	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
> +	map_flags &= ~BPF_FROM_USER;
>   	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>   	    /* BPF_F_LOCK can only be used in a value with spin_lock */
>   	    unlikely((map_flags & BPF_F_LOCK) &&
> @@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   
> -		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   		if (!selem)
>   			return ERR_PTR(-ENOMEM);
>   
> @@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -			copy_map_value_locked(&smap->map, old_sdata->data,
> -					      value, false);
> +			copy_map_value_locked_user(&smap->map, old_sdata->data,
> +						   value, false, from_user);
>   			return old_sdata;
>   		}
>   	}
> @@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	/* A lookup has just been done before and concluded a new selem is
>   	 * needed. The chance of an unnecessary alloc is unlikely.
>   	 */
> -	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   	if (!alloc_selem)
>   		return ERR_PTR(-ENOMEM);
>   
> @@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		goto unlock;
>   
>   	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -		copy_map_value_locked(&smap->map, old_sdata->data, value,
> -				      false);
> +		copy_map_value_locked_user(&smap->map, old_sdata->data, value,
> +					   false, from_user);
>   		selem = SELEM(old_sdata);
>   		goto unlock;
>   	}
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d02ae323996b..4aef86209fdd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
>   	.arg1_btf_id    = BPF_PTR_POISON,
>   };
>   
> -void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> -			   bool lock_src)
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user)
>   {
>   	struct bpf_spin_lock *lock;
>   
> @@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   		lock = dst + map->record->spin_lock_off;
>   	preempt_disable();
>   	__bpf_spin_lock_irqsave(lock);
> -	copy_map_value(map, dst, src);
> +	copy_map_value_user(map, dst, src, from_user);
>   	__bpf_spin_unlock_irqrestore(lock);
>   	preempt_enable();
>   }
>   
> +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> +			   bool lock_src)
> +{
> +	copy_map_value_locked_user(map, dst, src, lock_src, false);
> +}
> +
>   BPF_CALL_0(bpf_jiffies64)
>   {
>   	return get_jiffies_64();
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 3969eb0382af..62a12fa8ce9e 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
>   	struct bpf_cgroup_storage *storage;
>   	struct bpf_storage_buffer *new;
>   
> -	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
> +	if (unlikely(flags & ~BPF_F_LOCK))

This is a unnecessary change.
Will be removed.

>   		return -EINVAL;
>   
>   	if (unlikely((flags & BPF_F_LOCK) &&
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 90a25307480e..eaa2a9d13265 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
>   		synchronize_rcu();
>   }
>   
> -static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> -				void *key, void *value, __u64 flags)
> +static void *trans_addr_pages(struct page **pages, int npages)
> +{
> +	if (npages == 1)
> +		return page_address(pages[0]);
> +	/* For multiple pages, we need to use vmap() to get a contiguous
> +	 * virtual address range.
> +	 */
> +	return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +}
> +
> +#define KPTR_USER_MAX_PAGES 16
> +
> +static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
> +{
> +	const struct btf_type *t;
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	void *ptr, *kern_addr;
> +	u32 type_id, tsz;
> +	int r, npages;
> +
> +	ptr = *addr;
> +	type_id = field->kptr.btf_id;
> +	t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
> +	if (!t)
> +		return -EINVAL;
> +	if (tsz == 0) {
> +		*addr = NULL;
> +		return 0;
> +	}
> +
> +	npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
> +		  ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
> +	if (npages > KPTR_USER_MAX_PAGES)
> +		return -E2BIG;
> +	r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);
> +	if (r != npages)
> +		return -EINVAL;
> +	kern_addr = trans_addr_pages(pages, npages);
> +	if (!kern_addr)
> +		return -ENOMEM;
> +	*addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
> +	return 0;
> +}
> +
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
> +{
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	int npages, i;
> +	u32 size, type_id;
> +	void *ptr;
> +
> +	type_id = field->kptr.btf_id;
> +	btf_type_id_size(field->kptr.btf, &type_id, &size);
> +	if (size == 0)
> +		return;
> +
> +	ptr = (void *)((intptr_t)addr & PAGE_MASK);
> +	npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
> +	for (i = 0; i < npages; i++) {
> +		pages[i] = virt_to_page(ptr);
> +		ptr += PAGE_SIZE;
> +	}
> +	if (npages > 1)
> +		/* Paired with vmap() in trans_addr_pages() */
> +		vunmap((void *)((intptr_t)addr & PAGE_MASK));
> +	unpin_user_pages(pages, npages);
> +}
> +
> +static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
> +{
> +	u32 next_off;
> +	int i, err;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return 0;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return 0;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		if (next_off + sizeof(void *) > size)
> +			return -EINVAL;
> +		err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
> +		if (!err)
> +			continue;
> +
> +		/* Rollback */
> +		for (i--; i >= 0; i--) {
> +			if (rec->fields[i].type != BPF_KPTR_USER)
> +				continue;
> +			next_off = rec->fields[i].offset;
> +			bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +			*(void **)(src + next_off) = NULL;
> +		}
> +
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
> +{
> +	u32 next_off;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +		*(void **)(src + next_off) = NULL;
> +	}
> +}
> +
> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
> +				      void *key, void *value, __u64 flags)
>   {
>   	int err;
>   
> @@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>   	return err;
>   }
>   
> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> +				void *key, void *value, __u64 flags)
> +{
> +	int err;
> +
> +	if (flags & BPF_FROM_USER) {
> +		/* Pin user memory can lead to context switch, so we need
> +		 * to do it before potential RCU lock.
> +		 */
> +		err = bpf_obj_trans_pin_uaddrs(map->record, value,
> +					       bpf_map_value_size(map));
> +		if (err)
> +			return err;
> +	}
> +
> +	err = bpf_map_update_value_inner(map, map_file, key, value, flags);
> +
> +	if (err && (flags & BPF_FROM_USER))
> +		bpf_obj_unpin_uaddrs(map->record, value);
> +
> +	return err;
> +}
> +
>   static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>   			      __u64 flags)
>   {
> @@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>   				field->kptr.dtor(xchgd_field);
>   			}
>   			break;
> +		case BPF_KPTR_USER:
> +			if (virt_addr_valid(*(void **)field_ptr))
> +				bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
> +			*(void **)field_ptr = NULL;
> +			break;
>   		case BPF_LIST_HEAD:
>   			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>   				continue;
> @@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>   					goto free_map_tab;
>   				}
>   				break;
> +			case BPF_KPTR_USER:
> +				if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +					ret = -EOPNOTSUPP;
> +					goto free_map_tab;
> +				}
> +				break;
>   			case BPF_LIST_HEAD:
>   			case BPF_RB_ROOT:
>   				if (map->map_type != BPF_MAP_TYPE_HASH &&
> @@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   	struct bpf_map *map;
>   	void *key, *value;
>   	u32 value_size;
> +	u64 extra_flags = 0;
>   	struct fd f;
>   	int err;
>   
>   	if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
>   		return -EINVAL;
> +	/* Prevent userspace from setting any internal flags */
> +	if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
> +		return -EINVAL;
>   
>   	f = fdget(ufd);
>   	map = __bpf_map_get(f);
> @@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   		goto free_key;
>   	}
>   
> -	err = bpf_map_update_value(map, f.file, key, value, attr->flags);
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
> +	err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
>   	if (!err)
>   		maybe_wait_bpf_programs(map);
>   
> @@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   	void __user *keys = u64_to_user_ptr(attr->batch.keys);
>   	u32 value_size, cp, max_count;
>   	void *key, *value;
> +	u64 extra_flags = 0;
>   	int err = 0;
>   
>   	if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   		return -ENOMEM;
>   	}
>   
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
>   	for (cp = 0; cp < max_count; cp++) {
>   		err = -EFAULT;
>   		if (copy_from_user(key, keys + cp * map->key_size,
> @@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   			break;
>   
>   		err = bpf_map_update_value(map, map_file, key, value,
> -					   attr->batch.elem_flags);
> +					   attr->batch.elem_flags | extra_flags);
>   
>   		if (err)
>   			break;
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index bc01b3aa6b0f..db5281384e6a 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
>   {
>   	struct bpf_local_storage_elem *copy_selem;
>   
> -	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
> +	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
>   	if (!copy_selem)
>   		return NULL;
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
  2024-08-08  0:05   ` Kui-Feng Lee
@ 2024-08-08  0:39   ` Kui-Feng Lee
  2024-08-12 16:00   ` Kui-Feng Lee
  2024-08-12 16:45   ` Alexei Starovoitov
  3 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-08  0:39 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: kuifeng, linux-mm


bpf_obj_trans_pin_uaddrs() is where pinning and mapping performed. It is
called when a syscall is called to update the value of a map. This
function will rewrite the value of user kptrs to the addresses in the
kernel.


On 8/7/24 16:57, Kui-Feng Lee wrote:
> User kptrs are pinned, by pin_user_pages_fast(), and translated to an
> address in the kernel when the value is updated by user programs. (Call
> bpf_map_update_elem() from user programs.) And, the pinned pages are
> unpinned if the value of user kptrs are overritten or if the values of maps
> are deleted/destroyed.
> 
> The pages are mapped through vmap() in order to get a continuous space in
> the kernel if the memory pointed by a user kptr resides in two or more
> pages. For the case of single page, page_address() is called to get the
> address of a page in the kernel.
> 
> User kptr is only supported by task storage maps.
> 
> One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
> is a random picked number for safety. We actually can remove this
> restriction totally.
> 
> User kptrs could only be set by user programs through syscalls.  Any
> attempts of updating the value of a map with __kptr_user in it should
> ignore the values of user kptrs from BPF programs. The values of user kptrs
> will keep as they were if the new values are from BPF programs, not from
> user programs.
> 
> Cc: linux-mm@kvack.org
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>   include/linux/bpf.h               |  35 +++++-
>   include/linux/bpf_local_storage.h |   2 +-
>   kernel/bpf/bpf_local_storage.c    |  18 +--
>   kernel/bpf/helpers.c              |  12 +-
>   kernel/bpf/local_storage.c        |   2 +-
>   kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
>   net/core/bpf_sk_storage.c         |   2 +-
>   7 files changed, 227 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 87d5f98249e2..f4ad0bc183cb 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -30,6 +30,7 @@
>   #include <linux/static_call.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cfi.h>
> +#include <linux/mm.h>
>   
>   struct bpf_verifier_env;
>   struct bpf_verifier_log;
> @@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>   		data_race(*ldst++ = *lsrc++);
>   }
>   
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
> +
>   /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>   static inline void bpf_obj_memcpy(struct btf_record *rec,
>   				  void *dst, void *src, u32 size,
> -				  bool long_memcpy)
> +				  bool long_memcpy, bool from_user)
>   {
>   	u32 curr_off = 0;
>   	int i;
> @@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>   	for (i = 0; i < rec->cnt; i++) {
>   		u32 next_off = rec->fields[i].offset;
>   		u32 sz = next_off - curr_off;
> +		void *addr;
>   
>   		memcpy(dst + curr_off, src + curr_off, sz);
> +		if (from_user && rec->fields[i].type == BPF_KPTR_USER) {
> +			/* Unpin old address.
> +			 *
> +			 * Alignments are guaranteed by btf_find_field_one().
> +			 */
> +			addr = *(void **)(dst + next_off);
> +			if (virt_addr_valid(addr))
> +				bpf_obj_unpin_uaddr(&rec->fields[i], addr);
> +			else if (addr)
> +				WARN_ON_ONCE(1);
> +
> +			*(void **)(dst + next_off) = *(void **)(src + next_off);
> +		}
>   		curr_off += rec->fields[i].size + sz;
>   	}
>   	memcpy(dst + curr_off, src + curr_off, size - curr_off);
>   }
>   
> +static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)
> +{
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
> +}
> +
>   static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
>   }
>   
>   static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
>   }
>   
>   static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
> @@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
>   	bpf_obj_memzero(map->record, dst, map->value_size);
>   }
>   
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user);
>   void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   			   bool lock_src);
>   void bpf_timer_cancel_and_free(void *timer);
> @@ -775,6 +799,11 @@ enum bpf_arg_type {
>   };
>   static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
>   
> +#define BPF_MAP_UPDATE_FLAG_BITS 3
> +enum bpf_map_update_flag {
> +	BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
> +};
> +
>   /* type of values returned from helper functions */
>   enum bpf_return_type {
>   	RET_INTEGER,			/* function returns integer */
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index dcddb0aef7d8..d337df68fa23 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
> -		bool charge_mem, gfp_t gfp_flags);
> +		bool charge_mem, gfp_t gfp_flags, bool from_user);
>   
>   void bpf_selem_free(struct bpf_local_storage_elem *selem,
>   		    struct bpf_local_storage_map *smap,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index c938dea5ddbf..c4cf09e27a19 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> -		void *value, bool charge_mem, gfp_t gfp_flags)
> +		void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
>   {
>   	struct bpf_local_storage_elem *selem;
>   
> @@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   
>   	if (selem) {
>   		if (value)
> -			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +			copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
>   		/* No need to call check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
> @@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
>   	struct bpf_local_storage *local_storage;
>   	unsigned long flags;
> +	bool from_user = map_flags & BPF_FROM_USER;
>   	int err;
>   
>   	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
> +	map_flags &= ~BPF_FROM_USER;
>   	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>   	    /* BPF_F_LOCK can only be used in a value with spin_lock */
>   	    unlikely((map_flags & BPF_F_LOCK) &&
> @@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   
> -		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   		if (!selem)
>   			return ERR_PTR(-ENOMEM);
>   
> @@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -			copy_map_value_locked(&smap->map, old_sdata->data,
> -					      value, false);
> +			copy_map_value_locked_user(&smap->map, old_sdata->data,
> +						   value, false, from_user);
>   			return old_sdata;
>   		}
>   	}
> @@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	/* A lookup has just been done before and concluded a new selem is
>   	 * needed. The chance of an unnecessary alloc is unlikely.
>   	 */
> -	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   	if (!alloc_selem)
>   		return ERR_PTR(-ENOMEM);
>   
> @@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		goto unlock;
>   
>   	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -		copy_map_value_locked(&smap->map, old_sdata->data, value,
> -				      false);
> +		copy_map_value_locked_user(&smap->map, old_sdata->data, value,
> +					   false, from_user);
>   		selem = SELEM(old_sdata);
>   		goto unlock;
>   	}
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d02ae323996b..4aef86209fdd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
>   	.arg1_btf_id    = BPF_PTR_POISON,
>   };
>   
> -void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> -			   bool lock_src)
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user)
>   {
>   	struct bpf_spin_lock *lock;
>   
> @@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   		lock = dst + map->record->spin_lock_off;
>   	preempt_disable();
>   	__bpf_spin_lock_irqsave(lock);
> -	copy_map_value(map, dst, src);
> +	copy_map_value_user(map, dst, src, from_user);
>   	__bpf_spin_unlock_irqrestore(lock);
>   	preempt_enable();
>   }
>   
> +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> +			   bool lock_src)
> +{
> +	copy_map_value_locked_user(map, dst, src, lock_src, false);
> +}
> +
>   BPF_CALL_0(bpf_jiffies64)
>   {
>   	return get_jiffies_64();
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 3969eb0382af..62a12fa8ce9e 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
>   	struct bpf_cgroup_storage *storage;
>   	struct bpf_storage_buffer *new;
>   
> -	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
> +	if (unlikely(flags & ~BPF_F_LOCK))
>   		return -EINVAL;
>   
>   	if (unlikely((flags & BPF_F_LOCK) &&
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 90a25307480e..eaa2a9d13265 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
>   		synchronize_rcu();
>   }
>   
> -static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> -				void *key, void *value, __u64 flags)
> +static void *trans_addr_pages(struct page **pages, int npages)
> +{
> +	if (npages == 1)
> +		return page_address(pages[0]);
> +	/* For multiple pages, we need to use vmap() to get a contiguous
> +	 * virtual address range.
> +	 */
> +	return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +}
> +
> +#define KPTR_USER_MAX_PAGES 16
> +
> +static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
> +{
> +	const struct btf_type *t;
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	void *ptr, *kern_addr;
> +	u32 type_id, tsz;
> +	int r, npages;
> +
> +	ptr = *addr;
> +	type_id = field->kptr.btf_id;
> +	t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
> +	if (!t)
> +		return -EINVAL;
> +	if (tsz == 0) {
> +		*addr = NULL;
> +		return 0;
> +	}
> +
> +	npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
> +		  ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
> +	if (npages > KPTR_USER_MAX_PAGES)
> +		return -E2BIG;
> +	r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);
> +	if (r != npages)
> +		return -EINVAL;
> +	kern_addr = trans_addr_pages(pages, npages);
> +	if (!kern_addr)
> +		return -ENOMEM;
> +	*addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
> +	return 0;
> +}
> +
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
> +{
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	int npages, i;
> +	u32 size, type_id;
> +	void *ptr;
> +
> +	type_id = field->kptr.btf_id;
> +	btf_type_id_size(field->kptr.btf, &type_id, &size);
> +	if (size == 0)
> +		return;
> +
> +	ptr = (void *)((intptr_t)addr & PAGE_MASK);
> +	npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
> +	for (i = 0; i < npages; i++) {
> +		pages[i] = virt_to_page(ptr);
> +		ptr += PAGE_SIZE;
> +	}
> +	if (npages > 1)
> +		/* Paired with vmap() in trans_addr_pages() */
> +		vunmap((void *)((intptr_t)addr & PAGE_MASK));
> +	unpin_user_pages(pages, npages);
> +}
> +
> +static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
> +{
> +	u32 next_off;
> +	int i, err;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return 0;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return 0;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		if (next_off + sizeof(void *) > size)
> +			return -EINVAL;
> +		err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
> +		if (!err)
> +			continue;
> +
> +		/* Rollback */
> +		for (i--; i >= 0; i--) {
> +			if (rec->fields[i].type != BPF_KPTR_USER)
> +				continue;
> +			next_off = rec->fields[i].offset;
> +			bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +			*(void **)(src + next_off) = NULL;
> +		}
> +
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
> +{
> +	u32 next_off;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +		*(void **)(src + next_off) = NULL;
> +	}
> +}
> +
> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
> +				      void *key, void *value, __u64 flags)
>   {
>   	int err;
>   
> @@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>   	return err;
>   }
>   
> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> +				void *key, void *value, __u64 flags)
> +{
> +	int err;
> +
> +	if (flags & BPF_FROM_USER) {
> +		/* Pin user memory can lead to context switch, so we need
> +		 * to do it before potential RCU lock.
> +		 */
> +		err = bpf_obj_trans_pin_uaddrs(map->record, value,
> +					       bpf_map_value_size(map));
> +		if (err)
> +			return err;
> +	}
> +
> +	err = bpf_map_update_value_inner(map, map_file, key, value, flags);
> +
> +	if (err && (flags & BPF_FROM_USER))
> +		bpf_obj_unpin_uaddrs(map->record, value);
> +
> +	return err;
> +}
> +
>   static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>   			      __u64 flags)
>   {
> @@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>   				field->kptr.dtor(xchgd_field);
>   			}
>   			break;
> +		case BPF_KPTR_USER:
> +			if (virt_addr_valid(*(void **)field_ptr))
> +				bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
> +			*(void **)field_ptr = NULL;
> +			break;
>   		case BPF_LIST_HEAD:
>   			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>   				continue;
> @@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>   					goto free_map_tab;
>   				}
>   				break;
> +			case BPF_KPTR_USER:
> +				if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +					ret = -EOPNOTSUPP;
> +					goto free_map_tab;
> +				}
> +				break;
>   			case BPF_LIST_HEAD:
>   			case BPF_RB_ROOT:
>   				if (map->map_type != BPF_MAP_TYPE_HASH &&
> @@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   	struct bpf_map *map;
>   	void *key, *value;
>   	u32 value_size;
> +	u64 extra_flags = 0;
>   	struct fd f;
>   	int err;
>   
>   	if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
>   		return -EINVAL;
> +	/* Prevent userspace from setting any internal flags */
> +	if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
> +		return -EINVAL;
>   
>   	f = fdget(ufd);
>   	map = __bpf_map_get(f);
> @@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   		goto free_key;
>   	}
>   
> -	err = bpf_map_update_value(map, f.file, key, value, attr->flags);
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
> +	err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
>   	if (!err)
>   		maybe_wait_bpf_programs(map);
>   
> @@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   	void __user *keys = u64_to_user_ptr(attr->batch.keys);
>   	u32 value_size, cp, max_count;
>   	void *key, *value;
> +	u64 extra_flags = 0;
>   	int err = 0;
>   
>   	if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   		return -ENOMEM;
>   	}
>   
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
>   	for (cp = 0; cp < max_count; cp++) {
>   		err = -EFAULT;
>   		if (copy_from_user(key, keys + cp * map->key_size,
> @@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   			break;
>   
>   		err = bpf_map_update_value(map, map_file, key, value,
> -					   attr->batch.elem_flags);
> +					   attr->batch.elem_flags | extra_flags);
>   
>   		if (err)
>   			break;
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index bc01b3aa6b0f..db5281384e6a 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
>   {
>   	struct bpf_local_storage_elem *copy_selem;
>   
> -	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
> +	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
>   	if (!copy_selem)
>   		return NULL;
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
  2024-08-08  0:05   ` Kui-Feng Lee
  2024-08-08  0:39   ` Kui-Feng Lee
@ 2024-08-12 16:00   ` Kui-Feng Lee
  2024-08-12 16:45   ` Alexei Starovoitov
  3 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-12 16:00 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: kuifeng, linux-mm



On 8/7/24 16:57, Kui-Feng Lee wrote:
> User kptrs are pinned, by pin_user_pages_fast(), and translated to an
> address in the kernel when the value is updated by user programs. (Call
> bpf_map_update_elem() from user programs.) And, the pinned pages are
> unpinned if the value of user kptrs are overritten or if the values of maps
> are deleted/destroyed.
> 
> The pages are mapped through vmap() in order to get a continuous space in
> the kernel if the memory pointed by a user kptr resides in two or more
> pages. For the case of single page, page_address() is called to get the
> address of a page in the kernel.
> 
> User kptr is only supported by task storage maps.
> 
> One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
> is a random picked number for safety. We actually can remove this
> restriction totally.
> 
> User kptrs could only be set by user programs through syscalls.  Any
> attempts of updating the value of a map with __kptr_user in it should
> ignore the values of user kptrs from BPF programs. The values of user kptrs
> will keep as they were if the new values are from BPF programs, not from
> user programs.
> 
> Cc: linux-mm@kvack.org
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>   include/linux/bpf.h               |  35 +++++-
>   include/linux/bpf_local_storage.h |   2 +-
>   kernel/bpf/bpf_local_storage.c    |  18 +--
>   kernel/bpf/helpers.c              |  12 +-
>   kernel/bpf/local_storage.c        |   2 +-
>   kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
>   net/core/bpf_sk_storage.c         |   2 +-
>   7 files changed, 227 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 87d5f98249e2..f4ad0bc183cb 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -30,6 +30,7 @@
>   #include <linux/static_call.h>
>   #include <linux/memcontrol.h>
>   #include <linux/cfi.h>
> +#include <linux/mm.h>
>   
>   struct bpf_verifier_env;
>   struct bpf_verifier_log;
> @@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>   		data_race(*ldst++ = *lsrc++);
>   }
>   
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
> +
>   /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>   static inline void bpf_obj_memcpy(struct btf_record *rec,
>   				  void *dst, void *src, u32 size,
> -				  bool long_memcpy)
> +				  bool long_memcpy, bool from_user)
>   {
>   	u32 curr_off = 0;
>   	int i;
> @@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>   	for (i = 0; i < rec->cnt; i++) {
>   		u32 next_off = rec->fields[i].offset;
>   		u32 sz = next_off - curr_off;
> +		void *addr;
>   
>   		memcpy(dst + curr_off, src + curr_off, sz);
> +		if (from_user && rec->fields[i].type == BPF_KPTR_USER) {
> +			/* Unpin old address.
> +			 *
> +			 * Alignments are guaranteed by btf_find_field_one().
> +			 */
> +			addr = *(void **)(dst + next_off);
> +			if (virt_addr_valid(addr))
> +				bpf_obj_unpin_uaddr(&rec->fields[i], addr);
> +			else if (addr)
> +				WARN_ON_ONCE(1);
> +
> +			*(void **)(dst + next_off) = *(void **)(src + next_off);
> +		}
>   		curr_off += rec->fields[i].size + sz;
>   	}
>   	memcpy(dst + curr_off, src + curr_off, size - curr_off);
>   }
>   
> +static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)
> +{
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
> +}
> +
>   static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
>   }
>   
>   static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
>   {
> -	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
> +	bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
>   }
>   
>   static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
> @@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
>   	bpf_obj_memzero(map->record, dst, map->value_size);
>   }
>   
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user);
>   void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   			   bool lock_src);
>   void bpf_timer_cancel_and_free(void *timer);
> @@ -775,6 +799,11 @@ enum bpf_arg_type {
>   };
>   static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
>   
> +#define BPF_MAP_UPDATE_FLAG_BITS 3
> +enum bpf_map_update_flag {
> +	BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
> +};
> +
>   /* type of values returned from helper functions */
>   enum bpf_return_type {
>   	RET_INTEGER,			/* function returns integer */
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index dcddb0aef7d8..d337df68fa23 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
> -		bool charge_mem, gfp_t gfp_flags);
> +		bool charge_mem, gfp_t gfp_flags, bool from_user);
>   
>   void bpf_selem_free(struct bpf_local_storage_elem *selem,
>   		    struct bpf_local_storage_map *smap,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index c938dea5ddbf..c4cf09e27a19 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
>   
>   struct bpf_local_storage_elem *
>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> -		void *value, bool charge_mem, gfp_t gfp_flags)
> +		void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
>   {
>   	struct bpf_local_storage_elem *selem;
>   
> @@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   
>   	if (selem) {
>   		if (value)
> -			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +			copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
>   		/* No need to call check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
> @@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
>   	struct bpf_local_storage *local_storage;
>   	unsigned long flags;
> +	bool from_user = map_flags & BPF_FROM_USER;
>   	int err;
>   
>   	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
> +	map_flags &= ~BPF_FROM_USER;
>   	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>   	    /* BPF_F_LOCK can only be used in a value with spin_lock */
>   	    unlikely((map_flags & BPF_F_LOCK) &&
> @@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   
> -		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +		selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   		if (!selem)
>   			return ERR_PTR(-ENOMEM);
>   
> @@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -			copy_map_value_locked(&smap->map, old_sdata->data,
> -					      value, false);
> +			copy_map_value_locked_user(&smap->map, old_sdata->data,
> +						   value, false, from_user);
>   			return old_sdata;
>   		}
>   	}
> @@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   	/* A lookup has just been done before and concluded a new selem is
>   	 * needed. The chance of an unnecessary alloc is unlikely.
>   	 */
> -	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +	alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>   	if (!alloc_selem)
>   		return ERR_PTR(-ENOMEM);
>   
> @@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		goto unlock;
>   
>   	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -		copy_map_value_locked(&smap->map, old_sdata->data, value,
> -				      false);
> +		copy_map_value_locked_user(&smap->map, old_sdata->data, value,
> +					   false, from_user);
>   		selem = SELEM(old_sdata);
>   		goto unlock;
>   	}
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d02ae323996b..4aef86209fdd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
>   	.arg1_btf_id    = BPF_PTR_POISON,
>   };
>   
> -void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> -			   bool lock_src)
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +				bool lock_src, bool from_user)
>   {
>   	struct bpf_spin_lock *lock;
>   
> @@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   		lock = dst + map->record->spin_lock_off;
>   	preempt_disable();
>   	__bpf_spin_lock_irqsave(lock);
> -	copy_map_value(map, dst, src);
> +	copy_map_value_user(map, dst, src, from_user);
>   	__bpf_spin_unlock_irqrestore(lock);
>   	preempt_enable();
>   }
>   
> +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> +			   bool lock_src)
> +{
> +	copy_map_value_locked_user(map, dst, src, lock_src, false);
> +}
> +
>   BPF_CALL_0(bpf_jiffies64)
>   {
>   	return get_jiffies_64();
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 3969eb0382af..62a12fa8ce9e 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
>   	struct bpf_cgroup_storage *storage;
>   	struct bpf_storage_buffer *new;
>   
> -	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
> +	if (unlikely(flags & ~BPF_F_LOCK))
>   		return -EINVAL;
>   
>   	if (unlikely((flags & BPF_F_LOCK) &&
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 90a25307480e..eaa2a9d13265 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
>   		synchronize_rcu();
>   }
>   
> -static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> -				void *key, void *value, __u64 flags)
> +static void *trans_addr_pages(struct page **pages, int npages)
> +{
> +	if (npages == 1)
> +		return page_address(pages[0]);
> +	/* For multiple pages, we need to use vmap() to get a contiguous
> +	 * virtual address range.
> +	 */
> +	return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +}
> +
> +#define KPTR_USER_MAX_PAGES 16
> +
> +static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
> +{
> +	const struct btf_type *t;
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	void *ptr, *kern_addr;
> +	u32 type_id, tsz;
> +	int r, npages;
> +
> +	ptr = *addr;
> +	type_id = field->kptr.btf_id;
> +	t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
> +	if (!t)
> +		return -EINVAL;
> +	if (tsz == 0) {
> +		*addr = NULL;
> +		return 0;
> +	}
> +
> +	npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
> +		  ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
> +	if (npages > KPTR_USER_MAX_PAGES)
> +		return -E2BIG;
> +	r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);
> +	if (r != npages)
> +		return -EINVAL;
> +	kern_addr = trans_addr_pages(pages, npages);
> +	if (!kern_addr)
> +		return -ENOMEM;
> +	*addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
> +	return 0;
> +}
> +
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
> +{
> +	struct page *pages[KPTR_USER_MAX_PAGES];
> +	int npages, i;
> +	u32 size, type_id;
> +	void *ptr;
> +
> +	type_id = field->kptr.btf_id;
> +	btf_type_id_size(field->kptr.btf, &type_id, &size);
> +	if (size == 0)
> +		return;
> +
> +	ptr = (void *)((intptr_t)addr & PAGE_MASK);
> +	npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
> +	for (i = 0; i < npages; i++) {
> +		pages[i] = virt_to_page(ptr);
> +		ptr += PAGE_SIZE;
> +	}
> +	if (npages > 1)
> +		/* Paired with vmap() in trans_addr_pages() */
> +		vunmap((void *)((intptr_t)addr & PAGE_MASK));

Just realize that vunmap() should not be called in a non-sleepable
context. I would add an async variant of vunmap() to defer unmapping to
a workqueue.

> +	unpin_user_pages(pages, npages);
> +}
> +
> +static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
> +{
> +	u32 next_off;
> +	int i, err;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return 0;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return 0;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		if (next_off + sizeof(void *) > size)
> +			return -EINVAL;
> +		err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
> +		if (!err)
> +			continue;
> +
> +		/* Rollback */
> +		for (i--; i >= 0; i--) {
> +			if (rec->fields[i].type != BPF_KPTR_USER)
> +				continue;
> +			next_off = rec->fields[i].offset;
> +			bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +			*(void **)(src + next_off) = NULL;
> +		}
> +
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
> +{
> +	u32 next_off;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +
> +	if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +		return;
> +
> +	for (i = 0; i < rec->cnt; i++) {
> +		if (rec->fields[i].type != BPF_KPTR_USER)
> +			continue;
> +
> +		next_off = rec->fields[i].offset;
> +		bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +		*(void **)(src + next_off) = NULL;
> +	}
> +}
> +
> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
> +				      void *key, void *value, __u64 flags)
>   {
>   	int err;
>   
> @@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>   	return err;
>   }
>   
> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> +				void *key, void *value, __u64 flags)
> +{
> +	int err;
> +
> +	if (flags & BPF_FROM_USER) {
> +		/* Pin user memory can lead to context switch, so we need
> +		 * to do it before potential RCU lock.
> +		 */
> +		err = bpf_obj_trans_pin_uaddrs(map->record, value,
> +					       bpf_map_value_size(map));
> +		if (err)
> +			return err;
> +	}
> +
> +	err = bpf_map_update_value_inner(map, map_file, key, value, flags);
> +
> +	if (err && (flags & BPF_FROM_USER))
> +		bpf_obj_unpin_uaddrs(map->record, value);
> +
> +	return err;
> +}
> +
>   static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>   			      __u64 flags)
>   {
> @@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>   				field->kptr.dtor(xchgd_field);
>   			}
>   			break;
> +		case BPF_KPTR_USER:
> +			if (virt_addr_valid(*(void **)field_ptr))
> +				bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
> +			*(void **)field_ptr = NULL;
> +			break;
>   		case BPF_LIST_HEAD:
>   			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>   				continue;
> @@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>   					goto free_map_tab;
>   				}
>   				break;
> +			case BPF_KPTR_USER:
> +				if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +					ret = -EOPNOTSUPP;
> +					goto free_map_tab;
> +				}
> +				break;
>   			case BPF_LIST_HEAD:
>   			case BPF_RB_ROOT:
>   				if (map->map_type != BPF_MAP_TYPE_HASH &&
> @@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   	struct bpf_map *map;
>   	void *key, *value;
>   	u32 value_size;
> +	u64 extra_flags = 0;
>   	struct fd f;
>   	int err;
>   
>   	if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
>   		return -EINVAL;
> +	/* Prevent userspace from setting any internal flags */
> +	if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
> +		return -EINVAL;
>   
>   	f = fdget(ufd);
>   	map = __bpf_map_get(f);
> @@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>   		goto free_key;
>   	}
>   
> -	err = bpf_map_update_value(map, f.file, key, value, attr->flags);
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
> +	err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
>   	if (!err)
>   		maybe_wait_bpf_programs(map);
>   
> @@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   	void __user *keys = u64_to_user_ptr(attr->batch.keys);
>   	u32 value_size, cp, max_count;
>   	void *key, *value;
> +	u64 extra_flags = 0;
>   	int err = 0;
>   
>   	if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   		return -ENOMEM;
>   	}
>   
> +	if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +		extra_flags |= BPF_FROM_USER;
>   	for (cp = 0; cp < max_count; cp++) {
>   		err = -EFAULT;
>   		if (copy_from_user(key, keys + cp * map->key_size,
> @@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>   			break;
>   
>   		err = bpf_map_update_value(map, map_file, key, value,
> -					   attr->batch.elem_flags);
> +					   attr->batch.elem_flags | extra_flags);
>   
>   		if (err)
>   			break;
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index bc01b3aa6b0f..db5281384e6a 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
>   {
>   	struct bpf_local_storage_elem *copy_selem;
>   
> -	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
> +	copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
>   	if (!copy_selem)
>   		return NULL;
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
                     ` (2 preceding siblings ...)
  2024-08-12 16:00   ` Kui-Feng Lee
@ 2024-08-12 16:45   ` Alexei Starovoitov
  2024-08-12 17:24     ` Kui-Feng Lee
  3 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-12 16:45 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee, Kui-Feng Lee, linux-mm

On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> User kptrs are pinned, by pin_user_pages_fast(), and translated to an
> address in the kernel when the value is updated by user programs. (Call
> bpf_map_update_elem() from user programs.) And, the pinned pages are
> unpinned if the value of user kptrs are overritten or if the values of maps
> are deleted/destroyed.
>
> The pages are mapped through vmap() in order to get a continuous space in
> the kernel if the memory pointed by a user kptr resides in two or more
> pages. For the case of single page, page_address() is called to get the
> address of a page in the kernel.
>
> User kptr is only supported by task storage maps.
>
> One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
> is a random picked number for safety. We actually can remove this
> restriction totally.
>
> User kptrs could only be set by user programs through syscalls.  Any
> attempts of updating the value of a map with __kptr_user in it should
> ignore the values of user kptrs from BPF programs. The values of user kptrs
> will keep as they were if the new values are from BPF programs, not from
> user programs.
>
> Cc: linux-mm@kvack.org
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  include/linux/bpf.h               |  35 +++++-
>  include/linux/bpf_local_storage.h |   2 +-
>  kernel/bpf/bpf_local_storage.c    |  18 +--
>  kernel/bpf/helpers.c              |  12 +-
>  kernel/bpf/local_storage.c        |   2 +-
>  kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
>  net/core/bpf_sk_storage.c         |   2 +-
>  7 files changed, 227 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 87d5f98249e2..f4ad0bc183cb 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -30,6 +30,7 @@
>  #include <linux/static_call.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cfi.h>
> +#include <linux/mm.h>
>
>  struct bpf_verifier_env;
>  struct bpf_verifier_log;
> @@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>                 data_race(*ldst++ = *lsrc++);
>  }
>
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
> +
>  /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>  static inline void bpf_obj_memcpy(struct btf_record *rec,
>                                   void *dst, void *src, u32 size,
> -                                 bool long_memcpy)
> +                                 bool long_memcpy, bool from_user)
>  {
>         u32 curr_off = 0;
>         int i;
> @@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>         for (i = 0; i < rec->cnt; i++) {
>                 u32 next_off = rec->fields[i].offset;
>                 u32 sz = next_off - curr_off;
> +               void *addr;
>
>                 memcpy(dst + curr_off, src + curr_off, sz);
> +               if (from_user && rec->fields[i].type == BPF_KPTR_USER) {


Do not add this to bpf_obj_memcpy() which is a critical path
for various map operations.
This has to be standalone for task storage only.

> +                       /* Unpin old address.
> +                        *
> +                        * Alignments are guaranteed by btf_find_field_one().
> +                        */
> +                       addr = *(void **)(dst + next_off);
> +                       if (virt_addr_valid(addr))
> +                               bpf_obj_unpin_uaddr(&rec->fields[i], addr);
> +                       else if (addr)
> +                               WARN_ON_ONCE(1);
> +
> +                       *(void **)(dst + next_off) = *(void **)(src + next_off);
> +               }
>                 curr_off += rec->fields[i].size + sz;
>         }
>         memcpy(dst + curr_off, src + curr_off, size - curr_off);
>  }
>
> +static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)

No need for these helpers either.

> +{
> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
> +}
> +
>  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>  {
> -       bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
>  }
>
>  static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
>  {
> -       bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
>  }
>
>  static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
> @@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
>         bpf_obj_memzero(map->record, dst, map->value_size);
>  }
>
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +                               bool lock_src, bool from_user);
>  void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>                            bool lock_src);
>  void bpf_timer_cancel_and_free(void *timer);
> @@ -775,6 +799,11 @@ enum bpf_arg_type {
>  };
>  static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
>
> +#define BPF_MAP_UPDATE_FLAG_BITS 3
> +enum bpf_map_update_flag {
> +       BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
> +};
> +
>  /* type of values returned from helper functions */
>  enum bpf_return_type {
>         RET_INTEGER,                    /* function returns integer */
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index dcddb0aef7d8..d337df68fa23 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>
>  struct bpf_local_storage_elem *
>  bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
> -               bool charge_mem, gfp_t gfp_flags);
> +               bool charge_mem, gfp_t gfp_flags, bool from_user);
>
>  void bpf_selem_free(struct bpf_local_storage_elem *selem,
>                     struct bpf_local_storage_map *smap,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index c938dea5ddbf..c4cf09e27a19 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
>
>  struct bpf_local_storage_elem *
>  bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> -               void *value, bool charge_mem, gfp_t gfp_flags)
> +               void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
>  {
>         struct bpf_local_storage_elem *selem;
>
> @@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>
>         if (selem) {
>                 if (value)
> -                       copy_map_value(&smap->map, SDATA(selem)->data, value);
> +                       copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
>                 /* No need to call check_and_init_map_value as memory is zero init */
>                 return selem;
>         }
> @@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>         struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
>         struct bpf_local_storage *local_storage;
>         unsigned long flags;
> +       bool from_user = map_flags & BPF_FROM_USER;
>         int err;
>
>         /* BPF_EXIST and BPF_NOEXIST cannot be both set */
> +       map_flags &= ~BPF_FROM_USER;
>         if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>             /* BPF_F_LOCK can only be used in a value with spin_lock */
>             unlikely((map_flags & BPF_F_LOCK) &&
> @@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 if (err)
>                         return ERR_PTR(err);
>
> -               selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +               selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>                 if (!selem)
>                         return ERR_PTR(-ENOMEM);
>
> @@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 if (err)
>                         return ERR_PTR(err);
>                 if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -                       copy_map_value_locked(&smap->map, old_sdata->data,
> -                                             value, false);
> +                       copy_map_value_locked_user(&smap->map, old_sdata->data,
> +                                                  value, false, from_user);
>                         return old_sdata;
>                 }
>         }
> @@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>         /* A lookup has just been done before and concluded a new selem is
>          * needed. The chance of an unnecessary alloc is unlikely.
>          */
> -       alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
> +       alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>         if (!alloc_selem)
>                 return ERR_PTR(-ENOMEM);
>
> @@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 goto unlock;
>
>         if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -               copy_map_value_locked(&smap->map, old_sdata->data, value,
> -                                     false);
> +               copy_map_value_locked_user(&smap->map, old_sdata->data, value,
> +                                          false, from_user);
>                 selem = SELEM(old_sdata);
>                 goto unlock;
>         }
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index d02ae323996b..4aef86209fdd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
>         .arg1_btf_id    = BPF_PTR_POISON,
>  };
>
> -void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> -                          bool lock_src)
> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
> +                               bool lock_src, bool from_user)
>  {
>         struct bpf_spin_lock *lock;
>
> @@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>                 lock = dst + map->record->spin_lock_off;
>         preempt_disable();
>         __bpf_spin_lock_irqsave(lock);
> -       copy_map_value(map, dst, src);
> +       copy_map_value_user(map, dst, src, from_user);
>         __bpf_spin_unlock_irqrestore(lock);
>         preempt_enable();
>  }
>
> +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> +                          bool lock_src)
> +{
> +       copy_map_value_locked_user(map, dst, src, lock_src, false);
> +}
> +
>  BPF_CALL_0(bpf_jiffies64)
>  {
>         return get_jiffies_64();
> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 3969eb0382af..62a12fa8ce9e 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
>         struct bpf_cgroup_storage *storage;
>         struct bpf_storage_buffer *new;
>
> -       if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
> +       if (unlikely(flags & ~BPF_F_LOCK))
>                 return -EINVAL;
>
>         if (unlikely((flags & BPF_F_LOCK) &&
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 90a25307480e..eaa2a9d13265 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
>                 synchronize_rcu();
>  }
>
> -static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> -                               void *key, void *value, __u64 flags)
> +static void *trans_addr_pages(struct page **pages, int npages)
> +{
> +       if (npages == 1)
> +               return page_address(pages[0]);
> +       /* For multiple pages, we need to use vmap() to get a contiguous
> +        * virtual address range.
> +        */
> +       return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> +}

Don't quite see a need for trans_addr_pages() helper when it's used
once.

> +
> +#define KPTR_USER_MAX_PAGES 16
> +
> +static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
> +{
> +       const struct btf_type *t;
> +       struct page *pages[KPTR_USER_MAX_PAGES];
> +       void *ptr, *kern_addr;
> +       u32 type_id, tsz;
> +       int r, npages;
> +
> +       ptr = *addr;
> +       type_id = field->kptr.btf_id;
> +       t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
> +       if (!t)
> +               return -EINVAL;
> +       if (tsz == 0) {
> +               *addr = NULL;
> +               return 0;
> +       }
> +
> +       npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
> +                 ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
> +       if (npages > KPTR_USER_MAX_PAGES)
> +               return -E2BIG;
> +       r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);

No need to apply the mask on ptr. See pin_user_pages_fast() internals.

It probably should be FOLL_WRITE | FOLL_LONGTERM instead of 0.

> +       if (r != npages)
> +               return -EINVAL;
> +       kern_addr = trans_addr_pages(pages, npages);
> +       if (!kern_addr)
> +               return -ENOMEM;
> +       *addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
> +       return 0;
> +}
> +
> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
> +{
> +       struct page *pages[KPTR_USER_MAX_PAGES];
> +       int npages, i;
> +       u32 size, type_id;
> +       void *ptr;
> +
> +       type_id = field->kptr.btf_id;
> +       btf_type_id_size(field->kptr.btf, &type_id, &size);
> +       if (size == 0)
> +               return;
> +
> +       ptr = (void *)((intptr_t)addr & PAGE_MASK);
> +       npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
> +       for (i = 0; i < npages; i++) {
> +               pages[i] = virt_to_page(ptr);
> +               ptr += PAGE_SIZE;
> +       }
> +       if (npages > 1)
> +               /* Paired with vmap() in trans_addr_pages() */
> +               vunmap((void *)((intptr_t)addr & PAGE_MASK));
> +       unpin_user_pages(pages, npages);
> +}
> +
> +static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
> +{
> +       u32 next_off;
> +       int i, err;
> +
> +       if (IS_ERR_OR_NULL(rec))
> +               return 0;
> +
> +       if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +               return 0;

imo kptr_user doesn't quite fit as a name.
'kptr' means 'kernel pointer'. Here it's user addr.
Maybe just "uptr" ?

> +
> +       for (i = 0; i < rec->cnt; i++) {
> +               if (rec->fields[i].type != BPF_KPTR_USER)
> +                       continue;
> +
> +               next_off = rec->fields[i].offset;
> +               if (next_off + sizeof(void *) > size)
> +                       return -EINVAL;
> +               err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
> +               if (!err)
> +                       continue;
> +
> +               /* Rollback */
> +               for (i--; i >= 0; i--) {
> +                       if (rec->fields[i].type != BPF_KPTR_USER)
> +                               continue;
> +                       next_off = rec->fields[i].offset;
> +                       bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +                       *(void **)(src + next_off) = NULL;
> +               }
> +
> +               return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
> +{
> +       u32 next_off;
> +       int i;
> +
> +       if (IS_ERR_OR_NULL(rec))
> +               return;
> +
> +       if (!btf_record_has_field(rec, BPF_KPTR_USER))
> +               return;
> +
> +       for (i = 0; i < rec->cnt; i++) {
> +               if (rec->fields[i].type != BPF_KPTR_USER)
> +                       continue;
> +
> +               next_off = rec->fields[i].offset;
> +               bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
> +               *(void **)(src + next_off) = NULL;

This part is pretty much the same as the undo part in
bpf_obj_trans_pin_uaddrs() and the common helper is warranted.

> +       }
> +}
> +
> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
> +                                     void *key, void *value, __u64 flags)
>  {
>         int err;
>
> @@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>         return err;
>  }
>
> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> +                               void *key, void *value, __u64 flags)
> +{
> +       int err;
> +
> +       if (flags & BPF_FROM_USER) {

there shouldn't be a need for this extra flag.
map->record has the info whether uptr is present or not.

> +               /* Pin user memory can lead to context switch, so we need
> +                * to do it before potential RCU lock.
> +                */
> +               err = bpf_obj_trans_pin_uaddrs(map->record, value,
> +                                              bpf_map_value_size(map));
> +               if (err)
> +                       return err;
> +       }
> +
> +       err = bpf_map_update_value_inner(map, map_file, key, value, flags);
> +
> +       if (err && (flags & BPF_FROM_USER))
> +               bpf_obj_unpin_uaddrs(map->record, value);
> +
> +       return err;
> +}
> +
>  static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>                               __u64 flags)
>  {
> @@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>                                 field->kptr.dtor(xchgd_field);
>                         }
>                         break;
> +               case BPF_KPTR_USER:
> +                       if (virt_addr_valid(*(void **)field_ptr))
> +                               bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
> +                       *(void **)field_ptr = NULL;
> +                       break;
>                 case BPF_LIST_HEAD:
>                         if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                 continue;
> @@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>                                         goto free_map_tab;
>                                 }
>                                 break;
> +                       case BPF_KPTR_USER:
> +                               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +                                       ret = -EOPNOTSUPP;
> +                                       goto free_map_tab;
> +                               }
> +                               break;
>                         case BPF_LIST_HEAD:
>                         case BPF_RB_ROOT:
>                                 if (map->map_type != BPF_MAP_TYPE_HASH &&
> @@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>         struct bpf_map *map;
>         void *key, *value;
>         u32 value_size;
> +       u64 extra_flags = 0;
>         struct fd f;
>         int err;
>
>         if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
>                 return -EINVAL;
> +       /* Prevent userspace from setting any internal flags */
> +       if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
> +               return -EINVAL;
>
>         f = fdget(ufd);
>         map = __bpf_map_get(f);
> @@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>                 goto free_key;
>         }
>
> -       err = bpf_map_update_value(map, f.file, key, value, attr->flags);
> +       if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +               extra_flags |= BPF_FROM_USER;
> +       err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
>         if (!err)
>                 maybe_wait_bpf_programs(map);
>
> @@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>         void __user *keys = u64_to_user_ptr(attr->batch.keys);
>         u32 value_size, cp, max_count;
>         void *key, *value;
> +       u64 extra_flags = 0;
>         int err = 0;
>
>         if (attr->batch.elem_flags & ~BPF_F_LOCK)
> @@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>                 return -ENOMEM;
>         }
>
> +       if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
> +               extra_flags |= BPF_FROM_USER;
>         for (cp = 0; cp < max_count; cp++) {
>                 err = -EFAULT;
>                 if (copy_from_user(key, keys + cp * map->key_size,
> @@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>                         break;
>
>                 err = bpf_map_update_value(map, map_file, key, value,
> -                                          attr->batch.elem_flags);
> +                                          attr->batch.elem_flags | extra_flags);
>
>                 if (err)
>                         break;
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index bc01b3aa6b0f..db5281384e6a 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
>  {
>         struct bpf_local_storage_elem *copy_selem;
>
> -       copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
> +       copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
>         if (!copy_selem)
>                 return NULL;
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier.
  2024-08-07 23:57 ` [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier Kui-Feng Lee
@ 2024-08-12 16:48   ` Alexei Starovoitov
  2024-08-13 16:52     ` Kui-Feng Lee
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-12 16:48 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee, Kui-Feng Lee

On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> Give PTR_MAYBE_NULL | PTR_UNTRUSTED | MEM_ALLOC | NON_OWN_REF to kptr_user
> to the memory pointed by it readable and writable.
>
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  kernel/bpf/verifier.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index df3be12096cf..84647e599595 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
>         int perm_flags;
>         const char *reg_name = "";
>
> +       if (kptr_field->type == BPF_KPTR_USER)
> +               /* BPF programs should not change any user kptr */
> +               return -EACCES;
> +
>         if (btf_is_kernel(reg->btf)) {
>                 perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
>
> @@ -5483,6 +5487,12 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
>                         ret |= NON_OWN_REF;
>         } else {
>                 ret |= PTR_UNTRUSTED;
> +               if (kptr_field->type == BPF_KPTR_USER)
> +                       /* In oder to access directly from bpf
> +                        * programs. NON_OWN_REF make the memory
> +                        * writable. Check check_ptr_to_btf_access().
> +                        */
> +                       ret |= MEM_ALLOC | NON_OWN_REF;

UNTRUSTED | MEM_ALLOC | NON_OWN_REF ?!

That doesn't fit into any of the existing verifier schemes.
I cannot make sense of this part.

UNTRUSTED | MEM_ALLOC is read only through exceptions logic.
The uptr has to be read/write through normal load/store.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.
  2024-08-07 23:57 ` [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map Kui-Feng Lee
@ 2024-08-12 16:58   ` Alexei Starovoitov
  2024-08-12 17:15     ` Kui-Feng Lee
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-12 16:58 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee, Kui-Feng Lee

On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
> +
> +       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
> +               return;
> +
> +       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
> +       value.udata_mmap = user_data_mmap;
> +       value.udata = &user_data;

There shouldn't be a need to do mmap(). It's too much memory overhead.
The user should be able to write:
static __thread struct user_data udata;
value.udata = &udata;
bpf_map_update_elem(map_fd, my_task_fd, &value)
and do it once.
Later multi thread user code will just access "udata".
No map lookups.

If sizeof(udata) is small enough the kernel will pin either
one or two pages (if udata crosses page boundary).

So no extra memory consumption by the user process while the kernel
pins a page or two.
In a good case it's one page and no extra vmap.
I wonder whether we should enforce that one page case.
It's not hard for users to write:
static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.
  2024-08-12 16:58   ` Alexei Starovoitov
@ 2024-08-12 17:15     ` Kui-Feng Lee
  2024-08-12 17:31       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-12 17:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee



On 8/12/24 09:58, Alexei Starovoitov wrote:
> On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>> +
>> +       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
>> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> +       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
>> +               return;
>> +
>> +       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
>> +       value.udata_mmap = user_data_mmap;
>> +       value.udata = &user_data;
> 
> There shouldn't be a need to do mmap(). It's too much memory overhead.
> The user should be able to write:
> static __thread struct user_data udata;
> value.udata = &udata;
> bpf_map_update_elem(map_fd, my_task_fd, &value)
> and do it once.
> Later multi thread user code will just access "udata".
> No map lookups.

mmap() is not necessary here. There are two pointers here.
udata_mmap one is used to test the case crossing page boundary although
in the current RFC it fails to do it. It will be fixed later.
udata one works just like what you have described, except user_data is a
local variable.

> 
> If sizeof(udata) is small enough the kernel will pin either
> one or two pages (if udata crosses page boundary).
> 
> So no extra memory consumption by the user process while the kernel
> pins a page or two.
> In a good case it's one page and no extra vmap.
> I wonder whether we should enforce that one page case.
> It's not hard for users to write:
> static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));

With one page restriction, the implementation would be much simpler. If 
you think it is a reasonable restriction, I would enforce this rule.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-12 16:45   ` Alexei Starovoitov
@ 2024-08-12 17:24     ` Kui-Feng Lee
  2024-08-12 17:36       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-12 17:24 UTC (permalink / raw)
  To: Alexei Starovoitov, Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee, linux-mm



On 8/12/24 09:45, Alexei Starovoitov wrote:
> On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> User kptrs are pinned, by pin_user_pages_fast(), and translated to an
>> address in the kernel when the value is updated by user programs. (Call
>> bpf_map_update_elem() from user programs.) And, the pinned pages are
>> unpinned if the value of user kptrs are overritten or if the values of maps
>> are deleted/destroyed.
>>
>> The pages are mapped through vmap() in order to get a continuous space in
>> the kernel if the memory pointed by a user kptr resides in two or more
>> pages. For the case of single page, page_address() is called to get the
>> address of a page in the kernel.
>>
>> User kptr is only supported by task storage maps.
>>
>> One user kptr can pin at most KPTR_USER_MAX_PAGES(16) physical pages. This
>> is a random picked number for safety. We actually can remove this
>> restriction totally.
>>
>> User kptrs could only be set by user programs through syscalls.  Any
>> attempts of updating the value of a map with __kptr_user in it should
>> ignore the values of user kptrs from BPF programs. The values of user kptrs
>> will keep as they were if the new values are from BPF programs, not from
>> user programs.
>>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   include/linux/bpf.h               |  35 +++++-
>>   include/linux/bpf_local_storage.h |   2 +-
>>   kernel/bpf/bpf_local_storage.c    |  18 +--
>>   kernel/bpf/helpers.c              |  12 +-
>>   kernel/bpf/local_storage.c        |   2 +-
>>   kernel/bpf/syscall.c              | 177 +++++++++++++++++++++++++++++-
>>   net/core/bpf_sk_storage.c         |   2 +-
>>   7 files changed, 227 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 87d5f98249e2..f4ad0bc183cb 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -30,6 +30,7 @@
>>   #include <linux/static_call.h>
>>   #include <linux/memcontrol.h>
>>   #include <linux/cfi.h>
>> +#include <linux/mm.h>
>>
>>   struct bpf_verifier_env;
>>   struct bpf_verifier_log;
>> @@ -477,10 +478,12 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>>                  data_race(*ldst++ = *lsrc++);
>>   }
>>
>> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr);
>> +
>>   /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
>>   static inline void bpf_obj_memcpy(struct btf_record *rec,
>>                                    void *dst, void *src, u32 size,
>> -                                 bool long_memcpy)
>> +                                 bool long_memcpy, bool from_user)
>>   {
>>          u32 curr_off = 0;
>>          int i;
>> @@ -496,21 +499,40 @@ static inline void bpf_obj_memcpy(struct btf_record *rec,
>>          for (i = 0; i < rec->cnt; i++) {
>>                  u32 next_off = rec->fields[i].offset;
>>                  u32 sz = next_off - curr_off;
>> +               void *addr;
>>
>>                  memcpy(dst + curr_off, src + curr_off, sz);
>> +               if (from_user && rec->fields[i].type == BPF_KPTR_USER) {
> 
> 
> Do not add this to bpf_obj_memcpy() which is a critical path
> for various map operations.
> This has to be standalone for task storage only.
> 
>> +                       /* Unpin old address.
>> +                        *
>> +                        * Alignments are guaranteed by btf_find_field_one().
>> +                        */
>> +                       addr = *(void **)(dst + next_off);
>> +                       if (virt_addr_valid(addr))
>> +                               bpf_obj_unpin_uaddr(&rec->fields[i], addr);
>> +                       else if (addr)
>> +                               WARN_ON_ONCE(1);
>> +
>> +                       *(void **)(dst + next_off) = *(void **)(src + next_off);
>> +               }
>>                  curr_off += rec->fields[i].size + sz;
>>          }
>>          memcpy(dst + curr_off, src + curr_off, size - curr_off);
>>   }
>>
>> +static inline void copy_map_value_user(struct bpf_map *map, void *dst, void *src, bool from_user)
> 
> No need for these helpers either.
> 
>> +{
>> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, false, from_user);
>> +}
>> +
>>   static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
>>   {
>> -       bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
>> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, false, false);
>>   }
>>
>>   static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
>>   {
>> -       bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
>> +       bpf_obj_memcpy(map->record, dst, src, map->value_size, true, false);
>>   }
>>
>>   static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
>> @@ -538,6 +560,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
>>          bpf_obj_memzero(map->record, dst, map->value_size);
>>   }
>>
>> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
>> +                               bool lock_src, bool from_user);
>>   void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>>                             bool lock_src);
>>   void bpf_timer_cancel_and_free(void *timer);
>> @@ -775,6 +799,11 @@ enum bpf_arg_type {
>>   };
>>   static_assert(__BPF_ARG_TYPE_MAX <= BPF_BASE_TYPE_LIMIT);
>>
>> +#define BPF_MAP_UPDATE_FLAG_BITS 3
>> +enum bpf_map_update_flag {
>> +       BPF_FROM_USER = BIT(0 + BPF_MAP_UPDATE_FLAG_BITS)
>> +};
>> +
>>   /* type of values returned from helper functions */
>>   enum bpf_return_type {
>>          RET_INTEGER,                    /* function returns integer */
>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>> index dcddb0aef7d8..d337df68fa23 100644
>> --- a/include/linux/bpf_local_storage.h
>> +++ b/include/linux/bpf_local_storage.h
>> @@ -181,7 +181,7 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>>
>>   struct bpf_local_storage_elem *
>>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
>> -               bool charge_mem, gfp_t gfp_flags);
>> +               bool charge_mem, gfp_t gfp_flags, bool from_user);
>>
>>   void bpf_selem_free(struct bpf_local_storage_elem *selem,
>>                      struct bpf_local_storage_map *smap,
>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>> index c938dea5ddbf..c4cf09e27a19 100644
>> --- a/kernel/bpf/bpf_local_storage.c
>> +++ b/kernel/bpf/bpf_local_storage.c
>> @@ -73,7 +73,7 @@ static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
>>
>>   struct bpf_local_storage_elem *
>>   bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>> -               void *value, bool charge_mem, gfp_t gfp_flags)
>> +               void *value, bool charge_mem, gfp_t gfp_flags, bool from_user)
>>   {
>>          struct bpf_local_storage_elem *selem;
>>
>> @@ -100,7 +100,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>
>>          if (selem) {
>>                  if (value)
>> -                       copy_map_value(&smap->map, SDATA(selem)->data, value);
>> +                       copy_map_value_user(&smap->map, SDATA(selem)->data, value, from_user);
>>                  /* No need to call check_and_init_map_value as memory is zero init */
>>                  return selem;
>>          }
>> @@ -530,9 +530,11 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>          struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
>>          struct bpf_local_storage *local_storage;
>>          unsigned long flags;
>> +       bool from_user = map_flags & BPF_FROM_USER;
>>          int err;
>>
>>          /* BPF_EXIST and BPF_NOEXIST cannot be both set */
>> +       map_flags &= ~BPF_FROM_USER;
>>          if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>>              /* BPF_F_LOCK can only be used in a value with spin_lock */
>>              unlikely((map_flags & BPF_F_LOCK) &&
>> @@ -550,7 +552,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>                  if (err)
>>                          return ERR_PTR(err);
>>
>> -               selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
>> +               selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>>                  if (!selem)
>>                          return ERR_PTR(-ENOMEM);
>>
>> @@ -575,8 +577,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>                  if (err)
>>                          return ERR_PTR(err);
>>                  if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
>> -                       copy_map_value_locked(&smap->map, old_sdata->data,
>> -                                             value, false);
>> +                       copy_map_value_locked_user(&smap->map, old_sdata->data,
>> +                                                  value, false, from_user);
>>                          return old_sdata;
>>                  }
>>          }
>> @@ -584,7 +586,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>          /* A lookup has just been done before and concluded a new selem is
>>           * needed. The chance of an unnecessary alloc is unlikely.
>>           */
>> -       alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags);
>> +       alloc_selem = selem = bpf_selem_alloc(smap, owner, value, true, gfp_flags, from_user);
>>          if (!alloc_selem)
>>                  return ERR_PTR(-ENOMEM);
>>
>> @@ -607,8 +609,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>                  goto unlock;
>>
>>          if (old_sdata && (map_flags & BPF_F_LOCK)) {
>> -               copy_map_value_locked(&smap->map, old_sdata->data, value,
>> -                                     false);
>> +               copy_map_value_locked_user(&smap->map, old_sdata->data, value,
>> +                                          false, from_user);
>>                  selem = SELEM(old_sdata);
>>                  goto unlock;
>>          }
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index d02ae323996b..4aef86209fdd 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -372,8 +372,8 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
>>          .arg1_btf_id    = BPF_PTR_POISON,
>>   };
>>
>> -void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>> -                          bool lock_src)
>> +void copy_map_value_locked_user(struct bpf_map *map, void *dst, void *src,
>> +                               bool lock_src, bool from_user)
>>   {
>>          struct bpf_spin_lock *lock;
>>
>> @@ -383,11 +383,17 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>>                  lock = dst + map->record->spin_lock_off;
>>          preempt_disable();
>>          __bpf_spin_lock_irqsave(lock);
>> -       copy_map_value(map, dst, src);
>> +       copy_map_value_user(map, dst, src, from_user);
>>          __bpf_spin_unlock_irqrestore(lock);
>>          preempt_enable();
>>   }
>>
>> +void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>> +                          bool lock_src)
>> +{
>> +       copy_map_value_locked_user(map, dst, src, lock_src, false);
>> +}
>> +
>>   BPF_CALL_0(bpf_jiffies64)
>>   {
>>          return get_jiffies_64();
>> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
>> index 3969eb0382af..62a12fa8ce9e 100644
>> --- a/kernel/bpf/local_storage.c
>> +++ b/kernel/bpf/local_storage.c
>> @@ -147,7 +147,7 @@ static long cgroup_storage_update_elem(struct bpf_map *map, void *key,
>>          struct bpf_cgroup_storage *storage;
>>          struct bpf_storage_buffer *new;
>>
>> -       if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
>> +       if (unlikely(flags & ~BPF_F_LOCK))
>>                  return -EINVAL;
>>
>>          if (unlikely((flags & BPF_F_LOCK) &&
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 90a25307480e..eaa2a9d13265 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -155,8 +155,134 @@ static void maybe_wait_bpf_programs(struct bpf_map *map)
>>                  synchronize_rcu();
>>   }
>>
>> -static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>> -                               void *key, void *value, __u64 flags)
>> +static void *trans_addr_pages(struct page **pages, int npages)
>> +{
>> +       if (npages == 1)
>> +               return page_address(pages[0]);
>> +       /* For multiple pages, we need to use vmap() to get a contiguous
>> +        * virtual address range.
>> +        */
>> +       return vmap(pages, npages, VM_MAP, PAGE_KERNEL);
>> +}
> 
> Don't quite see a need for trans_addr_pages() helper when it's used
> once.
> 
>> +
>> +#define KPTR_USER_MAX_PAGES 16
>> +
>> +static int bpf_obj_trans_pin_uaddr(struct btf_field *field, void **addr)
>> +{
>> +       const struct btf_type *t;
>> +       struct page *pages[KPTR_USER_MAX_PAGES];
>> +       void *ptr, *kern_addr;
>> +       u32 type_id, tsz;
>> +       int r, npages;
>> +
>> +       ptr = *addr;
>> +       type_id = field->kptr.btf_id;
>> +       t = btf_type_id_size(field->kptr.btf, &type_id, &tsz);
>> +       if (!t)
>> +               return -EINVAL;
>> +       if (tsz == 0) {
>> +               *addr = NULL;
>> +               return 0;
>> +       }
>> +
>> +       npages = (((intptr_t)ptr + tsz + ~PAGE_MASK) -
>> +                 ((intptr_t)ptr & PAGE_MASK)) >> PAGE_SHIFT;
>> +       if (npages > KPTR_USER_MAX_PAGES)
>> +               return -E2BIG;
>> +       r = pin_user_pages_fast((intptr_t)ptr & PAGE_MASK, npages, 0, pages);
> 
> No need to apply the mask on ptr. See pin_user_pages_fast() internals.
> 
> It probably should be FOLL_WRITE | FOLL_LONGTERM instead of 0.

Agree!

> 
>> +       if (r != npages)
>> +               return -EINVAL;
>> +       kern_addr = trans_addr_pages(pages, npages);
>> +       if (!kern_addr)
>> +               return -ENOMEM;
>> +       *addr = kern_addr + ((intptr_t)ptr & ~PAGE_MASK);
>> +       return 0;
>> +}
>> +
>> +void bpf_obj_unpin_uaddr(const struct btf_field *field, void *addr)
>> +{
>> +       struct page *pages[KPTR_USER_MAX_PAGES];
>> +       int npages, i;
>> +       u32 size, type_id;
>> +       void *ptr;
>> +
>> +       type_id = field->kptr.btf_id;
>> +       btf_type_id_size(field->kptr.btf, &type_id, &size);
>> +       if (size == 0)
>> +               return;
>> +
>> +       ptr = (void *)((intptr_t)addr & PAGE_MASK);
>> +       npages = (((intptr_t)addr + size + ~PAGE_MASK) - (intptr_t)ptr) >> PAGE_SHIFT;
>> +       for (i = 0; i < npages; i++) {
>> +               pages[i] = virt_to_page(ptr);
>> +               ptr += PAGE_SIZE;
>> +       }
>> +       if (npages > 1)
>> +               /* Paired with vmap() in trans_addr_pages() */
>> +               vunmap((void *)((intptr_t)addr & PAGE_MASK));
>> +       unpin_user_pages(pages, npages);
>> +}
>> +
>> +static int bpf_obj_trans_pin_uaddrs(struct btf_record *rec, void *src, u32 size)
>> +{
>> +       u32 next_off;
>> +       int i, err;
>> +
>> +       if (IS_ERR_OR_NULL(rec))
>> +               return 0;
>> +
>> +       if (!btf_record_has_field(rec, BPF_KPTR_USER))
>> +               return 0;
> 
> imo kptr_user doesn't quite fit as a name.
> 'kptr' means 'kernel pointer'. Here it's user addr.
> Maybe just "uptr" ?

That makes sense.

> 
>> +
>> +       for (i = 0; i < rec->cnt; i++) {
>> +               if (rec->fields[i].type != BPF_KPTR_USER)
>> +                       continue;
>> +
>> +               next_off = rec->fields[i].offset;
>> +               if (next_off + sizeof(void *) > size)
>> +                       return -EINVAL;
>> +               err = bpf_obj_trans_pin_uaddr(&rec->fields[i], src + next_off);
>> +               if (!err)
>> +                       continue;
>> +
>> +               /* Rollback */
>> +               for (i--; i >= 0; i--) {
>> +                       if (rec->fields[i].type != BPF_KPTR_USER)
>> +                               continue;
>> +                       next_off = rec->fields[i].offset;
>> +                       bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
>> +                       *(void **)(src + next_off) = NULL;
>> +               }
>> +
>> +               return err;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static void bpf_obj_unpin_uaddrs(struct btf_record *rec, void *src)
>> +{
>> +       u32 next_off;
>> +       int i;
>> +
>> +       if (IS_ERR_OR_NULL(rec))
>> +               return;
>> +
>> +       if (!btf_record_has_field(rec, BPF_KPTR_USER))
>> +               return;
>> +
>> +       for (i = 0; i < rec->cnt; i++) {
>> +               if (rec->fields[i].type != BPF_KPTR_USER)
>> +                       continue;
>> +
>> +               next_off = rec->fields[i].offset;
>> +               bpf_obj_unpin_uaddr(&rec->fields[i], *(void **)(src + next_off));
>> +               *(void **)(src + next_off) = NULL;
> 
> This part is pretty much the same as the undo part in
> bpf_obj_trans_pin_uaddrs() and the common helper is warranted.

Sure!

> 
>> +       }
>> +}
>> +
>> +static int bpf_map_update_value_inner(struct bpf_map *map, struct file *map_file,
>> +                                     void *key, void *value, __u64 flags)
>>   {
>>          int err;
>>
>> @@ -208,6 +334,29 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>>          return err;
>>   }
>>
>> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>> +                               void *key, void *value, __u64 flags)
>> +{
>> +       int err;
>> +
>> +       if (flags & BPF_FROM_USER) {
> 
> there shouldn't be a need for this extra flag.
> map->record has the info whether uptr is present or not.

The BPF_FROM_USER flag is used to support updating map values from BPF
programs as well. Although BPF programs can udpate map values, I
don't want the values of uptrs to be changed by the BPF programs.

Should we just forbid the BPF programs to udpate the map values having
uptrs in them?


> 
>> +               /* Pin user memory can lead to context switch, so we need
>> +                * to do it before potential RCU lock.
>> +                */
>> +               err = bpf_obj_trans_pin_uaddrs(map->record, value,
>> +                                              bpf_map_value_size(map));
>> +               if (err)
>> +                       return err;
>> +       }
>> +
>> +       err = bpf_map_update_value_inner(map, map_file, key, value, flags);
>> +
>> +       if (err && (flags & BPF_FROM_USER))
>> +               bpf_obj_unpin_uaddrs(map->record, value);
>> +
>> +       return err;
>> +}
>> +
>>   static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
>>                                __u64 flags)
>>   {
>> @@ -714,6 +863,11 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>                                  field->kptr.dtor(xchgd_field);
>>                          }
>>                          break;
>> +               case BPF_KPTR_USER:
>> +                       if (virt_addr_valid(*(void **)field_ptr))
>> +                               bpf_obj_unpin_uaddr(field, *(void **)field_ptr);
>> +                       *(void **)field_ptr = NULL;
>> +                       break;
>>                  case BPF_LIST_HEAD:
>>                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>                                  continue;
>> @@ -1155,6 +1309,12 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>>                                          goto free_map_tab;
>>                                  }
>>                                  break;
>> +                       case BPF_KPTR_USER:
>> +                               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
>> +                                       ret = -EOPNOTSUPP;
>> +                                       goto free_map_tab;
>> +                               }
>> +                               break;
>>                          case BPF_LIST_HEAD:
>>                          case BPF_RB_ROOT:
>>                                  if (map->map_type != BPF_MAP_TYPE_HASH &&
>> @@ -1618,11 +1778,15 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>>          struct bpf_map *map;
>>          void *key, *value;
>>          u32 value_size;
>> +       u64 extra_flags = 0;
>>          struct fd f;
>>          int err;
>>
>>          if (CHECK_ATTR(BPF_MAP_UPDATE_ELEM))
>>                  return -EINVAL;
>> +       /* Prevent userspace from setting any internal flags */
>> +       if (attr->flags & ~(BIT(BPF_MAP_UPDATE_FLAG_BITS) - 1))
>> +               return -EINVAL;
>>
>>          f = fdget(ufd);
>>          map = __bpf_map_get(f);
>> @@ -1653,7 +1817,9 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>>                  goto free_key;
>>          }
>>
>> -       err = bpf_map_update_value(map, f.file, key, value, attr->flags);
>> +       if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
>> +               extra_flags |= BPF_FROM_USER;
>> +       err = bpf_map_update_value(map, f.file, key, value, attr->flags | extra_flags);
>>          if (!err)
>>                  maybe_wait_bpf_programs(map);
>>
>> @@ -1852,6 +2018,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>>          void __user *keys = u64_to_user_ptr(attr->batch.keys);
>>          u32 value_size, cp, max_count;
>>          void *key, *value;
>> +       u64 extra_flags = 0;
>>          int err = 0;
>>
>>          if (attr->batch.elem_flags & ~BPF_F_LOCK)
>> @@ -1881,6 +2048,8 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>>                  return -ENOMEM;
>>          }
>>
>> +       if (map->map_type == BPF_MAP_TYPE_TASK_STORAGE)
>> +               extra_flags |= BPF_FROM_USER;
>>          for (cp = 0; cp < max_count; cp++) {
>>                  err = -EFAULT;
>>                  if (copy_from_user(key, keys + cp * map->key_size,
>> @@ -1889,7 +2058,7 @@ int generic_map_update_batch(struct bpf_map *map, struct file *map_file,
>>                          break;
>>
>>                  err = bpf_map_update_value(map, map_file, key, value,
>> -                                          attr->batch.elem_flags);
>> +                                          attr->batch.elem_flags | extra_flags);
>>
>>                  if (err)
>>                          break;
>> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
>> index bc01b3aa6b0f..db5281384e6a 100644
>> --- a/net/core/bpf_sk_storage.c
>> +++ b/net/core/bpf_sk_storage.c
>> @@ -137,7 +137,7 @@ bpf_sk_storage_clone_elem(struct sock *newsk,
>>   {
>>          struct bpf_local_storage_elem *copy_selem;
>>
>> -       copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC);
>> +       copy_selem = bpf_selem_alloc(smap, newsk, NULL, true, GFP_ATOMIC, false);
>>          if (!copy_selem)
>>                  return NULL;
>>
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.
  2024-08-12 17:15     ` Kui-Feng Lee
@ 2024-08-12 17:31       ` Alexei Starovoitov
  2024-08-12 18:10         ` Kui-Feng Lee
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-12 17:31 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee

On Mon, Aug 12, 2024 at 10:15 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 8/12/24 09:58, Alexei Starovoitov wrote:
> > On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
> >> +
> >> +       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
> >> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >> +       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
> >> +               return;
> >> +
> >> +       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
> >> +       value.udata_mmap = user_data_mmap;
> >> +       value.udata = &user_data;
> >
> > There shouldn't be a need to do mmap(). It's too much memory overhead.
> > The user should be able to write:
> > static __thread struct user_data udata;
> > value.udata = &udata;
> > bpf_map_update_elem(map_fd, my_task_fd, &value)
> > and do it once.
> > Later multi thread user code will just access "udata".
> > No map lookups.
>
> mmap() is not necessary here. There are two pointers here.
> udata_mmap one is used to test the case crossing page boundary although
> in the current RFC it fails to do it. It will be fixed later.
> udata one works just like what you have described, except user_data is a
> local variable.

Hmm. I guess I misread the code.
But then:
+       struct user_data user_data user_data = ...;
+       value.udata = &user_data;

how is that supposed to work when the address points to the stack?
I guess the kernel can still pin that page, but it will be junk
as soon as the function returns.

> >
> > If sizeof(udata) is small enough the kernel will pin either
> > one or two pages (if udata crosses page boundary).
> >
> > So no extra memory consumption by the user process while the kernel
> > pins a page or two.
> > In a good case it's one page and no extra vmap.
> > I wonder whether we should enforce that one page case.
> > It's not hard for users to write:
> > static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));
>
> With one page restriction, the implementation would be much simpler. If
> you think it is a reasonable restriction, I would enforce this rule.

I'm worried about vmap(). Doing it for every map elemen (same as every
task) might add substantial kernel side overhead.
On my devserver:
sudo cat /proc/vmallocinfo |grep vmap|wc -l
105
sudo cat /proc/vmallocinfo |wc -l
17608

I believe that the mechanism scales to millions, but adding one more
vmap per element feels like a footgun.
To avoid that the user would need to make sure their user_data doesn't
cross the page, so imo we can make this mandatory.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-12 17:24     ` Kui-Feng Lee
@ 2024-08-12 17:36       ` Alexei Starovoitov
  2024-08-12 17:51         ` Kui-Feng Lee
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-12 17:36 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee, linux-mm

On Mon, Aug 12, 2024 at 10:24 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
> >> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> >> +                               void *key, void *value, __u64 flags)
> >> +{
> >> +       int err;
> >> +
> >> +       if (flags & BPF_FROM_USER) {
> >
> > there shouldn't be a need for this extra flag.
> > map->record has the info whether uptr is present or not.
>
> The BPF_FROM_USER flag is used to support updating map values from BPF
> programs as well. Although BPF programs can udpate map values, I
> don't want the values of uptrs to be changed by the BPF programs.
>
> Should we just forbid the BPF programs to udpate the map values having
> uptrs in them?

hmm. map_update_elem() is disallowed from bpf prog.

        case BPF_MAP_TYPE_TASK_STORAGE:
                if (func_id != BPF_FUNC_task_storage_get &&
                    func_id != BPF_FUNC_task_storage_delete &&
                    func_id != BPF_FUNC_kptr_xchg)
                        goto error;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls.
  2024-08-12 17:36       ` Alexei Starovoitov
@ 2024-08-12 17:51         ` Kui-Feng Lee
  0 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-12 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee, linux-mm



On 8/12/24 10:36, Alexei Starovoitov wrote:
> On Mon, Aug 12, 2024 at 10:24 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>>> +static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>>>> +                               void *key, void *value, __u64 flags)
>>>> +{
>>>> +       int err;
>>>> +
>>>> +       if (flags & BPF_FROM_USER) {
>>>
>>> there shouldn't be a need for this extra flag.
>>> map->record has the info whether uptr is present or not.
>>
>> The BPF_FROM_USER flag is used to support updating map values from BPF
>> programs as well. Although BPF programs can udpate map values, I
>> don't want the values of uptrs to be changed by the BPF programs.
>>
>> Should we just forbid the BPF programs to udpate the map values having
>> uptrs in them?
> 
> hmm. map_update_elem() is disallowed from bpf prog.
> 
>          case BPF_MAP_TYPE_TASK_STORAGE:
>                  if (func_id != BPF_FUNC_task_storage_get &&
>                      func_id != BPF_FUNC_task_storage_delete &&
>                      func_id != BPF_FUNC_kptr_xchg)
>                          goto error;

Thank you for the information!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map.
  2024-08-12 17:31       ` Alexei Starovoitov
@ 2024-08-12 18:10         ` Kui-Feng Lee
  0 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-12 18:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee



On 8/12/24 10:31, Alexei Starovoitov wrote:
> On Mon, Aug 12, 2024 at 10:15 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 8/12/24 09:58, Alexei Starovoitov wrote:
>>> On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>>> +
>>>> +       user_data_mmap = mmap(NULL, sizeof(*user_data_mmap), PROT_READ | PROT_WRITE,
>>>> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>> +       if (!ASSERT_NEQ(user_data_mmap, MAP_FAILED, "mmap"))
>>>> +               return;
>>>> +
>>>> +       memcpy(user_data_mmap, &user_data_mmap_v, sizeof(*user_data_mmap));
>>>> +       value.udata_mmap = user_data_mmap;
>>>> +       value.udata = &user_data;
>>>
>>> There shouldn't be a need to do mmap(). It's too much memory overhead.
>>> The user should be able to write:
>>> static __thread struct user_data udata;
>>> value.udata = &udata;
>>> bpf_map_update_elem(map_fd, my_task_fd, &value)
>>> and do it once.
>>> Later multi thread user code will just access "udata".
>>> No map lookups.
>>
>> mmap() is not necessary here. There are two pointers here.
>> udata_mmap one is used to test the case crossing page boundary although
>> in the current RFC it fails to do it. It will be fixed later.
>> udata one works just like what you have described, except user_data is a
>> local variable.
> 
> Hmm. I guess I misread the code.
> But then:
> +       struct user_data user_data user_data = ...;
> +       value.udata = &user_data;
> 
> how is that supposed to work when the address points to the stack?
> I guess the kernel can still pin that page, but it will be junk
> as soon as the function returns.

You are right! It works only for this test case since the map will be
destroyed before leaving this function. I will move it to a static variable.

> 
>>>
>>> If sizeof(udata) is small enough the kernel will pin either
>>> one or two pages (if udata crosses page boundary).
>>>
>>> So no extra memory consumption by the user process while the kernel
>>> pins a page or two.
>>> In a good case it's one page and no extra vmap.
>>> I wonder whether we should enforce that one page case.
>>> It's not hard for users to write:
>>> static __thread struct user_data udata __attribute__((aligned(sizeof(udata))));
>>
>> With one page restriction, the implementation would be much simpler. If
>> you think it is a reasonable restriction, I would enforce this rule.
> 
> I'm worried about vmap(). Doing it for every map elemen (same as every
> task) might add substantial kernel side overhead.
> On my devserver:
> sudo cat /proc/vmallocinfo |grep vmap|wc -l
> 105
> sudo cat /proc/vmallocinfo |wc -l
> 17608
> 
> I believe that the mechanism scales to millions, but adding one more
> vmap per element feels like a footgun.
> To avoid that the user would need to make sure their user_data doesn't
> cross the page, so imo we can make this mandatory.

If the memory block that is pointed by a uptr takes only one page,
vmap() is not called. vmap() is called only for the cases that take two
or more pages. Without the one page restriction, there is a chance to
have additional vmaps even for small memory blocks, but not every uptr
having that extra vmap.

Users can accidentally create a new vmap. But, with current
implementation, they can also avoid it by aligning memory properly. The
trade-off is between supporting big chunks of memory and idiot-proof.
However, in my opinion, big chunks are very unlikely for task local storage.

So, I will make this restriction mandatory.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier.
  2024-08-12 16:48   ` Alexei Starovoitov
@ 2024-08-13 16:52     ` Kui-Feng Lee
  2024-08-13 19:35       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-13 16:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Kui-Feng Lee
  Cc: bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu, Kernel Team,
	Andrii Nakryiko, Kui-Feng Lee



On 8/12/24 09:48, Alexei Starovoitov wrote:
> On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> Give PTR_MAYBE_NULL | PTR_UNTRUSTED | MEM_ALLOC | NON_OWN_REF to kptr_user
>> to the memory pointed by it readable and writable.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   kernel/bpf/verifier.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index df3be12096cf..84647e599595 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
>>          int perm_flags;
>>          const char *reg_name = "";
>>
>> +       if (kptr_field->type == BPF_KPTR_USER)
>> +               /* BPF programs should not change any user kptr */
>> +               return -EACCES;
>> +
>>          if (btf_is_kernel(reg->btf)) {
>>                  perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
>>
>> @@ -5483,6 +5487,12 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
>>                          ret |= NON_OWN_REF;
>>          } else {
>>                  ret |= PTR_UNTRUSTED;
>> +               if (kptr_field->type == BPF_KPTR_USER)
>> +                       /* In oder to access directly from bpf
>> +                        * programs. NON_OWN_REF make the memory
>> +                        * writable. Check check_ptr_to_btf_access().
>> +                        */
>> +                       ret |= MEM_ALLOC | NON_OWN_REF;
> 
> UNTRUSTED | MEM_ALLOC | NON_OWN_REF ?!
> 
> That doesn't fit into any of the existing verifier schemes.
> I cannot make sense of this part.
> 
> UNTRUSTED | MEM_ALLOC is read only through exceptions logic.
> The uptr has to be read/write through normal load/store.

I will remove UNTRUSTED and leave MEM_ALLOC and NON_OWN_REF.
Does it make sense to you?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier.
  2024-08-13 16:52     ` Kui-Feng Lee
@ 2024-08-13 19:35       ` Alexei Starovoitov
  2024-08-13 23:13         ` Kui-Feng Lee
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2024-08-13 19:35 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee

On Tue, Aug 13, 2024 at 9:52 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 8/12/24 09:48, Alexei Starovoitov wrote:
> > On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
> >>
> >> Give PTR_MAYBE_NULL | PTR_UNTRUSTED | MEM_ALLOC | NON_OWN_REF to kptr_user
> >> to the memory pointed by it readable and writable.
> >>
> >> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> >> ---
> >>   kernel/bpf/verifier.c | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index df3be12096cf..84647e599595 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
> >>          int perm_flags;
> >>          const char *reg_name = "";
> >>
> >> +       if (kptr_field->type == BPF_KPTR_USER)
> >> +               /* BPF programs should not change any user kptr */
> >> +               return -EACCES;
> >> +
> >>          if (btf_is_kernel(reg->btf)) {
> >>                  perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
> >>
> >> @@ -5483,6 +5487,12 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
> >>                          ret |= NON_OWN_REF;
> >>          } else {
> >>                  ret |= PTR_UNTRUSTED;
> >> +               if (kptr_field->type == BPF_KPTR_USER)
> >> +                       /* In oder to access directly from bpf
> >> +                        * programs. NON_OWN_REF make the memory
> >> +                        * writable. Check check_ptr_to_btf_access().
> >> +                        */
> >> +                       ret |= MEM_ALLOC | NON_OWN_REF;
> >
> > UNTRUSTED | MEM_ALLOC | NON_OWN_REF ?!
> >
> > That doesn't fit into any of the existing verifier schemes.
> > I cannot make sense of this part.
> >
> > UNTRUSTED | MEM_ALLOC is read only through exceptions logic.
> > The uptr has to be read/write through normal load/store.
>
> I will remove UNTRUSTED and leave MEM_ALLOC and NON_OWN_REF.
> Does it make sense to you?

I don't think it fits either.
MEM_ALLOC | NON_OWN_REF is specific to bpf_rbtree/linklist nodes.
There are various checks and logic like:
1.
      if (!(type_is_ptr_alloc_obj(reg->type) ||
type_is_non_owning_ref(reg->type)) &&
            WARN_ON_ONCE(reg->off))
          return;
2.
invalidate_non_owning_refs() during unlock

that shouldn't apply in this case.

PTR_TO_MEM with specific mem_size fits better.
Since it's user/kernel shared memory PTR_TO_BTF_ID logic with field walking
won't work anyway, so opaque array of bytes is better. Which is PTR_TO_MEM.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier.
  2024-08-13 19:35       ` Alexei Starovoitov
@ 2024-08-13 23:13         ` Kui-Feng Lee
  0 siblings, 0 replies; 21+ messages in thread
From: Kui-Feng Lee @ 2024-08-13 23:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee



On 8/13/24 12:35, Alexei Starovoitov wrote:
> On Tue, Aug 13, 2024 at 9:52 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 8/12/24 09:48, Alexei Starovoitov wrote:
>>> On Wed, Aug 7, 2024 at 4:58 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>>>
>>>> Give PTR_MAYBE_NULL | PTR_UNTRUSTED | MEM_ALLOC | NON_OWN_REF to kptr_user
>>>> to the memory pointed by it readable and writable.
>>>>
>>>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>>>> ---
>>>>    kernel/bpf/verifier.c | 11 +++++++++++
>>>>    1 file changed, 11 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index df3be12096cf..84647e599595 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -5340,6 +5340,10 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
>>>>           int perm_flags;
>>>>           const char *reg_name = "";
>>>>
>>>> +       if (kptr_field->type == BPF_KPTR_USER)
>>>> +               /* BPF programs should not change any user kptr */
>>>> +               return -EACCES;
>>>> +
>>>>           if (btf_is_kernel(reg->btf)) {
>>>>                   perm_flags = PTR_MAYBE_NULL | PTR_TRUSTED | MEM_RCU;
>>>>
>>>> @@ -5483,6 +5487,12 @@ static u32 btf_ld_kptr_type(struct bpf_verifier_env *env, struct btf_field *kptr
>>>>                           ret |= NON_OWN_REF;
>>>>           } else {
>>>>                   ret |= PTR_UNTRUSTED;
>>>> +               if (kptr_field->type == BPF_KPTR_USER)
>>>> +                       /* In oder to access directly from bpf
>>>> +                        * programs. NON_OWN_REF make the memory
>>>> +                        * writable. Check check_ptr_to_btf_access().
>>>> +                        */
>>>> +                       ret |= MEM_ALLOC | NON_OWN_REF;
>>>
>>> UNTRUSTED | MEM_ALLOC | NON_OWN_REF ?!
>>>
>>> That doesn't fit into any of the existing verifier schemes.
>>> I cannot make sense of this part.
>>>
>>> UNTRUSTED | MEM_ALLOC is read only through exceptions logic.
>>> The uptr has to be read/write through normal load/store.
>>
>> I will remove UNTRUSTED and leave MEM_ALLOC and NON_OWN_REF.
>> Does it make sense to you?
> 
> I don't think it fits either.
> MEM_ALLOC | NON_OWN_REF is specific to bpf_rbtree/linklist nodes.
> There are various checks and logic like:
> 1.
>        if (!(type_is_ptr_alloc_obj(reg->type) ||
> type_is_non_owning_ref(reg->type)) &&
>              WARN_ON_ONCE(reg->off))
>            return;
> 2.
> invalidate_non_owning_refs() during unlock
> 
> that shouldn't apply in this case.
> 
> PTR_TO_MEM with specific mem_size fits better.
> Since it's user/kernel shared memory PTR_TO_BTF_ID logic with field walking
> won't work anyway, so opaque array of bytes is better. Which is PTR_TO_MEM.

Make sense!



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-08-13 23:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 23:57 [RFC bpf-next 0/5] Share user memory to BPF program through task storage map Kui-Feng Lee
2024-08-07 23:57 ` [RFC bpf-next 1/5] bpf: Parse and support "kptr_user" tag Kui-Feng Lee
2024-08-07 23:57 ` [RFC bpf-next 2/5] bpf: Handle BPF_KPTR_USER in verifier Kui-Feng Lee
2024-08-12 16:48   ` Alexei Starovoitov
2024-08-13 16:52     ` Kui-Feng Lee
2024-08-13 19:35       ` Alexei Starovoitov
2024-08-13 23:13         ` Kui-Feng Lee
2024-08-07 23:57 ` [RFC bpf-next 3/5] bpf: pin, translate, and unpin __kptr_user from syscalls Kui-Feng Lee
2024-08-08  0:05   ` Kui-Feng Lee
2024-08-08  0:39   ` Kui-Feng Lee
2024-08-12 16:00   ` Kui-Feng Lee
2024-08-12 16:45   ` Alexei Starovoitov
2024-08-12 17:24     ` Kui-Feng Lee
2024-08-12 17:36       ` Alexei Starovoitov
2024-08-12 17:51         ` Kui-Feng Lee
2024-08-07 23:57 ` [RFC bpf-next 4/5] libbpf: define __kptr_user Kui-Feng Lee
2024-08-07 23:57 ` [RFC bpf-next 5/5] selftests/bpf: test __kptr_user on the value of a task storage map Kui-Feng Lee
2024-08-12 16:58   ` Alexei Starovoitov
2024-08-12 17:15     ` Kui-Feng Lee
2024-08-12 17:31       ` Alexei Starovoitov
2024-08-12 18:10         ` Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox