[PATCH v1 bpf-next 0/2] bpf: Add mmapable task

BPF List
 help / color / mirror / Atom feed

* [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage
@ 2023-11-20 17:59 Dave Marchevsky
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
  2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Marchevsky @ 2023-11-20 17:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, Dave Marchevsky

This series adds support for mmap()ing single task_local storage mapvals
into userspace. Two motivating usecases:

  * sched_ext ([0]) schedulers might want to act on 'scheduling hints'
    provided by userspace tasks. For example, a task can tag itself as
    latency-sensitive but not particularly computationally intensive and
    BPF scheduler can use this information to make better scheduling
    decisions. Similarly, a database task about to start a
    transaction can tag itself as doing so without high overhead by
    writing to the mmap'd mapval. In both cases the information is
    task-specific and in the latter it'd be preferable to avoid
    incurring syscall overhead as the hint would change often.

  * strobemeta ([1]) technique to read thread_local storage is used
    by tracing programs at Meta to annotate tracing data with
    task-specific metadata. For example, a multithreaded webserver with
    a pool of worker threads preparing responses and other threads
    handling request connections might want to tag threads by type, and
    further tag worker threads with feature flags enabled during request
    processing.
      * The strobemeta technique predates existence of task_local
	storage map, instead relying on reverse-engineering thread_local
	storage implementation specifics. The approach enabled here
	avoids much of this complexity.

The general thrust of this series' implementation is "simplest thing
that works". A userspace thread can mmap() a task_local storage map fd
and receive the map_value corresponding to its task. In the future we
can support mmap()ing in other threads' map_values via offset parameter
or some other approach. Similarly, this series makes no attempt to pack
multiple map_values into a userspace-mappable page - each map_value for
a BPF_F_MMAPABLE task_local storage map is given its own page. For the
motivating usecases above neither of those potential improvements is
necessary. Patch 1's summary digs deeper into implementation details.

This series' changes to generic local_storage implementation shared by
cgroup_local storage and others will make extending this support to
those local storage types straightforward in the future.

Summary of patches:
  * Patch 1 adds support for mmapable map_vals in generic
    bpf_local_storage infrastructure and uses the new feature in
    task_local storage
  * Patch 2 adds tests

  [0]: https://lore.kernel.org/bpf/20231111024835.2164816-1-tj@kernel.org/
  [1]: tools/testing/selftests/bpf/progs/strobemeta*

Dave Marchevsky (2):
  bpf: Support BPF_F_MMAPABLE task_local storage
  selftests/bpf: Add test exercising mmapable task_local_storage

 include/linux/bpf_local_storage.h             |  14 +-
 kernel/bpf/bpf_local_storage.c                | 145 +++++++++++---
 kernel/bpf/bpf_task_storage.c                 |  35 +++-
 kernel/bpf/syscall.c                          |   2 +-
 .../bpf/prog_tests/task_local_storage.c       | 177 ++++++++++++++++++
 .../bpf/progs/task_local_storage__mmap.c      |  59 ++++++
 .../bpf/progs/task_local_storage__mmap.h      |   7 +
 .../bpf/progs/task_local_storage__mmap_fail.c |  39 ++++
 8 files changed, 445 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
@ 2023-11-20 17:59 ` Dave Marchevsky
  2023-11-20 21:41   ` Johannes Weiner
                     ` (8 more replies)
  2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
  1 sibling, 9 replies; 17+ messages in thread
From: Dave Marchevsky @ 2023-11-20 17:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, Dave Marchevsky

This patch modifies the generic bpf_local_storage infrastructure to
support mmapable map values and adds mmap() handling to task_local
storage leveraging this new functionality. A userspace task which
mmap's a task_local storage map will receive a pointer to the map_value
corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
not supported in this patch.

Currently, struct bpf_local_storage_elem contains both bookkeeping
information as well as a struct bpf_local_storage_data with additional
bookkeeping information and the actual mapval data. We can't simply map
the page containing this struct into userspace. Instead, mmapable
local_storage uses bpf_local_storage_data's data field to point to the
actual mapval, which is allocated separately such that it can be
mmapped. Only the mapval lives on the page(s) allocated for it.

The lifetime of the actual_data mmapable region is tied to the
bpf_local_storage_elem which points to it. This doesn't necessarily mean
that the pages go away when the bpf_local_storage_elem is free'd - if
they're mapped into some userspace process they will remain until
unmapped, but are no longer the task_local storage's mapval.

Implementation details:

  * A few small helpers are added to deal with bpf_local_storage_data's
    'data' field having different semantics when the local_storage map
    is mmapable. With their help, many of the changes to existing code
    are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
    selem->elem_size becomes selem_bytes_used(selem)).

  * The map flags are copied into bpf_local_storage_data when its
    containing bpf_local_storage_elem is alloc'd, since the
    bpf_local_storage_map associated with them may be gone when
    bpf_local_storage_data is free'd, and testing flags for
    BPF_F_MMAPABLE is necessary when free'ing to ensure that the
    mmapable region is free'd.
    * The extra field doesn't change bpf_local_storage_elem's size.
      There were 48 bytes of padding after the bpf_local_storage_data
      field, now there are 40.

  * Currently, bpf_local_storage_update always creates a new
    bpf_local_storage_elem for the 'updated' value - the only exception
    being if the map_value has a bpf_spin_lock field, in which case the
    spin lock is grabbed instead of the less granular bpf_local_storage
    lock, and the value updated in place. This inplace update behavior
    is desired for mmapable local_storage map_values as well, since
    creating a new selem would result in new mmapable pages.

  * The size of the mmapable pages are accounted for when calling
    mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
    mem_uncharge may be called before they actually go away.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf_local_storage.h |  14 ++-
 kernel/bpf/bpf_local_storage.c    | 145 ++++++++++++++++++++++++------
 kernel/bpf/bpf_task_storage.c     |  35 ++++++--
 kernel/bpf/syscall.c              |   2 +-
 4 files changed, 163 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 173ec7f43ed1..114973f925ea 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -69,7 +69,17 @@ struct bpf_local_storage_data {
 	 * the number of cachelines accessed during the cache hit case.
 	 */
 	struct bpf_local_storage_map __rcu *smap;
-	u8 data[] __aligned(8);
+	/* Need to duplicate smap's map_flags as smap may be gone when
+	 * it's time to free bpf_local_storage_data
+	 */
+	u64 smap_map_flags;
+	/* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
+	 * Otherwise the actual mapval data lives here
+	 */
+	union {
+		DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
+		void *actual_data __aligned(8);
+	};
 };
 
 /* Linked to bpf_local_storage and bpf_local_storage_map */
@@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = {			\
 /* Helper functions for bpf_local_storage */
 int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
 
+void *sdata_mapval(struct bpf_local_storage_data *data);
+
 struct bpf_map *
 bpf_local_storage_map_alloc(union bpf_attr *attr,
 			    struct bpf_local_storage_cache *cache,
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 146824cc9689..9b3becbcc1a3 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -15,7 +15,8 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/rcupdate_wait.h>
 
-#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
+#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
+	(BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
 
 static struct bpf_local_storage_map_bucket *
 select_bucket(struct bpf_local_storage_map *smap,
@@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
 	return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
 }
 
+struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
+
+void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(&smap->map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
+					  NUMA_NO_NODE);
+	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
+
+	return ptr;
+}
+
+void *sdata_mapval(struct bpf_local_storage_data *data)
+{
+	if (data->smap_map_flags & BPF_F_MMAPABLE)
+		return data->actual_data;
+	return &data->data;
+}
+
+static size_t sdata_data_field_size(struct bpf_local_storage_map *smap,
+				    struct bpf_local_storage_data *data)
+{
+	if (smap->map.map_flags & BPF_F_MMAPABLE)
+		return sizeof(void *);
+	return (size_t)smap->map.value_size;
+}
+
+static u32 selem_bytes_used(struct bpf_local_storage_map *smap)
+{
+	if (smap->map.map_flags & BPF_F_MMAPABLE)
+		return smap->elem_size + PAGE_ALIGN(smap->map.value_size);
+	return smap->elem_size;
+}
+
+static bool can_update_existing_selem(struct bpf_local_storage_map *smap,
+				      u64 flags)
+{
+	return flags & BPF_F_LOCK || smap->map.map_flags & BPF_F_MMAPABLE;
+}
+
 static int mem_charge(struct bpf_local_storage_map *smap, void *owner, u32 size)
 {
 	struct bpf_map *map = &smap->map;
@@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 		void *value, bool charge_mem, gfp_t gfp_flags)
 {
 	struct bpf_local_storage_elem *selem;
+	void *mmapable_value = NULL;
+	u32 selem_mem;
 
-	if (charge_mem && mem_charge(smap, owner, smap->elem_size))
+	selem_mem = selem_bytes_used(smap);
+	if (charge_mem && mem_charge(smap, owner, selem_mem))
 		return NULL;
 
+	if (smap->map.map_flags & BPF_F_MMAPABLE) {
+		mmapable_value = alloc_mmapable_selem_value(smap);
+		if (!mmapable_value)
+			goto err_out;
+	}
+
 	if (smap->bpf_ma) {
 		migrate_disable();
 		selem = bpf_mem_cache_alloc_flags(&smap->selem_ma, gfp_flags);
@@ -92,22 +147,28 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 			 * only does bpf_mem_cache_free when there is
 			 * no other bpf prog is using the selem.
 			 */
-			memset(SDATA(selem)->data, 0, smap->map.value_size);
+			memset(SDATA(selem)->data, 0,
+			       sdata_data_field_size(smap, SDATA(selem)));
 	} else {
 		selem = bpf_map_kzalloc(&smap->map, smap->elem_size,
 					gfp_flags | __GFP_NOWARN);
 	}
 
-	if (selem) {
-		if (value)
-			copy_map_value(&smap->map, SDATA(selem)->data, value);
-		/* No need to call check_and_init_map_value as memory is zero init */
-		return selem;
-	}
-
+	if (!selem)
+		goto err_out;
+
+	selem->sdata.smap_map_flags = smap->map.map_flags;
+	if (smap->map.map_flags & BPF_F_MMAPABLE)
+		selem->sdata.actual_data = mmapable_value;
+	if (value)
+		copy_map_value(&smap->map, sdata_mapval(SDATA(selem)), value);
+	/* No need to call check_and_init_map_value as memory is zero init */
+	return selem;
+err_out:
+	if (mmapable_value)
+		bpf_map_area_free(mmapable_value);
 	if (charge_mem)
-		mem_uncharge(smap, owner, smap->elem_size);
-
+		mem_uncharge(smap, owner, selem_mem);
 	return NULL;
 }
 
@@ -184,6 +245,21 @@ static void bpf_local_storage_free(struct bpf_local_storage *local_storage,
 	}
 }
 
+static void __bpf_selem_kfree(struct bpf_local_storage_elem *selem)
+{
+	if (selem->sdata.smap_map_flags & BPF_F_MMAPABLE)
+		bpf_map_area_free(selem->sdata.actual_data);
+	kfree(selem);
+}
+
+static void __bpf_selem_kfree_rcu(struct rcu_head *rcu)
+{
+	struct bpf_local_storage_elem *selem;
+
+	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
+	__bpf_selem_kfree(selem);
+}
+
 /* rcu tasks trace callback for bpf_ma == false */
 static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu)
 {
@@ -191,9 +267,9 @@ static void __bpf_selem_free_trace_rcu(struct rcu_head *rcu)
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
 	if (rcu_trace_implies_rcu_gp())
-		kfree(selem);
+		__bpf_selem_kfree(selem);
 	else
-		kfree_rcu(selem, rcu);
+		call_rcu(rcu, __bpf_selem_kfree_rcu);
 }
 
 /* Handle bpf_ma == false */
@@ -201,7 +277,7 @@ static void __bpf_selem_free(struct bpf_local_storage_elem *selem,
 			     bool vanilla_rcu)
 {
 	if (vanilla_rcu)
-		kfree_rcu(selem, rcu);
+		call_rcu(&selem->rcu, __bpf_selem_kfree_rcu);
 	else
 		call_rcu_tasks_trace(&selem->rcu, __bpf_selem_free_trace_rcu);
 }
@@ -209,8 +285,12 @@ static void __bpf_selem_free(struct bpf_local_storage_elem *selem,
 static void bpf_selem_free_rcu(struct rcu_head *rcu)
 {
 	struct bpf_local_storage_elem *selem;
+	struct bpf_local_storage_map *smap;
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
+	smap = selem->sdata.smap;
+	if (selem->sdata.smap_map_flags & BPF_F_MMAPABLE)
+		bpf_map_area_free(selem->sdata.actual_data);
 	bpf_mem_cache_raw_free(selem);
 }
 
@@ -241,6 +321,8 @@ void bpf_selem_free(struct bpf_local_storage_elem *selem,
 		 * immediately.
 		 */
 		migrate_disable();
+		if (smap->map.map_flags & BPF_F_MMAPABLE)
+			bpf_map_area_free(selem->sdata.actual_data);
 		bpf_mem_cache_free(&smap->selem_ma, selem);
 		migrate_enable();
 	}
@@ -266,7 +348,7 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
 	 * from local_storage.
 	 */
 	if (uncharge_mem)
-		mem_uncharge(smap, owner, smap->elem_size);
+		mem_uncharge(smap, owner, selem_bytes_used(smap));
 
 	free_local_storage = hlist_is_singular_node(&selem->snode,
 						    &local_storage->list);
@@ -583,14 +665,14 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
 		if (err) {
 			bpf_selem_free(selem, smap, true);
-			mem_uncharge(smap, owner, smap->elem_size);
+			mem_uncharge(smap, owner, selem_bytes_used(smap));
 			return ERR_PTR(err);
 		}
 
 		return SDATA(selem);
 	}
 
-	if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
+	if (can_update_existing_selem(smap, map_flags) && !(map_flags & BPF_NOEXIST)) {
 		/* Hoping to find an old_sdata to do inline update
 		 * such that it can avoid taking the local_storage->lock
 		 * and changing the lists.
@@ -601,8 +683,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		if (err)
 			return ERR_PTR(err);
 		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
-			copy_map_value_locked(&smap->map, old_sdata->data,
-					      value, false);
+			if (map_flags & BPF_F_LOCK)
+				copy_map_value_locked(&smap->map,
+						      sdata_mapval(old_sdata),
+						      value, false);
+			else
+				copy_map_value(&smap->map, sdata_mapval(old_sdata),
+					       value);
 			return old_sdata;
 		}
 	}
@@ -633,8 +720,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		goto unlock;
 
 	if (old_sdata && (map_flags & BPF_F_LOCK)) {
-		copy_map_value_locked(&smap->map, old_sdata->data, value,
-				      false);
+		copy_map_value_locked(&smap->map, sdata_mapval(old_sdata),
+				      value, false);
 		selem = SELEM(old_sdata);
 		goto unlock;
 	}
@@ -656,7 +743,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 unlock:
 	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
 	if (alloc_selem) {
-		mem_uncharge(smap, owner, smap->elem_size);
+		mem_uncharge(smap, owner, selem_bytes_used(smap));
 		bpf_selem_free(alloc_selem, smap, true);
 	}
 	return err ? ERR_PTR(err) : SDATA(selem);
@@ -707,6 +794,10 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
 	if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE)
 		return -E2BIG;
 
+	if ((attr->map_flags & BPF_F_MMAPABLE) &&
+	    attr->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -820,8 +911,12 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
 		raw_spin_lock_init(&smap->buckets[i].lock);
 	}
 
-	smap->elem_size = offsetof(struct bpf_local_storage_elem,
-				   sdata.data[attr->value_size]);
+	if (attr->map_flags & BPF_F_MMAPABLE)
+		smap->elem_size = offsetof(struct bpf_local_storage_elem,
+					   sdata.data[sizeof(void *)]);
+	else
+		smap->elem_size = offsetof(struct bpf_local_storage_elem,
+					   sdata.data[attr->value_size]);
 
 	smap->bpf_ma = bpf_ma;
 	if (bpf_ma) {
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index adf6dfe0ba68..ce75c8d8b2ce 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -90,6 +90,7 @@ void bpf_task_storage_free(struct task_struct *task)
 static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_local_storage_data *sdata;
+	struct bpf_local_storage_map *smap;
 	struct task_struct *task;
 	unsigned int f_flags;
 	struct pid *pid;
@@ -114,7 +115,8 @@ static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
 	sdata = task_storage_lookup(task, map, true);
 	bpf_task_storage_unlock();
 	put_pid(pid);
-	return sdata ? sdata->data : NULL;
+	smap = (struct bpf_local_storage_map *)map;
+	return sdata ? sdata_mapval(sdata) : NULL;
 out:
 	put_pid(pid);
 	return ERR_PTR(err);
@@ -209,18 +211,19 @@ static void *__bpf_task_storage_get(struct bpf_map *map,
 				    u64 flags, gfp_t gfp_flags, bool nobusy)
 {
 	struct bpf_local_storage_data *sdata;
+	struct bpf_local_storage_map *smap;
 
+	smap = (struct bpf_local_storage_map *)map;
 	sdata = task_storage_lookup(task, map, nobusy);
 	if (sdata)
-		return sdata->data;
+		return sdata_mapval(sdata);
 
 	/* only allocate new storage, when the task is refcounted */
 	if (refcount_read(&task->usage) &&
 	    (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) && nobusy) {
-		sdata = bpf_local_storage_update(
-			task, (struct bpf_local_storage_map *)map, value,
-			BPF_NOEXIST, gfp_flags);
-		return IS_ERR(sdata) ? NULL : sdata->data;
+		sdata = bpf_local_storage_update(task, smap, value,
+						 BPF_NOEXIST, gfp_flags);
+		return IS_ERR(sdata) ? NULL : sdata_mapval(sdata);
 	}
 
 	return NULL;
@@ -317,6 +320,25 @@ static void task_storage_map_free(struct bpf_map *map)
 	bpf_local_storage_map_free(map, &task_cache, &bpf_task_storage_busy);
 }
 
+static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	void *data;
+
+	if (!(map->map_flags & BPF_F_MMAPABLE) || vma->vm_pgoff ||
+	    (vma->vm_end - vma->vm_start) < map->value_size)
+		return -EINVAL;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	bpf_task_storage_lock();
+	data = __bpf_task_storage_get(map, current, NULL, BPF_LOCAL_STORAGE_GET_F_CREATE,
+				      0, true);
+	bpf_task_storage_unlock();
+	if (!data)
+		return -EINVAL;
+
+	return remap_vmalloc_range(vma, data, vma->vm_pgoff);
+}
+
 BTF_ID_LIST_GLOBAL_SINGLE(bpf_local_storage_map_btf_id, struct, bpf_local_storage_map)
 const struct bpf_map_ops task_storage_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -331,6 +353,7 @@ const struct bpf_map_ops task_storage_map_ops = {
 	.map_mem_usage = bpf_local_storage_map_mem_usage,
 	.map_btf_id = &bpf_local_storage_map_btf_id[0],
 	.map_owner_storage_ptr = task_storage_ptr,
+	.map_mmap = task_storage_map_mmap,
 };
 
 const struct bpf_func_proto bpf_task_storage_get_recur_proto = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5e43ddd1b83f..d7c05a509870 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -404,7 +404,7 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 		obj_cgroup_put(map->objcg);
 }
 
-static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
 {
 	if (map->objcg)
 		return get_mem_cgroup_from_objcg(map->objcg);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage
  2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
@ 2023-11-20 17:59 ` Dave Marchevsky
  2023-11-21 19:34   ` Andrii Nakryiko
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Marchevsky @ 2023-11-20 17:59 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, Dave Marchevsky

This patch tests mmapable task_local storage functionality added earlier
in the series. The success tests focus on verifying correctness of the
various ways of reading from and writing to mmapable task_local storage:

  * Write through mmap'd region should be visible when BPF program
    makes bpf_task_storage_get call
  * If BPF program reads-and-incrs the mapval, the new value should be
    visible when userspace reads from mmap'd region or does
    map_lookup_elem call
  * If userspace does map_update_elem call, new value should be visible
    when userspace reads from mmap'd region or does map_lookup_elem
    call
  * After bpf_map_delete_elem, reading from mmap'd region should still
    succeed, but map_lookup_elem w/o BPF_LOCAL_STORAGE_GET_F_CREATE flag
    should fail
  * After bpf_map_delete_elem, creating a new map_val via mmap call
    should return a different memory region

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../bpf/prog_tests/task_local_storage.c       | 177 ++++++++++++++++++
 .../bpf/progs/task_local_storage__mmap.c      |  59 ++++++
 .../bpf/progs/task_local_storage__mmap.h      |   7 +
 .../bpf/progs/task_local_storage__mmap_fail.c |  39 ++++
 4 files changed, 282 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index ea8537c54413..08c589a12bd6 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -5,14 +5,19 @@
 #include <unistd.h>
 #include <sched.h>
 #include <pthread.h>
+#include <sys/mman.h>   /* For mmap and associated flags */
 #include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <sys/types.h>
 #include <test_progs.h>
+#include <network_helpers.h>
 #include "task_local_storage_helpers.h"
 #include "task_local_storage.skel.h"
 #include "task_local_storage_exit_creds.skel.h"
+#include "task_local_storage__mmap.skel.h"
+#include "task_local_storage__mmap_fail.skel.h"
 #include "task_ls_recursion.skel.h"
 #include "task_storage_nodeadlock.skel.h"
+#include "progs/task_local_storage__mmap.h"
 
 static void test_sys_enter_exit(void)
 {
@@ -40,6 +45,173 @@ static void test_sys_enter_exit(void)
 	task_local_storage__destroy(skel);
 }
 
+static int basic_mmapable_read_write(struct task_local_storage__mmap *skel,
+				     long *mmaped_task_local)
+{
+	int err;
+
+	*mmaped_task_local = 42;
+
+	err = task_local_storage__mmap__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		return -1;
+
+	syscall(SYS_gettid);
+	ASSERT_EQ(skel->bss->mmaped_mapval, 42, "mmaped_mapval");
+
+	/* Incr from userspace should be visible when BPF prog reads */
+	*mmaped_task_local = *mmaped_task_local + 1;
+	syscall(SYS_gettid);
+	ASSERT_EQ(skel->bss->mmaped_mapval, 43, "mmaped_mapval_user_incr");
+
+	/* Incr from BPF prog should be visible from userspace */
+	skel->bss->read_and_incr = 1;
+	syscall(SYS_gettid);
+	ASSERT_EQ(skel->bss->mmaped_mapval, 44, "mmaped_mapval_bpf_incr");
+	ASSERT_EQ(skel->bss->mmaped_mapval, *mmaped_task_local, "bpf_incr_eq");
+	skel->bss->read_and_incr = 0;
+
+	return 0;
+}
+
+static void test_sys_enter_mmap(void)
+{
+	struct task_local_storage__mmap *skel;
+	long *task_local, *task_local2, value;
+	int err, task_fd, map_fd;
+
+	task_local = task_local2 = (long *)-1;
+	task_fd = sys_pidfd_open(getpid(), 0);
+	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
+		return;
+
+	skel = task_local_storage__mmap__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) {
+		close(task_fd);
+		return;
+	}
+
+	map_fd = bpf_map__fd(skel->maps.mmapable);
+	task_local = mmap(NULL, sizeof(long), PROT_READ | PROT_WRITE,
+			  MAP_SHARED, map_fd, 0);
+	if (!ASSERT_OK_PTR(task_local, "mmap_task_local_storage"))
+		goto out;
+
+	err = basic_mmapable_read_write(skel, task_local);
+	if (!ASSERT_OK(err, "basic_mmapable_read_write"))
+		goto out;
+
+	err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
+	if (!ASSERT_OK(err, "bpf_map_lookup_elem") ||
+	    !ASSERT_EQ(value, 44, "bpf_map_lookup_elem value"))
+		goto out;
+
+	value = 148;
+	bpf_map_update_elem(map_fd, &task_fd, &value, BPF_EXIST);
+	if (!ASSERT_EQ(READ_ONCE(*task_local), 148, "mmaped_read_after_update"))
+		goto out;
+
+	err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
+	if (!ASSERT_OK(err, "bpf_map_lookup_elem") ||
+	    !ASSERT_EQ(value, 148, "bpf_map_lookup_elem value"))
+		goto out;
+
+	/* The mmapable page is not released by map_delete_elem, but no longer
+	 * linked to local_storage
+	 */
+	err = bpf_map_delete_elem(map_fd, &task_fd);
+	if (!ASSERT_OK(err, "bpf_map_delete_elem") ||
+	    !ASSERT_EQ(READ_ONCE(*task_local), 148, "mmaped_read_after_delete"))
+		goto out;
+
+	err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
+	if (!ASSERT_EQ(err, -ENOENT, "bpf_map_lookup_elem_after_delete"))
+		goto out;
+
+	task_local_storage__mmap__destroy(skel);
+
+	/* The mmapable page is not released when __destroy unloads the map.
+	 * It will stick around until we munmap it
+	 */
+	*task_local = -999;
+
+	/* Although task_local's page is still around, it won't be reused */
+	skel = task_local_storage__mmap__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load2"))
+		return;
+
+	map_fd = bpf_map__fd(skel->maps.mmapable);
+	err = task_local_storage__mmap__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach2"))
+		goto out;
+
+	skel->bss->read_and_incr = 1;
+	skel->bss->create_flag = BPF_LOCAL_STORAGE_GET_F_CREATE;
+	syscall(SYS_gettid);
+	ASSERT_EQ(skel->bss->mmaped_mapval, 1, "mmaped_mapval2");
+
+	skel->bss->read_and_incr = 0;
+	task_local2 = mmap(NULL, sizeof(long), PROT_READ | PROT_WRITE,
+			   MAP_SHARED, map_fd, 0);
+	if (!ASSERT_OK_PTR(task_local, "mmap_task_local_storage2"))
+		goto out;
+
+	if (!ASSERT_NEQ(task_local, task_local2, "second_mmap_address"))
+		goto out;
+
+	ASSERT_EQ(READ_ONCE(*task_local2), 1, "mmaped_mapval2_bpf_create_incr");
+
+out:
+	close(task_fd);
+	if (task_local > 0)
+		munmap(task_local, sizeof(long));
+	if (task_local2 > 0)
+		munmap(task_local2, sizeof(long));
+	task_local_storage__mmap__destroy(skel);
+}
+
+static void test_sys_enter_mmap_big_mapval(void)
+{
+	struct two_page_struct *task_local, value;
+	struct task_local_storage__mmap *skel;
+	int task_fd, map_fd, err;
+
+	task_local = (struct two_page_struct *)-1;
+	task_fd = sys_pidfd_open(getpid(), 0);
+	if (!ASSERT_NEQ(task_fd, -1, "sys_pidfd_open"))
+		return;
+
+	skel = task_local_storage__mmap__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) {
+		close(task_fd);
+		return;
+	}
+	map_fd = bpf_map__fd(skel->maps.mmapable_two_pages);
+	task_local = mmap(NULL, sizeof(struct two_page_struct),
+			  PROT_READ | PROT_WRITE, MAP_SHARED,
+			  map_fd, 0);
+	if (!ASSERT_OK_PTR(task_local, "mmap_task_local_storage"))
+		goto out;
+
+	skel->bss->use_big_mapval = 1;
+	err = basic_mmapable_read_write(skel, &task_local->val);
+	if (!ASSERT_OK(err, "basic_mmapable_read_write"))
+		goto out;
+
+	task_local->c[4096] = 'z';
+
+	err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
+	if (!ASSERT_OK(err, "bpf_map_lookup_elem") ||
+	    !ASSERT_EQ(value.val, 44, "bpf_map_lookup_elem value"))
+		goto out;
+
+out:
+	close(task_fd);
+	if (task_local > 0)
+		munmap(task_local, sizeof(struct two_page_struct));
+	task_local_storage__mmap__destroy(skel);
+}
+
 static void test_exit_creds(void)
 {
 	struct task_local_storage_exit_creds *skel;
@@ -237,10 +409,15 @@ void test_task_local_storage(void)
 {
 	if (test__start_subtest("sys_enter_exit"))
 		test_sys_enter_exit();
+	if (test__start_subtest("sys_enter_mmap"))
+		test_sys_enter_mmap();
+	if (test__start_subtest("sys_enter_mmap_big_mapval"))
+		test_sys_enter_mmap_big_mapval();
 	if (test__start_subtest("exit_creds"))
 		test_exit_creds();
 	if (test__start_subtest("recursion"))
 		test_recursion();
 	if (test__start_subtest("nodeadlock"))
 		test_nodeadlock();
+	RUN_TESTS(task_local_storage__mmap_fail);
 }
diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
new file mode 100644
index 000000000000..1c8850c8d189
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "task_local_storage__mmap.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
+	__type(key, int);
+	__type(value, long);
+} mmapable SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
+	__type(key, int);
+	__type(value, struct two_page_struct);
+} mmapable_two_pages SEC(".maps");
+
+long mmaped_mapval = 0;
+int read_and_incr = 0;
+int create_flag = 0;
+int use_big_mapval = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct two_page_struct *big_mapval;
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	if (!task)
+		return 1;
+
+	if (use_big_mapval) {
+		big_mapval = bpf_task_storage_get(&mmapable_two_pages, task, 0,
+						  create_flag);
+		if (!big_mapval)
+			return 2;
+		ptr = &big_mapval->val;
+	} else {
+		ptr = bpf_task_storage_get(&mmapable, task, 0, create_flag);
+	}
+
+	if (!ptr)
+		return 3;
+
+	if (read_and_incr)
+		*ptr = *ptr + 1;
+
+	mmaped_mapval = *ptr;
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
new file mode 100644
index 000000000000..f4a3264142c2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+struct two_page_struct {
+	long val;
+	char c[4097];
+};
diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c b/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c
new file mode 100644
index 000000000000..f32c5bfe370a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
+	__type(key, int);
+	__type(value, long);
+} mmapable SEC(".maps");
+
+__failure __msg("invalid access to map value, value_size=8 off=8 size=8")
+SEC("tp_btf/sys_enter")
+long BPF_PROG(fail_read_past_mapval_end, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+	long res;
+
+	task = bpf_get_current_task_btf();
+	if (!task)
+		return 1;
+
+	ptr = bpf_task_storage_get(&mmapable, task, 0, 0);
+	if (!ptr)
+		return 3;
+	/* Although mmapable mapval is given an entire page, verifier shouldn't
+	 * allow r/w past end of 'long' type
+	 */
+	res = *(ptr + 1);
+
+	return res;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
@ 2023-11-20 21:41   ` Johannes Weiner
  2023-11-21  0:42   ` Martin KaFai Lau
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2023-11-20 21:41 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team

On Mon, Nov 20, 2023 at 09:59:24AM -0800, Dave Marchevsky wrote:
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
> 
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
> 
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.

Those bits look good to me. vfree() uses __free_pages(), which
participates in refcounting. remap_vmalloc_range() acquires references
to the individual pages, which will be dropped once the page tables
disappear on munmap(). The vmalloc area doesn't need to stick around.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
  2023-11-20 21:41   ` Johannes Weiner
@ 2023-11-21  0:42   ` Martin KaFai Lau
  2023-11-21  6:11     ` David Marchevsky
  2023-11-21  2:32   ` kernel test robot
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2023-11-21  0:42 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, bpf

On 11/20/23 9:59 AM, Dave Marchevsky wrote:
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>   	 * the number of cachelines accessed during the cache hit case.
>   	 */
>   	struct bpf_local_storage_map __rcu *smap;
> -	u8 data[] __aligned(8);
> +	/* Need to duplicate smap's map_flags as smap may be gone when
> +	 * it's time to free bpf_local_storage_data
> +	 */
> +	u64 smap_map_flags;
> +	/* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> +	 * Otherwise the actual mapval data lives here
> +	 */
> +	union {
> +		DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> +		void *actual_data __aligned(8);

The pages (that can be mmap'ed later) feel like a specific kind of kptr.

Have you thought about allowing a kptr (pointing to some pages that can be 
mmap'ed later) to be stored as one of the members of the map's value as a kptr. 
bpf_local_storage_map is one of the maps that supports kptr.

struct normal_and_mmap_value {
     int some_int;
     int __percpu_kptr *some_cnts;

     struct bpf_mmap_page __kptr *some_stats;
};

struct mmap_only_value {
     struct bpf_mmap_page __kptr *some_stats;
};

[ ... ]

> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 146824cc9689..9b3becbcc1a3 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -15,7 +15,8 @@
>   #include <linux/rcupdate_trace.h>
>   #include <linux/rcupdate_wait.h>
>   
> -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
> +	(BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>   
>   static struct bpf_local_storage_map_bucket *
>   select_bucket(struct bpf_local_storage_map *smap,
> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>   	return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>   }
>   
> +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
> +
> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
> +{
> +	struct mem_cgroup *memcg, *old_memcg;
> +	void *ptr;
> +
> +	memcg = bpf_map_get_memcg(&smap->map);
> +	old_memcg = set_active_memcg(memcg);
> +	ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
> +					  NUMA_NO_NODE);
> +	set_active_memcg(old_memcg);
> +	mem_cgroup_put(memcg);
> +
> +	return ptr;
> +}

[ ... ]

> @@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   		void *value, bool charge_mem, gfp_t gfp_flags)
>   {
>   	struct bpf_local_storage_elem *selem;
> +	void *mmapable_value = NULL;
> +	u32 selem_mem;
>   
> -	if (charge_mem && mem_charge(smap, owner, smap->elem_size))
> +	selem_mem = selem_bytes_used(smap);
> +	if (charge_mem && mem_charge(smap, owner, selem_mem))
>   		return NULL;
>   
> +	if (smap->map.map_flags & BPF_F_MMAPABLE) {
> +		mmapable_value = alloc_mmapable_selem_value(smap);

This probably is not always safe for bpf prog to do. Leaving the gfp_flags was 
not used aside, the bpf local storage is moving to the bpf's memalloc because of 
https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/

> +		if (!mmapable_value)
> +			goto err_out;
> +	}
> +


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
  2023-11-20 21:41   ` Johannes Weiner
  2023-11-21  0:42   ` Martin KaFai Lau
@ 2023-11-21  2:32   ` kernel test robot
  2023-11-21  5:06   ` kernel test robot
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2023-11-21  2:32 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: oe-kbuild-all, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team, Johannes Weiner,
	Dave Marchevsky

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Marchevsky/bpf-Support-BPF_F_MMAPABLE-task_local-storage/20231121-020345
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20231120175925.733167-2-davemarchevsky%40fb.com
patch subject: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
config: sparc-randconfig-r081-20231121 (https://download.01.org/0day-ci/archive/20231121/202311211037.6BmDdrr4-lkp@intel.com/config)
compiler: sparc-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231121/202311211037.6BmDdrr4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311211037.6BmDdrr4-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/syscall.c:407:20: warning: no previous prototype for 'bpf_map_get_memcg' [-Wmissing-prototypes]
     407 | struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
         |                    ^~~~~~~~~~~~~~~~~
--
>> kernel/bpf/bpf_local_storage.c:30:7: warning: no previous prototype for 'alloc_mmapable_selem_value' [-Wmissing-prototypes]
      30 | void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
         |       ^~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/bpf_local_storage.c: In function 'bpf_selem_free_rcu':
   kernel/bpf/bpf_local_storage.c:288:39: warning: variable 'smap' set but not used [-Wunused-but-set-variable]
     288 |         struct bpf_local_storage_map *smap;
         |                                       ^~~~


vim +/bpf_map_get_memcg +407 kernel/bpf/syscall.c

   406	
 > 407	struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
   408	{
   409		if (map->objcg)
   410			return get_mem_cgroup_from_objcg(map->objcg);
   411	
   412		return root_mem_cgroup;
   413	}
   414	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (2 preceding siblings ...)
  2023-11-21  2:32   ` kernel test robot
@ 2023-11-21  5:06   ` kernel test robot
  2023-11-21  5:20   ` kernel test robot
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2023-11-21  5:06 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: oe-kbuild-all, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team, Johannes Weiner,
	Dave Marchevsky

Hi Dave,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Marchevsky/bpf-Support-BPF_F_MMAPABLE-task_local-storage/20231121-020345
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20231120175925.733167-2-davemarchevsky%40fb.com
patch subject: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
config: i386-buildonly-randconfig-005-20231121 (https://download.01.org/0day-ci/archive/20231121/202311211247.KBiJyddD-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231121/202311211247.KBiJyddD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311211247.KBiJyddD-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: kernel/bpf/bpf_local_storage.o: in function `alloc_mmapable_selem_value':
   bpf_local_storage.c:(.text+0x207): undefined reference to `bpf_map_get_memcg'
   ld: kernel/bpf/bpf_local_storage.o: in function `bpf_selem_alloc':
   bpf_local_storage.c:(.text+0x410): undefined reference to `bpf_map_get_memcg'
>> ld: bpf_local_storage.c:(.text+0x543): undefined reference to `bpf_map_get_memcg'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (3 preceding siblings ...)
  2023-11-21  5:06   ` kernel test robot
@ 2023-11-21  5:20   ` kernel test robot
  2023-11-21  5:44   ` Alexei Starovoitov
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2023-11-21  5:20 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: oe-kbuild-all, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kernel Team, Johannes Weiner,
	Dave Marchevsky

Hi Dave,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Marchevsky/bpf-Support-BPF_F_MMAPABLE-task_local-storage/20231121-020345
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20231120175925.733167-2-davemarchevsky%40fb.com
patch subject: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
config: x86_64-randconfig-121-20231121 (https://download.01.org/0day-ci/archive/20231121/202311211358.O24bsL46-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231121/202311211358.O24bsL46-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311211358.O24bsL46-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/bpf/bpf_local_storage.c:30:6: sparse: sparse: symbol 'alloc_mmapable_selem_value' was not declared. Should it be static?
>> kernel/bpf/bpf_local_storage.c:291:14: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct bpf_local_storage_map *smap @@     got struct bpf_local_storage_map [noderef] __rcu *smap @@
   kernel/bpf/bpf_local_storage.c:291:14: sparse:     expected struct bpf_local_storage_map *smap
   kernel/bpf/bpf_local_storage.c:291:14: sparse:     got struct bpf_local_storage_map [noderef] __rcu *smap
   kernel/bpf/bpf_local_storage.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
   include/linux/page-flags.h:242:46: sparse: sparse: self-comparison always evaluates to false

vim +/alloc_mmapable_selem_value +30 kernel/bpf/bpf_local_storage.c

    29	
  > 30	void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
    31	{
    32		struct mem_cgroup *memcg, *old_memcg;
    33		void *ptr;
    34	
    35		memcg = bpf_map_get_memcg(&smap->map);
    36		old_memcg = set_active_memcg(memcg);
    37		ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
    38						  NUMA_NO_NODE);
    39		set_active_memcg(old_memcg);
    40		mem_cgroup_put(memcg);
    41	
    42		return ptr;
    43	}
    44	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (4 preceding siblings ...)
  2023-11-21  5:20   ` kernel test robot
@ 2023-11-21  5:44   ` Alexei Starovoitov
  2023-11-21  6:41   ` Yonghong Song
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2023-11-21  5:44 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner

On Mon, Nov 20, 2023 at 09:59:24AM -0800, Dave Marchevsky wrote:
> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)

should be static?

> +static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +	void *data;
> +
> +	if (!(map->map_flags & BPF_F_MMAPABLE) || vma->vm_pgoff ||
> +	    (vma->vm_end - vma->vm_start) < map->value_size)
> +		return -EINVAL;
> +
> +	WARN_ON_ONCE(!bpf_rcu_lock_held());

why? This is mmap() syscall. What is the concern?

> +	bpf_task_storage_lock();
> +	data = __bpf_task_storage_get(map, current, NULL, BPF_LOCAL_STORAGE_GET_F_CREATE,
> +				      0, true);

0 for gfp_flags? It probably should be GFP_USER?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-21  0:42   ` Martin KaFai Lau
@ 2023-11-21  6:11     ` David Marchevsky
  2023-11-21 19:27       ` Martin KaFai Lau
  0 siblings, 1 reply; 17+ messages in thread
From: David Marchevsky @ 2023-11-21  6:11 UTC (permalink / raw)
  To: Martin KaFai Lau, Dave Marchevsky
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, bpf

On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>> index 173ec7f43ed1..114973f925ea 100644
>> --- a/include/linux/bpf_local_storage.h
>> +++ b/include/linux/bpf_local_storage.h
>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>        * the number of cachelines accessed during the cache hit case.
>>        */
>>       struct bpf_local_storage_map __rcu *smap;
>> -    u8 data[] __aligned(8);
>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>> +     * it's time to free bpf_local_storage_data
>> +     */
>> +    u64 smap_map_flags;
>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>> +     * Otherwise the actual mapval data lives here
>> +     */
>> +    union {
>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>> +        void *actual_data __aligned(8);
> 
> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
> 
> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
> 
> struct normal_and_mmap_value {
>     int some_int;
>     int __percpu_kptr *some_cnts;
> 
>     struct bpf_mmap_page __kptr *some_stats;
> };
> 
> struct mmap_only_value {
>     struct bpf_mmap_page __kptr *some_stats;
> };
> 
> [ ... ]
> 

This is an intriguing idea. For conciseness I'll call this specific
kind of kptr 'mmapable kptrs' for the rest of this message. Below is
more of a brainstorming dump than a cohesive response, separate trains
of thought are separated by two newlines.

My initial thought upon seeing struct normal_and_mmap_value was to note
that we currently don't support mmaping for map_value types with _any_
special fields ('special' as determined by btf_parse_fields). But IIUC
you're actually talking about exposing the some_stats pointee memory via
mmap, not the containing struct with kptr fields. That is, for maps that
support these kptrs, mmap()ing a map with value type struct
normal_and_mmap_value would return the some_stats pointer value, and
likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
logic in this patch. We'd only be able to support one such mmapable kptr
field per mapval type, but that isn't a dealbreaker.

Some maps, like task_storage, would only support mmap() on a map_value
with mmapable kptr field, as mmap()ing the mapval itself doesn't make
sense or is unsafe. Seems like arraymap would do the opposite, only
supporting mmap()ing the mapval itself. I'm curious if any map could
feasibly support both, and if so, might have to do logic like:

  if (map_val has mmapable kptr)
     mmap the pointee of mmapable kptr
  else
     mmap the map_val itself

Which is maybe too confusing of a detail to expose to BPF program
writers. Maybe a little too presumptuous and brainstorm-ey given the
limited number of mmap()able maps currently, but making this a kptr type
means maps should either ignore/fail if they don't support it, or have
consistent semantics amongst maps that do support it.

Instead of  struct bpf_mmap_page __kptr *some_stats;  I'd prefer
something like

struct my_type { long count; long another_count; };

struct mmap_only_value {
  struct my_type __mmapable_kptr *some_stats;
};

This way the type of mmap()able field is known to BPF programs that
interact with it. This is all assuming that struct bpf_mmap_page is an
opaque page-sized blob of bytes.

We could then support structs like

struct mmap_value_and_lock {
  struct bpf_spin_lock l;
  int some_int;
  struct my_type __mmapable_kptr *some_stats;
};

and have bpf_map_update_elem handler use the spin_lock instead of
map-internal lock where appropriate. But no way to ensure userspace task
using the mmap()ed region uses the spin_lock.

>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>> index 146824cc9689..9b3becbcc1a3 100644
>> --- a/kernel/bpf/bpf_local_storage.c
>> +++ b/kernel/bpf/bpf_local_storage.c
>> @@ -15,7 +15,8 @@
>>   #include <linux/rcupdate_trace.h>
>>   #include <linux/rcupdate_wait.h>
>>   -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
>> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
>> +    (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>>     static struct bpf_local_storage_map_bucket *
>>   select_bucket(struct bpf_local_storage_map *smap,
>> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>>       return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>>   }
>>   +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
>> +
>> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
>> +{
>> +    struct mem_cgroup *memcg, *old_memcg;
>> +    void *ptr;
>> +
>> +    memcg = bpf_map_get_memcg(&smap->map);
>> +    old_memcg = set_active_memcg(memcg);
>> +    ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
>> +                      NUMA_NO_NODE);
>> +    set_active_memcg(old_memcg);
>> +    mem_cgroup_put(memcg);
>> +
>> +    return ptr;
>> +}
> 
> [ ... ]
> 
>> @@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>           void *value, bool charge_mem, gfp_t gfp_flags)
>>   {
>>       struct bpf_local_storage_elem *selem;
>> +    void *mmapable_value = NULL;
>> +    u32 selem_mem;
>>   -    if (charge_mem && mem_charge(smap, owner, smap->elem_size))
>> +    selem_mem = selem_bytes_used(smap);
>> +    if (charge_mem && mem_charge(smap, owner, selem_mem))
>>           return NULL;
>>   +    if (smap->map.map_flags & BPF_F_MMAPABLE) {
>> +        mmapable_value = alloc_mmapable_selem_value(smap);
> 
> This probably is not always safe for bpf prog to do. Leaving the gfp_flags was not used aside, the bpf local storage is moving to the bpf's memalloc because of https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
> 

Minor point: alloc_mmapable_selem_value's bpf_map_area_mmapable_alloc
call will always call vmalloc under the hood. vmalloc has locks as well,
so your point stands.

I think I see how this ties into your 'specific kptr type' proposal
above. Let me know if this sounds right: if there was a bpf_mem_alloc
type focused on providing vmalloc'd mmap()able memory, we could use it
here instead of raw vmalloc and avoid the lock recursion problem linked
above. Such an allocator could be used in something like bpf_obj_new to
create the __mmapable_kptr - either from BPF prog or mmap path before
remap_vmalloc_range.

re: gfp_flags, looks like verifier is setting this param to either
GFP_ATOMIC or GFP_KERNEL. Looks like we should not allow GFP_KERNEL
allocs here?

>> +        if (!mmapable_value)
>> +            goto err_out;
>> +    }
>> +
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (5 preceding siblings ...)
  2023-11-21  5:44   ` Alexei Starovoitov
@ 2023-11-21  6:41   ` Yonghong Song
  2023-11-21 15:34   ` Yonghong Song
  2023-11-21 19:30   ` Andrii Nakryiko
  8 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-11-21  6:41 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner


On 11/20/23 12:59 PM, Dave Marchevsky wrote:
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
>
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
>
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.
>
> Implementation details:
>
>    * A few small helpers are added to deal with bpf_local_storage_data's
>      'data' field having different semantics when the local_storage map
>      is mmapable. With their help, many of the changes to existing code
>      are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
>      selem->elem_size becomes selem_bytes_used(selem)).
>
>    * The map flags are copied into bpf_local_storage_data when its
>      containing bpf_local_storage_elem is alloc'd, since the
>      bpf_local_storage_map associated with them may be gone when
>      bpf_local_storage_data is free'd, and testing flags for
>      BPF_F_MMAPABLE is necessary when free'ing to ensure that the
>      mmapable region is free'd.
>      * The extra field doesn't change bpf_local_storage_elem's size.
>        There were 48 bytes of padding after the bpf_local_storage_data
>        field, now there are 40.
>
>    * Currently, bpf_local_storage_update always creates a new
>      bpf_local_storage_elem for the 'updated' value - the only exception
>      being if the map_value has a bpf_spin_lock field, in which case the
>      spin lock is grabbed instead of the less granular bpf_local_storage
>      lock, and the value updated in place. This inplace update behavior
>      is desired for mmapable local_storage map_values as well, since
>      creating a new selem would result in new mmapable pages.
>
>    * The size of the mmapable pages are accounted for when calling
>      mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
>      mem_uncharge may be called before they actually go away.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/linux/bpf_local_storage.h |  14 ++-
>   kernel/bpf/bpf_local_storage.c    | 145 ++++++++++++++++++++++++------
>   kernel/bpf/bpf_task_storage.c     |  35 ++++++--
>   kernel/bpf/syscall.c              |   2 +-
>   4 files changed, 163 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>   	 * the number of cachelines accessed during the cache hit case.
>   	 */
>   	struct bpf_local_storage_map __rcu *smap;
> -	u8 data[] __aligned(8);
> +	/* Need to duplicate smap's map_flags as smap may be gone when
> +	 * it's time to free bpf_local_storage_data
> +	 */
> +	u64 smap_map_flags;
> +	/* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> +	 * Otherwise the actual mapval data lives here
> +	 */
> +	union {
> +		DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> +		void *actual_data __aligned(8);

I think we can remove __aligned(8) from 'void *actual_data __aligned(8)'
in the above. There are two reasons, first, the first element in the union
is aligned(8) and then the rest of union members will be at least aligned 8.
second, IIUC, in the code, we have
     actual_data = mmapable_value
where mmapable_value has type 'void *'. So actual_data is used
to hold a pointer to mmap-ed page, so it only needs pointer type
alignment which can already be achieved with 'void *'.
So we can remove __aligned(8) safely. It can also remove
an impression that 'actual_data' has to be 8-byte aligned in
32-bit system although it does not need to be just by 'actual_data'
itself.


> +	};
>   };
>   
>   /* Linked to bpf_local_storage and bpf_local_storage_map */
> @@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = {			\
>   /* Helper functions for bpf_local_storage */
>   int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
>   
> +void *sdata_mapval(struct bpf_local_storage_data *data);
> +
>   struct bpf_map *
>   bpf_local_storage_map_alloc(union bpf_attr *attr,
>   			    struct bpf_local_storage_cache *cache,

[...]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (6 preceding siblings ...)
  2023-11-21  6:41   ` Yonghong Song
@ 2023-11-21 15:34   ` Yonghong Song
  2023-11-21 19:30   ` Andrii Nakryiko
  8 siblings, 0 replies; 17+ messages in thread
From: Yonghong Song @ 2023-11-21 15:34 UTC (permalink / raw)
  To: Dave Marchevsky, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner


On 11/20/23 12:59 PM, Dave Marchevsky wrote:
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
>
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
>
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.
>
> Implementation details:
>
>    * A few small helpers are added to deal with bpf_local_storage_data's
>      'data' field having different semantics when the local_storage map
>      is mmapable. With their help, many of the changes to existing code
>      are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
>      selem->elem_size becomes selem_bytes_used(selem)).
>
>    * The map flags are copied into bpf_local_storage_data when its
>      containing bpf_local_storage_elem is alloc'd, since the
>      bpf_local_storage_map associated with them may be gone when
>      bpf_local_storage_data is free'd, and testing flags for
>      BPF_F_MMAPABLE is necessary when free'ing to ensure that the
>      mmapable region is free'd.
>      * The extra field doesn't change bpf_local_storage_elem's size.
>        There were 48 bytes of padding after the bpf_local_storage_data
>        field, now there are 40.
>
>    * Currently, bpf_local_storage_update always creates a new
>      bpf_local_storage_elem for the 'updated' value - the only exception
>      being if the map_value has a bpf_spin_lock field, in which case the
>      spin lock is grabbed instead of the less granular bpf_local_storage
>      lock, and the value updated in place. This inplace update behavior
>      is desired for mmapable local_storage map_values as well, since
>      creating a new selem would result in new mmapable pages.
>
>    * The size of the mmapable pages are accounted for when calling
>      mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
>      mem_uncharge may be called before they actually go away.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>   include/linux/bpf_local_storage.h |  14 ++-
>   kernel/bpf/bpf_local_storage.c    | 145 ++++++++++++++++++++++++------
>   kernel/bpf/bpf_task_storage.c     |  35 ++++++--
>   kernel/bpf/syscall.c              |   2 +-
>   4 files changed, 163 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>   	 * the number of cachelines accessed during the cache hit case.
>   	 */
>   	struct bpf_local_storage_map __rcu *smap;
> -	u8 data[] __aligned(8);
> +	/* Need to duplicate smap's map_flags as smap may be gone when
> +	 * it's time to free bpf_local_storage_data
> +	 */
> +	u64 smap_map_flags;
> +	/* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> +	 * Otherwise the actual mapval data lives here
> +	 */
> +	union {
> +		DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> +		void *actual_data __aligned(8);
> +	};
>   };
>   
>   /* Linked to bpf_local_storage and bpf_local_storage_map */
> @@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = {			\
>   /* Helper functions for bpf_local_storage */
>   int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
>   
> +void *sdata_mapval(struct bpf_local_storage_data *data);
> +
>   struct bpf_map *
>   bpf_local_storage_map_alloc(union bpf_attr *attr,
>   			    struct bpf_local_storage_cache *cache,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 146824cc9689..9b3becbcc1a3 100644
> --- a/kernel/bpf/bpf_local_storage.c
[...]
> @@ -583,14 +665,14 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
>   		if (err) {
>   			bpf_selem_free(selem, smap, true);
> -			mem_uncharge(smap, owner, smap->elem_size);
> +			mem_uncharge(smap, owner, selem_bytes_used(smap));
>   			return ERR_PTR(err);
>   		}
>   
>   		return SDATA(selem);
>   	}
>   
> -	if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
> +	if (can_update_existing_selem(smap, map_flags) && !(map_flags & BPF_NOEXIST)) {
>   		/* Hoping to find an old_sdata to do inline update
>   		 * such that it can avoid taking the local_storage->lock
>   		 * and changing the lists.
> @@ -601,8 +683,13 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		if (err)
>   			return ERR_PTR(err);
>   		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> -			copy_map_value_locked(&smap->map, old_sdata->data,
> -					      value, false);
> +			if (map_flags & BPF_F_LOCK)
> +				copy_map_value_locked(&smap->map,
> +						      sdata_mapval(old_sdata),
> +						      value, false);
> +			else
> +				copy_map_value(&smap->map, sdata_mapval(old_sdata),
> +					       value);

IIUC, if two 'storage_update' to the same map/key and then
these two updates will be serialized due to spin_lock.
How about concurrent update for mmap'ed sdata, do we need
any protection here?

>   			return old_sdata;
>   		}
>   	}
> @@ -633,8 +720,8 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   		goto unlock;
>   
>   	if (old_sdata && (map_flags & BPF_F_LOCK)) {
> -		copy_map_value_locked(&smap->map, old_sdata->data, value,
> -				      false);
> +		copy_map_value_locked(&smap->map, sdata_mapval(old_sdata),
> +				      value, false);
>   		selem = SELEM(old_sdata);
>   		goto unlock;
>   	}
> @@ -656,7 +743,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>   unlock:
>   	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>   	if (alloc_selem) {
> -		mem_uncharge(smap, owner, smap->elem_size);
> +		mem_uncharge(smap, owner, selem_bytes_used(smap));
>   		bpf_selem_free(alloc_selem, smap, true);
>   	}
>   	return err ? ERR_PTR(err) : SDATA(selem);
> @@ -707,6 +794,10 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
>   	if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE)
>   		return -E2BIG;
>   
> +	if ((attr->map_flags & BPF_F_MMAPABLE) &&
> +	    attr->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> +		return -EINVAL;
> +
>   	return 0;
>   }
>   

[...]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-21  6:11     ` David Marchevsky
@ 2023-11-21 19:27       ` Martin KaFai Lau
  2023-11-21 19:49         ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2023-11-21 19:27 UTC (permalink / raw)
  To: David Marchevsky, Dave Marchevsky
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner, bpf

On 11/20/23 10:11 PM, David Marchevsky wrote:
> 
> 
> On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
>> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>> index 173ec7f43ed1..114973f925ea 100644
>>> --- a/include/linux/bpf_local_storage.h
>>> +++ b/include/linux/bpf_local_storage.h
>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>>         * the number of cachelines accessed during the cache hit case.
>>>         */
>>>        struct bpf_local_storage_map __rcu *smap;
>>> -    u8 data[] __aligned(8);
>>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>>> +     * it's time to free bpf_local_storage_data
>>> +     */
>>> +    u64 smap_map_flags;
>>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>>> +     * Otherwise the actual mapval data lives here
>>> +     */
>>> +    union {
>>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>>> +        void *actual_data __aligned(8);
>>
>> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
>>
>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
>>
>> struct normal_and_mmap_value {
>>      int some_int;
>>      int __percpu_kptr *some_cnts;
>>
>>      struct bpf_mmap_page __kptr *some_stats;
>> };
>>
>> struct mmap_only_value {
>>      struct bpf_mmap_page __kptr *some_stats;
>> };
>>
>> [ ... ]
>>
> 
> This is an intriguing idea. For conciseness I'll call this specific
> kind of kptr 'mmapable kptrs' for the rest of this message. Below is
> more of a brainstorming dump than a cohesive response, separate trains
> of thought are separated by two newlines.

Thanks for bearing with me while some ideas could be crazy. I am trying to see 
how this would look like for other local storage, sk and inode. Allocating a 
page for each sk will not be nice for server with half a million sk(s). e.g. 
half a million sk(s) sharing a few bandwidth policies or a few tuning 
parameters. Creating something mmap'able to the user space and also sharable 
among many sk(s) will be useful.

> 
> 
> My initial thought upon seeing struct normal_and_mmap_value was to note
> that we currently don't support mmaping for map_value types with _any_
> special fields ('special' as determined by btf_parse_fields). But IIUC
> you're actually talking about exposing the some_stats pointee memory via
> mmap, not the containing struct with kptr fields. That is, for maps that
> support these kptrs, mmap()ing a map with value type struct
> normal_and_mmap_value would return the some_stats pointer value, and
> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
> logic in this patch. We'd only be able to support one such mmapable kptr
> field per mapval type, but that isn't a dealbreaker.
> 
> Some maps, like task_storage, would only support mmap() on a map_value
> with mmapable kptr field, as mmap()ing the mapval itself doesn't make
> sense or is unsafe. Seems like arraymap would do the opposite, only

Changing direction a bit since arraymap is brought up. :)

arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an 
arraymap as kptr, the bpf prog should be able to access it as a map. More like 
the current map-in-map setup. The arraymap can be used as regular map in the 
user space also (like pinning). It may need some btf plumbing to tell the value 
type of the arrayamp to the verifier.

The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags) 
can be used where the value->array_mmap initialized as an arraymap_fd. This will 
limit the arraymap kptr update only from the syscall side which seems to be your 
usecase also? Allocating the arraymap from the bpf prog side needs some thoughts 
and need a whitelist.

The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd, 
&task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we 
can create a libbpf helper to free the fd(s) resources held in the looked-up 
value by using the value's btf.

The bpf_local_storage_map side probably does not need to support mmap() then.


> supporting mmap()ing the mapval itself. I'm curious if any map could
> feasibly support both, and if so, might have to do logic like:
> 
>    if (map_val has mmapable kptr)
>       mmap the pointee of mmapable kptr
>    else
>       mmap the map_val itself
> 
> Which is maybe too confusing of a detail to expose to BPF program
> writers. Maybe a little too presumptuous and brainstorm-ey given the
> limited number of mmap()able maps currently, but making this a kptr type
> means maps should either ignore/fail if they don't support it, or have
> consistent semantics amongst maps that do support it.
> 
> 
> Instead of  struct bpf_mmap_page __kptr *some_stats;  I'd prefer
> something like
> 
> struct my_type { long count; long another_count; };
> 
> struct mmap_only_value {
>    struct my_type __mmapable_kptr *some_stats;
> };
> 
> This way the type of mmap()able field is known to BPF programs that
> interact with it. This is all assuming that struct bpf_mmap_page is an
> opaque page-sized blob of bytes.
> 
> 
> We could then support structs like
> 
> struct mmap_value_and_lock {
>    struct bpf_spin_lock l;
>    int some_int;
>    struct my_type __mmapable_kptr *some_stats;
> };
> 
> and have bpf_map_update_elem handler use the spin_lock instead of
> map-internal lock where appropriate. But no way to ensure userspace task
> using the mmap()ed region uses the spin_lock.
> 
>>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>>> index 146824cc9689..9b3becbcc1a3 100644
>>> --- a/kernel/bpf/bpf_local_storage.c
>>> +++ b/kernel/bpf/bpf_local_storage.c
>>> @@ -15,7 +15,8 @@
>>>    #include <linux/rcupdate_trace.h>
>>>    #include <linux/rcupdate_wait.h>
>>>    -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
>>> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
>>> +    (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>>>      static struct bpf_local_storage_map_bucket *
>>>    select_bucket(struct bpf_local_storage_map *smap,
>>> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>>>        return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>>>    }
>>>    +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
>>> +
>>> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
>>> +{
>>> +    struct mem_cgroup *memcg, *old_memcg;
>>> +    void *ptr;
>>> +
>>> +    memcg = bpf_map_get_memcg(&smap->map);
>>> +    old_memcg = set_active_memcg(memcg);
>>> +    ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
>>> +                      NUMA_NO_NODE);
>>> +    set_active_memcg(old_memcg);
>>> +    mem_cgroup_put(memcg);
>>> +
>>> +    return ptr;
>>> +}
>>
>> [ ... ]
>>
>>> @@ -76,10 +122,19 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>>            void *value, bool charge_mem, gfp_t gfp_flags)
>>>    {
>>>        struct bpf_local_storage_elem *selem;
>>> +    void *mmapable_value = NULL;
>>> +    u32 selem_mem;
>>>    -    if (charge_mem && mem_charge(smap, owner, smap->elem_size))
>>> +    selem_mem = selem_bytes_used(smap);
>>> +    if (charge_mem && mem_charge(smap, owner, selem_mem))
>>>            return NULL;
>>>    +    if (smap->map.map_flags & BPF_F_MMAPABLE) {
>>> +        mmapable_value = alloc_mmapable_selem_value(smap);
>>
>> This probably is not always safe for bpf prog to do. Leaving the gfp_flags was not used aside, the bpf local storage is moving to the bpf's memalloc because of https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
>>
> 
> Minor point: alloc_mmapable_selem_value's bpf_map_area_mmapable_alloc
> call will always call vmalloc under the hood. vmalloc has locks as well,
> so your point stands.
> 
> I think I see how this ties into your 'specific kptr type' proposal
> above. Let me know if this sounds right: if there was a bpf_mem_alloc
> type focused on providing vmalloc'd mmap()able memory, we could use it
> here instead of raw vmalloc and avoid the lock recursion problem linked
> above. Such an allocator could be used in something like bpf_obj_new to
> create the __mmapable_kptr - either from BPF prog or mmap path before
> remap_vmalloc_range.
> 
> re: gfp_flags, looks like verifier is setting this param to either
> GFP_ATOMIC or GFP_KERNEL. Looks like we should not allow GFP_KERNEL
> allocs here?

Going back to this patch, not sure what does it take to make bpf_mem_alloc() 
mmap()able. May be we can limit the blast radius for now, like limit this alloc 
to the user space mmap() call for now. Does it fit your use case? A whitelist 
for bpf prog could also be created later if needed.

> 
>>> +        if (!mmapable_value)
>>> +            goto err_out;
>>> +    }
>>> +
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
                     ` (7 preceding siblings ...)
  2023-11-21 15:34   ` Yonghong Song
@ 2023-11-21 19:30   ` Andrii Nakryiko
  8 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2023-11-21 19:30 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner

On Mon, Nov 20, 2023 at 9:59 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> This patch modifies the generic bpf_local_storage infrastructure to
> support mmapable map values and adds mmap() handling to task_local
> storage leveraging this new functionality. A userspace task which
> mmap's a task_local storage map will receive a pointer to the map_value
> corresponding to that tasks' key - mmap'ing in other tasks' mapvals is
> not supported in this patch.
>
> Currently, struct bpf_local_storage_elem contains both bookkeeping
> information as well as a struct bpf_local_storage_data with additional
> bookkeeping information and the actual mapval data. We can't simply map
> the page containing this struct into userspace. Instead, mmapable
> local_storage uses bpf_local_storage_data's data field to point to the
> actual mapval, which is allocated separately such that it can be
> mmapped. Only the mapval lives on the page(s) allocated for it.
>
> The lifetime of the actual_data mmapable region is tied to the
> bpf_local_storage_elem which points to it. This doesn't necessarily mean
> that the pages go away when the bpf_local_storage_elem is free'd - if
> they're mapped into some userspace process they will remain until
> unmapped, but are no longer the task_local storage's mapval.
>
> Implementation details:
>
>   * A few small helpers are added to deal with bpf_local_storage_data's
>     'data' field having different semantics when the local_storage map
>     is mmapable. With their help, many of the changes to existing code
>     are purely mechanical (e.g. sdata->data becomes sdata_mapval(sdata),
>     selem->elem_size becomes selem_bytes_used(selem)).

might be worth doing this as a pre-patch with no functional changes to
make the main change a bit smaller and more focused?

>
>   * The map flags are copied into bpf_local_storage_data when its
>     containing bpf_local_storage_elem is alloc'd, since the
>     bpf_local_storage_map associated with them may be gone when
>     bpf_local_storage_data is free'd, and testing flags for
>     BPF_F_MMAPABLE is necessary when free'ing to ensure that the
>     mmapable region is free'd.
>     * The extra field doesn't change bpf_local_storage_elem's size.
>       There were 48 bytes of padding after the bpf_local_storage_data
>       field, now there are 40.
>
>   * Currently, bpf_local_storage_update always creates a new
>     bpf_local_storage_elem for the 'updated' value - the only exception
>     being if the map_value has a bpf_spin_lock field, in which case the
>     spin lock is grabbed instead of the less granular bpf_local_storage
>     lock, and the value updated in place. This inplace update behavior
>     is desired for mmapable local_storage map_values as well, since
>     creating a new selem would result in new mmapable pages.
>
>   * The size of the mmapable pages are accounted for when calling
>     mem_{charge,uncharge}. If the pages are mmap'd into a userspace task
>     mem_uncharge may be called before they actually go away.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  include/linux/bpf_local_storage.h |  14 ++-
>  kernel/bpf/bpf_local_storage.c    | 145 ++++++++++++++++++++++++------
>  kernel/bpf/bpf_task_storage.c     |  35 ++++++--
>  kernel/bpf/syscall.c              |   2 +-
>  4 files changed, 163 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 173ec7f43ed1..114973f925ea 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>          * the number of cachelines accessed during the cache hit case.
>          */
>         struct bpf_local_storage_map __rcu *smap;
> -       u8 data[] __aligned(8);
> +       /* Need to duplicate smap's map_flags as smap may be gone when
> +        * it's time to free bpf_local_storage_data
> +        */
> +       u64 smap_map_flags;
> +       /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> +        * Otherwise the actual mapval data lives here
> +        */
> +       union {
> +               DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> +               void *actual_data __aligned(8);

I don't know if it's the issue, but probably best to keep FLEX_ARRAY
member the last even within the union, just in case if some tool
doesn't handle FLEX_ARRAY not being last line number-wise?


> +       };
>  };
>
>  /* Linked to bpf_local_storage and bpf_local_storage_map */
> @@ -124,6 +134,8 @@ static struct bpf_local_storage_cache name = {                      \
>  /* Helper functions for bpf_local_storage */
>  int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
>
> +void *sdata_mapval(struct bpf_local_storage_data *data);
> +
>  struct bpf_map *
>  bpf_local_storage_map_alloc(union bpf_attr *attr,
>                             struct bpf_local_storage_cache *cache,
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 146824cc9689..9b3becbcc1a3 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -15,7 +15,8 @@
>  #include <linux/rcupdate_trace.h>
>  #include <linux/rcupdate_wait.h>
>
> -#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
> +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \
> +       (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
>
>  static struct bpf_local_storage_map_bucket *
>  select_bucket(struct bpf_local_storage_map *smap,
> @@ -24,6 +25,51 @@ select_bucket(struct bpf_local_storage_map *smap,
>         return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
>  }
>
> +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map);
> +
> +void *alloc_mmapable_selem_value(struct bpf_local_storage_map *smap)
> +{
> +       struct mem_cgroup *memcg, *old_memcg;
> +       void *ptr;
> +
> +       memcg = bpf_map_get_memcg(&smap->map);
> +       old_memcg = set_active_memcg(memcg);
> +       ptr = bpf_map_area_mmapable_alloc(PAGE_ALIGN(smap->map.value_size),
> +                                         NUMA_NO_NODE);
> +       set_active_memcg(old_memcg);
> +       mem_cgroup_put(memcg);
> +
> +       return ptr;
> +}
> +
> +void *sdata_mapval(struct bpf_local_storage_data *data)
> +{
> +       if (data->smap_map_flags & BPF_F_MMAPABLE)
> +               return data->actual_data;
> +       return &data->data;
> +}

given this being potentially high-frequency helper called from other
.o files and it is simple, should this be a static inline in .h header
instead?

> +
> +static size_t sdata_data_field_size(struct bpf_local_storage_map *smap,
> +                                   struct bpf_local_storage_data *data)
> +{
> +       if (smap->map.map_flags & BPF_F_MMAPABLE)
> +               return sizeof(void *);
> +       return (size_t)smap->map.value_size;
> +}
> +

[...]

> diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
> index adf6dfe0ba68..ce75c8d8b2ce 100644
> --- a/kernel/bpf/bpf_task_storage.c
> +++ b/kernel/bpf/bpf_task_storage.c
> @@ -90,6 +90,7 @@ void bpf_task_storage_free(struct task_struct *task)
>  static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
>  {
>         struct bpf_local_storage_data *sdata;
> +       struct bpf_local_storage_map *smap;
>         struct task_struct *task;
>         unsigned int f_flags;
>         struct pid *pid;
> @@ -114,7 +115,8 @@ static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
>         sdata = task_storage_lookup(task, map, true);
>         bpf_task_storage_unlock();
>         put_pid(pid);
> -       return sdata ? sdata->data : NULL;
> +       smap = (struct bpf_local_storage_map *)map;

smap seems unused?

> +       return sdata ? sdata_mapval(sdata) : NULL;
>  out:
>         put_pid(pid);
>         return ERR_PTR(err);
> @@ -209,18 +211,19 @@ static void *__bpf_task_storage_get(struct bpf_map *map,
>                                     u64 flags, gfp_t gfp_flags, bool nobusy)
>  {
>         struct bpf_local_storage_data *sdata;
> +       struct bpf_local_storage_map *smap;
>
> +       smap = (struct bpf_local_storage_map *)map;

used much later, so maybe move it down? or just not change this part?

>         sdata = task_storage_lookup(task, map, nobusy);
>         if (sdata)
> -               return sdata->data;
> +               return sdata_mapval(sdata);
>
>         /* only allocate new storage, when the task is refcounted */
>         if (refcount_read(&task->usage) &&
>             (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) && nobusy) {
> -               sdata = bpf_local_storage_update(
> -                       task, (struct bpf_local_storage_map *)map, value,
> -                       BPF_NOEXIST, gfp_flags);
> -               return IS_ERR(sdata) ? NULL : sdata->data;
> +               sdata = bpf_local_storage_update(task, smap, value,
> +                                                BPF_NOEXIST, gfp_flags);
> +               return IS_ERR(sdata) ? NULL : sdata_mapval(sdata);
>         }
>
>         return NULL;
> @@ -317,6 +320,25 @@ static void task_storage_map_free(struct bpf_map *map)
>         bpf_local_storage_map_free(map, &task_cache, &bpf_task_storage_busy);
>  }
>
> +static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +       void *data;
> +
> +       if (!(map->map_flags & BPF_F_MMAPABLE) || vma->vm_pgoff ||
> +           (vma->vm_end - vma->vm_start) < map->value_size)

so we enforce that vm_pgoff is zero, that's understandable. But why
disallowing mmaping only a smaller portion of map value?

Also, more importantly, I think you should reject `vma->vm_end -
vma->vm_start > PAGE_ALIGN(map->value_size)`, no?

Another question. I might have missed it, but where do we disallow
mmap()'ing maps that have "special" fields in map_value, like kptrs,
spin_locks, timers, etc?

> +               return -EINVAL;
> +
> +       WARN_ON_ONCE(!bpf_rcu_lock_held());
> +       bpf_task_storage_lock();
> +       data = __bpf_task_storage_get(map, current, NULL, BPF_LOCAL_STORAGE_GET_F_CREATE,
> +                                     0, true);
> +       bpf_task_storage_unlock();
> +       if (!data)
> +               return -EINVAL;
> +
> +       return remap_vmalloc_range(vma, data, vma->vm_pgoff);
> +}
> +
>  BTF_ID_LIST_GLOBAL_SINGLE(bpf_local_storage_map_btf_id, struct, bpf_local_storage_map)
>  const struct bpf_map_ops task_storage_map_ops = {
>         .map_meta_equal = bpf_map_meta_equal,
> @@ -331,6 +353,7 @@ const struct bpf_map_ops task_storage_map_ops = {
>         .map_mem_usage = bpf_local_storage_map_mem_usage,
>         .map_btf_id = &bpf_local_storage_map_btf_id[0],
>         .map_owner_storage_ptr = task_storage_ptr,
> +       .map_mmap = task_storage_map_mmap,
>  };
>
>  const struct bpf_func_proto bpf_task_storage_get_recur_proto = {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5e43ddd1b83f..d7c05a509870 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -404,7 +404,7 @@ static void bpf_map_release_memcg(struct bpf_map *map)
>                 obj_cgroup_put(map->objcg);
>  }
>
> -static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
> +struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
>  {
>         if (map->objcg)
>                 return get_mem_cgroup_from_objcg(map->objcg);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage
  2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
@ 2023-11-21 19:34   ` Andrii Nakryiko
  0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2023-11-21 19:34 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Kernel Team, Johannes Weiner

On Mon, Nov 20, 2023 at 9:59 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> This patch tests mmapable task_local storage functionality added earlier
> in the series. The success tests focus on verifying correctness of the
> various ways of reading from and writing to mmapable task_local storage:
>
>   * Write through mmap'd region should be visible when BPF program
>     makes bpf_task_storage_get call
>   * If BPF program reads-and-incrs the mapval, the new value should be
>     visible when userspace reads from mmap'd region or does
>     map_lookup_elem call
>   * If userspace does map_update_elem call, new value should be visible
>     when userspace reads from mmap'd region or does map_lookup_elem
>     call
>   * After bpf_map_delete_elem, reading from mmap'd region should still
>     succeed, but map_lookup_elem w/o BPF_LOCAL_STORAGE_GET_F_CREATE flag
>     should fail
>   * After bpf_map_delete_elem, creating a new map_val via mmap call
>     should return a different memory region
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  .../bpf/prog_tests/task_local_storage.c       | 177 ++++++++++++++++++
>  .../bpf/progs/task_local_storage__mmap.c      |  59 ++++++
>  .../bpf/progs/task_local_storage__mmap.h      |   7 +
>  .../bpf/progs/task_local_storage__mmap_fail.c |  39 ++++
>  4 files changed, 282 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
>  create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
>  create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c
>

[...]

> diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
> new file mode 100644
> index 000000000000..1c8850c8d189
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.c
> @@ -0,0 +1,59 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "task_local_storage__mmap.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
> +       __type(key, int);
> +       __type(value, long);
> +} mmapable SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
> +       __type(key, int);
> +       __type(value, struct two_page_struct);
> +} mmapable_two_pages SEC(".maps");
> +
> +long mmaped_mapval = 0;
> +int read_and_incr = 0;
> +int create_flag = 0;
> +int use_big_mapval = 0;
> +
> +SEC("tp_btf/sys_enter")
> +int BPF_PROG(on_enter, struct pt_regs *regs, long id)
> +{
> +       struct two_page_struct *big_mapval;
> +       struct task_struct *task;
> +       long *ptr;
> +

In addition to the note below about idempotency as a desired property,
it might be good to *also* add tid/pid filter here to only execute the
logic for our process?


> +       task = bpf_get_current_task_btf();
> +       if (!task)
> +               return 1;
> +
> +       if (use_big_mapval) {
> +               big_mapval = bpf_task_storage_get(&mmapable_two_pages, task, 0,
> +                                                 create_flag);
> +               if (!big_mapval)
> +                       return 2;
> +               ptr = &big_mapval->val;
> +       } else {
> +               ptr = bpf_task_storage_get(&mmapable, task, 0, create_flag);
> +       }
> +
> +       if (!ptr)
> +               return 3;
> +
> +       if (read_and_incr)
> +               *ptr = *ptr + 1;

this seems fragile, this program is not idempotent, which means any
extra unexpected syscall might cause problem, right?

> +
> +       mmaped_mapval = *ptr;
> +       return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
> new file mode 100644
> index 000000000000..f4a3264142c2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
> +
> +struct two_page_struct {
> +       long val;
> +       char c[4097];
> +};
> diff --git a/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c b/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c
> new file mode 100644
> index 000000000000..f32c5bfe370a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_storage__mmap_fail.c
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
> +       __type(key, int);
> +       __type(value, long);
> +} mmapable SEC(".maps");
> +
> +__failure __msg("invalid access to map value, value_size=8 off=8 size=8")
> +SEC("tp_btf/sys_enter")

let's keep SEC() first, please move __failure and __msg below it


> +long BPF_PROG(fail_read_past_mapval_end, struct pt_regs *regs, long id)
> +{
> +       struct task_struct *task;
> +       long *ptr;
> +       long res;
> +
> +       task = bpf_get_current_task_btf();
> +       if (!task)
> +               return 1;
> +
> +       ptr = bpf_task_storage_get(&mmapable, task, 0, 0);
> +       if (!ptr)
> +               return 3;
> +       /* Although mmapable mapval is given an entire page, verifier shouldn't
> +        * allow r/w past end of 'long' type
> +        */
> +       res = *(ptr + 1);
> +
> +       return res;
> +}
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-21 19:27       ` Martin KaFai Lau
@ 2023-11-21 19:49         ` Alexei Starovoitov
  2023-12-11 17:31           ` David Marchevsky
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-11-21 19:49 UTC (permalink / raw)
  To: Martin KaFai Lau, Tejun Heo, David Vernet
  Cc: David Marchevsky, Dave Marchevsky, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Johannes Weiner, bpf

On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/20/23 10:11 PM, David Marchevsky wrote:
> >
> >
> > On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
> >> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
> >>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> >>> index 173ec7f43ed1..114973f925ea 100644
> >>> --- a/include/linux/bpf_local_storage.h
> >>> +++ b/include/linux/bpf_local_storage.h
> >>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
> >>>         * the number of cachelines accessed during the cache hit case.
> >>>         */
> >>>        struct bpf_local_storage_map __rcu *smap;
> >>> -    u8 data[] __aligned(8);
> >>> +    /* Need to duplicate smap's map_flags as smap may be gone when
> >>> +     * it's time to free bpf_local_storage_data
> >>> +     */
> >>> +    u64 smap_map_flags;
> >>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
> >>> +     * Otherwise the actual mapval data lives here
> >>> +     */
> >>> +    union {
> >>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
> >>> +        void *actual_data __aligned(8);
> >>
> >> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
> >>
> >> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
> >>
> >> struct normal_and_mmap_value {
> >>      int some_int;
> >>      int __percpu_kptr *some_cnts;
> >>
> >>      struct bpf_mmap_page __kptr *some_stats;
> >> };
> >>
> >> struct mmap_only_value {
> >>      struct bpf_mmap_page __kptr *some_stats;
> >> };
> >>
> >> [ ... ]
> >>
> >
> > This is an intriguing idea. For conciseness I'll call this specific
> > kind of kptr 'mmapable kptrs' for the rest of this message. Below is
> > more of a brainstorming dump than a cohesive response, separate trains
> > of thought are separated by two newlines.
>
> Thanks for bearing with me while some ideas could be crazy. I am trying to see
> how this would look like for other local storage, sk and inode. Allocating a
> page for each sk will not be nice for server with half a million sk(s). e.g.
> half a million sk(s) sharing a few bandwidth policies or a few tuning
> parameters. Creating something mmap'able to the user space and also sharable
> among many sk(s) will be useful.
>
> >
> >
> > My initial thought upon seeing struct normal_and_mmap_value was to note
> > that we currently don't support mmaping for map_value types with _any_
> > special fields ('special' as determined by btf_parse_fields). But IIUC
> > you're actually talking about exposing the some_stats pointee memory via
> > mmap, not the containing struct with kptr fields. That is, for maps that
> > support these kptrs, mmap()ing a map with value type struct
> > normal_and_mmap_value would return the some_stats pointer value, and
> > likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
> > logic in this patch. We'd only be able to support one such mmapable kptr
> > field per mapval type, but that isn't a dealbreaker.
> >
> > Some maps, like task_storage, would only support mmap() on a map_value
> > with mmapable kptr field, as mmap()ing the mapval itself doesn't make
> > sense or is unsafe. Seems like arraymap would do the opposite, only
>
> Changing direction a bit since arraymap is brought up. :)
>
> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an
> arraymap as kptr, the bpf prog should be able to access it as a map. More like
> the current map-in-map setup. The arraymap can be used as regular map in the
> user space also (like pinning). It may need some btf plumbing to tell the value
> type of the arrayamp to the verifier.
>
> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags)
> can be used where the value->array_mmap initialized as an arraymap_fd. This will
> limit the arraymap kptr update only from the syscall side which seems to be your
> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts
> and need a whitelist.
>
> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd,
> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we
> can create a libbpf helper to free the fd(s) resources held in the looked-up
> value by using the value's btf.
>
> The bpf_local_storage_map side probably does not need to support mmap() then.

Martin,
that's an interesting idea!
I kinda like it and I think it's worth exploring further.

I think the main quirk of the proposed mmap-of-task-local-storage
is using 'current' task as an implicit 'key' in task local storage map.
It fits here, but I'm not sure it addresses sched-ext use case.

Tejun, David,
could you please chime in ?
Do you think mmap(..., task_local_storage_map_fd, ...)
that returns a page that belongs to current task only is enough ?

If not we need to think through how to mmap local storage of other
tasks. One proposal was to use pgoff to carry the key somehow
like io-uring does, but if we want to generalize that the pgoff approach
falls apart if we want __mmapable_kptr to work like Martin is proposing above,
since the key will not fit in 64-bit of pgoff.

Maybe we need an office hours slot to discuss. This looks to be a big
topic. Not sure we can converge over email.
Just getting everyone on the same page will take a lot of email reading.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE task_local storage
  2023-11-21 19:49         ` Alexei Starovoitov
@ 2023-12-11 17:31           ` David Marchevsky
  0 siblings, 0 replies; 17+ messages in thread
From: David Marchevsky @ 2023-12-11 17:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau, Tejun Heo, David Vernet
  Cc: David Marchevsky, Dave Marchevsky, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kernel Team,
	Johannes Weiner, bpf



On 11/21/23 2:49 PM, Alexei Starovoitov wrote:
> On Tue, Nov 21, 2023 at 11:27 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/20/23 10:11 PM, David Marchevsky wrote:
>>>
>>>
>>> On 11/20/23 7:42 PM, Martin KaFai Lau wrote:
>>>> On 11/20/23 9:59 AM, Dave Marchevsky wrote:
>>>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>>>> index 173ec7f43ed1..114973f925ea 100644
>>>>> --- a/include/linux/bpf_local_storage.h
>>>>> +++ b/include/linux/bpf_local_storage.h
>>>>> @@ -69,7 +69,17 @@ struct bpf_local_storage_data {
>>>>>         * the number of cachelines accessed during the cache hit case.
>>>>>         */
>>>>>        struct bpf_local_storage_map __rcu *smap;
>>>>> -    u8 data[] __aligned(8);
>>>>> +    /* Need to duplicate smap's map_flags as smap may be gone when
>>>>> +     * it's time to free bpf_local_storage_data
>>>>> +     */
>>>>> +    u64 smap_map_flags;
>>>>> +    /* If BPF_F_MMAPABLE, this is a void * to separately-alloc'd data
>>>>> +     * Otherwise the actual mapval data lives here
>>>>> +     */
>>>>> +    union {
>>>>> +        DECLARE_FLEX_ARRAY(u8, data) __aligned(8);
>>>>> +        void *actual_data __aligned(8);
>>>>
>>>> The pages (that can be mmap'ed later) feel like a specific kind of kptr.
>>>>
>>>> Have you thought about allowing a kptr (pointing to some pages that can be mmap'ed later) to be stored as one of the members of the map's value as a kptr. bpf_local_storage_map is one of the maps that supports kptr.
>>>>
>>>> struct normal_and_mmap_value {
>>>>      int some_int;
>>>>      int __percpu_kptr *some_cnts;
>>>>
>>>>      struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> struct mmap_only_value {
>>>>      struct bpf_mmap_page __kptr *some_stats;
>>>> };
>>>>
>>>> [ ... ]
>>>>
>>>
>>> This is an intriguing idea. For conciseness I'll call this specific
>>> kind of kptr 'mmapable kptrs' for the rest of this message. Below is
>>> more of a brainstorming dump than a cohesive response, separate trains
>>> of thought are separated by two newlines.
>>
>> Thanks for bearing with me while some ideas could be crazy. I am trying to see
>> how this would look like for other local storage, sk and inode. Allocating a
>> page for each sk will not be nice for server with half a million sk(s). e.g.
>> half a million sk(s) sharing a few bandwidth policies or a few tuning
>> parameters. Creating something mmap'able to the user space and also sharable
>> among many sk(s) will be useful.
>>
>>>
>>>
>>> My initial thought upon seeing struct normal_and_mmap_value was to note
>>> that we currently don't support mmaping for map_value types with _any_
>>> special fields ('special' as determined by btf_parse_fields). But IIUC
>>> you're actually talking about exposing the some_stats pointee memory via
>>> mmap, not the containing struct with kptr fields. That is, for maps that
>>> support these kptrs, mmap()ing a map with value type struct
>>> normal_and_mmap_value would return the some_stats pointer value, and
>>> likely initialize the pointer similarly to BPF_LOCAL_STORAGE_GET_F_CREATE
>>> logic in this patch. We'd only be able to support one such mmapable kptr
>>> field per mapval type, but that isn't a dealbreaker.
>>>
>>> Some maps, like task_storage, would only support mmap() on a map_value
>>> with mmapable kptr field, as mmap()ing the mapval itself doesn't make
>>> sense or is unsafe. Seems like arraymap would do the opposite, only
>>
>> Changing direction a bit since arraymap is brought up. :)
>>
>> arraymap supports BPF_F_MMAPABLE. If the local storage map's value can store an
>> arraymap as kptr, the bpf prog should be able to access it as a map. More like
>> the current map-in-map setup. The arraymap can be used as regular map in the
>> user space also (like pinning). It may need some btf plumbing to tell the value
>> type of the arrayamp to the verifier.
>>
>> The syscall bpf_map_update_elem(task_storage_map_fd, &task_pidfd, &value, flags)
>> can be used where the value->array_mmap initialized as an arraymap_fd. This will
>> limit the arraymap kptr update only from the syscall side which seems to be your
>> usecase also? Allocating the arraymap from the bpf prog side needs some thoughts
>> and need a whitelist.
>>
>> The same goes for the syscall bpf_map_lookup_elem(task_storage_map_fd,
>> &task_pidfd, &value). The kernel can return a fd in value->array_mmap. May be we
>> can create a libbpf helper to free the fd(s) resources held in the looked-up
>> value by using the value's btf.
>>
>> The bpf_local_storage_map side probably does not need to support mmap() then.
> 
> Martin,
> that's an interesting idea!
> I kinda like it and I think it's worth exploring further.
> 
> I think the main quirk of the proposed mmap-of-task-local-storage
> is using 'current' task as an implicit 'key' in task local storage map.
> It fits here, but I'm not sure it addresses sched-ext use case.
> 
> Tejun, David,
> could you please chime in ?
> Do you think mmap(..., task_local_storage_map_fd, ...)
> that returns a page that belongs to current task only is enough ?
> 
> If not we need to think through how to mmap local storage of other
> tasks. One proposal was to use pgoff to carry the key somehow
> like io-uring does, but if we want to generalize that the pgoff approach
> falls apart if we want __mmapable_kptr to work like Martin is proposing above,
> since the key will not fit in 64-bit of pgoff.
> 
> Maybe we need an office hours slot to discuss. This looks to be a big
> topic. Not sure we can converge over email.
> Just getting everyone on the same page will take a lot of email reading.

Meta BPF folks were all in one place for reasons unrelated to this
thread, and decided to have a design discussion regarding this mmapable
task_local storage implementation and other implementation ideas
discussed in this thread. I will summarize the discussion below. We
didn't arrive at a confident conclusion, though, so plenty of time for
others to chime in and for proper office hours discussion to happen if
necessary. Below, anything attributed to a specific person is
paraphrased from my notes, there will certainly be errors /
misrememberings.



mmapable task_local storage design discussion 11/29

We first summarized approaches that were discussed in this thread:

1) Current implementation in this series

2) pseudo-map_in_map, kptr arraymap type:

  struct mmapable_data_map {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __uint(map_flags, BPF_F_MMAPABLE);
          __uint(max_entries, 1);
          __type(key, __u32);
          __type(value, __u64);
  };

  struct my_mapval {
    int whatever;
    struct bpf_arraymap __kptr __arraymap_type(mmapable_data_map) *m;
    /* Need special logic to support getting / updating above field from userspace (as fd) */
  };

3) "mmapable page" kptr type, or "mmapable kptr" tag

  struct my_mapval {
    int whatever;
    struct bpf_mmapable_page *m;
  };
  
  struct my_mapval2 { /* Separate approach than my_mapval above */
    struct bpf_spin_lock l;
    int some_int;
    struct my_type __mmapable_kptr *some_stats;
  };

Points of consideration regardless of implementation approach:
  * mmap() syscall's return address must be page-aligned. If we want to
    reduce / eliminate wasted memory by supporting packing of multiple
    mapvals onto single page, need to be able to return non-page-aligned
    addr. Using a BPF syscall subcommand in lieu of or in addition to
    mmap() syscall would be a way to support this.
    * Dave wants to avoid implementing packing at all, but having a
      reasonable path forward would be nice in case actual usecase
      arises.
    * Andrii suggested a new actual mmap syscall supporting passing of
      custom params, useful for any subsystem using mmap in
      nontraditional ways. This was initially a response to "use offset
      to pass task id" discussion re: selecting non-current task.

   * How orthogonal is Martin's "kptr to arraymap" suggestion from the
     general mmapable local_storage goal? Is there a world where we
     choose a different approach for this work, and then to "kptr to
     arraymap" independently later?

The above was mostly summary of existing discussion in this thread. Rest
of discussion flowed from there.

Q&A:

- Do we need to be able to query other tasks' mapvals? (for David Vernet
  / Tejun Heo)

TJ had a usecase where this might've been necessary, but rewrote.
David: Being able to do this in general, aside from TJ's specific case,
would be useful. David provided an example from ghOSt project - central
scheduling. Another example: Folly runtime framework, farms out work to
worker threads, might want to tag them.

- Which usecases actually care about avoiding syscall overhead?

Alexei: Most services that would want to use this functionality to tag
their tasks don't need it, as they just set the value once.
Dave: Some tracing usecases (e.g. strobemeta) need it.

- Do we want to use mmap() in current-task-only limited way, or do BPF
  subcommand or something more exotic?

TJ: What if bpf subcommand returns FD that can be mmap'd. fd identifies
which task_local storage elem is mmap'd. Subcommand:
bpf_map_lookup_elem_fd(map *, u64 elem_id).
Alexei: Such a thing should return fd to arbitrary mapval, and should
support other common ops (open, close, etc.).
David: What's the problem w/ having fd that only supports mmap for now?
TJ: 'Dangling' fds exist in some usecases already

Discussion around the bpf_map_lookup_elem_fd idea continued for quite a
while. Folks liked that it avoids the "can only have one mmapable field"
issue from proposal (3) above, without making any implicit assumptions.

Alexei: Can we instead have pointer to userspace blob - similar to rseq
- to avoid wasting most of page?

TJ: My instinct is to stick to something more generic, would rather pay
4k.

Discussion around userspace pointer continued for a while as well,
ending in the conclusion that we should take a look at using
get_user_pages, perhaps wrapping such functionality in a 'guppable' kptr
type. Folks liked the 'guppable' idea as it sort-of avoids the wasted
memory issue, pushing the details to userspace, and punts on working out
a path forward for mmap interface, which other implementation ideas
require.

Action items based on convo, priority order:

  - Think more about / prototype 'guppable' kptr idea
  - If the above has issues, try bpf_map_lookup_elem_fd
  - If both above have issues, consider earlier approaches

I will start tackling these soon.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-12-11 17:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-20 17:59 [PATCH v1 bpf-next 0/2] bpf: Add mmapable task_local storage Dave Marchevsky
2023-11-20 17:59 ` [PATCH v1 bpf-next 1/2] bpf: Support BPF_F_MMAPABLE " Dave Marchevsky
2023-11-20 21:41   ` Johannes Weiner
2023-11-21  0:42   ` Martin KaFai Lau
2023-11-21  6:11     ` David Marchevsky
2023-11-21 19:27       ` Martin KaFai Lau
2023-11-21 19:49         ` Alexei Starovoitov
2023-12-11 17:31           ` David Marchevsky
2023-11-21  2:32   ` kernel test robot
2023-11-21  5:06   ` kernel test robot
2023-11-21  5:20   ` kernel test robot
2023-11-21  5:44   ` Alexei Starovoitov
2023-11-21  6:41   ` Yonghong Song
2023-11-21 15:34   ` Yonghong Song
2023-11-21 19:30   ` Andrii Nakryiko
2023-11-20 17:59 ` [PATCH v1 bpf-next 2/2] selftests/bpf: Add test exercising mmapable task_local_storage Dave Marchevsky
2023-11-21 19:34   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox