[PATCH bpf-next 0/4] Make inode storage available to tracing prog

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/4] Make inode storage available to tracing prog
@ 2024-11-12  8:25 Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 1/4] bpf: lsm: Remove hook to bpf_task_storage_free Song Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

bpf inode local storage can be useful beyond LSM programs. For example,
bcc/libbpf-tools file* can use inode local storage to simplify the logic.
This set makes inode local storage available to tracing program.

1/4 is missing change for bpf task local storage. 2/4 move inode local
storage from security blob to inode.

Similar to task local storage in tracing program, it is necessary to add
recursion prevention logic for inode local storage. Patch 3/4 adds such
logic, and 4/4 add a test for the recursion prevention logic.

Song Liu (4):
  bpf: lsm: Remove hook to bpf_task_storage_free
  bpf: Make bpf inode storage available to tracing program
  bpf: Add recursion prevention logic for inode storage
  selftest/bpf: Test inode local storage recursion prevention

 fs/inode.c                                    |   1 +
 include/linux/bpf.h                           |   9 +
 include/linux/bpf_lsm.h                       |  29 ---
 include/linux/fs.h                            |   4 +
 kernel/bpf/Makefile                           |   3 +-
 kernel/bpf/bpf_inode_storage.c                | 185 +++++++++++++-----
 kernel/bpf/bpf_lsm.c                          |   4 -
 kernel/trace/bpf_trace.c                      |   8 +
 security/bpf/hooks.c                          |   7 -
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 .../bpf/prog_tests/inode_local_storage.c      |  72 +++++++
 .../bpf/progs/inode_storage_recursion.c       |  90 +++++++++
 12 files changed, 320 insertions(+), 93 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c

--
2.43.5

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 1/4] bpf: lsm: Remove hook to bpf_task_storage_free
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
@ 2024-11-12  8:25 ` Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program Song Liu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

free_task() already calls bpf_task_storage_free(). It is not necessary
to call bpf_task_storage_free() again on security_task_free(). Remove
the hook to bpf_task_storage_free.

Signed-off-by: Song Liu <song@kernel.org>
---
 security/bpf/hooks.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/security/bpf/hooks.c b/security/bpf/hooks.c
index 3663aec7bcbd..db759025abe1 100644
--- a/security/bpf/hooks.c
+++ b/security/bpf/hooks.c
@@ -13,7 +13,6 @@ static struct security_hook_list bpf_lsm_hooks[] __ro_after_init = {
 	#include <linux/lsm_hook_defs.h>
 	#undef LSM_HOOK
 	LSM_HOOK_INIT(inode_free_security, bpf_inode_storage_free),
-	LSM_HOOK_INIT(task_free, bpf_task_storage_free),
 };
 
 static const struct lsm_id bpf_lsmid = {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 1/4] bpf: lsm: Remove hook to bpf_task_storage_free Song Liu
@ 2024-11-12  8:25 ` Song Liu
  2024-11-13 10:19   ` Christian Brauner
  2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion avoid logic for inode storage Song Liu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

inode storage can be useful for non-LSM program. For example, file* tools
from bcc/libbpf-tools can use inode storage instead of hash map; fanotify
fastpath [1] can also use inode storage to store useful data.

Make inode storage available for tracing program. Move bpf inode storage
from a security blob to inode->i_bpf_storage, and adjust related code
accordingly.

[1] https://lore.kernel.org/linux-fsdevel/20241029231244.2834368-1-song@kernel.org/
Signed-off-by: Song Liu <song@kernel.org>
---
 fs/inode.c                     |  1 +
 include/linux/bpf.h            |  9 +++++++++
 include/linux/bpf_lsm.h        | 29 -----------------------------
 include/linux/fs.h             |  4 ++++
 kernel/bpf/Makefile            |  3 +--
 kernel/bpf/bpf_inode_storage.c | 32 +++++---------------------------
 kernel/bpf/bpf_lsm.c           |  4 ----
 kernel/trace/bpf_trace.c       |  4 ++++
 security/bpf/hooks.c           |  6 ------
 9 files changed, 24 insertions(+), 68 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8dabb224f941..3c679578169f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -286,6 +286,7 @@ static struct inode *alloc_inode(struct super_block *sb)
 void __destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
+	bpf_inode_storage_free(inode);
 	inode_detach_wb(inode);
 	security_inode_free(inode);
 	fsnotify_inode_delete(inode);
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1b84613b10ac..0b31d2e74df6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2672,6 +2672,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
 const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id,
 						 const struct bpf_prog *prog);
 void bpf_task_storage_free(struct task_struct *task);
+void bpf_inode_storage_free(struct inode *inode);
 void bpf_cgrp_storage_free(struct cgroup *cgroup);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
 const struct btf_func_model *
@@ -2942,6 +2943,10 @@ static inline void bpf_task_storage_free(struct task_struct *task)
 {
 }
 
+static inline void bpf_inode_storage_free(struct inode *inode)
+{
+}
+
 static inline bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog)
 {
 	return false;
@@ -3305,6 +3310,10 @@ extern const struct bpf_func_proto bpf_task_storage_get_recur_proto;
 extern const struct bpf_func_proto bpf_task_storage_get_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_recur_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_proto;
+extern const struct bpf_func_proto bpf_inode_storage_get_proto;
+extern const struct bpf_func_proto bpf_inode_storage_get_recur_proto;
+extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
+extern const struct bpf_func_proto bpf_inode_storage_delete_recur_proto;
 extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
 extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
 extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
index aefcd6564251..a819c2f0a062 100644
--- a/include/linux/bpf_lsm.h
+++ b/include/linux/bpf_lsm.h
@@ -19,31 +19,12 @@
 #include <linux/lsm_hook_defs.h>
 #undef LSM_HOOK
 
-struct bpf_storage_blob {
-	struct bpf_local_storage __rcu *storage;
-};
-
-extern struct lsm_blob_sizes bpf_lsm_blob_sizes;
-
 int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
 			const struct bpf_prog *prog);
 
 bool bpf_lsm_is_sleepable_hook(u32 btf_id);
 bool bpf_lsm_is_trusted(const struct bpf_prog *prog);
 
-static inline struct bpf_storage_blob *bpf_inode(
-	const struct inode *inode)
-{
-	if (unlikely(!inode->i_security))
-		return NULL;
-
-	return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
-}
-
-extern const struct bpf_func_proto bpf_inode_storage_get_proto;
-extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
-void bpf_inode_storage_free(struct inode *inode);
-
 void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func);
 
 int bpf_lsm_get_retval_range(const struct bpf_prog *prog,
@@ -66,16 +47,6 @@ static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
 	return -EOPNOTSUPP;
 }
 
-static inline struct bpf_storage_blob *bpf_inode(
-	const struct inode *inode)
-{
-	return NULL;
-}
-
-static inline void bpf_inode_storage_free(struct inode *inode)
-{
-}
-
 static inline void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
 					   bpf_func_t *bpf_func)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..479097e4dd5b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -79,6 +79,7 @@ struct fs_context;
 struct fs_parameter_spec;
 struct fileattr;
 struct iomap_ops;
+struct bpf_local_storage;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -648,6 +649,9 @@ struct inode {
 #ifdef CONFIG_SECURITY
 	void			*i_security;
 #endif
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_local_storage __rcu *i_bpf_storage;
+#endif
 
 	/* Stat data, not accessed from path walking */
 	unsigned long		i_ino;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 105328f0b9c0..73604c7130f1 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,8 +10,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
-obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
-obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 44ccebc745e5..8d5a9bfe6643 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -21,16 +21,11 @@
 
 DEFINE_BPF_STORAGE_CACHE(inode_cache);
 
-static struct bpf_local_storage __rcu **
-inode_storage_ptr(void *owner)
+static struct bpf_local_storage __rcu **inode_storage_ptr(void *owner)
 {
 	struct inode *inode = owner;
-	struct bpf_storage_blob *bsb;
 
-	bsb = bpf_inode(inode);
-	if (!bsb)
-		return NULL;
-	return &bsb->storage;
+	return &inode->i_bpf_storage;
 }
 
 static struct bpf_local_storage_data *inode_storage_lookup(struct inode *inode,
@@ -39,14 +34,9 @@ static struct bpf_local_storage_data *inode_storage_lookup(struct inode *inode,
 {
 	struct bpf_local_storage *inode_storage;
 	struct bpf_local_storage_map *smap;
-	struct bpf_storage_blob *bsb;
-
-	bsb = bpf_inode(inode);
-	if (!bsb)
-		return NULL;
 
 	inode_storage =
-		rcu_dereference_check(bsb->storage, bpf_rcu_lock_held());
+		rcu_dereference_check(inode->i_bpf_storage, bpf_rcu_lock_held());
 	if (!inode_storage)
 		return NULL;
 
@@ -57,15 +47,10 @@ static struct bpf_local_storage_data *inode_storage_lookup(struct inode *inode,
 void bpf_inode_storage_free(struct inode *inode)
 {
 	struct bpf_local_storage *local_storage;
-	struct bpf_storage_blob *bsb;
-
-	bsb = bpf_inode(inode);
-	if (!bsb)
-		return;
 
 	rcu_read_lock();
 
-	local_storage = rcu_dereference(bsb->storage);
+	local_storage = rcu_dereference(inode->i_bpf_storage);
 	if (!local_storage) {
 		rcu_read_unlock();
 		return;
@@ -95,8 +80,6 @@ static long bpf_fd_inode_storage_update_elem(struct bpf_map *map, void *key,
 
 	if (fd_empty(f))
 		return -EBADF;
-	if (!inode_storage_ptr(file_inode(fd_file(f))))
-		return -EBADF;
 
 	sdata = bpf_local_storage_update(file_inode(fd_file(f)),
 					 (struct bpf_local_storage_map *)map,
@@ -136,12 +119,7 @@ BPF_CALL_5(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode,
 	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
 		return (unsigned long)NULL;
 
-	/* explicitly check that the inode_storage_ptr is not
-	 * NULL as inode_storage_lookup returns NULL in this case and
-	 * bpf_local_storage_update expects the owner to have a
-	 * valid storage pointer.
-	 */
-	if (!inode || !inode_storage_ptr(inode))
+	if (!inode)
 		return (unsigned long)NULL;
 
 	sdata = inode_storage_lookup(inode, map, true);
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 3bc61628ab25..6b6e0730e515 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -231,10 +231,6 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 
 	switch (func_id) {
-	case BPF_FUNC_inode_storage_get:
-		return &bpf_inode_storage_get_proto;
-	case BPF_FUNC_inode_storage_delete:
-		return &bpf_inode_storage_delete_proto;
 #ifdef CONFIG_NET
 	case BPF_FUNC_sk_storage_get:
 		return &bpf_sk_storage_get_proto;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 949a3870946c..262bd101ea0b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1553,6 +1553,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		if (bpf_prog_check_recur(prog))
 			return &bpf_task_storage_delete_recur_proto;
 		return &bpf_task_storage_delete_proto;
+	case BPF_FUNC_inode_storage_get:
+		return &bpf_inode_storage_get_proto;
+	case BPF_FUNC_inode_storage_delete:
+		return &bpf_inode_storage_delete_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
 	case BPF_FUNC_snprintf:
diff --git a/security/bpf/hooks.c b/security/bpf/hooks.c
index db759025abe1..67719a04bb0b 100644
--- a/security/bpf/hooks.c
+++ b/security/bpf/hooks.c
@@ -12,7 +12,6 @@ static struct security_hook_list bpf_lsm_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(NAME, bpf_lsm_##NAME),
 	#include <linux/lsm_hook_defs.h>
 	#undef LSM_HOOK
-	LSM_HOOK_INIT(inode_free_security, bpf_inode_storage_free),
 };
 
 static const struct lsm_id bpf_lsmid = {
@@ -28,12 +27,7 @@ static int __init bpf_lsm_init(void)
 	return 0;
 }
 
-struct lsm_blob_sizes bpf_lsm_blob_sizes __ro_after_init = {
-	.lbs_inode = sizeof(struct bpf_storage_blob),
-};
-
 DEFINE_LSM(bpf) = {
 	.name = "bpf",
 	.init = bpf_lsm_init,
-	.blobs = &bpf_lsm_blob_sizes
 };
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 3/4] bpf: Add recursion avoid logic for inode storage
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 1/4] bpf: lsm: Remove hook to bpf_task_storage_free Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program Song Liu
@ 2024-11-12  8:25 ` Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion prevention " Song Liu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

The logic is same as task local storage.

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/bpf_inode_storage.c | 156 +++++++++++++++++++++++++++------
 kernel/trace/bpf_trace.c       |   4 +
 2 files changed, 135 insertions(+), 25 deletions(-)

diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index cd4dc266ebff..ef539f4fe583 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -21,6 +21,31 @@
 
 DEFINE_BPF_STORAGE_CACHE(inode_cache);
 
+static DEFINE_PER_CPU(int, bpf_inode_storage_busy);
+
+static void bpf_inode_storage_lock(void)
+{
+	migrate_disable();
+	this_cpu_inc(bpf_inode_storage_busy);
+}
+
+static void bpf_inode_storage_unlock(void)
+{
+	this_cpu_dec(bpf_inode_storage_busy);
+	migrate_enable();
+}
+
+static bool bpf_inode_storage_trylock(void)
+{
+	migrate_disable();
+	if (unlikely(this_cpu_inc_return(bpf_inode_storage_busy) != 1)) {
+		this_cpu_dec(bpf_inode_storage_busy);
+		migrate_enable();
+		return false;
+	}
+	return true;
+}
+
 static struct bpf_local_storage __rcu **inode_storage_ptr(void *owner)
 {
 	struct inode *inode = owner;
@@ -56,7 +81,9 @@ void bpf_inode_storage_free(struct inode *inode)
 		return;
 	}
 
+	bpf_inode_storage_lock();
 	bpf_local_storage_destroy(local_storage);
+	bpf_inode_storage_unlock();
 	rcu_read_unlock();
 }
 
@@ -68,7 +95,9 @@ static void *bpf_fd_inode_storage_lookup_elem(struct bpf_map *map, void *key)
 	if (fd_empty(f))
 		return ERR_PTR(-EBADF);
 
+	bpf_inode_storage_lock();
 	sdata = inode_storage_lookup(file_inode(fd_file(f)), map, true);
+	bpf_inode_storage_unlock();
 	return sdata ? sdata->data : NULL;
 }
 
@@ -81,13 +110,16 @@ static long bpf_fd_inode_storage_update_elem(struct bpf_map *map, void *key,
 	if (fd_empty(f))
 		return -EBADF;
 
+	bpf_inode_storage_lock();
 	sdata = bpf_local_storage_update(file_inode(fd_file(f)),
 					 (struct bpf_local_storage_map *)map,
 					 value, map_flags, false, GFP_ATOMIC);
+	bpf_inode_storage_unlock();
 	return PTR_ERR_OR_ZERO(sdata);
 }
 
-static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
+static int inode_storage_delete(struct inode *inode, struct bpf_map *map,
+				bool nobusy)
 {
 	struct bpf_local_storage_data *sdata;
 
@@ -95,6 +127,9 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
+	if (!nobusy)
+		return -EBUSY;
+
 	bpf_selem_unlink(SELEM(sdata), false);
 
 	return 0;
@@ -102,60 +137,111 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
 
 static long bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key)
 {
+	int err;
+
 	CLASS(fd_raw, f)(*(int *)key);
 
 	if (fd_empty(f))
 		return -EBADF;
-	return inode_storage_delete(file_inode(fd_file(f)), map);
+	bpf_inode_storage_lock();
+	err = inode_storage_delete(file_inode(fd_file(f)), map, true);
+	bpf_inode_storage_unlock();
+	return err;
+}
+
+static void *__bpf_inode_storage_get(struct bpf_map *map, struct inode *inode,
+				     void *value, u64 flags, gfp_t gfp_flags, bool nobusy)
+{
+	struct bpf_local_storage_data *sdata;
+
+	/* explicitly check that the inode not NULL */
+	if (!inode)
+		return NULL;
+
+	sdata = inode_storage_lookup(inode, map, true);
+	if (sdata)
+		return sdata->data;
+
+	/* only allocate new storage, when the inode is refcounted */
+	if (atomic_read(&inode->i_count) &&
+	    flags & BPF_LOCAL_STORAGE_GET_F_CREATE) {
+		sdata = bpf_local_storage_update(
+			inode, (struct bpf_local_storage_map *)map, value,
+			BPF_NOEXIST, false, gfp_flags);
+		return IS_ERR(sdata) ? NULL : sdata->data;
+	}
+
+	return NULL;
 }
 
 /* *gfp_flags* is a hidden argument provided by the verifier */
-BPF_CALL_5(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode,
+BPF_CALL_5(bpf_inode_storage_get_recur, struct bpf_map *, map, struct inode *, inode,
 	   void *, value, u64, flags, gfp_t, gfp_flags)
 {
-	struct bpf_local_storage_data *sdata;
+	bool nobusy;
+	void *data;
 
 	WARN_ON_ONCE(!bpf_rcu_lock_held());
 	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
 		return (unsigned long)NULL;
 
-	/* explicitly check that the inode_storage_ptr is not
-	 * NULL as inode_storage_lookup returns NULL in this case and
-	 * bpf_local_storage_update expects the owner to have a
-	 * valid storage pointer.
-	 */
-	if (!inode || !inode_storage_ptr(inode))
+	nobusy = bpf_inode_storage_trylock();
+	data = __bpf_inode_storage_get(map, inode, value, flags, gfp_flags, nobusy);
+	if (nobusy)
+		bpf_inode_storage_unlock();
+	return (unsigned long)data;
+}
+
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode,
+	   void *, value, u64, flags, gfp_t, gfp_flags)
+{
+	void *data;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
 		return (unsigned long)NULL;
 
-	sdata = inode_storage_lookup(inode, map, true);
-	if (sdata)
-		return (unsigned long)sdata->data;
+	bpf_inode_storage_lock();
+	data = __bpf_inode_storage_get(map, inode, value, flags, gfp_flags, true);
+	bpf_inode_storage_unlock();
+	return (unsigned long)data;
+}
+
+BPF_CALL_2(bpf_inode_storage_delete_recur, struct bpf_map *, map, struct inode *, inode)
+{
+	bool nobusy;
+	int ret;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (!inode)
+		return -EINVAL;
 
+	nobusy = bpf_inode_storage_trylock();
 	/* This helper must only called from where the inode is guaranteed
 	 * to have a refcount and cannot be freed.
 	 */
-	if (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) {
-		sdata = bpf_local_storage_update(
-			inode, (struct bpf_local_storage_map *)map, value,
-			BPF_NOEXIST, false, gfp_flags);
-		return IS_ERR(sdata) ? (unsigned long)NULL :
-					     (unsigned long)sdata->data;
-	}
-
-	return (unsigned long)NULL;
+	ret = inode_storage_delete(inode, map, nobusy);
+	if (nobusy)
+		bpf_inode_storage_unlock();
+	return ret;
 }
 
-BPF_CALL_2(bpf_inode_storage_delete,
-	   struct bpf_map *, map, struct inode *, inode)
+BPF_CALL_2(bpf_inode_storage_delete, struct bpf_map *, map, struct inode *, inode)
 {
+	int ret;
+
 	WARN_ON_ONCE(!bpf_rcu_lock_held());
 	if (!inode)
 		return -EINVAL;
 
+	bpf_inode_storage_lock();
 	/* This helper must only called from where the inode is guaranteed
 	 * to have a refcount and cannot be freed.
 	 */
-	return inode_storage_delete(inode, map);
+	ret = inode_storage_delete(inode, map, true);
+	bpf_inode_storage_unlock();
+	return ret;
 }
 
 static int notsupp_get_next_key(struct bpf_map *map, void *key,
@@ -191,6 +277,17 @@ const struct bpf_map_ops inode_storage_map_ops = {
 
 BTF_ID_LIST_SINGLE(bpf_inode_storage_btf_ids, struct, inode)
 
+const struct bpf_func_proto bpf_inode_storage_get_recur_proto = {
+	.func		= bpf_inode_storage_get_recur,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id	= &bpf_inode_storage_btf_ids[0],
+	.arg3_type	= ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_inode_storage_get_proto = {
 	.func		= bpf_inode_storage_get,
 	.gpl_only	= false,
@@ -202,6 +299,15 @@ const struct bpf_func_proto bpf_inode_storage_get_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+const struct bpf_func_proto bpf_inode_storage_delete_recur_proto = {
+	.func		= bpf_inode_storage_delete_recur,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id	= &bpf_inode_storage_btf_ids[0],
+};
+
 const struct bpf_func_proto bpf_inode_storage_delete_proto = {
 	.func		= bpf_inode_storage_delete,
 	.gpl_only	= false,
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 262bd101ea0b..4616f5430a5e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1554,8 +1554,12 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 			return &bpf_task_storage_delete_recur_proto;
 		return &bpf_task_storage_delete_proto;
 	case BPF_FUNC_inode_storage_get:
+		if (bpf_prog_check_recur(prog))
+			return &bpf_inode_storage_get_recur_proto;
 		return &bpf_inode_storage_get_proto;
 	case BPF_FUNC_inode_storage_delete:
+		if (bpf_prog_check_recur(prog))
+			return &bpf_inode_storage_delete_recur_proto;
 		return &bpf_inode_storage_delete_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 3/4] bpf: Add recursion prevention logic for inode storage
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
                   ` (2 preceding siblings ...)
  2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion avoid logic for inode storage Song Liu
@ 2024-11-12  8:25 ` Song Liu
  2024-11-12  8:25 ` [PATCH bpf-next 4/4] selftest/bpf: Add test for inode local storage recursion Song Liu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

This logic is similar to the recursion prevention logic for task local
storage: bpf programs on LSM hooks lock bpf_inode_storage_busy; bpf
tracing program will try to lock bpf_inode_storage_busy, and may return
-EBUSY if something else already lock bpf_inode_storage_busy on the same
CPU.

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/bpf_inode_storage.c | 153 +++++++++++++++++++++++++++------
 kernel/trace/bpf_trace.c       |   4 +
 2 files changed, 133 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 8d5a9bfe6643..3a94a38e24f0 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -21,6 +21,31 @@
 
 DEFINE_BPF_STORAGE_CACHE(inode_cache);
 
+static DEFINE_PER_CPU(int, bpf_inode_storage_busy);
+
+static void bpf_inode_storage_lock(void)
+{
+	migrate_disable();
+	this_cpu_inc(bpf_inode_storage_busy);
+}
+
+static void bpf_inode_storage_unlock(void)
+{
+	this_cpu_dec(bpf_inode_storage_busy);
+	migrate_enable();
+}
+
+static bool bpf_inode_storage_trylock(void)
+{
+	migrate_disable();
+	if (unlikely(this_cpu_inc_return(bpf_inode_storage_busy) != 1)) {
+		this_cpu_dec(bpf_inode_storage_busy);
+		migrate_enable();
+		return false;
+	}
+	return true;
+}
+
 static struct bpf_local_storage __rcu **inode_storage_ptr(void *owner)
 {
 	struct inode *inode = owner;
@@ -56,7 +81,9 @@ void bpf_inode_storage_free(struct inode *inode)
 		return;
 	}
 
+	bpf_inode_storage_lock();
 	bpf_local_storage_destroy(local_storage);
+	bpf_inode_storage_unlock();
 	rcu_read_unlock();
 }
 
@@ -68,7 +95,9 @@ static void *bpf_fd_inode_storage_lookup_elem(struct bpf_map *map, void *key)
 	if (fd_empty(f))
 		return ERR_PTR(-EBADF);
 
+	bpf_inode_storage_lock();
 	sdata = inode_storage_lookup(file_inode(fd_file(f)), map, true);
+	bpf_inode_storage_unlock();
 	return sdata ? sdata->data : NULL;
 }
 
@@ -81,13 +110,16 @@ static long bpf_fd_inode_storage_update_elem(struct bpf_map *map, void *key,
 	if (fd_empty(f))
 		return -EBADF;
 
+	bpf_inode_storage_lock();
 	sdata = bpf_local_storage_update(file_inode(fd_file(f)),
 					 (struct bpf_local_storage_map *)map,
 					 value, map_flags, false, GFP_ATOMIC);
+	bpf_inode_storage_unlock();
 	return PTR_ERR_OR_ZERO(sdata);
 }
 
-static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
+static int inode_storage_delete(struct inode *inode, struct bpf_map *map,
+				bool nobusy)
 {
 	struct bpf_local_storage_data *sdata;
 
@@ -95,6 +127,9 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
+	if (!nobusy)
+		return -EBUSY;
+
 	bpf_selem_unlink(SELEM(sdata), false);
 
 	return 0;
@@ -102,55 +137,105 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
 
 static long bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key)
 {
+	int err;
+
 	CLASS(fd_raw, f)(*(int *)key);
 
 	if (fd_empty(f))
 		return -EBADF;
-	return inode_storage_delete(file_inode(fd_file(f)), map);
+	bpf_inode_storage_lock();
+	err = inode_storage_delete(file_inode(fd_file(f)), map, true);
+	bpf_inode_storage_unlock();
+	return err;
 }
 
-/* *gfp_flags* is a hidden argument provided by the verifier */
-BPF_CALL_5(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode,
-	   void *, value, u64, flags, gfp_t, gfp_flags)
+static void *__bpf_inode_storage_get(struct bpf_map *map, struct inode *inode,
+				     void *value, u64 flags, gfp_t gfp_flags, bool nobusy)
 {
 	struct bpf_local_storage_data *sdata;
 
-	WARN_ON_ONCE(!bpf_rcu_lock_held());
-	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
-		return (unsigned long)NULL;
-
+	/* explicitly check that the inode not NULL */
 	if (!inode)
-		return (unsigned long)NULL;
+		return NULL;
 
 	sdata = inode_storage_lookup(inode, map, true);
 	if (sdata)
-		return (unsigned long)sdata->data;
+		return sdata->data;
 
-	/* This helper must only called from where the inode is guaranteed
-	 * to have a refcount and cannot be freed.
-	 */
-	if (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) {
+	/* only allocate new storage, when the inode is refcounted */
+	if (atomic_read(&inode->i_count) &&
+	    flags & BPF_LOCAL_STORAGE_GET_F_CREATE) {
 		sdata = bpf_local_storage_update(
 			inode, (struct bpf_local_storage_map *)map, value,
 			BPF_NOEXIST, false, gfp_flags);
-		return IS_ERR(sdata) ? (unsigned long)NULL :
-					     (unsigned long)sdata->data;
+		return IS_ERR(sdata) ? NULL : sdata->data;
 	}
 
-	return (unsigned long)NULL;
+	return NULL;
+}
+
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_inode_storage_get_recur, struct bpf_map *, map, struct inode *, inode,
+	   void *, value, u64, flags, gfp_t, gfp_flags)
+{
+	bool nobusy;
+	void *data;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
+		return (unsigned long)NULL;
+
+	nobusy = bpf_inode_storage_trylock();
+	data = __bpf_inode_storage_get(map, inode, value, flags, gfp_flags, nobusy);
+	if (nobusy)
+		bpf_inode_storage_unlock();
+	return (unsigned long)data;
+}
+
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode,
+	   void *, value, u64, flags, gfp_t, gfp_flags)
+{
+	void *data;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
+		return (unsigned long)NULL;
+
+	bpf_inode_storage_lock();
+	data = __bpf_inode_storage_get(map, inode, value, flags, gfp_flags, true);
+	bpf_inode_storage_unlock();
+	return (unsigned long)data;
+}
+
+BPF_CALL_2(bpf_inode_storage_delete_recur, struct bpf_map *, map, struct inode *, inode)
+{
+	bool nobusy;
+	int ret;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (!inode)
+		return -EINVAL;
+
+	nobusy = bpf_inode_storage_trylock();
+	ret = inode_storage_delete(inode, map, nobusy);
+	if (nobusy)
+		bpf_inode_storage_unlock();
+	return ret;
 }
 
-BPF_CALL_2(bpf_inode_storage_delete,
-	   struct bpf_map *, map, struct inode *, inode)
+BPF_CALL_2(bpf_inode_storage_delete, struct bpf_map *, map, struct inode *, inode)
 {
+	int ret;
+
 	WARN_ON_ONCE(!bpf_rcu_lock_held());
 	if (!inode)
 		return -EINVAL;
 
-	/* This helper must only called from where the inode is guaranteed
-	 * to have a refcount and cannot be freed.
-	 */
-	return inode_storage_delete(inode, map);
+	bpf_inode_storage_lock();
+	ret = inode_storage_delete(inode, map, true);
+	bpf_inode_storage_unlock();
+	return ret;
 }
 
 static int notsupp_get_next_key(struct bpf_map *map, void *key,
@@ -186,6 +271,17 @@ const struct bpf_map_ops inode_storage_map_ops = {
 
 BTF_ID_LIST_SINGLE(bpf_inode_storage_btf_ids, struct, inode)
 
+const struct bpf_func_proto bpf_inode_storage_get_recur_proto = {
+	.func		= bpf_inode_storage_get_recur,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id	= &bpf_inode_storage_btf_ids[0],
+	.arg3_type	= ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_inode_storage_get_proto = {
 	.func		= bpf_inode_storage_get,
 	.gpl_only	= false,
@@ -197,6 +293,15 @@ const struct bpf_func_proto bpf_inode_storage_get_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+const struct bpf_func_proto bpf_inode_storage_delete_recur_proto = {
+	.func		= bpf_inode_storage_delete_recur,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id	= &bpf_inode_storage_btf_ids[0],
+};
+
 const struct bpf_func_proto bpf_inode_storage_delete_proto = {
 	.func		= bpf_inode_storage_delete,
 	.gpl_only	= false,
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 262bd101ea0b..4616f5430a5e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1554,8 +1554,12 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 			return &bpf_task_storage_delete_recur_proto;
 		return &bpf_task_storage_delete_proto;
 	case BPF_FUNC_inode_storage_get:
+		if (bpf_prog_check_recur(prog))
+			return &bpf_inode_storage_get_recur_proto;
 		return &bpf_inode_storage_get_proto;
 	case BPF_FUNC_inode_storage_delete:
+		if (bpf_prog_check_recur(prog))
+			return &bpf_inode_storage_delete_recur_proto;
 		return &bpf_inode_storage_delete_proto;
 	case BPF_FUNC_for_each_map_elem:
 		return &bpf_for_each_map_elem_proto;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 4/4] selftest/bpf: Add test for inode local storage recursion
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
                   ` (3 preceding siblings ...)
  2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion prevention " Song Liu
@ 2024-11-12  8:25 ` Song Liu
  2024-11-12  8:26 ` [PATCH bpf-next 4/4] selftest/bpf: Test inode local storage recursion prevention Song Liu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:25 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

Add selftest to cover recursion of bpf local storage functions.

When inode local storage function is traced, helpers that access inode
local storage should return -EBUSY.

The recurring program is attached to inode_storage_lookup(). This is not
an ideal target for recurrsion tests. However, given that the target
function have to take "struct inode *" argument, there isn't a better
target function for the tests.

Test results showed that inode_storage_lookup() is inlined in s390x.
Work around this by adding this test to DENYLIST.s390x.

Signed-off-by: Song Liu <song@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 .../bpf/prog_tests/inode_local_storage.c      | 70 +++++++++++++++
 .../bpf/progs/inode_storage_recursion.c       | 90 +++++++++++++++++++
 3 files changed, 161 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 3ebd77206f98..6b8c9c9ec754 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -1,5 +1,6 @@
 # TEMPORARY
 # Alphabetical order
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
+inode_localstorage/recursion             # target function (inode_storage_lookup) is inlined on s390)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
 verifier_iterating_callbacks
diff --git a/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c b/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
new file mode 100644
index 000000000000..8dc44ebb8431
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include <stdio.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include "inode_storage_recursion.skel.h"
+
+#define TDIR "/tmp/inode_local_storage"
+#define TDIR_PARENT "/tmp"
+
+static void test_recursion(void)
+{
+	struct inode_storage_recursion *skel;
+	struct bpf_prog_info info;
+	__u32 info_len = sizeof(info);
+	int err, prog_fd, map_fd, inode_fd = -1;
+	long value;
+
+	skel = inode_storage_recursion__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	skel->bss->test_pid = getpid();
+
+	err = inode_storage_recursion__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	err = mkdir(TDIR, 0755);
+	if (!ASSERT_OK(err, "mkdir " TDIR))
+		goto out;
+
+	inode_fd = open(TDIR_PARENT, O_RDONLY | O_CLOEXEC);
+	if (!ASSERT_OK_FD(inode_fd, "open inode_fd"))
+		goto out;
+
+	/* Detach so that the following lookup won't trigger
+	 * trace_inode_storage_lookup and further change the values.
+	 */
+	inode_storage_recursion__detach(skel);
+	map_fd = bpf_map__fd(skel->maps.inode_map);
+	err = bpf_map_lookup_elem(map_fd, &inode_fd, &value);
+	ASSERT_OK(err, "lookup inode_map");
+	ASSERT_EQ(value, 201, "inode_map value");
+	ASSERT_EQ(skel->bss->nr_del_errs, 1, "bpf_task_storage_delete busy");
+
+	prog_fd = bpf_program__fd(skel->progs.trace_inode_mkdir);
+	memset(&info, 0, sizeof(info));
+	err = bpf_prog_get_info_by_fd(prog_fd, &info, &info_len);
+	ASSERT_OK(err, "get prog info");
+	ASSERT_EQ(info.recursion_misses, 0, "trace_inode_mkdir prog recursion");
+
+	prog_fd = bpf_program__fd(skel->progs.trace_inode_storage_lookup);
+	memset(&info, 0, sizeof(info));
+	err = bpf_prog_get_info_by_fd(prog_fd, &info, &info_len);
+	ASSERT_OK(err, "get prog info");
+	ASSERT_EQ(info.recursion_misses, 3, "trace_inode_storage_lookup prog recursion");
+
+out:
+	rmdir(TDIR);
+	close(inode_fd);
+	inode_storage_recursion__destroy(skel);
+}
+
+void test_inode_localstorage(void)
+{
+	if (test__start_subtest("recursion"))
+		test_recursion();
+}
diff --git a/tools/testing/selftests/bpf/progs/inode_storage_recursion.c b/tools/testing/selftests/bpf/progs/inode_storage_recursion.c
new file mode 100644
index 000000000000..18db141f8235
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/inode_storage_recursion.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#ifndef EBUSY
+#define EBUSY 16
+#endif
+
+char _license[] SEC("license") = "GPL";
+int nr_del_errs;
+int test_pid;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_INODE_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} inode_map SEC(".maps");
+
+/* inode_storage_lookup is not an ideal hook for recursion tests, as it
+ * is static and more likely to get inlined. However, there isn't a
+ * better function for the test. This is because we need to call
+ * bpf_inode_storage_* helpers with an inode intput. Unlike task local
+ * storage, for which we can use bpf_get_current_task_btf() to get task
+ * pointer with BTF, for inode local storage, we need the get the inode
+ * pointer from function arguments. Other functions, such as,
+ * bpf_local_storage_get() does not take inode as input.
+ *
+ * As a compromise, we may need to skip this test for some architectures.
+ */
+SEC("fentry/inode_storage_lookup")
+int BPF_PROG(trace_inode_storage_lookup, struct inode *inode)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+	int err;
+
+	if (!test_pid || task->pid != test_pid)
+		return 0;
+
+	/* This doesn't have BPF_LOCAL_STORAGE_GET_F_CREATE, so it will
+	 * not trigger on the first call of bpf_inode_storage_get() below.
+	 *
+	 * This is called twice, recursion_misses += 2.
+	 */
+	ptr = bpf_inode_storage_get(&inode_map, inode, 0, 0);
+	if (ptr) {
+		*ptr += 1;
+
+		/* This is called once, recursion_misses += 1. */
+		err = bpf_inode_storage_delete(&inode_map, inode);
+		if (err == -EBUSY)
+			nr_del_errs++;
+	}
+
+	return 0;
+}
+
+SEC("fentry/security_inode_mkdir")
+int BPF_PROG(trace_inode_mkdir, struct inode *dir,
+	     struct dentry *dentry,
+	     int mode)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+
+	if (!test_pid || task->pid != test_pid)
+		return 0;
+
+	/* Trigger trace_inode_storage_lookup, the first time */
+	ptr = bpf_inode_storage_get(&inode_map, dir, 0,
+				    BPF_LOCAL_STORAGE_GET_F_CREATE);
+
+	/* trace_inode_storage_lookup cannot get ptr, so *ptr is 0 */
+	if (ptr && !*ptr)
+		*ptr = 200;
+
+	/* Trigger trace_inode_storage_lookup, the first time.
+	 * trace_inode_storage_lookup can now get ptr and increase the
+	 * value.
+	 */
+	bpf_inode_storage_get(&inode_map, dir, 0,
+			      BPF_LOCAL_STORAGE_GET_F_CREATE);
+
+	return 0;
+
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH bpf-next 4/4] selftest/bpf: Test inode local storage recursion prevention
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
                   ` (4 preceding siblings ...)
  2024-11-12  8:25 ` [PATCH bpf-next 4/4] selftest/bpf: Add test for inode local storage recursion Song Liu
@ 2024-11-12  8:26 ` Song Liu
  2024-11-12  8:35 ` [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
  2024-11-12 18:09 ` Casey Schaufler
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:26 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Song Liu

Add selftest for recursion prevention logic of bpf local storage.

When inode local storage function is traced, helpers that access inode
local storage should return -EBUSY.

The recurring program is attached to inode_storage_lookup(). This is not
an ideal target for recursion tests. However, given that the target
function have to take "struct inode *" argument, there isn't a better
target function for the tests.

Test results showed that inode_storage_lookup() is inlined in s390x.
Work around this by adding this test to DENYLIST.s390x.

Signed-off-by: Song Liu <song@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 .../bpf/prog_tests/inode_local_storage.c      | 72 +++++++++++++++
 .../bpf/progs/inode_storage_recursion.c       | 90 +++++++++++++++++++
 3 files changed, 163 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 3ebd77206f98..6b8c9c9ec754 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -1,5 +1,6 @@
 # TEMPORARY
 # Alphabetical order
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
+inode_localstorage/recursion             # target function (inode_storage_lookup) is inlined on s390)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
 verifier_iterating_callbacks
diff --git a/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c b/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
new file mode 100644
index 000000000000..a9d9f77216f4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include <stdio.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include "inode_storage_recursion.skel.h"
+
+#define TDIR "/tmp/inode_local_storage"
+#define TDIR_PARENT "/tmp"
+
+static void test_recursion(void)
+{
+	struct inode_storage_recursion *skel;
+	struct bpf_prog_info info;
+	__u32 info_len = sizeof(info);
+	int err, prog_fd, map_fd, inode_fd = -1;
+	long value;
+
+	skel = inode_storage_recursion__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	skel->bss->test_pid = getpid();
+
+	err = inode_storage_recursion__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	err = mkdir(TDIR, 0755);
+	if (!ASSERT_OK(err, "mkdir " TDIR))
+		goto out;
+
+	inode_fd = open(TDIR_PARENT, O_RDONLY | O_CLOEXEC);
+	if (!ASSERT_OK_FD(inode_fd, "open inode_fd"))
+		goto out;
+
+	/* Detach so that the following lookup won't trigger
+	 * trace_inode_storage_lookup and further change the values.
+	 */
+	inode_storage_recursion__detach(skel);
+	map_fd = bpf_map__fd(skel->maps.inode_map);
+	err = bpf_map_lookup_elem(map_fd, &inode_fd, &value);
+	ASSERT_OK(err, "lookup inode_map");
+
+	/* Check trace_inode_mkdir for the reason that value == 201 */
+	ASSERT_EQ(value, 201, "inode_map value");
+	ASSERT_EQ(skel->bss->nr_del_errs, 1, "bpf_task_storage_delete busy");
+
+	prog_fd = bpf_program__fd(skel->progs.trace_inode_mkdir);
+	memset(&info, 0, sizeof(info));
+	err = bpf_prog_get_info_by_fd(prog_fd, &info, &info_len);
+	ASSERT_OK(err, "get prog info");
+	ASSERT_EQ(info.recursion_misses, 0, "trace_inode_mkdir prog recursion");
+
+	prog_fd = bpf_program__fd(skel->progs.trace_inode_storage_lookup);
+	memset(&info, 0, sizeof(info));
+	err = bpf_prog_get_info_by_fd(prog_fd, &info, &info_len);
+	ASSERT_OK(err, "get prog info");
+	ASSERT_EQ(info.recursion_misses, 3, "trace_inode_storage_lookup prog recursion");
+
+out:
+	rmdir(TDIR);
+	close(inode_fd);
+	inode_storage_recursion__destroy(skel);
+}
+
+void test_inode_localstorage(void)
+{
+	if (test__start_subtest("recursion"))
+		test_recursion();
+}
diff --git a/tools/testing/selftests/bpf/progs/inode_storage_recursion.c b/tools/testing/selftests/bpf/progs/inode_storage_recursion.c
new file mode 100644
index 000000000000..0ad36f8c6e04
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/inode_storage_recursion.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#ifndef EBUSY
+#define EBUSY 16
+#endif
+
+char _license[] SEC("license") = "GPL";
+int nr_del_errs;
+int test_pid;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_INODE_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} inode_map SEC(".maps");
+
+/* inode_storage_lookup is not an ideal hook for recursion tests, as it
+ * is static and more likely to get inlined. However, there isn't a
+ * better function for the test. This is because we need to call
+ * bpf_inode_storage_* helpers with an inode intput. Unlike task local
+ * storage, for which we can use bpf_get_current_task_btf() to get task
+ * pointer with BTF, for inode local storage, we need the get the inode
+ * pointer from function arguments. Other functions, such as,
+ * bpf_local_storage_get() does not take inode as input.
+ */
+SEC("fentry/inode_storage_lookup")
+int BPF_PROG(trace_inode_storage_lookup, struct inode *inode)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+	int err;
+
+	if (!test_pid || task->pid != test_pid)
+		return 0;
+
+	/* This doesn't have BPF_LOCAL_STORAGE_GET_F_CREATE, so it will
+	 * not trigger on the first call of bpf_inode_storage_get() below.
+	 *
+	 * This is called twice, recursion_misses += 2.
+	 */
+	ptr = bpf_inode_storage_get(&inode_map, inode, 0, 0);
+	if (ptr) {
+		*ptr += 1;
+
+		/* This is called once, recursion_misses += 1. */
+		err = bpf_inode_storage_delete(&inode_map, inode);
+		if (err == -EBUSY)
+			nr_del_errs++;
+	}
+
+	return 0;
+}
+
+SEC("fentry/security_inode_mkdir")
+int BPF_PROG(trace_inode_mkdir, struct inode *dir,
+	     struct dentry *dentry,
+	     int mode)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+
+	if (!test_pid || task->pid != test_pid)
+		return 0;
+
+	/* Trigger trace_inode_storage_lookup, the first time */
+	ptr = bpf_inode_storage_get(&inode_map, dir, 0,
+				    BPF_LOCAL_STORAGE_GET_F_CREATE);
+
+	/* trace_inode_storage_lookup cannot get ptr, so *ptr is 0.
+	 * Set ptr to 200.
+	 */
+	if (ptr && !*ptr)
+		*ptr = 200;
+
+	/* Trigger trace_inode_storage_lookup, the second time.
+	 * trace_inode_storage_lookup can now get ptr and increase the
+	 * value. Now the value is 201.
+	 */
+	bpf_inode_storage_get(&inode_map, dir, 0,
+			      BPF_LOCAL_STORAGE_GET_F_CREATE);
+
+	return 0;
+
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
                   ` (5 preceding siblings ...)
  2024-11-12  8:26 ` [PATCH bpf-next 4/4] selftest/bpf: Test inode local storage recursion prevention Song Liu
@ 2024-11-12  8:35 ` Song Liu
  2024-11-12 18:09 ` Casey Schaufler
  7 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-12  8:35 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

> On Nov 12, 2024, at 12:25 AM, Song Liu <song@kernel.org> wrote:
> 
> bpf inode local storage can be useful beyond LSM programs. For example,
> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
> This set makes inode local storage available to tracing program.
> 
> 1/4 is missing change for bpf task local storage. 2/4 move inode local
> storage from security blob to inode.
> 
> Similar to task local storage in tracing program, it is necessary to add
> recursion prevention logic for inode local storage. Patch 3/4 adds such
> logic, and 4/4 add a test for the recursion prevention logic.
> 
> Song Liu (4):
>  bpf: lsm: Remove hook to bpf_task_storage_free
>  bpf: Make bpf inode storage available to tracing program
>  bpf: Add recursion prevention logic for inode storage
>  selftest/bpf: Test inode local storage recursion prevention

I accidentally sent some older .patch files together with this
set. Please ignore this version. I will resend v2. 

Thanks,
Song

> 
> fs/inode.c                                    |   1 +
> include/linux/bpf.h                           |   9 +
> include/linux/bpf_lsm.h                       |  29 ---
> include/linux/fs.h                            |   4 +
> kernel/bpf/Makefile                           |   3 +-
> kernel/bpf/bpf_inode_storage.c                | 185 +++++++++++++-----
> kernel/bpf/bpf_lsm.c                          |   4 -
> kernel/trace/bpf_trace.c                      |   8 +
> security/bpf/hooks.c                          |   7 -
> tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
> .../bpf/prog_tests/inode_local_storage.c      |  72 +++++++
> .../bpf/progs/inode_storage_recursion.c       |  90 +++++++++
> 12 files changed, 320 insertions(+), 93 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
> create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c
> 
> --
> 2.43.5


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
                   ` (6 preceding siblings ...)
  2024-11-12  8:35 ` [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
@ 2024-11-12 18:09 ` Casey Schaufler
  2024-11-12 18:44   ` Song Liu
  7 siblings, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-12 18:09 UTC (permalink / raw)
  To: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, amir73il, repnop, jlayton,
	josef, mic, gnoack, Casey Schaufler

On 11/12/2024 12:25 AM, Song Liu wrote:
> bpf inode local storage can be useful beyond LSM programs. For example,
> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
> This set makes inode local storage available to tracing program.

Mixing the storage and scope of LSM data and tracing data leaves all sorts
of opportunities for abuse. Add inode data for tracing if you can get the
patch accepted, but do not move the LSM data out of i_security. Moving
the LSM data would break the integrity (such that there is) of the LSM
model.

>
> 1/4 is missing change for bpf task local storage. 2/4 move inode local
> storage from security blob to inode.
>
> Similar to task local storage in tracing program, it is necessary to add
> recursion prevention logic for inode local storage. Patch 3/4 adds such
> logic, and 4/4 add a test for the recursion prevention logic.
>
> Song Liu (4):
>   bpf: lsm: Remove hook to bpf_task_storage_free
>   bpf: Make bpf inode storage available to tracing program
>   bpf: Add recursion prevention logic for inode storage
>   selftest/bpf: Test inode local storage recursion prevention
>
>  fs/inode.c                                    |   1 +
>  include/linux/bpf.h                           |   9 +
>  include/linux/bpf_lsm.h                       |  29 ---
>  include/linux/fs.h                            |   4 +
>  kernel/bpf/Makefile                           |   3 +-
>  kernel/bpf/bpf_inode_storage.c                | 185 +++++++++++++-----
>  kernel/bpf/bpf_lsm.c                          |   4 -
>  kernel/trace/bpf_trace.c                      |   8 +
>  security/bpf/hooks.c                          |   7 -
>  tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
>  .../bpf/prog_tests/inode_local_storage.c      |  72 +++++++
>  .../bpf/progs/inode_storage_recursion.c       |  90 +++++++++
>  12 files changed, 320 insertions(+), 93 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/inode_local_storage.c
>  create mode 100644 tools/testing/selftests/bpf/progs/inode_storage_recursion.c
>
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-12 18:09 ` Casey Schaufler
@ 2024-11-12 18:44   ` Song Liu
  2024-11-13  1:10     ` Casey Schaufler
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-12 18:44 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

Hi Casey, 

Thanks for your input. 

> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> On 11/12/2024 12:25 AM, Song Liu wrote:
>> bpf inode local storage can be useful beyond LSM programs. For example,
>> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
>> This set makes inode local storage available to tracing program.
> 
> Mixing the storage and scope of LSM data and tracing data leaves all sorts
> of opportunities for abuse. Add inode data for tracing if you can get the
> patch accepted, but do not move the LSM data out of i_security. Moving
> the LSM data would break the integrity (such that there is) of the LSM
> model.

I honestly don't see how this would cause any issues. Each bpf inode 
storage maps are independent of each other, and the bpf local storage is 
designed to handle multiple inode storage maps properly. Therefore, if
the user decide to stick with only LSM hooks, there isn't any behavior 
change. OTOH, if the user decides some tracing hooks (on tracepoints, 
etc.) are needed, making a inode storage map available for both tracing 
programs and LSM programs would help simplify the logic. (Alternatively,
the tracing programs need to store per inode data in a hash map, and 
the LSM program would read that instead of the inode storage map.)

Does this answer the question and address the concerns?

Thanks,
Song

> 
>> 
>> 1/4 is missing change for bpf task local storage. 2/4 move inode local
>> storage from security blob to inode.
>> 
>> Similar to task local storage in tracing program, it is necessary to add
>> recursion prevention logic for inode local storage. Patch 3/4 adds such
>> logic, and 4/4 add a test for the recursion prevention logic.
>> 
>> Song Liu (4):
>>  bpf: lsm: Remove hook to bpf_task_storage_free
>>  bpf: Make bpf inode storage available to tracing program
>>  bpf: Add recursion prevention logic for inode storage
>>  selftest/bpf: Test inode local storage recursion prevention

[...]


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-12 18:44   ` Song Liu
@ 2024-11-13  1:10     ` Casey Schaufler
  2024-11-13  1:37       ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-13  1:10 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com, Casey Schaufler

On 11/12/2024 10:44 AM, Song Liu wrote:
> Hi Casey, 
>
> Thanks for your input. 
>
>> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>> On 11/12/2024 12:25 AM, Song Liu wrote:
>>> bpf inode local storage can be useful beyond LSM programs. For example,
>>> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
>>> This set makes inode local storage available to tracing program.
>> Mixing the storage and scope of LSM data and tracing data leaves all sorts
>> of opportunities for abuse. Add inode data for tracing if you can get the
>> patch accepted, but do not move the LSM data out of i_security. Moving
>> the LSM data would break the integrity (such that there is) of the LSM
>> model.
> I honestly don't see how this would cause any issues. Each bpf inode 
> storage maps are independent of each other, and the bpf local storage is 
> designed to handle multiple inode storage maps properly. Therefore, if
> the user decide to stick with only LSM hooks, there isn't any behavior 
> change. OTOH, if the user decides some tracing hooks (on tracepoints, 
> etc.) are needed, making a inode storage map available for both tracing 
> programs and LSM programs would help simplify the logic. (Alternatively,
> the tracing programs need to store per inode data in a hash map, and 
> the LSM program would read that instead of the inode storage map.)
>
> Does this answer the question and address the concerns?

First off, I had no question. No, this does not address my concern.
LSM data should be kept in and managed by the LSMs. We're making an
effort to make the LSM infrastructure more consistent. Moving some of
the LSM data into an LSM specific field in the inode structure goes
against this. What you're proposing is a one-off clever optimization
hack. We have too many of those already.



>
> Thanks,
> Song
>
>>> 1/4 is missing change for bpf task local storage. 2/4 move inode local
>>> storage from security blob to inode.
>>>
>>> Similar to task local storage in tracing program, it is necessary to add
>>> recursion prevention logic for inode local storage. Patch 3/4 adds such
>>> logic, and 4/4 add a test for the recursion prevention logic.
>>>
>>> Song Liu (4):
>>>  bpf: lsm: Remove hook to bpf_task_storage_free
>>>  bpf: Make bpf inode storage available to tracing program
>>>  bpf: Add recursion prevention logic for inode storage
>>>  selftest/bpf: Test inode local storage recursion prevention
> [...]
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-13  1:10     ` Casey Schaufler
@ 2024-11-13  1:37       ` Song Liu
  2024-11-13 18:06         ` Casey Schaufler
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-13  1:37 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com



> On Nov 12, 2024, at 5:10 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> On 11/12/2024 10:44 AM, Song Liu wrote:
>> Hi Casey, 
>> 
>> Thanks for your input. 
>> 
>>> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> 
>>> On 11/12/2024 12:25 AM, Song Liu wrote:
>>>> bpf inode local storage can be useful beyond LSM programs. For example,
>>>> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
>>>> This set makes inode local storage available to tracing program.
>>> Mixing the storage and scope of LSM data and tracing data leaves all sorts
>>> of opportunities for abuse. Add inode data for tracing if you can get the
>>> patch accepted, but do not move the LSM data out of i_security. Moving
>>> the LSM data would break the integrity (such that there is) of the LSM
>>> model.
>> I honestly don't see how this would cause any issues. Each bpf inode 
>> storage maps are independent of each other, and the bpf local storage is 
>> designed to handle multiple inode storage maps properly. Therefore, if
>> the user decide to stick with only LSM hooks, there isn't any behavior 
>> change. OTOH, if the user decides some tracing hooks (on tracepoints, 
>> etc.) are needed, making a inode storage map available for both tracing 
>> programs and LSM programs would help simplify the logic. (Alternatively,
>> the tracing programs need to store per inode data in a hash map, and 
>> the LSM program would read that instead of the inode storage map.)
>> 
>> Does this answer the question and address the concerns?
> 
> First off, I had no question. No, this does not address my concern.
> LSM data should be kept in and managed by the LSMs. We're making an
> effort to make the LSM infrastructure more consistent.

Could you provide more information on the definition of "more 
consistent" LSM infrastructure? 

BPF LSM programs have full access to regular BPF maps (hash map, 
array, etc.). There was never a separation of LSM data vs. other 
data. 

AFAICT, other LSMs also use kzalloc and similar APIs for memory 
allocation. So data separation is not a goal for any LSM, right?

Thanks,
Song

> Moving some of
> the LSM data into an LSM specific field in the inode structure goes
> against this. What you're proposing is a one-off clever optimization
> hack. We have too many of those already.




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-12  8:25 ` [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program Song Liu
@ 2024-11-13 10:19   ` Christian Brauner
  2024-11-13 14:15     ` Song Liu
  2024-11-14 21:11     ` Song Liu
  0 siblings, 2 replies; 53+ messages in thread
From: Christian Brauner @ 2024-11-13 10:19 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro, jack,
	kpsingh, mattbobrowski, amir73il, repnop, jlayton, josef, mic,
	gnoack

On Tue, Nov 12, 2024 at 12:25:56AM -0800, Song Liu wrote:
> inode storage can be useful for non-LSM program. For example, file* tools
> from bcc/libbpf-tools can use inode storage instead of hash map; fanotify
> fastpath [1] can also use inode storage to store useful data.
> 
> Make inode storage available for tracing program. Move bpf inode storage
> from a security blob to inode->i_bpf_storage, and adjust related code
> accordingly.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20241029231244.2834368-1-song@kernel.org/
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  fs/inode.c                     |  1 +
>  include/linux/bpf.h            |  9 +++++++++
>  include/linux/bpf_lsm.h        | 29 -----------------------------
>  include/linux/fs.h             |  4 ++++
>  kernel/bpf/Makefile            |  3 +--
>  kernel/bpf/bpf_inode_storage.c | 32 +++++---------------------------
>  kernel/bpf/bpf_lsm.c           |  4 ----
>  kernel/trace/bpf_trace.c       |  4 ++++
>  security/bpf/hooks.c           |  6 ------
>  9 files changed, 24 insertions(+), 68 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 8dabb224f941..3c679578169f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -286,6 +286,7 @@ static struct inode *alloc_inode(struct super_block *sb)
>  void __destroy_inode(struct inode *inode)
>  {
>  	BUG_ON(inode_has_buffers(inode));
> +	bpf_inode_storage_free(inode);
>  	inode_detach_wb(inode);
>  	security_inode_free(inode);
>  	fsnotify_inode_delete(inode);
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1b84613b10ac..0b31d2e74df6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2672,6 +2672,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
>  const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id,
>  						 const struct bpf_prog *prog);
>  void bpf_task_storage_free(struct task_struct *task);
> +void bpf_inode_storage_free(struct inode *inode);
>  void bpf_cgrp_storage_free(struct cgroup *cgroup);
>  bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>  const struct btf_func_model *
> @@ -2942,6 +2943,10 @@ static inline void bpf_task_storage_free(struct task_struct *task)
>  {
>  }
>  
> +static inline void bpf_inode_storage_free(struct inode *inode)
> +{
> +}
> +
>  static inline bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog)
>  {
>  	return false;
> @@ -3305,6 +3310,10 @@ extern const struct bpf_func_proto bpf_task_storage_get_recur_proto;
>  extern const struct bpf_func_proto bpf_task_storage_get_proto;
>  extern const struct bpf_func_proto bpf_task_storage_delete_recur_proto;
>  extern const struct bpf_func_proto bpf_task_storage_delete_proto;
> +extern const struct bpf_func_proto bpf_inode_storage_get_proto;
> +extern const struct bpf_func_proto bpf_inode_storage_get_recur_proto;
> +extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
> +extern const struct bpf_func_proto bpf_inode_storage_delete_recur_proto;
>  extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
>  extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
>  extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
> diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
> index aefcd6564251..a819c2f0a062 100644
> --- a/include/linux/bpf_lsm.h
> +++ b/include/linux/bpf_lsm.h
> @@ -19,31 +19,12 @@
>  #include <linux/lsm_hook_defs.h>
>  #undef LSM_HOOK
>  
> -struct bpf_storage_blob {
> -	struct bpf_local_storage __rcu *storage;
> -};
> -
> -extern struct lsm_blob_sizes bpf_lsm_blob_sizes;
> -
>  int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
>  			const struct bpf_prog *prog);
>  
>  bool bpf_lsm_is_sleepable_hook(u32 btf_id);
>  bool bpf_lsm_is_trusted(const struct bpf_prog *prog);
>  
> -static inline struct bpf_storage_blob *bpf_inode(
> -	const struct inode *inode)
> -{
> -	if (unlikely(!inode->i_security))
> -		return NULL;
> -
> -	return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
> -}
> -
> -extern const struct bpf_func_proto bpf_inode_storage_get_proto;
> -extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
> -void bpf_inode_storage_free(struct inode *inode);
> -
>  void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func);
>  
>  int bpf_lsm_get_retval_range(const struct bpf_prog *prog,
> @@ -66,16 +47,6 @@ static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
>  	return -EOPNOTSUPP;
>  }
>  
> -static inline struct bpf_storage_blob *bpf_inode(
> -	const struct inode *inode)
> -{
> -	return NULL;
> -}
> -
> -static inline void bpf_inode_storage_free(struct inode *inode)
> -{
> -}
> -
>  static inline void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
>  					   bpf_func_t *bpf_func)
>  {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3559446279c1..479097e4dd5b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -79,6 +79,7 @@ struct fs_context;
>  struct fs_parameter_spec;
>  struct fileattr;
>  struct iomap_ops;
> +struct bpf_local_storage;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -648,6 +649,9 @@ struct inode {
>  #ifdef CONFIG_SECURITY
>  	void			*i_security;
>  #endif
> +#ifdef CONFIG_BPF_SYSCALL
> +	struct bpf_local_storage __rcu *i_bpf_storage;
> +#endif

Sorry, we're not growing struct inode for this. It just keeps getting
bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
to waste them on special-purpose stuff. We already NAKed someone else's
pet field here.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-13 10:19   ` Christian Brauner
@ 2024-11-13 14:15     ` Song Liu
  2024-11-13 18:29       ` Casey Schaufler
  2024-11-21  9:04       ` Christian Brauner
  2024-11-14 21:11     ` Song Liu
  1 sibling, 2 replies; 53+ messages in thread
From: Song Liu @ 2024-11-13 14:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Christian, 

Thanks for your review. 

> On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:
[...]

>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3559446279c1..479097e4dd5b 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -79,6 +79,7 @@ struct fs_context;
>> struct fs_parameter_spec;
>> struct fileattr;
>> struct iomap_ops;
>> +struct bpf_local_storage;
>> 
>> extern void __init inode_init(void);
>> extern void __init inode_init_early(void);
>> @@ -648,6 +649,9 @@ struct inode {
>> #ifdef CONFIG_SECURITY
>> void *i_security;
>> #endif
>> +#ifdef CONFIG_BPF_SYSCALL
>> + struct bpf_local_storage __rcu *i_bpf_storage;
>> +#endif
> 
> Sorry, we're not growing struct inode for this. It just keeps getting
> bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
> to waste them on special-purpose stuff. We already NAKed someone else's
> pet field here.

Would it be acceptable if we union i_bpf_storage with i_security?
IOW, if CONFIG_SECURITY is enabled, we will use existing logic. 
If CONFIG_SECURITY is not enabled, we will use i_bpf_storage. 
Given majority of default configs have CONFIG_SECURITY=y, this 
will not grow inode for most users. OTOH, users with 
CONFIG_SECURITY=n && CONFIG_BPF_SYSCALL=y combination can still 
use inode local storage in the tracing BPF programs. 

Does this make sense?

Thanks,
Song 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-13  1:37       ` Song Liu
@ 2024-11-13 18:06         ` Casey Schaufler
  2024-11-13 18:57           ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-13 18:06 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com, Casey Schaufler

On 11/12/2024 5:37 PM, Song Liu wrote:
>
>> On Nov 12, 2024, at 5:10 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>> On 11/12/2024 10:44 AM, Song Liu wrote:
>>> Hi Casey, 
>>>
>>> Thanks for your input. 
>>>
>>>> On Nov 12, 2024, at 10:09 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>
>>>> On 11/12/2024 12:25 AM, Song Liu wrote:
>>>>> bpf inode local storage can be useful beyond LSM programs. For example,
>>>>> bcc/libbpf-tools file* can use inode local storage to simplify the logic.
>>>>> This set makes inode local storage available to tracing program.
>>>> Mixing the storage and scope of LSM data and tracing data leaves all sorts
>>>> of opportunities for abuse. Add inode data for tracing if you can get the
>>>> patch accepted, but do not move the LSM data out of i_security. Moving
>>>> the LSM data would break the integrity (such that there is) of the LSM
>>>> model.
>>> I honestly don't see how this would cause any issues. Each bpf inode 
>>> storage maps are independent of each other, and the bpf local storage is 
>>> designed to handle multiple inode storage maps properly. Therefore, if
>>> the user decide to stick with only LSM hooks, there isn't any behavior 
>>> change. OTOH, if the user decides some tracing hooks (on tracepoints, 
>>> etc.) are needed, making a inode storage map available for both tracing 
>>> programs and LSM programs would help simplify the logic. (Alternatively,
>>> the tracing programs need to store per inode data in a hash map, and 
>>> the LSM program would read that instead of the inode storage map.)
>>>
>>> Does this answer the question and address the concerns?
>> First off, I had no question. No, this does not address my concern.
>> LSM data should be kept in and managed by the LSMs. We're making an
>> effort to make the LSM infrastructure more consistent.
> Could you provide more information on the definition of "more 
> consistent" LSM infrastructure?

We're doing several things. The management of security blobs
(e.g. inode->i_security) has been moved out of the individual
modules and into the infrastructure. The use of a u32 secid is
being replaced with a more general lsm_prop structure, except
where networking code won't allow it. A good deal of work has
gone into making the return values of LSM hooks consistent.

Some of this was done as part of the direct call change, and some
in support of LSM stacking. There are also some hardening changes
that aren't ready for prime-time, but that are in the works.
There have been concerns about the potential expoitability of the
LSM infrastructure, and we're serious about addressing those.

>
> BPF LSM programs have full access to regular BPF maps (hash map, 
> array, etc.). There was never a separation of LSM data vs. other 
> data. 
>
> AFAICT, other LSMs also use kzalloc and similar APIs for memory 
> allocation. So data separation is not a goal for any LSM, right?
>
> Thanks,
> Song
>
>> Moving some of
>> the LSM data into an LSM specific field in the inode structure goes
>> against this. What you're proposing is a one-off clever optimization
>> hack. We have too many of those already.
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-13 14:15     ` Song Liu
@ 2024-11-13 18:29       ` Casey Schaufler
  2024-11-13 19:00         ` Song Liu
  2024-11-21  9:04       ` Christian Brauner
  1 sibling, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-13 18:29 UTC (permalink / raw)
  To: Song Liu, Christian Brauner
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com, Casey Schaufler

On 11/13/2024 6:15 AM, Song Liu wrote:
> Hi Christian, 
>
> Thanks for your review. 
>
>> On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:
> [...]
>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 3559446279c1..479097e4dd5b 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -79,6 +79,7 @@ struct fs_context;
>>> struct fs_parameter_spec;
>>> struct fileattr;
>>> struct iomap_ops;
>>> +struct bpf_local_storage;
>>>
>>> extern void __init inode_init(void);
>>> extern void __init inode_init_early(void);
>>> @@ -648,6 +649,9 @@ struct inode {
>>> #ifdef CONFIG_SECURITY
>>> void *i_security;
>>> #endif
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> + struct bpf_local_storage __rcu *i_bpf_storage;
>>> +#endif
>> Sorry, we're not growing struct inode for this. It just keeps getting
>> bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
>> to waste them on special-purpose stuff. We already NAKed someone else's
>> pet field here.
> Would it be acceptable if we union i_bpf_storage with i_security?

No!

> IOW, if CONFIG_SECURITY is enabled, we will use existing logic. 
> If CONFIG_SECURITY is not enabled, we will use i_bpf_storage. 
> Given majority of default configs have CONFIG_SECURITY=y, this 
> will not grow inode for most users. OTOH, users with 
> CONFIG_SECURITY=n && CONFIG_BPF_SYSCALL=y combination can still 
> use inode local storage in the tracing BPF programs. 
>
> Does this make sense?

All it would take is one BPF programmer assuming that CONFIG_SECURITY=n
is the norm for this to blow up spectacularly.

>
> Thanks,
> Song 
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-13 18:06         ` Casey Schaufler
@ 2024-11-13 18:57           ` Song Liu
  2024-11-14 16:36             ` Dr. Greg
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-13 18:57 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Wed, Nov 13, 2024 at 10:06 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 11/12/2024 5:37 PM, Song Liu wrote:
[...]
> > Could you provide more information on the definition of "more
> > consistent" LSM infrastructure?
>
> We're doing several things. The management of security blobs
> (e.g. inode->i_security) has been moved out of the individual
> modules and into the infrastructure. The use of a u32 secid is
> being replaced with a more general lsm_prop structure, except
> where networking code won't allow it. A good deal of work has
> gone into making the return values of LSM hooks consistent.

Thanks for the information. Unifying per-object memory usage of
different LSMs makes sense. However, I don't think we are limiting
any LSM to only use memory from the lsm_blobs. The LSMs still
have the freedom to use other memory allocators. BPF inode
local storage, just like other BPF maps, is a way to manage
memory. BPF LSM programs have full access to BPF maps. So
I don't think it makes sense to say this BPF map is used by tracing,
so we should not allow LSM to use it.

Does this make sense?
Song

> Some of this was done as part of the direct call change, and some
> in support of LSM stacking. There are also some hardening changes
> that aren't ready for prime-time, but that are in the works.
> There have been concerns about the potential expoitability of the
> LSM infrastructure, and we're serious about addressing those.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-13 18:29       ` Casey Schaufler
@ 2024-11-13 19:00         ` Song Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-13 19:00 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, Christian Brauner, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Wed, Nov 13, 2024 at 10:30 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 11/13/2024 6:15 AM, Song Liu wrote:
> > Hi Christian,
> >
> > Thanks for your review.
> >
> >> On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:
> > [...]
> >
> >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>> index 3559446279c1..479097e4dd5b 100644
> >>> --- a/include/linux/fs.h
> >>> +++ b/include/linux/fs.h
> >>> @@ -79,6 +79,7 @@ struct fs_context;
> >>> struct fs_parameter_spec;
> >>> struct fileattr;
> >>> struct iomap_ops;
> >>> +struct bpf_local_storage;
> >>>
> >>> extern void __init inode_init(void);
> >>> extern void __init inode_init_early(void);
> >>> @@ -648,6 +649,9 @@ struct inode {
> >>> #ifdef CONFIG_SECURITY
> >>> void *i_security;
> >>> #endif
> >>> +#ifdef CONFIG_BPF_SYSCALL
> >>> + struct bpf_local_storage __rcu *i_bpf_storage;
> >>> +#endif
> >> Sorry, we're not growing struct inode for this. It just keeps getting
> >> bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
> >> to waste them on special-purpose stuff. We already NAKed someone else's
> >> pet field here.
> > Would it be acceptable if we union i_bpf_storage with i_security?
>
> No!
>
> > IOW, if CONFIG_SECURITY is enabled, we will use existing logic.
> > If CONFIG_SECURITY is not enabled, we will use i_bpf_storage.
> > Given majority of default configs have CONFIG_SECURITY=y, this
> > will not grow inode for most users. OTOH, users with
> > CONFIG_SECURITY=n && CONFIG_BPF_SYSCALL=y combination can still
> > use inode local storage in the tracing BPF programs.
> >
> > Does this make sense?
>
> All it would take is one BPF programmer assuming that CONFIG_SECURITY=n
> is the norm for this to blow up spectacularly.

I seriously don't understand what would blow up and how. Can you be
more specific?

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-13 18:57           ` Song Liu
@ 2024-11-14 16:36             ` Dr. Greg
  2024-11-14 17:29               ` Casey Schaufler
  2024-11-14 17:51               ` Song Liu
  0 siblings, 2 replies; 53+ messages in thread
From: Dr. Greg @ 2024-11-14 16:36 UTC (permalink / raw)
  To: Song Liu
  Cc: Casey Schaufler, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote:

Good morning, I hope the week is going well for everyone.

> On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 11/12/2024 5:37 PM, Song Liu wrote:
> [...]
> > > Could you provide more information on the definition of "more
> > > consistent" LSM infrastructure?
> >
> > We're doing several things. The management of security blobs
> > (e.g. inode->i_security) has been moved out of the individual
> > modules and into the infrastructure. The use of a u32 secid is
> > being replaced with a more general lsm_prop structure, except
> > where networking code won't allow it. A good deal of work has
> > gone into making the return values of LSM hooks consistent.
> 
> Thanks for the information. Unifying per-object memory usage of
> different LSMs makes sense. However, I don't think we are limiting
> any LSM to only use memory from the lsm_blobs. The LSMs still
> have the freedom to use other memory allocators. BPF inode
> local storage, just like other BPF maps, is a way to manage
> memory. BPF LSM programs have full access to BPF maps. So
> I don't think it makes sense to say this BPF map is used by tracing,
> so we should not allow LSM to use it.
> 
> Does this make sense?

As involved bystanders, some questions and thoughts that may help
further the discussion.

With respect to inode specific storage, the currently accepted pattern
in the LSM world is roughly as follows:

The LSM initialization code, at boot, computes the total amount of
storage needed by all of the LSM's that are requesting inode specific
storage.  A single pointer to that 'blob' of storage is included in
the inode structure.

In an include file, an inline function similar to the following is
declared, whose purpose is to return the location inside of the
allocated storage or 'LSM inode blob' where a particular LSM's inode
specific data structure is located:

static inline struct tsem_inode *tsem_inode(struct inode *inode)
{
	return inode->i_security + tsem_blob_sizes.lbs_inode;
}

In an LSM's implementation code, the function gets used in something
like the following manner:

static int tsem_inode_alloc_security(struct inode *inode)
{
	struct tsem_inode *tsip = tsem_inode(inode);

	/* Do something with the structure pointed to by tsip. */
}

Christian appears to have already chimed in and indicated that there
is no appetite to add another pointer member to the inode structure.

So, if this were to proceed forward, is it proposed that there will be
a 'flag day' requirement to have each LSM that uses inode specific
storage implement a security_inode_alloc() event handler that creates
an LSM specific BPF map key/value pair for that inode?

Which, in turn, would require that the accessor functions be converted
to use a bpf key request to return the LSM specific information for
that inode?

A flag day event is always somewhat of a concern, but the larger
concern may be the substitution of simple pointer arithmetic for a
body of more complex code.  One would assume with something like this,
that there may be a need for a shake-out period to determine what type
of potential regressions the more complex implementation may generate,
with regressions in security sensitive code always a concern.

In a larger context.  Given that the current implementation works on
simple pointer arithmetic over a common block of storage, there is not
much of a safety guarantee that one LSM couldn't interfere with the
inode storage of another LSM.  However, using a generic BPF construct
such as a map, would presumably open the level of influence over LSM
specific inode storage to a much larger audience, presumably any BPF
program that would be loaded.

The LSM inode information is obviously security sensitive, which I
presume would be be the motivation for Casey's concern that a 'mistake
by a BPF programmer could cause the whole system to blow up', which in
full disclosure is only a rough approximation of his statement.

We obviously can't speak directly to Casey's concerns.  Casey, any
specific technical comments on the challenges of using a common inode
specific storage architecture?

Song, FWIW going forward.  I don't know how closely you follow LSM
development, but we believe an unbiased observer would conclude that
there is some degree of reticence about BPF's involvement with the LSM
infrastructure by some of the core LSM maintainers, that in turn makes
these types of conversations technically sensitive.

> Song

We will look forward to your thoughts on the above.

Have a good week.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
              https://github.com/Quixote-Project

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 16:36             ` Dr. Greg
@ 2024-11-14 17:29               ` Casey Schaufler
  2024-11-14 18:08                 ` Song Liu
  2024-11-14 17:51               ` Song Liu
  1 sibling, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-14 17:29 UTC (permalink / raw)
  To: Dr. Greg, Song Liu
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com, Casey Schaufler

On 11/14/2024 8:36 AM, Dr. Greg wrote:
> On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote:
>
> Good morning, I hope the week is going well for everyone.
>
>> On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> On 11/12/2024 5:37 PM, Song Liu wrote:
>> [...]
>>>> Could you provide more information on the definition of "more
>>>> consistent" LSM infrastructure?
>>> We're doing several things. The management of security blobs
>>> (e.g. inode->i_security) has been moved out of the individual
>>> modules and into the infrastructure. The use of a u32 secid is
>>> being replaced with a more general lsm_prop structure, except
>>> where networking code won't allow it. A good deal of work has
>>> gone into making the return values of LSM hooks consistent.
>> Thanks for the information. Unifying per-object memory usage of
>> different LSMs makes sense. However, I don't think we are limiting
>> any LSM to only use memory from the lsm_blobs. The LSMs still
>> have the freedom to use other memory allocators. BPF inode
>> local storage, just like other BPF maps, is a way to manage
>> memory. BPF LSM programs have full access to BPF maps. So
>> I don't think it makes sense to say this BPF map is used by tracing,
>> so we should not allow LSM to use it.
>>
>> Does this make sense?
> As involved bystanders, some questions and thoughts that may help
> further the discussion.
>
> With respect to inode specific storage, the currently accepted pattern
> in the LSM world is roughly as follows:
>
> The LSM initialization code, at boot, computes the total amount of
> storage needed by all of the LSM's that are requesting inode specific
> storage.  A single pointer to that 'blob' of storage is included in
> the inode structure.
>
> In an include file, an inline function similar to the following is
> declared, whose purpose is to return the location inside of the
> allocated storage or 'LSM inode blob' where a particular LSM's inode
> specific data structure is located:
>
> static inline struct tsem_inode *tsem_inode(struct inode *inode)
> {
> 	return inode->i_security + tsem_blob_sizes.lbs_inode;
> }
>
> In an LSM's implementation code, the function gets used in something
> like the following manner:
>
> static int tsem_inode_alloc_security(struct inode *inode)
> {
> 	struct tsem_inode *tsip = tsem_inode(inode);
>
> 	/* Do something with the structure pointed to by tsip. */
> }
>
> Christian appears to have already chimed in and indicated that there
> is no appetite to add another pointer member to the inode structure.
>
> So, if this were to proceed forward, is it proposed that there will be
> a 'flag day' requirement to have each LSM that uses inode specific
> storage implement a security_inode_alloc() event handler that creates
> an LSM specific BPF map key/value pair for that inode?
>
> Which, in turn, would require that the accessor functions be converted
> to use a bpf key request to return the LSM specific information for
> that inode?
>
> A flag day event is always somewhat of a concern, but the larger
> concern may be the substitution of simple pointer arithmetic for a
> body of more complex code.  One would assume with something like this,
> that there may be a need for a shake-out period to determine what type
> of potential regressions the more complex implementation may generate,
> with regressions in security sensitive code always a concern.
>
> In a larger context.  Given that the current implementation works on
> simple pointer arithmetic over a common block of storage, there is not
> much of a safety guarantee that one LSM couldn't interfere with the
> inode storage of another LSM.  However, using a generic BPF construct
> such as a map, would presumably open the level of influence over LSM
> specific inode storage to a much larger audience, presumably any BPF
> program that would be loaded.
>
> The LSM inode information is obviously security sensitive, which I
> presume would be be the motivation for Casey's concern that a 'mistake
> by a BPF programmer could cause the whole system to blow up', which in
> full disclosure is only a rough approximation of his statement.
>
> We obviously can't speak directly to Casey's concerns.  Casey, any
> specific technical comments on the challenges of using a common inode
> specific storage architecture?

My objection to using a union for the BPF and LSM pointer is based
on the observation that a lot of modern programmers don't know what
a union does. The BPF programmer would see that there are two ways
to accomplish their task, one for CONFIG_SECURITY=y and the other
for when it isn't. The second is much simpler. Not understanding
how kernel configuration works, nor being "real" C language savvy,
the programmer installs code using the simpler interfaces on a
Redhat system. The SELinux inode data is compromised by the BPF
code, which thinks the data is its own. Hilarity ensues.

>
> Song, FWIW going forward.  I don't know how closely you follow LSM
> development, but we believe an unbiased observer would conclude that
> there is some degree of reticence about BPF's involvement with the LSM
> infrastructure by some of the core LSM maintainers, that in turn makes
> these types of conversations technically sensitive.
>
>> Song
> We will look forward to your thoughts on the above.
>
> Have a good week.
>
> As always,
> Dr. Greg
>
> The Quixote Project - Flailing at the Travails of Cybersecurity
>               https://github.com/Quixote-Project

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 16:36             ` Dr. Greg
  2024-11-14 17:29               ` Casey Schaufler
@ 2024-11-14 17:51               ` Song Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-14 17:51 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Song Liu, Casey Schaufler, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

Hi Dr. Greg, 

Thanks for your input on this. 

> On Nov 14, 2024, at 8:36 AM, Dr. Greg <greg@enjellic.com> wrote:
> 
> On Wed, Nov 13, 2024 at 10:57:05AM -0800, Song Liu wrote:
> 
> Good morning, I hope the week is going well for everyone.
> 
>> On Wed, Nov 13, 2024 at 10:06???AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> 
>>> On 11/12/2024 5:37 PM, Song Liu wrote:
>> [...]
>>>> Could you provide more information on the definition of "more
>>>> consistent" LSM infrastructure?
>>> 
>>> We're doing several things. The management of security blobs
>>> (e.g. inode->i_security) has been moved out of the individual
>>> modules and into the infrastructure. The use of a u32 secid is
>>> being replaced with a more general lsm_prop structure, except
>>> where networking code won't allow it. A good deal of work has
>>> gone into making the return values of LSM hooks consistent.
>> 
>> Thanks for the information. Unifying per-object memory usage of
>> different LSMs makes sense. However, I don't think we are limiting
>> any LSM to only use memory from the lsm_blobs. The LSMs still
>> have the freedom to use other memory allocators. BPF inode
>> local storage, just like other BPF maps, is a way to manage
>> memory. BPF LSM programs have full access to BPF maps. So
>> I don't think it makes sense to say this BPF map is used by tracing,
>> so we should not allow LSM to use it.
>> 
>> Does this make sense?
> 
> As involved bystanders, some questions and thoughts that may help
> further the discussion.
> 
> With respect to inode specific storage, the currently accepted pattern
> in the LSM world is roughly as follows:
> 
> The LSM initialization code, at boot, computes the total amount of
> storage needed by all of the LSM's that are requesting inode specific
> storage.  A single pointer to that 'blob' of storage is included in
> the inode structure.
> 
> In an include file, an inline function similar to the following is
> declared, whose purpose is to return the location inside of the
> allocated storage or 'LSM inode blob' where a particular LSM's inode
> specific data structure is located:
> 
> static inline struct tsem_inode *tsem_inode(struct inode *inode)
> {
> return inode->i_security + tsem_blob_sizes.lbs_inode;
> }
> 
> In an LSM's implementation code, the function gets used in something
> like the following manner:
> 
> static int tsem_inode_alloc_security(struct inode *inode)
> {
> struct tsem_inode *tsip = tsem_inode(inode);
> 
> /* Do something with the structure pointed to by tsip. */
> }

Yes, I am fully aware how most LSMs allocate and use these 
inode/task/etc. storage. 

> Christian appears to have already chimed in and indicated that there
> is no appetite to add another pointer member to the inode structure.

If I understand Christian correctly, his concern comes from the 
size of inode, and thus the impact on memory footprint and CPU
cache usage of all the inode in the system. While we got easier 
time adding a pointer to other data structures, for example socket,
I personally acknowledge Christian's concern and I am motivated to 
make changes to reduce the size of inode. 

> So, if this were to proceed forward, is it proposed that there will be
> a 'flag day' requirement to have each LSM that uses inode specific
> storage implement a security_inode_alloc() event handler that creates
> an LSM specific BPF map key/value pair for that inode?
> 
> Which, in turn, would require that the accessor functions be converted
> to use a bpf key request to return the LSM specific information for
> that inode?

I never thought about asking other LSMs to make any changes. 
At the moment, none of the BPF maps are available to none BPF
code. 

> A flag day event is always somewhat of a concern, but the larger
> concern may be the substitution of simple pointer arithmetic for a
> body of more complex code.  One would assume with something like this,
> that there may be a need for a shake-out period to determine what type
> of potential regressions the more complex implementation may generate,
> with regressions in security sensitive code always a concern.
> 
> In a larger context.  Given that the current implementation works on
> simple pointer arithmetic over a common block of storage, there is not
> much of a safety guarantee that one LSM couldn't interfere with the
> inode storage of another LSM.  However, using a generic BPF construct
> such as a map, would presumably open the level of influence over LSM
> specific inode storage to a much larger audience, presumably any BPF
> program that would be loaded.

To be honest, I think bpf maps provide much better data isolation 
than a common block of storage. The creator of each bpf map has 
_full control_ who can access the map. The only exception is with
CAP_SYS_ADMIN, where the root user can access all bpf maps in the 
system. I don't think this has any security concern over the common
block of storage, as the root user can easily probe any data in the 
common block of storage via /proc/kcore. 

> 
> The LSM inode information is obviously security sensitive, which I
> presume would be be the motivation for Casey's concern that a 'mistake
> by a BPF programmer could cause the whole system to blow up', which in
> full disclosure is only a rough approximation of his statement.
> 
> We obviously can't speak directly to Casey's concerns.  Casey, any
> specific technical comments on the challenges of using a common inode
> specific storage architecture?
> 
> Song, FWIW going forward.  I don't know how closely you follow LSM
> development, but we believe an unbiased observer would conclude that
> there is some degree of reticence about BPF's involvement with the LSM
> infrastructure by some of the core LSM maintainers, that in turn makes
> these types of conversations technically sensitive.

I think I indeed got much more push back than I would imagine. 
However, as always, I value everyone's perspective and I am
willing make reasonable changes to address valid concerns. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 17:29               ` Casey Schaufler
@ 2024-11-14 18:08                 ` Song Liu
  2024-11-14 21:49                   ` James Bottomley
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-14 18:08 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Dr. Greg, Song Liu, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com



> On Nov 14, 2024, at 9:29 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:

[...]

>> 
>> 
>> The LSM inode information is obviously security sensitive, which I
>> presume would be be the motivation for Casey's concern that a 'mistake
>> by a BPF programmer could cause the whole system to blow up', which in
>> full disclosure is only a rough approximation of his statement.
>> 
>> We obviously can't speak directly to Casey's concerns.  Casey, any
>> specific technical comments on the challenges of using a common inode
>> specific storage architecture?
> 
> My objection to using a union for the BPF and LSM pointer is based
> on the observation that a lot of modern programmers don't know what
> a union does. The BPF programmer would see that there are two ways
> to accomplish their task, one for CONFIG_SECURITY=y and the other
> for when it isn't. The second is much simpler. Not understanding
> how kernel configuration works, nor being "real" C language savvy,
> the programmer installs code using the simpler interfaces on a
> Redhat system. The SELinux inode data is compromised by the BPF
> code, which thinks the data is its own. Hilarity ensues.

There must be some serious misunderstanding here. So let me 
explain the idea again. 

With CONFIG_SECURITY=y, the code will work the same as right now. 
BPF inode storage uses i_security, just as any other LSMs. 

With CONFIG_SECURITY=n, i_security does not exist, so the bpf
inode storage will use i_bpf_storage. 

Since this is a CONFIG_, all the logic got sorted out at compile
time. Thus the user API (for user space and for bpf programs) 
stays the same. 


Actually, I can understand the concern with union. Although, 
the logic is set at kernel compile time, it is still possible 
for kernel source code to use i_bpf_storage when 
CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand
the concern). 

We can address this with something like following:

#ifdef CONFIG_SECURITY
        void                    *i_security;
#elif CONFIG_BPF_SYSCALL
        struct bpf_local_storage __rcu *i_bpf_storage;
#endif

This will help catch all misuse of the i_bpf_storage at compile
time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 

Does this make sense?

Thanks,
Song


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-13 10:19   ` Christian Brauner
  2024-11-13 14:15     ` Song Liu
@ 2024-11-14 21:11     ` Song Liu
  2024-11-15 11:19       ` Jan Kara
  1 sibling, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-14 21:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Christian, 

> On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:

[...]

>> static inline void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
>>   bpf_func_t *bpf_func)
>> {
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 3559446279c1..479097e4dd5b 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -79,6 +79,7 @@ struct fs_context;
>> struct fs_parameter_spec;
>> struct fileattr;
>> struct iomap_ops;
>> +struct bpf_local_storage;
>> 
>> extern void __init inode_init(void);
>> extern void __init inode_init_early(void);
>> @@ -648,6 +649,9 @@ struct inode {
>> #ifdef CONFIG_SECURITY
>> void *i_security;
>> #endif
>> +#ifdef CONFIG_BPF_SYSCALL
>> + struct bpf_local_storage __rcu *i_bpf_storage;
>> +#endif
> 
> Sorry, we're not growing struct inode for this. It just keeps getting
> bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
> to waste them on special-purpose stuff. We already NAKed someone else's
> pet field here.

Per other discussions in this thread, I am implementing the following:

#ifdef CONFIG_SECURITY
        void                    *i_security;
#elif CONFIG_BPF_SYSCALL
        struct bpf_local_storage __rcu *i_bpf_storage;
#endif

However, it is a bit trickier than I thought. Specifically, we need 
to deal with the following scenarios:

1. CONFIG_SECURITY=y && CONFIG_BPF_LSM=n && CONFIG_BPF_SYSCALL=y
2. CONFIG_SECURITY=y && CONFIG_BPF_LSM=y && CONFIG_BPF_SYSCALL=y but 
   bpf lsm is not enabled at boot time. 

AFAICT, we need to modify how lsm blob are managed with 
CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
if it gets accepted, doesn't really save any memory. Instead of 
growing struct inode by 8 bytes, the solution will allocate 8
more bytes to inode->i_security. So the total memory consumption
is the same, but the memory is more fragmented. 

Therefore, I think we should really step back and consider adding
the i_bpf_storage to struct inode. While this does increase the
size of struct inode by 8 bytes, it may end up with less overall
memory consumption for the system. This is why. 

When the user cannot use inode local storage, the alternative is 
to use hash maps (use inode pointer as key). AFAICT, all hash maps 
comes with non-trivial overhead, in memory consumption, in access 
latency, and in extra code to manage the memory. OTOH, inode local 
storage doesn't have these issue, and is usually much more efficient: 
 - memory is only allocated for inodes with actual data, 
 - O(1) latency, 
 - per inode data is freed automatically when the inode is evicted. 
Please refer to [1] where Amir mentioned all the work needed to 
properly manage a hash map, and I explained why we don't need to 
worry about these with inode local storage. 

Besides reducing memory consumption, i_bpf_storage also shortens 
the pointer chain to access inode local storage. Before this set, 
inode local storage is available at 
inode->i_security+offset(struct bpf_storage_blob)->storage. After 
this set, inode local storage is simply at inode->i_bpf_storage. 

At the moment, we are using bpf local storage with 
task_struct->bpf_storage, struct sock->sk_bpf_storage, 
struct cgroup->bpf_cgrp_storage. All of these turned out to be 
successful and helped users to use memory more efficiently. I 
think we can see the same benefits with struct inode->i_bpf_storage. 

I hope these make sense, and you will consider adding i_bpf_storage.
Please let me know if anything above is not clear. 

Thanks,
Song

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjXjjkKMa1xcPyxE5vxh1U5oGZJWtofRCwp-3ikCA6cgg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 18:08                 ` Song Liu
@ 2024-11-14 21:49                   ` James Bottomley
  2024-11-14 22:30                     ` Song Liu
                                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: James Bottomley @ 2024-11-14 21:49 UTC (permalink / raw)
  To: Song Liu, Casey Schaufler
  Cc: Dr. Greg, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Thu, 2024-11-14 at 18:08 +0000, Song Liu wrote:
> 
> 
> > On Nov 14, 2024, at 9:29 AM, Casey Schaufler
> > <casey@schaufler-ca.com> wrote:
> 
> [...]
> 
> > > 
> > > 
> > > The LSM inode information is obviously security sensitive, which
> > > I
> > > presume would be be the motivation for Casey's concern that a
> > > 'mistake
> > > by a BPF programmer could cause the whole system to blow up',
> > > which in
> > > full disclosure is only a rough approximation of his statement.
> > > 
> > > We obviously can't speak directly to Casey's concerns.  Casey,
> > > any
> > > specific technical comments on the challenges of using a common
> > > inode
> > > specific storage architecture?
> > 
> > My objection to using a union for the BPF and LSM pointer is based
> > on the observation that a lot of modern programmers don't know what
> > a union does. The BPF programmer would see that there are two ways
> > to accomplish their task, one for CONFIG_SECURITY=y and the other
> > for when it isn't. The second is much simpler. Not understanding
> > how kernel configuration works, nor being "real" C language savvy,
> > the programmer installs code using the simpler interfaces on a
> > Redhat system. The SELinux inode data is compromised by the BPF
> > code, which thinks the data is its own. Hilarity ensues.
> 
> There must be some serious misunderstanding here. So let me 
> explain the idea again. 
> 
> With CONFIG_SECURITY=y, the code will work the same as right now. 
> BPF inode storage uses i_security, just as any other LSMs. 
> 
> With CONFIG_SECURITY=n, i_security does not exist, so the bpf
> inode storage will use i_bpf_storage. 
> 
> Since this is a CONFIG_, all the logic got sorted out at compile
> time. Thus the user API (for user space and for bpf programs) 
> stays the same. 
> 
> 
> Actually, I can understand the concern with union. Although, 
> the logic is set at kernel compile time, it is still possible 
> for kernel source code to use i_bpf_storage when 
> CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand
> the concern). 
> 
> We can address this with something like following:
> 
> #ifdef CONFIG_SECURITY
>         void                    *i_security;
> #elif CONFIG_BPF_SYSCALL
>         struct bpf_local_storage __rcu *i_bpf_storage;
> #endif
> 
> This will help catch all misuse of the i_bpf_storage at compile
> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
> 
> Does this make sense?

Got to say I'm with Casey here, this will generate horrible and failure
prone code.

Since effectively you're making i_security always present anyway,
simply do that and also pull the allocation code out of security.c in a
way that it's always available?  That way you don't have to special
case the code depending on whether CONFIG_SECURITY is defined. 
Effectively this would give everyone a generic way to attach some
memory area to an inode.  I know it's more complex than this because
there are LSM hooks that run from security_inode_alloc() but if you can
make it work generically, I'm sure everyone will benefit.

Regards,

James




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 21:49                   ` James Bottomley
@ 2024-11-14 22:30                     ` Song Liu
  2024-11-17 22:59                     ` Song Liu
  2024-11-23 19:11                     ` Paul Moore
  2 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-14 22:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Song Liu, Casey Schaufler, Dr. Greg, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

Hi James, 

Thanks for your input!

> On Nov 14, 2024, at 1:49 PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

[...]

>> 
>> Actually, I can understand the concern with union. Although, 
>> the logic is set at kernel compile time, it is still possible 
>> for kernel source code to use i_bpf_storage when 
>> CONFIG_SECURITY is enabled. (Yes, I guess now I finally understand
>> the concern). 
>> 
>> We can address this with something like following:
>> 
>> #ifdef CONFIG_SECURITY
>>         void                    *i_security;
>> #elif CONFIG_BPF_SYSCALL
>>         struct bpf_local_storage __rcu *i_bpf_storage;
>> #endif
>> 
>> This will help catch all misuse of the i_bpf_storage at compile
>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
>> 
>> Does this make sense?
> 
> Got to say I'm with Casey here, this will generate horrible and failure
> prone code.

Yes, as I described in another email in the thread [1], this turned
out to cause more troubles than I thought. 

> Since effectively you're making i_security always present anyway,
> simply do that and also pull the allocation code out of security.c in a
> way that it's always available?  

I think this is a very good idea. If folks agree with this approach, 
I am more than happy to draft patch for this. 

Thanks again, 

Song

> That way you don't have to special
> case the code depending on whether CONFIG_SECURITY is defined. 
> Effectively this would give everyone a generic way to attach some
> memory area to an inode.  I know it's more complex than this because
> there are LSM hooks that run from security_inode_alloc() but if you can
> make it work generically, I'm sure everyone will benefit.

[1] https://lore.kernel.org/linux-fsdevel/86C65B85-8167-4D04-BFF5-40FD4F3407A4@fb.com/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-14 21:11     ` Song Liu
@ 2024-11-15 11:19       ` Jan Kara
  2024-11-15 17:35         ` Song Liu
  2024-11-21  9:14         ` Christian Brauner
  0 siblings, 2 replies; 53+ messages in thread
From: Jan Kara @ 2024-11-15 11:19 UTC (permalink / raw)
  To: Song Liu
  Cc: Christian Brauner, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Song!

On Thu 14-11-24 21:11:57, Song Liu wrote:
> > On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:
> >> static inline void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> >>   bpf_func_t *bpf_func)
> >> {
> >> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >> index 3559446279c1..479097e4dd5b 100644
> >> --- a/include/linux/fs.h
> >> +++ b/include/linux/fs.h
> >> @@ -79,6 +79,7 @@ struct fs_context;
> >> struct fs_parameter_spec;
> >> struct fileattr;
> >> struct iomap_ops;
> >> +struct bpf_local_storage;
> >> 
> >> extern void __init inode_init(void);
> >> extern void __init inode_init_early(void);
> >> @@ -648,6 +649,9 @@ struct inode {
> >> #ifdef CONFIG_SECURITY
> >> void *i_security;
> >> #endif
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> + struct bpf_local_storage __rcu *i_bpf_storage;
> >> +#endif
> > 
> > Sorry, we're not growing struct inode for this. It just keeps getting
> > bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
> > to waste them on special-purpose stuff. We already NAKed someone else's
> > pet field here.
> 
> Per other discussions in this thread, I am implementing the following:
> 
> #ifdef CONFIG_SECURITY
>         void                    *i_security;
> #elif CONFIG_BPF_SYSCALL
>         struct bpf_local_storage __rcu *i_bpf_storage;
> #endif
> 
> However, it is a bit trickier than I thought. Specifically, we need 
> to deal with the following scenarios:
>  
> 1. CONFIG_SECURITY=y && CONFIG_BPF_LSM=n && CONFIG_BPF_SYSCALL=y
> 2. CONFIG_SECURITY=y && CONFIG_BPF_LSM=y && CONFIG_BPF_SYSCALL=y but 
>    bpf lsm is not enabled at boot time. 
> 
> AFAICT, we need to modify how lsm blob are managed with 
> CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
> if it gets accepted, doesn't really save any memory. Instead of 
> growing struct inode by 8 bytes, the solution will allocate 8
> more bytes to inode->i_security. So the total memory consumption
> is the same, but the memory is more fragmented. 

I guess you've found a better solution for this based on James' suggestion.

> Therefore, I think we should really step back and consider adding
> the i_bpf_storage to struct inode. While this does increase the
> size of struct inode by 8 bytes, it may end up with less overall
> memory consumption for the system. This is why. 
>
> When the user cannot use inode local storage, the alternative is 
> to use hash maps (use inode pointer as key). AFAICT, all hash maps 
> comes with non-trivial overhead, in memory consumption, in access 
> latency, and in extra code to manage the memory. OTOH, inode local 
> storage doesn't have these issue, and is usually much more efficient: 
>  - memory is only allocated for inodes with actual data, 
>  - O(1) latency, 
>  - per inode data is freed automatically when the inode is evicted. 
> Please refer to [1] where Amir mentioned all the work needed to 
> properly manage a hash map, and I explained why we don't need to 
> worry about these with inode local storage. 

Well, but here you are speaking of a situation where bpf inode storage
space gets actually used for most inodes. Then I agree i_bpf_storage is the
most economic solution. But I'd also expect that for vast majority of
systems the bpf inode storage isn't used at all and if it does get used, it
is used only for a small fraction of inodes. So we are weighting 8 bytes
per inode for all those users that don't need it against more significant
memory savings for users that actually do need per inode bpf storage. A
factor in this is that a lot of people are running some distribution kernel
which generally enables most config options that are at least somewhat
useful. So hiding the cost behind CONFIG_FOO doesn't really help such
people.
 
I'm personally not *so* hung up about a pointer in struct inode but I can
see why Christian is and I agree adding a pointer there isn't a win for
everybody.

Longer term, I think it may be beneficial to come up with a way to attach
private info to the inode in a way that doesn't cost us one pointer per
funcionality that may possibly attach info to the inode. We already have
i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
call where the space overhead for everybody is worth the runtime &
complexity overhead for users using the functionality...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-15 11:19       ` Jan Kara
@ 2024-11-15 17:35         ` Song Liu
  2024-11-19 14:21           ` Jeff Layton
  2024-11-21  9:14         ` Christian Brauner
  1 sibling, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-15 17:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Song Liu, Christian Brauner, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Jan, 

> On Nov 15, 2024, at 3:19 AM, Jan Kara <jack@suse.cz> wrote:

[...]

>> AFAICT, we need to modify how lsm blob are managed with 
>> CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
>> if it gets accepted, doesn't really save any memory. Instead of 
>> growing struct inode by 8 bytes, the solution will allocate 8
>> more bytes to inode->i_security. So the total memory consumption
>> is the same, but the memory is more fragmented.
> 
> I guess you've found a better solution for this based on James' suggestion.
> 
>> Therefore, I think we should really step back and consider adding
>> the i_bpf_storage to struct inode. While this does increase the
>> size of struct inode by 8 bytes, it may end up with less overall
>> memory consumption for the system. This is why. 
>> 
>> When the user cannot use inode local storage, the alternative is 
>> to use hash maps (use inode pointer as key). AFAICT, all hash maps 
>> comes with non-trivial overhead, in memory consumption, in access 
>> latency, and in extra code to manage the memory. OTOH, inode local 
>> storage doesn't have these issue, and is usually much more efficient: 
>> - memory is only allocated for inodes with actual data, 
>> - O(1) latency, 
>> - per inode data is freed automatically when the inode is evicted. 
>> Please refer to [1] where Amir mentioned all the work needed to 
>> properly manage a hash map, and I explained why we don't need to 
>> worry about these with inode local storage.
> 
> Well, but here you are speaking of a situation where bpf inode storage
> space gets actually used for most inodes. Then I agree i_bpf_storage is the
> most economic solution. But I'd also expect that for vast majority of
> systems the bpf inode storage isn't used at all and if it does get used, it
> is used only for a small fraction of inodes. So we are weighting 8 bytes
> per inode for all those users that don't need it against more significant
> memory savings for users that actually do need per inode bpf storage. A
> factor in this is that a lot of people are running some distribution kernel
> which generally enables most config options that are at least somewhat
> useful. So hiding the cost behind CONFIG_FOO doesn't really help such
> people.

Agreed that an extra pointer will be used if there is no actual users
of it. However, in longer term, "most users do not use bpf inode
storage" may not be true. As kernel engineers, we may not always notice 
when user space is using some BPF features. For example, systemd has
a BPF LSM program "restrict_filesystems" [1]. It is enabled if the 
user have lsm=bpf in kernel args. I personally noticed it as a 
surprise when we enabled lsm=bpf. 

> I'm personally not *so* hung up about a pointer in struct inode but I can
> see why Christian is and I agree adding a pointer there isn't a win for
> everybody.

I can also understand Christian's motivation. However, I am a bit
frustrated because similar approach (adding a pointer to the struct) 
worked fine for other popular data structures: task_struct, sock, 
cgroup. 

> Longer term, I think it may be beneficial to come up with a way to attach
> private info to the inode in a way that doesn't cost us one pointer per
> funcionality that may possibly attach info to the inode. We already have
> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> call where the space overhead for everybody is worth the runtime &
> complexity overhead for users using the functionality...

It does seem to be the right long term solution, and I am willing to 
work on it. However, I would really appreciate some positive feedback
on the idea, so that I have better confidence my weeks of work has a 
better chance to worth it. 

Thanks,
Song

[1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 21:49                   ` James Bottomley
  2024-11-14 22:30                     ` Song Liu
@ 2024-11-17 22:59                     ` Song Liu
  2024-11-19 12:27                       ` Dr. Greg
  2024-11-23 19:11                     ` Paul Moore
  2 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-11-17 22:59 UTC (permalink / raw)
  To: James Bottomley, jack@suse.cz, brauner@kernel.org
  Cc: Song Liu, Casey Schaufler, Dr. Greg, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Christian, James and Jan, 

> On Nov 14, 2024, at 1:49 PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

[...]

>> 
>> We can address this with something like following:
>> 
>> #ifdef CONFIG_SECURITY
>>         void                    *i_security;
>> #elif CONFIG_BPF_SYSCALL
>>         struct bpf_local_storage __rcu *i_bpf_storage;
>> #endif
>> 
>> This will help catch all misuse of the i_bpf_storage at compile
>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
>> 
>> Does this make sense?
> 
> Got to say I'm with Casey here, this will generate horrible and failure
> prone code.
> 
> Since effectively you're making i_security always present anyway,
> simply do that and also pull the allocation code out of security.c in a
> way that it's always available?  That way you don't have to special
> case the code depending on whether CONFIG_SECURITY is defined. 
> Effectively this would give everyone a generic way to attach some
> memory area to an inode.  I know it's more complex than this because
> there are LSM hooks that run from security_inode_alloc() but if you can
> make it work generically, I'm sure everyone will benefit.

On a second thought, I think making i_security generic is not 
the right solution for "BPF inode storage in tracing use cases". 

This is because i_security serves a very specific use case: it 
points to a piece of memory whose size is calculated at system 
boot time. If some of the supported LSMs is not enabled by the 
lsm= kernel arg, the kernel will not allocate memory in 
i_security for them. The only way to change lsm= is to reboot 
the system. BPF LSM programs can be disabled at the boot time, 
which fits well in i_security. However, BPF tracing programs 
cannot be disabled at boot time (even we change the code to 
make it possible, we are not likely to disable BPF tracing). 
IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
BPF tracing programs to load at some point of time, and these 
programs may use BPF inode storage. 

Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
always will be attached to i_security (maybe under a different 
name, say, i_generic) of every inode. In this case, we should 
really add i_bpf_storage directly to the inode, because another 
pointer jump via i_generic gives nothing but overhead. 

Does this make sense? Or did I misunderstand the suggestion?

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-17 22:59                     ` Song Liu
@ 2024-11-19 12:27                       ` Dr. Greg
  2024-11-19 18:14                         ` Casey Schaufler
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. Greg @ 2024-11-19 12:27 UTC (permalink / raw)
  To: Song Liu
  Cc: James Bottomley, jack@suse.cz, brauner@kernel.org,
	Casey Schaufler, Dr. Greg, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:

> Hi Christian, James and Jan, 

Good morning, I hope the day is starting well for everyone.

> > On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> [...]
> 
> >> 
> >> We can address this with something like following:
> >> 
> >> #ifdef CONFIG_SECURITY
> >>         void                    *i_security;
> >> #elif CONFIG_BPF_SYSCALL
> >>         struct bpf_local_storage __rcu *i_bpf_storage;
> >> #endif
> >> 
> >> This will help catch all misuse of the i_bpf_storage at compile
> >> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
> >> 
> >> Does this make sense?
> > 
> > Got to say I'm with Casey here, this will generate horrible and failure
> > prone code.
> > 
> > Since effectively you're making i_security always present anyway,
> > simply do that and also pull the allocation code out of security.c in a
> > way that it's always available?  That way you don't have to special
> > case the code depending on whether CONFIG_SECURITY is defined. 
> > Effectively this would give everyone a generic way to attach some
> > memory area to an inode.  I know it's more complex than this because
> > there are LSM hooks that run from security_inode_alloc() but if you can
> > make it work generically, I'm sure everyone will benefit.

> On a second thought, I think making i_security generic is not 
> the right solution for "BPF inode storage in tracing use cases". 
> 
> This is because i_security serves a very specific use case: it 
> points to a piece of memory whose size is calculated at system 
> boot time. If some of the supported LSMs is not enabled by the 
> lsm= kernel arg, the kernel will not allocate memory in 
> i_security for them. The only way to change lsm= is to reboot 
> the system. BPF LSM programs can be disabled at the boot time, 
> which fits well in i_security. However, BPF tracing programs 
> cannot be disabled at boot time (even we change the code to 
> make it possible, we are not likely to disable BPF tracing). 
> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
> BPF tracing programs to load at some point of time, and these 
> programs may use BPF inode storage. 
> 
> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
> always will be attached to i_security (maybe under a different 
> name, say, i_generic) of every inode. In this case, we should 
> really add i_bpf_storage directly to the inode, because another 
> pointer jump via i_generic gives nothing but overhead. 
> 
> Does this make sense? Or did I misunderstand the suggestion?

There is a colloquialism that seems relevant here: "Pick your poison".

In the greater interests of the kernel, it seems that a generic
mechanism for attaching per inode information is the only realistic
path forward, unless Christian changes his position on expanding
the size of struct inode.

There are two pathways forward.

1.) Attach a constant size 'blob' of storage to each inode.

This is a similar approach to what the LSM uses where each blob is
sized as follows:

S = U * sizeof(void *)

Where U is the number of sub-systems that have a desire to use inode
specific storage.

Each sub-system uses it's pointer slot to manage any additional
storage that it desires to attach to the inode.

This has the obvious advantage of O(1) cost complexity for any
sub-system that wants to access its inode specific storage.

The disadvantage, as you note, is that it wastes memory if a
sub-system does not elect to attach per inode information, for example
the tracing infrastructure.

This disadvantage is parried by the fact that it reduces the size of
the inode proper by 24 bytes (4 pointers down to 1) and allows future
extensibility without colliding with the interests and desires of the
VFS maintainers.

2.) Implement key/value mapping for inode specific storage.

The key would be a sub-system specific numeric value that returns a
pointer the sub-system uses to manage its inode specific memory for a
particular inode.

A participating sub-system in turn uses its identifier to register an
inode specific pointer for its sub-system.

This strategy loses O(1) lookup complexity but reduces total memory
consumption and only imposes memory costs for inodes when a sub-system
desires to use inode specific storage.

Approach 2 requires the introduction of generic infrastructure that
allows an inode's key/value mappings to be located, presumably based
on the inode's pointer value.  We could probably just resurrect the
old IMA iint code for this purpose.

In the end it comes down to a rather standard trade-off in this
business, memory vs. execution cost.

We would posit that option 2 is the only viable scheme if the design
metric is overall good for the Linux kernel eco-system.

> Thanks,
> Song

Have a good day.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
              https://github.com/Quixote-Project

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-15 17:35         ` Song Liu
@ 2024-11-19 14:21           ` Jeff Layton
  2024-11-19 15:25             ` Amir Goldstein
  0 siblings, 1 reply; 53+ messages in thread
From: Jeff Layton @ 2024-11-19 14:21 UTC (permalink / raw)
  To: Song Liu, Jan Kara
  Cc: Christian Brauner, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	Josef Bacik, mic@digikod.net, gnoack@google.com

On Fri, 2024-11-15 at 17:35 +0000, Song Liu wrote:
> Hi Jan, 
> 
> > On Nov 15, 2024, at 3:19 AM, Jan Kara <jack@suse.cz> wrote:
> 
> [...]
> 
> > > AFAICT, we need to modify how lsm blob are managed with 
> > > CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
> > > if it gets accepted, doesn't really save any memory. Instead of 
> > > growing struct inode by 8 bytes, the solution will allocate 8
> > > more bytes to inode->i_security. So the total memory consumption
> > > is the same, but the memory is more fragmented.
> > 
> > I guess you've found a better solution for this based on James' suggestion.
> > 
> > > Therefore, I think we should really step back and consider adding
> > > the i_bpf_storage to struct inode. While this does increase the
> > > size of struct inode by 8 bytes, it may end up with less overall
> > > memory consumption for the system. This is why. 
> > > 
> > > When the user cannot use inode local storage, the alternative is 
> > > to use hash maps (use inode pointer as key). AFAICT, all hash maps 
> > > comes with non-trivial overhead, in memory consumption, in access 
> > > latency, and in extra code to manage the memory. OTOH, inode local 
> > > storage doesn't have these issue, and is usually much more efficient: 
> > > - memory is only allocated for inodes with actual data, 
> > > - O(1) latency, 
> > > - per inode data is freed automatically when the inode is evicted. 
> > > Please refer to [1] where Amir mentioned all the work needed to 
> > > properly manage a hash map, and I explained why we don't need to 
> > > worry about these with inode local storage.
> > 
> > Well, but here you are speaking of a situation where bpf inode storage
> > space gets actually used for most inodes. Then I agree i_bpf_storage is the
> > most economic solution. But I'd also expect that for vast majority of
> > systems the bpf inode storage isn't used at all and if it does get used, it
> > is used only for a small fraction of inodes. So we are weighting 8 bytes
> > per inode for all those users that don't need it against more significant
> > memory savings for users that actually do need per inode bpf storage. A
> > factor in this is that a lot of people are running some distribution kernel
> > which generally enables most config options that are at least somewhat
> > useful. So hiding the cost behind CONFIG_FOO doesn't really help such
> > people.
> 
> Agreed that an extra pointer will be used if there is no actual users
> of it. However, in longer term, "most users do not use bpf inode
> storage" may not be true. As kernel engineers, we may not always notice 
> when user space is using some BPF features. For example, systemd has
> a BPF LSM program "restrict_filesystems" [1]. It is enabled if the 
> user have lsm=bpf in kernel args. I personally noticed it as a 
> surprise when we enabled lsm=bpf. 
> 
> > I'm personally not *so* hung up about a pointer in struct inode but I can
> > see why Christian is and I agree adding a pointer there isn't a win for
> > everybody.
> 
> I can also understand Christian's motivation. However, I am a bit
> frustrated because similar approach (adding a pointer to the struct) 
> worked fine for other popular data structures: task_struct, sock, 
> cgroup. 
> 

There are (usually) a lot more inodes on a host than all of those other
structs combined. Worse, struct inode is often embedded in other
structs, and adding fields can cause alignment problems there.


> > Longer term, I think it may be beneficial to come up with a way to attach
> > private info to the inode in a way that doesn't cost us one pointer per
> > funcionality that may possibly attach info to the inode. We already have
> > i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > call where the space overhead for everybody is worth the runtime &
> > complexity overhead for users using the functionality...
> 
> It does seem to be the right long term solution, and I am willing to 
> work on it. However, I would really appreciate some positive feedback
> on the idea, so that I have better confidence my weeks of work has a 
> better chance to worth it. 
> 
> Thanks,
> Song
> 
> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c

fsnotify is somewhat similar to file locking in that few inodes on the
machine actually utilize these fields.

For file locking, we allocate and populate the inode->i_flctx field on
an as-needed basis. The kernel then hangs on to that struct until the
inode is freed. We could do something similar here. We have this now:

#ifdef CONFIG_FSNOTIFY
        __u32                   i_fsnotify_mask; /* all events this inode cares about */
        /* 32-bit hole reserved for expanding i_fsnotify_mask */
        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif

What if you were to turn these fields into a pointer to a new struct:

	struct fsnotify_inode_context {
		struct fsnotify_mark_connector __rcu	*i_fsnotify_marks;
		struct bpf_local_storage __rcu		*i_bpf_storage;
		__u32                   		i_fsnotify_mask; /* all events this inode cares about */
	};

Then whenever you have to populate any of these fields, you just
allocate one of these structs and set the inode up to point to it.
They're tiny too, so don't bother freeing it until the inode is
deallocated.

It'd mean rejiggering a fair bit of fsnotify code, but it would give
the fsnotify code an easier way to expand per-inode info in the future.
It would also slightly shrink struct inode too.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-19 14:21           ` Jeff Layton
@ 2024-11-19 15:25             ` Amir Goldstein
  2024-11-19 15:30               ` Amir Goldstein
  0 siblings, 1 reply; 53+ messages in thread
From: Amir Goldstein @ 2024-11-19 15:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Song Liu, Jan Kara, Christian Brauner, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2024-11-15 at 17:35 +0000, Song Liu wrote:
> > Hi Jan,
> >
> > > On Nov 15, 2024, at 3:19 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > [...]
> >
> > > > AFAICT, we need to modify how lsm blob are managed with
> > > > CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
> > > > if it gets accepted, doesn't really save any memory. Instead of
> > > > growing struct inode by 8 bytes, the solution will allocate 8
> > > > more bytes to inode->i_security. So the total memory consumption
> > > > is the same, but the memory is more fragmented.
> > >
> > > I guess you've found a better solution for this based on James' suggestion.
> > >
> > > > Therefore, I think we should really step back and consider adding
> > > > the i_bpf_storage to struct inode. While this does increase the
> > > > size of struct inode by 8 bytes, it may end up with less overall
> > > > memory consumption for the system. This is why.
> > > >
> > > > When the user cannot use inode local storage, the alternative is
> > > > to use hash maps (use inode pointer as key). AFAICT, all hash maps
> > > > comes with non-trivial overhead, in memory consumption, in access
> > > > latency, and in extra code to manage the memory. OTOH, inode local
> > > > storage doesn't have these issue, and is usually much more efficient:
> > > > - memory is only allocated for inodes with actual data,
> > > > - O(1) latency,
> > > > - per inode data is freed automatically when the inode is evicted.
> > > > Please refer to [1] where Amir mentioned all the work needed to
> > > > properly manage a hash map, and I explained why we don't need to
> > > > worry about these with inode local storage.
> > >
> > > Well, but here you are speaking of a situation where bpf inode storage
> > > space gets actually used for most inodes. Then I agree i_bpf_storage is the
> > > most economic solution. But I'd also expect that for vast majority of
> > > systems the bpf inode storage isn't used at all and if it does get used, it
> > > is used only for a small fraction of inodes. So we are weighting 8 bytes
> > > per inode for all those users that don't need it against more significant
> > > memory savings for users that actually do need per inode bpf storage. A
> > > factor in this is that a lot of people are running some distribution kernel
> > > which generally enables most config options that are at least somewhat
> > > useful. So hiding the cost behind CONFIG_FOO doesn't really help such
> > > people.
> >
> > Agreed that an extra pointer will be used if there is no actual users
> > of it. However, in longer term, "most users do not use bpf inode
> > storage" may not be true. As kernel engineers, we may not always notice
> > when user space is using some BPF features. For example, systemd has
> > a BPF LSM program "restrict_filesystems" [1]. It is enabled if the
> > user have lsm=bpf in kernel args. I personally noticed it as a
> > surprise when we enabled lsm=bpf.
> >
> > > I'm personally not *so* hung up about a pointer in struct inode but I can
> > > see why Christian is and I agree adding a pointer there isn't a win for
> > > everybody.
> >
> > I can also understand Christian's motivation. However, I am a bit
> > frustrated because similar approach (adding a pointer to the struct)
> > worked fine for other popular data structures: task_struct, sock,
> > cgroup.
> >
>
> There are (usually) a lot more inodes on a host than all of those other
> structs combined. Worse, struct inode is often embedded in other
> structs, and adding fields can cause alignment problems there.
>
>
> > > Longer term, I think it may be beneficial to come up with a way to attach
> > > private info to the inode in a way that doesn't cost us one pointer per
> > > funcionality that may possibly attach info to the inode. We already have
> > > i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > > call where the space overhead for everybody is worth the runtime &
> > > complexity overhead for users using the functionality...
> >
> > It does seem to be the right long term solution, and I am willing to
> > work on it. However, I would really appreciate some positive feedback
> > on the idea, so that I have better confidence my weeks of work has a
> > better chance to worth it.
> >
> > Thanks,
> > Song
> >
> > [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
>
> fsnotify is somewhat similar to file locking in that few inodes on the
> machine actually utilize these fields.
>
> For file locking, we allocate and populate the inode->i_flctx field on
> an as-needed basis. The kernel then hangs on to that struct until the
> inode is freed. We could do something similar here. We have this now:
>
> #ifdef CONFIG_FSNOTIFY
>         __u32                   i_fsnotify_mask; /* all events this inode cares about */
>         /* 32-bit hole reserved for expanding i_fsnotify_mask */
>         struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> #endif
>
> What if you were to turn these fields into a pointer to a new struct:
>
>         struct fsnotify_inode_context {
>                 struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
>                 struct bpf_local_storage __rcu          *i_bpf_storage;
>                 __u32                                   i_fsnotify_mask; /* all events this inode cares about */
>         };
>

The extra indirection is going to hurt for i_fsnotify_mask
it is being accessed frequently in fsnotify hooks, so I wouldn't move it
into a container, but it could be moved to the hole after i_state.

> Then whenever you have to populate any of these fields, you just
> allocate one of these structs and set the inode up to point to it.
> They're tiny too, so don't bother freeing it until the inode is
> deallocated.
>
> It'd mean rejiggering a fair bit of fsnotify code, but it would give
> the fsnotify code an easier way to expand per-inode info in the future.
> It would also slightly shrink struct inode too.

This was already done for s_fsnotify_marks, so you can follow the recipe
of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
and create an fsnotify_inode_info container.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-19 15:25             ` Amir Goldstein
@ 2024-11-19 15:30               ` Amir Goldstein
  2024-11-19 21:53                 ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Amir Goldstein @ 2024-11-19 15:30 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Song Liu, Jan Kara, Christian Brauner, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Fri, 2024-11-15 at 17:35 +0000, Song Liu wrote:
> > > Hi Jan,
> > >
> > > > On Nov 15, 2024, at 3:19 AM, Jan Kara <jack@suse.cz> wrote:
> > >
> > > [...]
> > >
> > > > > AFAICT, we need to modify how lsm blob are managed with
> > > > > CONFIG_BPF_SYSCALL=y && CONFIG_BPF_LSM=n case. The solution, even
> > > > > if it gets accepted, doesn't really save any memory. Instead of
> > > > > growing struct inode by 8 bytes, the solution will allocate 8
> > > > > more bytes to inode->i_security. So the total memory consumption
> > > > > is the same, but the memory is more fragmented.
> > > >
> > > > I guess you've found a better solution for this based on James' suggestion.
> > > >
> > > > > Therefore, I think we should really step back and consider adding
> > > > > the i_bpf_storage to struct inode. While this does increase the
> > > > > size of struct inode by 8 bytes, it may end up with less overall
> > > > > memory consumption for the system. This is why.
> > > > >
> > > > > When the user cannot use inode local storage, the alternative is
> > > > > to use hash maps (use inode pointer as key). AFAICT, all hash maps
> > > > > comes with non-trivial overhead, in memory consumption, in access
> > > > > latency, and in extra code to manage the memory. OTOH, inode local
> > > > > storage doesn't have these issue, and is usually much more efficient:
> > > > > - memory is only allocated for inodes with actual data,
> > > > > - O(1) latency,
> > > > > - per inode data is freed automatically when the inode is evicted.
> > > > > Please refer to [1] where Amir mentioned all the work needed to
> > > > > properly manage a hash map, and I explained why we don't need to
> > > > > worry about these with inode local storage.
> > > >
> > > > Well, but here you are speaking of a situation where bpf inode storage
> > > > space gets actually used for most inodes. Then I agree i_bpf_storage is the
> > > > most economic solution. But I'd also expect that for vast majority of
> > > > systems the bpf inode storage isn't used at all and if it does get used, it
> > > > is used only for a small fraction of inodes. So we are weighting 8 bytes
> > > > per inode for all those users that don't need it against more significant
> > > > memory savings for users that actually do need per inode bpf storage. A
> > > > factor in this is that a lot of people are running some distribution kernel
> > > > which generally enables most config options that are at least somewhat
> > > > useful. So hiding the cost behind CONFIG_FOO doesn't really help such
> > > > people.
> > >
> > > Agreed that an extra pointer will be used if there is no actual users
> > > of it. However, in longer term, "most users do not use bpf inode
> > > storage" may not be true. As kernel engineers, we may not always notice
> > > when user space is using some BPF features. For example, systemd has
> > > a BPF LSM program "restrict_filesystems" [1]. It is enabled if the
> > > user have lsm=bpf in kernel args. I personally noticed it as a
> > > surprise when we enabled lsm=bpf.
> > >
> > > > I'm personally not *so* hung up about a pointer in struct inode but I can
> > > > see why Christian is and I agree adding a pointer there isn't a win for
> > > > everybody.
> > >
> > > I can also understand Christian's motivation. However, I am a bit
> > > frustrated because similar approach (adding a pointer to the struct)
> > > worked fine for other popular data structures: task_struct, sock,
> > > cgroup.
> > >
> >
> > There are (usually) a lot more inodes on a host than all of those other
> > structs combined. Worse, struct inode is often embedded in other
> > structs, and adding fields can cause alignment problems there.
> >
> >
> > > > Longer term, I think it may be beneficial to come up with a way to attach
> > > > private info to the inode in a way that doesn't cost us one pointer per
> > > > funcionality that may possibly attach info to the inode. We already have
> > > > i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > > > call where the space overhead for everybody is worth the runtime &
> > > > complexity overhead for users using the functionality...
> > >
> > > It does seem to be the right long term solution, and I am willing to
> > > work on it. However, I would really appreciate some positive feedback
> > > on the idea, so that I have better confidence my weeks of work has a
> > > better chance to worth it.
> > >
> > > Thanks,
> > > Song
> > >
> > > [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> >
> > fsnotify is somewhat similar to file locking in that few inodes on the
> > machine actually utilize these fields.
> >
> > For file locking, we allocate and populate the inode->i_flctx field on
> > an as-needed basis. The kernel then hangs on to that struct until the
> > inode is freed. We could do something similar here. We have this now:
> >
> > #ifdef CONFIG_FSNOTIFY
> >         __u32                   i_fsnotify_mask; /* all events this inode cares about */
> >         /* 32-bit hole reserved for expanding i_fsnotify_mask */
> >         struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > #endif
> >
> > What if you were to turn these fields into a pointer to a new struct:
> >
> >         struct fsnotify_inode_context {
> >                 struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >                 struct bpf_local_storage __rcu          *i_bpf_storage;
> >                 __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> >         };
> >
>
> The extra indirection is going to hurt for i_fsnotify_mask
> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> into a container, but it could be moved to the hole after i_state.
>
> > Then whenever you have to populate any of these fields, you just
> > allocate one of these structs and set the inode up to point to it.
> > They're tiny too, so don't bother freeing it until the inode is
> > deallocated.
> >
> > It'd mean rejiggering a fair bit of fsnotify code, but it would give
> > the fsnotify code an easier way to expand per-inode info in the future.
> > It would also slightly shrink struct inode too.
>
> This was already done for s_fsnotify_marks, so you can follow the recipe
> of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
> and create an fsnotify_inode_info container.
>

On second thought, fsnotify_sb_info container is allocated and attached
in the context of userspace adding a mark.

If you will need allocate and attach fsnotify_inode_info in the content of
fast path fanotify hook in order to add the inode to the map, I don't
think that is going to fly??

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-19 12:27                       ` Dr. Greg
@ 2024-11-19 18:14                         ` Casey Schaufler
  2024-11-19 22:35                           ` Song Liu
  2024-11-20 16:54                           ` Dr. Greg
  0 siblings, 2 replies; 53+ messages in thread
From: Casey Schaufler @ 2024-11-19 18:14 UTC (permalink / raw)
  To: Dr. Greg, Song Liu
  Cc: James Bottomley, jack@suse.cz, brauner@kernel.org, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com, Casey Schaufler

On 11/19/2024 4:27 AM, Dr. Greg wrote:
> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
>
>> Hi Christian, James and Jan, 
> Good morning, I hope the day is starting well for everyone.
>
>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>> [...]
>>
>>>> We can address this with something like following:
>>>>
>>>> #ifdef CONFIG_SECURITY
>>>>         void                    *i_security;
>>>> #elif CONFIG_BPF_SYSCALL
>>>>         struct bpf_local_storage __rcu *i_bpf_storage;
>>>> #endif
>>>>
>>>> This will help catch all misuse of the i_bpf_storage at compile
>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
>>>>
>>>> Does this make sense?
>>> Got to say I'm with Casey here, this will generate horrible and failure
>>> prone code.
>>>
>>> Since effectively you're making i_security always present anyway,
>>> simply do that and also pull the allocation code out of security.c in a
>>> way that it's always available?  That way you don't have to special
>>> case the code depending on whether CONFIG_SECURITY is defined. 
>>> Effectively this would give everyone a generic way to attach some
>>> memory area to an inode.  I know it's more complex than this because
>>> there are LSM hooks that run from security_inode_alloc() but if you can
>>> make it work generically, I'm sure everyone will benefit.
>> On a second thought, I think making i_security generic is not 
>> the right solution for "BPF inode storage in tracing use cases". 
>>
>> This is because i_security serves a very specific use case: it 
>> points to a piece of memory whose size is calculated at system 
>> boot time. If some of the supported LSMs is not enabled by the 
>> lsm= kernel arg, the kernel will not allocate memory in 
>> i_security for them. The only way to change lsm= is to reboot 
>> the system. BPF LSM programs can be disabled at the boot time, 
>> which fits well in i_security. However, BPF tracing programs 
>> cannot be disabled at boot time (even we change the code to 
>> make it possible, we are not likely to disable BPF tracing). 
>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
>> BPF tracing programs to load at some point of time, and these 
>> programs may use BPF inode storage. 
>>
>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
>> always will be attached to i_security (maybe under a different 
>> name, say, i_generic) of every inode. In this case, we should 
>> really add i_bpf_storage directly to the inode, because another 
>> pointer jump via i_generic gives nothing but overhead. 
>>
>> Does this make sense? Or did I misunderstand the suggestion?
> There is a colloquialism that seems relevant here: "Pick your poison".
>
> In the greater interests of the kernel, it seems that a generic
> mechanism for attaching per inode information is the only realistic
> path forward, unless Christian changes his position on expanding
> the size of struct inode.
>
> There are two pathways forward.
>
> 1.) Attach a constant size 'blob' of storage to each inode.
>
> This is a similar approach to what the LSM uses where each blob is
> sized as follows:
>
> S = U * sizeof(void *)
>
> Where U is the number of sub-systems that have a desire to use inode
> specific storage.

I can't tell for sure, but it looks like you don't understand how
LSM i_security blobs are used. It is *not* the case that each LSM
gets a pointer in the i_security blob. Each LSM that wants storage
tells the infrastructure at initialization time how much space it
wants in the blob. That can be a pointer, but usually it's a struct
with flags, pointers and even lists.

> Each sub-system uses it's pointer slot to manage any additional
> storage that it desires to attach to the inode.

Again, an LSM may choose to do it that way, but most don't.
SELinux and Smack need data on every inode. It makes much more sense
to put it directly in the blob than to allocate a separate chunk
for every inode.

> This has the obvious advantage of O(1) cost complexity for any
> sub-system that wants to access its inode specific storage.
>
> The disadvantage, as you note, is that it wastes memory if a
> sub-system does not elect to attach per inode information, for example
> the tracing infrastructure.

To be clear, that disadvantage only comes up if the sub-system uses
inode data on an occasional basis. If it never uses inode data there
is no need to have a pointer to it.

> This disadvantage is parried by the fact that it reduces the size of
> the inode proper by 24 bytes (4 pointers down to 1) and allows future
> extensibility without colliding with the interests and desires of the
> VFS maintainers.

You're adding a level of indirection. Even I would object based on
the performance impact.

> 2.) Implement key/value mapping for inode specific storage.
>
> The key would be a sub-system specific numeric value that returns a
> pointer the sub-system uses to manage its inode specific memory for a
> particular inode.
>
> A participating sub-system in turn uses its identifier to register an
> inode specific pointer for its sub-system.
>
> This strategy loses O(1) lookup complexity but reduces total memory
> consumption and only imposes memory costs for inodes when a sub-system
> desires to use inode specific storage.

SELinux and Smack use an inode blob for every inode. The performance
regression boggles the mind. Not to mention the additional complexity
of managing the memory.

> Approach 2 requires the introduction of generic infrastructure that
> allows an inode's key/value mappings to be located, presumably based
> on the inode's pointer value.  We could probably just resurrect the
> old IMA iint code for this purpose.
>
> In the end it comes down to a rather standard trade-off in this
> business, memory vs. execution cost.
>
> We would posit that option 2 is the only viable scheme if the design
> metric is overall good for the Linux kernel eco-system.

No. Really, no. You need look no further than secmarks to understand
how a key based blob allocation scheme leads to tears. Keys are fine
in the case where use of data is sparse. They have no place when data
use is the norm.

>> Thanks,
>> Song
> Have a good day.
>
> As always,
> Dr. Greg
>
> The Quixote Project - Flailing at the Travails of Cybersecurity
>               https://github.com/Quixote-Project
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-19 15:30               ` Amir Goldstein
@ 2024-11-19 21:53                 ` Song Liu
  2024-11-20  9:19                   ` Amir Goldstein
  2024-11-20  9:28                   ` Christian Brauner
  0 siblings, 2 replies; 53+ messages in thread
From: Song Liu @ 2024-11-19 21:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Layton, Song Liu, Jan Kara, Christian Brauner, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

Hi Jeff and Amir, 

Thanks for your inputs!

> On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
>> 
>> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
>>> 

[...]

>>> Longer term, I think it may be beneficial to come up with a way to attach
>>>>> private info to the inode in a way that doesn't cost us one pointer per
>>>>> funcionality that may possibly attach info to the inode. We already have
>>>>> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
>>>>> call where the space overhead for everybody is worth the runtime &
>>>>> complexity overhead for users using the functionality...
>>>> 
>>>> It does seem to be the right long term solution, and I am willing to
>>>> work on it. However, I would really appreciate some positive feedback
>>>> on the idea, so that I have better confidence my weeks of work has a
>>>> better chance to worth it.
>>>> 
>>>> Thanks,
>>>> Song
>>>> 
>>>> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
>>> 
>>> fsnotify is somewhat similar to file locking in that few inodes on the
>>> machine actually utilize these fields.
>>> 
>>> For file locking, we allocate and populate the inode->i_flctx field on
>>> an as-needed basis. The kernel then hangs on to that struct until the
>>> inode is freed.

If we have some universal on-demand per-inode memory allocator, 
I guess we can move i_flctx to it?

>>> We could do something similar here. We have this now:
>>> 
>>> #ifdef CONFIG_FSNOTIFY
>>>        __u32                   i_fsnotify_mask; /* all events this inode cares about */
>>>        /* 32-bit hole reserved for expanding i_fsnotify_mask */
>>>        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
>>> #endif

And maybe some fsnotify fields too?

With a couple users, I think it justifies to have some universal
on-demond allocator. 

>>> What if you were to turn these fields into a pointer to a new struct:
>>> 
>>>        struct fsnotify_inode_context {
>>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
>>>                struct bpf_local_storage __rcu          *i_bpf_storage;
>>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
>>>        };
>>> 
>> 
>> The extra indirection is going to hurt for i_fsnotify_mask
>> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
>> into a container, but it could be moved to the hole after i_state.

>>> Then whenever you have to populate any of these fields, you just
>>> allocate one of these structs and set the inode up to point to it.
>>> They're tiny too, so don't bother freeing it until the inode is
>>> deallocated.
>>> 
>>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
>>> the fsnotify code an easier way to expand per-inode info in the future.
>>> It would also slightly shrink struct inode too.

I am hoping to make i_bpf_storage available to tracing programs. 
Therefore, I would rather not limit it to fsnotify context. We can
still use the universal on-demand allocator.

>> 
>> This was already done for s_fsnotify_marks, so you can follow the recipe
>> of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
>> and create an fsnotify_inode_info container.
>> 
> 
> On second thought, fsnotify_sb_info container is allocated and attached
> in the context of userspace adding a mark.
> 
> If you will need allocate and attach fsnotify_inode_info in the content of
> fast path fanotify hook in order to add the inode to the map, I don't
> think that is going to fly??

Do you mean we may not be able to allocate memory in the fast path 
hook? AFAICT, the fast path is still in the process context, so I 
think this is not a problem?

Thanks,
Song 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-19 18:14                         ` Casey Schaufler
@ 2024-11-19 22:35                           ` Song Liu
  2024-11-20 16:54                           ` Dr. Greg
  1 sibling, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-19 22:35 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Dr. Greg, Song Liu, James Bottomley, jack@suse.cz,
	brauner@kernel.org, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com



> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> On 11/19/2024 4:27 AM, Dr. Greg wrote:
>> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
>> 
>>> Hi Christian, James and Jan,
>> Good morning, I hope the day is starting well for everyone.
>> 
>>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>> [...]
>>> 
>>>>> We can address this with something like following:
>>>>> 
>>>>> #ifdef CONFIG_SECURITY
>>>>>        void                    *i_security;
>>>>> #elif CONFIG_BPF_SYSCALL
>>>>>        struct bpf_local_storage __rcu *i_bpf_storage;
>>>>> #endif
>>>>> 
>>>>> This will help catch all misuse of the i_bpf_storage at compile
>>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
>>>>> 
>>>>> Does this make sense?
>>>> Got to say I'm with Casey here, this will generate horrible and failure
>>>> prone code.
>>>> 
>>>> Since effectively you're making i_security always present anyway,
>>>> simply do that and also pull the allocation code out of security.c in a
>>>> way that it's always available?  That way you don't have to special
>>>> case the code depending on whether CONFIG_SECURITY is defined. 
>>>> Effectively this would give everyone a generic way to attach some
>>>> memory area to an inode.  I know it's more complex than this because
>>>> there are LSM hooks that run from security_inode_alloc() but if you can
>>>> make it work generically, I'm sure everyone will benefit.
>>> On a second thought, I think making i_security generic is not 
>>> the right solution for "BPF inode storage in tracing use cases". 
>>> 
>>> This is because i_security serves a very specific use case: it 
>>> points to a piece of memory whose size is calculated at system 
>>> boot time. If some of the supported LSMs is not enabled by the 
>>> lsm= kernel arg, the kernel will not allocate memory in 
>>> i_security for them. The only way to change lsm= is to reboot 
>>> the system. BPF LSM programs can be disabled at the boot time, 
>>> which fits well in i_security. However, BPF tracing programs 
>>> cannot be disabled at boot time (even we change the code to 
>>> make it possible, we are not likely to disable BPF tracing). 
>>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
>>> BPF tracing programs to load at some point of time, and these 
>>> programs may use BPF inode storage. 
>>> 
>>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
>>> always will be attached to i_security (maybe under a different 
>>> name, say, i_generic) of every inode. In this case, we should 
>>> really add i_bpf_storage directly to the inode, because another 
>>> pointer jump via i_generic gives nothing but overhead. 
>>> 
>>> Does this make sense? Or did I misunderstand the suggestion?
>> There is a colloquialism that seems relevant here: "Pick your poison".
>> 
>> In the greater interests of the kernel, it seems that a generic
>> mechanism for attaching per inode information is the only realistic
>> path forward, unless Christian changes his position on expanding
>> the size of struct inode.
>> 
>> There are two pathways forward.
>> 
>> 1.) Attach a constant size 'blob' of storage to each inode.
>> 
>> This is a similar approach to what the LSM uses where each blob is
>> sized as follows:
>> 
>> S = U * sizeof(void *)
>> 
>> Where U is the number of sub-systems that have a desire to use inode
>> specific storage.
> 
> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.
> 
>> Each sub-system uses it's pointer slot to manage any additional
>> storage that it desires to attach to the inode.
> 
> Again, an LSM may choose to do it that way, but most don't.
> SELinux and Smack need data on every inode. It makes much more sense
> to put it directly in the blob than to allocate a separate chunk
> for every inode.

AFAICT, i_security is somehow unique in the way that its size
is calculated at boot time. I guess we will just keep most LSM
users behind. 

> 
>> This has the obvious advantage of O(1) cost complexity for any
>> sub-system that wants to access its inode specific storage.
>> 
>> The disadvantage, as you note, is that it wastes memory if a
>> sub-system does not elect to attach per inode information, for example
>> the tracing infrastructure.
> 
> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.
> 
>> This disadvantage is parried by the fact that it reduces the size of
>> the inode proper by 24 bytes (4 pointers down to 1) and allows future
>> extensibility without colliding with the interests and desires of the
>> VFS maintainers.
> 
> You're adding a level of indirection. Even I would object based on
> the performance impact.
> 
>> 2.) Implement key/value mapping for inode specific storage.
>> 
>> The key would be a sub-system specific numeric value that returns a
>> pointer the sub-system uses to manage its inode specific memory for a
>> particular inode.
>> 
>> A participating sub-system in turn uses its identifier to register an
>> inode specific pointer for its sub-system.
>> 
>> This strategy loses O(1) lookup complexity but reduces total memory
>> consumption and only imposes memory costs for inodes when a sub-system
>> desires to use inode specific storage.
> 
> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional complexity
> of managing the memory.
> 
>> Approach 2 requires the introduction of generic infrastructure that
>> allows an inode's key/value mappings to be located, presumably based
>> on the inode's pointer value.  We could probably just resurrect the
>> old IMA iint code for this purpose.
>> 
>> In the end it comes down to a rather standard trade-off in this
>> business, memory vs. execution cost.
>> 
>> We would posit that option 2 is the only viable scheme if the design
>> metric is overall good for the Linux kernel eco-system.
> 
> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.

OTOH, I think some on-demand key-value storage makes sense for many 
other use cases, such as BPF (LSM and tracing), file lock, fanotify, 
etc. 

Overall, I think we have 3 types storages attached to inode: 

  1. Embedded in struct inode, gated by CONFIG_*. 
  2. Behind i_security (or maybe call it a different name if we
     can find other uses for it). The size is calculated at boot
     time. 
  3. Behind a key-value storage. 

To evaluate these categories, we have:

Speed: 1 > 2 > 3
Flexibility: 3 > 2 > 1

We don't really have 3 right now. I think the direction is to 
create it. BPF inode storage is a key-value store. If we can 
get another user for 3, in addition to BPF inode storage, it
should be a net win. 

Does this sound like a viable path forward?

Thanks,
Song


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-19 21:53                 ` Song Liu
@ 2024-11-20  9:19                   ` Amir Goldstein
  2024-11-20  9:28                   ` Christian Brauner
  1 sibling, 0 replies; 53+ messages in thread
From: Amir Goldstein @ 2024-11-20  9:19 UTC (permalink / raw)
  To: Song Liu
  Cc: Jeff Layton, Jan Kara, Christian Brauner, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Tue, Nov 19, 2024 at 10:53 PM Song Liu <songliubraving@meta.com> wrote:
>
> Hi Jeff and Amir,
>
> Thanks for your inputs!
>
> > On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >>
> >> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> >>>
>
> [...]
>
> >>> Longer term, I think it may be beneficial to come up with a way to attach
> >>>>> private info to the inode in a way that doesn't cost us one pointer per
> >>>>> funcionality that may possibly attach info to the inode. We already have
> >>>>> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> >>>>> call where the space overhead for everybody is worth the runtime &
> >>>>> complexity overhead for users using the functionality...
> >>>>
> >>>> It does seem to be the right long term solution, and I am willing to
> >>>> work on it. However, I would really appreciate some positive feedback
> >>>> on the idea, so that I have better confidence my weeks of work has a
> >>>> better chance to worth it.
> >>>>
> >>>> Thanks,
> >>>> Song
> >>>>
> >>>> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> >>>
> >>> fsnotify is somewhat similar to file locking in that few inodes on the
> >>> machine actually utilize these fields.
> >>>
> >>> For file locking, we allocate and populate the inode->i_flctx field on
> >>> an as-needed basis. The kernel then hangs on to that struct until the
> >>> inode is freed.
>
> If we have some universal on-demand per-inode memory allocator,
> I guess we can move i_flctx to it?
>
> >>> We could do something similar here. We have this now:
> >>>
> >>> #ifdef CONFIG_FSNOTIFY
> >>>        __u32                   i_fsnotify_mask; /* all events this inode cares about */
> >>>        /* 32-bit hole reserved for expanding i_fsnotify_mask */
> >>>        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >>> #endif
>
> And maybe some fsnotify fields too?
>
> With a couple users, I think it justifies to have some universal
> on-demond allocator.
>
> >>> What if you were to turn these fields into a pointer to a new struct:
> >>>
> >>>        struct fsnotify_inode_context {
> >>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >>>                struct bpf_local_storage __rcu          *i_bpf_storage;
> >>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> >>>        };
> >>>
> >>
> >> The extra indirection is going to hurt for i_fsnotify_mask
> >> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> >> into a container, but it could be moved to the hole after i_state.
>
> >>> Then whenever you have to populate any of these fields, you just
> >>> allocate one of these structs and set the inode up to point to it.
> >>> They're tiny too, so don't bother freeing it until the inode is
> >>> deallocated.
> >>>
> >>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
> >>> the fsnotify code an easier way to expand per-inode info in the future.
> >>> It would also slightly shrink struct inode too.
>
> I am hoping to make i_bpf_storage available to tracing programs.
> Therefore, I would rather not limit it to fsnotify context. We can
> still use the universal on-demand allocator.
>
> >>
> >> This was already done for s_fsnotify_marks, so you can follow the recipe
> >> of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
> >> and create an fsnotify_inode_info container.
> >>
> >
> > On second thought, fsnotify_sb_info container is allocated and attached
> > in the context of userspace adding a mark.
> >
> > If you will need allocate and attach fsnotify_inode_info in the content of
> > fast path fanotify hook in order to add the inode to the map, I don't
> > think that is going to fly??
>
> Do you mean we may not be able to allocate memory in the fast path
> hook? AFAICT, the fast path is still in the process context, so I
> think this is not a problem?

Right. that should be ok.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-19 21:53                 ` Song Liu
  2024-11-20  9:19                   ` Amir Goldstein
@ 2024-11-20  9:28                   ` Christian Brauner
  2024-11-20 11:19                     ` Amir Goldstein
  2024-11-21  8:08                     ` Song Liu
  1 sibling, 2 replies; 53+ messages in thread
From: Christian Brauner @ 2024-11-20  9:28 UTC (permalink / raw)
  To: Song Liu, Amir Goldstein, Jeff Layton
  Cc: Jan Kara, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Tue, Nov 19, 2024 at 09:53:20PM +0000, Song Liu wrote:
> Hi Jeff and Amir, 
> 
> Thanks for your inputs!
> 
> > On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >> 
> >> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> >>> 
> 
> [...]
> 
> >>> Longer term, I think it may be beneficial to come up with a way to attach
> >>>>> private info to the inode in a way that doesn't cost us one pointer per
> >>>>> funcionality that may possibly attach info to the inode. We already have
> >>>>> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> >>>>> call where the space overhead for everybody is worth the runtime &
> >>>>> complexity overhead for users using the functionality...
> >>>> 
> >>>> It does seem to be the right long term solution, and I am willing to
> >>>> work on it. However, I would really appreciate some positive feedback
> >>>> on the idea, so that I have better confidence my weeks of work has a
> >>>> better chance to worth it.
> >>>> 
> >>>> Thanks,
> >>>> Song
> >>>> 
> >>>> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> >>> 
> >>> fsnotify is somewhat similar to file locking in that few inodes on the
> >>> machine actually utilize these fields.
> >>> 
> >>> For file locking, we allocate and populate the inode->i_flctx field on
> >>> an as-needed basis. The kernel then hangs on to that struct until the
> >>> inode is freed.
> 
> If we have some universal on-demand per-inode memory allocator, 
> I guess we can move i_flctx to it?
> 
> >>> We could do something similar here. We have this now:
> >>> 
> >>> #ifdef CONFIG_FSNOTIFY
> >>>        __u32                   i_fsnotify_mask; /* all events this inode cares about */
> >>>        /* 32-bit hole reserved for expanding i_fsnotify_mask */
> >>>        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >>> #endif
> 
> And maybe some fsnotify fields too?
> 
> With a couple users, I think it justifies to have some universal
> on-demond allocator. 
> 
> >>> What if you were to turn these fields into a pointer to a new struct:
> >>> 
> >>>        struct fsnotify_inode_context {
> >>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >>>                struct bpf_local_storage __rcu          *i_bpf_storage;
> >>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> >>>        };
> >>> 
> >> 
> >> The extra indirection is going to hurt for i_fsnotify_mask
> >> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> >> into a container, but it could be moved to the hole after i_state.
> 
> >>> Then whenever you have to populate any of these fields, you just
> >>> allocate one of these structs and set the inode up to point to it.
> >>> They're tiny too, so don't bother freeing it until the inode is
> >>> deallocated.
> >>> 
> >>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
> >>> the fsnotify code an easier way to expand per-inode info in the future.
> >>> It would also slightly shrink struct inode too.
> 
> I am hoping to make i_bpf_storage available to tracing programs. 
> Therefore, I would rather not limit it to fsnotify context. We can
> still use the universal on-demand allocator.

Can't we just do something like:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..cc05a5485365 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
 #define IOP_DEFAULT_READLINK   0x0010
 #define IOP_MGTIME     0x0020

+struct inode_addons {
+        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
+        struct bpf_local_storage __rcu          *i_bpf_storage;
+        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
+};
+
 /*
  * Keep mostly read-only and often accessed (especially for
  * the RCU path lookup and 'stat' data) fields at the beginning
@@ -731,12 +737,7 @@ struct inode {
                unsigned                i_dir_seq;
        };

-
-#ifdef CONFIG_FSNOTIFY
-       __u32                   i_fsnotify_mask; /* all events this inode cares about */
-       /* 32-bit hole reserved for expanding i_fsnotify_mask */
-       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
-#endif
+       struct inode_addons *i_addons;

 #ifdef CONFIG_FS_ENCRYPTION
        struct fscrypt_inode_info       *i_crypt_info;

Then when either fsnotify or bpf needs that storage they can do a
cmpxchg() based allocation for struct inode_addons just like I did with
f_owner:

int file_f_owner_allocate(struct file *file)
{
	struct fown_struct *f_owner;

	f_owner = file_f_owner(file);
	if (f_owner)
		return 0;

	f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
	if (!f_owner)
		return -ENOMEM;

	rwlock_init(&f_owner->lock);
	f_owner->file = file;
	/* If someone else raced us, drop our allocation. */
	if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
		kfree(f_owner);
	return 0;
}

The internal allocations for specific fields are up to the subsystem
ofc. Does that make sense?

> >>>        struct fsnotify_inode_context {
> >>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> >>>                struct bpf_local_storage __rcu          *i_bpf_storage;
> >>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> >>>        };

> 
> >> 
> >> This was already done for s_fsnotify_marks, so you can follow the recipe
> >> of 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb")
> >> and create an fsnotify_inode_info container.
> >> 
> > 
> > On second thought, fsnotify_sb_info container is allocated and attached
> > in the context of userspace adding a mark.
> > 
> > If you will need allocate and attach fsnotify_inode_info in the content of
> > fast path fanotify hook in order to add the inode to the map, I don't
> > think that is going to fly??
> 
> Do you mean we may not be able to allocate memory in the fast path 
> hook? AFAICT, the fast path is still in the process context, so I 
> think this is not a problem?
> 
> Thanks,
> Song 
> 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-20  9:28                   ` Christian Brauner
@ 2024-11-20 11:19                     ` Amir Goldstein
  2024-11-21  8:43                       ` Christian Brauner
  2024-11-21 13:48                       ` Jeff Layton
  2024-11-21  8:08                     ` Song Liu
  1 sibling, 2 replies; 53+ messages in thread
From: Amir Goldstein @ 2024-11-20 11:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Song Liu, Jeff Layton, Jan Kara, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Wed, Nov 20, 2024 at 10:28 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Nov 19, 2024 at 09:53:20PM +0000, Song Liu wrote:
> > Hi Jeff and Amir,
> >
> > Thanks for your inputs!
> >
> > > On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >>
> > >> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >>>
> >
> > [...]
> >
> > >>> Longer term, I think it may be beneficial to come up with a way to attach
> > >>>>> private info to the inode in a way that doesn't cost us one pointer per
> > >>>>> funcionality that may possibly attach info to the inode. We already have
> > >>>>> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > >>>>> call where the space overhead for everybody is worth the runtime &
> > >>>>> complexity overhead for users using the functionality...
> > >>>>
> > >>>> It does seem to be the right long term solution, and I am willing to
> > >>>> work on it. However, I would really appreciate some positive feedback
> > >>>> on the idea, so that I have better confidence my weeks of work has a
> > >>>> better chance to worth it.
> > >>>>
> > >>>> Thanks,
> > >>>> Song
> > >>>>
> > >>>> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> > >>>
> > >>> fsnotify is somewhat similar to file locking in that few inodes on the
> > >>> machine actually utilize these fields.
> > >>>
> > >>> For file locking, we allocate and populate the inode->i_flctx field on
> > >>> an as-needed basis. The kernel then hangs on to that struct until the
> > >>> inode is freed.
> >
> > If we have some universal on-demand per-inode memory allocator,
> > I guess we can move i_flctx to it?
> >
> > >>> We could do something similar here. We have this now:
> > >>>
> > >>> #ifdef CONFIG_FSNOTIFY
> > >>>        __u32                   i_fsnotify_mask; /* all events this inode cares about */
> > >>>        /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > >>>        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > >>> #endif
> >
> > And maybe some fsnotify fields too?
> >
> > With a couple users, I think it justifies to have some universal
> > on-demond allocator.
> >
> > >>> What if you were to turn these fields into a pointer to a new struct:
> > >>>
> > >>>        struct fsnotify_inode_context {
> > >>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > >>>                struct bpf_local_storage __rcu          *i_bpf_storage;
> > >>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> > >>>        };
> > >>>
> > >>
> > >> The extra indirection is going to hurt for i_fsnotify_mask
> > >> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> > >> into a container, but it could be moved to the hole after i_state.
> >
> > >>> Then whenever you have to populate any of these fields, you just
> > >>> allocate one of these structs and set the inode up to point to it.
> > >>> They're tiny too, so don't bother freeing it until the inode is
> > >>> deallocated.
> > >>>
> > >>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
> > >>> the fsnotify code an easier way to expand per-inode info in the future.
> > >>> It would also slightly shrink struct inode too.
> >
> > I am hoping to make i_bpf_storage available to tracing programs.
> > Therefore, I would rather not limit it to fsnotify context. We can
> > still use the universal on-demand allocator.
>
> Can't we just do something like:
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..cc05a5485365 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
>  #define IOP_DEFAULT_READLINK   0x0010
>  #define IOP_MGTIME     0x0020
>
> +struct inode_addons {
> +        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> +        struct bpf_local_storage __rcu          *i_bpf_storage;
> +        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> +};
> +
>  /*
>   * Keep mostly read-only and often accessed (especially for
>   * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -731,12 +737,7 @@ struct inode {
>                 unsigned                i_dir_seq;
>         };
>
> -
> -#ifdef CONFIG_FSNOTIFY
> -       __u32                   i_fsnotify_mask; /* all events this inode cares about */
> -       /* 32-bit hole reserved for expanding i_fsnotify_mask */
> -       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> -#endif
> +       struct inode_addons *i_addons;
>
>  #ifdef CONFIG_FS_ENCRYPTION
>         struct fscrypt_inode_info       *i_crypt_info;
>
> Then when either fsnotify or bpf needs that storage they can do a
> cmpxchg() based allocation for struct inode_addons just like I did with
> f_owner:
>
> int file_f_owner_allocate(struct file *file)
> {
>         struct fown_struct *f_owner;
>
>         f_owner = file_f_owner(file);
>         if (f_owner)
>                 return 0;
>
>         f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
>         if (!f_owner)
>                 return -ENOMEM;
>
>         rwlock_init(&f_owner->lock);
>         f_owner->file = file;
>         /* If someone else raced us, drop our allocation. */
>         if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
>                 kfree(f_owner);
>         return 0;
> }
>
> The internal allocations for specific fields are up to the subsystem
> ofc. Does that make sense?
>

Maybe, but as I wrote, i_fsnotify_mask should not be moved out
of inode struct, because it is accessed in fast paths of fsnotify vfs
hooks, where we do not want to have to deref another context,
but i_fsnotify_mask can be moved to the hole after i_state.

And why stop at i_fsnotify/i_bfp?
If you go to "addons" why not also move i_security/i_crypt/i_verify?
Need to have some common rationale behind those decisions.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-19 18:14                         ` Casey Schaufler
  2024-11-19 22:35                           ` Song Liu
@ 2024-11-20 16:54                           ` Dr. Greg
  2024-11-21  8:28                             ` Song Liu
  1 sibling, 1 reply; 53+ messages in thread
From: Dr. Greg @ 2024-11-20 16:54 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, James Bottomley, jack@suse.cz, brauner@kernel.org,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:

Good morning, I hope the day is goning well for everyone.

> On 11/19/2024 4:27 AM, Dr. Greg wrote:
> > On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
> >
> >> Hi Christian, James and Jan, 
> > Good morning, I hope the day is starting well for everyone.
> >
> >>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> >> [...]
> >>
> >>>> We can address this with something like following:
> >>>>
> >>>> #ifdef CONFIG_SECURITY
> >>>>         void                    *i_security;
> >>>> #elif CONFIG_BPF_SYSCALL
> >>>>         struct bpf_local_storage __rcu *i_bpf_storage;
> >>>> #endif
> >>>>
> >>>> This will help catch all misuse of the i_bpf_storage at compile
> >>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
> >>>>
> >>>> Does this make sense?
> >>> Got to say I'm with Casey here, this will generate horrible and failure
> >>> prone code.
> >>>
> >>> Since effectively you're making i_security always present anyway,
> >>> simply do that and also pull the allocation code out of security.c in a
> >>> way that it's always available?  That way you don't have to special
> >>> case the code depending on whether CONFIG_SECURITY is defined. 
> >>> Effectively this would give everyone a generic way to attach some
> >>> memory area to an inode.  I know it's more complex than this because
> >>> there are LSM hooks that run from security_inode_alloc() but if you can
> >>> make it work generically, I'm sure everyone will benefit.
> >> On a second thought, I think making i_security generic is not 
> >> the right solution for "BPF inode storage in tracing use cases". 
> >>
> >> This is because i_security serves a very specific use case: it 
> >> points to a piece of memory whose size is calculated at system 
> >> boot time. If some of the supported LSMs is not enabled by the 
> >> lsm= kernel arg, the kernel will not allocate memory in 
> >> i_security for them. The only way to change lsm= is to reboot 
> >> the system. BPF LSM programs can be disabled at the boot time, 
> >> which fits well in i_security. However, BPF tracing programs 
> >> cannot be disabled at boot time (even we change the code to 
> >> make it possible, we are not likely to disable BPF tracing). 
> >> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
> >> BPF tracing programs to load at some point of time, and these 
> >> programs may use BPF inode storage. 
> >>
> >> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
> >> always will be attached to i_security (maybe under a different 
> >> name, say, i_generic) of every inode. In this case, we should 
> >> really add i_bpf_storage directly to the inode, because another 
> >> pointer jump via i_generic gives nothing but overhead. 
> >>
> >> Does this make sense? Or did I misunderstand the suggestion?
> > There is a colloquialism that seems relevant here: "Pick your poison".
> >
> > In the greater interests of the kernel, it seems that a generic
> > mechanism for attaching per inode information is the only realistic
> > path forward, unless Christian changes his position on expanding
> > the size of struct inode.
> >
> > There are two pathways forward.
> >
> > 1.) Attach a constant size 'blob' of storage to each inode.
> >
> > This is a similar approach to what the LSM uses where each blob is
> > sized as follows:
> >
> > S = U * sizeof(void *)
> >
> > Where U is the number of sub-systems that have a desire to use inode
> > specific storage.

> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.

I can state unequivocably for everyone's benefit, that as a team, we
have an intimate understanding of how LSM i_security blobs are used.

It was 0500 in the morning when I wrote the reply and I had personally
been working for 22 hours straight, so my apologies for being
imprecise.

I should not have specified sizeof(void *), I should have written
sizeof(allocation), for lack of a better syntax.

Also for the record, in a universal allocation scheme, when I say
sub-system I mean any implementation that would make use of per inode
information.  So the LSM, bpf tracing et.al., could all be considered
sub-systems that would register at boot time for a section of the
arena.

> > Each sub-system uses it's pointer slot to manage any additional
> > storage that it desires to attach to the inode.

> Again, an LSM may choose to do it that way, but most don't.  SELinux
> and Smack need data on every inode. It makes much more sense to put
> it directly in the blob than to allocate a separate chunk for every
> inode.

See my correction above.

> > This has the obvious advantage of O(1) cost complexity for any
> > sub-system that wants to access its inode specific storage.
> >
> > The disadvantage, as you note, is that it wastes memory if a
> > sub-system does not elect to attach per inode information, for example
> > the tracing infrastructure.

> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.

I think we all agree on that, therein lies the rub with a common arena
architecture, which is why I indicated in my earlier e-mail that this
comes down to engineering trade-off decisions.

That is why there would be a probable assumption that such sub-systems
would only request a pointer per arena slot and use that to reference
a dynamically allocated structure.  If, as a group, we are really
concerned about inode memory consumption the assumption would be that
the maintainers would have to whine about sparse consumers requesting
a structure sized allocation rather than a pointer sized allocation.

> > This disadvantage is parried by the fact that it reduces the size of
> > the inode proper by 24 bytes (4 pointers down to 1) and allows future
> > extensibility without colliding with the interests and desires of the
> > VFS maintainers.

> You're adding a level of indirection. Even I would object based on
> the performance impact.

I'm not sure that a thorough and complete analysis of the costs
associated with a sub-system touching inode local storage would
support the notion of a tangible performance hit.

The pointer in an arena slot would presumably be a pointer to a data
structure that a sub-system allocates at inode creation time.  After
computing the arena slot address in classic style (i_arena + offset)
the sub-system uses the pointer at that location to dereference
its structure elements or to find subordinate members in its arena.

If we take SMACK as an example, the smack inode contains three
pointers and a scalar.  So, if there is a need to access storage behind
one of those pointers, there is an additional indirection hit.

The three pointers are each to a structure (smack_known) that has
three list pointers and a mutex lock inside of it.

The SeLinux inode has a back pointer to the sponsoring inode, a list
head, a spinlock and some scalars.

So there is lots of potential indirection and locking going on with
access to inode local storage.

To extend further, for everyone thinking about this from an
engineering perspective.

A common arena model where everyone asks for a structure sized blob is
inherently cache pessimal.  Unless you are the first blob in the arena
you are going to need to hit another cache-line in order to start the
indirection process for your structure.

A pointer based arena architecture would allow up to eight sub-systems
to get their inode storage pointer for the cost of a single cache-line
fetch.

Let me offer another line of thinking on this drawn from the
discussion above.

A further optimization in the single pointer arena model is for the
LSM to place a pointer to a standard LSM sized memory blob in its
pointer slot on behalf of all the individual LSM's.  All of the
individual participating LSM's take that pointer and do the offset
calculation into the LSM arena for that inode as they normally would.

So there would seem to be a lot of engineering issues to consider that
are beyond the simple predicate that indirection is bad.

See, I do understand how the LSM arena model works.

> > 2.) Implement key/value mapping for inode specific storage.
> >
> > The key would be a sub-system specific numeric value that returns a
> > pointer the sub-system uses to manage its inode specific memory for a
> > particular inode.
> >
> > A participating sub-system in turn uses its identifier to register an
> > inode specific pointer for its sub-system.
> >
> > This strategy loses O(1) lookup complexity but reduces total memory
> > consumption and only imposes memory costs for inodes when a sub-system
> > desires to use inode specific storage.

> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional
> complexity of managing the memory.

I guess we would have to measure the performance impacts to understand
their level of mind boggliness.

My first thought is that we hear a huge amount of fanfare about BPF
being a game changer for tracing and network monitoring.  Given
current networking speeds, if its ability to manage storage needed for
it purposes are truely abysmal the industry wouldn't be finding the
technology useful.

Beyond that.

As I noted above, the LSM could be an independent subscriber.  The
pointer to register would come from the the kmem_cache allocator as it
does now, so that cost is idempotent with the current implementation.
The pointer registration would also be a single instance cost.

So the primary cost differential over the common arena model will be
the complexity costs associated with lookups in a red/black tree, if
we used the old IMA integrity cache as an example implementation.

As I noted above, these per inode local storage structures are complex
in of themselves, including lists and locks.  If touching an inode
involves locking and walking lists and the like it would seem that
those performance impacts would quickly swamp an r/b lookup cost.

> > Approach 2 requires the introduction of generic infrastructure that
> > allows an inode's key/value mappings to be located, presumably based
> > on the inode's pointer value.  We could probably just resurrect the
> > old IMA iint code for this purpose.
> >
> > In the end it comes down to a rather standard trade-off in this
> > business, memory vs. execution cost.
> >
> > We would posit that option 2 is the only viable scheme if the design
> > metric is overall good for the Linux kernel eco-system.

> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.

Then it would seem that we need to get everyone to agree that we can
get by with using two pointers in struct inode.  One for uses best
served by common arena allocation and one for a key/pointer mapping,
and then convert the sub-systems accordingly.

Or alternately, getting everyone to agree that allocating a mininum of
eight additional bytes for every subscriber to private inode data
isn't the end of the world, even if use of the resource is sparse.

Of course, experience would suggest, that getting everyone in this
community to agree on something is roughly akin to throwing a hand
grenade into a chicken coop with an expectation that all of the
chickens will fly out in a uniform flock formation.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
              https://github.com/Quixote-Project

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-20  9:28                   ` Christian Brauner
  2024-11-20 11:19                     ` Amir Goldstein
@ 2024-11-21  8:08                     ` Song Liu
  1 sibling, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-21  8:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Song Liu, Amir Goldstein, Jeff Layton, Jan Kara, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com


> On Nov 20, 2024, at 1:28 AM, Christian Brauner <brauner@kernel.org> wrote:

[...]

>>>>> Then whenever you have to populate any of these fields, you just
>>>>> allocate one of these structs and set the inode up to point to it.
>>>>> They're tiny too, so don't bother freeing it until the inode is
>>>>> deallocated.
>>>>> 
>>>>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
>>>>> the fsnotify code an easier way to expand per-inode info in the future.
>>>>> It would also slightly shrink struct inode too.
>> 
>> I am hoping to make i_bpf_storage available to tracing programs. 
>> Therefore, I would rather not limit it to fsnotify context. We can
>> still use the universal on-demand allocator.
> 
> Can't we just do something like:
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..cc05a5485365 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_DEFAULT_READLINK   0x0010
> #define IOP_MGTIME     0x0020
> 
> +struct inode_addons {
> +        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> +        struct bpf_local_storage __rcu          *i_bpf_storage;
> +        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> +};
> +
> /*
>  * Keep mostly read-only and often accessed (especially for
>  * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -731,12 +737,7 @@ struct inode {
>                unsigned                i_dir_seq;
>        };
> 
> -
> -#ifdef CONFIG_FSNOTIFY
> -       __u32                   i_fsnotify_mask; /* all events this inode cares about */
> -       /* 32-bit hole reserved for expanding i_fsnotify_mask */
> -       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> -#endif
> +       struct inode_addons *i_addons;
> 
> #ifdef CONFIG_FS_ENCRYPTION
>        struct fscrypt_inode_info       *i_crypt_info;
> 
> Then when either fsnotify or bpf needs that storage they can do a
> cmpxchg() based allocation for struct inode_addons just like I did with
> f_owner:
> 
> int file_f_owner_allocate(struct file *file)
> {
> struct fown_struct *f_owner;
> 
> f_owner = file_f_owner(file);
> if (f_owner)
> return 0;
> 
> f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
> if (!f_owner)
> return -ENOMEM;
> 
> rwlock_init(&f_owner->lock);
> f_owner->file = file;
> /* If someone else raced us, drop our allocation. */
> if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
> kfree(f_owner);
> return 0;
> }
> 
> The internal allocations for specific fields are up to the subsystem
> ofc. Does that make sense?

This works for fsnotify/fanotify. However, for tracing use cases, 
this is not as reliable as other (task, cgroup, sock) local storage. 
BPF tracing programs need to work in any contexts, including NMI. 
Therefore, doing kzalloc(GFP_KERNEL) is not always safe for 
tracing use cases. OTOH, bpf local storage works in NMI. If we have 
a i_bpf_storage pointer in struct inode, bpf inode storage will work 
in NMI. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-20 16:54                           ` Dr. Greg
@ 2024-11-21  8:28                             ` Song Liu
  2024-11-21 16:02                               ` Dr. Greg
  2024-11-21 17:47                               ` Casey Schaufler
  0 siblings, 2 replies; 53+ messages in thread
From: Song Liu @ 2024-11-21  8:28 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Casey Schaufler, Song Liu, James Bottomley, jack@suse.cz,
	brauner@kernel.org, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

Hi Dr. Greg,

Thanks for your input!

> On Nov 20, 2024, at 8:54 AM, Dr. Greg <greg@enjellic.com> wrote:
> 
> On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:

[...]

> 
>>> 2.) Implement key/value mapping for inode specific storage.
>>> 
>>> The key would be a sub-system specific numeric value that returns a
>>> pointer the sub-system uses to manage its inode specific memory for a
>>> particular inode.
>>> 
>>> A participating sub-system in turn uses its identifier to register an
>>> inode specific pointer for its sub-system.
>>> 
>>> This strategy loses O(1) lookup complexity but reduces total memory
>>> consumption and only imposes memory costs for inodes when a sub-system
>>> desires to use inode specific storage.
> 
>> SELinux and Smack use an inode blob for every inode. The performance
>> regression boggles the mind. Not to mention the additional
>> complexity of managing the memory.
> 
> I guess we would have to measure the performance impacts to understand
> their level of mind boggliness.
> 
> My first thought is that we hear a huge amount of fanfare about BPF
> being a game changer for tracing and network monitoring.  Given
> current networking speeds, if its ability to manage storage needed for
> it purposes are truely abysmal the industry wouldn't be finding the
> technology useful.
> 
> Beyond that.
> 
> As I noted above, the LSM could be an independent subscriber.  The
> pointer to register would come from the the kmem_cache allocator as it
> does now, so that cost is idempotent with the current implementation.
> The pointer registration would also be a single instance cost.
> 
> So the primary cost differential over the common arena model will be
> the complexity costs associated with lookups in a red/black tree, if
> we used the old IMA integrity cache as an example implementation.
> 
> As I noted above, these per inode local storage structures are complex
> in of themselves, including lists and locks.  If touching an inode
> involves locking and walking lists and the like it would seem that
> those performance impacts would quickly swamp an r/b lookup cost.

bpf local storage is designed to be an arena like solution that works
for multiple bpf maps (and we don't know how many of maps we need 
ahead of time). Therefore, we may end up doing what you suggested 
earlier: every LSM should use bpf inode storage. ;) I am only 90%
kidding. 

> 
>>> Approach 2 requires the introduction of generic infrastructure that
>>> allows an inode's key/value mappings to be located, presumably based
>>> on the inode's pointer value.  We could probably just resurrect the
>>> old IMA iint code for this purpose.
>>> 
>>> In the end it comes down to a rather standard trade-off in this
>>> business, memory vs. execution cost.
>>> 
>>> We would posit that option 2 is the only viable scheme if the design
>>> metric is overall good for the Linux kernel eco-system.
> 
>> No. Really, no. You need look no further than secmarks to understand
>> how a key based blob allocation scheme leads to tears. Keys are fine
>> in the case where use of data is sparse. They have no place when data
>> use is the norm.
> 
> Then it would seem that we need to get everyone to agree that we can
> get by with using two pointers in struct inode.  One for uses best
> served by common arena allocation and one for a key/pointer mapping,
> and then convert the sub-systems accordingly.
> 
> Or alternately, getting everyone to agree that allocating a mininum of
> eight additional bytes for every subscriber to private inode data
> isn't the end of the world, even if use of the resource is sparse.

Christian suggested we can use an inode_addon structure, which is 
similar to this idea. It won't work well in all contexts, though. 
So it is not as good as other bpf local storage (task, sock, cgroup). 

Thanks,
Song

> 
> Of course, experience would suggest, that getting everyone in this
> community to agree on something is roughly akin to throwing a hand
> grenade into a chicken coop with an expectation that all of the
> chickens will fly out in a uniform flock formation.
> 
> As always,
> Dr. Greg
> 
> The Quixote Project - Flailing at the Travails of Cybersecurity
>              https://github.com/Quixote-Project


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-20 11:19                     ` Amir Goldstein
@ 2024-11-21  8:43                       ` Christian Brauner
  2024-11-21 13:48                       ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Christian Brauner @ 2024-11-21  8:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Song Liu, Jeff Layton, Jan Kara, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Wed, Nov 20, 2024 at 12:19:51PM +0100, Amir Goldstein wrote:
> On Wed, Nov 20, 2024 at 10:28 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Nov 19, 2024 at 09:53:20PM +0000, Song Liu wrote:
> > > Hi Jeff and Amir,
> > >
> > > Thanks for your inputs!
> > >
> > > > On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >>
> > > >> On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > >>>
> > >
> > > [...]
> > >
> > > >>> Longer term, I think it may be beneficial to come up with a way to attach
> > > >>>>> private info to the inode in a way that doesn't cost us one pointer per
> > > >>>>> funcionality that may possibly attach info to the inode. We already have
> > > >>>>> i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > > >>>>> call where the space overhead for everybody is worth the runtime &
> > > >>>>> complexity overhead for users using the functionality...
> > > >>>>
> > > >>>> It does seem to be the right long term solution, and I am willing to
> > > >>>> work on it. However, I would really appreciate some positive feedback
> > > >>>> on the idea, so that I have better confidence my weeks of work has a
> > > >>>> better chance to worth it.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Song
> > > >>>>
> > > >>>> [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> > > >>>
> > > >>> fsnotify is somewhat similar to file locking in that few inodes on the
> > > >>> machine actually utilize these fields.
> > > >>>
> > > >>> For file locking, we allocate and populate the inode->i_flctx field on
> > > >>> an as-needed basis. The kernel then hangs on to that struct until the
> > > >>> inode is freed.
> > >
> > > If we have some universal on-demand per-inode memory allocator,
> > > I guess we can move i_flctx to it?
> > >
> > > >>> We could do something similar here. We have this now:
> > > >>>
> > > >>> #ifdef CONFIG_FSNOTIFY
> > > >>>        __u32                   i_fsnotify_mask; /* all events this inode cares about */
> > > >>>        /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > > >>>        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > > >>> #endif
> > >
> > > And maybe some fsnotify fields too?
> > >
> > > With a couple users, I think it justifies to have some universal
> > > on-demond allocator.
> > >
> > > >>> What if you were to turn these fields into a pointer to a new struct:
> > > >>>
> > > >>>        struct fsnotify_inode_context {
> > > >>>                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > > >>>                struct bpf_local_storage __rcu          *i_bpf_storage;
> > > >>>                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> > > >>>        };
> > > >>>
> > > >>
> > > >> The extra indirection is going to hurt for i_fsnotify_mask
> > > >> it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> > > >> into a container, but it could be moved to the hole after i_state.
> > >
> > > >>> Then whenever you have to populate any of these fields, you just
> > > >>> allocate one of these structs and set the inode up to point to it.
> > > >>> They're tiny too, so don't bother freeing it until the inode is
> > > >>> deallocated.
> > > >>>
> > > >>> It'd mean rejiggering a fair bit of fsnotify code, but it would give
> > > >>> the fsnotify code an easier way to expand per-inode info in the future.
> > > >>> It would also slightly shrink struct inode too.
> > >
> > > I am hoping to make i_bpf_storage available to tracing programs.
> > > Therefore, I would rather not limit it to fsnotify context. We can
> > > still use the universal on-demand allocator.
> >
> > Can't we just do something like:
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7e29433c5ecc..cc05a5485365 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
> >  #define IOP_DEFAULT_READLINK   0x0010
> >  #define IOP_MGTIME     0x0020
> >
> > +struct inode_addons {
> > +        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > +        struct bpf_local_storage __rcu          *i_bpf_storage;
> > +        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> > +};
> > +
> >  /*
> >   * Keep mostly read-only and often accessed (especially for
> >   * the RCU path lookup and 'stat' data) fields at the beginning
> > @@ -731,12 +737,7 @@ struct inode {
> >                 unsigned                i_dir_seq;
> >         };
> >
> > -
> > -#ifdef CONFIG_FSNOTIFY
> > -       __u32                   i_fsnotify_mask; /* all events this inode cares about */
> > -       /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > -       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > -#endif
> > +       struct inode_addons *i_addons;
> >
> >  #ifdef CONFIG_FS_ENCRYPTION
> >         struct fscrypt_inode_info       *i_crypt_info;
> >
> > Then when either fsnotify or bpf needs that storage they can do a
> > cmpxchg() based allocation for struct inode_addons just like I did with
> > f_owner:
> >
> > int file_f_owner_allocate(struct file *file)
> > {
> >         struct fown_struct *f_owner;
> >
> >         f_owner = file_f_owner(file);
> >         if (f_owner)
> >                 return 0;
> >
> >         f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
> >         if (!f_owner)
> >                 return -ENOMEM;
> >
> >         rwlock_init(&f_owner->lock);
> >         f_owner->file = file;
> >         /* If someone else raced us, drop our allocation. */
> >         if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
> >                 kfree(f_owner);
> >         return 0;
> > }
> >
> > The internal allocations for specific fields are up to the subsystem
> > ofc. Does that make sense?
> >
> 
> Maybe, but as I wrote, i_fsnotify_mask should not be moved out
> of inode struct, because it is accessed in fast paths of fsnotify vfs
> hooks, where we do not want to have to deref another context,
> but i_fsnotify_mask can be moved to the hole after i_state.
> 
> And why stop at i_fsnotify/i_bfp?
> If you go to "addons" why not also move i_security/i_crypt/i_verify?
> Need to have some common rationale behind those decisions.

The rationale is that we need a mechanism to stop bloating our
structures with ever more stuff somehow. What happens to older members
of struct inode is a cleanup matter and then it needs to be seen what
can be moved into a substruct and not mind the additional pointer chase.

It's just a generalization of your proposal in a way because I don't
understand why you would move the bpf stuff into fsnotify specific
parts.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-13 14:15     ` Song Liu
  2024-11-13 18:29       ` Casey Schaufler
@ 2024-11-21  9:04       ` Christian Brauner
  1 sibling, 0 replies; 53+ messages in thread
From: Christian Brauner @ 2024-11-21  9:04 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Wed, Nov 13, 2024 at 02:15:20PM +0000, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your review. 
> 
> > On Nov 13, 2024, at 2:19 AM, Christian Brauner <brauner@kernel.org> wrote:
> [...]
> 
> >> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >> index 3559446279c1..479097e4dd5b 100644
> >> --- a/include/linux/fs.h
> >> +++ b/include/linux/fs.h
> >> @@ -79,6 +79,7 @@ struct fs_context;
> >> struct fs_parameter_spec;
> >> struct fileattr;
> >> struct iomap_ops;
> >> +struct bpf_local_storage;
> >> 
> >> extern void __init inode_init(void);
> >> extern void __init inode_init_early(void);
> >> @@ -648,6 +649,9 @@ struct inode {
> >> #ifdef CONFIG_SECURITY
> >> void *i_security;
> >> #endif
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> + struct bpf_local_storage __rcu *i_bpf_storage;
> >> +#endif
> > 
> > Sorry, we're not growing struct inode for this. It just keeps getting
> > bigger. Last cycle we freed up 8 bytes to shrink it and we're not going
> > to waste them on special-purpose stuff. We already NAKed someone else's
> > pet field here.
> 
> Would it be acceptable if we union i_bpf_storage with i_security?

I have no quarrels with this if this is acceptable to you.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-15 11:19       ` Jan Kara
  2024-11-15 17:35         ` Song Liu
@ 2024-11-21  9:14         ` Christian Brauner
  2024-11-23  0:08           ` Alexei Starovoitov
  1 sibling, 1 reply; 53+ messages in thread
From: Christian Brauner @ 2024-11-21  9:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Song Liu, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

> I'm personally not *so* hung up about a pointer in struct inode but I can
> see why Christian is and I agree adding a pointer there isn't a win for
> everybody.

Maybe I'm annoying here but I feel that I have to be. Because it's
trivial to grow structs it's rather cumbersome work to get them to
shrink. I just looked at struct task_struct again and it has four bpf
related structures/pointers in there. Ok, struct task_struct is
everyone's favorite struct to bloat so whatever.

For the VFS the structures we maintain longterm that need to be so
generic as to be usable by so many different filesystems have a tendency
to grow over time.

With some we are very strict, i.e., nobody would dare to grow struct
dentry and that's mostly because we have people that really care about
this and have an eye on that and ofc also because it's costly.

But for some structures we simply have no one caring about them that
much. So what happens is that with the ever growing list of features we
bloat them over time. There need to be some reasonable boundaries on
what we accept or not and the criteria I have been using is how
generically useful to filesystems or our infrastructre this is (roughly)
and this is rather very special-case so I'm weary of wasting 8 bytes in
struct inode that we fought rather hard to get back: Jeff's timespec
conversion and my i_state conversion.

I'm not saying it's out of the question but I want to exhaust all other
options and I'm not yet sure we have.

> Longer term, I think it may be beneficial to come up with a way to attach
> private info to the inode in a way that doesn't cost us one pointer per
> funcionality that may possibly attach info to the inode. We already have

Agreed.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-20 11:19                     ` Amir Goldstein
  2024-11-21  8:43                       ` Christian Brauner
@ 2024-11-21 13:48                       ` Jeff Layton
  1 sibling, 0 replies; 53+ messages in thread
From: Jeff Layton @ 2024-11-21 13:48 UTC (permalink / raw)
  To: Amir Goldstein, Christian Brauner
  Cc: Song Liu, Jan Kara, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, repnop@google.com, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Wed, 2024-11-20 at 12:19 +0100, Amir Goldstein wrote:
> On Wed, Nov 20, 2024 at 10:28 AM Christian Brauner <brauner@kernel.org> wrote:
> > 
> > On Tue, Nov 19, 2024 at 09:53:20PM +0000, Song Liu wrote:
> > > Hi Jeff and Amir,
> > > 
> > > Thanks for your inputs!
> > > 
> > > > On Nov 19, 2024, at 7:30 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > 
> > > > On Tue, Nov 19, 2024 at 4:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > 
> > > > > On Tue, Nov 19, 2024 at 3:21 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > 
> > > 
> > > [...]
> > > 
> > > > > > Longer term, I think it may be beneficial to come up with a way to attach
> > > > > > > > private info to the inode in a way that doesn't cost us one pointer per
> > > > > > > > funcionality that may possibly attach info to the inode. We already have
> > > > > > > > i_crypt_info, i_verity_info, i_flctx, i_security, etc. It's always a tough
> > > > > > > > call where the space overhead for everybody is worth the runtime &
> > > > > > > > complexity overhead for users using the functionality...
> > > > > > > 
> > > > > > > It does seem to be the right long term solution, and I am willing to
> > > > > > > work on it. However, I would really appreciate some positive feedback
> > > > > > > on the idea, so that I have better confidence my weeks of work has a
> > > > > > > better chance to worth it.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Song
> > > > > > > 
> > > > > > > [1] https://github.com/systemd/systemd/blob/main/src/core/bpf/restrict_fs/restrict-fs.bpf.c
> > > > > > 
> > > > > > fsnotify is somewhat similar to file locking in that few inodes on the
> > > > > > machine actually utilize these fields.
> > > > > > 
> > > > > > For file locking, we allocate and populate the inode->i_flctx field on
> > > > > > an as-needed basis. The kernel then hangs on to that struct until the
> > > > > > inode is freed.
> > > 
> > > If we have some universal on-demand per-inode memory allocator,
> > > I guess we can move i_flctx to it?
> > > 
> > > > > > We could do something similar here. We have this now:
> > > > > > 
> > > > > > #ifdef CONFIG_FSNOTIFY
> > > > > >        __u32                   i_fsnotify_mask; /* all events this inode cares about */
> > > > > >        /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > > > > >        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > > > > > #endif
> > > 
> > > And maybe some fsnotify fields too?
> > > 
> > > With a couple users, I think it justifies to have some universal
> > > on-demond allocator.
> > > 
> > > > > > What if you were to turn these fields into a pointer to a new struct:
> > > > > > 
> > > > > >        struct fsnotify_inode_context {
> > > > > >                struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > > > > >                struct bpf_local_storage __rcu          *i_bpf_storage;
> > > > > >                __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> > > > > >        };
> > > > > > 
> > > > > 
> > > > > The extra indirection is going to hurt for i_fsnotify_mask
> > > > > it is being accessed frequently in fsnotify hooks, so I wouldn't move it
> > > > > into a container, but it could be moved to the hole after i_state.
> > > 
> > > > > > Then whenever you have to populate any of these fields, you just
> > > > > > allocate one of these structs and set the inode up to point to it.
> > > > > > They're tiny too, so don't bother freeing it until the inode is
> > > > > > deallocated.
> > > > > > 
> > > > > > It'd mean rejiggering a fair bit of fsnotify code, but it would give
> > > > > > the fsnotify code an easier way to expand per-inode info in the future.
> > > > > > It would also slightly shrink struct inode too.
> > > 
> > > I am hoping to make i_bpf_storage available to tracing programs.
> > > Therefore, I would rather not limit it to fsnotify context. We can
> > > still use the universal on-demand allocator.
> > 
> > Can't we just do something like:
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7e29433c5ecc..cc05a5485365 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -627,6 +627,12 @@ is_uncached_acl(struct posix_acl *acl)
> >  #define IOP_DEFAULT_READLINK   0x0010
> >  #define IOP_MGTIME     0x0020
> > 
> > +struct inode_addons {
> > +        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > +        struct bpf_local_storage __rcu          *i_bpf_storage;
> > +        __u32                                   i_fsnotify_mask; /* all events this inode cares about */
> > +};
> > +
> >  /*
> >   * Keep mostly read-only and often accessed (especially for
> >   * the RCU path lookup and 'stat' data) fields at the beginning
> > @@ -731,12 +737,7 @@ struct inode {
> >                 unsigned                i_dir_seq;
> >         };
> > 
> > -
> > -#ifdef CONFIG_FSNOTIFY
> > -       __u32                   i_fsnotify_mask; /* all events this inode cares about */
> > -       /* 32-bit hole reserved for expanding i_fsnotify_mask */
> > -       struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
> > -#endif
> > +       struct inode_addons *i_addons;
> > 
> >  #ifdef CONFIG_FS_ENCRYPTION
> >         struct fscrypt_inode_info       *i_crypt_info;
> > 
> > Then when either fsnotify or bpf needs that storage they can do a
> > cmpxchg() based allocation for struct inode_addons just like I did with
> > f_owner:
> > 
> > int file_f_owner_allocate(struct file *file)
> > {
> >         struct fown_struct *f_owner;
> > 
> >         f_owner = file_f_owner(file);
> >         if (f_owner)
> >                 return 0;
> > 
> >         f_owner = kzalloc(sizeof(struct fown_struct), GFP_KERNEL);
> >         if (!f_owner)
> >                 return -ENOMEM;
> > 
> >         rwlock_init(&f_owner->lock);
> >         f_owner->file = file;
> >         /* If someone else raced us, drop our allocation. */
> >         if (unlikely(cmpxchg(&file->f_owner, NULL, f_owner)))
> >                 kfree(f_owner);
> >         return 0;
> > }
> > 
> > The internal allocations for specific fields are up to the subsystem
> > ofc. Does that make sense?
> > 
> 
> Maybe, but as I wrote, i_fsnotify_mask should not be moved out
> of inode struct, because it is accessed in fast paths of fsnotify vfs
> hooks, where we do not want to have to deref another context,
> but i_fsnotify_mask can be moved to the hole after i_state.
>
> And why stop at i_fsnotify/i_bfp?
> If you go to "addons" why not also move i_security/i_crypt/i_verify?
> Need to have some common rationale behind those decisions.
> 

I don't think we would stop there. We could probably move several
fields into the new struct (i_flctx comes to mind), but that should
probably be done piecemeal, in later patchsets.

The bigger concern is that this is only helpful when inode_addons is
needed for a fraction of inodes on the system. i_security might not be
good candidate to move there for that reason. Anyone running with an
LSM is going to end up allocating one of these for almost every inode,
so they might as well just keep a pointer in struct inode instead.

We also need to be cautious here. This adds extra pointer indirection,
which could be costly for some uses.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-21  8:28                             ` Song Liu
@ 2024-11-21 16:02                               ` Dr. Greg
  2024-11-21 18:11                                 ` Casey Schaufler
  2024-11-21 17:47                               ` Casey Schaufler
  1 sibling, 1 reply; 53+ messages in thread
From: Dr. Greg @ 2024-11-21 16:02 UTC (permalink / raw)
  To: Song Liu
  Cc: Casey Schaufler, James Bottomley, jack@suse.cz,
	brauner@kernel.org, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Thu, Nov 21, 2024 at 08:28:05AM +0000, Song Liu wrote:

> Hi Dr. Greg,
> 
> Thanks for your input!

Good morning, I hope everyone's day is going well.

> > On Nov 20, 2024, at 8:54???AM, Dr. Greg <greg@enjellic.com> wrote:
> > 
> > On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:
> 
> [...]
> 
> > 
> >>> 2.) Implement key/value mapping for inode specific storage.
> >>> 
> >>> The key would be a sub-system specific numeric value that returns a
> >>> pointer the sub-system uses to manage its inode specific memory for a
> >>> particular inode.
> >>> 
> >>> A participating sub-system in turn uses its identifier to register an
> >>> inode specific pointer for its sub-system.
> >>> 
> >>> This strategy loses O(1) lookup complexity but reduces total memory
> >>> consumption and only imposes memory costs for inodes when a sub-system
> >>> desires to use inode specific storage.
> > 
> >> SELinux and Smack use an inode blob for every inode. The performance
> >> regression boggles the mind. Not to mention the additional
> >> complexity of managing the memory.
> > 
> > I guess we would have to measure the performance impacts to understand
> > their level of mind boggliness.
> > 
> > My first thought is that we hear a huge amount of fanfare about BPF
> > being a game changer for tracing and network monitoring.  Given
> > current networking speeds, if its ability to manage storage needed for
> > it purposes are truely abysmal the industry wouldn't be finding the
> > technology useful.
> > 
> > Beyond that.
> > 
> > As I noted above, the LSM could be an independent subscriber.  The
> > pointer to register would come from the the kmem_cache allocator as it
> > does now, so that cost is idempotent with the current implementation.
> > The pointer registration would also be a single instance cost.
> > 
> > So the primary cost differential over the common arena model will be
> > the complexity costs associated with lookups in a red/black tree, if
> > we used the old IMA integrity cache as an example implementation.
> > 
> > As I noted above, these per inode local storage structures are complex
> > in of themselves, including lists and locks.  If touching an inode
> > involves locking and walking lists and the like it would seem that
> > those performance impacts would quickly swamp an r/b lookup cost.

> bpf local storage is designed to be an arena like solution that
> works for multiple bpf maps (and we don't know how many of maps we
> need ahead of time). Therefore, we may end up doing what you
> suggested earlier: every LSM should use bpf inode storage. ;) I am
> only 90% kidding.

I will let you thrash that out with the LSM folks, we have enough on
our hands just with TSEM.... :-)

I think the most important issue in all of this is to get solid
performance measurements and let those speak to how we move forward.

As LSM authors ourself, we don't see an off-putting reason to not have
a common arena storage architecture that builds on what the LSM is
doing.  If sub-systems with sparse usage would agree that they need to
restrict themselves to a single pointer slot in the arena, it would
seem that memory consumption, in this day and age, would be tolerable.

See below for another idea.

> >>> Approach 2 requires the introduction of generic infrastructure that
> >>> allows an inode's key/value mappings to be located, presumably based
> >>> on the inode's pointer value.  We could probably just resurrect the
> >>> old IMA iint code for this purpose.
> >>> 
> >>> In the end it comes down to a rather standard trade-off in this
> >>> business, memory vs. execution cost.
> >>> 
> >>> We would posit that option 2 is the only viable scheme if the design
> >>> metric is overall good for the Linux kernel eco-system.
> > 
> >> No. Really, no. You need look no further than secmarks to understand
> >> how a key based blob allocation scheme leads to tears. Keys are fine
> >> in the case where use of data is sparse. They have no place when data
> >> use is the norm.
> > 
> > Then it would seem that we need to get everyone to agree that we can
> > get by with using two pointers in struct inode.  One for uses best
> > served by common arena allocation and one for a key/pointer mapping,
> > and then convert the sub-systems accordingly.
> > 
> > Or alternately, getting everyone to agree that allocating a mininum of
> > eight additional bytes for every subscriber to private inode data
> > isn't the end of the world, even if use of the resource is sparse.

> Christian suggested we can use an inode_addon structure, which is 
> similar to this idea. It won't work well in all contexts, though. 
> So it is not as good as other bpf local storage (task, sock,
> cgroup). 

Here is another thought in all of this.

I've mentioned the old IMA integrity inode cache a couple of times in
this thread.  The most peacable path forward may be to look at
generalizing that architecture so that a sub-system that wanted inode
local storage could request that an inode local storage cache manager
be implemented for it.

That infrastructure was based on a red/black tree that used the inode
pointer as a key to locate a pointer to a structure that contained
local information for the inode.  That takes away the need to embed
something in the inode structure proper.

Since insertion and lookup times have complexity functions that scale
with tree height it would seem to be a good fit for sparse utilization
scenarios.

An extra optimization that may be possible would be to maintain an
indicator flag tied the filesystem superblock that would provide a
simple binary answer as to whether any local inode cache managers have
been registered for inodes on a filesystem.  That would allow the
lookup to be completely skipped with a simple conditional test.

If the infrastructure was generalized to request and release cache
managers it would be suitable for systems, implemented as modules,
that have a need for local inode storage.

It also offers the ability for implementation independence, which is
always a good thing in the Linux community.

> Thanks,
> Song

Have a good day.

> > Of course, experience would suggest, that getting everyone in this
> > community to agree on something is roughly akin to throwing a hand
> > grenade into a chicken coop with an expectation that all of the
> > chickens will fly out in a uniform flock formation.
> > 
> > As always,
> > Dr. Greg
> > 
> > The Quixote Project - Flailing at the Travails of Cybersecurity
> >              https://github.com/Quixote-Project
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-21  8:28                             ` Song Liu
  2024-11-21 16:02                               ` Dr. Greg
@ 2024-11-21 17:47                               ` Casey Schaufler
  2024-11-21 18:28                                 ` Song Liu
  1 sibling, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-21 17:47 UTC (permalink / raw)
  To: Song Liu, Dr. Greg
  Cc: James Bottomley, jack@suse.cz, brauner@kernel.org, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com, Casey Schaufler

On 11/21/2024 12:28 AM, Song Liu wrote:
> Hi Dr. Greg,
>
> Thanks for your input!
>
>> On Nov 20, 2024, at 8:54 AM, Dr. Greg <greg@enjellic.com> wrote:
>>
>> On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:
> [...]
>
>>>> 2.) Implement key/value mapping for inode specific storage.
>>>>
>>>> The key would be a sub-system specific numeric value that returns a
>>>> pointer the sub-system uses to manage its inode specific memory for a
>>>> particular inode.
>>>>
>>>> A participating sub-system in turn uses its identifier to register an
>>>> inode specific pointer for its sub-system.
>>>>
>>>> This strategy loses O(1) lookup complexity but reduces total memory
>>>> consumption and only imposes memory costs for inodes when a sub-system
>>>> desires to use inode specific storage.
>>> SELinux and Smack use an inode blob for every inode. The performance
>>> regression boggles the mind. Not to mention the additional
>>> complexity of managing the memory.
>> I guess we would have to measure the performance impacts to understand
>> their level of mind boggliness.
>>
>> My first thought is that we hear a huge amount of fanfare about BPF
>> being a game changer for tracing and network monitoring.  Given
>> current networking speeds, if its ability to manage storage needed for
>> it purposes are truely abysmal the industry wouldn't be finding the
>> technology useful.
>>
>> Beyond that.
>>
>> As I noted above, the LSM could be an independent subscriber.  The
>> pointer to register would come from the the kmem_cache allocator as it
>> does now, so that cost is idempotent with the current implementation.
>> The pointer registration would also be a single instance cost.
>>
>> So the primary cost differential over the common arena model will be
>> the complexity costs associated with lookups in a red/black tree, if
>> we used the old IMA integrity cache as an example implementation.
>>
>> As I noted above, these per inode local storage structures are complex
>> in of themselves, including lists and locks.  If touching an inode
>> involves locking and walking lists and the like it would seem that
>> those performance impacts would quickly swamp an r/b lookup cost.
> bpf local storage is designed to be an arena like solution that works
> for multiple bpf maps (and we don't know how many of maps we need 
> ahead of time). Therefore, we may end up doing what you suggested 
> earlier: every LSM should use bpf inode storage. ;) I am only 90%
> kidding. 

Sorry, but that's not funny. It's the kind of suggestion that some
yoho takes seriously, whacks together a patch for, and gets accepted
via the xfd887 device tree. Then everyone screams at the SELinux folks
because of the performance impact. As I have already pointed out,
there are serious consequences for an LSM that has a blob on every
inode.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-21 16:02                               ` Dr. Greg
@ 2024-11-21 18:11                                 ` Casey Schaufler
  2024-11-23 17:01                                   ` Dr. Greg
  0 siblings, 1 reply; 53+ messages in thread
From: Casey Schaufler @ 2024-11-21 18:11 UTC (permalink / raw)
  To: Dr. Greg, Song Liu
  Cc: James Bottomley, jack@suse.cz, brauner@kernel.org, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com, Casey Schaufler

On 11/21/2024 8:02 AM, Dr. Greg wrote:
> On Thu, Nov 21, 2024 at 08:28:05AM +0000, Song Liu wrote:
>
>> Hi Dr. Greg,
>>
>> Thanks for your input!
> Good morning, I hope everyone's day is going well.
>
>>> On Nov 20, 2024, at 8:54???AM, Dr. Greg <greg@enjellic.com> wrote:
>>>
>>> On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:
>> [...]
>>
>>>>> 2.) Implement key/value mapping for inode specific storage.
>>>>>
>>>>> The key would be a sub-system specific numeric value that returns a
>>>>> pointer the sub-system uses to manage its inode specific memory for a
>>>>> particular inode.
>>>>>
>>>>> A participating sub-system in turn uses its identifier to register an
>>>>> inode specific pointer for its sub-system.
>>>>>
>>>>> This strategy loses O(1) lookup complexity but reduces total memory
>>>>> consumption and only imposes memory costs for inodes when a sub-system
>>>>> desires to use inode specific storage.
>>>> SELinux and Smack use an inode blob for every inode. The performance
>>>> regression boggles the mind. Not to mention the additional
>>>> complexity of managing the memory.
>>> I guess we would have to measure the performance impacts to understand
>>> their level of mind boggliness.
>>>
>>> My first thought is that we hear a huge amount of fanfare about BPF
>>> being a game changer for tracing and network monitoring.  Given
>>> current networking speeds, if its ability to manage storage needed for
>>> it purposes are truely abysmal the industry wouldn't be finding the
>>> technology useful.
>>>
>>> Beyond that.
>>>
>>> As I noted above, the LSM could be an independent subscriber.  The
>>> pointer to register would come from the the kmem_cache allocator as it
>>> does now, so that cost is idempotent with the current implementation.
>>> The pointer registration would also be a single instance cost.
>>>
>>> So the primary cost differential over the common arena model will be
>>> the complexity costs associated with lookups in a red/black tree, if
>>> we used the old IMA integrity cache as an example implementation.
>>>
>>> As I noted above, these per inode local storage structures are complex
>>> in of themselves, including lists and locks.  If touching an inode
>>> involves locking and walking lists and the like it would seem that
>>> those performance impacts would quickly swamp an r/b lookup cost.
>> bpf local storage is designed to be an arena like solution that
>> works for multiple bpf maps (and we don't know how many of maps we
>> need ahead of time). Therefore, we may end up doing what you
>> suggested earlier: every LSM should use bpf inode storage. ;) I am
>> only 90% kidding.
> I will let you thrash that out with the LSM folks, we have enough on
> our hands just with TSEM.... :-)
>
> I think the most important issue in all of this is to get solid
> performance measurements and let those speak to how we move forward.
>
> As LSM authors ourself, we don't see an off-putting reason to not have
> a common arena storage architecture that builds on what the LSM is
> doing.  If sub-systems with sparse usage would agree that they need to
> restrict themselves to a single pointer slot in the arena, it would
> seem that memory consumption, in this day and age, would be tolerable.
>
> See below for another idea.
>
>>>>> Approach 2 requires the introduction of generic infrastructure that
>>>>> allows an inode's key/value mappings to be located, presumably based
>>>>> on the inode's pointer value.  We could probably just resurrect the
>>>>> old IMA iint code for this purpose.
>>>>>
>>>>> In the end it comes down to a rather standard trade-off in this
>>>>> business, memory vs. execution cost.
>>>>>
>>>>> We would posit that option 2 is the only viable scheme if the design
>>>>> metric is overall good for the Linux kernel eco-system.
>>>> No. Really, no. You need look no further than secmarks to understand
>>>> how a key based blob allocation scheme leads to tears. Keys are fine
>>>> in the case where use of data is sparse. They have no place when data
>>>> use is the norm.
>>> Then it would seem that we need to get everyone to agree that we can
>>> get by with using two pointers in struct inode.  One for uses best
>>> served by common arena allocation and one for a key/pointer mapping,
>>> and then convert the sub-systems accordingly.
>>>
>>> Or alternately, getting everyone to agree that allocating a mininum of
>>> eight additional bytes for every subscriber to private inode data
>>> isn't the end of the world, even if use of the resource is sparse.
>> Christian suggested we can use an inode_addon structure, which is 
>> similar to this idea. It won't work well in all contexts, though. 
>> So it is not as good as other bpf local storage (task, sock,
>> cgroup). 
> Here is another thought in all of this.
>
> I've mentioned the old IMA integrity inode cache a couple of times in
> this thread.  The most peacable path forward may be to look at
> generalizing that architecture so that a sub-system that wanted inode
> local storage could request that an inode local storage cache manager
> be implemented for it.
>
> That infrastructure was based on a red/black tree that used the inode
> pointer as a key to locate a pointer to a structure that contained
> local information for the inode.  That takes away the need to embed
> something in the inode structure proper.
>
> Since insertion and lookup times have complexity functions that scale
> with tree height it would seem to be a good fit for sparse utilization
> scenarios.
>
> An extra optimization that may be possible would be to maintain an
> indicator flag tied the filesystem superblock that would provide a
> simple binary answer as to whether any local inode cache managers have
> been registered for inodes on a filesystem.  That would allow the
> lookup to be completely skipped with a simple conditional test.
>
> If the infrastructure was generalized to request and release cache
> managers it would be suitable for systems, implemented as modules,
> that have a need for local inode storage.

Do you think that over the past 20 years no one has thought of this?
We're working to make the LSM infrastructure cleaner and more robust.
Adding the burden of memory management to each LSM is a horrible idea.

> It also offers the ability for implementation independence, which is
> always a good thing in the Linux community.

Generality for the sake of generality is seriously overrated.
File systems have to be done so as to fit into the VFS infrastructure,
network protocols have to work with sockets without impacting the
performance of others and so forth.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-21 17:47                               ` Casey Schaufler
@ 2024-11-21 18:28                                 ` Song Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-11-21 18:28 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, Dr. Greg, James Bottomley, jack@suse.cz,
	brauner@kernel.org, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com



> On Nov 21, 2024, at 9:47 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> On 11/21/2024 12:28 AM, Song Liu wrote:
>> Hi Dr. Greg,
>> 
>> Thanks for your input!
>> 
>>> On Nov 20, 2024, at 8:54 AM, Dr. Greg <greg@enjellic.com> wrote:
>>> 
>>> On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:
>> [...]
>> 
>>>>> 2.) Implement key/value mapping for inode specific storage.
>>>>> 
>>>>> The key would be a sub-system specific numeric value that returns a
>>>>> pointer the sub-system uses to manage its inode specific memory for a
>>>>> particular inode.
>>>>> 
>>>>> A participating sub-system in turn uses its identifier to register an
>>>>> inode specific pointer for its sub-system.
>>>>> 
>>>>> This strategy loses O(1) lookup complexity but reduces total memory
>>>>> consumption and only imposes memory costs for inodes when a sub-system
>>>>> desires to use inode specific storage.
>>>> SELinux and Smack use an inode blob for every inode. The performance
>>>> regression boggles the mind. Not to mention the additional
>>>> complexity of managing the memory.
>>> I guess we would have to measure the performance impacts to understand
>>> their level of mind boggliness.
>>> 
>>> My first thought is that we hear a huge amount of fanfare about BPF
>>> being a game changer for tracing and network monitoring.  Given
>>> current networking speeds, if its ability to manage storage needed for
>>> it purposes are truely abysmal the industry wouldn't be finding the
>>> technology useful.
>>> 
>>> Beyond that.
>>> 
>>> As I noted above, the LSM could be an independent subscriber.  The
>>> pointer to register would come from the the kmem_cache allocator as it
>>> does now, so that cost is idempotent with the current implementation.
>>> The pointer registration would also be a single instance cost.
>>> 
>>> So the primary cost differential over the common arena model will be
>>> the complexity costs associated with lookups in a red/black tree, if
>>> we used the old IMA integrity cache as an example implementation.
>>> 
>>> As I noted above, these per inode local storage structures are complex
>>> in of themselves, including lists and locks.  If touching an inode
>>> involves locking and walking lists and the like it would seem that
>>> those performance impacts would quickly swamp an r/b lookup cost.
>> bpf local storage is designed to be an arena like solution that works
>> for multiple bpf maps (and we don't know how many of maps we need 
>> ahead of time). Therefore, we may end up doing what you suggested 
>> earlier: every LSM should use bpf inode storage. ;) I am only 90%
>> kidding.
> 
> Sorry, but that's not funny.

I didn't think this is funny. Many use cases can seriously benefit
from a _reliable_ allocator for inode attached data. 

> It's the kind of suggestion that some
> yoho takes seriously, whacks together a patch for, and gets accepted
> via the xfd887 device tree. Then everyone screams at the SELinux folks
> because of the performance impact. As I have already pointed out,
> there are serious consequences for an LSM that has a blob on every
> inode.

i_security serves this type of users pretty well. I see no reason 
to change this. At the same time, I see no reasons to block 
optimizations for other use cases because these users may get 
blamed in 2087 for a mistake by xfd887 device maintainers. 

Song





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program
  2024-11-21  9:14         ` Christian Brauner
@ 2024-11-23  0:08           ` Alexei Starovoitov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexei Starovoitov @ 2024-11-23  0:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Song Liu, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Thu, Nov 21, 2024 at 1:15 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > I'm personally not *so* hung up about a pointer in struct inode but I can
> > see why Christian is and I agree adding a pointer there isn't a win for
> > everybody.
>
> Maybe I'm annoying here but I feel that I have to be. Because it's
> trivial to grow structs it's rather cumbersome work to get them to
> shrink. I just looked at struct task_struct again and it has four bpf
> related structures/pointers in there. Ok, struct task_struct is
> everyone's favorite struct to bloat so whatever.
>
> For the VFS the structures we maintain longterm that need to be so
> generic as to be usable by so many different filesystems have a tendency
> to grow over time.
>
> With some we are very strict, i.e., nobody would dare to grow struct
> dentry and that's mostly because we have people that really care about
> this and have an eye on that and ofc also because it's costly.
>
> But for some structures we simply have no one caring about them that
> much. So what happens is that with the ever growing list of features we
> bloat them over time. There need to be some reasonable boundaries on
> what we accept or not and the criteria I have been using is how
> generically useful to filesystems or our infrastructre this is (roughly)
> and this is rather very special-case so I'm weary of wasting 8 bytes in
> struct inode that we fought rather hard to get back: Jeff's timespec
> conversion and my i_state conversion.
>
> I'm not saying it's out of the question but I want to exhaust all other
> options and I'm not yet sure we have.

+1 to all of the above.

I hope we don't end up as strict as sk_buff though.

I think Song can proceed without this patch.
Worst case bpf hash map with key==inode will do.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-21 18:11                                 ` Casey Schaufler
@ 2024-11-23 17:01                                   ` Dr. Greg
  2024-11-25 20:49                                     ` Casey Schaufler
  0 siblings, 1 reply; 53+ messages in thread
From: Dr. Greg @ 2024-11-23 17:01 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Song Liu, James Bottomley, jack@suse.cz, brauner@kernel.org,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com

On Thu, Nov 21, 2024 at 10:11:16AM -0800, Casey Schaufler wrote:

Good morning, I hope the weekend is going well for everyone.

> On 11/21/2024 8:02 AM, Dr. Greg wrote:
> > On Thu, Nov 21, 2024 at 08:28:05AM +0000, Song Liu wrote:
> >
> >> Hi Dr. Greg,
> >>
> >> Thanks for your input!
> > Good morning, I hope everyone's day is going well.
> >
> >>> On Nov 20, 2024, at 8:54???AM, Dr. Greg <greg@enjellic.com> wrote:
> >>>
> >>> On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:
> >> [...]
> >>
> >>>>> 2.) Implement key/value mapping for inode specific storage.
> >>>>>
> >>>>> The key would be a sub-system specific numeric value that returns a
> >>>>> pointer the sub-system uses to manage its inode specific memory for a
> >>>>> particular inode.
> >>>>>
> >>>>> A participating sub-system in turn uses its identifier to register an
> >>>>> inode specific pointer for its sub-system.
> >>>>>
> >>>>> This strategy loses O(1) lookup complexity but reduces total memory
> >>>>> consumption and only imposes memory costs for inodes when a sub-system
> >>>>> desires to use inode specific storage.

> >>>> SELinux and Smack use an inode blob for every inode. The performance
> >>>> regression boggles the mind. Not to mention the additional
> >>>> complexity of managing the memory.

> >>> I guess we would have to measure the performance impacts to understand
> >>> their level of mind boggliness.
> >>>
> >>> My first thought is that we hear a huge amount of fanfare about BPF
> >>> being a game changer for tracing and network monitoring.  Given
> >>> current networking speeds, if its ability to manage storage needed for
> >>> it purposes are truely abysmal the industry wouldn't be finding the
> >>> technology useful.
> >>>
> >>> Beyond that.
> >>>
> >>> As I noted above, the LSM could be an independent subscriber.  The
> >>> pointer to register would come from the the kmem_cache allocator as it
> >>> does now, so that cost is idempotent with the current implementation.
> >>> The pointer registration would also be a single instance cost.
> >>>
> >>> So the primary cost differential over the common arena model will be
> >>> the complexity costs associated with lookups in a red/black tree, if
> >>> we used the old IMA integrity cache as an example implementation.
> >>>
> >>> As I noted above, these per inode local storage structures are complex
> >>> in of themselves, including lists and locks.  If touching an inode
> >>> involves locking and walking lists and the like it would seem that
> >>> those performance impacts would quickly swamp an r/b lookup cost.

> >> bpf local storage is designed to be an arena like solution that
> >> works for multiple bpf maps (and we don't know how many of maps we
> >> need ahead of time). Therefore, we may end up doing what you
> >> suggested earlier: every LSM should use bpf inode storage. ;) I am
> >> only 90% kidding.

> > I will let you thrash that out with the LSM folks, we have enough on
> > our hands just with TSEM.... :-)
> >
> > I think the most important issue in all of this is to get solid
> > performance measurements and let those speak to how we move forward.
> >
> > As LSM authors ourself, we don't see an off-putting reason to not have
> > a common arena storage architecture that builds on what the LSM is
> > doing.  If sub-systems with sparse usage would agree that they need to
> > restrict themselves to a single pointer slot in the arena, it would
> > seem that memory consumption, in this day and age, would be tolerable.
> >
> > See below for another idea.

> >>>>> Approach 2 requires the introduction of generic infrastructure that
> >>>>> allows an inode's key/value mappings to be located, presumably based
> >>>>> on the inode's pointer value.  We could probably just resurrect the
> >>>>> old IMA iint code for this purpose.
> >>>>>
> >>>>> In the end it comes down to a rather standard trade-off in this
> >>>>> business, memory vs. execution cost.
> >>>>>
> >>>>> We would posit that option 2 is the only viable scheme if the design
> >>>>> metric is overall good for the Linux kernel eco-system.

> >>>> No. Really, no. You need look no further than secmarks to understand
> >>>> how a key based blob allocation scheme leads to tears. Keys are fine
> >>>> in the case where use of data is sparse. They have no place when data
> >>>> use is the norm.

> >>> Then it would seem that we need to get everyone to agree that we can
> >>> get by with using two pointers in struct inode.  One for uses best
> >>> served by common arena allocation and one for a key/pointer mapping,
> >>> and then convert the sub-systems accordingly.
> >>>
> >>> Or alternately, getting everyone to agree that allocating a mininum of
> >>> eight additional bytes for every subscriber to private inode data
> >>> isn't the end of the world, even if use of the resource is sparse.

> >> Christian suggested we can use an inode_addon structure, which is 
> >> similar to this idea. It won't work well in all contexts, though. 
> >> So it is not as good as other bpf local storage (task, sock,
> >> cgroup). 

> > Here is another thought in all of this.
> >
> > I've mentioned the old IMA integrity inode cache a couple of times in
> > this thread.  The most peacable path forward may be to look at
> > generalizing that architecture so that a sub-system that wanted inode
> > local storage could request that an inode local storage cache manager
> > be implemented for it.
> >
> > That infrastructure was based on a red/black tree that used the inode
> > pointer as a key to locate a pointer to a structure that contained
> > local information for the inode.  That takes away the need to embed
> > something in the inode structure proper.
> >
> > Since insertion and lookup times have complexity functions that scale
> > with tree height it would seem to be a good fit for sparse utilization
> > scenarios.
> >
> > An extra optimization that may be possible would be to maintain an
> > indicator flag tied the filesystem superblock that would provide a
> > simple binary answer as to whether any local inode cache managers have
> > been registered for inodes on a filesystem.  That would allow the
> > lookup to be completely skipped with a simple conditional test.
> >
> > If the infrastructure was generalized to request and release cache
> > managers it would be suitable for systems, implemented as modules,
> > that have a need for local inode storage.

> Do you think that over the past 20 years no one has thought of this?
> We're working to make the LSM infrastructure cleaner and more
> robust.  Adding the burden of memory management to each LSM is a
> horrible idea.

No, I cannot ascribe to the notion that I, personally, know what
everyone has thought about in the last 20 years.

I do know, personally, that very talented individuals who are involved
with large security sensitive operations question the trajectory of
the LSM.  That, however, is a debate for another venue.

For the lore record and everyone reading along at home, you
misinterpreted or did not read closely my e-mail.

We were not proposing adding memory management to each LSM, we were
suggesting to Song Liu that generalizing, what was the old IMA inode
integrity infrastructure, may be a path forward for sub-systems that
need inode local storage, particularly systems that have sparse
occupancy requirements.

Everyone has their britches in a knicker about performance.

Note that we called out a possible optimization for this architecture
so that there would be no need to even hit the r/b tree if a
filesystem had no sub-systems that had requested sparse inode local
storage for that filesystem.

> > It also offers the ability for implementation independence, which is
> > always a good thing in the Linux community.

> Generality for the sake of generality is seriously overrated.
> File systems have to be done so as to fit into the VFS infrastructure,
> network protocols have to work with sockets without impacting the
> performance of others and so forth.

We were not advocating generality for the sake of generality, we were
suggesting a generalized architecture, that does not require expansion
of struct inode, because Christian has publically indicated there is
no appetite by the VFS maintainers for consuming additional space in
struct inode for infrastructure requiring local inode storage.

You talk about cooperation, yet you object to any consideration that
the LSM should participate in a shared arena environment where
sub-systems wanting local inode storage could just request a block in
a common arena.  The LSM, in this case, is just like a filesystem
since it is a consumer of infrastructure supplied by the VFS and
should thus cooperate with other consumers of VFS infrastructure.

If people go back and read our last paragraph you replied to we were
not speaking to the advantages of generality, we were speaking to the
advantage of independent implementations that did unnecessarily cross
sub-system lines.  Casual observation of Linux development, and this
thread, would suggest the importance of that.

I need to get a bunch of firewood under cover so I will leave things
at that.

Have a good weekend.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
              https://github.com/Quixote-Project

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-14 21:49                   ` James Bottomley
  2024-11-14 22:30                     ` Song Liu
  2024-11-17 22:59                     ` Song Liu
@ 2024-11-23 19:11                     ` Paul Moore
  2 siblings, 0 replies; 53+ messages in thread
From: Paul Moore @ 2024-11-23 19:11 UTC (permalink / raw)
  To: James Bottomley, Casey Schaufler, Song Liu
  Cc: Dr. Greg, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, amir73il@gmail.com,
	repnop@google.com, jlayton@kernel.org, Josef Bacik,
	mic@digikod.net, gnoack@google.com

On Thu, Nov 14, 2024 at 4:49 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> Got to say I'm with Casey here, this will generate horrible and failure
> prone code.
>
> Since effectively you're making i_security always present anyway,
> simply do that and also pull the allocation code out of security.c in a
> way that it's always available?  That way you don't have to special
> case the code depending on whether CONFIG_SECURITY is defined.
> Effectively this would give everyone a generic way to attach some
> memory area to an inode.  I know it's more complex than this because
> there are LSM hooks that run from security_inode_alloc() but if you can
> make it work generically, I'm sure everyone will benefit.

My apologies on such a delayed response to this thread, I've had very
limited network access for a bit due to travel and the usual merge
window related distractions (and some others that were completely
unrelated) have left me with quite the mail backlog to sift through.

Enough with the excuses ...

Quickly skimming this thread and the v3 patchset, I would advise you
that there may be issues around using BPF LSMs and storing inode LSM
state outside the LSM managed inode storage blob.  Beyond the
conceptual objections that Casey has already mentioned, there have
been issues relating to the disjoint inode and inode->i_security
lifetimes.  While I believe we have an okay-ish solution in place now
for LSMs, I can't promise everything will work fine for BPF LSMs that
manage their inode LSM state outside of the LSM managed inode blob.
I'm sure you've already looked at it, but if you haven't it might be
worth looking at security_inode_free() to see some of the details.  In
a perfect world inode->i_security would be better synchronized with
the inode lifetime, but that would involve changes that the VFS folks
dislike.

However, while I will recommend against it, I'm not going to object to
you storing BPF LSM inode state elsewhere, that is up to you and KP
(he would need to ACK that as the BPF LSM maintainer).  I just want
you to know that if things break, there isn't much we (the LSM folks)
will be able to do to help other than suggest you go back to using the
LSM managed inode storage.

As far as some of the other ideas in this thread are concerned, at
this point in time I don't think we want to do any massive rework or
consolidation around i_security.  That's a critical field for the LSM
framework and many individual LSMs and there is work underway which
relies on this as a LSM specific inode storage blob; having to share
i_security with non-LSM users or moving the management of i_security
outside of the LSM is not something I'm overly excited about right
now.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
  2024-11-23 17:01                                   ` Dr. Greg
@ 2024-11-25 20:49                                     ` Casey Schaufler
  0 siblings, 0 replies; 53+ messages in thread
From: Casey Schaufler @ 2024-11-25 20:49 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Song Liu, James Bottomley, jack@suse.cz, brauner@kernel.org,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, kpsingh@kernel.org,
	mattbobrowski@google.com, amir73il@gmail.com, repnop@google.com,
	jlayton@kernel.org, Josef Bacik, mic@digikod.net,
	gnoack@google.com, Casey Schaufler

On 11/23/2024 9:01 AM, Dr. Greg wrote:
>>> Here is another thought in all of this.
>>>
>>> I've mentioned the old IMA integrity inode cache a couple of times in
>>> this thread.  The most peacable path forward may be to look at
>>> generalizing that architecture so that a sub-system that wanted inode
>>> local storage could request that an inode local storage cache manager
>>> be implemented for it.
>>>
>>> That infrastructure was based on a red/black tree that used the inode
>>> pointer as a key to locate a pointer to a structure that contained
>>> local information for the inode.  That takes away the need to embed
>>> something in the inode structure proper.
>>>
>>> Since insertion and lookup times have complexity functions that scale
>>> with tree height it would seem to be a good fit for sparse utilization
>>> scenarios.
>>>
>>> An extra optimization that may be possible would be to maintain an
>>> indicator flag tied the filesystem superblock that would provide a
>>> simple binary answer as to whether any local inode cache managers have
>>> been registered for inodes on a filesystem.  That would allow the
>>> lookup to be completely skipped with a simple conditional test.
>>>
>>> If the infrastructure was generalized to request and release cache
>>> managers it would be suitable for systems, implemented as modules,
>>> that have a need for local inode storage.
>> Do you think that over the past 20 years no one has thought of this?
>> We're working to make the LSM infrastructure cleaner and more
>> robust.  Adding the burden of memory management to each LSM is a
>> horrible idea.
> No, I cannot ascribe to the notion that I, personally, know what
> everyone has thought about in the last 20 years.
>
> I do know, personally, that very talented individuals who are involved
> with large security sensitive operations question the trajectory of
> the LSM.  That, however, is a debate for another venue.

I invite anyone who would "question the trajectory" of the LSM to
step up and do so publicly. I don't claim to be the most talented
individual working in the security community, but I am busting my
butt to get the work done. Occasionally I've had corporate backing,
but generally I've been doing it as a hobbyist on my own time. You
can threaten the LSM developers with the wrath of "large security
sensitive operations", but in the absence of participation in the
process you can't expect much to change.

> For the lore record and everyone reading along at home, you
> misinterpreted or did not read closely my e-mail.
>
> We were not proposing adding memory management to each LSM, we were
> suggesting to Song Liu that generalizing, what was the old IMA inode
> integrity infrastructure, may be a path forward for sub-systems that
> need inode local storage, particularly systems that have sparse
> occupancy requirements.
>
> Everyone has their britches in a knicker about performance.

Darn Toot'n! You can't work in security for very long before you
run up against those who hate security because of the real or
perceived performance impact. To be successful in security development
it is essential to have a good grasp of the impact on other aspects
of the system. It is vital to understand how the implementation
affects others and why it is the best way to accomplish the goals. 

> Note that we called out a possible optimization for this architecture
> so that there would be no need to even hit the r/b tree if a
> filesystem had no sub-systems that had requested sparse inode local
> storage for that filesystem.
>
>>> It also offers the ability for implementation independence, which is
>>> always a good thing in the Linux community.
>> Generality for the sake of generality is seriously overrated.
>> File systems have to be done so as to fit into the VFS infrastructure,
>> network protocols have to work with sockets without impacting the
>> performance of others and so forth.
> We were not advocating generality for the sake of generality, we were
> suggesting a generalized architecture, that does not require expansion
> of struct inode, because Christian has publically indicated there is
> no appetite by the VFS maintainers for consuming additional space in
> struct inode for infrastructure requiring local inode storage.
>
> You talk about cooperation, yet you object to any consideration that
> the LSM should participate in a shared arena environment where
> sub-systems wanting local inode storage could just request a block in
> a common arena.  The LSM, in this case, is just like a filesystem
> since it is a consumer of infrastructure supplied by the VFS and
> should thus cooperate with other consumers of VFS infrastructure.

I am perfectly open to a change that improves LSM performance.
This would not be the case with this proposal.


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2024-11-25 20:49 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12  8:25 [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
2024-11-12  8:25 ` [PATCH bpf-next 1/4] bpf: lsm: Remove hook to bpf_task_storage_free Song Liu
2024-11-12  8:25 ` [PATCH bpf-next 2/4] bpf: Make bpf inode storage available to tracing program Song Liu
2024-11-13 10:19   ` Christian Brauner
2024-11-13 14:15     ` Song Liu
2024-11-13 18:29       ` Casey Schaufler
2024-11-13 19:00         ` Song Liu
2024-11-21  9:04       ` Christian Brauner
2024-11-14 21:11     ` Song Liu
2024-11-15 11:19       ` Jan Kara
2024-11-15 17:35         ` Song Liu
2024-11-19 14:21           ` Jeff Layton
2024-11-19 15:25             ` Amir Goldstein
2024-11-19 15:30               ` Amir Goldstein
2024-11-19 21:53                 ` Song Liu
2024-11-20  9:19                   ` Amir Goldstein
2024-11-20  9:28                   ` Christian Brauner
2024-11-20 11:19                     ` Amir Goldstein
2024-11-21  8:43                       ` Christian Brauner
2024-11-21 13:48                       ` Jeff Layton
2024-11-21  8:08                     ` Song Liu
2024-11-21  9:14         ` Christian Brauner
2024-11-23  0:08           ` Alexei Starovoitov
2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion avoid logic for inode storage Song Liu
2024-11-12  8:25 ` [PATCH bpf-next 3/4] bpf: Add recursion prevention " Song Liu
2024-11-12  8:25 ` [PATCH bpf-next 4/4] selftest/bpf: Add test for inode local storage recursion Song Liu
2024-11-12  8:26 ` [PATCH bpf-next 4/4] selftest/bpf: Test inode local storage recursion prevention Song Liu
2024-11-12  8:35 ` [PATCH bpf-next 0/4] Make inode storage available to tracing prog Song Liu
2024-11-12 18:09 ` Casey Schaufler
2024-11-12 18:44   ` Song Liu
2024-11-13  1:10     ` Casey Schaufler
2024-11-13  1:37       ` Song Liu
2024-11-13 18:06         ` Casey Schaufler
2024-11-13 18:57           ` Song Liu
2024-11-14 16:36             ` Dr. Greg
2024-11-14 17:29               ` Casey Schaufler
2024-11-14 18:08                 ` Song Liu
2024-11-14 21:49                   ` James Bottomley
2024-11-14 22:30                     ` Song Liu
2024-11-17 22:59                     ` Song Liu
2024-11-19 12:27                       ` Dr. Greg
2024-11-19 18:14                         ` Casey Schaufler
2024-11-19 22:35                           ` Song Liu
2024-11-20 16:54                           ` Dr. Greg
2024-11-21  8:28                             ` Song Liu
2024-11-21 16:02                               ` Dr. Greg
2024-11-21 18:11                                 ` Casey Schaufler
2024-11-23 17:01                                   ` Dr. Greg
2024-11-25 20:49                                     ` Casey Schaufler
2024-11-21 17:47                               ` Casey Schaufler
2024-11-21 18:28                                 ` Song Liu
2024-11-23 19:11                     ` Paul Moore
2024-11-14 17:51               ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).