[RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
@ 2023-11-12  7:34 Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2) Yafang Shao
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-12  7:34 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Yafang Shao

Background
==========

In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks. This becomes more critical with guaranteed
tasks (oom_score_adj: -998) aggravating the issue.

The selected victim might not have allocated memory on the same NUMA node,
rendering the killing ineffective. This patch aims to address this by
disabling MPOL_BIND in container environments.

In the container environment, our aim is to consolidate memory resource
control under the management of kubelet. If users express a preference for
binding their memory to a specific NUMA node, we encourage the adoption of
a standardized approach. Specifically, we recommend configuring this memory
policy through kubelet using cpuset.mems in the cpuset controller, rather
than individual users setting it autonomously. This centralized approach
ensures that NUMA nodes are globally managed through kubelet, promoting
consistency and facilitating streamlined administration of memory resources
across the entire containerized environment.

Proposed Solutions
=================

- Introduce Capability to Disable MPOL_BIND
  Currently, any task can perform MPOL_BIND without specific capabilities.
  Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
  may have unintended consequences. Capabilities, being broad, might grant
  unnecessary privileges. We should explore alternatives to prevent
  unexpected side effects.

- Use LSM BPF to Disable MPOL_BIND
  Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
  set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
  flexibility and allows for fine-grained control without unintended
  consequences. A sample LSM BPF program is included, demonstrating
  practical implementation in a production environment.

Future Considerations
=====================

In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.

[0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/ 

Yafang Shao (4):
  mm, security: Add lsm hook for mbind(2)
  mm, security: Add lsm hook for set_mempolicy(2)
  mm, security: Add lsm hook for set_mempolicy_home_node(2)
  selftests/bpf: Add selftests for mbind(2) with lsm prog

 include/linux/lsm_hook_defs.h                      |  8 +++
 include/linux/security.h                           | 26 +++++++
 mm/mempolicy.c                                     | 13 ++++
 security/security.c                                | 19 ++++++
 tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++
 6 files changed, 174 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2)
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
@ 2023-11-12  7:34 ` Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 2/4] mm, security: Add lsm hook for set_mempolicy(2) Yafang Shao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-12  7:34 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Yafang Shao

In container environment, we don't want users to bind their memory to a
specific numa node, while we want to unit control memory resource with
kubelet. Therefore, add a new lsm hook for mbind(2), then we can enforce
fine-grained control over memory policy adjustment by the tasks in a
container.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/lsm_hook_defs.h |  4 ++++
 include/linux/security.h      | 10 ++++++++++
 mm/mempolicy.c                |  4 ++++
 security/security.c           |  7 +++++++
 4 files changed, 25 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 99b8176..b1b5e3a 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -419,3 +419,7 @@
 LSM_HOOK(int, 0, uring_sqpoll, void)
 LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
 #endif /* CONFIG_IO_URING */
+
+LSM_HOOK(int, 0, mbind, unsigned long start, unsigned long len,
+	 unsigned long mode, const unsigned long __user *nmask,
+	 unsigned long maxnode, unsigned int flags)
diff --git a/include/linux/security.h b/include/linux/security.h
index 1d1df326..9f87543 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -484,6 +484,9 @@ int security_setprocattr(const char *lsm, const char *name, void *value,
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
 int security_locked_down(enum lockdown_reason what);
+int security_mbind(unsigned long start, unsigned long len,
+		   unsigned long mode, const unsigned long __user *nmask,
+		   unsigned long maxnode, unsigned int flags);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1395,6 +1398,13 @@ static inline int security_locked_down(enum lockdown_reason what)
 {
 	return 0;
 }
+
+static inline int security_mbind(unsigned long start, unsigned long len,
+				 unsigned long mode, const unsigned long __user *nmask,
+				 unsigned long maxnode, unsigned int flags)
+{
+	return 0;
+}
 #endif	/* CONFIG_SECURITY */
 
 #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590e..98a378c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1483,6 +1483,10 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
+	err = security_mbind(start, len, mode, nmask, maxnode, flags);
+	if (err)
+		return err;
+
 	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
 }
 
diff --git a/security/security.c b/security/security.c
index dcb3e70..425ec1c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5337,3 +5337,10 @@ int security_uring_cmd(struct io_uring_cmd *ioucmd)
 	return call_int_hook(uring_cmd, 0, ioucmd);
 }
 #endif /* CONFIG_IO_URING */
+
+int security_mbind(unsigned long start, unsigned long len,
+		   unsigned long mode, const unsigned long __user *nmask,
+		   unsigned long maxnode, unsigned int flags)
+{
+	return call_int_hook(mbind, 0, start, len, mode, nmask, maxnode, flags);
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH -mm 2/4] mm, security: Add lsm hook for set_mempolicy(2)
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2) Yafang Shao
@ 2023-11-12  7:34 ` Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 3/4] mm, security: Add lsm hook for set_mempolicy_home_node(2) Yafang Shao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-12  7:34 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Yafang Shao

In container environment, we don't want users to bind their memory to a
specific numa node, while we want to unit control memory resource with
kubelet. Therefore, add a new lsm hook for set_mempolicy(2), then we can
enforce fine-grained control over memory policy adjustment by the tasks in
a container.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/lsm_hook_defs.h | 2 ++
 include/linux/security.h      | 8 ++++++++
 mm/mempolicy.c                | 4 ++++
 security/security.c           | 5 +++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index b1b5e3a..725a03d 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -423,3 +423,5 @@
 LSM_HOOK(int, 0, mbind, unsigned long start, unsigned long len,
 	 unsigned long mode, const unsigned long __user *nmask,
 	 unsigned long maxnode, unsigned int flags)
+LSM_HOOK(int, 0, set_mempolicy, int mode, const unsigned long __user *nmask,
+	 unsigned long maxnode)
diff --git a/include/linux/security.h b/include/linux/security.h
index 9f87543..93c91b6a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -487,6 +487,8 @@ int security_setprocattr(const char *lsm, const char *name, void *value,
 int security_mbind(unsigned long start, unsigned long len,
 		   unsigned long mode, const unsigned long __user *nmask,
 		   unsigned long maxnode, unsigned int flags);
+int security_set_mempolicy(int mode, const unsigned long __user *nmask,
+			   unsigned long maxnode);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1405,6 +1407,12 @@ static inline int security_mbind(unsigned long start, unsigned long len,
 {
 	return 0;
 }
+
+static inline int security_set_mempolicy(int mode, const unsigned long __user *nmask,
+					 unsigned long maxnode)
+{
+	return 0;
+}
 #endif	/* CONFIG_SECURITY */
 
 #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 98a378c..0a76cd2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1581,6 +1581,10 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
+	err = security_set_mempolicy(mode, nmask, maxnode);
+	if (err)
+		return err;
+
 	return do_set_mempolicy(lmode, mode_flags, &nodes);
 }
 
diff --git a/security/security.c b/security/security.c
index 425ec1c..79ae17d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5344,3 +5344,8 @@ int security_mbind(unsigned long start, unsigned long len,
 {
 	return call_int_hook(mbind, 0, start, len, mode, nmask, maxnode, flags);
 }
+
+int security_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode)
+{
+	return call_int_hook(set_mempolicy, 0, mode, nmask, maxnode);
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH -mm 3/4] mm, security: Add lsm hook for set_mempolicy_home_node(2)
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2) Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 2/4] mm, security: Add lsm hook for set_mempolicy(2) Yafang Shao
@ 2023-11-12  7:34 ` Yafang Shao
  2023-11-12  7:34 ` [RFC PATCH -mm 4/4] selftests/bpf: Add selftests for mbind(2) with lsm prog Yafang Shao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-12  7:34 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Yafang Shao

In container environment, we don't want users to bind their memory to a
specific numa node, while we want to unit control memory resource with
kubelet. Therefore, add a new lsm hook for set_mempolicy_home_node(2), then
we can enforce fine-grained control over memory policy adjustment by the
tasks in a container.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/lsm_hook_defs.h | 2 ++
 include/linux/security.h      | 8 ++++++++
 mm/mempolicy.c                | 5 +++++
 security/security.c           | 7 +++++++
 4 files changed, 22 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 725a03d..109883e 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -425,3 +425,5 @@
 	 unsigned long maxnode, unsigned int flags)
 LSM_HOOK(int, 0, set_mempolicy, int mode, const unsigned long __user *nmask,
 	 unsigned long maxnode)
+LSM_HOOK(int, 0, set_mempolicy_home_node, unsigned long start, unsigned long len,
+	 unsigned long home_node, unsigned long flags)
diff --git a/include/linux/security.h b/include/linux/security.h
index 93c91b6a..7b7096f 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -489,6 +489,8 @@ int security_mbind(unsigned long start, unsigned long len,
 		   unsigned long maxnode, unsigned int flags);
 int security_set_mempolicy(int mode, const unsigned long __user *nmask,
 			   unsigned long maxnode);
+int security_set_mempolicy_home_node(unsigned long start, unsigned long len,
+				     unsigned long home_node, unsigned long flags);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1413,6 +1415,12 @@ static inline int security_set_mempolicy(int mode, const unsigned long __user *n
 {
 	return 0;
 }
+
+static inline int security_set_mempolicy_home_node(unsigned long start, unsigned long len,
+						   unsigned long home_node, unsigned long flags)
+{
+	return 0;
+}
 #endif	/* CONFIG_SECURITY */
 
 #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0a76cd2..54106e1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1523,6 +1523,11 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 		return -EINVAL;
 	if (end == start)
 		return 0;
+
+	err = security_set_mempolicy_home_node(start, len, home_node, flags);
+	if (err)
+		return err;
+
 	mmap_write_lock(mm);
 	prev = vma_prev(&vmi);
 	for_each_vma_range(vmi, vma, end) {
diff --git a/security/security.c b/security/security.c
index 79ae17d..0a2e062 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5349,3 +5349,10 @@ int security_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned
 {
 	return call_int_hook(set_mempolicy, 0, mode, nmask, maxnode);
 }
+
+int security_set_mempolicy_home_node(unsigned long start, unsigned long len,
+				     unsigned long home_node, unsigned long flags)
+{
+
+	return call_int_hook(set_mempolicy_home_node, 0, start, len, home_node, flags);
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH -mm 4/4] selftests/bpf: Add selftests for mbind(2) with lsm prog
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
                   ` (2 preceding siblings ...)
  2023-11-12  7:34 ` [RFC PATCH -mm 3/4] mm, security: Add lsm hook for set_mempolicy_home_node(2) Yafang Shao
@ 2023-11-12  7:34 ` Yafang Shao
  2023-11-12 16:45 ` [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Casey Schaufler
  2023-11-12 20:32 ` Paul Moore
  5 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-12  7:34 UTC (permalink / raw)
  To: akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Yafang Shao

The result as follows,
  #142/1   mempolicy/MPOL_BIND_with_lsm:OK
  #142/2   mempolicy/MPOL_DEFAULT_with_lsm:OK
  #142/3   mempolicy/MPOL_BIND_without_lsm:OK
  #142/4   mempolicy/MPOL_DEFAULT_without_lsm:OK
  #142     mempolicy:OK

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c

diff --git a/tools/testing/selftests/bpf/prog_tests/mempolicy.c b/tools/testing/selftests/bpf/prog_tests/mempolicy.c
new file mode 100644
index 0000000..e0dfb18
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/mempolicy.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <numaif.h>
+#include <test_progs.h>
+#include "test_mempolicy.skel.h"
+
+#define SIZE 4096
+
+static void mempolicy_bind(bool success)
+{
+	unsigned long mask = 1;
+	char *addr;
+	int err;
+
+	addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	if (!ASSERT_OK_PTR(addr, "mmap"))
+		return;
+
+	err = mbind(addr, SIZE, MPOL_BIND, &mask, sizeof(mask), 0);
+	if (success)
+		ASSERT_OK(err, "mbind_success");
+	else
+		ASSERT_ERR(err, "mbind_fail");
+
+	munmap(addr, SIZE);
+}
+
+static void mempolicy_default(void)
+{
+	char *addr;
+	int err;
+
+	addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	if (!ASSERT_OK_PTR(addr, "mmap"))
+		return;
+
+	err = mbind(addr, SIZE, MPOL_DEFAULT, NULL, 0, 0);
+	ASSERT_OK(err, "mbind_success");
+
+	munmap(addr, SIZE);
+}
+void test_mempolicy(void)
+{
+	struct test_mempolicy *skel;
+	int err;
+
+	skel = test_mempolicy__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	skel->bss->target_pid = getpid();
+
+	err = test_mempolicy__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+
+	/* Attach LSM prog first */
+	err = test_mempolicy__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto destroy;
+
+	/* syscall to adjust memory policy */
+	if (test__start_subtest("MPOL_BIND_with_lsm"))
+		mempolicy_bind(false);
+	if (test__start_subtest("MPOL_DEFAULT_with_lsm"))
+		mempolicy_default();
+
+destroy:
+	test_mempolicy__destroy(skel);
+
+	if (test__start_subtest("MPOL_BIND_without_lsm"))
+		mempolicy_bind(true);
+	if (test__start_subtest("MPOL_DEFAULT_without_lsm"))
+		mempolicy_default();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_mempolicy.c b/tools/testing/selftests/bpf/progs/test_mempolicy.c
new file mode 100644
index 0000000..2fe8c99
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_mempolicy.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Yafang Shao <laoar.shao@gmail.com> */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+int target_pid;
+
+static int mem_policy_adjustment(u64 mode)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+
+	if (task->pid != target_pid)
+		return 0;
+
+	if (mode != MPOL_BIND)
+		return 0;
+	return -1;
+}
+
+SEC("lsm/mbind")
+int BPF_PROG(mbind_run, u64 start, u64 len, u64 mode, const u64 *nmask, u64 maxnode, u32 flags)
+{
+	return mem_policy_adjustment(mode);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
                   ` (3 preceding siblings ...)
  2023-11-12  7:34 ` [RFC PATCH -mm 4/4] selftests/bpf: Add selftests for mbind(2) with lsm prog Yafang Shao
@ 2023-11-12 16:45 ` Casey Schaufler
  2023-11-13  3:15   ` Yafang Shao
  2023-11-12 20:32 ` Paul Moore
  5 siblings, 1 reply; 23+ messages in thread
From: Casey Schaufler @ 2023-11-12 16:45 UTC (permalink / raw)
  To: Yafang Shao, akpm, paul, jmorris, serge
  Cc: linux-mm, linux-security-module, bpf, ligang.bdlg, mhocko,
	Casey Schaufler

On 11/11/2023 11:34 PM, Yafang Shao wrote:
> Background
> ==========
>
> In our containerized environment, we've identified unexpected OOM events
> where the OOM-killer terminates tasks despite having ample free memory.
> This anomaly is traced back to tasks within a container using mbind(2) to
> bind memory to a specific NUMA node. When the allocated memory on this node
> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> indiscriminately kills tasks. This becomes more critical with guaranteed
> tasks (oom_score_adj: -998) aggravating the issue.

Is there some reason why you can't fix the callers of mbind(2)?
This looks like an user space configuration error rather than a
system security issue.

>
> The selected victim might not have allocated memory on the same NUMA node,
> rendering the killing ineffective. This patch aims to address this by
> disabling MPOL_BIND in container environments.
>
> In the container environment, our aim is to consolidate memory resource
> control under the management of kubelet. If users express a preference for
> binding their memory to a specific NUMA node, we encourage the adoption of
> a standardized approach. Specifically, we recommend configuring this memory
> policy through kubelet using cpuset.mems in the cpuset controller, rather
> than individual users setting it autonomously. This centralized approach
> ensures that NUMA nodes are globally managed through kubelet, promoting
> consistency and facilitating streamlined administration of memory resources
> across the entire containerized environment.

Changing system behavior for a single use case doesn't seem prudent.
You're introducing a bunch of kernel code to avoid fixing a broken
user space configuration.

>
> Proposed Solutions
> =================
>
> - Introduce Capability to Disable MPOL_BIND
>   Currently, any task can perform MPOL_BIND without specific capabilities.
>   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
>   may have unintended consequences. Capabilities, being broad, might grant
>   unnecessary privileges. We should explore alternatives to prevent
>   unexpected side effects.
>
> - Use LSM BPF to Disable MPOL_BIND
>   Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
>   set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
>   flexibility and allows for fine-grained control without unintended
>   consequences. A sample LSM BPF program is included, demonstrating
>   practical implementation in a production environment.
>
> Future Considerations
> =====================
>
> In addition, there's room for enhancement in the OOM-killer for cases
> involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> prioritize selecting a victim that has allocated memory on the same NUMA
> node. My exploration on the lore led me to a proposal[0] related to this
> matter, although consensus seems elusive at this point. Nevertheless,
> delving into this specific topic is beyond the scope of the current
> patchset.
>
> [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/ 
>
> Yafang Shao (4):
>   mm, security: Add lsm hook for mbind(2)
>   mm, security: Add lsm hook for set_mempolicy(2)
>   mm, security: Add lsm hook for set_mempolicy_home_node(2)
>   selftests/bpf: Add selftests for mbind(2) with lsm prog
>
>  include/linux/lsm_hook_defs.h                      |  8 +++
>  include/linux/security.h                           | 26 +++++++
>  mm/mempolicy.c                                     | 13 ++++
>  security/security.c                                | 19 ++++++
>  tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++
>  6 files changed, 174 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
                   ` (4 preceding siblings ...)
  2023-11-12 16:45 ` [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Casey Schaufler
@ 2023-11-12 20:32 ` Paul Moore
  2023-11-13  3:17   ` Yafang Shao
  5 siblings, 1 reply; 23+ messages in thread
From: Paul Moore @ 2023-11-12 20:32 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, mhocko

On Sun, Nov 12, 2023 at 2:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ==========
>
> In our containerized environment, we've identified unexpected OOM events
> where the OOM-killer terminates tasks despite having ample free memory.
> This anomaly is traced back to tasks within a container using mbind(2) to
> bind memory to a specific NUMA node. When the allocated memory on this node
> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> indiscriminately kills tasks. This becomes more critical with guaranteed
> tasks (oom_score_adj: -998) aggravating the issue.
>
> The selected victim might not have allocated memory on the same NUMA node,
> rendering the killing ineffective. This patch aims to address this by
> disabling MPOL_BIND in container environments.
>
> In the container environment, our aim is to consolidate memory resource
> control under the management of kubelet. If users express a preference for
> binding their memory to a specific NUMA node, we encourage the adoption of
> a standardized approach. Specifically, we recommend configuring this memory
> policy through kubelet using cpuset.mems in the cpuset controller, rather
> than individual users setting it autonomously. This centralized approach
> ensures that NUMA nodes are globally managed through kubelet, promoting
> consistency and facilitating streamlined administration of memory resources
> across the entire containerized environment.
>
> Proposed Solutions
> =================
>
> - Introduce Capability to Disable MPOL_BIND
>   Currently, any task can perform MPOL_BIND without specific capabilities.
>   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
>   may have unintended consequences. Capabilities, being broad, might grant
>   unnecessary privileges. We should explore alternatives to prevent
>   unexpected side effects.
>
> - Use LSM BPF to Disable MPOL_BIND
>   Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
>   set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
>   flexibility and allows for fine-grained control without unintended
>   consequences. A sample LSM BPF program is included, demonstrating
>   practical implementation in a production environment.

Without looking at the patchset in any detail yet, I wanted to point
out that we do have some documented guidelines for adding new LSM
hooks:

https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hook-guidelines

I just learned that there are provisions for adding this to the
MAINTAINERS file, I'll be doing that shortly.  My apologies for not
having it in there sooner.

> Future Considerations
> =====================
>
> In addition, there's room for enhancement in the OOM-killer for cases
> involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
> prioritize selecting a victim that has allocated memory on the same NUMA
> node. My exploration on the lore led me to a proposal[0] related to this
> matter, although consensus seems elusive at this point. Nevertheless,
> delving into this specific topic is beyond the scope of the current
> patchset.
>
> [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-12 16:45 ` [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Casey Schaufler
@ 2023-11-13  3:15   ` Yafang Shao
  2023-11-13  8:50     ` Ondrej Mosnacek
  2023-11-14 10:15     ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-13  3:15 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: akpm, paul, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, mhocko

On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > Background
> > ==========
> >
> > In our containerized environment, we've identified unexpected OOM events
> > where the OOM-killer terminates tasks despite having ample free memory.
> > This anomaly is traced back to tasks within a container using mbind(2) to
> > bind memory to a specific NUMA node. When the allocated memory on this node
> > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > indiscriminately kills tasks. This becomes more critical with guaranteed
> > tasks (oom_score_adj: -998) aggravating the issue.
>
> Is there some reason why you can't fix the callers of mbind(2)?
> This looks like an user space configuration error rather than a
> system security issue.

It appears my initial description may have caused confusion. In this
scenario, the caller is an unprivileged user lacking any capabilities.
While a privileged user, such as root, experiencing this issue might
indicate a user space configuration error, the concerning aspect is
the potential for an unprivileged user to disrupt the system easily.
If this is perceived as a misconfiguration, the question arises: What
is the correct configuration to prevent an unprivileged user from
utilizing mbind(2)?"

>
> >
> > The selected victim might not have allocated memory on the same NUMA node,
> > rendering the killing ineffective. This patch aims to address this by
> > disabling MPOL_BIND in container environments.
> >
> > In the container environment, our aim is to consolidate memory resource
> > control under the management of kubelet. If users express a preference for
> > binding their memory to a specific NUMA node, we encourage the adoption of
> > a standardized approach. Specifically, we recommend configuring this memory
> > policy through kubelet using cpuset.mems in the cpuset controller, rather
> > than individual users setting it autonomously. This centralized approach
> > ensures that NUMA nodes are globally managed through kubelet, promoting
> > consistency and facilitating streamlined administration of memory resources
> > across the entire containerized environment.
>
> Changing system behavior for a single use case doesn't seem prudent.
> You're introducing a bunch of kernel code to avoid fixing a broken
> user space configuration.

Currently, there is no mechanism in place to proactively prevent an
unprivileged user from utilizing mbind(2). The approach adopted is to
monitor mbind(2) through a BPF program and trigger an alert if its
usage is detected. However, beyond this monitoring, the only recourse
is to verbally communicate with the user, advising against the use of
mbind(2). As a result, users will question why mbind(2) isn't outright
prohibited in the first place.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-12 20:32 ` Paul Moore
@ 2023-11-13  3:17   ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-13  3:17 UTC (permalink / raw)
  To: Paul Moore
  Cc: akpm, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, mhocko

On Mon, Nov 13, 2023 at 4:32 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Sun, Nov 12, 2023 at 2:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Background
> > ==========
> >
> > In our containerized environment, we've identified unexpected OOM events
> > where the OOM-killer terminates tasks despite having ample free memory.
> > This anomaly is traced back to tasks within a container using mbind(2) to
> > bind memory to a specific NUMA node. When the allocated memory on this node
> > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > indiscriminately kills tasks. This becomes more critical with guaranteed
> > tasks (oom_score_adj: -998) aggravating the issue.
> >
> > The selected victim might not have allocated memory on the same NUMA node,
> > rendering the killing ineffective. This patch aims to address this by
> > disabling MPOL_BIND in container environments.
> >
> > In the container environment, our aim is to consolidate memory resource
> > control under the management of kubelet. If users express a preference for
> > binding their memory to a specific NUMA node, we encourage the adoption of
> > a standardized approach. Specifically, we recommend configuring this memory
> > policy through kubelet using cpuset.mems in the cpuset controller, rather
> > than individual users setting it autonomously. This centralized approach
> > ensures that NUMA nodes are globally managed through kubelet, promoting
> > consistency and facilitating streamlined administration of memory resources
> > across the entire containerized environment.
> >
> > Proposed Solutions
> > =================
> >
> > - Introduce Capability to Disable MPOL_BIND
> >   Currently, any task can perform MPOL_BIND without specific capabilities.
> >   Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
> >   may have unintended consequences. Capabilities, being broad, might grant
> >   unnecessary privileges. We should explore alternatives to prevent
> >   unexpected side effects.
> >
> > - Use LSM BPF to Disable MPOL_BIND
> >   Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
> >   set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
> >   flexibility and allows for fine-grained control without unintended
> >   consequences. A sample LSM BPF program is included, demonstrating
> >   practical implementation in a production environment.
>
> Without looking at the patchset in any detail yet, I wanted to point
> out that we do have some documented guidelines for adding new LSM
> hooks:
>
> https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hook-guidelines
>
> I just learned that there are provisions for adding this to the
> MAINTAINERS file, I'll be doing that shortly.  My apologies for not
> having it in there sooner.

Thanks for your information. I will learn it carefully.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-13  3:15   ` Yafang Shao
@ 2023-11-13  8:50     ` Ondrej Mosnacek
  2023-11-13 21:23       ` Casey Schaufler
  2023-11-14  2:30       ` Yafang Shao
  2023-11-14 10:15     ` Michal Hocko
  1 sibling, 2 replies; 23+ messages in thread
From: Ondrej Mosnacek @ 2023-11-13  8:50 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg, mhocko

On Mon, Nov 13, 2023 at 4:17 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > Background
> > > ==========
> > >
> > > In our containerized environment, we've identified unexpected OOM events
> > > where the OOM-killer terminates tasks despite having ample free memory.
> > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > tasks (oom_score_adj: -998) aggravating the issue.
> >
> > Is there some reason why you can't fix the callers of mbind(2)?
> > This looks like an user space configuration error rather than a
> > system security issue.
>
> It appears my initial description may have caused confusion. In this
> scenario, the caller is an unprivileged user lacking any capabilities.
> While a privileged user, such as root, experiencing this issue might
> indicate a user space configuration error, the concerning aspect is
> the potential for an unprivileged user to disrupt the system easily.
> If this is perceived as a misconfiguration, the question arises: What
> is the correct configuration to prevent an unprivileged user from
> utilizing mbind(2)?"
>
> >
> > >
> > > The selected victim might not have allocated memory on the same NUMA node,
> > > rendering the killing ineffective. This patch aims to address this by
> > > disabling MPOL_BIND in container environments.
> > >
> > > In the container environment, our aim is to consolidate memory resource
> > > control under the management of kubelet. If users express a preference for
> > > binding their memory to a specific NUMA node, we encourage the adoption of
> > > a standardized approach. Specifically, we recommend configuring this memory
> > > policy through kubelet using cpuset.mems in the cpuset controller, rather
> > > than individual users setting it autonomously. This centralized approach
> > > ensures that NUMA nodes are globally managed through kubelet, promoting
> > > consistency and facilitating streamlined administration of memory resources
> > > across the entire containerized environment.
> >
> > Changing system behavior for a single use case doesn't seem prudent.
> > You're introducing a bunch of kernel code to avoid fixing a broken
> > user space configuration.
>
> Currently, there is no mechanism in place to proactively prevent an
> unprivileged user from utilizing mbind(2). The approach adopted is to
> monitor mbind(2) through a BPF program and trigger an alert if its
> usage is detected. However, beyond this monitoring, the only recourse
> is to verbally communicate with the user, advising against the use of
> mbind(2). As a result, users will question why mbind(2) isn't outright
> prohibited in the first place.

Is there a reason why you can't use syscall filtering via seccomp(2)?
AFAIK, all the mainstream container tooling already has support for
specifying seccomp filters for containers.

-- 
Ondrej Mosnacek
Senior Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-13  8:50     ` Ondrej Mosnacek
@ 2023-11-13 21:23       ` Casey Schaufler
  2023-11-14  2:30       ` Yafang Shao
  1 sibling, 0 replies; 23+ messages in thread
From: Casey Schaufler @ 2023-11-13 21:23 UTC (permalink / raw)
  To: Ondrej Mosnacek, Yafang Shao
  Cc: akpm, paul, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, mhocko, Casey Schaufler

On 11/13/2023 12:50 AM, Ondrej Mosnacek wrote:
> On Mon, Nov 13, 2023 at 4:17 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
>>>> Background
>>>> ==========
>>>>
>>>> In our containerized environment, we've identified unexpected OOM events
>>>> where the OOM-killer terminates tasks despite having ample free memory.
>>>> This anomaly is traced back to tasks within a container using mbind(2) to
>>>> bind memory to a specific NUMA node. When the allocated memory on this node
>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
>>>> tasks (oom_score_adj: -998) aggravating the issue.
>>> Is there some reason why you can't fix the callers of mbind(2)?
>>> This looks like an user space configuration error rather than a
>>> system security issue.
>> It appears my initial description may have caused confusion. In this
>> scenario, the caller is an unprivileged user lacking any capabilities.
>> While a privileged user, such as root, experiencing this issue might
>> indicate a user space configuration error, the concerning aspect is
>> the potential for an unprivileged user to disrupt the system easily.
>> If this is perceived as a misconfiguration, the question arises: What
>> is the correct configuration to prevent an unprivileged user from
>> utilizing mbind(2)?"
>>
>>>> The selected victim might not have allocated memory on the same NUMA node,
>>>> rendering the killing ineffective. This patch aims to address this by
>>>> disabling MPOL_BIND in container environments.
>>>>
>>>> In the container environment, our aim is to consolidate memory resource
>>>> control under the management of kubelet. If users express a preference for
>>>> binding their memory to a specific NUMA node, we encourage the adoption of
>>>> a standardized approach. Specifically, we recommend configuring this memory
>>>> policy through kubelet using cpuset.mems in the cpuset controller, rather
>>>> than individual users setting it autonomously. This centralized approach
>>>> ensures that NUMA nodes are globally managed through kubelet, promoting
>>>> consistency and facilitating streamlined administration of memory resources
>>>> across the entire containerized environment.
>>> Changing system behavior for a single use case doesn't seem prudent.
>>> You're introducing a bunch of kernel code to avoid fixing a broken
>>> user space configuration.
>> Currently, there is no mechanism in place to proactively prevent an
>> unprivileged user from utilizing mbind(2). The approach adopted is to
>> monitor mbind(2) through a BPF program and trigger an alert if its
>> usage is detected. However, beyond this monitoring, the only recourse
>> is to verbally communicate with the user, advising against the use of
>> mbind(2). As a result, users will question why mbind(2) isn't outright
>> prohibited in the first place.
> Is there a reason why you can't use syscall filtering via seccomp(2)?
> AFAIK, all the mainstream container tooling already has support for
> specifying seccomp filters for containers.

That looks like a practical solution from here.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-13  8:50     ` Ondrej Mosnacek
  2023-11-13 21:23       ` Casey Schaufler
@ 2023-11-14  2:30       ` Yafang Shao
  1 sibling, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-14  2:30 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg, mhocko

On Mon, Nov 13, 2023 at 4:50 PM Ondrej Mosnacek <omosnace@redhat.com> wrote:
>
> On Mon, Nov 13, 2023 at 4:17 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > >
> > > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > tasks (oom_score_adj: -998) aggravating the issue.
> > >
> > > Is there some reason why you can't fix the callers of mbind(2)?
> > > This looks like an user space configuration error rather than a
> > > system security issue.
> >
> > It appears my initial description may have caused confusion. In this
> > scenario, the caller is an unprivileged user lacking any capabilities.
> > While a privileged user, such as root, experiencing this issue might
> > indicate a user space configuration error, the concerning aspect is
> > the potential for an unprivileged user to disrupt the system easily.
> > If this is perceived as a misconfiguration, the question arises: What
> > is the correct configuration to prevent an unprivileged user from
> > utilizing mbind(2)?"
> >
> > >
> > > >
> > > > The selected victim might not have allocated memory on the same NUMA node,
> > > > rendering the killing ineffective. This patch aims to address this by
> > > > disabling MPOL_BIND in container environments.
> > > >
> > > > In the container environment, our aim is to consolidate memory resource
> > > > control under the management of kubelet. If users express a preference for
> > > > binding their memory to a specific NUMA node, we encourage the adoption of
> > > > a standardized approach. Specifically, we recommend configuring this memory
> > > > policy through kubelet using cpuset.mems in the cpuset controller, rather
> > > > than individual users setting it autonomously. This centralized approach
> > > > ensures that NUMA nodes are globally managed through kubelet, promoting
> > > > consistency and facilitating streamlined administration of memory resources
> > > > across the entire containerized environment.
> > >
> > > Changing system behavior for a single use case doesn't seem prudent.
> > > You're introducing a bunch of kernel code to avoid fixing a broken
> > > user space configuration.
> >
> > Currently, there is no mechanism in place to proactively prevent an
> > unprivileged user from utilizing mbind(2). The approach adopted is to
> > monitor mbind(2) through a BPF program and trigger an alert if its
> > usage is detected. However, beyond this monitoring, the only recourse
> > is to verbally communicate with the user, advising against the use of
> > mbind(2). As a result, users will question why mbind(2) isn't outright
> > prohibited in the first place.
>
> Is there a reason why you can't use syscall filtering via seccomp(2)?
> AFAIK, all the mainstream container tooling already has support for
> specifying seccomp filters for containers.

seccomp is relatively heavyweight, making it less suitable for
enabling in our production environment. In contrast, LSM offer a more
lightweight and flexible alternative. Moreover, the act of binding to
a specific NUMA node appears akin to a privileged operation,
warranting the consideration of a dedicated LSM hook.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-13  3:15   ` Yafang Shao
  2023-11-13  8:50     ` Ondrej Mosnacek
@ 2023-11-14 10:15     ` Michal Hocko
  2023-11-14 11:59       ` Yafang Shao
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-14 10:15 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > Background
> > > ==========
> > >
> > > In our containerized environment, we've identified unexpected OOM events
> > > where the OOM-killer terminates tasks despite having ample free memory.
> > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > tasks (oom_score_adj: -998) aggravating the issue.
> >
> > Is there some reason why you can't fix the callers of mbind(2)?
> > This looks like an user space configuration error rather than a
> > system security issue.
> 
> It appears my initial description may have caused confusion. In this
> scenario, the caller is an unprivileged user lacking any capabilities.
> While a privileged user, such as root, experiencing this issue might
> indicate a user space configuration error, the concerning aspect is
> the potential for an unprivileged user to disrupt the system easily.
> If this is perceived as a misconfiguration, the question arises: What
> is the correct configuration to prevent an unprivileged user from
> utilizing mbind(2)?"

How is this any different than a non NUMA (mbind) situation? You can
still have an unprivileged user to allocate just until the OOM triggers
and disrupt other workload consuming more memory. Sure the mempolicy
based OOM is less precise and it might select a victim with only a small
consumption on a target NUMA node but fundamentally the situation is
very similar. I do not think disallowing mbind specifically is solving a
real problem. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-14 10:15     ` Michal Hocko
@ 2023-11-14 11:59       ` Yafang Shao
  2023-11-14 16:57         ` Casey Schaufler
  0 siblings, 1 reply; 23+ messages in thread
From: Yafang Shao @ 2023-11-14 11:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > >
> > > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > tasks (oom_score_adj: -998) aggravating the issue.
> > >
> > > Is there some reason why you can't fix the callers of mbind(2)?
> > > This looks like an user space configuration error rather than a
> > > system security issue.
> >
> > It appears my initial description may have caused confusion. In this
> > scenario, the caller is an unprivileged user lacking any capabilities.
> > While a privileged user, such as root, experiencing this issue might
> > indicate a user space configuration error, the concerning aspect is
> > the potential for an unprivileged user to disrupt the system easily.
> > If this is perceived as a misconfiguration, the question arises: What
> > is the correct configuration to prevent an unprivileged user from
> > utilizing mbind(2)?"
>
> How is this any different than a non NUMA (mbind) situation?

In a UMA system, each gigabyte of memory carries the same cost.
Conversely, in a NUMA architecture, opting to confine processes within
a specific NUMA node incurs additional costs. In the worst-case
scenario, if all containers opt to bind their memory exclusively to
specific nodes, it will result in significant memory wastage.

> You can
> still have an unprivileged user to allocate just until the OOM triggers
> and disrupt other workload consuming more memory. Sure the mempolicy
> based OOM is less precise and it might select a victim with only a small
> consumption on a target NUMA node but fundamentally the situation is
> very similar. I do not think disallowing mbind specifically is solving a
> real problem.

How would you recommend addressing this more effectively?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-14 11:59       ` Yafang Shao
@ 2023-11-14 16:57         ` Casey Schaufler
  2023-11-15  1:52           ` Yafang Shao
  0 siblings, 1 reply; 23+ messages in thread
From: Casey Schaufler @ 2023-11-14 16:57 UTC (permalink / raw)
  To: Yafang Shao, Michal Hocko
  Cc: akpm, paul, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, Casey Schaufler

On 11/14/2023 3:59 AM, Yafang Shao wrote:
> On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
>> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
>>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
>>>>> Background
>>>>> ==========
>>>>>
>>>>> In our containerized environment, we've identified unexpected OOM events
>>>>> where the OOM-killer terminates tasks despite having ample free memory.
>>>>> This anomaly is traced back to tasks within a container using mbind(2) to
>>>>> bind memory to a specific NUMA node. When the allocated memory on this node
>>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
>>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
>>>>> tasks (oom_score_adj: -998) aggravating the issue.
>>>> Is there some reason why you can't fix the callers of mbind(2)?
>>>> This looks like an user space configuration error rather than a
>>>> system security issue.
>>> It appears my initial description may have caused confusion. In this
>>> scenario, the caller is an unprivileged user lacking any capabilities.
>>> While a privileged user, such as root, experiencing this issue might
>>> indicate a user space configuration error, the concerning aspect is
>>> the potential for an unprivileged user to disrupt the system easily.
>>> If this is perceived as a misconfiguration, the question arises: What
>>> is the correct configuration to prevent an unprivileged user from
>>> utilizing mbind(2)?"
>> How is this any different than a non NUMA (mbind) situation?
> In a UMA system, each gigabyte of memory carries the same cost.
> Conversely, in a NUMA architecture, opting to confine processes within
> a specific NUMA node incurs additional costs. In the worst-case
> scenario, if all containers opt to bind their memory exclusively to
> specific nodes, it will result in significant memory wastage.

That still sounds like you've misconfigured your containers such
that they expect to get more memory than is available, and that
they have more control over it than they really do.


>> You can
>> still have an unprivileged user to allocate just until the OOM triggers
>> and disrupt other workload consuming more memory. Sure the mempolicy
>> based OOM is less precise and it might select a victim with only a small
>> consumption on a target NUMA node but fundamentally the situation is
>> very similar. I do not think disallowing mbind specifically is solving a
>> real problem.
> How would you recommend addressing this more effectively?
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-14 16:57         ` Casey Schaufler
@ 2023-11-15  1:52           ` Yafang Shao
  2023-11-15  8:45             ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Yafang Shao @ 2023-11-15  1:52 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Michal Hocko, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> >>>>> Background
> >>>>> ==========
> >>>>>
> >>>>> In our containerized environment, we've identified unexpected OOM events
> >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> >>>> Is there some reason why you can't fix the callers of mbind(2)?
> >>>> This looks like an user space configuration error rather than a
> >>>> system security issue.
> >>> It appears my initial description may have caused confusion. In this
> >>> scenario, the caller is an unprivileged user lacking any capabilities.
> >>> While a privileged user, such as root, experiencing this issue might
> >>> indicate a user space configuration error, the concerning aspect is
> >>> the potential for an unprivileged user to disrupt the system easily.
> >>> If this is perceived as a misconfiguration, the question arises: What
> >>> is the correct configuration to prevent an unprivileged user from
> >>> utilizing mbind(2)?"
> >> How is this any different than a non NUMA (mbind) situation?
> > In a UMA system, each gigabyte of memory carries the same cost.
> > Conversely, in a NUMA architecture, opting to confine processes within
> > a specific NUMA node incurs additional costs. In the worst-case
> > scenario, if all containers opt to bind their memory exclusively to
> > specific nodes, it will result in significant memory wastage.
>
> That still sounds like you've misconfigured your containers such
> that they expect to get more memory than is available, and that
> they have more control over it than they really do.

And again: What configuration method is suitable to limit user control
over memory policy adjustments, besides the heavyweight seccomp
approach?

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15  1:52           ` Yafang Shao
@ 2023-11-15  8:45             ` Michal Hocko
  2023-11-15  9:33               ` Yafang Shao
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-15  8:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > >>>>> Background
> > >>>>> ==========
> > >>>>>
> > >>>>> In our containerized environment, we've identified unexpected OOM events
> > >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> > >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> > >>>> Is there some reason why you can't fix the callers of mbind(2)?
> > >>>> This looks like an user space configuration error rather than a
> > >>>> system security issue.
> > >>> It appears my initial description may have caused confusion. In this
> > >>> scenario, the caller is an unprivileged user lacking any capabilities.
> > >>> While a privileged user, such as root, experiencing this issue might
> > >>> indicate a user space configuration error, the concerning aspect is
> > >>> the potential for an unprivileged user to disrupt the system easily.
> > >>> If this is perceived as a misconfiguration, the question arises: What
> > >>> is the correct configuration to prevent an unprivileged user from
> > >>> utilizing mbind(2)?"
> > >> How is this any different than a non NUMA (mbind) situation?
> > > In a UMA system, each gigabyte of memory carries the same cost.
> > > Conversely, in a NUMA architecture, opting to confine processes within
> > > a specific NUMA node incurs additional costs. In the worst-case
> > > scenario, if all containers opt to bind their memory exclusively to
> > > specific nodes, it will result in significant memory wastage.
> >
> > That still sounds like you've misconfigured your containers such
> > that they expect to get more memory than is available, and that
> > they have more control over it than they really do.
> 
> And again: What configuration method is suitable to limit user control
> over memory policy adjustments, besides the heavyweight seccomp
> approach?

This really depends on the workloads. What is the reason mbind is used
in the first place? Is it acceptable to partition the system so that
there is a numa node reserved for NUMA aware workloads? If not, have you
considered (already proposed numa=off)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15  8:45             ` Michal Hocko
@ 2023-11-15  9:33               ` Yafang Shao
  2023-11-15 14:26                 ` Yafang Shao
  2023-11-15 17:00                 ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-15  9:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > >
> > > On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > >>>>> Background
> > > >>>>> ==========
> > > >>>>>
> > > >>>>> In our containerized environment, we've identified unexpected OOM events
> > > >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> > > >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> > > >>>> Is there some reason why you can't fix the callers of mbind(2)?
> > > >>>> This looks like an user space configuration error rather than a
> > > >>>> system security issue.
> > > >>> It appears my initial description may have caused confusion. In this
> > > >>> scenario, the caller is an unprivileged user lacking any capabilities.
> > > >>> While a privileged user, such as root, experiencing this issue might
> > > >>> indicate a user space configuration error, the concerning aspect is
> > > >>> the potential for an unprivileged user to disrupt the system easily.
> > > >>> If this is perceived as a misconfiguration, the question arises: What
> > > >>> is the correct configuration to prevent an unprivileged user from
> > > >>> utilizing mbind(2)?"
> > > >> How is this any different than a non NUMA (mbind) situation?
> > > > In a UMA system, each gigabyte of memory carries the same cost.
> > > > Conversely, in a NUMA architecture, opting to confine processes within
> > > > a specific NUMA node incurs additional costs. In the worst-case
> > > > scenario, if all containers opt to bind their memory exclusively to
> > > > specific nodes, it will result in significant memory wastage.
> > >
> > > That still sounds like you've misconfigured your containers such
> > > that they expect to get more memory than is available, and that
> > > they have more control over it than they really do.
> >
> > And again: What configuration method is suitable to limit user control
> > over memory policy adjustments, besides the heavyweight seccomp
> > approach?
>
> This really depends on the workloads. What is the reason mbind is used
> in the first place?

It can improve their performance.

> Is it acceptable to partition the system so that
> there is a numa node reserved for NUMA aware workloads?

As highlighted in the commit log, our preference is to configure this
memory policy through kubelet using cpuset.mems in the cpuset
controller, rather than allowing individual users to set it
independently.

> If not, have you
> considered (already proposed numa=off)?

The challenge at hand isn't solely about whether users should bind to
a memory node or the deployment of workloads. What we're genuinely
dealing with is the fact that users can bind to a specific node
without our explicit agreement or authorization.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15  9:33               ` Yafang Shao
@ 2023-11-15 14:26                 ` Yafang Shao
  2023-11-15 17:09                   ` Casey Schaufler
  2023-11-15 17:00                 ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Yafang Shao @ 2023-11-15 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Wed, Nov 15, 2023 at 5:33 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> > > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > >
> > > > On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > >>>>> Background
> > > > >>>>> ==========
> > > > >>>>>
> > > > >>>>> In our containerized environment, we've identified unexpected OOM events
> > > > >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> > > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> > > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> > > > >>>> Is there some reason why you can't fix the callers of mbind(2)?
> > > > >>>> This looks like an user space configuration error rather than a
> > > > >>>> system security issue.
> > > > >>> It appears my initial description may have caused confusion. In this
> > > > >>> scenario, the caller is an unprivileged user lacking any capabilities.
> > > > >>> While a privileged user, such as root, experiencing this issue might
> > > > >>> indicate a user space configuration error, the concerning aspect is
> > > > >>> the potential for an unprivileged user to disrupt the system easily.
> > > > >>> If this is perceived as a misconfiguration, the question arises: What
> > > > >>> is the correct configuration to prevent an unprivileged user from
> > > > >>> utilizing mbind(2)?"
> > > > >> How is this any different than a non NUMA (mbind) situation?
> > > > > In a UMA system, each gigabyte of memory carries the same cost.
> > > > > Conversely, in a NUMA architecture, opting to confine processes within
> > > > > a specific NUMA node incurs additional costs. In the worst-case
> > > > > scenario, if all containers opt to bind their memory exclusively to
> > > > > specific nodes, it will result in significant memory wastage.
> > > >
> > > > That still sounds like you've misconfigured your containers such
> > > > that they expect to get more memory than is available, and that
> > > > they have more control over it than they really do.
> > >
> > > And again: What configuration method is suitable to limit user control
> > > over memory policy adjustments, besides the heavyweight seccomp
> > > approach?
> >
> > This really depends on the workloads. What is the reason mbind is used
> > in the first place?
>
> It can improve their performance.
>
> > Is it acceptable to partition the system so that
> > there is a numa node reserved for NUMA aware workloads?
>
> As highlighted in the commit log, our preference is to configure this
> memory policy through kubelet using cpuset.mems in the cpuset
> controller, rather than allowing individual users to set it
> independently.
>
> > If not, have you
> > considered (already proposed numa=off)?
>
> The challenge at hand isn't solely about whether users should bind to
> a memory node or the deployment of workloads. What we're genuinely
> dealing with is the fact that users can bind to a specific node
> without our explicit agreement or authorization.

BYW, the same principle should also apply to sched_setaffinity(2).
While there's already a security_task_setscheduler() in place, it's
undeniable that we should also consider adding a
security_set_mempolicy() for consistency.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15  9:33               ` Yafang Shao
  2023-11-15 14:26                 ` Yafang Shao
@ 2023-11-15 17:00                 ` Michal Hocko
  2023-11-16  2:22                   ` Yafang Shao
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2023-11-15 17:00 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Wed 15-11-23 17:33:51, Yafang Shao wrote:
> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> > > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > >
> > > > On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > >>>>> Background
> > > > >>>>> ==========
> > > > >>>>>
> > > > >>>>> In our containerized environment, we've identified unexpected OOM events
> > > > >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> > > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> > > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> > > > >>>> Is there some reason why you can't fix the callers of mbind(2)?
> > > > >>>> This looks like an user space configuration error rather than a
> > > > >>>> system security issue.
> > > > >>> It appears my initial description may have caused confusion. In this
> > > > >>> scenario, the caller is an unprivileged user lacking any capabilities.
> > > > >>> While a privileged user, such as root, experiencing this issue might
> > > > >>> indicate a user space configuration error, the concerning aspect is
> > > > >>> the potential for an unprivileged user to disrupt the system easily.
> > > > >>> If this is perceived as a misconfiguration, the question arises: What
> > > > >>> is the correct configuration to prevent an unprivileged user from
> > > > >>> utilizing mbind(2)?"
> > > > >> How is this any different than a non NUMA (mbind) situation?
> > > > > In a UMA system, each gigabyte of memory carries the same cost.
> > > > > Conversely, in a NUMA architecture, opting to confine processes within
> > > > > a specific NUMA node incurs additional costs. In the worst-case
> > > > > scenario, if all containers opt to bind their memory exclusively to
> > > > > specific nodes, it will result in significant memory wastage.
> > > >
> > > > That still sounds like you've misconfigured your containers such
> > > > that they expect to get more memory than is available, and that
> > > > they have more control over it than they really do.
> > >
> > > And again: What configuration method is suitable to limit user control
> > > over memory policy adjustments, besides the heavyweight seccomp
> > > approach?
> >
> > This really depends on the workloads. What is the reason mbind is used
> > in the first place?
> 
> It can improve their performance.
> 
> > Is it acceptable to partition the system so that
> > there is a numa node reserved for NUMA aware workloads?
> 
> As highlighted in the commit log, our preference is to configure this
> memory policy through kubelet using cpuset.mems in the cpuset
> controller, rather than allowing individual users to set it
> independently.

OK, I have missed that part.

> > If not, have you
> > considered (already proposed numa=off)?
> 
> The challenge at hand isn't solely about whether users should bind to
> a memory node or the deployment of workloads. What we're genuinely
> dealing with is the fact that users can bind to a specific node
> without our explicit agreement or authorization.

mbind outside of the cpuset shouldn't be possible (policy_nodemask). So
if you are configuring cpusets already then mbind should add much to a
problem. I can see how you can have problems when you do not have any
NUMA partitioning in place because mixing NUMA aware and unaware
workloads doesn't really work out well when the memory is short on
supply.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15 14:26                 ` Yafang Shao
@ 2023-11-15 17:09                   ` Casey Schaufler
  2023-11-16  1:41                     ` Yafang Shao
  0 siblings, 1 reply; 23+ messages in thread
From: Casey Schaufler @ 2023-11-15 17:09 UTC (permalink / raw)
  To: Yafang Shao, Michal Hocko
  Cc: akpm, paul, jmorris, serge, linux-mm, linux-security-module, bpf,
	ligang.bdlg, Casey Schaufler

On 11/15/2023 6:26 AM, Yafang Shao wrote:
> On Wed, Nov 15, 2023 at 5:33 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
>>> On Wed 15-11-23 09:52:38, Yafang Shao wrote:
>>>> On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 11/14/2023 3:59 AM, Yafang Shao wrote:
>>>>>> On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
>>>>>>> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
>>>>>>>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
>>>>>>>>>> Background
>>>>>>>>>> ==========
>>>>>>>>>>
>>>>>>>>>> In our containerized environment, we've identified unexpected OOM events
>>>>>>>>>> where the OOM-killer terminates tasks despite having ample free memory.
>>>>>>>>>> This anomaly is traced back to tasks within a container using mbind(2) to
>>>>>>>>>> bind memory to a specific NUMA node. When the allocated memory on this node
>>>>>>>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
>>>>>>>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
>>>>>>>>>> tasks (oom_score_adj: -998) aggravating the issue.
>>>>>>>>> Is there some reason why you can't fix the callers of mbind(2)?
>>>>>>>>> This looks like an user space configuration error rather than a
>>>>>>>>> system security issue.
>>>>>>>> It appears my initial description may have caused confusion. In this
>>>>>>>> scenario, the caller is an unprivileged user lacking any capabilities.
>>>>>>>> While a privileged user, such as root, experiencing this issue might
>>>>>>>> indicate a user space configuration error, the concerning aspect is
>>>>>>>> the potential for an unprivileged user to disrupt the system easily.
>>>>>>>> If this is perceived as a misconfiguration, the question arises: What
>>>>>>>> is the correct configuration to prevent an unprivileged user from
>>>>>>>> utilizing mbind(2)?"
>>>>>>> How is this any different than a non NUMA (mbind) situation?
>>>>>> In a UMA system, each gigabyte of memory carries the same cost.
>>>>>> Conversely, in a NUMA architecture, opting to confine processes within
>>>>>> a specific NUMA node incurs additional costs. In the worst-case
>>>>>> scenario, if all containers opt to bind their memory exclusively to
>>>>>> specific nodes, it will result in significant memory wastage.
>>>>> That still sounds like you've misconfigured your containers such
>>>>> that they expect to get more memory than is available, and that
>>>>> they have more control over it than they really do.
>>>> And again: What configuration method is suitable to limit user control
>>>> over memory policy adjustments, besides the heavyweight seccomp
>>>> approach?

What makes seccomp "heavyweight"? The overhead? The infrastructure required?

>>> This really depends on the workloads. What is the reason mbind is used
>>> in the first place?
>> It can improve their performance.

How much? You've already demonstrated that using mbind can degrade their performance.

>>
>>> Is it acceptable to partition the system so that
>>> there is a numa node reserved for NUMA aware workloads?
>> As highlighted in the commit log, our preference is to configure this
>> memory policy through kubelet using cpuset.mems in the cpuset
>> controller, rather than allowing individual users to set it
>> independently.
>>
>>> If not, have you
>>> considered (already proposed numa=off)?
>> The challenge at hand isn't solely about whether users should bind to
>> a memory node or the deployment of workloads. What we're genuinely
>> dealing with is the fact that users can bind to a specific node
>> without our explicit agreement or authorization.
> BYW, the same principle should also apply to sched_setaffinity(2).
> While there's already a security_task_setscheduler() in place, it's
> undeniable that we should also consider adding a
> security_set_mempolicy() for consistency.

	"A foolish consistency is the hobgoblin of little minds"
	- Ralph Waldo Emerson



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15 17:09                   ` Casey Schaufler
@ 2023-11-16  1:41                     ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-16  1:41 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Michal Hocko, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Thu, Nov 16, 2023 at 1:09 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 11/15/2023 6:26 AM, Yafang Shao wrote:
> > On Wed, Nov 15, 2023 at 5:33 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >> On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
> >>> On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> >>>> On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>> On 11/14/2023 3:59 AM, Yafang Shao wrote:
> >>>>>> On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> >>>>>>> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> >>>>>>>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>>>>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> >>>>>>>>>> Background
> >>>>>>>>>> ==========
> >>>>>>>>>>
> >>>>>>>>>> In our containerized environment, we've identified unexpected OOM events
> >>>>>>>>>> where the OOM-killer terminates tasks despite having ample free memory.
> >>>>>>>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> >>>>>>>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> >>>>>>>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> >>>>>>>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> >>>>>>>>>> tasks (oom_score_adj: -998) aggravating the issue.
> >>>>>>>>> Is there some reason why you can't fix the callers of mbind(2)?
> >>>>>>>>> This looks like an user space configuration error rather than a
> >>>>>>>>> system security issue.
> >>>>>>>> It appears my initial description may have caused confusion. In this
> >>>>>>>> scenario, the caller is an unprivileged user lacking any capabilities.
> >>>>>>>> While a privileged user, such as root, experiencing this issue might
> >>>>>>>> indicate a user space configuration error, the concerning aspect is
> >>>>>>>> the potential for an unprivileged user to disrupt the system easily.
> >>>>>>>> If this is perceived as a misconfiguration, the question arises: What
> >>>>>>>> is the correct configuration to prevent an unprivileged user from
> >>>>>>>> utilizing mbind(2)?"
> >>>>>>> How is this any different than a non NUMA (mbind) situation?
> >>>>>> In a UMA system, each gigabyte of memory carries the same cost.
> >>>>>> Conversely, in a NUMA architecture, opting to confine processes within
> >>>>>> a specific NUMA node incurs additional costs. In the worst-case
> >>>>>> scenario, if all containers opt to bind their memory exclusively to
> >>>>>> specific nodes, it will result in significant memory wastage.
> >>>>> That still sounds like you've misconfigured your containers such
> >>>>> that they expect to get more memory than is available, and that
> >>>>> they have more control over it than they really do.
> >>>> And again: What configuration method is suitable to limit user control
> >>>> over memory policy adjustments, besides the heavyweight seccomp
> >>>> approach?
>
> What makes seccomp "heavyweight"? The overhead? The infrastructure required?
>
> >>> This really depends on the workloads. What is the reason mbind is used
> >>> in the first place?
> >> It can improve their performance.
>
> How much? You've already demonstrated that using mbind can degrade their performance.

Pls. calm down and read the whole discussion carefully. It is not easy
to understand.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
  2023-11-15 17:00                 ` Michal Hocko
@ 2023-11-16  2:22                   ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2023-11-16  2:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Casey Schaufler, akpm, paul, jmorris, serge, linux-mm,
	linux-security-module, bpf, ligang.bdlg

On Thu, Nov 16, 2023 at 1:00 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 15-11-23 17:33:51, Yafang Shao wrote:
> > On Wed, Nov 15, 2023 at 4:45 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 15-11-23 09:52:38, Yafang Shao wrote:
> > > > On Wed, Nov 15, 2023 at 12:58 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > >
> > > > > On 11/14/2023 3:59 AM, Yafang Shao wrote:
> > > > > > On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > > > > >>> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > > >>>> On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > > >>>>> Background
> > > > > >>>>> ==========
> > > > > >>>>>
> > > > > >>>>> In our containerized environment, we've identified unexpected OOM events
> > > > > >>>>> where the OOM-killer terminates tasks despite having ample free memory.
> > > > > >>>>> This anomaly is traced back to tasks within a container using mbind(2) to
> > > > > >>>>> bind memory to a specific NUMA node. When the allocated memory on this node
> > > > > >>>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > > >>>>> indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > > >>>>> tasks (oom_score_adj: -998) aggravating the issue.
> > > > > >>>> Is there some reason why you can't fix the callers of mbind(2)?
> > > > > >>>> This looks like an user space configuration error rather than a
> > > > > >>>> system security issue.
> > > > > >>> It appears my initial description may have caused confusion. In this
> > > > > >>> scenario, the caller is an unprivileged user lacking any capabilities.
> > > > > >>> While a privileged user, such as root, experiencing this issue might
> > > > > >>> indicate a user space configuration error, the concerning aspect is
> > > > > >>> the potential for an unprivileged user to disrupt the system easily.
> > > > > >>> If this is perceived as a misconfiguration, the question arises: What
> > > > > >>> is the correct configuration to prevent an unprivileged user from
> > > > > >>> utilizing mbind(2)?"
> > > > > >> How is this any different than a non NUMA (mbind) situation?
> > > > > > In a UMA system, each gigabyte of memory carries the same cost.
> > > > > > Conversely, in a NUMA architecture, opting to confine processes within
> > > > > > a specific NUMA node incurs additional costs. In the worst-case
> > > > > > scenario, if all containers opt to bind their memory exclusively to
> > > > > > specific nodes, it will result in significant memory wastage.
> > > > >
> > > > > That still sounds like you've misconfigured your containers such
> > > > > that they expect to get more memory than is available, and that
> > > > > they have more control over it than they really do.
> > > >
> > > > And again: What configuration method is suitable to limit user control
> > > > over memory policy adjustments, besides the heavyweight seccomp
> > > > approach?
> > >
> > > This really depends on the workloads. What is the reason mbind is used
> > > in the first place?
> >
> > It can improve their performance.
> >
> > > Is it acceptable to partition the system so that
> > > there is a numa node reserved for NUMA aware workloads?
> >
> > As highlighted in the commit log, our preference is to configure this
> > memory policy through kubelet using cpuset.mems in the cpuset
> > controller, rather than allowing individual users to set it
> > independently.
>
> OK, I have missed that part.
>
> > > If not, have you
> > > considered (already proposed numa=off)?
> >
> > The challenge at hand isn't solely about whether users should bind to
> > a memory node or the deployment of workloads. What we're genuinely
> > dealing with is the fact that users can bind to a specific node
> > without our explicit agreement or authorization.
>
> mbind outside of the cpuset shouldn't be possible (policy_nodemask). So
> if you are configuring cpusets already then mbind should add much to a
> problem. I can see how you can have problems when you do not have any
> NUMA partitioning in place because mixing NUMA aware and unaware
> workloads doesn't really work out well when the memory is short on
> supply.

Right, we're trying to move NUMA aware workloads to dedicated servers.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-11-16  2:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-12  7:34 [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Yafang Shao
2023-11-12  7:34 ` [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2) Yafang Shao
2023-11-12  7:34 ` [RFC PATCH -mm 2/4] mm, security: Add lsm hook for set_mempolicy(2) Yafang Shao
2023-11-12  7:34 ` [RFC PATCH -mm 3/4] mm, security: Add lsm hook for set_mempolicy_home_node(2) Yafang Shao
2023-11-12  7:34 ` [RFC PATCH -mm 4/4] selftests/bpf: Add selftests for mbind(2) with lsm prog Yafang Shao
2023-11-12 16:45 ` [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Casey Schaufler
2023-11-13  3:15   ` Yafang Shao
2023-11-13  8:50     ` Ondrej Mosnacek
2023-11-13 21:23       ` Casey Schaufler
2023-11-14  2:30       ` Yafang Shao
2023-11-14 10:15     ` Michal Hocko
2023-11-14 11:59       ` Yafang Shao
2023-11-14 16:57         ` Casey Schaufler
2023-11-15  1:52           ` Yafang Shao
2023-11-15  8:45             ` Michal Hocko
2023-11-15  9:33               ` Yafang Shao
2023-11-15 14:26                 ` Yafang Shao
2023-11-15 17:09                   ` Casey Schaufler
2023-11-16  1:41                     ` Yafang Shao
2023-11-15 17:00                 ` Michal Hocko
2023-11-16  2:22                   ` Yafang Shao
2023-11-12 20:32 ` Paul Moore
2023-11-13  3:17   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).