From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org, paul@paul-moore.com,
jmorris@namei.org, serge@hallyn.com
Cc: linux-mm@kvack.org, linux-security-module@vger.kernel.org,
bpf@vger.kernel.org, ligang.bdlg@bytedance.com, mhocko@suse.com,
Yafang Shao <laoar.shao@gmail.com>
Subject: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf
Date: Sun, 12 Nov 2023 07:34:20 +0000 [thread overview]
Message-ID: <20231112073424.4216-1-laoar.shao@gmail.com> (raw)
Background
==========
In our containerized environment, we've identified unexpected OOM events
where the OOM-killer terminates tasks despite having ample free memory.
This anomaly is traced back to tasks within a container using mbind(2) to
bind memory to a specific NUMA node. When the allocated memory on this node
is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
indiscriminately kills tasks. This becomes more critical with guaranteed
tasks (oom_score_adj: -998) aggravating the issue.
The selected victim might not have allocated memory on the same NUMA node,
rendering the killing ineffective. This patch aims to address this by
disabling MPOL_BIND in container environments.
In the container environment, our aim is to consolidate memory resource
control under the management of kubelet. If users express a preference for
binding their memory to a specific NUMA node, we encourage the adoption of
a standardized approach. Specifically, we recommend configuring this memory
policy through kubelet using cpuset.mems in the cpuset controller, rather
than individual users setting it autonomously. This centralized approach
ensures that NUMA nodes are globally managed through kubelet, promoting
consistency and facilitating streamlined administration of memory resources
across the entire containerized environment.
Proposed Solutions
=================
- Introduce Capability to Disable MPOL_BIND
Currently, any task can perform MPOL_BIND without specific capabilities.
Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this
may have unintended consequences. Capabilities, being broad, might grant
unnecessary privileges. We should explore alternatives to prevent
unexpected side effects.
- Use LSM BPF to Disable MPOL_BIND
Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and
set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more
flexibility and allows for fine-grained control without unintended
consequences. A sample LSM BPF program is included, demonstrating
practical implementation in a production environment.
Future Considerations
=====================
In addition, there's room for enhancement in the OOM-killer for cases
involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to
prioritize selecting a victim that has allocated memory on the same NUMA
node. My exploration on the lore led me to a proposal[0] related to this
matter, although consensus seems elusive at this point. Nevertheless,
delving into this specific topic is beyond the scope of the current
patchset.
[0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@bytedance.com/
Yafang Shao (4):
mm, security: Add lsm hook for mbind(2)
mm, security: Add lsm hook for set_mempolicy(2)
mm, security: Add lsm hook for set_mempolicy_home_node(2)
selftests/bpf: Add selftests for mbind(2) with lsm prog
include/linux/lsm_hook_defs.h | 8 +++
include/linux/security.h | 26 +++++++
mm/mempolicy.c | 13 ++++
security/security.c | 19 ++++++
tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++
tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++
6 files changed, 174 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c
create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c
--
1.8.3.1
next reply other threads:[~2023-11-12 7:35 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-12 7:34 Yafang Shao [this message]
2023-11-12 7:34 ` [RFC PATCH -mm 1/4] mm, security: Add lsm hook for mbind(2) Yafang Shao
2023-11-12 7:34 ` [RFC PATCH -mm 2/4] mm, security: Add lsm hook for set_mempolicy(2) Yafang Shao
2023-11-12 7:34 ` [RFC PATCH -mm 3/4] mm, security: Add lsm hook for set_mempolicy_home_node(2) Yafang Shao
2023-11-12 7:34 ` [RFC PATCH -mm 4/4] selftests/bpf: Add selftests for mbind(2) with lsm prog Yafang Shao
2023-11-12 16:45 ` [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf Casey Schaufler
2023-11-13 3:15 ` Yafang Shao
2023-11-13 8:50 ` Ondrej Mosnacek
2023-11-13 21:23 ` Casey Schaufler
2023-11-14 2:30 ` Yafang Shao
2023-11-14 10:15 ` Michal Hocko
2023-11-14 11:59 ` Yafang Shao
2023-11-14 16:57 ` Casey Schaufler
2023-11-15 1:52 ` Yafang Shao
2023-11-15 8:45 ` Michal Hocko
2023-11-15 9:33 ` Yafang Shao
2023-11-15 14:26 ` Yafang Shao
2023-11-15 17:09 ` Casey Schaufler
2023-11-16 1:41 ` Yafang Shao
2023-11-15 17:00 ` Michal Hocko
2023-11-16 2:22 ` Yafang Shao
2023-11-12 20:32 ` Paul Moore
2023-11-13 3:17 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231112073424.4216-1-laoar.shao@gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=jmorris@namei.org \
--cc=ligang.bdlg@bytedance.com \
--cc=linux-mm@kvack.org \
--cc=linux-security-module@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=paul@paul-moore.com \
--cc=serge@hallyn.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox