* [v11 0/6] cgroup-aware OOM killer
@ 2017-10-05 13:04 Roman Gushchin
2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
` (5 more replies)
0 siblings, 6 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
This patchset makes the OOM killer cgroup-aware.
v11:
- Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
- Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
- Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
- Rebase on top of mm tree
v10:
- Separate oom_group introduction into a standalone patch
- Stop propagating oom_group
- Make oom_group delegatable
- Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
- Stop caching oom_score on struct memcg, optimize victim
memcg selection
- Drop dmesg printing (for further refining)
- Small refactorings and comments added here and there
- Rebase on top of mm tree
v9:
- Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
- Make oom_group implicitly propagated down by the tree
- Fix an issue with task selection in root cgroup
v8:
- Do not kill tasks with OOM_SCORE_ADJ -1000
- Make the whole thing opt-in with cgroup mount option control
- Drop oom_priority for further discussions
- Kill the whole cgroup if oom_group is set and it's
memory.max is reached
- Update docs and commit messages
v7:
- __oom_kill_process() drops reference to the victim task
- oom_score_adj -1000 is always respected
- Renamed oom_kill_all to oom_group
- Dropped oom_prio range, converted from short to int
- Added a cgroup v2 mount option to disable cgroup-aware OOM killer
- Docs updated
- Rebased on top of mmotm
v6:
- Renamed oom_control.chosen to oom_control.chosen_task
- Renamed oom_kill_all_tasks to oom_kill_all
- Per-node NR_SLAB_UNRECLAIMABLE accounting
- Several minor fixes and cleanups
- Docs updated
v5:
- Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
- Separated the oom_kill_process() splitting into a standalone commit
- Added debug output (suggested by David Rientjes)
- Some minor fixes
v4:
- Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
- Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
- Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
- Debug output is dropped
- Refactored TIF_MEMDIE usage
v3:
- Merged commits 1-4 into 6
- Separated oom_score_adj logic and debug output into separate commits
- Fixed swap accounting
v2:
- Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
- "Kill all tasks" is now an opt-in option, by default
only one process will be killed
- Added per-cgroup oom_score_adj
- Refined oom score calculations, suggested by Vladimir Davydov
- Converted to a patchset
v1:
https://lkml.org/lkml/2017/5/18/969
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Roman Gushchin (6):
mm, oom: refactor the oom_kill_process() function
mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
mm, oom: cgroup-aware OOM killer
mm, oom: introduce memory.oom_group
mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
mm, oom, docs: describe the cgroup-aware OOM killer
Documentation/cgroup-v2.txt | 51 +++++++++
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 34 ++++++
include/linux/oom.h | 12 ++-
kernel/cgroup/cgroup.c | 10 ++
mm/memcontrol.c | 249 +++++++++++++++++++++++++++++++++++++++++++-
mm/oom_kill.c | 210 ++++++++++++++++++++++++-------------
7 files changed, 495 insertions(+), 76 deletions(-)
--
2.13.6
^ permalink raw reply [flat|nested] 27+ messages in thread* [v11 1/6] mm, oom: refactor the oom_kill_process() function 2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin @ 2017-10-05 13:04 ` Roman Gushchin 2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin ` (4 subsequent siblings) 5 siblings, 0 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel The oom_kill_process() function consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits the oom_kill_process() function with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 65 insertions(+), 58 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f642a45b7f14..ccdb7d34cd13 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -845,68 +845,12 @@ static bool task_will_free_mem(struct task_struct *task) return ret; } -static void oom_kill_process(struct oom_control *oc, const char *message) +static void __oom_kill_process(struct task_struct *victim) { - struct task_struct *p = oc->chosen; - unsigned int points = oc->chosen_points; - struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; + struct task_struct *p; struct mm_struct *mm; - unsigned int victim_points = 0; - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just give it access to memory reserves - * so it can die quickly - */ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); - - if (__ratelimit(&oom_rs)) - dump_header(oc, p); - - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); - p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); @@ -980,6 +924,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } #undef K +static void oom_kill_process(struct oom_control *oc, const char *message) +{ + struct task_struct *p = oc->chosen; + unsigned int points = oc->chosen_points; + struct task_struct *victim = p; + struct task_struct *child; + struct task_struct *t; + unsigned int victim_points = 0; + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + /* + * If the task is already exiting, don't alarm the sysadmin or kill + * its children or threads, just give it access to memory reserves + * so it can die quickly + */ + task_lock(p); + if (task_will_free_mem(p)) { + mark_oom_victim(p); + wake_oom_reaper(p); + task_unlock(p); + put_task_struct(p); + return; + } + task_unlock(p); + + if (__ratelimit(&oom_rs)) + dump_header(oc, p); + + pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + + /* + * If any of p's children has a different mm and is eligible for kill, + * the one with the highest oom_badness() score is sacrificed for its + * parent. This attempts to lose the minimal amount of work done while + * still freeing memory. + */ + read_lock(&tasklist_lock); + for_each_thread(p, t) { + list_for_each_entry(child, &t->children, sibling) { + unsigned int child_points; + + if (process_shares_mm(child, p->mm)) + continue; + /* + * oom_badness() returns 0 if the thread is unkillable + */ + child_points = oom_badness(child, + oc->memcg, oc->nodemask, oc->totalpages); + if (child_points > victim_points) { + put_task_struct(victim); + victim = child; + victim_points = child_points; + get_task_struct(victim); + } + } + } + read_unlock(&tasklist_lock); + + __oom_kill_process(victim); +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ -- 2.13.6 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup 2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin 2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin @ 2017-10-05 13:04 ` Roman Gushchin [not found] ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org> 2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin ` (3 subsequent siblings) 5 siblings, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel Implement mem_cgroup_scan_tasks() functionality for the root memory cgroup to use this function for looking for a OOM victim task in the root memory cgroup by the cgroup-ware OOM killer. The root memory cgroup is treated as a leaf cgroup, so only tasks which are directly belonging to the root cgroup are iterated over. This patch doesn't introduce any functional change as mem_cgroup_scan_tasks() is never called for the root memcg. This is preparatory work for the cgroup-aware OOM killer, which will use this function to iterate over tasks belonging to the root memcg. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/memcontrol.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c7410636fadf..41d71f665550 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg) * value, the function breaks the iteration loop and returns the value. * Otherwise, it will iterate over all tasks and return 0. * - * This function must not be called for the root memory cgroup. + * If memcg is the root memory cgroup, this function will iterate only + * over tasks belonging directly to the root memory cgroup. */ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, int (*fn)(struct task_struct *, void *), void *arg) @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, struct mem_cgroup *iter; int ret = 0; - BUG_ON(memcg == root_mem_cgroup); - for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; struct task_struct *task; @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, while (!ret && (task = css_task_iter_next(&it))) ret = fn(task, arg); css_task_iter_end(&it); - if (ret) { + if (ret || memcg == root_mem_cgroup) { mem_cgroup_iter_break(memcg, iter); break; } -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>]
* Re: [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup [not found] ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org> @ 2017-10-09 21:11 ` David Rientjes 0 siblings, 0 replies; 27+ messages in thread From: David Rientjes @ 2017-10-09 21:11 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, 5 Oct 2017, Roman Gushchin wrote: > Implement mem_cgroup_scan_tasks() functionality for the root > memory cgroup to use this function for looking for a OOM victim > task in the root memory cgroup by the cgroup-ware OOM killer. > > The root memory cgroup is treated as a leaf cgroup, so only tasks > which are directly belonging to the root cgroup are iterated over. > > This patch doesn't introduce any functional change as > mem_cgroup_scan_tasks() is never called for the root memcg. > This is preparatory work for the cgroup-aware OOM killer, > which will use this function to iterate over tasks belonging > to the root memcg. > > Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin 2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin 2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin @ 2017-10-05 13:04 ` Roman Gushchin [not found] ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org> 2017-10-09 21:52 ` David Rientjes [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org> ` (2 subsequent siblings) 5 siblings, 2 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. To address these issues, the cgroup-aware OOM killer is introduced. Under OOM conditions, it looks for the biggest leaf memory cgroup and kills the biggest task belonging to it. The following patches will extend this functionality to consider non-leaf memory cgroups as well, and also provide an ability to kill all tasks belonging to the victim cgroup. The root cgroup is treated as a leaf memory cgroup, so it's score is compared with leaf memory cgroups. Due to memcg statistics implementation a special algorithm is used for estimating it's oom_score: we define it as maximum oom_score of the belonging tasks. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 17 +++++ include/linux/oom.h | 12 +++- mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 70 +++++++++++++----- 4 files changed, 251 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..75b63b68846e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -936,6 +948,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..ca78e2d5956e 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -9,6 +9,13 @@ #include <linux/sched/coredump.h> /* MMF_* */ #include <linux/mm.h> /* VM_FAULT* */ + +/* + * Special value returned by victim selection functions to indicate + * that are inflight OOM victims. + */ +#define INFLIGHT_VICTIM ((void *)-1UL) + struct zonelist; struct notifier_block; struct mem_cgroup; @@ -39,7 +46,8 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; - struct task_struct *chosen; + struct task_struct *chosen_task; + struct mem_cgroup *chosen_memcg; unsigned long chosen_points; }; @@ -101,6 +109,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 41d71f665550..191b70735f1f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) return ret; } +static long memcg_oom_badness(struct mem_cgroup *memcg, + const nodemask_t *nodemask, + unsigned long totalpages) +{ + long points = 0; + int nid; + pg_data_t *pgdat; + + /* + * We don't have necessary stats for the root memcg, + * so we define it's oom_score as the maximum oom_score + * of the belonging tasks. + * + * As tasks in the root memcg unlikely are parts of a + * single workload, and we don't have to implement + * group killing, this approximation is reasonable. + * + * But if we will have necessary stats for the root memcg, + * we might switch to the approach which is used for all + * other memcgs. + */ + if (memcg == root_mem_cgroup) { + struct css_task_iter it; + struct task_struct *task; + long score, max_score = 0; + + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + score = oom_badness(task, memcg, nodemask, + totalpages); + if (score > max_score) + max_score = score; + } + css_task_iter_end(&it); + + return max_score; + } + + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; + + points += mem_cgroup_node_nr_lru_pages(memcg, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + + pgdat = NODE_DATA(nid); + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), + NR_SLAB_UNRECLAIMABLE); + } + + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(memcg, MEMCG_SOCK); + points += memcg_page_state(memcg, MEMCG_SWAP); + + return points; +} + +/* + * Checks if the given memcg is a valid OOM victim and returns a number, + * which means the folowing: + * -1: there are inflight OOM victim tasks, belonging to the memcg + * 0: memcg is not eligible, e.g. all belonging tasks are protected + * by oom_score_adj set to OOM_SCORE_ADJ_MIN + * >0: memcg is eligible, and the returned value is an estimation + * of the memory footprint + */ +static long oom_evaluate_memcg(struct mem_cgroup *memcg, + const nodemask_t *nodemask, + unsigned long totalpages) +{ + struct css_task_iter it; + struct task_struct *task; + int eligible = 0; + + /* + * Memcg is OOM eligible if there are OOM killable tasks inside. + * + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN + * as unkillable. + * + * If there are inflight OOM victim tasks inside the memcg, + * we return -1. + */ + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + if (!eligible && + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) + eligible = 1; + + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { + eligible = -1; + break; + } + } + css_task_iter_end(&it); + + if (eligible <= 0) + return eligible; + + return memcg_oom_badness(memcg, nodemask, totalpages); +} + +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) +{ + struct mem_cgroup *iter; + + oc->chosen_memcg = NULL; + oc->chosen_points = 0; + + /* + * The oom_score is calculated for leaf memory cgroups (including + * the root memcg). + */ + rcu_read_lock(); + for_each_mem_cgroup_tree(iter, root) { + long score; + + if (memcg_has_children(iter) && iter != root_mem_cgroup) + continue; + + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); + + /* + * Ignore empty and non-eligible memory cgroups. + */ + if (score == 0) + continue; + + /* + * If there are inflight OOM victims, we don't need + * to look further for new victims. + */ + if (score == -1) { + oc->chosen_memcg = INFLIGHT_VICTIM; + mem_cgroup_iter_break(root, iter); + break; + } + + if (score > oc->chosen_points) { + oc->chosen_points = score; + oc->chosen_memcg = iter; + } + } + + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM) + css_get(&oc->chosen_memcg->css); + + rcu_read_unlock(); +} + +bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + struct mem_cgroup *root; + + if (mem_cgroup_disabled()) + return false; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return false; + + if (oc->memcg) + root = oc->memcg; + else + root = root_mem_cgroup; + + select_victim_memcg(root, oc); + + return oc->chosen_memcg; +} + /* * Reclaims as many pages from the given memcg as possible. * diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ccdb7d34cd13..20e62ec32ba8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -310,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } -static int oom_evaluate_task(struct task_struct *task, void *arg) +int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; unsigned long points; @@ -344,26 +344,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto next; /* Prefer thread group leaders for display purposes */ - if (points == oc->chosen_points && thread_group_leader(oc->chosen)) + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task)) goto next; select: - if (oc->chosen) - put_task_struct(oc->chosen); + if (oc->chosen_task) + put_task_struct(oc->chosen_task); get_task_struct(task); - oc->chosen = task; + oc->chosen_task = task; oc->chosen_points = points; next: return 0; abort: - if (oc->chosen) - put_task_struct(oc->chosen); - oc->chosen = (void *)-1UL; + if (oc->chosen_task) + put_task_struct(oc->chosen_task); + oc->chosen_task = INFLIGHT_VICTIM; return 1; } /* * Simple selection loop. We choose the process with the highest number of - * 'points'. In case scan was aborted, oc->chosen is set to -1. + * 'points'. In case scan was aborted, oc->chosen_task is set to -1. */ static void select_bad_process(struct oom_control *oc) { @@ -926,7 +926,7 @@ static void __oom_kill_process(struct task_struct *victim) static void oom_kill_process(struct oom_control *oc, const char *message) { - struct task_struct *p = oc->chosen; + struct task_struct *p = oc->chosen_task; unsigned int points = oc->chosen_points; struct task_struct *victim = p; struct task_struct *child; @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message) __oom_kill_process(victim); } +static bool oom_kill_memcg_victim(struct oom_control *oc) +{ + + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) + return oc->chosen_memcg; + + /* Kill a task in the chosen memcg with the biggest memory footprint */ + oc->chosen_points = 0; + oc->chosen_task = NULL; + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); + + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) + goto out; + + __oom_kill_process(oc->chosen_task); + +out: + mem_cgroup_put(oc->chosen_memcg); + return oc->chosen_task; +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ @@ -1039,6 +1060,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + bool delay = false; /* if set, delay next allocation attempt */ if (oom_killer_disabled) return false; @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc) current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); - oc->chosen = current; + oc->chosen_task = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); return true; } + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { + delay = true; + goto out; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen && oc->chosen != (void *)-1UL) { + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - schedule_timeout_killable(1); + delay = true; } - return !!oc->chosen; + +out: + /* + * Give the killed process a good chance to exit before trying + * to allocate memory again. + */ + if (delay) + schedule_timeout_killable(1); + + return !!oc->chosen_task; } /* -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>]
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer [not found] ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org> @ 2017-10-05 14:27 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2017-10-05 14:27 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu 05-10-17 14:04:51, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > > To address these issues, the cgroup-aware OOM killer is introduced. > > Under OOM conditions, it looks for the biggest leaf memory cgroup > and kills the biggest task belonging to it. The following patches > will extend this functionality to consider non-leaf memory cgroups > as well, and also provide an ability to kill all tasks belonging > to the victim cgroup. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with leaf memory cgroups. > Due to memcg statistics implementation a special algorithm > is used for estimating it's oom_score: we define it as maximum > oom_score of the belonging tasks. > > Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> > Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Cc: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org> > Cc: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: kernel-team-b10kYP2dOMg@public.gmane.org > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org Assuming this is an opt-in Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> > --- > include/linux/memcontrol.h | 17 +++++ > include/linux/oom.h | 12 +++- > mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++++++++++ > mm/oom_kill.c | 70 +++++++++++++----- > 4 files changed, 251 insertions(+), 20 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 69966c461d1c..75b63b68846e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -35,6 +35,7 @@ struct mem_cgroup; > struct page; > struct mm_struct; > struct kmem_cache; > +struct oom_control; > > /* Cgroup-specific page state, on top of universal node page state */ > enum memcg_stat_item { > @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ > return css ? container_of(css, struct mem_cgroup, css) : NULL; > } > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg) > +{ > + css_put(&memcg->css); > +} > + > #define mem_cgroup_from_counter(counter, member) \ > container_of(counter, struct mem_cgroup, member) > > @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) > > bool mem_cgroup_oom_synchronize(bool wait); > > +bool mem_cgroup_select_oom_victim(struct oom_control *oc); > + > #ifdef CONFIG_MEMCG_SWAP > extern int do_swap_account; > #endif > @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, > return true; > } > > +static inline void mem_cgroup_put(struct mem_cgroup *memcg) > +{ > +} > + > static inline struct mem_cgroup * > mem_cgroup_iter(struct mem_cgroup *root, > struct mem_cgroup *prev, > @@ -936,6 +948,11 @@ static inline > void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) > { > } > + > +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + return false; > +} > #endif /* CONFIG_MEMCG */ > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 76aac4ce39bc..ca78e2d5956e 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -9,6 +9,13 @@ > #include <linux/sched/coredump.h> /* MMF_* */ > #include <linux/mm.h> /* VM_FAULT* */ > > + > +/* > + * Special value returned by victim selection functions to indicate > + * that are inflight OOM victims. > + */ > +#define INFLIGHT_VICTIM ((void *)-1UL) > + > struct zonelist; > struct notifier_block; > struct mem_cgroup; > @@ -39,7 +46,8 @@ struct oom_control { > > /* Used by oom implementation, do not set */ > unsigned long totalpages; > - struct task_struct *chosen; > + struct task_struct *chosen_task; > + struct mem_cgroup *chosen_memcg; > unsigned long chosen_points; > }; > > @@ -101,6 +109,8 @@ extern void oom_killer_enable(void); > > extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > +extern int oom_evaluate_task(struct task_struct *task, void *arg); > + > /* sysctls */ > extern int sysctl_oom_dump_tasks; > extern int sysctl_oom_kill_allocating_task; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 41d71f665550..191b70735f1f 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) > return ret; > } > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + long points = 0; > + int nid; > + pg_data_t *pgdat; > + > + /* > + * We don't have necessary stats for the root memcg, > + * so we define it's oom_score as the maximum oom_score > + * of the belonging tasks. > + * > + * As tasks in the root memcg unlikely are parts of a > + * single workload, and we don't have to implement > + * group killing, this approximation is reasonable. > + * > + * But if we will have necessary stats for the root memcg, > + * we might switch to the approach which is used for all > + * other memcgs. > + */ > + if (memcg == root_mem_cgroup) { > + struct css_task_iter it; > + struct task_struct *task; > + long score, max_score = 0; > + > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + score = oom_badness(task, memcg, nodemask, > + totalpages); > + if (score > max_score) > + max_score = score; > + } > + css_task_iter_end(&it); > + > + return max_score; > + } > + > + for_each_node_state(nid, N_MEMORY) { > + if (nodemask && !node_isset(nid, *nodemask)) > + continue; > + > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + > + pgdat = NODE_DATA(nid); > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), > + NR_SLAB_UNRECLAIMABLE); > + } > + > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > + (PAGE_SIZE / 1024); > + points += memcg_page_state(memcg, MEMCG_SOCK); > + points += memcg_page_state(memcg, MEMCG_SWAP); > + > + return points; > +} > + > +/* > + * Checks if the given memcg is a valid OOM victim and returns a number, > + * which means the folowing: > + * -1: there are inflight OOM victim tasks, belonging to the memcg > + * 0: memcg is not eligible, e.g. all belonging tasks are protected > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > + * >0: memcg is eligible, and the returned value is an estimation > + * of the memory footprint > + */ > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int eligible = 0; > + > + /* > + * Memcg is OOM eligible if there are OOM killable tasks inside. > + * > + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > + * as unkillable. > + * > + * If there are inflight OOM victim tasks inside the memcg, > + * we return -1. > + */ > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + if (!eligible && > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > + eligible = 1; > + > + if (tsk_is_oom_victim(task) && > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > + eligible = -1; > + break; > + } > + } > + css_task_iter_end(&it); > + > + if (eligible <= 0) > + return eligible; > + > + return memcg_oom_badness(memcg, nodemask, totalpages); > +} > + > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > +{ > + struct mem_cgroup *iter; > + > + oc->chosen_memcg = NULL; > + oc->chosen_points = 0; > + > + /* > + * The oom_score is calculated for leaf memory cgroups (including > + * the root memcg). > + */ > + rcu_read_lock(); > + for_each_mem_cgroup_tree(iter, root) { > + long score; > + > + if (memcg_has_children(iter) && iter != root_mem_cgroup) > + continue; > + > + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); > + > + /* > + * Ignore empty and non-eligible memory cgroups. > + */ > + if (score == 0) > + continue; > + > + /* > + * If there are inflight OOM victims, we don't need > + * to look further for new victims. > + */ > + if (score == -1) { > + oc->chosen_memcg = INFLIGHT_VICTIM; > + mem_cgroup_iter_break(root, iter); > + break; > + } > + > + if (score > oc->chosen_points) { > + oc->chosen_points = score; > + oc->chosen_memcg = iter; > + } > + } > + > + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM) > + css_get(&oc->chosen_memcg->css); > + > + rcu_read_unlock(); > +} > + > +bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + struct mem_cgroup *root; > + > + if (mem_cgroup_disabled()) > + return false; > + > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > + return false; > + > + if (oc->memcg) > + root = oc->memcg; > + else > + root = root_mem_cgroup; > + > + select_victim_memcg(root, oc); > + > + return oc->chosen_memcg; > +} > + > /* > * Reclaims as many pages from the given memcg as possible. > * > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ccdb7d34cd13..20e62ec32ba8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -310,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) > return CONSTRAINT_NONE; > } > > -static int oom_evaluate_task(struct task_struct *task, void *arg) > +int oom_evaluate_task(struct task_struct *task, void *arg) > { > struct oom_control *oc = arg; > unsigned long points; > @@ -344,26 +344,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > goto next; > > /* Prefer thread group leaders for display purposes */ > - if (points == oc->chosen_points && thread_group_leader(oc->chosen)) > + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task)) > goto next; > select: > - if (oc->chosen) > - put_task_struct(oc->chosen); > + if (oc->chosen_task) > + put_task_struct(oc->chosen_task); > get_task_struct(task); > - oc->chosen = task; > + oc->chosen_task = task; > oc->chosen_points = points; > next: > return 0; > abort: > - if (oc->chosen) > - put_task_struct(oc->chosen); > - oc->chosen = (void *)-1UL; > + if (oc->chosen_task) > + put_task_struct(oc->chosen_task); > + oc->chosen_task = INFLIGHT_VICTIM; > return 1; > } > > /* > * Simple selection loop. We choose the process with the highest number of > - * 'points'. In case scan was aborted, oc->chosen is set to -1. > + * 'points'. In case scan was aborted, oc->chosen_task is set to -1. > */ > static void select_bad_process(struct oom_control *oc) > { > @@ -926,7 +926,7 @@ static void __oom_kill_process(struct task_struct *victim) > > static void oom_kill_process(struct oom_control *oc, const char *message) > { > - struct task_struct *p = oc->chosen; > + struct task_struct *p = oc->chosen_task; > unsigned int points = oc->chosen_points; > struct task_struct *victim = p; > struct task_struct *child; > @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > __oom_kill_process(victim); > } > > +static bool oom_kill_memcg_victim(struct oom_control *oc) > +{ > + > + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) > + return oc->chosen_memcg; > + > + /* Kill a task in the chosen memcg with the biggest memory footprint */ > + oc->chosen_points = 0; > + oc->chosen_task = NULL; > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > + > + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) > + goto out; > + > + __oom_kill_process(oc->chosen_task); > + > +out: > + mem_cgroup_put(oc->chosen_memcg); > + return oc->chosen_task; > +} > + > /* > * Determines whether the kernel must panic because of the panic_on_oom sysctl. > */ > @@ -1039,6 +1060,7 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > + bool delay = false; /* if set, delay next allocation attempt */ > > if (oom_killer_disabled) > return false; > @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc) > current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > get_task_struct(current); > - oc->chosen = current; > + oc->chosen_task = current; > oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); > return true; > } > > + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { > + delay = true; > + goto out; > + } > + > select_bad_process(oc); > /* Found nothing?!?! Either we hang forever, or we panic. */ > - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > dump_header(oc, NULL); > panic("Out of memory and no killable processes...\n"); > } > - if (oc->chosen && oc->chosen != (void *)-1UL) { > + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > "Memory cgroup out of memory"); > - /* > - * Give the killed process a good chance to exit before trying > - * to allocate memory again. > - */ > - schedule_timeout_killable(1); > + delay = true; > } > - return !!oc->chosen; > + > +out: > + /* > + * Give the killed process a good chance to exit before trying > + * to allocate memory again. > + */ > + if (delay) > + schedule_timeout_killable(1); > + > + return !!oc->chosen_task; > } > > /* > -- > 2.13.6 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin [not found] ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org> @ 2017-10-09 21:52 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 2017-10-10 12:23 ` Roman Gushchin 1 sibling, 2 replies; 27+ messages in thread From: David Rientjes @ 2017-10-09 21:52 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Thu, 5 Oct 2017, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > I'd move the second point to the changelog for the next patch since this patch doesn't implement any support for memory.oom_group. > To address these issues, the cgroup-aware OOM killer is introduced. > > Under OOM conditions, it looks for the biggest leaf memory cgroup > and kills the biggest task belonging to it. The following patches > will extend this functionality to consider non-leaf memory cgroups > as well, and also provide an ability to kill all tasks belonging > to the victim cgroup. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with leaf memory cgroups. > Due to memcg statistics implementation a special algorithm > is used for estimating it's oom_score: we define it as maximum > oom_score of the belonging tasks. > This seems to unfairly bias the root mem cgroup depending on process size. It isn't treated fairly as a leaf mem cgroup if they are being compared based on different criteria: the root mem cgroup as (mostly) the largest rss of a single process vs leaf mem cgroups as all anon, unevictable, and unreclaimable slab pages charged to it by all processes. I imagine a configuration where the root mem cgroup has 100 processes attached each with rss of 80MB, compared to a leaf cgroup with 100 processes of 1MB rss each. How does this logic prevent repeatedly oom killing the processes of 1MB rss? In this case, "the root cgroup is treated as a leaf memory cgroup" isn't quite fair, it can simply hide large processes from being selected. Users who configure cgroups in a unified hierarchy for other resource constraints are penalized for this choice even though the mem cgroup with 100 processes of 1MB rss each may not be limited itself. I think for this comparison to be fair, it requires accounting for the root mem cgroup itself or for a different accounting methodology for leaf memory cgroups. > +/* > + * Checks if the given memcg is a valid OOM victim and returns a number, > + * which means the folowing: > + * -1: there are inflight OOM victim tasks, belonging to the memcg > + * 0: memcg is not eligible, e.g. all belonging tasks are protected > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > + * >0: memcg is eligible, and the returned value is an estimation > + * of the memory footprint > + */ > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int eligible = 0; > + > + /* > + * Memcg is OOM eligible if there are OOM killable tasks inside. > + * > + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > + * as unkillable. > + * > + * If there are inflight OOM victim tasks inside the memcg, > + * we return -1. > + */ > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + if (!eligible && > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > + eligible = 1; > + > + if (tsk_is_oom_victim(task) && > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > + eligible = -1; > + break; > + } > + } > + css_task_iter_end(&it); > + > + if (eligible <= 0) > + return eligible; > + > + return memcg_oom_badness(memcg, nodemask, totalpages); > +} > + > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > +{ > + struct mem_cgroup *iter; > + > + oc->chosen_memcg = NULL; > + oc->chosen_points = 0; > + > + /* > + * The oom_score is calculated for leaf memory cgroups (including > + * the root memcg). > + */ > + rcu_read_lock(); > + for_each_mem_cgroup_tree(iter, root) { > + long score; > + > + if (memcg_has_children(iter) && iter != root_mem_cgroup) > + continue; I'll reiterate what I did on the last version of the patchset: considering only leaf memory cgroups easily allows users to defeat this heuristic and bias against all of their memory usage up to the largest process size amongst the set of processes attached. If the user creates N child mem cgroups for their N processes and attaches one process to each child, the _only_ thing this achieved is to defeat your heuristic and prefer other leaf cgroups simply because those other leaf cgroups did not do this. Effectively: for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done will radically shift the heuristic from a score of all anonymous + unevictable memory for all processes to a score of the largest anonymous + unevictable memory for a single process. There is no downside or ramifaction for the end user in doing this. When comparing cgroups based on usage, it only makes sense to compare the hierarchical usage of that cgroup so that attaching processes to descendants or splitting the implementation of a process into several smaller individual processes does not allow this heuristic to be defeated. > + > + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); > + > + /* > + * Ignore empty and non-eligible memory cgroups. > + */ > + if (score == 0) > + continue; > + > + /* > + * If there are inflight OOM victims, we don't need > + * to look further for new victims. > + */ > + if (score == -1) { > + oc->chosen_memcg = INFLIGHT_VICTIM; > + mem_cgroup_iter_break(root, iter); > + break; > + } > + > + if (score > oc->chosen_points) { > + oc->chosen_points = score; > + oc->chosen_memcg = iter; > + } I'll reiterate what I did on previous versions of this patchset: this effectively removes all control the user has from influencing oom victim selection. Victim selection is very important, the user must be able to influence that decision to prevent the loss of important work when the system is out of memory. This heuristic only uses user infleunce by considering whether a memory cgroup is eligible depending on whether all processes have /proc/pid/oom_score_adj == -1000 or not. It means a user must oom disable all processes attached to an important memory cgroup that has not reached its limit to prevent it from being oom killed with this heuristic. It simply has no other choice. It cannot differentiate between two memory cgroups where one is expected to have much higher memory usage, and should be protected because of end user goals. > + } > + > + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM) > + css_get(&oc->chosen_memcg->css); > + > + rcu_read_unlock(); > +} > + > +bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + struct mem_cgroup *root; > + > + if (mem_cgroup_disabled()) > + return false; > + > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > + return false; > + > + if (oc->memcg) > + root = oc->memcg; > + else > + root = root_mem_cgroup; > + > + select_victim_memcg(root, oc); > + > + return oc->chosen_memcg; > +} > + > /* > * Reclaims as many pages from the given memcg as possible. > * > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ccdb7d34cd13..20e62ec32ba8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > __oom_kill_process(victim); > } > > +static bool oom_kill_memcg_victim(struct oom_control *oc) > +{ > + > + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) > + return oc->chosen_memcg; > + > + /* Kill a task in the chosen memcg with the biggest memory footprint */ > + oc->chosen_points = 0; > + oc->chosen_task = NULL; > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > + > + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) > + goto out; > + > + __oom_kill_process(oc->chosen_task); > + > +out: > + mem_cgroup_put(oc->chosen_memcg); > + return oc->chosen_task; > +} > + > /* > * Determines whether the kernel must panic because of the panic_on_oom sysctl. > */ > @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc) > current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > get_task_struct(current); > - oc->chosen = current; > + oc->chosen_task = current; > oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); > return true; > } > > + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { > + delay = true; > + goto out; > + } > + > select_bad_process(oc); This is racy because mem_cgroup_select_oom_victim() found an eligible oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a process cannot be killed because of oom_unkillable_task(), the only eligible processes moved or exited, or the /proc/pid/oom_score_adj of the eligible processes changed, we end up falling back to the complete tasklist scan. It would be better for oom_evaluate_memcg() to consider oom_unkillable_task() and also retry in the case where oom_kill_memcg_victim() returns NULL. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer [not found] ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2017-10-10 8:18 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2017-10-10 8:18 UTC (permalink / raw) To: David Rientjes Cc: Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon 09-10-17 14:52:53, David Rientjes wrote: > On Thu, 5 Oct 2017, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > Under oom conditions, it finds a process with the highest oom score > > and kills it. > > > > This behavior doesn't suit well the system with many running > > containers: > > > > 1) There is no fairness between containers. A small container with > > few large processes will be chosen over a large one with huge > > number of small processes. > > > > 2) Containers often do not expect that some random process inside > > will be killed. In many cases much safer behavior is to kill > > all tasks in the container. Traditionally, this was implemented > > in userspace, but doing it in the kernel has some advantages, > > especially in a case of a system-wide OOM. > > > > I'd move the second point to the changelog for the next patch since this > patch doesn't implement any support for memory.oom_group. > > > To address these issues, the cgroup-aware OOM killer is introduced. > > > > Under OOM conditions, it looks for the biggest leaf memory cgroup > > and kills the biggest task belonging to it. The following patches > > will extend this functionality to consider non-leaf memory cgroups > > as well, and also provide an ability to kill all tasks belonging > > to the victim cgroup. > > > > The root cgroup is treated as a leaf memory cgroup, so it's score > > is compared with leaf memory cgroups. > > Due to memcg statistics implementation a special algorithm > > is used for estimating it's oom_score: we define it as maximum > > oom_score of the belonging tasks. > > > > This seems to unfairly bias the root mem cgroup depending on process size. > It isn't treated fairly as a leaf mem cgroup if they are being compared > based on different criteria: the root mem cgroup as (mostly) the largest > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > unreclaimable slab pages charged to it by all processes. > > I imagine a configuration where the root mem cgroup has 100 processes > attached each with rss of 80MB, compared to a leaf cgroup with 100 > processes of 1MB rss each. How does this logic prevent repeatedly oom > killing the processes of 1MB rss? > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > quite fair, it can simply hide large processes from being selected. Users > who configure cgroups in a unified hierarchy for other resource > constraints are penalized for this choice even though the mem cgroup with > 100 processes of 1MB rss each may not be limited itself. > > I think for this comparison to be fair, it requires accounting for the > root mem cgroup itself or for a different accounting methodology for leaf > memory cgroups. I believe this is documented in the patch. I agree with you but I also assume this will not be such a big problem in practice because usecases which are going to opt-in for the cgroup aware OOM killer will have the all workloads running in memcgs and the root will basically run only some essential system wide services needed for the overall system operation. Risk of the runaway of this should be reasonably small and killing any of those will put the system into an unstable state anyway. That being said future improvements are possible but I guess that shouldn't be a roadblock for the feature to be merged. > > +/* > > + * Checks if the given memcg is a valid OOM victim and returns a number, > > + * which means the folowing: > > + * -1: there are inflight OOM victim tasks, belonging to the memcg > > + * 0: memcg is not eligible, e.g. all belonging tasks are protected > > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > > + * >0: memcg is eligible, and the returned value is an estimation > > + * of the memory footprint > > + */ > > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > > + const nodemask_t *nodemask, > > + unsigned long totalpages) > > +{ > > + struct css_task_iter it; > > + struct task_struct *task; > > + int eligible = 0; > > + > > + /* > > + * Memcg is OOM eligible if there are OOM killable tasks inside. > > + * > > + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > > + * as unkillable. > > + * > > + * If there are inflight OOM victim tasks inside the memcg, > > + * we return -1. > > + */ > > + css_task_iter_start(&memcg->css, 0, &it); > > + while ((task = css_task_iter_next(&it))) { > > + if (!eligible && > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > > + eligible = 1; > > + > > + if (tsk_is_oom_victim(task) && > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > > + eligible = -1; > > + break; > > + } > > + } > > + css_task_iter_end(&it); > > + > > + if (eligible <= 0) > > + return eligible; > > + > > + return memcg_oom_badness(memcg, nodemask, totalpages); > > +} > > + > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > > +{ > > + struct mem_cgroup *iter; > > + > > + oc->chosen_memcg = NULL; > > + oc->chosen_points = 0; > > + > > + /* > > + * The oom_score is calculated for leaf memory cgroups (including > > + * the root memcg). > > + */ > > + rcu_read_lock(); > > + for_each_mem_cgroup_tree(iter, root) { > > + long score; > > + > > + if (memcg_has_children(iter) && iter != root_mem_cgroup) > > + continue; > > I'll reiterate what I did on the last version of the patchset: considering > only leaf memory cgroups easily allows users to defeat this heuristic and > bias against all of their memory usage up to the largest process size > amongst the set of processes attached. If the user creates N child mem > cgroups for their N processes and attaches one process to each child, the > _only_ thing this achieved is to defeat your heuristic and prefer other > leaf cgroups simply because those other leaf cgroups did not do this. I do not think repeating the argument is both needed nor helpful. It has been already argued that the userspace is already able to do the same by splitting the memory consumptions between processes. I would argue even further that allowing an untrusted entity to create arbitrary sub groups is dangerous for other reasons. > Effectively: > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > will radically shift the heuristic from a score of all anonymous + > unevictable memory for all processes to a score of the largest anonymous + > unevictable memory for a single process. There is no downside or > ramifaction for the end user in doing this. When comparing cgroups based > on usage, it only makes sense to compare the hierarchical usage of that > cgroup so that attaching processes to descendants or splitting the > implementation of a process into several smaller individual processes does > not allow this heuristic to be defeated. But it breaks other usecases as already pointed out and it is quite sad you keep ignoring those. > > + > > + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); > > + > > + /* > > + * Ignore empty and non-eligible memory cgroups. > > + */ > > + if (score == 0) > > + continue; > > + > > + /* > > + * If there are inflight OOM victims, we don't need > > + * to look further for new victims. > > + */ > > + if (score == -1) { > > + oc->chosen_memcg = INFLIGHT_VICTIM; > > + mem_cgroup_iter_break(root, iter); > > + break; > > + } > > + > > + if (score > oc->chosen_points) { > > + oc->chosen_points = score; > > + oc->chosen_memcg = iter; > > + } > > I'll reiterate what I did on previous versions of this patchset: this > effectively removes all control the user has from influencing oom victim > selection. Victim selection is very important, the user must be able to > influence that decision to prevent the loss of important work when the > system is out of memory. And again it has been argued, and rightfully so, that this is not in scope of this series and a more advanced user space influence can be implemented on top. [...] > > @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc) > > current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && > > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > > get_task_struct(current); > > - oc->chosen = current; > > + oc->chosen_task = current; > > oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); > > return true; > > } > > > > + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { > > + delay = true; > > + goto out; > > + } > > + > > select_bad_process(oc); > > This is racy because mem_cgroup_select_oom_victim() found an eligible > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a > process cannot be killed because of oom_unkillable_task(), the only > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the > eligible processes changed, we end up falling back to the complete > tasklist scan. oom victim selection will always be racy wrt. tasks exiting. Falling back to the complete tasklist scan should be tolerable as this is really not even remotely close to a hot path. > It would be better for oom_evaluate_memcg() to consider > oom_unkillable_task() and also retry in the case where > oom_kill_memcg_victim() returns NULL. I am not really sure oom_unkillable_task will really help. Most of the conditions are simply not applicable to the memcgs' tasks. The only interesting one might be has_intersects_mems_allowed but even that one is quite questionable. memcg_oom_badness is already NUMA aware and has_intersects_mems_allowed is not much more reliable way to detect specific node consumers anyway. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-09 21:52 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2017-10-10 12:23 ` Roman Gushchin 2017-10-10 21:13 ` David Rientjes 1 sibling, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-10 12:23 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Mon, Oct 09, 2017 at 02:52:53PM -0700, David Rientjes wrote: > On Thu, 5 Oct 2017, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > Under oom conditions, it finds a process with the highest oom score > > and kills it. > > > > This behavior doesn't suit well the system with many running > > containers: > > > > 1) There is no fairness between containers. A small container with > > few large processes will be chosen over a large one with huge > > number of small processes. > > > > 2) Containers often do not expect that some random process inside > > will be killed. In many cases much safer behavior is to kill > > all tasks in the container. Traditionally, this was implemented > > in userspace, but doing it in the kernel has some advantages, > > especially in a case of a system-wide OOM. > > > > I'd move the second point to the changelog for the next patch since this > patch doesn't implement any support for memory.oom_group. There is a special remark later in the changelog explaining that this functionality will be added by following patches. I've thought it's useful to have all basic ideas in the one place. > > > To address these issues, the cgroup-aware OOM killer is introduced. > > > > Under OOM conditions, it looks for the biggest leaf memory cgroup > > and kills the biggest task belonging to it. The following patches > > will extend this functionality to consider non-leaf memory cgroups > > as well, and also provide an ability to kill all tasks belonging > > to the victim cgroup. > > > > The root cgroup is treated as a leaf memory cgroup, so it's score > > is compared with leaf memory cgroups. > > Due to memcg statistics implementation a special algorithm > > is used for estimating it's oom_score: we define it as maximum > > oom_score of the belonging tasks. > > > > This seems to unfairly bias the root mem cgroup depending on process size. > It isn't treated fairly as a leaf mem cgroup if they are being compared > based on different criteria: the root mem cgroup as (mostly) the largest > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > unreclaimable slab pages charged to it by all processes. > > I imagine a configuration where the root mem cgroup has 100 processes > attached each with rss of 80MB, compared to a leaf cgroup with 100 > processes of 1MB rss each. How does this logic prevent repeatedly oom > killing the processes of 1MB rss? > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > quite fair, it can simply hide large processes from being selected. Users > who configure cgroups in a unified hierarchy for other resource > constraints are penalized for this choice even though the mem cgroup with > 100 processes of 1MB rss each may not be limited itself. > > I think for this comparison to be fair, it requires accounting for the > root mem cgroup itself or for a different accounting methodology for leaf > memory cgroups. This is basically a workaround, because we don't have necessary stats for root memory cgroup. If we'll start gathering them at some point, we can change this and treat root memcg exactly as other leaf cgroups. Or, if someone will come with an idea of a better approximation, it can be implemented as a separate enhancement on top of the initial implementation. This is more than welcome. > > I'll reiterate what I did on the last version of the patchset: considering > only leaf memory cgroups easily allows users to defeat this heuristic and > bias against all of their memory usage up to the largest process size > amongst the set of processes attached. If the user creates N child mem > cgroups for their N processes and attaches one process to each child, the > _only_ thing this achieved is to defeat your heuristic and prefer other > leaf cgroups simply because those other leaf cgroups did not do this. > > Effectively: > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > will radically shift the heuristic from a score of all anonymous + > unevictable memory for all processes to a score of the largest anonymous + > unevictable memory for a single process. There is no downside or > ramifaction for the end user in doing this. When comparing cgroups based > on usage, it only makes sense to compare the hierarchical usage of that > cgroup so that attaching processes to descendants or splitting the > implementation of a process into several smaller individual processes does > not allow this heuristic to be defeated. To all previously said words I can only add that cgroup v2 allows to limit the amount of cgroups in the sub-tree: 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > This is racy because mem_cgroup_select_oom_victim() found an eligible > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a > process cannot be killed because of oom_unkillable_task(), the only > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the > eligible processes changed, we end up falling back to the complete > tasklist scan. It would be better for oom_evaluate_memcg() to consider > oom_unkillable_task() and also retry in the case where > oom_kill_memcg_victim() returns NULL. I agree with you here. The fallback to the existing mechanism is implemented to be safe for sure, especially in a case of a global OOM. When we'll get more confidence in cgroup-aware OOM killer reliability, we can change this behavior. Personally, I would prefer to get rid of looking at all tasks just to find a pre-existing OOM victim, but it might be quite tricky to implement. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-10 12:23 ` Roman Gushchin @ 2017-10-10 21:13 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: David Rientjes @ 2017-10-10 21:13 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Tue, 10 Oct 2017, Roman Gushchin wrote: > > This seems to unfairly bias the root mem cgroup depending on process size. > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > based on different criteria: the root mem cgroup as (mostly) the largest > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > unreclaimable slab pages charged to it by all processes. > > > > I imagine a configuration where the root mem cgroup has 100 processes > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > killing the processes of 1MB rss? > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > quite fair, it can simply hide large processes from being selected. Users > > who configure cgroups in a unified hierarchy for other resource > > constraints are penalized for this choice even though the mem cgroup with > > 100 processes of 1MB rss each may not be limited itself. > > > > I think for this comparison to be fair, it requires accounting for the > > root mem cgroup itself or for a different accounting methodology for leaf > > memory cgroups. > > This is basically a workaround, because we don't have necessary stats for root > memory cgroup. If we'll start gathering them at some point, we can change this > and treat root memcg exactly as other leaf cgroups. > I understand why it currently cannot be an apples vs apples comparison without, as I suggest in the last paragraph, that the same accounting is done for the root mem cgroup, which is intuitive if it is to be considered on the same basis as leaf mem cgroups. I understand for the design to work that leaf mem cgroups and the root mem cgroup must be compared if processes can be attached to the root mem cgroup. My point is that it is currently completely unfair as I've stated: you can have 10000 processes attached to the root mem cgroup with rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each and the oom killer is going to target the leaf mem cgroup as a result of this apples vs oranges comparison. In case it's not clear, the 10000 processes of 80MB rss each is the most likely contributor to a system-wide oom kill. Unfortunately, the heuristic introduced by this patchset is broken wrt a fair comparison of the root mem cgroup usage. > Or, if someone will come with an idea of a better approximation, it can be > implemented as a separate enhancement on top of the initial implementation. > This is more than welcome. > We don't need a better approximation, we need a fair comparison. The heuristic that this patchset is implementing is based on the usage of individual mem cgroups. For the root mem cgroup to be considered eligible, we need to understand its usage. That usage is _not_ what is implemented by this patchset, which is the largest rss of a single attached process. This, in fact, is not an "approximation" at all. In the example of 10000 processes attached with 80MB rss each, the usage of the root mem cgroup is _not_ 80MB. I'll restate that oom killing a process is a last resort for the kernel, but it also must be able to make a smart decision. Targeting dozens of 1MB processes instead of 80MB processes because of a shortcoming in this implementation is not the appropriate selection, it's the opposite of the correct selection. > > I'll reiterate what I did on the last version of the patchset: considering > > only leaf memory cgroups easily allows users to defeat this heuristic and > > bias against all of their memory usage up to the largest process size > > amongst the set of processes attached. If the user creates N child mem > > cgroups for their N processes and attaches one process to each child, the > > _only_ thing this achieved is to defeat your heuristic and prefer other > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > Effectively: > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > will radically shift the heuristic from a score of all anonymous + > > unevictable memory for all processes to a score of the largest anonymous + > > unevictable memory for a single process. There is no downside or > > ramifaction for the end user in doing this. When comparing cgroups based > > on usage, it only makes sense to compare the hierarchical usage of that > > cgroup so that attaching processes to descendants or splitting the > > implementation of a process into several smaller individual processes does > > not allow this heuristic to be defeated. > > To all previously said words I can only add that cgroup v2 allows to limit > the amount of cgroups in the sub-tree: > 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > So the solution to for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done evading all oom kills for your mem cgroup is to limit the number of cgroups that can be created by the user? With a unified cgroup hierarchy, that doesn't work well if I wanted to actually constrain these individual processes to different resource limits like cpu usage. In fact, the user may not know it is effectively evading the oom killer entirely because it has constrained the cpu of individual processes because its a side-effect of this heuristic. You chose not to respond to my reiteration of userspace having absolutely no control over victim selection with the new heuristic without setting all processes to be oom disabled via /proc/pid/oom_score_adj. If I have a very important job that is running on a system that is really supposed to use 80% of memory, I need to be able to specify that it should not be oom killed based on user goals. Setting all processes to be oom disabled in the important mem cgroup to avoid being oom killed unless absolutely necessary in a system oom condition is not a robust solution: (1) the mem cgroup livelocks if it reaches its own mem cgroup limit and (2) the system panic()'s if these preferred mem cgroups are the only consumers left on the system. With overcommit, both of these possibilities exist in the wild and the problem is only a result of the implementation detail of this patchset. For these reasons: unfair comparison of root mem cgroup usage to bias against that mem cgroup from oom kill in system oom conditions, the ability of users to completely evade the oom killer by attaching all processes to child cgroups either purposefully or unpurposefully, and the inability of userspace to effectively control oom victim selection: Nacked-by: David Rientjes <rientjes@google.com> > > This is racy because mem_cgroup_select_oom_victim() found an eligible > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a > > process cannot be killed because of oom_unkillable_task(), the only > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the > > eligible processes changed, we end up falling back to the complete > > tasklist scan. It would be better for oom_evaluate_memcg() to consider > > oom_unkillable_task() and also retry in the case where > > oom_kill_memcg_victim() returns NULL. > > I agree with you here. The fallback to the existing mechanism is implemented > to be safe for sure, especially in a case of a global OOM. When we'll get > more confidence in cgroup-aware OOM killer reliability, we can change this > behavior. Personally, I would prefer to get rid of looking at all tasks just > to find a pre-existing OOM victim, but it might be quite tricky to implement. > I'm not sure what this has to do with confidence in this patchset's reliability? The race obviously exists: mem_cgroup_select_oom_victim() found an eligible process in oc->chosen_memcg but it was either ineligible later because of oom_unkillable_task(), it moved, or it exited. It's a race. For users who opt-in to this new heuristic, they should not be concerned with a process exiting and thus killing a completely unexpected process from an unexpected memcg when it should be possible to retry and select the correct victim. It's much better to document and state to the user what they are opting-in to and clearly define how a victim is chosen with the new heuristic and then implement that so it works correctly. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer [not found] ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2017-10-10 22:04 ` Roman Gushchin 2017-10-11 20:21 ` David Rientjes 0 siblings, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-10 22:04 UTC (permalink / raw) To: David Rientjes Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > This seems to unfairly bias the root mem cgroup depending on process size. > > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > > based on different criteria: the root mem cgroup as (mostly) the largest > > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > > unreclaimable slab pages charged to it by all processes. > > > > > > I imagine a configuration where the root mem cgroup has 100 processes > > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > > killing the processes of 1MB rss? > > > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > > quite fair, it can simply hide large processes from being selected. Users > > > who configure cgroups in a unified hierarchy for other resource > > > constraints are penalized for this choice even though the mem cgroup with > > > 100 processes of 1MB rss each may not be limited itself. > > > > > > I think for this comparison to be fair, it requires accounting for the > > > root mem cgroup itself or for a different accounting methodology for leaf > > > memory cgroups. > > > > This is basically a workaround, because we don't have necessary stats for root > > memory cgroup. If we'll start gathering them at some point, we can change this > > and treat root memcg exactly as other leaf cgroups. > > > > I understand why it currently cannot be an apples vs apples comparison > without, as I suggest in the last paragraph, that the same accounting is > done for the root mem cgroup, which is intuitive if it is to be considered > on the same basis as leaf mem cgroups. > > I understand for the design to work that leaf mem cgroups and the root mem > cgroup must be compared if processes can be attached to the root mem > cgroup. My point is that it is currently completely unfair as I've > stated: you can have 10000 processes attached to the root mem cgroup with > rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each > and the oom killer is going to target the leaf mem cgroup as a result of > this apples vs oranges comparison. > > In case it's not clear, the 10000 processes of 80MB rss each is the most > likely contributor to a system-wide oom kill. Unfortunately, the > heuristic introduced by this patchset is broken wrt a fair comparison of > the root mem cgroup usage. > > > Or, if someone will come with an idea of a better approximation, it can be > > implemented as a separate enhancement on top of the initial implementation. > > This is more than welcome. > > > > We don't need a better approximation, we need a fair comparison. The > heuristic that this patchset is implementing is based on the usage of > individual mem cgroups. For the root mem cgroup to be considered > eligible, we need to understand its usage. That usage is _not_ what is > implemented by this patchset, which is the largest rss of a single > attached process. This, in fact, is not an "approximation" at all. In > the example of 10000 processes attached with 80MB rss each, the usage of > the root mem cgroup is _not_ 80MB. It's hard to imagine a "healthy" setup with 10000 process in the root memory cgroup, and even if we kill 1 process we will still have 9999 remaining process. I agree with you at some point, but it's not a real world example. > > I'll restate that oom killing a process is a last resort for the kernel, > but it also must be able to make a smart decision. Targeting dozens of > 1MB processes instead of 80MB processes because of a shortcoming in this > implementation is not the appropriate selection, it's the opposite of the > correct selection. > > > > I'll reiterate what I did on the last version of the patchset: considering > > > only leaf memory cgroups easily allows users to defeat this heuristic and > > > bias against all of their memory usage up to the largest process size > > > amongst the set of processes attached. If the user creates N child mem > > > cgroups for their N processes and attaches one process to each child, the > > > _only_ thing this achieved is to defeat your heuristic and prefer other > > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > > > Effectively: > > > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > > > will radically shift the heuristic from a score of all anonymous + > > > unevictable memory for all processes to a score of the largest anonymous + > > > unevictable memory for a single process. There is no downside or > > > ramifaction for the end user in doing this. When comparing cgroups based > > > on usage, it only makes sense to compare the hierarchical usage of that > > > cgroup so that attaching processes to descendants or splitting the > > > implementation of a process into several smaller individual processes does > > > not allow this heuristic to be defeated. > > > > To all previously said words I can only add that cgroup v2 allows to limit > > the amount of cgroups in the sub-tree: > > 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > > > > So the solution to > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > evading all oom kills for your mem cgroup is to limit the number of > cgroups that can be created by the user? With a unified cgroup hierarchy, > that doesn't work well if I wanted to actually constrain these individual > processes to different resource limits like cpu usage. In fact, the user > may not know it is effectively evading the oom killer entirely because it > has constrained the cpu of individual processes because its a side-effect > of this heuristic. > > > You chose not to respond to my reiteration of userspace having absolutely > no control over victim selection with the new heuristic without setting > all processes to be oom disabled via /proc/pid/oom_score_adj. If I have a > very important job that is running on a system that is really supposed to > use 80% of memory, I need to be able to specify that it should not be oom > killed based on user goals. Setting all processes to be oom disabled in > the important mem cgroup to avoid being oom killed unless absolutely > necessary in a system oom condition is not a robust solution: (1) the mem > cgroup livelocks if it reaches its own mem cgroup limit and (2) the system > panic()'s if these preferred mem cgroups are the only consumers left on > the system. With overcommit, both of these possibilities exist in the > wild and the problem is only a result of the implementation detail of this > patchset. > > For these reasons: unfair comparison of root mem cgroup usage to bias > against that mem cgroup from oom kill in system oom conditions, the > ability of users to completely evade the oom killer by attaching all > processes to child cgroups either purposefully or unpurposefully, and the > inability of userspace to effectively control oom victim selection: > > Nacked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, will it fix the problem? It might have some drawbacks as well (especially around oom_score_adj), but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > > > > This is racy because mem_cgroup_select_oom_victim() found an eligible > > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible > > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a > > > process cannot be killed because of oom_unkillable_task(), the only > > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the > > > eligible processes changed, we end up falling back to the complete > > > tasklist scan. It would be better for oom_evaluate_memcg() to consider > > > oom_unkillable_task() and also retry in the case where > > > oom_kill_memcg_victim() returns NULL. > > > > I agree with you here. The fallback to the existing mechanism is implemented > > to be safe for sure, especially in a case of a global OOM. When we'll get > > more confidence in cgroup-aware OOM killer reliability, we can change this > > behavior. Personally, I would prefer to get rid of looking at all tasks just > > to find a pre-existing OOM victim, but it might be quite tricky to implement. > > > > I'm not sure what this has to do with confidence in this patchset's > reliability? The race obviously exists: mem_cgroup_select_oom_victim() > found an eligible process in oc->chosen_memcg but it was either ineligible > later because of oom_unkillable_task(), it moved, or it exited. It's a > race. For users who opt-in to this new heuristic, they should not be > concerned with a process exiting and thus killing a completely unexpected > process from an unexpected memcg when it should be possible to retry and > select the correct victim. Yes, I have to agree here. Looks like we can't fallback to the original policy. Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-10 22:04 ` Roman Gushchin @ 2017-10-11 20:21 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: David Rientjes @ 2017-10-11 20:21 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Tue, 10 Oct 2017, Roman Gushchin wrote: > > We don't need a better approximation, we need a fair comparison. The > > heuristic that this patchset is implementing is based on the usage of > > individual mem cgroups. For the root mem cgroup to be considered > > eligible, we need to understand its usage. That usage is _not_ what is > > implemented by this patchset, which is the largest rss of a single > > attached process. This, in fact, is not an "approximation" at all. In > > the example of 10000 processes attached with 80MB rss each, the usage of > > the root mem cgroup is _not_ 80MB. > > It's hard to imagine a "healthy" setup with 10000 process in the root > memory cgroup, and even if we kill 1 process we will still have 9999 > remaining process. I agree with you at some point, but it's not > a real world example. > It's an example that illustrates the problem with the unfair comparison between the root mem cgroup and leaf mem cgroups. It's unfair to compare [largest rss of a single process attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an approximation, as previously stated: the usage of the root mem cgroup is not 100MB if there are 10 such processes attached to the root mem cgroup, it's off by orders of magnitude. For the root mem cgroup to be treated equally as a leaf mem cgroup as this patchset proposes, it must have a fair comparison. That can be done by accounting memory to the root mem cgroup in the same way it is to leaf mem cgroups. But let's move the discussion forward to fix it. To avoid necessarily accounting memory to the root mem cgroup, have we considered if it is even necessary to address the root mem cgroup? For the users who opt-in to this heuristic, would it be possible to discount the root mem cgroup from the heuristic entirely so that oom kills originate from leaf mem cgroups? Or, perhaps better, oom kill from non-memory.oom_group cgroups only if the victim rss is greater than an eligible victim rss attached to the root mem cgroup? > > For these reasons: unfair comparison of root mem cgroup usage to bias > > against that mem cgroup from oom kill in system oom conditions, the > > ability of users to completely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes <rientjes@google.com> > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > will it fix the problem? > > It might have some drawbacks as well (especially around oom_score_adj), > but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > You would be required to discount oom_score_adj because the heuristic doesn't account for oom_score_adj when comparing the anon + unevictable + unreclaimable slab of leaf mem cgroups. This wouldn't result in the correct victim selection in real-world scenarios where processes attached to the root mem cgroup are vital to the system and not part of any user job, i.e. they are important system daemons and the "activity manager" responsible for orchestrating the cgroup hierarchy. It's also still unfair because it now compares [sum of rss of processes attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't going to be a solution, regardless if its one process or all processes, if it's being compared to more types of memory in leaf cgroups. If we really don't want root mem cgroup accounting so this is a fair comparison, I think the heuristic needs to special case the root mem cgroup either by discounting root oom kills if there are eligible oom kills from leaf cgroups (the user would be opting-in to this behavior) or comparing the badness of a victim from a leaf cgroup to the badness of a victim from the root cgroup when deciding which to kill and allow the user to protect root mem cgroup processes with oom_score_adj. That aside, all of this has only addressed one of the three concerns with the patchset. I believe the solution to avoid allowing users to circumvent oom kill is to account usage up the hierarchy as you have done in the past. Cgroup hierarchies can be managed by the user so they can create their own subcontainers, this is nothing new, and I would hope that you wouldn't limit your feature to only a very specific set of usecases. That may be your solution for the root mem cgroup itself: if the hierarchical usage of all top-level mem cgroups is known, it's possible to find the root mem cgroup usage by subtraction, you are using stats that are global vmstats in your heuristic. Accounting usage up the hierarchy avoids the first two concerns with the patchset. It allows you to implicitly understand the usage of the root mem cgroup itself, and does not allow users to circumvent oom kill by creating subcontainers, either purposefully or not. The third concern, userspace influence, can allow users to attack leaf mem cgroups deeper in the tree if it is using more memory than expected, but the hierarchical usage is lower at the top-level. That is the only objection that I have seen to using hierarchical usage: there may be a single cgroup deeper in the tree that avoids oom kill because another hierarchy has a higher usage. This can trivially be addressed either by oom priorities or an adjustment, just like oom_score_adj, on cgroup usage. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer [not found] ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2017-10-11 21:49 ` Roman Gushchin 2017-10-12 21:50 ` David Rientjes 0 siblings, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-11 21:49 UTC (permalink / raw) To: David Rientjes Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > We don't need a better approximation, we need a fair comparison. The > > > heuristic that this patchset is implementing is based on the usage of > > > individual mem cgroups. For the root mem cgroup to be considered > > > eligible, we need to understand its usage. That usage is _not_ what is > > > implemented by this patchset, which is the largest rss of a single > > > attached process. This, in fact, is not an "approximation" at all. In > > > the example of 10000 processes attached with 80MB rss each, the usage of > > > the root mem cgroup is _not_ 80MB. > > > > It's hard to imagine a "healthy" setup with 10000 process in the root > > memory cgroup, and even if we kill 1 process we will still have 9999 > > remaining process. I agree with you at some point, but it's not > > a real world example. > > > > It's an example that illustrates the problem with the unfair comparison > between the root mem cgroup and leaf mem cgroups. It's unfair to compare > [largest rss of a single process attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an > approximation, as previously stated: the usage of the root mem cgroup is > not 100MB if there are 10 such processes attached to the root mem cgroup, > it's off by orders of magnitude. > > For the root mem cgroup to be treated equally as a leaf mem cgroup as this > patchset proposes, it must have a fair comparison. That can be done by > accounting memory to the root mem cgroup in the same way it is to leaf mem > cgroups. > > But let's move the discussion forward to fix it. To avoid necessarily > accounting memory to the root mem cgroup, have we considered if it is even > necessary to address the root mem cgroup? For the users who opt-in to > this heuristic, would it be possible to discount the root mem cgroup from > the heuristic entirely so that oom kills originate from leaf mem cgroups? > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > the victim rss is greater than an eligible victim rss attached to the root > mem cgroup? David, I'm not pretending for implementing the best possible accounting for the root memory cgroup, and I'm sure there is a place for further enhancement. But if it's not leading to some obviously stupid victim selection (like ignoring leaking task, which consumes most of the memory), I don't see why it should be treated as a blocker for the whole patchset. I also doubt that any of us has these examples, and the best way to get them is to get some real usage feedback. Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, these all are perfect ideas which can be implemented on top of this patchset. > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > > > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > > will it fix the problem? > > > > It might have some drawbacks as well (especially around oom_score_adj), > > but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > > > > You would be required to discount oom_score_adj because the heuristic > doesn't account for oom_score_adj when comparing the anon + unevictable + > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > correct victim selection in real-world scenarios where processes attached > to the root mem cgroup are vital to the system and not part of any user > job, i.e. they are important system daemons and the "activity manager" > responsible for orchestrating the cgroup hierarchy. > > It's also still unfair because it now compares > [sum of rss of processes attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > going to be a solution, regardless if its one process or all processes, if > it's being compared to more types of memory in leaf cgroups. > > If we really don't want root mem cgroup accounting so this is a fair > comparison, I think the heuristic needs to special case the root mem > cgroup either by discounting root oom kills if there are eligible oom > kills from leaf cgroups (the user would be opting-in to this behavior) or > comparing the badness of a victim from a leaf cgroup to the badness of a > victim from the root cgroup when deciding which to kill and allow the user > to protect root mem cgroup processes with oom_score_adj. > > That aside, all of this has only addressed one of the three concerns with > the patchset. > > I believe the solution to avoid allowing users to circumvent oom kill is > to account usage up the hierarchy as you have done in the past. Cgroup > hierarchies can be managed by the user so they can create their own > subcontainers, this is nothing new, and I would hope that you wouldn't > limit your feature to only a very specific set of usecases. That may be > your solution for the root mem cgroup itself: if the hierarchical usage of > all top-level mem cgroups is known, it's possible to find the root mem > cgroup usage by subtraction, you are using stats that are global vmstats > in your heuristic. > > Accounting usage up the hierarchy avoids the first two concerns with the > patchset. It allows you to implicitly understand the usage of the root > mem cgroup itself, and does not allow users to circumvent oom kill by > creating subcontainers, either purposefully or not. The third concern, > userspace influence, can allow users to attack leaf mem cgroups deeper in > the tree if it is using more memory than expected, but the hierarchical > usage is lower at the top-level. That is the only objection that I have > seen to using hierarchical usage: there may be a single cgroup deeper in > the tree that avoids oom kill because another hierarchy has a higher > usage. This can trivially be addressed either by oom priorities or an > adjustment, just like oom_score_adj, on cgroup usage. As I've said, I barely understand how the exact implementation of root memory cgroup accounting is considered a blocker for the whole feature. The same is true for oom priorities: it's something that can and should be implemented on top of the basic semantics, introduced by this patchset. So, the only real question is the way how we find a victim memcg in the subtree: by performing independent election on each level or by searching tree-wide. We all had many discussion around, and as you remember, initially I was supporting the first option. But then Michal provided a very strong argument: if you have 3 similar workloads in A, B and C, but for non-memory-related reasons (e.g. cpu time sharing) you have to join A and B into a group D: /\ D C / \ A B it's strange to penalize A and B for it. It looks to me that you're talking about the similar case, but you consider this hierarchy useful. So, overall, it seems to be depending on exact configuration. I have to add, that if you can enable memory.oom_group, your problem doesn't exist. The selected approach is easy extendable into hierarchical direction: as I've said before, we can introduce a new value of memory.oom_group, which will enable cumulative accounting without mass killing. And, tbh, I don't see how oom_priorities will resolve an opposite problem if we'd take the hierarchical approach. Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-11 21:49 ` Roman Gushchin @ 2017-10-12 21:50 ` David Rientjes 2017-10-13 13:32 ` Roman Gushchin 0 siblings, 1 reply; 27+ messages in thread From: David Rientjes @ 2017-10-12 21:50 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Wed, 11 Oct 2017, Roman Gushchin wrote: > > But let's move the discussion forward to fix it. To avoid necessarily > > accounting memory to the root mem cgroup, have we considered if it is even > > necessary to address the root mem cgroup? For the users who opt-in to > > this heuristic, would it be possible to discount the root mem cgroup from > > the heuristic entirely so that oom kills originate from leaf mem cgroups? > > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > > the victim rss is greater than an eligible victim rss attached to the root > > mem cgroup? > > David, I'm not pretending for implementing the best possible accounting > for the root memory cgroup, and I'm sure there is a place for further > enhancement. But if it's not leading to some obviously stupid victim > selection (like ignoring leaking task, which consumes most of the memory), > I don't see why it should be treated as a blocker for the whole patchset. > I also doubt that any of us has these examples, and the best way to get > them is to get some real usage feedback. > > Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, > these all are perfect ideas which can be implemented on top of this patchset. > For the root mem cgroup to be compared to leaf mem cgroups, it needs a fair comparison, not something that we leave to some future patches on top of this patchset. We can't compare some cgroups with other cgroups based on different criteria depending on which cgroup is involved. It's actually a quite trivial problem to address, it was a small modiifcation to your hierarchical usage patchset if that's the way that you elect to fix it. I know that some of our customers use cgroups only for one or two jobs on the system, and that isn't necessarily just for memory limitation. The fact remains, that without considering the root mem cgroup fairly, that these customers are unfairly biased against because they have aggregated their processes in a cgroup. This a not a highly specialized usecase, I am positive that many users use cgroups only for a subset of processes. This heuristic penalizes that behavior to prefer them as oom victims. The problem needs to be fixed instead of asking for the patchset to be merged and hope that we'll address these issues later. If you account for hierarchical usage, you can easily subtract this from global vmstats to get an implicit root usage. > > You would be required to discount oom_score_adj because the heuristic > > doesn't account for oom_score_adj when comparing the anon + unevictable + > > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > > correct victim selection in real-world scenarios where processes attached > > to the root mem cgroup are vital to the system and not part of any user > > job, i.e. they are important system daemons and the "activity manager" > > responsible for orchestrating the cgroup hierarchy. > > > > It's also still unfair because it now compares > > [sum of rss of processes attached to a cgroup] to > > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > > going to be a solution, regardless if its one process or all processes, if > > it's being compared to more types of memory in leaf cgroups. > > > > If we really don't want root mem cgroup accounting so this is a fair > > comparison, I think the heuristic needs to special case the root mem > > cgroup either by discounting root oom kills if there are eligible oom > > kills from leaf cgroups (the user would be opting-in to this behavior) or > > comparing the badness of a victim from a leaf cgroup to the badness of a > > victim from the root cgroup when deciding which to kill and allow the user > > to protect root mem cgroup processes with oom_score_adj. > > > > That aside, all of this has only addressed one of the three concerns with > > the patchset. > > > > I believe the solution to avoid allowing users to circumvent oom kill is > > to account usage up the hierarchy as you have done in the past. Cgroup > > hierarchies can be managed by the user so they can create their own > > subcontainers, this is nothing new, and I would hope that you wouldn't > > limit your feature to only a very specific set of usecases. That may be > > your solution for the root mem cgroup itself: if the hierarchical usage of > > all top-level mem cgroups is known, it's possible to find the root mem > > cgroup usage by subtraction, you are using stats that are global vmstats > > in your heuristic. > > > > Accounting usage up the hierarchy avoids the first two concerns with the > > patchset. It allows you to implicitly understand the usage of the root > > mem cgroup itself, and does not allow users to circumvent oom kill by > > creating subcontainers, either purposefully or not. The third concern, > > userspace influence, can allow users to attack leaf mem cgroups deeper in > > the tree if it is using more memory than expected, but the hierarchical > > usage is lower at the top-level. That is the only objection that I have > > seen to using hierarchical usage: there may be a single cgroup deeper in > > the tree that avoids oom kill because another hierarchy has a higher > > usage. This can trivially be addressed either by oom priorities or an > > adjustment, just like oom_score_adj, on cgroup usage. > > As I've said, I barely understand how the exact implementation of root memory > cgroup accounting is considered a blocker for the whole feature. > The same is true for oom priorities: it's something that can and should > be implemented on top of the basic semantics, introduced by this patchset. > No, we cannot merge incomplete features that have well identified issues by simply saying that we'll address those issues later. We need a patchset that is complete. Wrt root mem cgroup usage, this change is actually quite trivial with hierarchical usage. The memory cgroup is based on accounting hierarchical usage, you actually have all the data you need already available in the kernel. Iterating all root processes for where task == mm->owner and then accounting rss for those processes is not the same as a leaf cgroup's anonymous + unevictable + unreclaimable slab. It's not even a close approximation in some cases. OOM priorities are a different concern, but it also needs to be addressed as a complete solution. This patchset removes virtually all control the user has in preferring a cgroup for oom kill or biasing against a cgroup for oom kil. The patchset is moving the selection criteria from individual processes to cgroups. Great! Just allow userspace to have influence over that selection just like /proc/pid/oom_adj has existed for over a decade and is very widespread. Users need the ability to protect important cgroups on the system, just like they need the ability to protect important processes on the system with the current heuristic. If a single cgroup accounts for 50% of memory, it will always be the chosen victim memcg with your heuristic. The only thing that is being asked here is that userspace be able to say that cgroup is actually really important and we should oom kill something else. Not hard whatsoever. These two issues are actually very trivial to implement, and you actually implemented 95% of it in earlier iterations of the patchset. It was a beautiful solution to all of these concerns and well written. If you would prefer that I use this patchset as a basis and then fix it with respect to all three of these issues and then propose it, let me know. > So, the only real question is the way how we find a victim memcg in the > subtree: by performing independent election on each level or by searching > tree-wide. We all had many discussion around, and as you remember, initially > I was supporting the first option. > But then Michal provided a very strong argument: > if you have 3 similar workloads in A, B and C, but for non-memory-related > reasons (e.g. cpu time sharing) you have to join A and B into a group D: > /\ > D C > / \ > A B > it's strange to penalize A and B for it. It looks to me that you're > talking about the similar case, but you consider this hierarchy > useful. So, overall, it seems to be depending on exact configuration. > This is _exactly_ why you need oom priorities so that userspace can influence the decisionmaking. This makes my previous point, I'm not sure where the disconnect is coming from? You need to be able to bias D when compared to C for the heuristic to work. Userspace knows how it is organizing its memory cgroups. We would be making a mistake if we thought we knew all possible ways that people are using cgroups and limit your heuristic so that some people can opt-in and others are left with the current per-process heuristic because their users have accidently subverted oom kill selection because they split their processes amongst subcontainers. > The selected approach is easy extendable into hierarchical direction: > as I've said before, we can introduce a new value of memory.oom_group, > which will enable cumulative accounting without mass killing. > Again, we cannot merge incomplete patchsets in the hope that issues with that patchset are addressed later, especially when there are three very well defined concerns with the existing implementation. Your earlier iterations were actually a brilliant solution to the problem, I'm not sure that you realize how powerful it could be in practice. > And, tbh, I don't see how oom_priorities will resolve an opposite > problem if we'd take the hierarchical approach. > Think about it in a different way: we currently compare per-process usage and userspace has /proc/pid/oom_score_adj to adjust that usage depending on priorities of that process and still oom kill if there's a memory leak. Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer after all. We don't need a strict memory.oom_priority that outranks all other sibling cgroups regardless of usage. We need a memory.oom_score_adj to adjust the per-cgroup usage. The decisionmaking in your earlier example would be under the control of C/memory.oom_score_adj and D/memory.oom_score_adj. Problem solved. It also solves the problem of userspace being able to influence oom victim selection so now they can protect important cgroups just like we can protect important processes today. And since this would be hierarchical usage, you can trivially infer root mem cgroup usage by subtraction of top-level mem cgroup usage. This is a powerful solution to the problem and gives userspace the control they need so that it can work in all usecases, not a subset of usecases. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-12 21:50 ` David Rientjes @ 2017-10-13 13:32 ` Roman Gushchin [not found] ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-13 13:32 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Thu, Oct 12, 2017 at 02:50:38PM -0700, David Rientjes wrote: > On Wed, 11 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > on priorities of that process and still oom kill if there's a memory leak. > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer > after all. We don't need a strict memory.oom_priority that outranks all > other sibling cgroups regardless of usage. We need a memory.oom_score_adj > to adjust the per-cgroup usage. The decisionmaking in your earlier > example would be under the control of C/memory.oom_score_adj and > D/memory.oom_score_adj. Problem solved. > > It also solves the problem of userspace being able to influence oom victim > selection so now they can protect important cgroups just like we can > protect important processes today. > > And since this would be hierarchical usage, you can trivially infer root > mem cgroup usage by subtraction of top-level mem cgroup usage. > > This is a powerful solution to the problem and gives userspace the control > they need so that it can work in all usecases, not a subset of usecases. You're right that per-cgroup oom_score_adj may resolve the issue with too strict semantics of oom_priorities. But I believe nobody likes the existing per-process oom_score_adj interface, and there are reasons behind. Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj will work is not trivial. For example, earlier in this thread I've shown an example, when a decision which of two processes should be killed depends on whether it's global or memcg-wide oom, despite both belong to a single cgroup! Of course, it's technically trivial to implement some analog of oom_score_adj for cgroups (and early versions of this patchset did that). But the right question is: is this an interface we want to support for the next many years? I'm not sure. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer [not found] ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org> @ 2017-10-13 21:31 ` David Rientjes 0 siblings, 0 replies; 27+ messages in thread From: David Rientjes @ 2017-10-13 21:31 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 13 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > > on priorities of that process and still oom kill if there's a memory leak. > > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer > > after all. We don't need a strict memory.oom_priority that outranks all > > other sibling cgroups regardless of usage. We need a memory.oom_score_adj > > to adjust the per-cgroup usage. The decisionmaking in your earlier > > example would be under the control of C/memory.oom_score_adj and > > D/memory.oom_score_adj. Problem solved. > > > > It also solves the problem of userspace being able to influence oom victim > > selection so now they can protect important cgroups just like we can > > protect important processes today. > > > > And since this would be hierarchical usage, you can trivially infer root > > mem cgroup usage by subtraction of top-level mem cgroup usage. > > > > This is a powerful solution to the problem and gives userspace the control > > they need so that it can work in all usecases, not a subset of usecases. > > You're right that per-cgroup oom_score_adj may resolve the issue with > too strict semantics of oom_priorities. But I believe nobody likes > the existing per-process oom_score_adj interface, and there are reasons behind. The previous heuristic before I rewrote the oom killer used /proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a much more difficult interface to use as I'm sure you can imagine. People ended up only using it to polarize selection: either -17 to oom disable a process, -16 to bias against it, and 15 to prefer it. Nobody used anything in between and I worked with openssh, udev, kde, and chromium to get a consensus on the oom_score_adj semantics. People do use it to protect against memory leaks and to prevent oom killing important processes when something else can be sacrificed, unless there's a leak. > Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj > will work is not trivial. I suggest defining it in the terms used for previous iterations of the patchset: do hierarchical scoring so that each level of the hierarchy has usage information for each subtree. You can get root mem cgroup usage with complete fairness by subtraction with this method. When comparing usage at each level of the hierarchy, you can propagate the eligibility of processes in that subtree much like you do today. I agree with your change to make the oom killer a no-op if selection races with the actual killing rather than falling back to the old heuristic. I'm happy to help add a Tested-by once we settle the other issues with that change. At each level, I would state that memory.oom_score_adj has the exact same semantics as /proc/pid/oom_score_adj. In this case, it would simply be defined as a proportion of the parent's limit. If the hierarchy is iterated starting at the root mem cgroup for system ooms and at the root of the oom memcg for memcg ooms, this should lead to the exact same oom killing behavior, which is desired. This solution would address the three concerns that I had: it allows the root mem cgroup to be compared fairly with leaf mem cgroups (with the bonus of not requiring root mem cgroup accounting thanks to your heuristic using global vmstats), it allows userspace to influence the decisionmaking so that users can protect cgroups that use 50% of memory because they are important, and it completely avoids users being able to change victim selection simply by creating child mem cgroups. This would be a very powerful patchset. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-10 21:13 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2017-10-11 13:08 ` Michal Hocko 2017-10-11 20:27 ` David Rientjes 2017-10-11 16:10 ` Roman Gushchin 2 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2017-10-11 13:08 UTC (permalink / raw) To: David Rientjes Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Tue 10-10-17 14:13:00, David Rientjes wrote: [...] > For these reasons: unfair comparison of root mem cgroup usage to bias > against that mem cgroup from oom kill in system oom conditions, the > ability of users to completely evade the oom killer by attaching all > processes to child cgroups either purposefully or unpurposefully, and the > inability of userspace to effectively control oom victim selection: > > Nacked-by: David Rientjes <rientjes@google.com> I consider this NACK rather dubious. Evading the heuristic as you describe requires root privileges in default configuration because normal users are not allowed to create subtrees. If you really want to delegate subtree to an untrusted entity then you do not have to opt-in for this oom strategy. We can work on an additional means which would allow to cover those as well (e.g. priority based one which is requested for other usecases). A similar argument applies to the root memcg evaluation. While the proposed behavior is not optimal it would work for general usecase described here where the root memcg doesn't really run any large number of tasks. If somebody who explicitly opts-in for the new strategy and it doesn't work well for that usecase we can enhance the behavior. That alone is not a reason to nack the whole thing. I find it really disturbing that you keep nacking this approach just because it doesn't suite your specific usecase while it doesn't break it. Moreover it has been stated several times already that future improvements are possible and cover what you have described already. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-11 13:08 ` Michal Hocko @ 2017-10-11 20:27 ` David Rientjes 2017-10-12 6:33 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: David Rientjes @ 2017-10-11 20:27 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Wed, 11 Oct 2017, Michal Hocko wrote: > > For these reasons: unfair comparison of root mem cgroup usage to bias > > against that mem cgroup from oom kill in system oom conditions, the > > ability of users to completely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes <rientjes@google.com> > > I consider this NACK rather dubious. Evading the heuristic as you > describe requires root privileges in default configuration because > normal users are not allowed to create subtrees. If you > really want to delegate subtree to an untrusted entity then you do not > have to opt-in for this oom strategy. We can work on an additional means > which would allow to cover those as well (e.g. priority based one which > is requested for other usecases). > You're missing the point that the user is trusted and it may be doing something to circumvent oom kill unknowingly. With a single unified hierarchy, the user is forced to attach its processes to subcontainers if it wants to constrain resources with other controllers. Doing so ends up completely avoiding oom kill because of this implementation detail. It has nothing to do with trust and the admin who is opting-in will not know a user has cirumvented oom kill purely because it constrains its processes with controllers other than the memory controller. > A similar argument applies to the root memcg evaluation. While the > proposed behavior is not optimal it would work for general usecase > described here where the root memcg doesn't really run any large number > of tasks. If somebody who explicitly opts-in for the new strategy and it > doesn't work well for that usecase we can enhance the behavior. That > alone is not a reason to nack the whole thing. > > I find it really disturbing that you keep nacking this approach just > because it doesn't suite your specific usecase while it doesn't break > it. Moreover it has been stated several times already that future > improvements are possible and cover what you have described already. This has nothing to do with my specific usecase. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-11 20:27 ` David Rientjes @ 2017-10-12 6:33 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2017-10-12 6:33 UTC (permalink / raw) To: David Rientjes Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Wed 11-10-17 13:27:44, David Rientjes wrote: > On Wed, 11 Oct 2017, Michal Hocko wrote: > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes <rientjes@google.com> > > > > I consider this NACK rather dubious. Evading the heuristic as you > > describe requires root privileges in default configuration because > > normal users are not allowed to create subtrees. If you > > really want to delegate subtree to an untrusted entity then you do not > > have to opt-in for this oom strategy. We can work on an additional means > > which would allow to cover those as well (e.g. priority based one which > > is requested for other usecases). > > > > You're missing the point that the user is trusted and it may be doing > something to circumvent oom kill unknowingly. I would really like to see a practical example of something like that. I am not saying this is completely impossible but as already pointed out this _can_ be addressed _on top_ of the current implementation. We will need some way to consider hierarchies anyway. So I really fail to see why this would be a blocker. After all it is no different than skipping oom selection by splitting a process (knowingly or otherwise) into subprocesses which is possible even now. OOM killer selection has never been, will not be and cannot be perfect in principal. Quite contrary, the more clever the heuristics are trying to be the more corner cases they might generate as we could see in the past. > With a single unified > hierarchy, the user is forced to attach its processes to subcontainers if > it wants to constrain resources with other controllers. Doing so ends up > completely avoiding oom kill because of this implementation detail. It > has nothing to do with trust and the admin who is opting-in will not know > a user has cirumvented oom kill purely because it constrains its processes > with controllers other than the memory controller. > > > A similar argument applies to the root memcg evaluation. While the > > proposed behavior is not optimal it would work for general usecase > > described here where the root memcg doesn't really run any large number > > of tasks. If somebody who explicitly opts-in for the new strategy and it > > doesn't work well for that usecase we can enhance the behavior. That > > alone is not a reason to nack the whole thing. > > > > I find it really disturbing that you keep nacking this approach just > > because it doesn't suite your specific usecase while it doesn't break > > it. Moreover it has been stated several times already that future > > improvements are possible and cover what you have described already. > > This has nothing to do with my specific usecase. Well, I might be really wrong but it is hard to not notice how most of your complains push towards hierarchical level-by-level comparisons. Which has been considered and deemed unsuitable for the default cgroup aware oom selection because it imposes structural constrains on how the hierarchy is organized and thus disallow many usecases. So pushing for this just because it resembles your current inhouse implementation leaves me with a feeling that you care more about your usecase than a general usability. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer 2017-10-10 21:13 ` David Rientjes [not found] ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 2017-10-11 13:08 ` Michal Hocko @ 2017-10-11 16:10 ` Roman Gushchin 2 siblings, 0 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-11 16:10 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > This seems to unfairly bias the root mem cgroup depending on process size. > > > It isn't treated fairly as a leaf mem cgroup if they are being compared > > > based on different criteria: the root mem cgroup as (mostly) the largest > > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and > > > unreclaimable slab pages charged to it by all processes. > > > > > > I imagine a configuration where the root mem cgroup has 100 processes > > > attached each with rss of 80MB, compared to a leaf cgroup with 100 > > > processes of 1MB rss each. How does this logic prevent repeatedly oom > > > killing the processes of 1MB rss? > > > > > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't > > > quite fair, it can simply hide large processes from being selected. Users > > > who configure cgroups in a unified hierarchy for other resource > > > constraints are penalized for this choice even though the mem cgroup with > > > 100 processes of 1MB rss each may not be limited itself. > > > > > > I think for this comparison to be fair, it requires accounting for the > > > root mem cgroup itself or for a different accounting methodology for leaf > > > memory cgroups. > > > > This is basically a workaround, because we don't have necessary stats for root > > memory cgroup. If we'll start gathering them at some point, we can change this > > and treat root memcg exactly as other leaf cgroups. > > > > I understand why it currently cannot be an apples vs apples comparison > without, as I suggest in the last paragraph, that the same accounting is > done for the root mem cgroup, which is intuitive if it is to be considered > on the same basis as leaf mem cgroups. > > I understand for the design to work that leaf mem cgroups and the root mem > cgroup must be compared if processes can be attached to the root mem > cgroup. My point is that it is currently completely unfair as I've > stated: you can have 10000 processes attached to the root mem cgroup with > rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each > and the oom killer is going to target the leaf mem cgroup as a result of > this apples vs oranges comparison. > > In case it's not clear, the 10000 processes of 80MB rss each is the most > likely contributor to a system-wide oom kill. Unfortunately, the > heuristic introduced by this patchset is broken wrt a fair comparison of > the root mem cgroup usage. > > > Or, if someone will come with an idea of a better approximation, it can be > > implemented as a separate enhancement on top of the initial implementation. > > This is more than welcome. > > > > We don't need a better approximation, we need a fair comparison. The > heuristic that this patchset is implementing is based on the usage of > individual mem cgroups. For the root mem cgroup to be considered > eligible, we need to understand its usage. That usage is _not_ what is > implemented by this patchset, which is the largest rss of a single > attached process. This, in fact, is not an "approximation" at all. In > the example of 10000 processes attached with 80MB rss each, the usage of > the root mem cgroup is _not_ 80MB. > > I'll restate that oom killing a process is a last resort for the kernel, > but it also must be able to make a smart decision. Targeting dozens of > 1MB processes instead of 80MB processes because of a shortcoming in this > implementation is not the appropriate selection, it's the opposite of the > correct selection. > > > > I'll reiterate what I did on the last version of the patchset: considering > > > only leaf memory cgroups easily allows users to defeat this heuristic and > > > bias against all of their memory usage up to the largest process size > > > amongst the set of processes attached. If the user creates N child mem > > > cgroups for their N processes and attaches one process to each child, the > > > _only_ thing this achieved is to defeat your heuristic and prefer other > > > leaf cgroups simply because those other leaf cgroups did not do this. > > > > > > Effectively: > > > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > > > will radically shift the heuristic from a score of all anonymous + > > > unevictable memory for all processes to a score of the largest anonymous + > > > unevictable memory for a single process. There is no downside or > > > ramifaction for the end user in doing this. When comparing cgroups based > > > on usage, it only makes sense to compare the hierarchical usage of that > > > cgroup so that attaching processes to descendants or splitting the > > > implementation of a process into several smaller individual processes does > > > not allow this heuristic to be defeated. > > > > To all previously said words I can only add that cgroup v2 allows to limit > > the amount of cgroups in the sub-tree: > > 1a926e0bbab8 ("cgroup: implement hierarchy limits"). > > > > So the solution to > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > evading all oom kills for your mem cgroup is to limit the number of > cgroups that can be created by the user? With a unified cgroup hierarchy, > that doesn't work well if I wanted to actually constrain these individual > processes to different resource limits like cpu usage. In fact, the user > may not know it is effectively evading the oom killer entirely because it > has constrained the cpu of individual processes because its a side-effect > of this heuristic. > > > You chose not to respond to my reiteration of userspace having absolutely > no control over victim selection with the new heuristic without setting > all processes to be oom disabled via /proc/pid/oom_score_adj. If I have a > very important job that is running on a system that is really supposed to > use 80% of memory, I need to be able to specify that it should not be oom > killed based on user goals. Setting all processes to be oom disabled in > the important mem cgroup to avoid being oom killed unless absolutely > necessary in a system oom condition is not a robust solution: (1) the mem > cgroup livelocks if it reaches its own mem cgroup limit and (2) the system > panic()'s if these preferred mem cgroups are the only consumers left on > the system. With overcommit, both of these possibilities exist in the > wild and the problem is only a result of the implementation detail of this > patchset. > > For these reasons: unfair comparison of root mem cgroup usage to bias > against that mem cgroup from oom kill in system oom conditions, the > ability of users to completely evade the oom killer by attaching all > processes to child cgroups either purposefully or unpurposefully, and the > inability of userspace to effectively control oom victim selection: > > Nacked-by: David Rientjes <rientjes@google.com> Hi David! Do you find the following approach (summing oom_score of tasks belonging to the root memory cgroup) acceptable? Also, I've closed the race, you've pointed on. Thanks! -------------------------------------------------------------------------------- From 7f51d26be2d2a5b6e4840574f72beb15920c0993 Mon Sep 17 00:00:00 2001 From: Roman Gushchin <guro@fb.com> Date: Thu, 25 May 2017 14:18:45 +0100 Subject: [v12 3/6] mm, oom: cgroup-aware OOM killer Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. To address these issues, the cgroup-aware OOM killer is introduced. This patch introduces the core functionality: an ability to select a memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks for the biggest leaf memory cgroup and kills the biggest task belonging to it. The following patches will extend this functionality to consider non-leaf memory cgroups as OOM victims, and also provide an ability to kill all tasks belonging to the victim cgroup. The root cgroup is treated as a leaf memory cgroup, so it's score is compared with other leaf memory cgroups. Due to memcg statistics implementation a special approximation is used for estimating oom_score of root memory cgroup: we sum oom_score of the belonging processes (or, to be more precise, tasks owning their mm structures). Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 17 +++++ include/linux/oom.h | 12 ++- mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 72 +++++++++++++----- 4 files changed, 262 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..75b63b68846e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -936,6 +948,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..ca78e2d5956e 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -9,6 +9,13 @@ #include <linux/sched/coredump.h> /* MMF_* */ #include <linux/mm.h> /* VM_FAULT* */ + +/* + * Special value returned by victim selection functions to indicate + * that are inflight OOM victims. + */ +#define INFLIGHT_VICTIM ((void *)-1UL) + struct zonelist; struct notifier_block; struct mem_cgroup; @@ -39,7 +46,8 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; - struct task_struct *chosen; + struct task_struct *chosen_task; + struct mem_cgroup *chosen_memcg; unsigned long chosen_points; }; @@ -101,6 +109,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df3368734f1c..8f04e1fb9dd9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) return ret; } +static long memcg_oom_badness(struct mem_cgroup *memcg, + const nodemask_t *nodemask, + unsigned long totalpages) +{ + long points = 0; + int nid; + pg_data_t *pgdat; + + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; + + points += mem_cgroup_node_nr_lru_pages(memcg, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + + pgdat = NODE_DATA(nid); + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), + NR_SLAB_UNRECLAIMABLE); + } + + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(memcg, MEMCG_SOCK); + points += memcg_page_state(memcg, MEMCG_SWAP); + + return points; +} + +/* + * Checks if the given memcg is a valid OOM victim and returns a number, + * which means the folowing: + * -1: there are inflight OOM victim tasks, belonging to the memcg + * 0: memcg is not eligible, e.g. all belonging tasks are protected + * by oom_score_adj set to OOM_SCORE_ADJ_MIN + * >0: memcg is eligible, and the returned value is an estimation + * of the memory footprint + */ +static long oom_evaluate_memcg(struct mem_cgroup *memcg, + const nodemask_t *nodemask, + unsigned long totalpages) +{ + struct css_task_iter it; + struct task_struct *task; + int eligible = 0; + + /* + * Root memory cgroup is a special case: + * we don't have necessary stats to evaluate it exactly as + * leaf memory cgroups, so we approximate it's oom_score + * by summing oom_score of all belonging tasks, which are + * owners of their mm structs. + * + * If there are inflight OOM victim tasks inside + * the root memcg, we return -1. + */ + if (memcg == root_mem_cgroup) { + struct css_task_iter it; + struct task_struct *task; + long score = 0; + + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, + &task->signal->oom_mm->flags)) { + score = -1; + break; + } + + task_lock(task); + if (!task->mm || task->mm->owner != task) { + task_unlock(task); + continue; + } + task_unlock(task); + + score += oom_badness(task, memcg, nodemask, + totalpages); + } + css_task_iter_end(&it); + + return score; + } + + /* + * Memcg is OOM eligible if there are OOM killable tasks inside. + * + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN + * as unkillable. + * + * If there are inflight OOM victim tasks inside the memcg, + * we return -1. + */ + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + if (!eligible && + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) + eligible = 1; + + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { + eligible = -1; + break; + } + } + css_task_iter_end(&it); + + if (eligible <= 0) + return eligible; + + return memcg_oom_badness(memcg, nodemask, totalpages); +} + +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) +{ + struct mem_cgroup *iter; + + oc->chosen_memcg = NULL; + oc->chosen_points = 0; + + /* + * The oom_score is calculated for leaf memory cgroups (including + * the root memcg). + */ + rcu_read_lock(); + for_each_mem_cgroup_tree(iter, root) { + long score; + + if (memcg_has_children(iter) && iter != root_mem_cgroup) + continue; + + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); + + /* + * Ignore empty and non-eligible memory cgroups. + */ + if (score == 0) + continue; + + /* + * If there are inflight OOM victims, we don't need + * to look further for new victims. + */ + if (score == -1) { + oc->chosen_memcg = INFLIGHT_VICTIM; + mem_cgroup_iter_break(root, iter); + break; + } + + if (score > oc->chosen_points) { + oc->chosen_points = score; + oc->chosen_memcg = iter; + } + } + + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM) + css_get(&oc->chosen_memcg->css); + + rcu_read_unlock(); +} + +bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + struct mem_cgroup *root; + + if (mem_cgroup_disabled()) + return false; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return false; + + if (oc->memcg) + root = oc->memcg; + else + root = root_mem_cgroup; + + select_victim_memcg(root, oc); + + return oc->chosen_memcg; +} + /* * Reclaims as many pages from the given memcg as possible. * diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0b9f36117989..5b670adb850c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } -static int oom_evaluate_task(struct task_struct *task, void *arg) +int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; unsigned long points; @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto next; /* Prefer thread group leaders for display purposes */ - if (points == oc->chosen_points && thread_group_leader(oc->chosen)) + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task)) goto next; select: - if (oc->chosen) - put_task_struct(oc->chosen); + if (oc->chosen_task) + put_task_struct(oc->chosen_task); get_task_struct(task); - oc->chosen = task; + oc->chosen_task = task; oc->chosen_points = points; next: return 0; abort: - if (oc->chosen) - put_task_struct(oc->chosen); - oc->chosen = (void *)-1UL; + if (oc->chosen_task) + put_task_struct(oc->chosen_task); + oc->chosen_task = INFLIGHT_VICTIM; return 1; } /* * Simple selection loop. We choose the process with the highest number of - * 'points'. In case scan was aborted, oc->chosen is set to -1. + * 'points'. In case scan was aborted, oc->chosen_task is set to -1. */ static void select_bad_process(struct oom_control *oc) { @@ -923,7 +923,7 @@ static void __oom_kill_process(struct task_struct *victim) static void oom_kill_process(struct oom_control *oc, const char *message) { - struct task_struct *p = oc->chosen; + struct task_struct *p = oc->chosen_task; unsigned int points = oc->chosen_points; struct task_struct *victim = p; struct task_struct *child; @@ -984,6 +984,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message) __oom_kill_process(victim); } +static bool oom_kill_memcg_victim(struct oom_control *oc) +{ + + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) + return oc->chosen_memcg; + + /* Kill a task in the chosen memcg with the biggest memory footprint */ + oc->chosen_points = 0; + oc->chosen_task = NULL; + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); + + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) + goto out; + + __oom_kill_process(oc->chosen_task); + +out: + mem_cgroup_put(oc->chosen_memcg); + return oc->chosen_task; +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ @@ -1036,6 +1057,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + bool delay = false; /* if set, delay next allocation attempt */ if (oom_killer_disabled) return false; @@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc) current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); - oc->chosen = current; + oc->chosen_task = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); return true; } + if (mem_cgroup_select_oom_victim(oc)) { + if (oom_kill_memcg_victim(oc)) + delay = true; + + goto out; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen && oc->chosen != (void *)-1UL) { + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - schedule_timeout_killable(1); + delay = true; } - return !!oc->chosen; + +out: + /* + * Give the killed process a good chance to exit before trying + * to allocate memory again. + */ + if (delay) + schedule_timeout_killable(1); + + return !!oc->chosen_task; } /* -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>]
* [v11 4/6] mm, oom: introduce memory.oom_group [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org> @ 2017-10-05 13:04 ` Roman Gushchin 2017-10-05 14:29 ` Michal Hocko 2017-10-05 14:31 ` Michal Hocko 0 siblings, 2 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA The cgroup-aware OOM killer treats leaf memory cgroups as memory consumption entities and performs the victim selection by comparing them based on their memory footprint. Then it kills the biggest task inside the selected memory cgroup. But there are workloads, which are not tolerant to a such behavior. Killing a random task may leave the workload in a broken state. To solve this problem, memory.oom_group knob is introduced. It will define, whether a memory group should be treated as an indivisible memory consumer, compared by total memory consumption with other memory consumers (leaf memory cgroups and other memory cgroups with memory.oom_group set), and whether all belonging tasks should be killed if the cgroup is selected. If set on memcg A, it means that in case of system-wide OOM or memcg-wide OOM scoped to A or any ancestor cgroup, all tasks, belonging to the sub-tree of A will be killed. If OOM event is scoped to a descendant cgroup (A/B, for example), only tasks in that cgroup can be affected. OOM killer will never touch any tasks outside of the scope of the OOM event. Also, tasks with oom_score_adj set to -1000 will not be killed because this has been a long established way to protect a particular process from seeing an unexpected SIGKILL from the OOM killer. Ignoring this user defined configuration might lead to data corruptions or other misbehavior. The default value is 0. Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Cc: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org> Cc: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: kernel-team-b10kYP2dOMg@public.gmane.org Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --- include/linux/memcontrol.h | 17 +++++++++++ mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++++++++++--- mm/oom_kill.c | 49 +++++++++++++++++++++++------- 3 files changed, 127 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 75b63b68846e..84ac10d7e67d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -200,6 +200,13 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + /* + * Treat the sub-tree as an indivisible memory consumer, + * kill all belonging tasks if the memory cgroup selected + * as OOM victim. + */ + bool oom_group; + /* handle for "memory.events" */ struct cgroup_file events_file; @@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait); bool mem_cgroup_select_oom_victim(struct oom_control *oc); +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) +{ + return memcg->oom_group; +} + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) { return false; } + +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 191b70735f1f..d5acb278b11a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg, static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) { - struct mem_cgroup *iter; + struct mem_cgroup *iter, *group = NULL; + long group_score = 0; oc->chosen_memcg = NULL; oc->chosen_points = 0; /* + * If OOM is memcg-wide, and the memcg has the oom_group flag set, + * all tasks belonging to the memcg should be killed. + * So, we mark the memcg as a victim. + */ + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) { + oc->chosen_memcg = oc->memcg; + css_get(&oc->chosen_memcg->css); + return; + } + + /* * The oom_score is calculated for leaf memory cgroups (including * the root memcg). + * Non-leaf oom_group cgroups accumulating score of descendant + * leaf memory cgroups. */ rcu_read_lock(); for_each_mem_cgroup_tree(iter, root) { long score; + /* + * We don't consider non-leaf non-oom_group memory cgroups + * as OOM victims. + */ + if (memcg_has_children(iter) && iter != root_mem_cgroup && + !mem_cgroup_oom_group(iter)) + continue; + + /* + * If group is not set or we've ran out of the group's sub-tree, + * we should set group and reset group_score. + */ + if (!group || group == root_mem_cgroup || + !mem_cgroup_is_descendant(iter, group)) { + group = iter; + group_score = 0; + } + if (memcg_has_children(iter) && iter != root_mem_cgroup) continue; @@ -2810,9 +2842,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) break; } - if (score > oc->chosen_points) { - oc->chosen_points = score; - oc->chosen_memcg = iter; + group_score += score; + + if (group_score > oc->chosen_points) { + oc->chosen_points = group_score; + oc->chosen_memcg = group; } } @@ -5437,6 +5471,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_group_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + bool oom_group = memcg->oom_group; + + seq_printf(m, "%d\n", oom_group); + + return 0; +} + +static ssize_t memory_oom_group_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_group; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_group); + if (err) + return err; + + memcg->oom_group = oom_group; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5557,6 +5618,12 @@ static struct cftype memory_files[] = { .write = memory_max_write, }, { + .name = "oom_group", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_oom_group_show, + .write = memory_oom_group_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 20e62ec32ba8..c8fbc73c4ed3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -851,6 +851,17 @@ static void __oom_kill_process(struct task_struct *victim) struct mm_struct *mm; bool can_oom_reap = true; + /* + * __oom_kill_process() is used to kill all tasks belonging to + * the selected memory cgroup, so we should check that we're not + * trying to kill an unkillable task. + */ + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) || + victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { + put_task_struct(victim); + return; + } + p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); @@ -987,21 +998,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message) __oom_kill_process(victim); } -static bool oom_kill_memcg_victim(struct oom_control *oc) +static int oom_kill_memcg_member(struct task_struct *task, void *unused) { + get_task_struct(task); + __oom_kill_process(task); + return 0; +} +static bool oom_kill_memcg_victim(struct oom_control *oc) +{ if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) return oc->chosen_memcg; - /* Kill a task in the chosen memcg with the biggest memory footprint */ - oc->chosen_points = 0; - oc->chosen_task = NULL; - mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); - - if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) - goto out; - - __oom_kill_process(oc->chosen_task); + /* + * If memory.oom_group is set, kill all tasks belonging to the sub-tree + * of the chosen memory cgroup, otherwise kill the task with the biggest + * memory footprint. + */ + if (mem_cgroup_oom_group(oc->chosen_memcg)) { + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member, + NULL); + /* We have one or more terminating processes at this point. */ + oc->chosen_task = INFLIGHT_VICTIM; + } else { + oc->chosen_points = 0; + oc->chosen_task = NULL; + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); + + if (oc->chosen_task == NULL || + oc->chosen_task == INFLIGHT_VICTIM) + goto out; + + __oom_kill_process(oc->chosen_task); + } out: mem_cgroup_put(oc->chosen_memcg); -- 2.13.6 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [v11 4/6] mm, oom: introduce memory.oom_group 2017-10-05 13:04 ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin @ 2017-10-05 14:29 ` Michal Hocko 2017-10-05 14:31 ` Michal Hocko 1 sibling, 0 replies; 27+ messages in thread From: Michal Hocko @ 2017-10-05 14:29 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Thu 05-10-17 14:04:52, Roman Gushchin wrote: > The cgroup-aware OOM killer treats leaf memory cgroups as memory > consumption entities and performs the victim selection by comparing > them based on their memory footprint. Then it kills the biggest task > inside the selected memory cgroup. > > But there are workloads, which are not tolerant to a such behavior. > Killing a random task may leave the workload in a broken state. > > To solve this problem, memory.oom_group knob is introduced. > It will define, whether a memory group should be treated as an > indivisible memory consumer, compared by total memory consumption > with other memory consumers (leaf memory cgroups and other memory > cgroups with memory.oom_group set), and whether all belonging tasks > should be killed if the cgroup is selected. > > If set on memcg A, it means that in case of system-wide OOM or > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks, > belonging to the sub-tree of A will be killed. If OOM event is > scoped to a descendant cgroup (A/B, for example), only tasks in > that cgroup can be affected. OOM killer will never touch any tasks > outside of the scope of the OOM event. > > Also, tasks with oom_score_adj set to -1000 will not be killed because > this has been a long established way to protect a particular process > from seeing an unexpected SIGKILL from the OOM killer. Ignoring this > user defined configuration might lead to data corruptions or other > misbehavior. > > The default value is 0. I still believe that oc->chosen_task == INFLIGHT_VICTIM check in oom_kill_memcg_victim should go away. > > Signed-off-by: Roman Gushchin <guro@fb.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Cc: David Rientjes <rientjes@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: kernel-team@fb.com > Cc: cgroups@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/memcontrol.h | 17 +++++++++++ > mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++++++++++--- > mm/oom_kill.c | 49 +++++++++++++++++++++++------- > 3 files changed, 127 insertions(+), 14 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 75b63b68846e..84ac10d7e67d 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -200,6 +200,13 @@ struct mem_cgroup { > /* OOM-Killer disable */ > int oom_kill_disable; > > + /* > + * Treat the sub-tree as an indivisible memory consumer, > + * kill all belonging tasks if the memory cgroup selected > + * as OOM victim. > + */ > + bool oom_group; > + > /* handle for "memory.events" */ > struct cgroup_file events_file; > > @@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait); > > bool mem_cgroup_select_oom_victim(struct oom_control *oc); > > +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) > +{ > + return memcg->oom_group; > +} > + > #ifdef CONFIG_MEMCG_SWAP > extern int do_swap_account; > #endif > @@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) > { > return false; > } > + > +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg) > +{ > + return false; > +} > #endif /* CONFIG_MEMCG */ > > /* idx can be of type enum memcg_stat_item or node_stat_item */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 191b70735f1f..d5acb278b11a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg, > > static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > { > - struct mem_cgroup *iter; > + struct mem_cgroup *iter, *group = NULL; > + long group_score = 0; > > oc->chosen_memcg = NULL; > oc->chosen_points = 0; > > /* > + * If OOM is memcg-wide, and the memcg has the oom_group flag set, > + * all tasks belonging to the memcg should be killed. > + * So, we mark the memcg as a victim. > + */ > + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) { > + oc->chosen_memcg = oc->memcg; > + css_get(&oc->chosen_memcg->css); > + return; > + } > + > + /* > * The oom_score is calculated for leaf memory cgroups (including > * the root memcg). > + * Non-leaf oom_group cgroups accumulating score of descendant > + * leaf memory cgroups. > */ > rcu_read_lock(); > for_each_mem_cgroup_tree(iter, root) { > long score; > > + /* > + * We don't consider non-leaf non-oom_group memory cgroups > + * as OOM victims. > + */ > + if (memcg_has_children(iter) && iter != root_mem_cgroup && > + !mem_cgroup_oom_group(iter)) > + continue; > + > + /* > + * If group is not set or we've ran out of the group's sub-tree, > + * we should set group and reset group_score. > + */ > + if (!group || group == root_mem_cgroup || > + !mem_cgroup_is_descendant(iter, group)) { > + group = iter; > + group_score = 0; > + } > + > if (memcg_has_children(iter) && iter != root_mem_cgroup) > continue; > > @@ -2810,9 +2842,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > break; > } > > - if (score > oc->chosen_points) { > - oc->chosen_points = score; > - oc->chosen_memcg = iter; > + group_score += score; > + > + if (group_score > oc->chosen_points) { > + oc->chosen_points = group_score; > + oc->chosen_memcg = group; > } > } > > @@ -5437,6 +5471,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > return nbytes; > } > > +static int memory_oom_group_show(struct seq_file *m, void *v) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > + bool oom_group = memcg->oom_group; > + > + seq_printf(m, "%d\n", oom_group); > + > + return 0; > +} > + > +static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, > + loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + int oom_group; > + int err; > + > + err = kstrtoint(strstrip(buf), 0, &oom_group); > + if (err) > + return err; > + > + memcg->oom_group = oom_group; > + > + return nbytes; > +} > + > static int memory_events_show(struct seq_file *m, void *v) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > @@ -5557,6 +5618,12 @@ static struct cftype memory_files[] = { > .write = memory_max_write, > }, > { > + .name = "oom_group", > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > + .seq_show = memory_oom_group_show, > + .write = memory_oom_group_write, > + }, > + { > .name = "events", > .flags = CFTYPE_NOT_ON_ROOT, > .file_offset = offsetof(struct mem_cgroup, events_file), > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 20e62ec32ba8..c8fbc73c4ed3 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -851,6 +851,17 @@ static void __oom_kill_process(struct task_struct *victim) > struct mm_struct *mm; > bool can_oom_reap = true; > > + /* > + * __oom_kill_process() is used to kill all tasks belonging to > + * the selected memory cgroup, so we should check that we're not > + * trying to kill an unkillable task. > + */ > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD) || > + victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > + put_task_struct(victim); > + return; > + } > + > p = find_lock_task_mm(victim); > if (!p) { > put_task_struct(victim); > @@ -987,21 +998,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > __oom_kill_process(victim); > } > > -static bool oom_kill_memcg_victim(struct oom_control *oc) > +static int oom_kill_memcg_member(struct task_struct *task, void *unused) > { > + get_task_struct(task); > + __oom_kill_process(task); > + return 0; > +} > > +static bool oom_kill_memcg_victim(struct oom_control *oc) > +{ > if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) > return oc->chosen_memcg; > > - /* Kill a task in the chosen memcg with the biggest memory footprint */ > - oc->chosen_points = 0; > - oc->chosen_task = NULL; > - mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > - > - if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) > - goto out; > - > - __oom_kill_process(oc->chosen_task); > + /* > + * If memory.oom_group is set, kill all tasks belonging to the sub-tree > + * of the chosen memory cgroup, otherwise kill the task with the biggest > + * memory footprint. > + */ > + if (mem_cgroup_oom_group(oc->chosen_memcg)) { > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member, > + NULL); > + /* We have one or more terminating processes at this point. */ > + oc->chosen_task = INFLIGHT_VICTIM; > + } else { > + oc->chosen_points = 0; > + oc->chosen_task = NULL; > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > + > + if (oc->chosen_task == NULL || > + oc->chosen_task == INFLIGHT_VICTIM) > + goto out; > + > + __oom_kill_process(oc->chosen_task); > + } > > out: > mem_cgroup_put(oc->chosen_memcg); > -- > 2.13.6 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 4/6] mm, oom: introduce memory.oom_group 2017-10-05 13:04 ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin 2017-10-05 14:29 ` Michal Hocko @ 2017-10-05 14:31 ` Michal Hocko [not found] ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Michal Hocko @ 2017-10-05 14:31 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel Btw. here is how I would do the recursive oom badness. The diff is not the nicest one because there is some code moving but the resulting code is smaller and imho easier to grasp. Only compile tested though --- diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 085056e562b1..9cdba4682198 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p); int cgroup_init_early(void); int cgroup_init(void); +static bool cgroup_has_tasks(struct cgroup *cgrp) +{ + return cgrp->nr_populated_csets; +} + /* * Iteration helpers and macros. */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8dacf73ad57e..a2dd7e3ffe23 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id) spin_unlock_bh(&cgroup_idr_lock); } -static bool cgroup_has_tasks(struct cgroup *cgrp) -{ - return cgrp->nr_populated_csets; -} - bool cgroup_is_threaded(struct cgroup *cgrp) { return cgrp->dom_cgrp != cgrp; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b3848bce4c86..012b2216266f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2671,59 +2671,63 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) } static long memcg_oom_badness(struct mem_cgroup *memcg, - const nodemask_t *nodemask, - unsigned long totalpages) + const nodemask_t *nodemask) { + struct mem_cgroup *iter; + struct css_task_iter it; + struct task_struct *task; long points = 0; + int eligible = 0; int nid; pg_data_t *pgdat; - /* - * We don't have necessary stats for the root memcg, - * so we define it's oom_score as the maximum oom_score - * of the belonging tasks. - * - * As tasks in the root memcg unlikely are parts of a - * single workload, and we don't have to implement - * group killing, this approximation is reasonable. - * - * But if we will have necessary stats for the root memcg, - * we might switch to the approach which is used for all - * other memcgs. - */ - if (memcg == root_mem_cgroup) { - struct css_task_iter it; - struct task_struct *task; - long score, max_score = 0; - + for_each_mem_cgroup_tree(iter, memcg) { + /* + * Memcg is OOM eligible if there are OOM killable tasks inside. + * + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN + * as unkillable. + * + * If there are inflight OOM victim tasks inside the memcg, + * we return -1. + */ css_task_iter_start(&memcg->css, 0, &it); while ((task = css_task_iter_next(&it))) { - score = oom_badness(task, memcg, nodemask, - totalpages); - if (score > max_score) - max_score = score; + if (!eligible && + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) + eligible = 1; + + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { + eligible = -1; + break; + } } css_task_iter_end(&it); - return max_score; - } + if (eligible <= 0) { + mem_cgroup_iter_break(memcg, iter); + points = -1; + break; + } - for_each_node_state(nid, N_MEMORY) { - if (nodemask && !node_isset(nid, *nodemask)) - continue; + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; - points += mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + points += mem_cgroup_node_nr_lru_pages(memcg, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); - pgdat = NODE_DATA(nid); - points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), - NR_SLAB_UNRECLAIMABLE); - } + pgdat = NODE_DATA(nid); + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), + NR_SLAB_UNRECLAIMABLE); + } - points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / - (PAGE_SIZE / 1024); - points += memcg_page_state(memcg, MEMCG_SOCK); - points += memcg_page_state(memcg, MEMCG_SWAP); + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(memcg, MEMCG_SOCK); + points += memcg_page_state(memcg, MEMCG_SWAP); + } return points; } @@ -2741,43 +2745,56 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages) { - struct css_task_iter it; - struct task_struct *task; - int eligible = 0; - /* - * Memcg is OOM eligible if there are OOM killable tasks inside. + * We don't have necessary stats for the root memcg, + * so we define it's oom_score as the maximum oom_score + * of the belonging tasks. * - * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN - * as unkillable. + * As tasks in the root memcg unlikely are parts of a + * single workload, and we don't have to implement + * group killing, this approximation is reasonable. * - * If there are inflight OOM victim tasks inside the memcg, - * we return -1. + * But if we will have necessary stats for the root memcg, + * we might switch to the approach which is used for all + * other memcgs. */ - css_task_iter_start(&memcg->css, 0, &it); - while ((task = css_task_iter_next(&it))) { - if (!eligible && - task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) - eligible = 1; - - if (tsk_is_oom_victim(task) && - !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { - eligible = -1; - break; + if (memcg == root_mem_cgroup) { + struct css_task_iter it; + struct task_struct *task; + long score, max_score = 0; + + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { + max_score = -1; + break; + } + score = oom_badness(task, memcg, nodemask, + totalpages); + if (score > max_score) + max_score = score; } - } - css_task_iter_end(&it); + css_task_iter_end(&it); - if (eligible <= 0) - return eligible; + return max_score; + } - return memcg_oom_badness(memcg, nodemask, totalpages); + return memcg_oom_badness(memcg, nodemask); } +static bool memcg_is_oom_eligible(struct mem_cgroup *memcg) +{ + if (mem_cgroup_oom_group(memcg)) + return true; + if (cgroup_has_tasks(memcg->css.cgroup)) + return true; + + return false; +} static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) { - struct mem_cgroup *iter, *group = NULL; - long group_score = 0; + struct mem_cgroup *iter; oc->chosen_memcg = NULL; oc->chosen_points = 0; @@ -2803,35 +2820,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) for_each_mem_cgroup_tree(iter, root) { long score; - /* - * We don't consider non-leaf non-oom_group memory cgroups - * as OOM victims. - */ - if (memcg_has_children(iter) && iter != root_mem_cgroup && - !mem_cgroup_oom_group(iter)) - continue; - - /* - * If group is not set or we've ran out of the group's sub-tree, - * we should set group and reset group_score. - */ - if (!group || group == root_mem_cgroup || - !mem_cgroup_is_descendant(iter, group)) { - group = iter; - group_score = 0; - } - - if (memcg_has_children(iter) && iter != root_mem_cgroup) + if (!memcg_is_oom_eligible(iter)) continue; score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages); - /* - * Ignore empty and non-eligible memory cgroups. - */ - if (score == 0) - continue; - /* * If there are inflight OOM victims, we don't need * to look further for new victims. @@ -2842,11 +2835,9 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) break; } - group_score += score; - - if (group_score > oc->chosen_points) { - oc->chosen_points = group_score; - oc->chosen_memcg = group; + if (score > oc->chosen_points) { + oc->chosen_points = score; + oc->chosen_memcg = iter; } } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [v11 4/6] mm, oom: introduce memory.oom_group [not found] ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2017-10-06 12:04 ` Roman Gushchin 2017-10-06 12:17 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Roman Gushchin @ 2017-10-06 12:04 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, Oct 05, 2017 at 04:31:04PM +0200, Michal Hocko wrote: > Btw. here is how I would do the recursive oom badness. The diff is not > the nicest one because there is some code moving but the resulting code > is smaller and imho easier to grasp. Only compile tested though Thanks! I'm not against this approach, and maybe it can lead to a better code, but the version you sent is just not there yet. There are some problems with it: 1) If there are nested cgroups with oom_group set, you will calculate a badness multiple times, and rely on the fact, that top memcg will become the largest score. It can be optimized, of course, but it's additional code. 2) cgroup_has_tasks() probably requires additional locking. Maybe it's ok to read nr_populated_csets without explicit locking, but it's not obvious for me. 3) Returning -1 from memcg_oom_badness() if eligible is equal to 0 is suspicious. Right now your version has exactly the same amount of code (skipping comments). I assume, this approach just requires some additional thinking/rework. Anyway, thank you for sharing this! > --- > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 085056e562b1..9cdba4682198 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p); > int cgroup_init_early(void); > int cgroup_init(void); > > +static bool cgroup_has_tasks(struct cgroup *cgrp) > +{ > + return cgrp->nr_populated_csets; > +} > + > /* > * Iteration helpers and macros. > */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 8dacf73ad57e..a2dd7e3ffe23 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id) > spin_unlock_bh(&cgroup_idr_lock); > } > > -static bool cgroup_has_tasks(struct cgroup *cgrp) > -{ > - return cgrp->nr_populated_csets; > -} > - > bool cgroup_is_threaded(struct cgroup *cgrp) > { > return cgrp->dom_cgrp != cgrp; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [v11 4/6] mm, oom: introduce memory.oom_group 2017-10-06 12:04 ` Roman Gushchin @ 2017-10-06 12:17 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2017-10-06 12:17 UTC (permalink / raw) To: Roman Gushchin Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel On Fri 06-10-17 13:04:35, Roman Gushchin wrote: > On Thu, Oct 05, 2017 at 04:31:04PM +0200, Michal Hocko wrote: > > Btw. here is how I would do the recursive oom badness. The diff is not > > the nicest one because there is some code moving but the resulting code > > is smaller and imho easier to grasp. Only compile tested though > > Thanks! > > I'm not against this approach, and maybe it can lead to a better code, > but the version you sent is just not there yet. > > There are some problems with it: > > 1) If there are nested cgroups with oom_group set, you will calculate > a badness multiple times, and rely on the fact, that top memcg will > become the largest score. It can be optimized, of course, but it's > additional code. right. As I've said we can introduce iterator helper to skip the subtree but I suspect it will not make much of a difference. > > 2) cgroup_has_tasks() probably requires additional locking. > Maybe it's ok to read nr_populated_csets without explicit locking, > but it's not obvious for me. I do not see why. Tasks are free to come and go and you only know at the time you are killing. > 3) Returning -1 from memcg_oom_badness() if eligible is equal to 0 > is suspicious. I didn't spend too much time on it. I merely wanted to point out my thinking more specifically than the pseudo code posted earlier. But this should be ok, because that would mean that either all tasks are OOM_SCORE_ADJ_MIN (eligible = 0) or there is a inflight victim (eligible = -1). Anyway the initialization should go inside the tree walk > Right now your version has exactly the same amount of code > (skipping comments). I assume, this approach just requires some additional > thinking/rework. Well, this is not about the amount of code but more about the clear logic implemented at the correct level. It is simply much easier when you evaluate the killable entity at one place rather open code it. But as I've said nothing I would want to enforce. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer 2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin ` (3 preceding siblings ...) [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org> @ 2017-10-05 13:04 ` Roman Gushchin 2017-10-05 13:04 ` [v11 6/6] mm, oom, docs: describe the " Roman Gushchin 5 siblings, 0 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware OOM killer. If not set, the OOM selection is performed in a "traditional" per-process way. The behavior can be changed dynamically by remounting the cgroupfs. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/cgroup-defs.h | 5 +++++ kernel/cgroup/cgroup.c | 10 ++++++++++ mm/memcontrol.c | 3 +++ 3 files changed, 18 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 3e55bbd31ad1..cae5343a8b21 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -80,6 +80,11 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), + + /* + * Enable cgroup-aware OOM killer. + */ + CGRP_GROUP_OOM = (1 << 5), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c3421ee0d230..8d8aa46ff930 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) if (!strcmp(token, "nsdelegate")) { *root_flags |= CGRP_ROOT_NS_DELEGATE; continue; + } else if (!strcmp(token, "groupoom")) { + *root_flags |= CGRP_GROUP_OOM; + continue; } pr_err("cgroup2: unknown option \"%s\"\n", token); @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; + + if (root_flags & CGRP_GROUP_OOM) + cgrp_dfl_root.flags |= CGRP_GROUP_OOM; + else + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; } } @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) + seq_puts(seq, ",groupoom"); return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5acb278b11a..fe6155d827c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2866,6 +2866,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) + return false; + if (oc->memcg) root = oc->memcg; else -- 2.13.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [v11 6/6] mm, oom, docs: describe the cgroup-aware OOM killer 2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin ` (4 preceding siblings ...) 2017-10-05 13:04 ` [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin @ 2017-10-05 13:04 ` Roman Gushchin 5 siblings, 0 replies; 27+ messages in thread From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw) To: linux-mm Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel Document the cgroup-aware OOM killer. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- Documentation/cgroup-v2.txt | 51 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 3f8216912df0..28429e62b0ea 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines 5-2-3. Memory Ownership + 5-2-4. OOM Killer 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom_group + + A read-write single value file which exists on non-root + cgroups. The default is "0". + + If set, OOM killer will consider the memory cgroup as an + indivisible memory consumers and compare it with other memory + consumers by it's memory footprint. + If such memory cgroup is selected as an OOM victim, all + processes belonging to it or it's descendants will be killed. + + This applies to system-wide OOM conditions and reaching + the hard memory limit of the cgroup and their ancestor. + If OOM condition happens in a descendant cgroup with it's own + memory limit, the memory cgroup can't be considered + as an OOM victim, and OOM killer will not kill all belonging + tasks. + + Also, OOM killer respects the /proc/pid/oom_score_adj value -1000, + and will never kill the unkillable task, even if memory.oom_group + is set. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. +OOM Killer +~~~~~~~~~~ + +Cgroup v2 memory controller implements a cgroup-aware OOM killer. +It means that it treats cgroups as first class OOM entities. + +Under OOM conditions the memory controller tries to make the best +choice of a victim, looking for a memory cgroup with the largest +memory footprint, considering leaf cgroups and cgroups with the +memory.oom_group option set, which are considered to be an indivisible +memory consumers. + +By default, OOM killer will kill the biggest task in the selected +memory cgroup. A user can change this behavior by enabling +the per-cgroup memory.oom_group option. If set, it causes +the OOM killer to kill all processes attached to the cgroup, +except processes with oom_score_adj set to -1000. + +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM +the memory controller considers only cgroups belonging to the sub-tree +of the OOM'ing cgroup. + +The root cgroup is treated as a leaf memory cgroup, so it's compared +with other leaf memory cgroups and cgroups with oom_group option set. + +If there are no cgroups with the enabled memory controller, +the OOM killer is using the "traditional" process-based approach. + IO -- -- 2.13.6 ^ permalink raw reply related [flat|nested] 27+ messages in thread
end of thread, other threads:[~2017-10-13 21:31 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
[not found] ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>
2017-10-09 21:11 ` David Rientjes
2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
[not found] ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 14:27 ` Michal Hocko
2017-10-09 21:52 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10 8:18 ` Michal Hocko
2017-10-10 12:23 ` Roman Gushchin
2017-10-10 21:13 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10 22:04 ` Roman Gushchin
2017-10-11 20:21 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-11 21:49 ` Roman Gushchin
2017-10-12 21:50 ` David Rientjes
2017-10-13 13:32 ` Roman Gushchin
[not found] ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2017-10-13 21:31 ` David Rientjes
2017-10-11 13:08 ` Michal Hocko
2017-10-11 20:27 ` David Rientjes
2017-10-12 6:33 ` Michal Hocko
2017-10-11 16:10 ` Roman Gushchin
[not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 13:04 ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-05 14:29 ` Michal Hocko
2017-10-05 14:31 ` Michal Hocko
[not found] ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-10-06 12:04 ` Roman Gushchin
2017-10-06 12:17 ` Michal Hocko
2017-10-05 13:04 ` [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04 ` [v11 6/6] mm, oom, docs: describe the " Roman Gushchin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox