* [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
` (8 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.
This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().
The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b0d0fed8480..f041534d77d3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -814,68 +814,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -949,6 +893,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread* [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
` (7 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, Andrew Morton,
Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
linux-mm
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.
The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.
This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/memcontrol.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4aac306ebe3..55fbda60cef6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -888,7 +888,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
* value, the function breaks the iteration loop and returns the value.
* Otherwise, it will iterate over all tasks and return 0.
*
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
*/
int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
int (*fn)(struct task_struct *, void *), void *arg)
@@ -896,8 +897,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
- BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -906,7 +905,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
- if (ret) {
+ if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:35 ` Michal Hocko
2017-12-07 1:24 ` Andrew Morton
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
` (6 subsequent siblings)
9 siblings, 2 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.
This behavior doesn't suit well the system with many running
containers:
1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.
2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.
To address these issues, the cgroup-aware OOM killer is introduced.
This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.
The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.
The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 17 +++++
include/linux/oom.h | 12 ++-
mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 84 +++++++++++++++------
4 files changed, 272 insertions(+), 22 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 882046863581..cb4db659a8b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27cd36b762b5..10f495c8454d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -10,6 +10,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -51,7 +58,8 @@ struct oom_control {
/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};
@@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55fbda60cef6..592ffb1c98a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Root memory cgroup is a special case:
+ * we don't have necessary stats to evaluate it exactly as
+ * leaf memory cgroups, so we approximate it's oom_score
+ * by summing oom_score of all belonging tasks, which are
+ * owners of their mm structs.
+ *
+ * If there are inflight OOM victim tasks inside
+ * the root memcg, we return -1.
+ */
+ if (memcg == root_mem_cgroup) {
+ struct css_task_iter it;
+ struct task_struct *task;
+ long score = 0;
+
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP,
+ &task->signal->oom_mm->flags)) {
+ score = -1;
+ break;
+ }
+
+ task_lock(task);
+ if (!task->mm || task->mm->owner != task) {
+ task_unlock(task);
+ continue;
+ }
+ task_unlock(task);
+
+ score += oom_badness(task, memcg, nodemask,
+ totalpages);
+ }
+ css_task_iter_end(&it);
+
+ return score;
+ }
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter;
+
+ oc->chosen_memcg = NULL;
+ oc->chosen_points = 0;
+
+ /*
+ * The oom_score is calculated for leaf memory cgroups (including
+ * the root memcg).
+ */
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, root) {
+ long score;
+
+ if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ continue;
+
+ score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need
+ * to look further for new victims.
+ */
+ if (score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ break;
+ }
+
+ if (score > oc->chosen_points) {
+ oc->chosen_points = score;
+ oc->chosen_memcg = iter;
+ }
+ }
+
+ if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+ css_get(&oc->chosen_memcg->css);
+
+ rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+ else
+ root = root_mem_cgroup;
+
+ select_victim_memcg(root, oc);
+
+ return oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f041534d77d3..bcfa92f29407 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto next;
/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+ if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+ return oc->chosen_memcg;
+
+ /* Kill a task in the chosen memcg with the biggest memory footprint */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
+
+ __oom_kill_process(oc->chosen_task);
+
+out:
+ mem_cgroup_put(oc->chosen_memcg);
+ return oc->chosen_task;
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
{
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ bool delay = false; /* if set, delay next allocation attempt */
if (oom_killer_disabled)
return false;
@@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
if (oc->page)
return true;
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}
+ if (mem_cgroup_select_oom_victim(oc)) {
+ oc->page = alloc_pages_before_oomkill(oc);
+ if (oc->page) {
+ if (oc->chosen_memcg &&
+ oc->chosen_memcg != INFLIGHT_VICTIM)
+ mem_cgroup_put(oc->chosen_memcg);
+ return true;
+ }
+
+ if (oom_kill_memcg_victim(oc)) {
+ delay = true;
+ goto out;
+ }
+ }
+
select_bad_process(oc);
/*
* Try really last second allocation attempt after we selected an OOM
@@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
*/
oc->page = alloc_pages_before_oomkill(oc);
if (oc->page) {
- if (oc->chosen && oc->chosen != (void *)-1UL)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+ put_task_struct(oc->chosen_task);
return true;
}
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
+ delay = true;
}
- return !!oc->chosen;
+
+out:
+ /*
+ * Give the killed process a good chance to exit before trying
+ * to allocate memory again.
+ */
+ if (delay)
+ schedule_timeout_killable(1);
+
+ return !!oc->chosen_task;
}
/*
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01 8:35 ` Michal Hocko
2017-12-07 1:24 ` Andrew Morton
1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:35 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:20, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
> To address these issues, the cgroup-aware OOM killer is introduced.
>
> This patch introduces the core functionality: an ability to select
> a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
> looks for the biggest leaf memory cgroup and kills the biggest
> task belonging to it.
>
> The following patches will extend this functionality to consider
> non-leaf memory cgroups as OOM victims, and also provide an ability
> to kill all tasks belonging to the victim cgroup.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf memory cgroups.
> Due to memcg statistics implementation a special approximation
> is used for estimating oom_score of root memory cgroup: we sum
> oom_score of the belonging processes (or, to be more precise,
> tasks owning their mm structures).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
I am not entirely happy that this patch enables the cgroup behavior
unconditioanlly for cgroup v2 but later patch fixes that up. I do not
expect people are going to bisect oom workloads over these few commits
so this should be a big deal.
Anyway I still _strongly_ believe that the new heuristic is not
suitable for the default behavior and the opt-in is required. So my ack
is under this condition.
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/memcontrol.h | 17 +++++
> include/linux/oom.h | 12 ++-
> mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++++++++++++
> mm/oom_kill.c | 84 +++++++++++++++------
> 4 files changed, 272 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 882046863581..cb4db659a8b5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
> struct page;
> struct mm_struct;
> struct kmem_cache;
> +struct oom_control;
>
> /* Cgroup-specific page state, on top of universal node page state */
> enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> return css ? container_of(css, struct mem_cgroup, css) : NULL;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
> #define mem_cgroup_from_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>
> bool mem_cgroup_oom_synchronize(bool wait);
>
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
> return true;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline struct mem_cgroup *
> mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
> void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
> {
> }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
> #endif /* CONFIG_MEMCG */
>
> /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 27cd36b762b5..10f495c8454d 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -10,6 +10,13 @@
> #include <linux/sched/coredump.h> /* MMF_* */
> #include <linux/mm.h> /* VM_FAULT* */
>
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
> struct zonelist;
> struct notifier_block;
> struct mem_cgroup;
> @@ -51,7 +58,8 @@ struct oom_control {
>
> /* Used by oom implementation, do not set */
> unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
> unsigned long chosen_points;
> };
>
> @@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
>
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
> /* sysctls */
> extern int sysctl_oom_dump_tasks;
> extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 55fbda60cef6..592ffb1c98a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Root memory cgroup is a special case:
> + * we don't have necessary stats to evaluate it exactly as
> + * leaf memory cgroups, so we approximate it's oom_score
> + * by summing oom_score of all belonging tasks, which are
> + * owners of their mm structs.
> + *
> + * If there are inflight OOM victim tasks inside
> + * the root memcg, we return -1.
> + */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP,
> + &task->signal->oom_mm->flags)) {
> + score = -1;
> + break;
> + }
> +
> + task_lock(task);
> + if (!task->mm || task->mm->owner != task) {
> + task_unlock(task);
> + continue;
> + }
> + task_unlock(task);
> +
> + score += oom_badness(task, memcg, nodemask,
> + totalpages);
> + }
> + css_task_iter_end(&it);
> +
> + return score;
> + }
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> + * The oom_score is calculated for leaf memory cgroups (including
> + * the root memcg).
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;
> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need
> + * to look further for new victims.
> + */
> + if (score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (score > oc->chosen_points) {
> + oc->chosen_points = score;
> + oc->chosen_memcg = iter;
> + }
> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> + else
> + root = root_mem_cgroup;
> +
> + select_victim_memcg(root, oc);
> +
> + return oc->chosen_memcg;
> +}
> +
> /*
> * Reclaims as many pages from the given memcg as possible.
> *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f041534d77d3..bcfa92f29407 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
> return CONSTRAINT_NONE;
> }
>
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> struct oom_control *oc = arg;
> unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> goto next;
>
> /* Prefer thread group leaders for display purposes */
> - if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
> goto next;
> select:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> get_task_struct(task);
> - oc->chosen = task;
> + oc->chosen_task = task;
> oc->chosen_points = points;
> next:
> return 0;
> abort:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> - oc->chosen = (void *)-1UL;
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> + oc->chosen_task = INFLIGHT_VICTIM;
> return 1;
> }
>
> /*
> * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
> */
> static void select_bad_process(struct oom_control *oc)
> {
> @@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
>
> static void oom_kill_process(struct oom_control *oc, const char *message)
> {
> - struct task_struct *p = oc->chosen;
> + struct task_struct *p = oc->chosen_task;
> unsigned int points = oc->chosen_points;
> struct task_struct *victim = p;
> struct task_struct *child;
> @@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> __oom_kill_process(victim);
> }
>
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Kill a task in the chosen memcg with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + __oom_kill_process(oc->chosen_task);
> +
> +out:
> + mem_cgroup_put(oc->chosen_memcg);
> + return oc->chosen_task;
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> @@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
> {
> unsigned long freed = 0;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> + bool delay = false; /* if set, delay next allocation attempt */
>
> if (oom_killer_disabled)
> return false;
> @@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
> if (oc->page)
> return true;
> get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
> oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> return true;
> }
>
> + if (mem_cgroup_select_oom_victim(oc)) {
> + oc->page = alloc_pages_before_oomkill(oc);
> + if (oc->page) {
> + if (oc->chosen_memcg &&
> + oc->chosen_memcg != INFLIGHT_VICTIM)
> + mem_cgroup_put(oc->chosen_memcg);
> + return true;
> + }
> +
> + if (oom_kill_memcg_victim(oc)) {
> + delay = true;
> + goto out;
> + }
> + }
> +
> select_bad_process(oc);
> /*
> * Try really last second allocation attempt after we selected an OOM
> @@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
> */
> oc->page = alloc_pages_before_oomkill(oc);
> if (oc->page) {
> - if (oc->chosen && oc->chosen != (void *)-1UL)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
> + put_task_struct(oc->chosen_task);
> return true;
> }
> /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> - if (oc->chosen && oc->chosen != (void *)-1UL) {
> + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
> oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> "Memory cgroup out of memory");
> - /*
> - * Give the killed process a good chance to exit before trying
> - * to allocate memory again.
> - */
> - schedule_timeout_killable(1);
> + delay = true;
> }
> - return !!oc->chosen;
> +
> +out:
> + /*
> + * Give the killed process a good chance to exit before trying
> + * to allocate memory again.
> + */
> + if (delay)
> + schedule_timeout_killable(1);
> +
> + return !!oc->chosen_task;
> }
>
> /*
> --
> 2.14.3
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-12-01 8:35 ` Michal Hocko
@ 2017-12-07 1:24 ` Andrew Morton
2017-12-07 13:39 ` Roman Gushchin
1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-12-07 1:24 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
As a result of the "stalled MM patches" discussion I've dropped these
three patches:
mm,oom: move last second allocation to inside the OOM killer
mm,oom: use ALLOC_OOM for OOM victim's last second allocation
mm,oom: remove oom_lock serialization from the OOM reaper
and I had to rework this patch as a result. Please carefully check (and
preferable test) my handiwork in out_of_memory()?
From: Roman Gushchin <guro@fb.com>
Subject: mm, oom: cgroup-aware OOM killer
Traditionally, the OOM killer is operating on a process level. Under oom
conditions, it finds a process with the highest oom score and kills it.
This behavior doesn't suit well the system with many running containers:
1) There is no fairness between containers. A small container with few
large processes will be chosen over a large one with huge number of
small processes.
2) Containers often do not expect that some random process inside will
be killed. In many cases much safer behavior is to kill all tasks in
the container. Traditionally, this was implemented in userspace, but
doing it in the kernel has some advantages, especially in a case of a
system-wide OOM.
To address these issues, the cgroup-aware OOM killer is introduced.
This patch introduces the core functionality: an ability to select a
memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks
for the biggest leaf memory cgroup and kills the biggest task belonging to
it.
The following patches will extend this functionality to consider non-leaf
memory cgroups as OOM victims, and also provide an ability to kill all
tasks belonging to the victim cgroup.
The root cgroup is treated as a leaf memory cgroup, so it's score is
compared with other leaf memory cgroups. Due to memcg statistics
implementation a special approximation is used for estimating oom_score of
root memory cgroup: we sum oom_score of the belonging processes (or, to be
more precise, tasks owning their mm structures).
Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/memcontrol.h | 17 +++
include/linux/oom.h | 12 ++
mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 72 ++++++++++---
4 files changed, 262 insertions(+), 20 deletions(-)
diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
--- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/oom.h
@@ -10,6 +10,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -40,7 +47,8 @@ struct oom_control {
/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};
@@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
--- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
return ret;
}
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Root memory cgroup is a special case:
+ * we don't have necessary stats to evaluate it exactly as
+ * leaf memory cgroups, so we approximate it's oom_score
+ * by summing oom_score of all belonging tasks, which are
+ * owners of their mm structs.
+ *
+ * If there are inflight OOM victim tasks inside
+ * the root memcg, we return -1.
+ */
+ if (memcg == root_mem_cgroup) {
+ struct css_task_iter it;
+ struct task_struct *task;
+ long score = 0;
+
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP,
+ &task->signal->oom_mm->flags)) {
+ score = -1;
+ break;
+ }
+
+ task_lock(task);
+ if (!task->mm || task->mm->owner != task) {
+ task_unlock(task);
+ continue;
+ }
+ task_unlock(task);
+
+ score += oom_badness(task, memcg, nodemask,
+ totalpages);
+ }
+ css_task_iter_end(&it);
+
+ return score;
+ }
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter;
+
+ oc->chosen_memcg = NULL;
+ oc->chosen_points = 0;
+
+ /*
+ * The oom_score is calculated for leaf memory cgroups (including
+ * the root memcg).
+ */
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, root) {
+ long score;
+
+ if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ continue;
+
+ score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need
+ * to look further for new victims.
+ */
+ if (score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ break;
+ }
+
+ if (score > oc->chosen_points) {
+ oc->chosen_points = score;
+ oc->chosen_memcg = iter;
+ }
+ }
+
+ if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+ css_get(&oc->chosen_memcg->css);
+
+ rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+ else
+ root = root_mem_cgroup;
+
+ select_victim_memcg(root, oc);
+
+ return oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
goto next;
/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
__oom_kill_process(victim);
}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+ if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+ return oc->chosen_memcg;
+
+ /* Kill a task in the chosen memcg with the biggest memory footprint */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
+
+ __oom_kill_process(oc->chosen_task);
+
+out:
+ mem_cgroup_put(oc->chosen_memcg);
+ return oc->chosen_task;
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
{
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ bool delay = false; /* if set, delay next allocation attempt */
if (oom_killer_disabled)
return false;
@@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}
+ if (mem_cgroup_select_oom_victim(oc)) {
+ if (oom_kill_memcg_victim(oc)) {
+ delay = true;
+ goto out;
+ }
+ }
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
+ delay = true;
}
- return !!oc->chosen;
+
+out:
+ /*
+ * Give the killed process a good chance to exit before trying
+ * to allocate memory again.
+ */
+ if (delay)
+ schedule_timeout_killable(1);
+
+ return !!oc->chosen_task;
}
/*
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-12-07 1:24 ` Andrew Morton
@ 2017-12-07 13:39 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-07 13:39 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed, Dec 06, 2017 at 05:24:13PM -0800, Andrew Morton wrote:
>
> As a result of the "stalled MM patches" discussion I've dropped these
> three patches:
>
> mm,oom: move last second allocation to inside the OOM killer
> mm,oom: use ALLOC_OOM for OOM victim's last second allocation
> mm,oom: remove oom_lock serialization from the OOM reaper
>
> and I had to rework this patch as a result. Please carefully check (and
> preferable test) my handiwork in out_of_memory()?
Hi, Andrew!
Reviewed and tested, looks good to me. Thank you!
A couple of small nits below.
>
>
>
> From: Roman Gushchin <guro@fb.com>
> Subject: mm, oom: cgroup-aware OOM killer
>
> Traditionally, the OOM killer is operating on a process level. Under oom
> conditions, it finds a process with the highest oom score and kills it.
>
> This behavior doesn't suit well the system with many running containers:
>
> 1) There is no fairness between containers. A small container with few
> large processes will be chosen over a large one with huge number of
> small processes.
>
> 2) Containers often do not expect that some random process inside will
> be killed. In many cases much safer behavior is to kill all tasks in
> the container. Traditionally, this was implemented in userspace, but
> doing it in the kernel has some advantages, especially in a case of a
> system-wide OOM.
>
> To address these issues, the cgroup-aware OOM killer is introduced.
>
> This patch introduces the core functionality: an ability to select a
> memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks
> for the biggest leaf memory cgroup and kills the biggest task belonging to
> it.
>
> The following patches will extend this functionality to consider non-leaf
> memory cgroups as OOM victims, and also provide an ability to kill all
> tasks belonging to the victim cgroup.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score is
> compared with other leaf memory cgroups. Due to memcg statistics
> implementation a special approximation is used for estimating oom_score of
> root memory cgroup: we sum oom_score of the belonging processes (or, to be
> more precise, tasks owning their mm structures).
>
> Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> include/linux/memcontrol.h | 17 +++
> include/linux/oom.h | 12 ++
> mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++
> mm/oom_kill.c | 72 ++++++++++---
> 4 files changed, 262 insertions(+), 20 deletions(-)
>
> diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
> struct page;
> struct mm_struct;
> struct kmem_cache;
> +struct oom_control;
>
> /* Cgroup-specific page state, on top of universal node page state */
> enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
> return css ? container_of(css, struct mem_cgroup, css) : NULL;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
> #define mem_cgroup_from_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
>
> bool mem_cgroup_oom_synchronize(bool wait);
>
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
> return true;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline struct mem_cgroup *
> mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
> void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
> {
> }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
> #endif /* CONFIG_MEMCG */
>
> /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
> --- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/oom.h
> @@ -10,6 +10,13 @@
> #include <linux/sched/coredump.h> /* MMF_* */
> #include <linux/mm.h> /* VM_FAULT* */
>
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
> struct zonelist;
> struct notifier_block;
> struct mem_cgroup;
> @@ -40,7 +47,8 @@ struct oom_control {
>
> /* Used by oom implementation, do not set */
> unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
> unsigned long chosen_points;
> };
>
> @@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
>
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
> /* sysctls */
> extern int sysctl_oom_dump_tasks;
> extern int sysctl_oom_kill_allocating_task;
> diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
> --- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Root memory cgroup is a special case:
> + * we don't have necessary stats to evaluate it exactly as
> + * leaf memory cgroups, so we approximate it's oom_score
> + * by summing oom_score of all belonging tasks, which are
> + * owners of their mm structs.
> + *
> + * If there are inflight OOM victim tasks inside
> + * the root memcg, we return -1.
> + */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP,
> + &task->signal->oom_mm->flags)) {
> + score = -1;
> + break;
> + }
> +
> + task_lock(task);
> + if (!task->mm || task->mm->owner != task) {
> + task_unlock(task);
> + continue;
> + }
> + task_unlock(task);
> +
> + score += oom_badness(task, memcg, nodemask,
> + totalpages);
> + }
> + css_task_iter_end(&it);
> +
> + return score;
> + }
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> + * The oom_score is calculated for leaf memory cgroups (including
> + * the root memcg).
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;
> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need
> + * to look further for new victims.
> + */
> + if (score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (score > oc->chosen_points) {
> + oc->chosen_points = score;
> + oc->chosen_memcg = iter;
> + }
> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> + else
> + root = root_mem_cgroup;
> +
> + select_victim_memcg(root, oc);
> +
> + return oc->chosen_memcg;
> +}
> +
> /*
> * Reclaims as many pages from the given memcg as possible.
> *
> diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
> --- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
> return CONSTRAINT_NONE;
> }
>
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> struct oom_control *oc = arg;
> unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
> goto next;
>
> /* Prefer thread group leaders for display purposes */
> - if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
> goto next;
> select:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> get_task_struct(task);
> - oc->chosen = task;
> + oc->chosen_task = task;
> oc->chosen_points = points;
> next:
> return 0;
> abort:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> - oc->chosen = (void *)-1UL;
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> + oc->chosen_task = INFLIGHT_VICTIM;
> return 1;
> }
>
> /*
> * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
> */
> static void select_bad_process(struct oom_control *oc)
> {
> @@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
>
> static void oom_kill_process(struct oom_control *oc, const char *message)
> {
> - struct task_struct *p = oc->chosen;
> + struct task_struct *p = oc->chosen_task;
> unsigned int points = oc->chosen_points;
> struct task_struct *victim = p;
> struct task_struct *child;
> @@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
> __oom_kill_process(victim);
> }
>
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Kill a task in the chosen memcg with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + __oom_kill_process(oc->chosen_task);
> +
> +out:
> + mem_cgroup_put(oc->chosen_memcg);
> + return oc->chosen_task;
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> @@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
> {
> unsigned long freed = 0;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> + bool delay = false; /* if set, delay next allocation attempt */
>
> if (oom_killer_disabled)
> return false;
> @@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
> current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
> oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> return true;
> }
>
> + if (mem_cgroup_select_oom_victim(oc)) {
> + if (oom_kill_memcg_victim(oc)) {
> + delay = true;
> + goto out;
> + }
> + }
It's probably better to join two conditions with &&:
if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
delay = true;
goto out;
}
> +
> select_bad_process(oc);
> /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> - if (oc->chosen && oc->chosen != (void *)-1UL) {
> + if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {
It's better to replace "(void *)-1UL" here with "INFLIGHT_VICTIM"
to be consistent:
if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) ...
> oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> "Memory cgroup out of memory");
--
Also, I've sent a couple of small fixes, requested by Michal:
1) https://lkml.org/lkml/2017/12/1/569
(mm, oom, docs: document groupoom mount option)
2) https://lkml.org/lkml/2017/12/1/568
(mm, oom: return error on access to memory.oom_group if groupoom is disabled)
Can, you, please, pull them?
Thank you!
Roman
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v13 4/7] mm, oom: introduce memory.oom_group
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (2 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
` (5 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.
But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.
To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.
If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.
Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.
The default value is 0.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 17 +++++++++++
mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++++++++++---
mm/oom_kill.c | 47 +++++++++++++++++++++++------
3 files changed, 126 insertions(+), 13 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cb4db659a8b5..7b8bcdf6571d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /*
+ * Treat the sub-tree as an indivisible memory consumer,
+ * kill all belonging tasks if the memory cgroup selected
+ * as OOM victim.
+ */
+ bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
@@ -490,6 +497,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return memcg->oom_group;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -990,6 +1002,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
{
return false;
}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 592ffb1c98a7..5d27a4bbd478 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,19 +2779,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
- struct mem_cgroup *iter;
+ struct mem_cgroup *iter, *group = NULL;
+ long group_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
+ /*
+ * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+ * all tasks belonging to the memcg should be killed.
+ * So, we mark the memcg as a victim.
+ */
+ if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+ oc->chosen_memcg = oc->memcg;
+ css_get(&oc->chosen_memcg->css);
+ return;
+ }
+
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
+ * Non-leaf oom_group cgroups accumulating score of descendant
+ * leaf memory cgroups.
*/
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * We don't consider non-leaf non-oom_group memory cgroups
+ * as OOM victims.
+ */
+ if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+ !mem_cgroup_oom_group(iter))
+ continue;
+
+ /*
+ * If group is not set or we've ran out of the group's sub-tree,
+ * we should set group and reset group_score.
+ */
+ if (!group || group == root_mem_cgroup ||
+ !mem_cgroup_is_descendant(iter, group)) {
+ group = iter;
+ group_score = 0;
+ }
+
if (memcg_has_children(iter) && iter != root_mem_cgroup)
continue;
@@ -2813,9 +2845,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
break;
}
- if (score > oc->chosen_points) {
- oc->chosen_points = score;
- oc->chosen_memcg = iter;
+ group_score += score;
+
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = group;
}
}
@@ -5440,6 +5474,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_group = memcg->oom_group;
+
+ seq_printf(m, "%d\n", oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_group;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_group);
+ if (err)
+ return err;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5559,6 +5620,12 @@ static struct cftype memory_files[] = {
.seq_show = memory_max_show,
.write = memory_max_write,
},
+ {
+ .name = "oom_group",
+ .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
{
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bcfa92f29407..4678468bae17 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,17 @@ static void __oom_kill_process(struct task_struct *victim)
struct mm_struct *mm;
bool can_oom_reap = true;
+ /*
+ * __oom_kill_process() is used to kill all tasks belonging to
+ * the selected memory cgroup, so we should check that we're not
+ * trying to kill an unkillable task.
+ */
+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+ victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ put_task_struct(victim);
+ return;
+ }
+
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -956,21 +967,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
{
+ get_task_struct(task);
+ __oom_kill_process(task);
+ return 0;
+}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg;
- /* Kill a task in the chosen memcg with the biggest memory footprint */
- oc->chosen_points = 0;
- oc->chosen_task = NULL;
- mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
- if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
- goto out;
+ /*
+ * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+ * of the chosen memory cgroup, otherwise kill the task with the biggest
+ * memory footprint.
+ */
+ if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+ NULL);
+ /* We have one or more terminating processes at this point. */
+ oc->chosen_task = INFLIGHT_VICTIM;
+ } else {
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL ||
+ oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
- __oom_kill_process(oc->chosen_task);
+ __oom_kill_process(oc->chosen_task);
+ }
out:
mem_cgroup_put(oc->chosen_memcg);
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (3 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:41 ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
` (4 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.
The behavior can be changed dynamically by remounting the cgroupfs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/cgroup-defs.h | 5 +++++
kernel/cgroup/cgroup.c | 10 ++++++++++
mm/memcontrol.c | 3 +++
3 files changed, 18 insertions(+)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8b7fd8eeccee..9fb99e25d654 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,6 +81,11 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+ /*
+ * Enable cgroup-aware OOM killer.
+ */
+ CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0b1ffe147f24..7338e12979e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+ } else if (!strcmp(token, "groupoom")) {
+ *root_flags |= CGRP_GROUP_OOM;
+ continue;
}
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+ if (root_flags & CGRP_GROUP_OOM)
+ cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+ else
+ cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+ if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+ seq_puts(seq, ",groupoom");
return 0;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d27a4bbd478..c76d5fb55c5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return false;
+
if (oc->memcg)
root = oc->memcg;
else
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01 8:41 ` Michal Hocko
2017-12-01 13:15 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:41 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
>
> The behavior can be changed dynamically by remounting the cgroupfs.
Is it ok to create oom_group if the option is not enabled? This looks
confusing. I forgot all the details about how cgroup core creates file
so I do not have a good idea how to fix this.
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
> include/linux/cgroup-defs.h | 5 +++++
> kernel/cgroup/cgroup.c | 10 ++++++++++
> mm/memcontrol.c | 3 +++
> 3 files changed, 18 insertions(+)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 8b7fd8eeccee..9fb99e25d654 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -81,6 +81,11 @@ enum {
> * Enable cpuset controller in v1 cgroup to use v2 behavior.
> */
> CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> + /*
> + * Enable cgroup-aware OOM killer.
> + */
> + CGRP_GROUP_OOM = (1 << 5),
> };
>
> /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 0b1ffe147f24..7338e12979e1 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
> if (!strcmp(token, "nsdelegate")) {
> *root_flags |= CGRP_ROOT_NS_DELEGATE;
> continue;
> + } else if (!strcmp(token, "groupoom")) {
> + *root_flags |= CGRP_GROUP_OOM;
> + continue;
> }
>
> pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
> cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
> else
> cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> + if (root_flags & CGRP_GROUP_OOM)
> + cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> + else
> + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
> }
> }
>
> @@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
> {
> if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
> seq_puts(seq, ",nsdelegate");
> + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> + seq_puts(seq, ",groupoom");
> return 0;
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d27a4bbd478..c76d5fb55c5c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return false;
>
> + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> + return false;
> +
> if (oc->memcg)
> root = oc->memcg;
> else
> --
> 2.14.3
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 8:41 ` Michal Hocko
@ 2017-12-01 13:15 ` Roman Gushchin
2017-12-01 13:31 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:15 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 09:41:13AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> >
> > The behavior can be changed dynamically by remounting the cgroupfs.
>
> Is it ok to create oom_group if the option is not enabled? This looks
> confusing. I forgot all the details about how cgroup core creates file
> so I do not have a good idea how to fix this.
I don't think we do show/hide interface files dynamically.
Even for things like socket memory which can be disabled by the boot option,
we don't hide the corresponding stats entry.
So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
attempt if option is not enabled?
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 13:15 ` Roman Gushchin
@ 2017-12-01 13:31 ` Michal Hocko
2017-12-01 17:00 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:31 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
[...]
> So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> attempt if option is not enabled?
Yes, that would work as well. ENOTSUP sounds better to me.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 13:31 ` Michal Hocko
@ 2017-12-01 17:00 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:00 UTC (permalink / raw)
To: Michal Hocko
Cc: Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 02:31:45PM +0100, Michal Hocko wrote:
> On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
> [...]
> > So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> > attempt if option is not enabled?
>
> Yes, that would work as well. ENOTSUP sounds better to me.
> --
> Michal Hocko
> SUSE Labs
From 78bf2c00abf450bcd993d02a7dc1783144005fbd Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 14:30:14 +0000
Subject: [PATCH] mm, oom: return error on access to memory.oom_group if
groupoom is disabled
Cgroup-aware OOM killer depends on cgroup mount option and is turned
off by default, despite the user interface (memory.oom_group file) is
always present. As it might be confusing to a user, let's return
-ENOTSUPP on an attempt to access to memory.oom_group if groupoom is not
enabled globally.
Example:
$ cd /sys/fs/cgroup/user.slice/
$ cat memory.oom_group
cat: memory.oom_group: Unknown error 524
$ echo 1 > memory.oom_group
-bash: echo: write error: Unknown error 524
$ mount -o remount,groupoom /sys/fs/cgroup
$ echo 1 > memory.oom_group
$ cat memory.oom_group
1
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
mm/memcontrol.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c76d5fb55c5c..b709ee4f914b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5482,6 +5482,9 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return -ENOTSUPP;
+
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5495,6 +5498,9 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return -ENOTSUPP;
+
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (4 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:41 ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
` (3 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vladimir Davydov,
Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Document the cgroup-aware OOM killer.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/cgroup-v2.txt | 58 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 779211fbb69f..c80a147f94b7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1026,6 +1027,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.oom_group
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ If set, OOM killer will consider the memory cgroup as an
+ indivisible memory consumers and compare it with other memory
+ consumers by it's memory footprint.
+ If such memory cgroup is selected as an OOM victim, all
+ processes belonging to it or it's descendants will be killed.
+
+ This applies to system-wide OOM conditions and reaching
+ the hard memory limit of the cgroup and their ancestor.
+ If OOM condition happens in a descendant cgroup with it's own
+ memory limit, the memory cgroup can't be considered
+ as an OOM victim, and OOM killer will not kill all belonging
+ tasks.
+
+ Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+ and will never kill the unkillable task, even if memory.oom_group
+ is set.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+Please, note that memory charges are not migrating if tasks
+are moved between different memory cgroups. Moving tasks with
+significant memory footprint may affect OOM victim selection logic.
+If it's a case, please, consider creating a common ancestor for
+the source and destination memory cgroups and enabling oom_group
+on ancestor layer.
+
IO
--
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-12-01 8:41 ` Michal Hocko
2017-12-01 17:01 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:41 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
> POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> belonging to the affected files to ensure correct memory ownership.
>
> +OOM Killer
> +~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats cgroups as first class OOM entities.
This should mention groupoom mount option to enable this functionality.
Other than that looks ok to me
Acked-by: Michal Hocko <mhocko@suse.com>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-12-01 8:41 ` Michal Hocko
@ 2017-12-01 17:01 ` Roman Gushchin
2017-12-01 17:13 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:01 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 09:41:54AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> > @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
> > POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> > belonging to the affected files to ensure correct memory ownership.
> >
> > +OOM Killer
> > +~~~~~~~~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats cgroups as first class OOM entities.
>
> This should mention groupoom mount option to enable this functionality.
>
> Other than that looks ok to me
> Acked-by: Michal Hocko <mhocko@suse.com>
> --
> Michal Hocko
> SUSE Labs
From 1d9c87128897ee7f27f9651d75b80f73985373e8 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 15:34:59 +0000
Subject: [PATCH] mm, oom, docs: document groupoom mount option
Add a note that cgroup-aware OOM logic is disabled by default
and describe how to enable it.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
Documentation/cgroup-v2.txt | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c80a147f94b7..ff8e92db636d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
and will never kill the unkillable task, even if memory.oom_group
is set.
+ If cgroup-aware OOM killer is not enabled, ENOTSUPP error
+ is returned on attempt to access the file.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1258,6 +1261,12 @@ OOM Killer
Cgroup v2 memory controller implements a cgroup-aware OOM killer.
It means that it treats cgroups as first class OOM entities.
+Cgroup-aware OOM logic is turned off by default and requires
+passing the "groupoom" option on mounting cgroupfs. It can also
+by remounting cgroupfs with the following command::
+
+ # mount -o remount,groupoom $MOUNT_POINT
+
Under OOM conditions the memory controller tries to make the best
choice of a victim, looking for a memory cgroup with the largest
memory footprint, considering leaf cgroups and cgroups with the
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-12-01 17:01 ` Roman Gushchin
@ 2017-12-01 17:13 ` Michal Hocko
0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 17:13 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 17:01:49, Roman Gushchin wrote:
[...]
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index c80a147f94b7..ff8e92db636d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
> and will never kill the unkillable task, even if memory.oom_group
> is set.
>
> + If cgroup-aware OOM killer is not enabled, ENOTSUPP error
> + is returned on attempt to access the file.
> +
> memory.events
> A read-only flat-keyed file which exists on non-root cgroups.
> The following entries are defined. Unless specified
> @@ -1258,6 +1261,12 @@ OOM Killer
> Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> It means that it treats cgroups as first class OOM entities.
>
> +Cgroup-aware OOM logic is turned off by default and requires
> +passing the "groupoom" option on mounting cgroupfs. It can also
> +by remounting cgroupfs with the following command::
> +
> + # mount -o remount,groupoom $MOUNT_POINT
> +
> Under OOM conditions the memory controller tries to make the best
> choice of a victim, looking for a memory cgroup with the largest
> memory footprint, considering leaf cgroups and cgroups with the
Looks good to me
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v13 7/7] cgroup: list groupoom in cgroup features
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (5 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
` (2 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Tejun Heo, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
List groupoom in cgroup features list (exported via
/sys/kernel/cgroup/features), which can be used by a userspace
apps (most likely, systemd) to get an idea which cgroup features
are supported by kernel.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
kernel/cgroup/cgroup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7338e12979e1..693443282fc1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5922,7 +5922,8 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
+ "groupoom\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
--
2.14.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (6 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
@ 2017-11-30 20:39 ` Andrew Morton
[not found] ` <20171130123930.cf3217c816fd270fa35a40cb-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
9 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-11-30 20:39 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu, 30 Nov 2017 15:28:17 +0000 Roman Gushchin <guro@fb.com> wrote:
> This patchset makes the OOM killer cgroup-aware.
Thanks, I'll grab these.
There has been controversy over this patchset, to say the least. I
can't say that I followed it closely! Could those who still have
reservations please summarise their concerns and hopefully suggest a
way forward?
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (7 preceding siblings ...)
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
@ 2017-12-01 9:14 ` Michal Hocko
2017-12-01 13:26 ` Tetsuo Handa
[not found] ` <20171201091425.ekrpxsmkwcusozua-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 9:14 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
Recently added alloc_pages_before_oomkill gained new caller with this
patchset and I think it just grown to deserve a simpler code flow.
What do you think about this on top of the series?
---
From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 1 Dec 2017 10:05:25 +0100
Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
alloc_pages_before_oomkill is the last attempt to allocate memory before
we go and try to kill a process or a memcg. It's success path always has
to properly clean up the oc state (namely victim reference count). Let's
pull this into alloc_pages_before_oomkill directly rather than risk
somebody will forget to do it in future. Also document that we _know_
alloc_pages_before_oomkill violates proper layering and that is a
pragmatic decision.
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
include/linux/oom.h | 2 +-
mm/oom_kill.c | 21 +++------------------
mm/page_alloc.c | 24 ++++++++++++++++++++++--
3 files changed, 26 insertions(+), 21 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 10f495c8454d..7052e0a20e13 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+extern bool alloc_pages_before_oomkill(struct oom_control *oc);
extern int oom_evaluate_task(struct task_struct *task, void *arg);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4678468bae17..5c2cd299757b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page)
+ if (alloc_pages_before_oomkill(oc))
return true;
get_task_struct(current);
oc->chosen_task = current;
@@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
}
if (mem_cgroup_select_oom_victim(oc)) {
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page) {
- if (oc->chosen_memcg &&
- oc->chosen_memcg != INFLIGHT_VICTIM)
- mem_cgroup_put(oc->chosen_memcg);
+ if (alloc_pages_before_oomkill(oc))
return true;
- }
if (oom_kill_memcg_victim(oc)) {
delay = true;
@@ -1127,17 +1121,8 @@ bool out_of_memory(struct oom_control *oc)
}
select_bad_process(oc);
- /*
- * Try really last second allocation attempt after we selected an OOM
- * victim, for somebody might have managed to free memory while we were
- * selecting an OOM victim which can take quite some time.
- */
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page) {
- if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
- put_task_struct(oc->chosen_task);
+ if (alloc_pages_before_oomkill(oc))
return true;
- }
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d518e9b2ee8..9e65fa06ee10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4146,7 +4146,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
return page;
}
-struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+/*
+ * Try really last second allocation attempt after we selected an OOM victim,
+ * for somebody might have managed to free memory while we were selecting an
+ * OOM victim which can take quite some time.
+ *
+ * This function is a blatant layer violation example because we cross the page
+ * allocator and reclaim (OOM killer) layers. Doing this right from the design
+ * POV would be much more complex though so let's close our eyes and pretend
+ * everything is allright.
+ */
+bool alloc_pages_before_oomkill(struct oom_control *oc)
{
/*
* Go through the zonelist yet one more time, keep very high watermark
@@ -4167,7 +4177,17 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
alloc_flags = reserve_flags;
- return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+ oc->page = get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+ if (!oc->page)
+ return false;
+
+ /*
+ * we are skipping the remaining oom steps so clean up the pending oc
+ * state
+ */
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+ put_task_struct(oc->chosen_task);
+ return true;
}
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
--
2.15.0
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2017-12-01 13:26 ` Tetsuo Handa
[not found] ` <20171201091425.ekrpxsmkwcusozua-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
1 sibling, 0 replies; 51+ messages in thread
From: Tetsuo Handa @ 2017-12-01 13:26 UTC (permalink / raw)
To: mhocko, guro
Cc: linux-mm, vdavydov.dev, hannes, rientjes, akpm, tj, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
I'm planning to post below patch in order to mitigate OOM lockup problem
caused by scheduling priority. But I'm OK with your patch because your patch
will not conflict with below patch.
----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2746a7..ef6e951 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3332,13 +3332,14 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
*did_some_progress = 0;
/*
- * Acquire the oom lock. If that fails, somebody else is
- * making progress for us.
+ * Acquire the oom lock. If that fails, give enough CPU time to the
+ * owner of the oom lock in order to help reclaiming memory.
*/
- if (!mutex_trylock(&oom_lock)) {
- *did_some_progress = 1;
+ while (!mutex_trylock(&oom_lock)) {
+ page = alloc_pages_before_oomkill(oc);
+ if (page)
+ return page;
schedule_timeout_uninterruptible(1);
- return NULL;
}
/* Coredumps can quickly deplete all memory reserves */
----------
^ permalink raw reply related [flat|nested] 51+ messages in thread[parent not found: <20171201091425.ekrpxsmkwcusozua-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
[not found] ` <20171201091425.ekrpxsmkwcusozua-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-12-01 13:32 ` Roman Gushchin
2017-12-01 13:54 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:32 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm-u79uwXL29TY76Z2rM5mHXA, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
Tejun Heo, kernel-team-b10kYP2dOMg,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg
Hi, Michal!
I totally agree that out_of_memory() function deserves some refactoring.
But I think there is an issue with your patch (see below):
On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
>
> ---
> From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Date: Fri, 1 Dec 2017 10:05:25 +0100
> Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
>
> alloc_pages_before_oomkill is the last attempt to allocate memory before
> we go and try to kill a process or a memcg. It's success path always has
> to properly clean up the oc state (namely victim reference count). Let's
> pull this into alloc_pages_before_oomkill directly rather than risk
> somebody will forget to do it in future. Also document that we _know_
> alloc_pages_before_oomkill violates proper layering and that is a
> pragmatic decision.
>
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> ---
> include/linux/oom.h | 2 +-
> mm/oom_kill.c | 21 +++------------------
> mm/page_alloc.c | 24 ++++++++++++++++++++++--
> 3 files changed, 26 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 10f495c8454d..7052e0a20e13 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
>
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> -extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
> +extern bool alloc_pages_before_oomkill(struct oom_control *oc);
>
> extern int oom_evaluate_task(struct task_struct *task, void *arg);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4678468bae17..5c2cd299757b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
> if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page)
> + if (alloc_pages_before_oomkill(oc))
> return true;
> get_task_struct(current);
> oc->chosen_task = current;
> @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
> }
>
> if (mem_cgroup_select_oom_victim(oc)) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page) {
> - if (oc->chosen_memcg &&
> - oc->chosen_memcg != INFLIGHT_VICTIM)
> - mem_cgroup_put(oc->chosen_memcg);
You're removing chosen_memcg releasing here, but I don't see where you
do this instead. And I'm not sure that putting mem_cgroup_put() into
alloc_pages_before_oomkill() is a way towards simpler code.
I was thinking about a bit larger refactoring: splitting out_of_memory()
into the following parts (defined as separate functions): victim selection
(per-process, memcg-aware or just allocating task), last allocation attempt,
OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
but I don't have code yet.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-12-01 13:32 ` Roman Gushchin
@ 2017-12-01 13:54 ` Michal Hocko
0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:54 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 13:32:15, Roman Gushchin wrote:
> Hi, Michal!
>
> I totally agree that out_of_memory() function deserves some refactoring.
>
> But I think there is an issue with your patch (see below):
>
> On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> > Recently added alloc_pages_before_oomkill gained new caller with this
> > patchset and I think it just grown to deserve a simpler code flow.
> > What do you think about this on top of the series?
> >
> > ---
[...]
> > @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
> > }
> >
> > if (mem_cgroup_select_oom_victim(oc)) {
> > - oc->page = alloc_pages_before_oomkill(oc);
> > - if (oc->page) {
> > - if (oc->chosen_memcg &&
> > - oc->chosen_memcg != INFLIGHT_VICTIM)
> > - mem_cgroup_put(oc->chosen_memcg);
>
> You're removing chosen_memcg releasing here, but I don't see where you
> do this instead. And I'm not sure that putting mem_cgroup_put() into
> alloc_pages_before_oomkill() is a way towards simpler code.
Dohh, I though I did. But obviously it is not there.
> I was thinking about a bit larger refactoring: splitting out_of_memory()
> into the following parts (defined as separate functions): victim selection
> (per-process, memcg-aware or just allocating task), last allocation attempt,
> OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
> but I don't have code yet.
OK, I will not push if you have further plans of course. This just hit
my eyes...
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (8 preceding siblings ...)
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2018-06-05 11:47 ` Michal Hocko
2018-06-05 12:13 ` Michal Hocko
2018-07-13 21:59 ` David Rientjes
9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 11:47 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
It seems that this is still in limbo mostly because of David's concerns.
So let me reiterate them and provide my POV once more (and the last
time) just to help Andrew make a decision:
1) comparision root with tail memcgs during the OOM killer is not fair
because we are comparing tasks with memcgs.
This is true, but I do not think this matters much for workloads which
are going to use the feature. Why? Because the main consumers of the new
feature seem to be containers which really need some fairness when
comparing _workloads_ rather than processes. Those are unlikely to
contain any significant memory consumers in the root memcg. That would
be mostly common infrastructure.
Is this is fixable? Yes, we would need to account in the root memcgs.
Why are we not doing that now? Because it has some negligible
performance overhead. Are there other ways? Yes we can approximate root
memcg memory consumption but I would rather wait for somebody seeing
that as a real problem rather than add hacks now without a strong
reason.
2) Evading the oom killer by attaching processes to child cgroups which
basically means that a task can split up the workload into smaller
memcgs to hide their real memory consumption.
Again true but not really anything new. Processes can already fork and
split up the memory consumption. Moreover it doesn't even require any
special privileges to do so unlike creating a sub memcg. Is this
fixable? Yes, untrusted workloads can setup group oom evaluation at the
delegation layer so all subgroups would be considered together.
3) Userspace has zero control over oom kill selection in leaf mem
cgroups.
Again true but this is something that needs a good evaluation to not end
up in the fiasko we have seen with oom_score*. Current users demanding
this feature can live without any prioritization so blocking the whole
feature seems unreasonable.
4) Future extensibility to be backward compatible.
David is wrong here IMHO. Any prioritization or oom selection policy
controls added in future are orthogonal to the oom_group concept added
by this patchset. Allowing memcg to be an oom entity is something that
we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
semantic and softening it will be possible by a adding a new knob to
tell whether a memcg/hierarchy is a workload or a set of tasks.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
@ 2018-06-05 12:13 ` Michal Hocko
2018-07-13 21:59 ` David Rientjes
1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 12:13 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Tue 05-06-18 13:47:29, Michal Hocko wrote:
> It seems that this is still in limbo mostly because of David's concerns.
> So let me reiterate them and provide my POV once more (and the last
> time) just to help Andrew make a decision:
Sorry, I forgot to add reference to the email with the full David's
reasoning. Here it is http://lkml.kernel.org/r/alpine.DEB.2.10.1801091556490.173445@chino.kir.corp.google.com
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
>
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
2018-06-05 12:13 ` Michal Hocko
@ 2018-07-13 21:59 ` David Rientjes
2018-07-14 1:55 ` Tetsuo Handa
2018-07-16 9:36 ` Michal Hocko
1 sibling, 2 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-13 21:59 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Tue, 5 Jun 2018, Michal Hocko wrote:
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
There are users (us) who want to use the feature and not all processes are
attached to leaf mem cgroups. The functionality can be provided in a
generally useful way that doesn't require any specific hierarchy, and I
implemented this in my patch series at
https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
*all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
in it's entirety, only extends it so it is generally useful and does not
remove any functionality. I'm not sure why we are discussing ways of
moving forward when that patchset has been waiting for review for almost
four months and, to date, I haven't seen an objection to.
I don't know why we cannot agree on making solutions generally useful nor
why that patchset has not been merged into -mm so that the whole feature
can be merged. It's baffling. This is the first time I've encountered a
perceived stalemate when there is a patchset sitting, unreviewed, that
fixes all of the concerns that there are about the implementation sitting
in -mm.
This isn't a criticism just of comparing processes attached to root
differently than leaf mem cgroups, it's how oom_score_adj influences that
decision. It's trivial for a very small consumer (not "significant"
memory consumer, as you put it) to require an oom kill from root instead
of a leaf mem cgroup. I show in
https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the
oom_score_adj of my bash shell attached to the root mem cgroup is
considered equal to a 95GB leaf mem cgroup with the current
implementation.
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from
what I remmeber Roman actually liked it.
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
Processes being able to fork to split up memory consumption is also fixed
by https://marc.info/?l=linux-mm&m=152175564104467 just as creating
subcontainers to intentionally or unintentionally subverting the oom
policy is fixed. It solves both problems.
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
One objection here is how the oom_score_adj of a process means something
or doesn't mean something depending on what cgroup it is attached to. The
cgroup-aware oom killer is cgroup aware. oom_score_adj should play no
part. I fixed this with https://marc.info/?t=152175564500007&r=1&w=2.
The other objection is that users do have cgroups that shouldn't be oom
killed because they are important, either because they are required to
provide a service for a smaller cgroup or because of business goals. We
have cgroups that use more than half of system memory and they are allowed
to do so because they are important. We would love to be able to bias
against that cgroup to prefer others, or prefer cgroups for oom kill
because they are less important. This was done for processes with
oom_score_adj, we need it for a cgroup aware oom killer for the same
reason.
But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2
that I said priority or adjustment can be added on top of the feature
after it's merged. This itself is not precluding anything from being
merged.
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
I've always said that the mechanism and policy in this patchset should be
separated. I do that exact thing in
https://marc.info/?l=linux-mm&m=152175564304469&w=2. I suggest that
different subtrees will want (or the admin will require) different
behaviors with regard to the mechanism.
I've stated the problems (and there are others wrt mempolicy selection)
that the current implementation has and given a full solution at
https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been
reviewed. I would love feedback from anybody on this thread on that. I'm
not trying to preclude the cgroup-aware oom killer from being merged, I'm
the only person actively trying to get it merged.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-13 21:59 ` David Rientjes
@ 2018-07-14 1:55 ` Tetsuo Handa
2018-07-16 21:13 ` Tetsuo Handa
2018-07-16 9:36 ` Michal Hocko
1 sibling, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-14 1:55 UTC (permalink / raw)
To: David Rientjes, Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On 2018/07/14 6:59, David Rientjes wrote:
> I'm not trying to preclude the cgroup-aware oom killer from being merged,
> I'm the only person actively trying to get it merged.
Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
and my cleanup? The gap between linux.git and linux-next.git keeps us unable
to use agreed baseline.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-14 1:55 ` Tetsuo Handa
@ 2018-07-16 21:13 ` Tetsuo Handa
2018-07-16 22:09 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-16 21:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Michal Hocko, Roman Gushchin, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
No response from Roman and David...
Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
Roman's series has a bug which I mentioned and which can be avoided by my patch.
David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
next OOM victim without trying to reclaim any memory.
Since they are not responding to my mail, I suggest once dropping from linux-next.
https://www.spinics.net/lists/linux-mm/msg153212.html
https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u
On 2018/07/14 10:55, Tetsuo Handa wrote:
> On 2018/07/14 6:59, David Rientjes wrote:
>> I'm not trying to preclude the cgroup-aware oom killer from being merged,
>> I'm the only person actively trying to get it merged.
>
> Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
> and my cleanup? The gap between linux.git and linux-next.git keeps us unable
> to use agreed baseline.
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 21:13 ` Tetsuo Handa
@ 2018-07-16 22:09 ` Roman Gushchin
2018-07-17 0:55 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-07-16 22:09 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> No response from Roman and David...
>
> Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> Roman's series has a bug which I mentioned and which can be avoided by my patch.
> David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> next OOM victim without trying to reclaim any memory.
>
> Since they are not responding to my mail, I suggest once dropping from linux-next.
I was in cc, and didn't thought that you're expecting something from me.
I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
I'm happy to help with rebasing and everything else.
Thanks,
Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 22:09 ` Roman Gushchin
@ 2018-07-17 0:55 ` Tetsuo Handa
2018-07-31 14:14 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-17 0:55 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> >
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> > next OOM victim without trying to reclaim any memory.
> >
> > Since they are not responding to my mail, I suggest once dropping from linux-next.
>
> I was in cc, and didn't thought that you're expecting something from me.
Oops. I was waiting for your response. ;-)
But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch
in linux-next. And it seems to me that your patch contains a bug which leads to
premature memory allocation failure explained below.
Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
(which eliminates "delay" and "out:" from your patch) so that people can easily
backport my patch? Or, do you want to apply a fix (which eliminates "delay" and
"out:" from linux-next) prior to my patch?
>
> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> I'm happy to help with rebasing and everything else.
Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
and easy to cleanly backport (if applied before your series).
Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-17 0:55 ` Tetsuo Handa
@ 2018-07-31 14:14 ` Tetsuo Handa
2018-08-01 16:37 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-31 14:14 UTC (permalink / raw)
To: Roman Gushchin, Andrew Morton
Cc: David Rientjes, Michal Hocko, linux-mm, Vladimir Davydov,
Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On 2018/07/17 9:55, Tetsuo Handa wrote:
>> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
>> I'm happy to help with rebasing and everything else.
>
> Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> and easy to cleanly backport (if applied before your series).
>
> Also, I expect you to check whether my cleanup patch which removes "abort" path
> ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> simplifying your series. I don't know detailed behavior of your series, but I
> assume that your series do not kill threads which current thread should not wait
> for MMF_OOM_SKIP.
syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258
I can't tell what change is triggering this race. Maybe removal of oom_lock from
the oom reaper made more likely to hit. But anyway I suspect that
static bool oom_kill_memcg_victim(struct oom_control *oc)
{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg; // <= This line is still broken
because
/* We have one or more terminating processes at this point. */
oc->chosen_task = INFLIGHT_VICTIM;
is not called.
Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
with oom_lock held.
Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
apply my cleanup patch? Since the merge window is approaching, I really want to
see how next -rc1 would look like...
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-31 14:14 ` Tetsuo Handa
@ 2018-08-01 16:37 ` Roman Gushchin
2018-08-01 22:01 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 16:37 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> On 2018/07/17 9:55, Tetsuo Handa wrote:
> >> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> >> I'm happy to help with rebasing and everything else.
> >
> > Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> > patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> > and easy to cleanly backport (if applied before your series).
> >
> > Also, I expect you to check whether my cleanup patch which removes "abort" path
> > ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> > simplifying your series. I don't know detailed behavior of your series, but I
> > assume that your series do not kill threads which current thread should not wait
> > for MMF_OOM_SKIP.
>
> syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fid-3Dea8c7912757d253537375e981b61749b2da69258&d=DwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=h9FJRAWtCmDLT-cVwvXKCYIUVRSrD--0XFJE-OnNY64&s=If6eFu6MlYjnfLXeg5_S-3tuhCZhSMv8_qfSrMfwOQ0&e=
>
> I can't tell what change is triggering this race. Maybe removal of oom_lock from
> the oom reaper made more likely to hit. But anyway I suspect that
>
> static bool oom_kill_memcg_victim(struct oom_control *oc)
> {
> if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> return oc->chosen_memcg; // <= This line is still broken
>
> because
>
> /* We have one or more terminating processes at this point. */
> oc->chosen_task = INFLIGHT_VICTIM;
>
> is not called.
>
> Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
> with oom_lock held.
>
> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> apply my cleanup patch? Since the merge window is approaching, I really want to
> see how next -rc1 would look like...
Hi Tetsuo!
Has this cleanup patch been acked by somebody?
Which problem does it solve?
Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
Anyway, there is a good chance that current cgroup-aware OOM killer
implementation will be replaced by a lightweight version (memory.oom.group).
Please, take a look at it, probably your cleanup will not conflict with it
at all.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-08-01 16:37 ` Roman Gushchin
@ 2018-08-01 22:01 ` Tetsuo Handa
2018-08-01 22:55 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-08-01 22:01 UTC (permalink / raw)
To: Roman Gushchin, Andrew Morton
Cc: David Rientjes, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
linux-mm
On 2018/08/02 1:37, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
>> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
>> apply my cleanup patch? Since the merge window is approaching, I really want to
>> see how next -rc1 would look like...
>
> Hi Tetsuo!
>
> Has this cleanup patch been acked by somebody?
Not yet. But since Michal considers this cleanup as "a nice shortcut"
( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
I will get an ACK regarding this cleanup.
> Which problem does it solve?
It simplifies tricky out_of_memory() return value decision, and
it also fixes a bug in your series which syzbot is pointing out.
> Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
What I need is a git tree which I can use as a baseline for making this cleanup.
linux.git is not suitable because it does not include Michal's fix, but
linux-next.git is not suitable because Michal's fix is overwritten by your series.
I want a git tree which includes Michal's fix and does not include your series.
>
> Anyway, there is a good chance that current cgroup-aware OOM killer
> implementation will be replaced by a lightweight version (memory.oom.group).
> Please, take a look at it, probably your cleanup will not conflict with it
> at all.
Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
I want to see how next -rc1 would look like (for testing purpose) and want to use
linux-next.git as a baseline (for making this cleanup).
>
> Thanks!
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-08-01 22:01 ` Tetsuo Handa
@ 2018-08-01 22:55 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 22:55 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Thu, Aug 02, 2018 at 07:01:28AM +0900, Tetsuo Handa wrote:
> On 2018/08/02 1:37, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> >> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> >> apply my cleanup patch? Since the merge window is approaching, I really want to
> >> see how next -rc1 would look like...
> >
> > Hi Tetsuo!
> >
> > Has this cleanup patch been acked by somebody?
>
> Not yet. But since Michal considers this cleanup as "a nice shortcut"
> ( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
> I will get an ACK regarding this cleanup.
>
> > Which problem does it solve?
>
> It simplifies tricky out_of_memory() return value decision, and
> it also fixes a bug in your series which syzbot is pointing out.
>
> > Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
>
> What I need is a git tree which I can use as a baseline for making this cleanup.
> linux.git is not suitable because it does not include Michal's fix, but
> linux-next.git is not suitable because Michal's fix is overwritten by your series.
> I want a git tree which includes Michal's fix and does not include your series.
>
> >
> > Anyway, there is a good chance that current cgroup-aware OOM killer
> > implementation will be replaced by a lightweight version (memory.oom.group).
> > Please, take a look at it, probably your cleanup will not conflict with it
> > at all.
>
> Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
> I want to see how next -rc1 would look like (for testing purpose) and want to use
> linux-next.git as a baseline (for making this cleanup).
I'll post memory.oom.group v2 later today, and if there will be no objections,
I'll ask Andrew to drop current memcg-aware OOM killer and replace it
with lightweight memory.oom.group.
These changes will be picked by linux-next in few days.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-13 21:59 ` David Rientjes
2018-07-14 1:55 ` Tetsuo Handa
@ 2018-07-16 9:36 ` Michal Hocko
2018-07-17 3:59 ` David Rientjes
1 sibling, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2018-07-16 9:36 UTC (permalink / raw)
To: David Rientjes
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 13-07-18 14:59:59, David Rientjes wrote:
> On Tue, 5 Jun 2018, Michal Hocko wrote:
>
> > 1) comparision root with tail memcgs during the OOM killer is not fair
> > because we are comparing tasks with memcgs.
> >
> > This is true, but I do not think this matters much for workloads which
> > are going to use the feature. Why? Because the main consumers of the new
> > feature seem to be containers which really need some fairness when
> > comparing _workloads_ rather than processes. Those are unlikely to
> > contain any significant memory consumers in the root memcg. That would
> > be mostly common infrastructure.
> >
>
> There are users (us) who want to use the feature and not all processes are
> attached to leaf mem cgroups. The functionality can be provided in a
> generally useful way that doesn't require any specific hierarchy, and I
> implemented this in my patch series at
> https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
> *all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
> in it's entirety, only extends it so it is generally useful and does not
> remove any functionality. I'm not sure why we are discussing ways of
> moving forward when that patchset has been waiting for review for almost
> four months and, to date, I haven't seen an objection to.
Well, I didn't really get to your patches yet. The last time I've
checked I had some pretty serious concerns about the consistency of your
proposal. Those might have been fixed in the lastest version of your
patchset I haven't seen. But I still strongly suspect that you are
largerly underestimating the complexity of more generic oom policies
which you are heading to.
Considering user API failures from the past (oom_*adj fiasco for
example) suggests that we should start with smaller steps and only
provide a clear and simple API. oom_group is such a simple and
semantically consistent thing which is the reason I am OK with it much
more than your "we can be more generic" approach. I simply do not trust
we can agree on sane and consistent api in a reasonable time.
And it is quite mind boggling that a simpler approach has been basically
blocked for months because there are some concerns for workloads which
are not really asking for the feature. Sure your usecase might need to
handle root memcg differently. That is a fair point but that shouldn't
really block containers users who can use the proposed solution without
any further changes. If we ever decide to handle root memcg differently
we are free to do so because the oom selection policy is not carved in
stone by any api.
[...]
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 9:36 ` Michal Hocko
@ 2018-07-17 3:59 ` David Rientjes
0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-17 3:59 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Mon, 16 Jul 2018, Michal Hocko wrote:
> Well, I didn't really get to your patches yet. The last time I've
> checked I had some pretty serious concerns about the consistency of your
> proposal. Those might have been fixed in the lastest version of your
> patchset I haven't seen. But I still strongly suspect that you are
> largerly underestimating the complexity of more generic oom policies
> which you are heading to.
>
I don't believe it's underestimated since it's used. It's perfectly valid
the lock an entire hierarchy or individual subtrees into a single policy
if that's what is preferred. Any use of a different policy at a subtree
root is a conscious decision made by the owner of that subtree. If they
prefer to kill the largest process, the largest descendant cgroup, or the
largest subtree, it is up to them. All three have valid usecases, the
goal is not to lock the entire hierarchy into a single policy: this
introduces the ability for users to subvert the selection policy either
intentionally or unintentionally because they are using a unified single
hierarchy with cgroup v2 and they are using controllers other than mem
cgroup.
> Considering user API failures from the past (oom_*adj fiasco for
> example) suggests that we should start with smaller steps and only
> provide a clear and simple API. oom_group is such a simple and
> semantically consistent thing which is the reason I am OK with it much
> more than your "we can be more generic" approach. I simply do not trust
> we can agree on sane and consistent api in a reasonable time.
>
> And it is quite mind boggling that a simpler approach has been basically
> blocked for months because there are some concerns for workloads which
> are not really asking for the feature. Sure your usecase might need to
> handle root memcg differently. That is a fair point but that shouldn't
> really block containers users who can use the proposed solution without
> any further changes. If we ever decide to handle root memcg differently
> we are free to do so because the oom selection policy is not carved in
> stone by any api.
>
Please respond directly to the patchset which clearly enumerates the
problems with the current implementation in -mm.
^ permalink raw reply [flat|nested] 51+ messages in thread