* [v7 1/5] mm, oom: refactor the oom_kill_process() function
2017-09-04 14:21 [v7 0/5] cgroup-aware OOM killer Roman Gushchin
@ 2017-09-04 14:21 ` Roman Gushchin
2017-09-05 13:34 ` Michal Hocko
2017-09-04 14:21 ` [v7 2/5] mm, oom: cgroup-aware OOM killer Roman Gushchin
` (3 subsequent siblings)
4 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 14:21 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.
This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().
The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 99736e026712..f061b627092c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.13.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [v7 1/5] mm, oom: refactor the oom_kill_process() function
2017-09-04 14:21 ` [v7 1/5] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-09-05 13:34 ` Michal Hocko
0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-09-05 13:34 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Mon 04-09-17 15:21:04, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
>
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
>
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
Looks good to me
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
> 1 file changed, 65 insertions(+), 58 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..f061b627092c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
> return ret;
> }
>
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
> {
> - struct task_struct *p = oc->chosen;
> - unsigned int points = oc->chosen_points;
> - struct task_struct *victim = p;
> - struct task_struct *child;
> - struct task_struct *t;
> + struct task_struct *p;
> struct mm_struct *mm;
> - unsigned int victim_points = 0;
> - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> - DEFAULT_RATELIMIT_BURST);
> bool can_oom_reap = true;
>
> - /*
> - * If the task is already exiting, don't alarm the sysadmin or kill
> - * its children or threads, just give it access to memory reserves
> - * so it can die quickly
> - */
> - task_lock(p);
> - if (task_will_free_mem(p)) {
> - mark_oom_victim(p);
> - wake_oom_reaper(p);
> - task_unlock(p);
> - put_task_struct(p);
> - return;
> - }
> - task_unlock(p);
> -
> - if (__ratelimit(&oom_rs))
> - dump_header(oc, p);
> -
> - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> - message, task_pid_nr(p), p->comm, points);
> -
> - /*
> - * If any of p's children has a different mm and is eligible for kill,
> - * the one with the highest oom_badness() score is sacrificed for its
> - * parent. This attempts to lose the minimal amount of work done while
> - * still freeing memory.
> - */
> - read_lock(&tasklist_lock);
> - for_each_thread(p, t) {
> - list_for_each_entry(child, &t->children, sibling) {
> - unsigned int child_points;
> -
> - if (process_shares_mm(child, p->mm))
> - continue;
> - /*
> - * oom_badness() returns 0 if the thread is unkillable
> - */
> - child_points = oom_badness(child,
> - oc->memcg, oc->nodemask, oc->totalpages);
> - if (child_points > victim_points) {
> - put_task_struct(victim);
> - victim = child;
> - victim_points = child_points;
> - get_task_struct(victim);
> - }
> - }
> - }
> - read_unlock(&tasklist_lock);
> -
> p = find_lock_task_mm(victim);
> if (!p) {
> put_task_struct(victim);
> @@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> }
> #undef K
>
> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
> + struct task_struct *p = oc->chosen;
> + unsigned int points = oc->chosen_points;
> + struct task_struct *victim = p;
> + struct task_struct *child;
> + struct task_struct *t;
> + unsigned int victim_points = 0;
> + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> + /*
> + * If the task is already exiting, don't alarm the sysadmin or kill
> + * its children or threads, just give it access to memory reserves
> + * so it can die quickly
> + */
> + task_lock(p);
> + if (task_will_free_mem(p)) {
> + mark_oom_victim(p);
> + wake_oom_reaper(p);
> + task_unlock(p);
> + put_task_struct(p);
> + return;
> + }
> + task_unlock(p);
> +
> + if (__ratelimit(&oom_rs))
> + dump_header(oc, p);
> +
> + pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> + message, task_pid_nr(p), p->comm, points);
> +
> + /*
> + * If any of p's children has a different mm and is eligible for kill,
> + * the one with the highest oom_badness() score is sacrificed for its
> + * parent. This attempts to lose the minimal amount of work done while
> + * still freeing memory.
> + */
> + read_lock(&tasklist_lock);
> + for_each_thread(p, t) {
> + list_for_each_entry(child, &t->children, sibling) {
> + unsigned int child_points;
> +
> + if (process_shares_mm(child, p->mm))
> + continue;
> + /*
> + * oom_badness() returns 0 if the thread is unkillable
> + */
> + child_points = oom_badness(child,
> + oc->memcg, oc->nodemask, oc->totalpages);
> + if (child_points > victim_points) {
> + put_task_struct(victim);
> + victim = child;
> + victim_points = child_points;
> + get_task_struct(victim);
> + }
> + }
> + }
> + read_unlock(&tasklist_lock);
> +
> + __oom_kill_process(victim);
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> --
> 2.13.5
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 44+ messages in thread
* [v7 2/5] mm, oom: cgroup-aware OOM killer
2017-09-04 14:21 [v7 0/5] cgroup-aware OOM killer Roman Gushchin
2017-09-04 14:21 ` [v7 1/5] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-09-04 14:21 ` Roman Gushchin
[not found] ` <20170904142108.7165-3-guro-b10kYP2dOMg@public.gmane.org>
2017-09-04 14:21 ` [v7 3/5] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
` (2 subsequent siblings)
4 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 14:21 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.
This behavior doesn't suit well the system with many running
containers:
1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.
2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.
3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.
To address these issues, cgroup-aware OOM killer is introduced.
Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.
By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.
But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.
Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 33 +++++++
include/linux/oom.h | 12 ++-
mm/memcontrol.c | 240 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 92 ++++++++++++++---
4 files changed, 362 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..5b5c2b89968e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* kill all tasks in the subtree in case of OOM */
+ bool oom_group;
+
+ /* cached OOM score */
+ long oom_score;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return memcg->oom_group;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -744,6 +763,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +959,16 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a69d23082abf..97813c56163b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2649,6 +2649,213 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter, *parent;
+
+ for_each_mem_cgroup_tree(iter, root) {
+ if (memcg_has_children(iter)) {
+ iter->oom_score = 0;
+ continue;
+ }
+
+ iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (iter->oom_score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need to look
+ * further for new victims.
+ */
+ if (iter->oom_score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ return;
+ }
+
+ for (parent = parent_mem_cgroup(iter); parent && parent != root;
+ parent = parent_mem_cgroup(parent))
+ parent->oom_score += iter->oom_score;
+ }
+
+ for (;;) {
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg = NULL;
+ long score = LONG_MIN;
+
+ css_for_each_child(css, &root->css) {
+ struct mem_cgroup *iter = mem_cgroup_from_css(css);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (iter->oom_score == 0)
+ continue;
+
+ if (iter->oom_score > score) {
+ memcg = iter;
+ score = iter->oom_score;
+ }
+ }
+
+ if (!memcg) {
+ if (oc->memcg && root == oc->memcg) {
+ oc->chosen_memcg = oc->memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = oc->memcg->oom_score;
+ }
+ break;
+ }
+
+ if (memcg->oom_group || !memcg_has_children(memcg)) {
+ oc->chosen_memcg = memcg;
+ css_get(&oc->chosen_memcg->css);
+ oc->chosen_points = score;
+ break;
+ }
+
+ root = memcg;
+ }
+}
+
+static void select_victim_root_cgroup_task(struct oom_control *oc)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
+
+ css_task_iter_start(&root_mem_cgroup->css, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = oom_evaluate_task(task, oc);
+ css_task_iter_end(&it);
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root = root_mem_cgroup;
+
+ oc->chosen_task = NULL;
+ oc->chosen_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+
+ rcu_read_lock();
+
+ select_victim_memcg(root, oc);
+
+ /*
+ * If existing OOM victims are found, no need to look further.
+ */
+ if (oc->chosen_memcg == INFLIGHT_VICTIM) {
+ rcu_read_unlock();
+ return true;
+ }
+
+ /*
+ * For system-wide OOMs we should consider tasks in the root cgroup
+ * with oom_score larger than oc->chosen_points.
+ */
+ if (!oc->memcg) {
+ select_victim_root_cgroup_task(oc);
+
+ if (oc->chosen_task && oc->chosen_memcg) {
+ /*
+ * If we've decided to kill a task in the root memcg,
+ * release chosen_memcg.
+ */
+ css_put(&oc->chosen_memcg->css);
+ oc->chosen_memcg = NULL;
+ }
+ }
+
+ rcu_read_unlock();
+
+ return oc->chosen_task || oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
@@ -5246,6 +5453,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_group = memcg->oom_group;
+
+ seq_printf(m, "%d\n", oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_group;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_group);
+ if (err)
+ return err;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5366,6 +5600,12 @@ static struct cftype memory_files[] = {
.write = memory_max_write,
},
{
+ .name = "oom_group",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f061b627092c..b90a41ec16a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -322,26 +322,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto next;
/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim)
struct mm_struct *mm;
bool can_oom_reap = true;
+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
+ return;
+
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -885,7 +888,7 @@ static void __oom_kill_process(struct task_struct *victim)
static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -946,6 +949,64 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (!tsk_is_oom_victim(task)) {
+ get_task_struct(task);
+ __oom_kill_process(task);
+ }
+ return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ if (oc->chosen_task) {
+ if (oc->chosen_task == INFLIGHT_VICTIM)
+ return true;
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen_task);
+
+ __oom_kill_process(oc->chosen_task);
+
+ schedule_timeout_killable(1);
+ return true;
+
+ } else if (oc->chosen_memcg) {
+ if (oc->chosen_memcg == INFLIGHT_VICTIM)
+ return true;
+
+ /* Always begin with the biggest task */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, oc->chosen_task);
+
+ __oom_kill_process(oc->chosen_task);
+
+ if (mem_cgroup_oom_group(oc->chosen_memcg))
+ mem_cgroup_scan_tasks(oc->chosen_memcg,
+ oom_kill_memcg_member,
+ NULL);
+ schedule_timeout_killable(1);
+ }
+
+ mem_cgroup_put(oc->chosen_memcg);
+ oc->chosen_memcg = NULL;
+ return oc->chosen_task;
+
+ } else {
+ oc->chosen_points = 0;
+ return false;
+ }
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1042,18 +1103,21 @@ bool out_of_memory(struct oom_control *oc)
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}
+ if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+ return true;
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
/*
@@ -1062,7 +1126,7 @@ bool out_of_memory(struct oom_control *oc)
*/
schedule_timeout_killable(1);
}
- return !!oc->chosen;
+ return !!oc->chosen_task;
}
/*
--
2.13.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 44+ messages in thread* [v7 3/5] mm, oom: introduce oom_priority for memory cgroups
2017-09-04 14:21 [v7 0/5] cgroup-aware OOM killer Roman Gushchin
2017-09-04 14:21 ` [v7 1/5] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-09-04 14:21 ` [v7 2/5] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-09-04 14:21 ` Roman Gushchin
2017-09-04 14:21 ` [v7 4/5] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
2017-09-04 14:21 ` [v7 5/5] mm, oom: cgroup v2 mount option to disable " Roman Gushchin
4 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 14:21 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
David Rientjes, Andrew Morton, Tejun Heo, Tetsuo Handa,
kernel-team, cgroups, linux-doc, linux-kernel
Introduce a per-memory-cgroup oom_priority setting: an integer number,
which defines the order in which the OOM killer selects victim memory
cgroups.
OOM killer prefers memory cgroups with larger priority if they are
populated with eligible tasks.
The oom_priority value is compared within sibling cgroups.
If two or more sibling cgroups have the same oom_priority,
the decision is based on their memory footprint.
The root cgroup has the oom_priority 0, which cannot be changed.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 49 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 50 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b5c2b89968e..73a0291948fd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,9 @@ struct mem_cgroup {
/* cached OOM score */
long oom_score;
+ /* OOM killer priority */
+ int oom_priority;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 97813c56163b..d7dd293897ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2757,6 +2757,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
for (;;) {
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg = NULL;
+ int prio = INT_MIN;
long score = LONG_MIN;
css_for_each_child(css, &root->css) {
@@ -2768,7 +2769,12 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
if (iter->oom_score == 0)
continue;
- if (iter->oom_score > score) {
+ if (iter->oom_priority > prio) {
+ memcg = iter;
+ prio = iter->oom_priority;
+ score = iter->oom_score;
+ } else if (iter->oom_priority == prio &&
+ iter->oom_score > score) {
memcg = iter;
score = iter->oom_score;
}
@@ -2838,7 +2844,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
* For system-wide OOMs we should consider tasks in the root cgroup
* with oom_score larger than oc->chosen_points.
*/
- if (!oc->memcg) {
+ if (!oc->memcg && !(oc->chosen_memcg &&
+ oc->chosen_memcg->oom_priority > 0)) {
+ /*
+ * Root memcg has priority 0, so if chosen memcg has lower
+ * priority, any task in root cgroup is preferable.
+ */
+ if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
+ oc->chosen_points = 0;
+
select_victim_root_cgroup_task(oc);
if (oc->chosen_task && oc->chosen_memcg) {
@@ -5480,6 +5494,31 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_priority_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+ seq_printf(m, "%d\n", memcg->oom_priority);
+
+ return 0;
+}
+
+static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_priority;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_priority);
+ if (err)
+ return err;
+
+ memcg->oom_priority = oom_priority;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5606,6 +5645,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_group_write,
},
{
+ .name = "oom_priority",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_priority_show,
+ .write = memory_oom_priority_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
--
2.13.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 44+ messages in thread* [v7 4/5] mm, oom, docs: describe the cgroup-aware OOM killer
2017-09-04 14:21 [v7 0/5] cgroup-aware OOM killer Roman Gushchin
` (2 preceding siblings ...)
2017-09-04 14:21 ` [v7 3/5] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
@ 2017-09-04 14:21 ` Roman Gushchin
2017-09-04 14:21 ` [v7 5/5] mm, oom: cgroup v2 mount option to disable " Roman Gushchin
4 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 14:21 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
Update cgroups v2 docs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/cgroup-v2.txt | 56 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index a86f3cb88125..5d21bd2e7d55 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -799,6 +800,33 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.oom_group
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ If set, OOM killer will kill all processes attached to the cgroup
+ if selected as an OOM victim.
+
+ OOM killer respects the /proc/pid/oom_score_adj value -1000,
+ and will never kill the unkillable task, even if memory.oom_group
+ is set.
+
+ memory.oom_priority
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ An integer number, which defines the order in which
+ the OOM killer selects victim memory cgroups.
+
+ OOM killer prefers memory cgroups with larger priority if they
+ are populated with eligible tasks.
+
+ The oom_priority value is compared within sibling cgroups.
+
+ The root cgroup has the oom_priority 0, which cannot be changed.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1028,6 +1056,34 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
+5-2-4. OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, hierarchically looking for a cgroup with the
+largest oom_priority. If sibling cgroups have the same priority,
+the OOM killer selects one which is the largest memory consumer.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup oom_group option. If set, it causes the OOM killer
+to kill all processes attached to the cgroup.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (memory cgroups and
+other tasks in root cgroup).
+The root cgroup doesn't support the oom_group feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+
5-3. IO
The "io" controller regulates the distribution of IO resources. This
--
2.13.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 44+ messages in thread* [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-04 14:21 [v7 0/5] cgroup-aware OOM killer Roman Gushchin
` (3 preceding siblings ...)
2017-09-04 14:21 ` [v7 4/5] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
@ 2017-09-04 14:21 ` Roman Gushchin
[not found] ` <20170904142108.7165-6-guro-b10kYP2dOMg@public.gmane.org>
2017-09-05 13:44 ` Michal Hocko
4 siblings, 2 replies; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 14:21 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
Introducing of cgroup-aware OOM killer changes the victim selection
algorithm used by default: instead of picking the largest process,
it will pick the largest memcg and then the largest process inside.
This affects only cgroup v2 users.
To provide a way to use cgroups v2 if the old OOM victim selection
algorithm is preferred for some reason, the nogroupoom mount option
is added.
If set, the OOM selection is performed in a "traditional" per-process
way. Both oom_priority and oom_group memcg knobs are ignored.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/admin-guide/kernel-parameters.txt | 1 +
mm/memcontrol.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 28f1a0f84456..07891f1030aa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -489,6 +489,7 @@
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
+ nogroupoom -- Disable cgroup-aware OOM killer.
checkreqprot [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7dd293897ca..6a8235dc41f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled? */
static bool cgroup_memory_nokmem;
+/* Cgroup-aware OOM disabled? */
+static bool cgroup_memory_nogroupoom;
+
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
int do_swap_account __read_mostly;
@@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (mem_cgroup_disabled())
return false;
+ if (cgroup_memory_nogroupoom)
+ return false;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
@@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
cgroup_memory_nosocket = true;
if (!strcmp(token, "nokmem"))
cgroup_memory_nokmem = true;
+ if (!strcmp(token, "nogroupoom"))
+ cgroup_memory_nogroupoom = true;
}
return 0;
}
--
2.13.5
^ permalink raw reply related [flat|nested] 44+ messages in thread[parent not found: <20170904142108.7165-6-guro-b10kYP2dOMg@public.gmane.org>]
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
[not found] ` <20170904142108.7165-6-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-09-04 17:32 ` Shakeel Butt
2017-09-04 17:51 ` Roman Gushchin
0 siblings, 1 reply; 44+ messages in thread
From: Shakeel Butt @ 2017-09-04 17:32 UTC (permalink / raw)
To: Roman Gushchin
Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, LKML
On Mon, Sep 4, 2017 at 7:21 AM, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
>
> This affects only cgroup v2 users.
>
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.
Is this mount option or boot parameter? From the code, it seems like a
boot parameter.
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-04 17:32 ` Shakeel Butt
@ 2017-09-04 17:51 ` Roman Gushchin
0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2017-09-04 17:51 UTC (permalink / raw)
To: Shakeel Butt
Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, LKML
On Mon, Sep 04, 2017 at 10:32:37AM -0700, Shakeel Butt wrote:
> On Mon, Sep 4, 2017 at 7:21 AM, Roman Gushchin <guro@fb.com> wrote:
> > Introducing of cgroup-aware OOM killer changes the victim selection
> > algorithm used by default: instead of picking the largest process,
> > it will pick the largest memcg and then the largest process inside.
> >
> > This affects only cgroup v2 users.
> >
> > To provide a way to use cgroups v2 if the old OOM victim selection
> > algorithm is preferred for some reason, the nogroupoom mount option
> > is added.
>
> Is this mount option or boot parameter? From the code, it seems like a
> boot parameter.
Sure, you're right.
Fixed version below.
Thank you!
--
From 0b4757ec8d339fa883e17d4e25a92f45bf5565e0 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Mon, 4 Sep 2017 12:08:52 +0100
Subject: [v7 5/5] mm, oom: allow disabling cgroup-aware OOM killer
Introducing of cgroup-aware OOM killer changes the victim selection
algorithm used by default: instead of picking the largest process,
it will pick the largest memcg and then the largest process inside.
This affects only cgroup v2 users.
To provide a way to use cgroups v2 if the old OOM victim selection
algorithm is preferred for some reason, the cgroup.memory=nogroupoom
boot option is added.
If set, the OOM selection is performed in a "traditional" per-process
way. Both oom_priority and oom_group memcg knobs are ignored.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/admin-guide/kernel-parameters.txt | 1 +
mm/memcontrol.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 28f1a0f84456..07891f1030aa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -489,6 +489,7 @@
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
+ nogroupoom -- Disable cgroup-aware OOM killer.
checkreqprot [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7dd293897ca..6a8235dc41f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled? */
static bool cgroup_memory_nokmem;
+/* Cgroup-aware OOM disabled? */
+static bool cgroup_memory_nogroupoom;
+
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
int do_swap_account __read_mostly;
@@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (mem_cgroup_disabled())
return false;
+ if (cgroup_memory_nogroupoom)
+ return false;
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
@@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
cgroup_memory_nosocket = true;
if (!strcmp(token, "nokmem"))
cgroup_memory_nokmem = true;
+ if (!strcmp(token, "nogroupoom"))
+ cgroup_memory_nogroupoom = true;
}
return 0;
}
--
2.13.5
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-04 14:21 ` [v7 5/5] mm, oom: cgroup v2 mount option to disable " Roman Gushchin
[not found] ` <20170904142108.7165-6-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-09-05 13:44 ` Michal Hocko
2017-09-05 14:30 ` Roman Gushchin
2017-09-05 21:53 ` Johannes Weiner
1 sibling, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2017-09-05 13:44 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
I will go and check patch 2 more deeply but this is something that I
wanted to sort out first.
On Mon 04-09-17 15:21:08, Roman Gushchin wrote:
> Introducing of cgroup-aware OOM killer changes the victim selection
> algorithm used by default: instead of picking the largest process,
> it will pick the largest memcg and then the largest process inside.
>
> This affects only cgroup v2 users.
>
> To provide a way to use cgroups v2 if the old OOM victim selection
> algorithm is preferred for some reason, the nogroupoom mount option
> is added.
>
> If set, the OOM selection is performed in a "traditional" per-process
> way. Both oom_priority and oom_group memcg knobs are ignored.
Why is this an opt out rather than opt-in? IMHO the original oom logic
should be preserved by default and specific workloads should opt in for
the cgroup aware logic. Changing the global behavior depending on
whether cgroup v2 interface is in use is more than unexpected and IMHO
wrong approach to take. I think we should instead go with
oom_strategy=[alloc_task,biggest_task,cgroup]
we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
biggest_task which is the default. You are adding cgroup and the more I
think about the more I agree that it doesn't really make sense to try to
fit thew new semantic into the existing one (compare tasks to kill-all
memcgs). Just introduce a new strategy and define a new semantic from
scratch. Memcg priority and kill-all are a natural extension of this new
strategy. This will make the life easier and easier to understand by
users.
Does that make sense to you?
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
> Documentation/admin-guide/kernel-parameters.txt | 1 +
> mm/memcontrol.c | 8 ++++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 28f1a0f84456..07891f1030aa 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -489,6 +489,7 @@
> Format: <string>
> nosocket -- Disable socket memory accounting.
> nokmem -- Disable kernel memory accounting.
> + nogroupoom -- Disable cgroup-aware OOM killer.
>
> checkreqprot [SELINUX] Set initial checkreqprot flag value.
> Format: { "0" | "1" }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7dd293897ca..6a8235dc41f6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -87,6 +87,9 @@ static bool cgroup_memory_nosocket;
> /* Kernel memory accounting disabled? */
> static bool cgroup_memory_nokmem;
>
> +/* Cgroup-aware OOM disabled? */
> +static bool cgroup_memory_nogroupoom;
> +
> /* Whether the swap controller is active */
> #ifdef CONFIG_MEMCG_SWAP
> int do_swap_account __read_mostly;
> @@ -2822,6 +2825,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> if (mem_cgroup_disabled())
> return false;
>
> + if (cgroup_memory_nogroupoom)
> + return false;
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return false;
>
> @@ -6188,6 +6194,8 @@ static int __init cgroup_memory(char *s)
> cgroup_memory_nosocket = true;
> if (!strcmp(token, "nokmem"))
> cgroup_memory_nokmem = true;
> + if (!strcmp(token, "nogroupoom"))
> + cgroup_memory_nogroupoom = true;
> }
> return 0;
> }
> --
> 2.13.5
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-05 13:44 ` Michal Hocko
@ 2017-09-05 14:30 ` Roman Gushchin
2017-09-05 15:12 ` Michal Hocko
2017-09-05 21:53 ` Johannes Weiner
1 sibling, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2017-09-05 14:30 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> I will go and check patch 2 more deeply but this is something that I
> wanted to sort out first.
>
> On Mon 04-09-17 15:21:08, Roman Gushchin wrote:
> > Introducing of cgroup-aware OOM killer changes the victim selection
> > algorithm used by default: instead of picking the largest process,
> > it will pick the largest memcg and then the largest process inside.
> >
> > This affects only cgroup v2 users.
> >
> > To provide a way to use cgroups v2 if the old OOM victim selection
> > algorithm is preferred for some reason, the nogroupoom mount option
> > is added.
> >
> > If set, the OOM selection is performed in a "traditional" per-process
> > way. Both oom_priority and oom_group memcg knobs are ignored.
>
> Why is this an opt out rather than opt-in? IMHO the original oom logic
> should be preserved by default and specific workloads should opt in for
> the cgroup aware logic. Changing the global behavior depending on
> whether cgroup v2 interface is in use is more than unexpected and IMHO
> wrong approach to take. I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]
>
> we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> biggest_task which is the default. You are adding cgroup and the more I
> think about the more I agree that it doesn't really make sense to try to
> fit thew new semantic into the existing one (compare tasks to kill-all
> memcgs). Just introduce a new strategy and define a new semantic from
> scratch. Memcg priority and kill-all are a natural extension of this new
> strategy. This will make the life easier and easier to understand by
> users.
>
> Does that make sense to you?
Absolutely.
The only thing: I'm not sure that we have to preserve the existing logic
as default option. For most users (except few very specific usecases),
it should be at least as good, as the existing one.
Making it opt-in means that corresponding code will be executed only
by few users, who cares. Then we should probably hide corresponding
cgroup interface (oom_group and oom_priority knobs) by default,
and it feels as unnecessary complication and is overall against
cgroup v2 interface design.
> I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]
It would be a really nice interface; although I've no idea how to implement it:
"alloc_task" is an existing sysctl, which we have to preserve;
while "cgroup" depends on cgroup v2.
Thanks!
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-05 14:30 ` Roman Gushchin
@ 2017-09-05 15:12 ` Michal Hocko
[not found] ` <20170905151251.luh4wogjd3msfqgf-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-09-05 15:12 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Tue 05-09-17 15:30:21, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
[...]
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> >
> > Does that make sense to you?
>
> Absolutely.
>
> The only thing: I'm not sure that we have to preserve the existing logic
> as default option. For most users (except few very specific usecases),
> it should be at least as good, as the existing one.
But this is really an unexpected change. Users even might not know that
they are using cgroup v2 and memcg is in use.
> Making it opt-in means that corresponding code will be executed only
> by few users, who cares.
Yeah, which is the way we should introduce new features no?
> Then we should probably hide corresponding
> cgroup interface (oom_group and oom_priority knobs) by default,
> and it feels as unnecessary complication and is overall against
> cgroup v2 interface design.
Why. If we care enough, we could simply return EINVAL when those knobs
are written while the corresponding strategy is not used.
> > I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
>
> It would be a really nice interface; although I've no idea how to implement it:
> "alloc_task" is an existing sysctl, which we have to preserve;
I would argue that we should simply deprecate and later drop the sysctl.
I _strongly_ suspect anybody is using this. If yes it is not that hard
to change the kernel command like rather than select the sysctl. The
deprecation process would be
- warn when somebody writes to the sysctl and check both boot
and sysctl values
[ wait some time ]
- keep the sysctl but return EINVAL
[ wait some time ]
- remove the sysctl
> while "cgroup" depends on cgroup v2.
Which is not a big deal either. Simply fall back to default if there are
no cgroup v2. The implementation would have essentially the same effect
because there won't be any kill-all cgroups and so we will select the
largest task.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-05 13:44 ` Michal Hocko
2017-09-05 14:30 ` Roman Gushchin
@ 2017-09-05 21:53 ` Johannes Weiner
2017-09-06 8:28 ` Michal Hocko
1 sibling, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2017-09-05 21:53 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> Why is this an opt out rather than opt-in? IMHO the original oom logic
> should be preserved by default and specific workloads should opt in for
> the cgroup aware logic. Changing the global behavior depending on
> whether cgroup v2 interface is in use is more than unexpected and IMHO
> wrong approach to take. I think we should instead go with
> oom_strategy=[alloc_task,biggest_task,cgroup]
>
> we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> biggest_task which is the default. You are adding cgroup and the more I
> think about the more I agree that it doesn't really make sense to try to
> fit thew new semantic into the existing one (compare tasks to kill-all
> memcgs). Just introduce a new strategy and define a new semantic from
> scratch. Memcg priority and kill-all are a natural extension of this new
> strategy. This will make the life easier and easier to understand by
> users.
oom_kill_allocating_task is actually a really good example of why
cgroup-awareness *should* be the new default.
Before we had the oom killer victim selection, we simply killed the
faulting/allocating task. While a valid answer to the problem, it's
not very fair or representative of what the user wants or intends.
Then we added code to kill the biggest offender instead, which should
have been the case from the start and was hence made the new default.
The oom_kill_allocating_task was added on the off-chance that there
might be setups who, for historical reasons, rely on the old behavior.
But our default was chosen based on what behavior is fair, expected,
and most reflective of the user's intentions.
The cgroup-awareness in the OOM killer is exactly the same thing. It
should have been the default from the beginning, because the user
configures a group of tasks to be an interdependent, terminal unit of
memory consumption, and it's undesirable for the OOM killer to ignore
this intention and compare members across these boundaries.
We should go the same way here as with kill_alloc_task: the default
should be what's sane and expected by the vast majority of our users,
with a knob (I would prefer a sysctl here, actually) to switch back in
case somebody - unexpectedly - actually relies on the old behavior.
I don't see why couldn't change user-visible behavior here if we don't
expect anyone (quirky setups aside) to rely on it AND we provide a
knob to revert in the field for the exceptions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-05 21:53 ` Johannes Weiner
@ 2017-09-06 8:28 ` Michal Hocko
2017-09-07 16:14 ` Johannes Weiner
2017-09-07 16:27 ` Christopher Lameter
0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2017-09-06 8:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
>
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
>
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
>
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.
I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.
> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.
I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?
From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.
We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-06 8:28 ` Michal Hocko
@ 2017-09-07 16:14 ` Johannes Weiner
2017-09-11 9:05 ` Michal Hocko
2017-09-07 16:27 ` Christopher Lameter
1 sibling, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2017-09-07 16:14 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > should have been the default from the beginning, because the user
> > configures a group of tasks to be an interdependent, terminal unit of
> > memory consumption, and it's undesirable for the OOM killer to ignore
> > this intention and compare members across these boundaries.
>
> I would agree if that was true in general. I can completely see how the
> cgroup awareness is useful in e.g. containerized environments (especially
> with kill-all enabled) but memcgs are used in a large variety of
> usecases and I cannot really say all of them really demand the new
> semantic. Say I have a workload which doesn't want to see reclaim
> interference from others on the same machine. Why should I kill a
> process from that particular memcg just because it is the largest one
> when there is a memory hog/leak outside of this memcg?
Sure, it's always possible to come up with a config for which this
isn't the optimal behavior. But this is about picking a default that
makes sense to most users, and that type of cgroup usage just isn't
the common case.
> From my point of view the safest (in a sense of the least surprise)
> way to go with opt-in for the new heuristic. I am pretty sure all who
> would benefit from the new behavior will enable it while others will not
> regress in unexpected way.
This thinking simply needs to be balanced against the need to make an
unsurprising and consistent final interface.
The current behavior breaks isolation by letting tasks in different
cgroups compete with each other during an OOM kill. While you can
rightfully argue that it's possible for usecases to rely on this, you
cannot tell me that this is the least-surprising thing we can offer
users; certainly not new users, but also not many/most existing ones.
> We can talk about the way _how_ to control these oom strategies, of
> course. But I would be really reluctant to change the default which is
> used for years and people got used to it.
I really doubt there are many cgroup users that rely on that
particular global OOM behavior.
We have to agree to disagree, I guess.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-07 16:14 ` Johannes Weiner
@ 2017-09-11 9:05 ` Michal Hocko
2017-09-11 12:50 ` Roman Gushchin
0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-09-11 9:05 UTC (permalink / raw)
To: Johannes Weiner
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Thu 07-09-17 12:14:57, Johannes Weiner wrote:
> On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> > On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > > should have been the default from the beginning, because the user
> > > configures a group of tasks to be an interdependent, terminal unit of
> > > memory consumption, and it's undesirable for the OOM killer to ignore
> > > this intention and compare members across these boundaries.
> >
> > I would agree if that was true in general. I can completely see how the
> > cgroup awareness is useful in e.g. containerized environments (especially
> > with kill-all enabled) but memcgs are used in a large variety of
> > usecases and I cannot really say all of them really demand the new
> > semantic. Say I have a workload which doesn't want to see reclaim
> > interference from others on the same machine. Why should I kill a
> > process from that particular memcg just because it is the largest one
> > when there is a memory hog/leak outside of this memcg?
>
> Sure, it's always possible to come up with a config for which this
> isn't the optimal behavior. But this is about picking a default that
> makes sense to most users, and that type of cgroup usage just isn't
> the common case.
How can you tell, really? Even if cgroup2 is a new interface we still
want as many legacy (v1) users to be migrated to the new hierarchy.
I have seen quite different usecases over time and I have hard time to
tell which of them to call common enough.
> > From my point of view the safest (in a sense of the least surprise)
> > way to go with opt-in for the new heuristic. I am pretty sure all who
> > would benefit from the new behavior will enable it while others will not
> > regress in unexpected way.
>
> This thinking simply needs to be balanced against the need to make an
> unsurprising and consistent final interface.
Sure. And I _think_ we can come up with a clear interface to configure
the oom behavior - e.g. a kernel command line parameter with a default
based on a config option.
> The current behavior breaks isolation by letting tasks in different
> cgroups compete with each other during an OOM kill. While you can
> rightfully argue that it's possible for usecases to rely on this, you
> cannot tell me that this is the least-surprising thing we can offer
> users; certainly not new users, but also not many/most existing ones.
I would argue that a global OOM has been always a special case and
people got used to "kill the largest task" strategy. I have seen
multiple reports where people were complaining when this wasn't the case
(e.g. when the NUMA policies were involved).
> > We can talk about the way _how_ to control these oom strategies, of
> > course. But I would be really reluctant to change the default which is
> > used for years and people got used to it.
>
> I really doubt there are many cgroup users that rely on that
> particular global OOM behavior.
>
> We have to agree to disagree, I guess.
Yes, I am afraid so. And I do not hear this would be a feature so many
users have been asking for a long time to simply say "yeah everybody
wants that, make it a default". And as such I do not see a reason why we
should enforce it on all users. It is really trivial to enable it when
it is considered useful.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-11 9:05 ` Michal Hocko
@ 2017-09-11 12:50 ` Roman Gushchin
0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2017-09-11 12:50 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, linux-mm, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
On Mon, Sep 11, 2017 at 11:05:59AM +0200, Michal Hocko wrote:
> On Thu 07-09-17 12:14:57, Johannes Weiner wrote:
> > On Wed, Sep 06, 2017 at 10:28:59AM +0200, Michal Hocko wrote:
> > > On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> > > > The cgroup-awareness in the OOM killer is exactly the same thing. It
> > > > should have been the default from the beginning, because the user
> > > > configures a group of tasks to be an interdependent, terminal unit of
> > > > memory consumption, and it's undesirable for the OOM killer to ignore
> > > > this intention and compare members across these boundaries.
> > >
> > > I would agree if that was true in general. I can completely see how the
> > > cgroup awareness is useful in e.g. containerized environments (especially
> > > with kill-all enabled) but memcgs are used in a large variety of
> > > usecases and I cannot really say all of them really demand the new
> > > semantic. Say I have a workload which doesn't want to see reclaim
> > > interference from others on the same machine. Why should I kill a
> > > process from that particular memcg just because it is the largest one
> > > when there is a memory hog/leak outside of this memcg?
> >
> > Sure, it's always possible to come up with a config for which this
> > isn't the optimal behavior. But this is about picking a default that
> > makes sense to most users, and that type of cgroup usage just isn't
> > the common case.
>
> How can you tell, really? Even if cgroup2 is a new interface we still
> want as many legacy (v1) users to be migrated to the new hierarchy.
> I have seen quite different usecases over time and I have hard time to
> tell which of them to call common enough.
>
> > > From my point of view the safest (in a sense of the least surprise)
> > > way to go with opt-in for the new heuristic. I am pretty sure all who
> > > would benefit from the new behavior will enable it while others will not
> > > regress in unexpected way.
> >
> > This thinking simply needs to be balanced against the need to make an
> > unsurprising and consistent final interface.
>
> Sure. And I _think_ we can come up with a clear interface to configure
> the oom behavior - e.g. a kernel command line parameter with a default
> based on a config option.
I would say cgroup v2 mount option is better, because it allows to change
the behavior dynamically (without rebooting) and clearly reflects
cgroup v2 dependency.
Also, it makes systemd (or who is mounting cgroupfs) responsible for the
default behavior. And makes more or less not important what the default is.
Thanks!
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-06 8:28 ` Michal Hocko
2017-09-07 16:14 ` Johannes Weiner
@ 2017-09-07 16:27 ` Christopher Lameter
2017-09-07 22:03 ` David Rientjes
1 sibling, 1 reply; 44+ messages in thread
From: Christopher Lameter @ 2017-09-07 16:27 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Roman Gushchin, linux-mm, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
On Wed, 6 Sep 2017, Michal Hocko wrote:
> I am not sure this is how things evolved actually. This is way before
> my time so my git log interpretation might be imprecise. We do have
> oom_badness heuristic since out_of_memory has been introduced and
> oom_kill_allocating_task has been introduced much later because of large
> boxes with zillions of tasks (SGI I suspect) which took too long to
> select a victim so David has added this heuristic.
Nope. The logic was required for tasks that run out of memory when the
restriction on the allocation did not allow the use of all of memory.
cpuset restrictions and memory policy restrictions where the prime
considerations at the time.
It has *nothing* to do with zillions of tasks. Its amusing that the SGI
ghost is still haunting the discussion here. The company died a couple of
years ago finally (ok somehow HP has an "SGI" brand now I believe). But
there are multiple companies that have large NUMA configurations and they
all have configurations where they want to restrict allocations of a
process to subset of system memory. This is even more important now that
we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
figure out what to do with allocations that fail because the *allowed*
memory pools are empty.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-07 16:27 ` Christopher Lameter
@ 2017-09-07 22:03 ` David Rientjes
2017-09-08 21:07 ` Christopher Lameter
0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2017-09-07 22:03 UTC (permalink / raw)
To: Christopher Lameter
Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, linux-mm,
Vladimir Davydov, Tetsuo Handa, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
On Thu, 7 Sep 2017, Christopher Lameter wrote:
> > I am not sure this is how things evolved actually. This is way before
> > my time so my git log interpretation might be imprecise. We do have
> > oom_badness heuristic since out_of_memory has been introduced and
> > oom_kill_allocating_task has been introduced much later because of large
> > boxes with zillions of tasks (SGI I suspect) which took too long to
> > select a victim so David has added this heuristic.
>
> Nope. The logic was required for tasks that run out of memory when the
> restriction on the allocation did not allow the use of all of memory.
> cpuset restrictions and memory policy restrictions where the prime
> considerations at the time.
>
> It has *nothing* to do with zillions of tasks. Its amusing that the SGI
> ghost is still haunting the discussion here. The company died a couple of
> years ago finally (ok somehow HP has an "SGI" brand now I believe). But
> there are multiple companies that have large NUMA configurations and they
> all have configurations where they want to restrict allocations of a
> process to subset of system memory. This is even more important now that
> we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
> figure out what to do with allocations that fail because the *allowed*
> memory pools are empty.
>
We already had CONSTRAINT_CPUSET at the time, this was requested by Paul
and acked by him in https://marc.info/?l=linux-mm&m=118306851418425.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-07 22:03 ` David Rientjes
@ 2017-09-08 21:07 ` Christopher Lameter
2017-09-09 8:45 ` David Rientjes
0 siblings, 1 reply; 44+ messages in thread
From: Christopher Lameter @ 2017-09-08 21:07 UTC (permalink / raw)
To: David Rientjes
Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, linux-mm,
Vladimir Davydov, Tetsuo Handa, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
On Thu, 7 Sep 2017, David Rientjes wrote:
> > It has *nothing* to do with zillions of tasks. Its amusing that the SGI
> > ghost is still haunting the discussion here. The company died a couple of
> > years ago finally (ok somehow HP has an "SGI" brand now I believe). But
> > there are multiple companies that have large NUMA configurations and they
> > all have configurations where they want to restrict allocations of a
> > process to subset of system memory. This is even more important now that
> > we get new forms of memory (NVDIMM, PCI-E device memory etc). You need to
> > figure out what to do with allocations that fail because the *allowed*
> > memory pools are empty.
> >
>
> We already had CONSTRAINT_CPUSET at the time, this was requested by Paul
> and acked by him in https://marc.info/?l=linux-mm&m=118306851418425.
Ok. Certainly there were scalability issues (lots of them) and the sysctl
may have helped there if set globally. But the ability to kill the
allocating tasks was primarily used in cpusets for constrained allocation.
The issue of scaling is irrelevant in the context of deciding what to do
about the sysctl. You can address the issue differently if it still
exists. The systems with super high NUMA nodes (hundreds to a
thousand) have somehow fallen out of fashion a bit. So I doubt that this
is still an issue. And no one of the old stakeholders is speaking up.
What is the current approach for an OOM occuring in a cpuset or cgroup
with a restricted numa node set?
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
2017-09-08 21:07 ` Christopher Lameter
@ 2017-09-09 8:45 ` David Rientjes
0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2017-09-09 8:45 UTC (permalink / raw)
To: Christopher Lameter
Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, linux-mm,
Vladimir Davydov, Tetsuo Handa, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel
On Fri, 8 Sep 2017, Christopher Lameter wrote:
> Ok. Certainly there were scalability issues (lots of them) and the sysctl
> may have helped there if set globally. But the ability to kill the
> allocating tasks was primarily used in cpusets for constrained allocation.
>
I remember discussing it with him and he had some data with pretty extreme
numbers for how long the tasklist iteration was taking. Regardless, I
agree it's not pertinent to the discussion if anybody is actively using
the sysctl, just fun to try to remember the discussions from 10 years ago.
The problem I'm having with the removal, though, is that the kernel source
actually uses it itself in tools/testing/fault-injection/failcmd.sh.
That, to me, suggests there are people outside the kernel source that are
also probably use it. We use it as part of our unit testing, although we
could convert away from it.
These are things that can probably be worked around, but I'm struggling to
see the whole benefit of it. It's only defined, there's generic sysctl
handling, and there's a single conditional in the oom killer. I wouldn't
risk the potential userspace breakage.
> The issue of scaling is irrelevant in the context of deciding what to do
> about the sysctl. You can address the issue differently if it still
> exists. The systems with super high NUMA nodes (hundreds to a
> thousand) have somehow fallen out of fashion a bit. So I doubt that this
> is still an issue. And no one of the old stakeholders is speaking up.
>
> What is the current approach for an OOM occuring in a cpuset or cgroup
> with a restricted numa node set?
>
It's always been shaky, we simply exclude potential kill victims based on
whether or not they share mempolicy nodes or cpuset mems with the
allocating process. Of course, this could result in no memory freeing
because a potential victim being allowed to allocate on a particular node
right now doesn't mean killing it will free memory on that node. It's
just more probable in practice. Nobody has complained about that
methodology, but we do have internal code that simply kills current for
mempolicy ooms. That is because we have priority based oom killing much
like this patchset implements and then extends it even further to
processes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 44+ messages in thread