* [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
@ 2017-06-21 21:19 ` Roman Gushchin
[not found] ` <201706220040.v5M0eSnK074332@www262.sakura.ne.jp>
[not found] ` <1498079956-24467-2-git-send-email-guro-b10kYP2dOMg@public.gmane.org>
2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
` (4 subsequent siblings)
5 siblings, 2 replies; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
linux-kernel
Oom killer should avoid unnecessary kills. To prevent them, during
the tasks list traverse we check for task which was previously
selected as oom victims. If there is such a task, new victim
is not selected.
This approach is sub-optimal (we're doing costly iteration over the task
list every time) and will not work for the cgroup-aware oom killer.
We already have oom_victims counter, which can be effectively used
for the task.
If there are victims in flight, don't do anything; if the counter
falls to 0, there are no more oom victims left.
So, it's a good time to start looking for a new victim.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/oom_kill.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..e3aaf5c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
if (oom_killer_disabled)
return false;
+ /*
+ * If there are oom victims in flight, we don't need to select
+ * a new victim.
+ */
+ if (atomic_read(&oom_victims) > 0)
+ return true;
+
if (!is_memcg_oom(oc)) {
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread* [v3 2/6] mm, oom: cgroup-aware OOM killer
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
2017-07-10 23:05 ` David Rientjes
2017-06-21 21:19 ` [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Roman Gushchin
` (3 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.
This behavior doesn't suit well the system with many running
containers. There are two main issues:
1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.
2) Containers often do not expect that some random process inside
will be killed. In many cases much more safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.
3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.
To address these issues, cgroup-aware OOM killer is introduced.
Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.
By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.
But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.
Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 20 ++++++
include/linux/oom.h | 3 +
mm/memcontrol.c | 155 ++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 164 +++++++++++++++++++++++++++++----------------
4 files changed, 285 insertions(+), 57 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3d..c59926c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -199,6 +200,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* kill all tasks in the subtree in case of OOM */
+ bool oom_kill_all_tasks;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
@@ -342,6 +346,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -480,6 +489,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -739,6 +750,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -926,6 +941,11 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
static inline void __inc_memcg_state(struct mem_cgroup *memcg,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..b7ec3bd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+ struct mem_cgroup *chosen_memcg;
};
extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e..bdb5103 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,128 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
+static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+ struct mem_cgroup *iter;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(iter, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+ }
+
+ points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(iter, NR_SLAB_UNRECLAIMABLE);
+ points += memcg_page_state(iter, MEMCG_SOCK);
+ points += memcg_page_state(iter, MEMCG_SWAP);
+ }
+
+ return points;
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct cgroup_subsys_state *css = NULL;
+ struct mem_cgroup *iter = NULL;
+ struct mem_cgroup *chosen_memcg = NULL;
+ struct mem_cgroup *parent = root_mem_cgroup;
+ unsigned long totalpages = oc->totalpages;
+ long chosen_memcg_points = 0;
+ long points = 0;
+
+ oc->chosen = NULL;
+ oc->chosen_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+ if (oc->memcg) {
+ chosen_memcg = oc->memcg;
+ parent = oc->memcg;
+ }
+
+ rcu_read_lock();
+
+ for (;;) {
+ css = css_next_child(css, &parent->css);
+ if (css) {
+ iter = mem_cgroup_from_css(css);
+
+ points = mem_cgroup_oom_badness(iter, oc->nodemask);
+
+ if (points > chosen_memcg_points) {
+ chosen_memcg = iter;
+ chosen_memcg_points = points;
+ oc->chosen_points = points;
+ }
+
+ continue;
+ }
+
+ if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
+ /* Go deeper in the cgroup hierarchy */
+ totalpages = chosen_memcg_points;
+ chosen_memcg_points = 0;
+
+ parent = chosen_memcg;
+ chosen_memcg = NULL;
+
+ continue;
+ }
+
+ if (!chosen_memcg && parent != root_mem_cgroup)
+ chosen_memcg = parent;
+
+ break;
+ }
+
+ if (!oc->memcg) {
+ /*
+ * We should also consider tasks in the root cgroup
+ * with badness larger than oc->chosen_points
+ */
+
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
+
+ css_task_iter_start(&root_mem_cgroup->css, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = oom_evaluate_task(task, oc);
+ css_task_iter_end(&it);
+ }
+
+ if (!oc->chosen && chosen_memcg) {
+ if (chosen_memcg->oom_kill_all_tasks) {
+ css_get(&chosen_memcg->css);
+ oc->chosen_memcg = chosen_memcg;
+ }
+
+ /*
+ * Even if we have to kill all tasks in the cgroup,
+ * we need to select the biggest task to start with.
+ * This task will receive an access to the memory
+ * reserves.
+ */
+ oc->chosen_points = 0;
+ mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc);
+ }
+
+ rcu_read_unlock();
+
+ oc->chosen_points = 0;
+ return !!oc->chosen || !!oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
@@ -5166,6 +5288,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
+
+ seq_printf(m, "%d\n", oom_kill_all_tasks);
+
+ return 0;
+}
+
+static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_kill_all_tasks;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
+ if (err)
+ return err;
+
+ memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5286,6 +5435,12 @@ static struct cftype memory_files[] = {
.write = memory_max_write,
},
{
+ .name = "oom_kill_all_tasks",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_kill_all_tasks_show,
+ .write = memory_oom_kill_all_tasks_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e3aaf5c8..489ab69 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -802,66 +802,14 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just set TIF_MEMDIE so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
p = find_lock_task_mm(victim);
if (!p) {
@@ -932,10 +880,107 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
wake_oom_reaper(victim);
mmdrop(mm);
- put_task_struct(victim);
}
#undef K
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just set TIF_MEMDIE so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+ put_task_struct(victim);
+}
+
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+ if (!tsk_is_oom_victim(task))
+ __oom_kill_process(task);
+ return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+ bool ret = !!oc->chosen;
+
+ if (oc->chosen && oc->chosen != (void *)-1UL) {
+ __oom_kill_process(oc->chosen);
+ put_task_struct(oc->chosen);
+ }
+
+ if (oc->chosen_memcg) {
+ /*
+ * Kill all tasks in the cgroup hierarchy
+ */
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+ NULL);
+ mem_cgroup_put(oc->chosen_memcg);
+ oc->chosen_memcg = NULL;
+ }
+
+ /*
+ * Reset points before falling back to the
+ * old per-process OOM victim selection logic
+ */
+ oc->chosen_points = 0;
+ oc->chosen = NULL;
+ return ret;
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1044,6 +1089,11 @@ bool out_of_memory(struct oom_control *oc)
return true;
}
+ if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
+ schedule_timeout_killable(1);
+ return true;
+ }
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-07-10 23:05 ` David Rientjes
2017-07-11 12:51 ` Roman Gushchin
0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2017-07-10 23:05 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel
On Wed, 21 Jun 2017, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
Yes, the original motivation was to limit killing to a single process, if
possible. To do that, we kill the process with the largest rss to free
the most memory and rely on the user to configure /proc/pid/oom_score_adj
if something else should be prioritized.
With containerization and overcommit of system memory, we concur that
killing the single largest process isn't always preferable and neglects
the priority of its memcg. Your motivation seems to be to provide
fairness between one memcg with a large process and one memcg with a large
number of small processes; I'm curious if you are concerned about the
priority of a memcg hierarchy (how important that "job" is) or whether you
are strictly concerned with "largeness" of memcgs relative to each other.
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
We agree.
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
>
This should only be a consequence of overcommiting memcgs at the top level
so the system oom killer is actually ever invoked, otherwise per-process
oom_score_adj works well for memcg oom killing.
> To address these issues, cgroup-aware OOM killer is introduced.
>
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
>
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
>
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
>
These are two different things, right? We can adjust how the system oom
killer chooses victims when memcg hierarchies overcommit the system to not
strictly prefer the single process with the largest rss without killing
everything attached to the memcg.
Separately: do you not intend to support memcg priorities at all, but
rather strictly consider the "largeness" of a memcg versus other memcgs?
In our methodology, each memcg is assigned a priority value and the
iteration of the hierarchy simply compares and visits the memcg with the
lowest priority at each level and then selects the largest process to
kill. This could also support a "kill-all" knob.
struct mem_cgroup *memcg = root_mem_cgroup;
struct mem_cgroup *low_memcg;
unsigned long low_priority;
next:
low_memcg = NULL;
low_priority = ULONG_MAX;
for_each_child_of_memcg(memcg) {
unsigned long prio = memcg_oom_priority(memcg);
if (prio < low_priority) {
low_memcg = memcg;
low_priority = prio;
}
}
if (low_memcg)
goto next;
oom_kill_process_from_memcg(memcg);
So this is a priority based model that is different than your aggregate
usage model but I think it allows userspace to define a more powerful
policy. We certainly may want to kill from a memcg with a single large
process, or we may want to kill from a memcg with several small processes,
it depends on the importance of that job.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
2017-07-10 23:05 ` David Rientjes
@ 2017-07-11 12:51 ` Roman Gushchin
2017-07-11 20:56 ` David Rientjes
0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2017-07-11 12:51 UTC (permalink / raw)
To: David Rientjes
Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel
On Mon, Jul 10, 2017 at 04:05:49PM -0700, David Rientjes wrote:
> On Wed, 21 Jun 2017, Roman Gushchin wrote:
>
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> >
> > This behavior doesn't suit well the system with many running
> > containers. There are two main issues:
> >
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> >
>
> Yes, the original motivation was to limit killing to a single process, if
> possible. To do that, we kill the process with the largest rss to free
> the most memory and rely on the user to configure /proc/pid/oom_score_adj
> if something else should be prioritized.
>
> With containerization and overcommit of system memory, we concur that
> killing the single largest process isn't always preferable and neglects
> the priority of its memcg. Your motivation seems to be to provide
> fairness between one memcg with a large process and one memcg with a large
> number of small processes; I'm curious if you are concerned about the
> priority of a memcg hierarchy (how important that "job" is) or whether you
> are strictly concerned with "largeness" of memcgs relative to each other.
I'm pretty sure we should provide some way to prioritize some cgroups
over other (in terms of oom killer preferences), but I'm not 100% sure yet,
what's the best way to do it. I've suggested something similar to the existing
oom_score_adj for tasks, mostly to folow the existing design.
One of the questions to answer in priority-based model is
how to compare tasks in the root cgroup with cgroups?
> > ...
> > By default, it will look for the biggest leaf cgroup, and kill
> > the largest task inside.
> >
> > But a user can change this behavior by enabling the per-cgroup
> > oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > the whole cgroup as an indivisible memory consumer. In case if it's
> > selected as on OOM victim, all belonging tasks will be killed.
> >
>
> These are two different things, right? We can adjust how the system oom
> killer chooses victims when memcg hierarchies overcommit the system to not
> strictly prefer the single process with the largest rss without killing
> everything attached to the memcg.
They are different, and I thought about providing two independent knobs.
But after all I haven't found enough real life examples, where it can be useful.
Can you provide something here?
Also, they are different only for non-leaf cgroups; leaf cgroups
are always treated as indivisible memory consumers during victim selection.
I assume, that containerized systems will always set oom_kill_all_tasks for
top-level container memory cgroups. By default it's turned off
to provide backward compatibility with current behavior and avoid
excessive kills and support oom_score_adj==-1000 (I've added this to v4,
will post soon).
> Separately: do you not intend to support memcg priorities at all, but
> rather strictly consider the "largeness" of a memcg versus other memcgs?
Answered upper.
> In our methodology, each memcg is assigned a priority value and the
> iteration of the hierarchy simply compares and visits the memcg with the
> lowest priority at each level and then selects the largest process to
> kill. This could also support a "kill-all" knob.
>
> struct mem_cgroup *memcg = root_mem_cgroup;
> struct mem_cgroup *low_memcg;
> unsigned long low_priority;
>
> next:
> low_memcg = NULL;
> low_priority = ULONG_MAX;
> for_each_child_of_memcg(memcg) {
> unsigned long prio = memcg_oom_priority(memcg);
>
> if (prio < low_priority) {
> low_memcg = memcg;
> low_priority = prio;
> }
> }
> if (low_memcg)
> goto next;
> oom_kill_process_from_memcg(memcg);
>
> So this is a priority based model that is different than your aggregate
> usage model but I think it allows userspace to define a more powerful
> policy. We certainly may want to kill from a memcg with a single large
> process, or we may want to kill from a memcg with several small processes,
> it depends on the importance of that job.
I believe, that both models have some advantages.
Priority-based model is more powerful, but requires support from the userspace
to set up these priorities (and, probably, adjust them dynamically).
Size-based model is limited, but provides reasonable behavior
without any additional configuration.
I will agree here with Michal Hocko, that bpf like mechanism can be used
here to provide an option for writing some custom oom policies. After we will
have necessary infrastructure to iterate over cgroup tree, select a cgroup
with largest oom score, and kill a biggest task/all the tasks there,
we can add an ability to customize the cgroup (and maybe tasks) evaluation.
Thanks!
Roman
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
2017-07-11 12:51 ` Roman Gushchin
@ 2017-07-11 20:56 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1707111342190.60183-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
0 siblings, 1 reply; 20+ messages in thread
From: David Rientjes @ 2017-07-11 20:56 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel
On Tue, 11 Jul 2017, Roman Gushchin wrote:
> > Yes, the original motivation was to limit killing to a single process, if
> > possible. To do that, we kill the process with the largest rss to free
> > the most memory and rely on the user to configure /proc/pid/oom_score_adj
> > if something else should be prioritized.
> >
> > With containerization and overcommit of system memory, we concur that
> > killing the single largest process isn't always preferable and neglects
> > the priority of its memcg. Your motivation seems to be to provide
> > fairness between one memcg with a large process and one memcg with a large
> > number of small processes; I'm curious if you are concerned about the
> > priority of a memcg hierarchy (how important that "job" is) or whether you
> > are strictly concerned with "largeness" of memcgs relative to each other.
>
> I'm pretty sure we should provide some way to prioritize some cgroups
> over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> what's the best way to do it. I've suggested something similar to the existing
> oom_score_adj for tasks, mostly to folow the existing design.
>
> One of the questions to answer in priority-based model is
> how to compare tasks in the root cgroup with cgroups?
>
We do this with an alternate scoring mechanism, that is purely priority
based and tiebreaks based on largest rss. An additional tunable is added
for each process, under /proc/pid, and also to the memcg hierarchy, and is
enabled via a system-wide sysctl. I way to mesh the two scoring
mechanisms together would be helpful, but for our purposes we don't use
oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be
oom disabled when written by third party apps.
For memcg oom conditions, iteration of the hierarchy begins at the oom
memcg. For system oom conditions, this is the root memcg.
All processes attached to the oom memcg have their priority based value
and this is compared to all child memcg's priority value at that level.
If a process has the lowest priority, it is killed and we're done; we
could implement a "kill all" mechanism for this memcg that is checked
before the process is killed.
If a memcg has the lowest priority compared to attached processes, it is
iterated as well, and so on throughout the memcg hierarchy until we find
the lowest priority process in the lowest priority leaf memcg. This way,
we can fully control which process is killed for both system and memcg oom
conditions. I can easily post patches for this, we have used it for
years.
> > These are two different things, right? We can adjust how the system oom
> > killer chooses victims when memcg hierarchies overcommit the system to not
> > strictly prefer the single process with the largest rss without killing
> > everything attached to the memcg.
>
> They are different, and I thought about providing two independent knobs.
> But after all I haven't found enough real life examples, where it can be useful.
> Can you provide something here?
>
Yes, we have users who we chown their memcg hierarchy to and have full
control over setting up their hierarchy however we want. Our "Activity
Manager", using Documentation/cgroup-v1/memory.txt terminology, only is
aware of the top level memcg that was chown'd to the user. That user runs
a series of batch jobs that are submitted to it and each job is
represented as a subcontainer to enforce strict limits on the amount of
memory that job can use. When it becomes oom, we have found that it is
preferable to oom kill the entire batch job rather than leave it in an
inconsistent state, so enabling such a knob here would be helpful.
Other top-level jobs are fine with individual processes being oom killed.
It can be a low priority process for which they have full control over
defining the priority through the new per-process and per-memcg value
described above. Easy example is scraping logs periodically or other
best-effort tasks like cleanup. They can happily be oom killed and
rescheduled without taking down the entire first-class job.
> Also, they are different only for non-leaf cgroups; leaf cgroups
> are always treated as indivisible memory consumers during victim selection.
>
> I assume, that containerized systems will always set oom_kill_all_tasks for
> top-level container memory cgroups. By default it's turned off
> to provide backward compatibility with current behavior and avoid
> excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> will post soon).
>
We certainly would not be enabling it for top-level memcgs, there would be
no way that we could because we have best-effort processes, but we would
like to enable it for small batch jobs that are run on behalf of a user in
their own subcontainer. We have had this usecase for ~3 years and solely
because of the problem that you pointed out earlier: it is often much more
reliable for the kernel to do oom killing of multiple processes rather
than userspace.
> > In our methodology, each memcg is assigned a priority value and the
> > iteration of the hierarchy simply compares and visits the memcg with the
> > lowest priority at each level and then selects the largest process to
> > kill. This could also support a "kill-all" knob.
> >
> > struct mem_cgroup *memcg = root_mem_cgroup;
> > struct mem_cgroup *low_memcg;
> > unsigned long low_priority;
> >
> > next:
> > low_memcg = NULL;
> > low_priority = ULONG_MAX;
> > for_each_child_of_memcg(memcg) {
> > unsigned long prio = memcg_oom_priority(memcg);
> >
> > if (prio < low_priority) {
> > low_memcg = memcg;
> > low_priority = prio;
> > }
> > }
> > if (low_memcg)
> > goto next;
> > oom_kill_process_from_memcg(memcg);
> >
> > So this is a priority based model that is different than your aggregate
> > usage model but I think it allows userspace to define a more powerful
> > policy. We certainly may want to kill from a memcg with a single large
> > process, or we may want to kill from a memcg with several small processes,
> > it depends on the importance of that job.
>
> I believe, that both models have some advantages.
> Priority-based model is more powerful, but requires support from the userspace
> to set up these priorities (and, probably, adjust them dynamically).
It's a no-op if nobody sets up priorities or the system-wide sysctl is
disabled. Presumably, as in our model, the Activity Manager sets the
sysctl and is responsible for configuring the priorities if present. All
memcgs at the sibling level or subcontainer level remain the default if
not defined by the chown'd user, so this falls back to an rss model for
backwards compatibility.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [v3 3/6] mm, oom: cgroup-aware OOM killer debug info
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
2017-06-21 21:19 ` [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Roman Gushchin
` (2 subsequent siblings)
5 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
cgroups, linux-doc, linux-kernel
Dump the cgroup oom badness score, as well as the name
of chosen victim cgroup.
Here how it looks like in dmesg:
[ 18.824495] Choosing a victim memcg because of the system-wide OOM
[ 18.826911] Cgroup /A1: 200805
[ 18.827996] Cgroup /A2: 273072
[ 18.828937] Cgroup /A2/B3: 51
[ 18.829795] Cgroup /A2/B4: 272969
[ 18.830800] Cgroup /A2/B5: 52
[ 18.831890] Chosen cgroup /A2/B4: 272969
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/memcontrol.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb5103..4face20 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2669,7 +2669,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
+
+ pr_info("Choosing a victim memcg because of the %s",
+ oc->memcg ?
+ "memory limit reached of cgroup " :
+ "system-wide OOM\n");
if (oc->memcg) {
+ pr_cont_cgroup_path(oc->memcg->css.cgroup);
+ pr_cont("\n");
+
chosen_memcg = oc->memcg;
parent = oc->memcg;
}
@@ -2683,6 +2691,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
points = mem_cgroup_oom_badness(iter, oc->nodemask);
+ pr_info("Cgroup ");
+ pr_cont_cgroup_path(iter->css.cgroup);
+ pr_cont(": %ld\n", points);
+
if (points > chosen_memcg_points) {
chosen_memcg = iter;
chosen_memcg_points = points;
@@ -2731,6 +2743,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
oc->chosen_memcg = chosen_memcg;
}
+ pr_info("Chosen cgroup ");
+ pr_cont_cgroup_path(chosen_memcg->css.cgroup);
+ pr_cont(": %ld\n", oc->chosen_points);
+
/*
* Even if we have to kill all tasks in the cgroup,
* we need to select the biggest task to start with.
@@ -2739,7 +2755,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
*/
oc->chosen_points = 0;
mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc);
- }
+ } else if (oc->chosen)
+ pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
+ oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
rcu_read_unlock();
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread* [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
` (2 preceding siblings ...)
2017-06-21 21:19 ` [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
2017-06-21 21:19 ` [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Roman Gushchin
5 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
linux-kernel
Introduce a per-memory-cgroup oom_score_adj setting.
A read-write single value file which exits on non-root
cgroups. The default is "0".
It will have a similar meaning to a per-process value,
available via /proc/<pid>/oom_score_adj.
Should be in a range [-1000, 1000].
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c59926c..b84a050 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,9 @@ struct mem_cgroup {
/* kill all tasks in the subtree in case of OOM */
bool oom_kill_all_tasks;
+ /* OOM kill score adjustment */
+ short oom_score_adj;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4face20..e474eba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5333,6 +5333,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_score_adj_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ short oom_score_adj = memcg->oom_score_adj;
+
+ seq_printf(m, "%d\n", oom_score_adj);
+
+ return 0;
+}
+
+static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_score_adj;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_score_adj);
+ if (err)
+ return err;
+
+ if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+ oom_score_adj > OOM_SCORE_ADJ_MAX)
+ return -EINVAL;
+
+ memcg->oom_score_adj = (short)oom_score_adj;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5459,6 +5489,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_kill_all_tasks_write,
},
{
+ .name = "oom_score_adj",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_oom_score_adj_show,
+ .write = memory_oom_score_adj_write,
+ },
+ {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread* [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
` (3 preceding siblings ...)
2017-06-21 21:19 ` [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
2017-06-29 8:53 ` Michal Hocko
2017-06-21 21:19 ` [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Roman Gushchin
5 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
linux-kernel
We want to limit the number of tasks which are having an access
to the memory reserves. To ensure the progress it's enough
to have one such process at the time.
If we need to kill the whole cgroup, let's give an access to the
memory reserves only to the first process in the list, which is
(usually) the biggest process.
This will give us good chances that all other processes will be able
to quit without an access to the memory reserves.
Otherwise, to keep going forward, let's grant the access to the memory
reserves for tasks, which can't be reaped by the oom_reaper.
As it will be done from the oom reaper thread, which handles the
oom reaper queue consequently, there is no high risk to have too many
such processes at the same time.
To implement this solution, we need to stop using TIF_MEMDIE flag
as an universal marker for oom victims tasks. It's not a big issue,
as we have oom_mm pointer/tsk_is_oom_victim(), which are just better.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
kernel/exit.c | 2 +-
mm/oom_kill.c | 31 ++++++++++++++++++++++---------
2 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index d211425..5b95d74 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -554,7 +554,7 @@ static void exit_mm(void)
task_unlock(current);
mm_update_next_owner(mm);
mmput(mm);
- if (test_thread_flag(TIF_MEMDIE))
+ if (tsk_is_oom_victim(current))
exit_oom_victim();
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 489ab69..b55bd18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk)
struct mm_struct *mm = tsk->signal->oom_mm;
/* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+ while (attempts++ < MAX_OOM_REAP_RETRIES &&
+ !__oom_reap_task_mm(tsk, mm)) {
+
+ /*
+ * If the task has no access to the memory reserves,
+ * grant it to help the task to exit.
+ */
+ if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+ set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
schedule_timeout_idle(HZ/10);
+ }
if (attempts <= MAX_OOM_REAP_RETRIES)
goto done;
@@ -647,16 +657,13 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
*/
static void mark_oom_victim(struct task_struct *tsk)
{
- struct mm_struct *mm = tsk->mm;
-
WARN_ON(oom_killer_disabled);
- /* OOM killer might race with memcg OOM */
- if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
- return;
/* oom_mm is bound to the signal struct life time. */
- if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
- mmgrab(tsk->signal->oom_mm);
+ if (cmpxchg(&tsk->signal->oom_mm, NULL, tsk->mm) != NULL)
+ return;
+
+ mmgrab(tsk->signal->oom_mm);
/*
* Make sure that the task is woken up from uninterruptible sleep
@@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
* that TIF_MEMDIE tasks should be ignored.
*/
__thaw_task(tsk);
- atomic_inc(&oom_victims);
+
+ /*
+ * If there are no oom victims in flight,
+ * give the task an access to the memory reserves.
+ */
+ if (atomic_inc_return(&oom_victims) == 1)
+ set_tsk_thread_flag(tsk, TIF_MEMDIE);
}
/**
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
@ 2017-06-29 8:53 ` Michal Hocko
2017-06-29 18:45 ` Roman Gushchin
0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2017-06-29 8:53 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel
On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> We want to limit the number of tasks which are having an access
> to the memory reserves. To ensure the progress it's enough
> to have one such process at the time.
>
> If we need to kill the whole cgroup, let's give an access to the
> memory reserves only to the first process in the list, which is
> (usually) the biggest process.
> This will give us good chances that all other processes will be able
> to quit without an access to the memory reserves.
I don't like this to be honest. Is there any reason to go the reduced
memory reserves access to oom victims I was suggesting earlier [1]?
[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
2017-06-29 8:53 ` Michal Hocko
@ 2017-06-29 18:45 ` Roman Gushchin
2017-06-30 8:25 ` Michal Hocko
0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2017-06-29 18:45 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel
On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote:
> On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> > We want to limit the number of tasks which are having an access
> > to the memory reserves. To ensure the progress it's enough
> > to have one such process at the time.
> >
> > If we need to kill the whole cgroup, let's give an access to the
> > memory reserves only to the first process in the list, which is
> > (usually) the biggest process.
> > This will give us good chances that all other processes will be able
> > to quit without an access to the memory reserves.
>
> I don't like this to be honest. Is there any reason to go the reduced
> memory reserves access to oom victims I was suggesting earlier [1]?
>
> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
I've nothing against your approach. What's the state of this patchset?
Do you plan to bring it upstream?
Roman
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
2017-06-29 18:45 ` Roman Gushchin
@ 2017-06-30 8:25 ` Michal Hocko
0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2017-06-30 8:25 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel
On Thu 29-06-17 14:45:13, Roman Gushchin wrote:
> On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote:
> > On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> > > We want to limit the number of tasks which are having an access
> > > to the memory reserves. To ensure the progress it's enough
> > > to have one such process at the time.
> > >
> > > If we need to kill the whole cgroup, let's give an access to the
> > > memory reserves only to the first process in the list, which is
> > > (usually) the biggest process.
> > > This will give us good chances that all other processes will be able
> > > to quit without an access to the memory reserves.
> >
> > I don't like this to be honest. Is there any reason to go the reduced
> > memory reserves access to oom victims I was suggesting earlier [1]?
> >
> > [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
>
> I've nothing against your approach. What's the state of this patchset?
> Do you plan to bring it upstream?
Just the specific patch I have linked should be sufficient for what you
need here. The patchset had some issues which I didn't have time to fix
and as such the need for the above patch was not a high priority as
well.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer
[not found] <1498079956-24467-1-git-send-email-guro@fb.com>
` (4 preceding siblings ...)
2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
5 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel
Update cgroups v2 docs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index a86f3cb..7a1a1ac 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. Cgroup-aware OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.oom_kill_all_tasks
+
+ A read-write single value file which exits on non-root
+ cgroups. The default is "0".
+
+ Defines whether the OOM killer should treat the cgroup
+ as a single entity during the victim selection.
+
+ If set, it will cause the OOM killer to kill all belonging
+ tasks, both in case of a system-wide or cgroup-wide OOM.
+
+ memory.oom_score_adj
+
+ A read-write single value file which exits on non-root
+ cgroups. The default is "0".
+
+ OOM killer score adjustment, which has as similar meaning
+ to a per-process value, available via /proc/<pid>/oom_score_adj.
+ Should be in a range [-1000, 1000].
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1028,6 +1049,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
+5-2-4. Cgroup-aware OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
5-3. IO
The "io" controller regulates the distribution of IO resources. This
--
2.7.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 20+ messages in thread