[v11 0/6] cgroup-aware OOM killer

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [v11 0/6] cgroup-aware OOM killer
@ 2017-10-05 13:04 Roman Gushchin
  2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v11:
  - Fixed an issue with skipping the root mem cgroup
    (discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
    patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
    (proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
    if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
    memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 249 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 210 ++++++++++++++++++++++++-------------
 7 files changed, 495 insertions(+), 76 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [v11 1/6] mm, oom: refactor the oom_kill_process() function
  2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
@ 2017-10-05 13:04 ` Roman Gushchin
  2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f642a45b7f14..ccdb7d34cd13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -845,68 +845,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -980,6 +924,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-10-05 13:04 ` Roman Gushchin
       [not found]   ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>
  2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7410636fadf..41d71f665550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

[parent not found: <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>]

* Re: [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
       [not found]   ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-10-09 21:11     ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2017-10-09 21:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup is treated as a leaf cgroup, so only tasks
> which are directly belonging to the root cgroup are iterated over.
> 
> This patch doesn't introduce any functional change as
> mem_cgroup_scan_tasks() is never called for the root memcg.
> This is preparatory work for the cgroup-aware OOM killer,
> which will use this function to iterate over tasks belonging
> to the root memcg.
> 
> Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-10-05 13:04 ` Roman Gushchin
       [not found]   ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
  2017-10-09 21:52   ` David Rientjes
       [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest leaf memory cgroup
and kills the biggest task belonging to it. The following patches
will extend this functionality to consider non-leaf memory cgroups
as well, and also provide an ability to kill all tasks belonging
to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with leaf memory cgroups.
Due to memcg statistics implementation a special algorithm
is used for estimating it's oom_score: we define it as maximum
oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  17 +++++
 include/linux/oom.h        |  12 +++-
 mm/memcontrol.c            | 172 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  70 +++++++++++++-----
 4 files changed, 251 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41d71f665550..191b70735f1f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	/*
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
+	 *
+	 * As tasks in the root memcg unlikely are parts of a
+	 * single workload, and we don't have to implement
+	 * group killing, this approximation is reasonable.
+	 *
+	 * But if we will have necessary stats for the root memcg,
+	 * we might switch to the approach which is used for all
+	 * other memcgs.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (score > max_score)
+				max_score = score;
+		}
+		css_task_iter_end(&it);
+
+		return max_score;
+	}
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ccdb7d34cd13..20e62ec32ba8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -310,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -344,26 +344,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -926,7 +926,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1039,6 +1060,7 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
+		delay = true;
+		goto out;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

[parent not found: <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>]

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
       [not found]   ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-10-05 14:27     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2017-10-05 14:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
	Tejun Heo, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu 05-10-17 14:04:51, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 
> Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
> Cc: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: kernel-team-b10kYP2dOMg@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

Assuming this is an opt-in
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

> ---
>  include/linux/memcontrol.h |  17 +++++
>  include/linux/oom.h        |  12 +++-
>  mm/memcontrol.c            | 172 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/oom_kill.c              |  70 +++++++++++++-----
>  4 files changed, 251 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 69966c461d1c..75b63b68846e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +	css_put(&memcg->css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> @@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
>  	return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>  		struct mem_cgroup *prev,
> @@ -936,6 +948,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 76aac4ce39bc..ca78e2d5956e 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -9,6 +9,13 @@
>  #include <linux/sched/coredump.h> /* MMF_* */
>  #include <linux/mm.h> /* VM_FAULT* */
>  
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
>  struct zonelist;
>  struct notifier_block;
>  struct mem_cgroup;
> @@ -39,7 +46,8 @@ struct oom_control {
>  
>  	/* Used by oom implementation, do not set */
>  	unsigned long totalpages;
> -	struct task_struct *chosen;
> +	struct task_struct *chosen_task;
> +	struct mem_cgroup *chosen_memcg;
>  	unsigned long chosen_points;
>  };
>  
> @@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 41d71f665550..191b70735f1f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	/*
> +	 * We don't have necessary stats for the root memcg,
> +	 * so we define it's oom_score as the maximum oom_score
> +	 * of the belonging tasks.
> +	 *
> +	 * As tasks in the root memcg unlikely are parts of a
> +	 * single workload, and we don't have to implement
> +	 * group killing, this approximation is reasonable.
> +	 *
> +	 * But if we will have necessary stats for the root memcg,
> +	 * we might switch to the approach which is used for all
> +	 * other memcgs.
> +	 */
> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score, max_score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			score = oom_badness(task, memcg, nodemask,
> +					    totalpages);
> +			if (score > max_score)
> +				max_score = score;
> +		}
> +		css_task_iter_end(&it);
> +
> +		return max_score;
> +	}
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +
> +	oc->chosen_memcg = NULL;
> +	oc->chosen_points = 0;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memory cgroups (including
> +	 * the root memcg).
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		long score;
> +
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +			continue;
> +
> +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need
> +		 * to look further for new victims.
> +		 */
> +		if (score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (score > oc->chosen_points) {
> +			oc->chosen_points = score;
> +			oc->chosen_memcg = iter;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *root;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	if (oc->memcg)
> +		root = oc->memcg;
> +	else
> +		root = root_mem_cgroup;
> +
> +	select_victim_memcg(root, oc);
> +
> +	return oc->chosen_memcg;
> +}
> +
>  /*
>   * Reclaims as many pages from the given memcg as possible.
>   *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ccdb7d34cd13..20e62ec32ba8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -310,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	return CONSTRAINT_NONE;
>  }
>  
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>  	struct oom_control *oc = arg;
>  	unsigned long points;
> @@ -344,26 +344,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  		goto next;
>  
>  	/* Prefer thread group leaders for display purposes */
> -	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> +	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
>  		goto next;
>  select:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
>  	get_task_struct(task);
> -	oc->chosen = task;
> +	oc->chosen_task = task;
>  	oc->chosen_points = points;
>  next:
>  	return 0;
>  abort:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> -	oc->chosen = (void *)-1UL;
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
> +	oc->chosen_task = INFLIGHT_VICTIM;
>  	return 1;
>  }
>  
>  /*
>   * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
>   */
>  static void select_bad_process(struct oom_control *oc)
>  {
> @@ -926,7 +926,7 @@ static void __oom_kill_process(struct task_struct *victim)
>  
>  static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
> -	struct task_struct *p = oc->chosen;
> +	struct task_struct *p = oc->chosen_task;
>  	unsigned int points = oc->chosen_points;
>  	struct task_struct *victim = p;
>  	struct task_struct *child;
> @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Kill a task in the chosen memcg with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +out:
> +	mem_cgroup_put(oc->chosen_memcg);
> +	return oc->chosen_task;
> +}
> +
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> @@ -1039,6 +1060,7 @@ bool out_of_memory(struct oom_control *oc)
>  {
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
> +	bool delay = false; /* if set, delay next allocation attempt */
>  
>  	if (oom_killer_disabled)
>  		return false;
> @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc)
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
> -		oc->chosen = current;
> +		oc->chosen_task = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
>  		return true;
>  	}
>  
> +	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
> +		delay = true;
> +		goto out;
> +	}
> +
>  	select_bad_process(oc);
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
> -	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> +	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (oc->chosen && oc->chosen != (void *)-1UL) {
> +	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>  				 "Memory cgroup out of memory");
> -		/*
> -		 * Give the killed process a good chance to exit before trying
> -		 * to allocate memory again.
> -		 */
> -		schedule_timeout_killable(1);
> +		delay = true;
>  	}
> -	return !!oc->chosen;
> +
> +out:
> +	/*
> +	 * Give the killed process a good chance to exit before trying
> +	 * to allocate memory again.
> +	 */
> +	if (delay)
> +		schedule_timeout_killable(1);
> +
> +	return !!oc->chosen_task;
>  }
>  
>  /*
> -- 
> 2.13.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
       [not found]   ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-10-09 21:52   ` David Rientjes
       [not found]     ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  2017-10-10 12:23     ` Roman Gushchin
  1 sibling, 2 replies; 27+ messages in thread
From: David Rientjes @ 2017-10-09 21:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

I'd move the second point to the changelog for the next patch since this 
patch doesn't implement any support for memory.oom_group.

> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 

This seems to unfairly bias the root mem cgroup depending on process size.  
It isn't treated fairly as a leaf mem cgroup if they are being compared 
based on different criteria: the root mem cgroup as (mostly) the largest 
rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
unreclaimable slab pages charged to it by all processes.

I imagine a configuration where the root mem cgroup has 100 processes 
attached each with rss of 80MB, compared to a leaf cgroup with 100 
processes of 1MB rss each.  How does this logic prevent repeatedly oom 
killing the processes of 1MB rss?

In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
quite fair, it can simply hide large processes from being selected.  Users 
who configure cgroups in a unified hierarchy for other resource 
constraints are penalized for this choice even though the mem cgroup with 
100 processes of 1MB rss each may not be limited itself.

I think for this comparison to be fair, it requires accounting for the 
root mem cgroup itself or for a different accounting methodology for leaf 
memory cgroups.

> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +
> +	oc->chosen_memcg = NULL;
> +	oc->chosen_points = 0;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memory cgroups (including
> +	 * the root memcg).
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		long score;
> +
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +			continue;

I'll reiterate what I did on the last version of the patchset: considering 
only leaf memory cgroups easily allows users to defeat this heuristic and 
bias against all of their memory usage up to the largest process size 
amongst the set of processes attached.  If the user creates N child mem 
cgroups for their N processes and attaches one process to each child, the 
_only_ thing this achieved is to defeat your heuristic and prefer other 
leaf cgroups simply because those other leaf cgroups did not do this.

Effectively:

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

will radically shift the heuristic from a score of all anonymous + 
unevictable memory for all processes to a score of the largest anonymous +
unevictable memory for a single process.  There is no downside or 
ramifaction for the end user in doing this.  When comparing cgroups based 
on usage, it only makes sense to compare the hierarchical usage of that 
cgroup so that attaching processes to descendants or splitting the 
implementation of a process into several smaller individual processes does 
not allow this heuristic to be defeated.

> +
> +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need
> +		 * to look further for new victims.
> +		 */
> +		if (score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (score > oc->chosen_points) {
> +			oc->chosen_points = score;
> +			oc->chosen_memcg = iter;
> +		}

I'll reiterate what I did on previous versions of this patchset: this 
effectively removes all control the user has from influencing oom victim 
selection.  Victim selection is very important, the user must be able to 
influence that decision to prevent the loss of important work when the 
system is out of memory.

This heuristic only uses user infleunce by considering whether a memory 
cgroup is eligible depending on whether all processes have 
/proc/pid/oom_score_adj == -1000 or not.  It means a user must oom disable 
all processes attached to an important memory cgroup that has not reached 
its limit to prevent it from being oom killed with this heuristic.  It 
simply has no other choice.  It cannot differentiate between two memory 
cgroups where one is expected to have much higher memory usage, and should 
be protected because of end user goals.

> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *root;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	if (oc->memcg)
> +		root = oc->memcg;
> +	else
> +		root = root_mem_cgroup;
> +
> +	select_victim_memcg(root, oc);
> +
> +	return oc->chosen_memcg;
> +}
> +
>  /*
>   * Reclaims as many pages from the given memcg as possible.
>   *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ccdb7d34cd13..20e62ec32ba8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Kill a task in the chosen memcg with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +out:
> +	mem_cgroup_put(oc->chosen_memcg);
> +	return oc->chosen_task;
> +}
> +
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc)
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
> -		oc->chosen = current;
> +		oc->chosen_task = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
>  		return true;
>  	}
>  
> +	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
> +		delay = true;
> +		goto out;
> +	}
> +
>  	select_bad_process(oc);

This is racy because mem_cgroup_select_oom_victim() found an eligible 
oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
process cannot be killed because of oom_unkillable_task(), the only 
eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
eligible processes changed, we end up falling back to the complete 
tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
oom_unkillable_task() and also retry in the case where 
oom_kill_memcg_victim() returns NULL.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
       [not found]     ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2017-10-10  8:18       ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2017-10-10  8:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 09-10-17 14:52:53, David Rientjes wrote:
> On Thu, 5 Oct 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> 
> I'd move the second point to the changelog for the next patch since this 
> patch doesn't implement any support for memory.oom_group.
> 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> 
> This seems to unfairly bias the root mem cgroup depending on process size.  
> It isn't treated fairly as a leaf mem cgroup if they are being compared 
> based on different criteria: the root mem cgroup as (mostly) the largest 
> rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> unreclaimable slab pages charged to it by all processes.
> 
> I imagine a configuration where the root mem cgroup has 100 processes 
> attached each with rss of 80MB, compared to a leaf cgroup with 100 
> processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> killing the processes of 1MB rss?
> 
> In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> quite fair, it can simply hide large processes from being selected.  Users 
> who configure cgroups in a unified hierarchy for other resource 
> constraints are penalized for this choice even though the mem cgroup with 
> 100 processes of 1MB rss each may not be limited itself.
> 
> I think for this comparison to be fair, it requires accounting for the 
> root mem cgroup itself or for a different accounting methodology for leaf 
> memory cgroups.

I believe this is documented in the patch. I agree with you but I also
assume this will not be such a big problem in practice because usecases
which are going to opt-in for the cgroup aware OOM killer will have the
all workloads running in memcgs and the root will basically run only
some essential system wide services needed for the overall system
operation. Risk of the runaway of this should be reasonably small and
killing any of those will put the system into an unstable state anyway.

That being said future improvements are possible but I guess that
shouldn't be a roadblock for the feature to be merged.

> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *       of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +			       const nodemask_t *nodemask,
> > +			       unsigned long totalpages)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	int eligible = 0;
> > +
> > +	/*
> > +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> > +	 *
> > +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +	 * as unkillable.
> > +	 *
> > +	 * If there are inflight OOM victim tasks inside the memcg,
> > +	 * we return -1.
> > +	 */
> > +	css_task_iter_start(&memcg->css, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (!eligible &&
> > +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +			eligible = 1;
> > +
> > +		if (tsk_is_oom_victim(task) &&
> > +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > +			eligible = -1;
> > +			break;
> > +		}
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (eligible <= 0)
> > +		return eligible;
> > +
> > +	return memcg_oom_badness(memcg, nodemask, totalpages);
> > +}
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter;
> > +
> > +	oc->chosen_memcg = NULL;
> > +	oc->chosen_points = 0;
> > +
> > +	/*
> > +	 * The oom_score is calculated for leaf memory cgroups (including
> > +	 * the root memcg).
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		long score;
> > +
> > +		if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > +			continue;
> 
> I'll reiterate what I did on the last version of the patchset: considering 
> only leaf memory cgroups easily allows users to defeat this heuristic and 
> bias against all of their memory usage up to the largest process size 
> amongst the set of processes attached.  If the user creates N child mem 
> cgroups for their N processes and attaches one process to each child, the 
> _only_ thing this achieved is to defeat your heuristic and prefer other 
> leaf cgroups simply because those other leaf cgroups did not do this.

I do not think repeating the argument is both needed nor helpful. It has
been already argued that the userspace is already able to do the same by
splitting the memory consumptions between processes. I would argue even
further that allowing an untrusted entity to create arbitrary sub groups
is dangerous for other reasons.

> Effectively:
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> will radically shift the heuristic from a score of all anonymous + 
> unevictable memory for all processes to a score of the largest anonymous +
> unevictable memory for a single process.  There is no downside or 
> ramifaction for the end user in doing this.  When comparing cgroups based 
> on usage, it only makes sense to compare the hierarchical usage of that 
> cgroup so that attaching processes to descendants or splitting the 
> implementation of a process into several smaller individual processes does 
> not allow this heuristic to be defeated.

But it breaks other usecases as already pointed out and it is quite sad
you keep ignoring those.

> > +
> > +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> > +
> > +		/*
> > +		 * Ignore empty and non-eligible memory cgroups.
> > +		 */
> > +		if (score == 0)
> > +			continue;
> > +
> > +		/*
> > +		 * If there are inflight OOM victims, we don't need
> > +		 * to look further for new victims.
> > +		 */
> > +		if (score == -1) {
> > +			oc->chosen_memcg = INFLIGHT_VICTIM;
> > +			mem_cgroup_iter_break(root, iter);
> > +			break;
> > +		}
> > +
> > +		if (score > oc->chosen_points) {
> > +			oc->chosen_points = score;
> > +			oc->chosen_memcg = iter;
> > +		}
> 
> I'll reiterate what I did on previous versions of this patchset: this 
> effectively removes all control the user has from influencing oom victim 
> selection.  Victim selection is very important, the user must be able to 
> influence that decision to prevent the loss of important work when the 
> system is out of memory.

And again it has been argued, and rightfully so, that this is not in
scope of this series and a more advanced user space influence can be
implemented on top.

[...]

> > @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc)
> >  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> >  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> >  		get_task_struct(current);
> > -		oc->chosen = current;
> > +		oc->chosen_task = current;
> >  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> >  		return true;
> >  	}
> >  
> > +	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
> > +		delay = true;
> > +		goto out;
> > +	}
> > +
> >  	select_bad_process(oc);
> 
> This is racy because mem_cgroup_select_oom_victim() found an eligible 
> oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> process cannot be killed because of oom_unkillable_task(), the only 
> eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> eligible processes changed, we end up falling back to the complete 
> tasklist scan.

oom victim selection will always be racy wrt. tasks exiting. Falling
back to the complete tasklist scan should be tolerable as this is really
not even remotely close to a hot path.

> It would be better for oom_evaluate_memcg() to consider 
> oom_unkillable_task() and also retry in the case where 
> oom_kill_memcg_victim() returns NULL.

I am not really sure oom_unkillable_task will really help. Most of the
conditions are simply not applicable to the memcgs' tasks. The only
interesting one might be has_intersects_mems_allowed but even that one
is quite questionable. memcg_oom_badness is already NUMA aware and
has_intersects_mems_allowed is not much more reliable way to detect
specific node consumers anyway.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-09 21:52   ` David Rientjes
       [not found]     ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2017-10-10 12:23     ` Roman Gushchin
  2017-10-10 21:13       ` David Rientjes
  1 sibling, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-10 12:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, Oct 09, 2017 at 02:52:53PM -0700, David Rientjes wrote:
> On Thu, 5 Oct 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> 
> I'd move the second point to the changelog for the next patch since this 
> patch doesn't implement any support for memory.oom_group.

There is a special remark later in the changelog explaining that
this functionality will be added by following patches. I've thought
it's useful to have all basic ideas in the one place.

> 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> 
> This seems to unfairly bias the root mem cgroup depending on process size.  
> It isn't treated fairly as a leaf mem cgroup if they are being compared 
> based on different criteria: the root mem cgroup as (mostly) the largest 
> rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> unreclaimable slab pages charged to it by all processes.
> 
> I imagine a configuration where the root mem cgroup has 100 processes 
> attached each with rss of 80MB, compared to a leaf cgroup with 100 
> processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> killing the processes of 1MB rss?
> 
> In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> quite fair, it can simply hide large processes from being selected.  Users 
> who configure cgroups in a unified hierarchy for other resource 
> constraints are penalized for this choice even though the mem cgroup with 
> 100 processes of 1MB rss each may not be limited itself.
> 
> I think for this comparison to be fair, it requires accounting for the 
> root mem cgroup itself or for a different accounting methodology for leaf 
> memory cgroups.

This is basically a workaround, because we don't have necessary stats for root
memory cgroup. If we'll start gathering them at some point, we can change this
and treat root memcg exactly as other leaf cgroups.

Or, if someone will come with an idea of a better approximation, it can be
implemented as a separate enhancement on top of the initial implementation.
This is more than welcome.

> 
> I'll reiterate what I did on the last version of the patchset: considering 
> only leaf memory cgroups easily allows users to defeat this heuristic and 
> bias against all of their memory usage up to the largest process size 
> amongst the set of processes attached.  If the user creates N child mem 
> cgroups for their N processes and attaches one process to each child, the 
> _only_ thing this achieved is to defeat your heuristic and prefer other 
> leaf cgroups simply because those other leaf cgroups did not do this.
> 
> Effectively:
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> will radically shift the heuristic from a score of all anonymous + 
> unevictable memory for all processes to a score of the largest anonymous +
> unevictable memory for a single process.  There is no downside or 
> ramifaction for the end user in doing this.  When comparing cgroups based 
> on usage, it only makes sense to compare the hierarchical usage of that 
> cgroup so that attaching processes to descendants or splitting the 
> implementation of a process into several smaller individual processes does 
> not allow this heuristic to be defeated.

To all previously said words I can only add that cgroup v2 allows to limit
the amount of cgroups in the sub-tree:
1a926e0bbab8 ("cgroup: implement hierarchy limits").

> This is racy because mem_cgroup_select_oom_victim() found an eligible 
> oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> process cannot be killed because of oom_unkillable_task(), the only 
> eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> eligible processes changed, we end up falling back to the complete 
> tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> oom_unkillable_task() and also retry in the case where 
> oom_kill_memcg_victim() returns NULL.

I agree with you here. The fallback to the existing mechanism is implemented
to be safe for sure, especially in a case of a global OOM. When we'll get
more confidence in cgroup-aware OOM killer reliability, we can change this
behavior. Personally, I would prefer to get rid of looking at all tasks just
to find a pre-existing OOM victim, but it might be quite tricky to implement.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-10 12:23     ` Roman Gushchin
@ 2017-10-10 21:13       ` David Rientjes
       [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
                           ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: David Rientjes @ 2017-10-10 21:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > This seems to unfairly bias the root mem cgroup depending on process size.  
> > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > based on different criteria: the root mem cgroup as (mostly) the largest 
> > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > unreclaimable slab pages charged to it by all processes.
> > 
> > I imagine a configuration where the root mem cgroup has 100 processes 
> > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > killing the processes of 1MB rss?
> > 
> > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > quite fair, it can simply hide large processes from being selected.  Users 
> > who configure cgroups in a unified hierarchy for other resource 
> > constraints are penalized for this choice even though the mem cgroup with 
> > 100 processes of 1MB rss each may not be limited itself.
> > 
> > I think for this comparison to be fair, it requires accounting for the 
> > root mem cgroup itself or for a different accounting methodology for leaf 
> > memory cgroups.
> 
> This is basically a workaround, because we don't have necessary stats for root
> memory cgroup. If we'll start gathering them at some point, we can change this
> and treat root memcg exactly as other leaf cgroups.
> 

I understand why it currently cannot be an apples vs apples comparison 
without, as I suggest in the last paragraph, that the same accounting is 
done for the root mem cgroup, which is intuitive if it is to be considered 
on the same basis as leaf mem cgroups.

I understand for the design to work that leaf mem cgroups and the root mem 
cgroup must be compared if processes can be attached to the root mem 
cgroup.  My point is that it is currently completely unfair as I've 
stated: you can have 10000 processes attached to the root mem cgroup with 
rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
and the oom killer is going to target the leaf mem cgroup as a result of 
this apples vs oranges comparison.

In case it's not clear, the 10000 processes of 80MB rss each is the most 
likely contributor to a system-wide oom kill.  Unfortunately, the 
heuristic introduced by this patchset is broken wrt a fair comparison of 
the root mem cgroup usage.

> Or, if someone will come with an idea of a better approximation, it can be
> implemented as a separate enhancement on top of the initial implementation.
> This is more than welcome.
> 

We don't need a better approximation, we need a fair comparison.  The 
heuristic that this patchset is implementing is based on the usage of 
individual mem cgroups.  For the root mem cgroup to be considered 
eligible, we need to understand its usage.  That usage is _not_ what is 
implemented by this patchset, which is the largest rss of a single 
attached process.  This, in fact, is not an "approximation" at all.  In 
the example of 10000 processes attached with 80MB rss each, the usage of 
the root mem cgroup is _not_ 80MB.

I'll restate that oom killing a process is a last resort for the kernel, 
but it also must be able to make a smart decision.  Targeting dozens of 
1MB processes instead of 80MB processes because of a shortcoming in this 
implementation is not the appropriate selection, it's the opposite of the 
correct selection.

> > I'll reiterate what I did on the last version of the patchset: considering 
> > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > bias against all of their memory usage up to the largest process size 
> > amongst the set of processes attached.  If the user creates N child mem 
> > cgroups for their N processes and attaches one process to each child, the 
> > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > leaf cgroups simply because those other leaf cgroups did not do this.
> > 
> > Effectively:
> > 
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > will radically shift the heuristic from a score of all anonymous + 
> > unevictable memory for all processes to a score of the largest anonymous +
> > unevictable memory for a single process.  There is no downside or 
> > ramifaction for the end user in doing this.  When comparing cgroups based 
> > on usage, it only makes sense to compare the hierarchical usage of that 
> > cgroup so that attaching processes to descendants or splitting the 
> > implementation of a process into several smaller individual processes does 
> > not allow this heuristic to be defeated.
> 
> To all previously said words I can only add that cgroup v2 allows to limit
> the amount of cgroups in the sub-tree:
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

So the solution to 

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

evading all oom kills for your mem cgroup is to limit the number of 
cgroups that can be created by the user?  With a unified cgroup hierarchy, 
that doesn't work well if I wanted to actually constrain these individual 
processes to different resource limits like cpu usage.  In fact, the user 
may not know it is effectively evading the oom killer entirely because it 
has constrained the cpu of individual processes because its a side-effect 
of this heuristic.

You chose not to respond to my reiteration of userspace having absolutely 
no control over victim selection with the new heuristic without setting 
all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
very important job that is running on a system that is really supposed to 
use 80% of memory, I need to be able to specify that it should not be oom 
killed based on user goals.  Setting all processes to be oom disabled in 
the important mem cgroup to avoid being oom killed unless absolutely 
necessary in a system oom condition is not a robust solution: (1) the mem 
cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
panic()'s if these preferred mem cgroups are the only consumers left on 
the system.  With overcommit, both of these possibilities exist in the 
wild and the problem is only a result of the implementation detail of this 
patchset.

For these reasons: unfair comparison of root mem cgroup usage to bias 
against that mem cgroup from oom kill in system oom conditions, the 
ability of users to completely evade the oom killer by attaching all 
processes to child cgroups either purposefully or unpurposefully, and the 
inability of userspace to effectively control oom victim selection:

Nacked-by: David Rientjes <rientjes@google.com>

> > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > process cannot be killed because of oom_unkillable_task(), the only 
> > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > eligible processes changed, we end up falling back to the complete 
> > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > oom_unkillable_task() and also retry in the case where 
> > oom_kill_memcg_victim() returns NULL.
> 
> I agree with you here. The fallback to the existing mechanism is implemented
> to be safe for sure, especially in a case of a global OOM. When we'll get
> more confidence in cgroup-aware OOM killer reliability, we can change this
> behavior. Personally, I would prefer to get rid of looking at all tasks just
> to find a pre-existing OOM victim, but it might be quite tricky to implement.
> 

I'm not sure what this has to do with confidence in this patchset's 
reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
found an eligible process in oc->chosen_memcg but it was either ineligible 
later because of oom_unkillable_task(), it moved, or it exited.  It's a 
race.  For users who opt-in to this new heuristic, they should not be 
concerned with a process exiting and thus killing a completely unexpected 
process from an unexpected memcg when it should be possible to retry and 
select the correct victim.

It's much better to document and state to the user what they are opting-in 
to and clearly define how a victim is chosen with the new heuristic and 
then implement that so it works correctly.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
       [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2017-10-10 22:04           ` Roman Gushchin
  2017-10-11 20:21             ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-10 22:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for root
> > memory cgroup. If we'll start gathering them at some point, we can change this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 10000 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 10000 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 10000 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 10000 process in the root
memory cgroup, and even if we kill 1 process we will still have 9999
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes does 
> > > not allow this heuristic to be defeated.
> > 
> > To all previously said words I can only add that cgroup v2 allows to limit
> > the amount of cgroups in the sub-tree:
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > 
> 
> So the solution to 
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> evading all oom kills for your mem cgroup is to limit the number of 
> cgroups that can be created by the user?  With a unified cgroup hierarchy, 
> that doesn't work well if I wanted to actually constrain these individual 
> processes to different resource limits like cpu usage.  In fact, the user 
> may not know it is effectively evading the oom killer entirely because it 
> has constrained the cpu of individual processes because its a side-effect 
> of this heuristic.
> 
> 
> You chose not to respond to my reiteration of userspace having absolutely 
> no control over victim selection with the new heuristic without setting 
> all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
> very important job that is running on a system that is really supposed to 
> use 80% of memory, I need to be able to specify that it should not be oom 
> killed based on user goals.  Setting all processes to be oom disabled in 
> the important mem cgroup to avoid being oom killed unless absolutely 
> necessary in a system oom condition is not a robust solution: (1) the mem 
> cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
> panic()'s if these preferred mem cgroups are the only consumers left on 
> the system.  With overcommit, both of these possibilities exist in the 
> wild and the problem is only a result of the implementation detail of this 
> patchset.
> 
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
will it fix the problem?

It might have some drawbacks as well (especially around oom_score_adj),
but it's doable, if we'll ignore tasks which are not owners of their's mm struct.

> 
> > > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > > process cannot be killed because of oom_unkillable_task(), the only 
> > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > > eligible processes changed, we end up falling back to the complete 
> > > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > > oom_unkillable_task() and also retry in the case where 
> > > oom_kill_memcg_victim() returns NULL.
> > 
> > I agree with you here. The fallback to the existing mechanism is implemented
> > to be safe for sure, especially in a case of a global OOM. When we'll get
> > more confidence in cgroup-aware OOM killer reliability, we can change this
> > behavior. Personally, I would prefer to get rid of looking at all tasks just
> > to find a pre-existing OOM victim, but it might be quite tricky to implement.
> > 
> 
> I'm not sure what this has to do with confidence in this patchset's 
> reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
> found an eligible process in oc->chosen_memcg but it was either ineligible 
> later because of oom_unkillable_task(), it moved, or it exited.  It's a 
> race.  For users who opt-in to this new heuristic, they should not be 
> concerned with a process exiting and thus killing a completely unexpected 
> process from an unexpected memcg when it should be possible to retry and 
> select the correct victim.

Yes, I have to agree here.
Looks like we can't fallback to the original policy.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-10 22:04           ` Roman Gushchin
@ 2017-10-11 20:21             ` David Rientjes
       [not found]               ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2017-10-11 20:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > We don't need a better approximation, we need a fair comparison.  The 
> > heuristic that this patchset is implementing is based on the usage of 
> > individual mem cgroups.  For the root mem cgroup to be considered 
> > eligible, we need to understand its usage.  That usage is _not_ what is 
> > implemented by this patchset, which is the largest rss of a single 
> > attached process.  This, in fact, is not an "approximation" at all.  In 
> > the example of 10000 processes attached with 80MB rss each, the usage of 
> > the root mem cgroup is _not_ 80MB.
> 
> It's hard to imagine a "healthy" setup with 10000 process in the root
> memory cgroup, and even if we kill 1 process we will still have 9999
> remaining process. I agree with you at some point, but it's not
> a real world example.
> 

It's an example that illustrates the problem with the unfair comparison 
between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
[largest rss of a single process attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
approximation, as previously stated: the usage of the root mem cgroup is 
not 100MB if there are 10 such processes attached to the root mem cgroup, 
it's off by orders of magnitude.

For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
patchset proposes, it must have a fair comparison.  That can be done by 
accounting memory to the root mem cgroup in the same way it is to leaf mem 
cgroups.

But let's move the discussion forward to fix it.  To avoid necessarily 
accounting memory to the root mem cgroup, have we considered if it is even 
necessary to address the root mem cgroup?  For the users who opt-in to 
this heuristic, would it be possible to discount the root mem cgroup from 
the heuristic entirely so that oom kills originate from leaf mem cgroups?  
Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
the victim rss is greater than an eligible victim rss attached to the root 
mem cgroup?

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes <rientjes@google.com>
> 
> So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> will it fix the problem?
> 
> It might have some drawbacks as well (especially around oom_score_adj),
> but it's doable, if we'll ignore tasks which are not owners of their's mm struct.
> 

You would be required to discount oom_score_adj because the heuristic 
doesn't account for oom_score_adj when comparing the anon + unevictable + 
unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
correct victim selection in real-world scenarios where processes attached 
to the root mem cgroup are vital to the system and not part of any user 
job, i.e. they are important system daemons and the "activity manager" 
responsible for orchestrating the cgroup hierarchy.

It's also still unfair because it now compares
[sum of rss of processes attached to a cgroup] to
[anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
going to be a solution, regardless if its one process or all processes, if 
it's being compared to more types of memory in leaf cgroups.

If we really don't want root mem cgroup accounting so this is a fair 
comparison, I think the heuristic needs to special case the root mem 
cgroup either by discounting root oom kills if there are eligible oom 
kills from leaf cgroups (the user would be opting-in to this behavior) or 
comparing the badness of a victim from a leaf cgroup to the badness of a 
victim from the root cgroup when deciding which to kill and allow the user 
to protect root mem cgroup processes with oom_score_adj.

That aside, all of this has only addressed one of the three concerns with 
the patchset.

I believe the solution to avoid allowing users to circumvent oom kill is 
to account usage up the hierarchy as you have done in the past.  Cgroup 
hierarchies can be managed by the user so they can create their own 
subcontainers, this is nothing new, and I would hope that you wouldn't 
limit your feature to only a very specific set of usecases.  That may be 
your solution for the root mem cgroup itself: if the hierarchical usage of 
all top-level mem cgroups is known, it's possible to find the root mem 
cgroup usage by subtraction, you are using stats that are global vmstats 
in your heuristic.

Accounting usage up the hierarchy avoids the first two concerns with the 
patchset.  It allows you to implicitly understand the usage of the root 
mem cgroup itself, and does not allow users to circumvent oom kill by 
creating subcontainers, either purposefully or not.  The third concern, 
userspace influence, can allow users to attack leaf mem cgroups deeper in 
the tree if it is using more memory than expected, but the hierarchical 
usage is lower at the top-level.  That is the only objection that I have 
seen to using hierarchical usage: there may be a single cgroup deeper in 
the tree that avoids oom kill because another hierarchy has a higher 
usage.  This can trivially be addressed either by oom priorities or an 
adjustment, just like oom_score_adj, on cgroup usage.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
       [not found]               ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2017-10-11 21:49                 ` Roman Gushchin
  2017-10-12 21:50                   ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-11 21:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > We don't need a better approximation, we need a fair comparison.  The 
> > > heuristic that this patchset is implementing is based on the usage of 
> > > individual mem cgroups.  For the root mem cgroup to be considered 
> > > eligible, we need to understand its usage.  That usage is _not_ what is 
> > > implemented by this patchset, which is the largest rss of a single 
> > > attached process.  This, in fact, is not an "approximation" at all.  In 
> > > the example of 10000 processes attached with 80MB rss each, the usage of 
> > > the root mem cgroup is _not_ 80MB.
> > 
> > It's hard to imagine a "healthy" setup with 10000 process in the root
> > memory cgroup, and even if we kill 1 process we will still have 9999
> > remaining process. I agree with you at some point, but it's not
> > a real world example.
> > 
> 
> It's an example that illustrates the problem with the unfair comparison 
> between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
> [largest rss of a single process attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
> approximation, as previously stated: the usage of the root mem cgroup is 
> not 100MB if there are 10 such processes attached to the root mem cgroup, 
> it's off by orders of magnitude.
> 
> For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
> patchset proposes, it must have a fair comparison.  That can be done by 
> accounting memory to the root mem cgroup in the same way it is to leaf mem 
> cgroups.
> 
> But let's move the discussion forward to fix it.  To avoid necessarily 
> accounting memory to the root mem cgroup, have we considered if it is even 
> necessary to address the root mem cgroup?  For the users who opt-in to 
> this heuristic, would it be possible to discount the root mem cgroup from 
> the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> the victim rss is greater than an eligible victim rss attached to the root 
> mem cgroup?

David, I'm not pretending for implementing the best possible accounting
for the root memory cgroup, and I'm sure there is a place for further
enhancement. But if it's not leading to some obviously stupid victim
selection (like ignoring leaking task, which consumes most of the memory),
I don't see why it should be treated as a blocker for the whole patchset.
I also doubt that any of us has these examples, and the best way to get
them is to get some real usage feedback.

Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
these all are perfect ideas which can be implemented on top of this patchset.

> 
> > > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > > against that mem cgroup from oom kill in system oom conditions, the 
> > > ability of users to completely evade the oom killer by attaching all 
> > > processes to child cgroups either purposefully or unpurposefully, and the 
> > > inability of userspace to effectively control oom victim selection:
> > > 
> > > Nacked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > 
> > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> > will it fix the problem?
> > 
> > It might have some drawbacks as well (especially around oom_score_adj),
> > but it's doable, if we'll ignore tasks which are not owners of their's mm struct.
> > 
> 
> You would be required to discount oom_score_adj because the heuristic 
> doesn't account for oom_score_adj when comparing the anon + unevictable + 
> unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> correct victim selection in real-world scenarios where processes attached 
> to the root mem cgroup are vital to the system and not part of any user 
> job, i.e. they are important system daemons and the "activity manager" 
> responsible for orchestrating the cgroup hierarchy.
> 
> It's also still unfair because it now compares
> [sum of rss of processes attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> going to be a solution, regardless if its one process or all processes, if 
> it's being compared to more types of memory in leaf cgroups.
> 
> If we really don't want root mem cgroup accounting so this is a fair 
> comparison, I think the heuristic needs to special case the root mem 
> cgroup either by discounting root oom kills if there are eligible oom 
> kills from leaf cgroups (the user would be opting-in to this behavior) or 
> comparing the badness of a victim from a leaf cgroup to the badness of a 
> victim from the root cgroup when deciding which to kill and allow the user 
> to protect root mem cgroup processes with oom_score_adj.
> 
> That aside, all of this has only addressed one of the three concerns with 
> the patchset.
> 
> I believe the solution to avoid allowing users to circumvent oom kill is 
> to account usage up the hierarchy as you have done in the past.  Cgroup 
> hierarchies can be managed by the user so they can create their own 
> subcontainers, this is nothing new, and I would hope that you wouldn't 
> limit your feature to only a very specific set of usecases.  That may be 
> your solution for the root mem cgroup itself: if the hierarchical usage of 
> all top-level mem cgroups is known, it's possible to find the root mem 
> cgroup usage by subtraction, you are using stats that are global vmstats 
> in your heuristic.
> 
> Accounting usage up the hierarchy avoids the first two concerns with the 
> patchset.  It allows you to implicitly understand the usage of the root 
> mem cgroup itself, and does not allow users to circumvent oom kill by 
> creating subcontainers, either purposefully or not.  The third concern, 
> userspace influence, can allow users to attack leaf mem cgroups deeper in 
> the tree if it is using more memory than expected, but the hierarchical 
> usage is lower at the top-level.  That is the only objection that I have 
> seen to using hierarchical usage: there may be a single cgroup deeper in 
> the tree that avoids oom kill because another hierarchy has a higher 
> usage.  This can trivially be addressed either by oom priorities or an 
> adjustment, just like oom_score_adj, on cgroup usage.

As I've said, I barely understand how the exact implementation of root memory
cgroup accounting is considered a blocker for the whole feature.
The same is true for oom priorities: it's something that can and should
be implemented on top of the basic semantics, introduced by this patchset.

So, the only real question is the way how we find a victim memcg in the
subtree: by performing independent election on each level or by searching
tree-wide. We all had many discussion around, and as you remember, initially
I was supporting the first option.
But then Michal provided a very strong argument:
if you have 3 similar workloads in A, B and C, but for non-memory-related
reasons (e.g. cpu time sharing) you have to join A and B into a group D:
  /\
 D  C
/ \
A B
it's strange to penalize A and B for it. It looks to me that you're
talking about the similar case, but you consider this hierarchy
useful. So, overall, it seems to be depending on exact configuration.

I have to add, that if you can enable memory.oom_group, your problem
doesn't exist.

The selected approach is easy extendable into hierarchical direction:
as I've said before, we can introduce a new value of memory.oom_group,
which will enable cumulative accounting without mass killing.

And, tbh, I don't see how oom_priorities will resolve an opposite
problem if we'd take the hierarchical approach.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-11 21:49                 ` Roman Gushchin
@ 2017-10-12 21:50                   ` David Rientjes
  2017-10-13 13:32                     ` Roman Gushchin
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2017-10-12 21:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 11 Oct 2017, Roman Gushchin wrote:

> > But let's move the discussion forward to fix it.  To avoid necessarily 
> > accounting memory to the root mem cgroup, have we considered if it is even 
> > necessary to address the root mem cgroup?  For the users who opt-in to 
> > this heuristic, would it be possible to discount the root mem cgroup from 
> > the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> > the victim rss is greater than an eligible victim rss attached to the root 
> > mem cgroup?
> 
> David, I'm not pretending for implementing the best possible accounting
> for the root memory cgroup, and I'm sure there is a place for further
> enhancement. But if it's not leading to some obviously stupid victim
> selection (like ignoring leaking task, which consumes most of the memory),
> I don't see why it should be treated as a blocker for the whole patchset.
> I also doubt that any of us has these examples, and the best way to get
> them is to get some real usage feedback.
> 
> Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
> these all are perfect ideas which can be implemented on top of this patchset.
> 

For the root mem cgroup to be compared to leaf mem cgroups, it needs a 
fair comparison, not something that we leave to some future patches on top 
of this patchset.  We can't compare some cgroups with other cgroups based 
on different criteria depending on which cgroup is involved.  It's 
actually a quite trivial problem to address, it was a small modiifcation 
to your hierarchical usage patchset if that's the way that you elect to 
fix it.

I know that some of our customers use cgroups only for one or two jobs on 
the system, and that isn't necessarily just for memory limitation.  The 
fact remains, that without considering the root mem cgroup fairly, that 
these customers are unfairly biased against because they have aggregated 
their processes in a cgroup.  This a not a highly specialized usecase, I 
am positive that many users use cgroups only for a subset of processes.  
This heuristic penalizes that behavior to prefer them as oom victims.

The problem needs to be fixed instead of asking for the patchset to be 
merged and hope that we'll address these issues later.  If you account for 
hierarchical usage, you can easily subtract this from global vmstats to 
get an implicit root usage.

> > You would be required to discount oom_score_adj because the heuristic 
> > doesn't account for oom_score_adj when comparing the anon + unevictable + 
> > unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> > correct victim selection in real-world scenarios where processes attached 
> > to the root mem cgroup are vital to the system and not part of any user 
> > job, i.e. they are important system daemons and the "activity manager" 
> > responsible for orchestrating the cgroup hierarchy.
> > 
> > It's also still unfair because it now compares
> > [sum of rss of processes attached to a cgroup] to
> > [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> > going to be a solution, regardless if its one process or all processes, if 
> > it's being compared to more types of memory in leaf cgroups.
> > 
> > If we really don't want root mem cgroup accounting so this is a fair 
> > comparison, I think the heuristic needs to special case the root mem 
> > cgroup either by discounting root oom kills if there are eligible oom 
> > kills from leaf cgroups (the user would be opting-in to this behavior) or 
> > comparing the badness of a victim from a leaf cgroup to the badness of a 
> > victim from the root cgroup when deciding which to kill and allow the user 
> > to protect root mem cgroup processes with oom_score_adj.
> > 
> > That aside, all of this has only addressed one of the three concerns with 
> > the patchset.
> > 
> > I believe the solution to avoid allowing users to circumvent oom kill is 
> > to account usage up the hierarchy as you have done in the past.  Cgroup 
> > hierarchies can be managed by the user so they can create their own 
> > subcontainers, this is nothing new, and I would hope that you wouldn't 
> > limit your feature to only a very specific set of usecases.  That may be 
> > your solution for the root mem cgroup itself: if the hierarchical usage of 
> > all top-level mem cgroups is known, it's possible to find the root mem 
> > cgroup usage by subtraction, you are using stats that are global vmstats 
> > in your heuristic.
> > 
> > Accounting usage up the hierarchy avoids the first two concerns with the 
> > patchset.  It allows you to implicitly understand the usage of the root 
> > mem cgroup itself, and does not allow users to circumvent oom kill by 
> > creating subcontainers, either purposefully or not.  The third concern, 
> > userspace influence, can allow users to attack leaf mem cgroups deeper in 
> > the tree if it is using more memory than expected, but the hierarchical 
> > usage is lower at the top-level.  That is the only objection that I have 
> > seen to using hierarchical usage: there may be a single cgroup deeper in 
> > the tree that avoids oom kill because another hierarchy has a higher 
> > usage.  This can trivially be addressed either by oom priorities or an 
> > adjustment, just like oom_score_adj, on cgroup usage.
> 
> As I've said, I barely understand how the exact implementation of root memory
> cgroup accounting is considered a blocker for the whole feature.
> The same is true for oom priorities: it's something that can and should
> be implemented on top of the basic semantics, introduced by this patchset.
> 

No, we cannot merge incomplete features that have well identified issues 
by simply saying that we'll address those issues later.  We need a 
patchset that is complete.  Wrt root mem cgroup usage, this change is 
actually quite trivial with hierarchical usage.  The memory cgroup is 
based on accounting hierarchical usage, you actually have all the data you 
need already available in the kernel.  Iterating all root processes for 
where task == mm->owner and then accounting rss for those processes is not 
the same as a leaf cgroup's anonymous + unevictable + unreclaimable slab.  
It's not even a close approximation in some cases.

OOM priorities are a different concern, but it also needs to be addressed 
as a complete solution.  This patchset removes virtually all control the 
user has in preferring a cgroup for oom kill or biasing against a cgroup 
for oom kil.  The patchset is moving the selection criteria from 
individual processes to cgroups.  Great!  Just allow userspace to have 
influence over that selection just like /proc/pid/oom_adj has existed for 
over a decade and is very widespread.  Users need the ability to protect 
important cgroups on the system, just like they need the ability to 
protect important processes on the system with the current heuristic.  If 
a single cgroup accounts for 50% of memory, it will always be the chosen 
victim memcg with your heuristic.  The only thing that is being asked here 
is that userspace be able to say that cgroup is actually really important 
and we should oom kill something else.  Not hard whatsoever.

These two issues are actually very trivial to implement, and you actually 
implemented 95% of it in earlier iterations of the patchset.  It was a 
beautiful solution to all of these concerns and well written.  If you 
would prefer that I use this patchset as a basis and then fix it with 
respect to all three of these issues and then propose it, let me know.

> So, the only real question is the way how we find a victim memcg in the
> subtree: by performing independent election on each level or by searching
> tree-wide. We all had many discussion around, and as you remember, initially
> I was supporting the first option.
> But then Michal provided a very strong argument:
> if you have 3 similar workloads in A, B and C, but for non-memory-related
> reasons (e.g. cpu time sharing) you have to join A and B into a group D:
>   /\
>  D  C
> / \
> A B
> it's strange to penalize A and B for it. It looks to me that you're
> talking about the similar case, but you consider this hierarchy
> useful. So, overall, it seems to be depending on exact configuration.
> 

This is _exactly_ why you need oom priorities so that userspace can 
influence the decisionmaking.  This makes my previous point, I'm not sure 
where the disconnect is coming from?  You need to be able to bias D when 
compared to C for the heuristic to work.  Userspace knows how it is 
organizing its memory cgroups.  We would be making a mistake if we thought 
we knew all possible ways that people are using cgroups and limit your 
heuristic so that some people can opt-in and others are left with the 
current per-process heuristic because their users have accidently 
subverted oom kill selection because they split their processes amongst 
subcontainers.

> The selected approach is easy extendable into hierarchical direction:
> as I've said before, we can introduce a new value of memory.oom_group,
> which will enable cumulative accounting without mass killing.
> 

Again, we cannot merge incomplete patchsets in the hope that issues with 
that patchset are addressed later, especially when there are three very 
well defined concerns with the existing implementation.  Your earlier 
iterations were actually a brilliant solution to the problem, I'm not sure 
that you realize how powerful it could be in practice.

> And, tbh, I don't see how oom_priorities will resolve an opposite
> problem if we'd take the hierarchical approach.
> 

Think about it in a different way: we currently compare per-process usage 
and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
on priorities of that process and still oom kill if there's a memory leak.  
Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
after all.  We don't need a strict memory.oom_priority that outranks all 
other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
to adjust the per-cgroup usage.  The decisionmaking in your earlier 
example would be under the control of C/memory.oom_score_adj and 
D/memory.oom_score_adj.  Problem solved.

It also solves the problem of userspace being able to influence oom victim 
selection so now they can protect important cgroups just like we can 
protect important processes today.

And since this would be hierarchical usage, you can trivially infer root 
mem cgroup usage by subtraction of top-level mem cgroup usage.

This is a powerful solution to the problem and gives userspace the control 
they need so that it can work in all usecases, not a subset of usecases.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-12 21:50                   ` David Rientjes
@ 2017-10-13 13:32                     ` Roman Gushchin
       [not found]                       ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-13 13:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 12, 2017 at 02:50:38PM -0700, David Rientjes wrote:
> On Wed, 11 Oct 2017, Roman Gushchin wrote:
> 
> Think about it in a different way: we currently compare per-process usage 
> and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> on priorities of that process and still oom kill if there's a memory leak.  
> Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> after all.  We don't need a strict memory.oom_priority that outranks all 
> other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> example would be under the control of C/memory.oom_score_adj and 
> D/memory.oom_score_adj.  Problem solved.
> 
> It also solves the problem of userspace being able to influence oom victim 
> selection so now they can protect important cgroups just like we can 
> protect important processes today.
> 
> And since this would be hierarchical usage, you can trivially infer root 
> mem cgroup usage by subtraction of top-level mem cgroup usage.
> 
> This is a powerful solution to the problem and gives userspace the control 
> they need so that it can work in all usecases, not a subset of usecases.

You're right that per-cgroup oom_score_adj may resolve the issue with
too strict semantics of oom_priorities. But I believe nobody likes
the existing per-process oom_score_adj interface, and there are reasons behind.
Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
will work is not trivial.
For example, earlier in this thread I've shown an example, when a decision
which of two processes should be killed depends on whether it's global or
memcg-wide oom, despite both belong to a single cgroup!

Of course, it's technically trivial to implement some analog of oom_score_adj
for cgroups (and early versions of this patchset did that).
But the right question is: is this an interface we want to support
for the next many years? I'm not sure.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
       [not found]                       ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2017-10-13 21:31                         ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2017-10-13 21:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, 13 Oct 2017, Roman Gushchin wrote:

> > Think about it in a different way: we currently compare per-process usage 
> > and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> > on priorities of that process and still oom kill if there's a memory leak.  
> > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> > after all.  We don't need a strict memory.oom_priority that outranks all 
> > other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> > to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> > example would be under the control of C/memory.oom_score_adj and 
> > D/memory.oom_score_adj.  Problem solved.
> > 
> > It also solves the problem of userspace being able to influence oom victim 
> > selection so now they can protect important cgroups just like we can 
> > protect important processes today.
> > 
> > And since this would be hierarchical usage, you can trivially infer root 
> > mem cgroup usage by subtraction of top-level mem cgroup usage.
> > 
> > This is a powerful solution to the problem and gives userspace the control 
> > they need so that it can work in all usecases, not a subset of usecases.
> 
> You're right that per-cgroup oom_score_adj may resolve the issue with
> too strict semantics of oom_priorities. But I believe nobody likes
> the existing per-process oom_score_adj interface, and there are reasons behind.

The previous heuristic before I rewrote the oom killer used 
/proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a 
much more difficult interface to use as I'm sure you can imagine.  People 
ended up only using it to polarize selection: either -17 to oom disable a 
process, -16 to bias against it, and 15 to prefer it.  Nobody used 
anything in between and I worked with openssh, udev, kde, and chromium to 
get a consensus on the oom_score_adj semantics.  People do use it to 
protect against memory leaks and to prevent oom killing important 
processes when something else can be sacrificed, unless there's a leak.

> Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
> will work is not trivial.

I suggest defining it in the terms used for previous iterations of the 
patchset: do hierarchical scoring so that each level of the hierarchy has 
usage information for each subtree.  You can get root mem cgroup usage 
with complete fairness by subtraction with this method.  When comparing 
usage at each level of the hierarchy, you can propagate the eligibility of 
processes in that subtree much like you do today.  I agree with your 
change to make the oom killer a no-op if selection races with the actual 
killing rather than falling back to the old heuristic.  I'm happy to help 
add a Tested-by once we settle the other issues with that change.

At each level, I would state that memory.oom_score_adj has the exact same 
semantics as /proc/pid/oom_score_adj.  In this case, it would simply be 
defined as a proportion of the parent's limit.  If the hierarchy is 
iterated starting at the root mem cgroup for system ooms and at the root 
of the oom memcg for memcg ooms, this should lead to the exact same oom 
killing behavior, which is desired.

This solution would address the three concerns that I had: it allows the 
root mem cgroup to be compared fairly with leaf mem cgroups (with the 
bonus of not requiring root mem cgroup accounting thanks to your heuristic 
using global vmstats), it allows userspace to influence the decisionmaking 
so that users can protect cgroups that use 50% of memory because they are 
important, and it completely avoids users being able to change victim 
selection simply by creating child mem cgroups.

This would be a very powerful patchset.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-10 21:13       ` David Rientjes
       [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2017-10-11 13:08         ` Michal Hocko
  2017-10-11 20:27           ` David Rientjes
  2017-10-11 16:10         ` Roman Gushchin
  2 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2017-10-11 13:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 10-10-17 14:13:00, David Rientjes wrote:
[...]
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes@google.com>

I consider this NACK rather dubious. Evading the heuristic as you
describe requires root privileges in default configuration because
normal users are not allowed to create subtrees. If you
really want to delegate subtree to an untrusted entity then you do not
have to opt-in for this oom strategy. We can work on an additional means
which would allow to cover those as well (e.g. priority based one which
is requested for other usecases).

A similar argument applies to the root memcg evaluation. While the
proposed behavior is not optimal it would work for general usecase
described here where the root memcg doesn't really run any large number
of tasks. If somebody who explicitly opts-in for the new strategy and it
doesn't work well for that usecase we can enhance the behavior. That
alone is not a reason to nack the whole thing.

I find it really disturbing that you keep nacking this approach just
because it doesn't suite your specific usecase while it doesn't break
it. Moreover it has been stated several times already that future
improvements are possible and cover what you have described already.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-11 13:08         ` Michal Hocko
@ 2017-10-11 20:27           ` David Rientjes
  2017-10-12  6:33             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2017-10-11 20:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 11 Oct 2017, Michal Hocko wrote:

> > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > against that mem cgroup from oom kill in system oom conditions, the 
> > ability of users to completely evade the oom killer by attaching all 
> > processes to child cgroups either purposefully or unpurposefully, and the 
> > inability of userspace to effectively control oom victim selection:
> > 
> > Nacked-by: David Rientjes <rientjes@google.com>
> 
> I consider this NACK rather dubious. Evading the heuristic as you
> describe requires root privileges in default configuration because
> normal users are not allowed to create subtrees. If you
> really want to delegate subtree to an untrusted entity then you do not
> have to opt-in for this oom strategy. We can work on an additional means
> which would allow to cover those as well (e.g. priority based one which
> is requested for other usecases).
> 

You're missing the point that the user is trusted and it may be doing 
something to circumvent oom kill unknowingly.  With a single unified 
hierarchy, the user is forced to attach its processes to subcontainers if 
it wants to constrain resources with other controllers.  Doing so ends up 
completely avoiding oom kill because of this implementation detail.  It 
has nothing to do with trust and the admin who is opting-in will not know 
a user has cirumvented oom kill purely because it constrains its processes 
with controllers other than the memory controller.

> A similar argument applies to the root memcg evaluation. While the
> proposed behavior is not optimal it would work for general usecase
> described here where the root memcg doesn't really run any large number
> of tasks. If somebody who explicitly opts-in for the new strategy and it
> doesn't work well for that usecase we can enhance the behavior. That
> alone is not a reason to nack the whole thing.
> 
> I find it really disturbing that you keep nacking this approach just
> because it doesn't suite your specific usecase while it doesn't break
> it. Moreover it has been stated several times already that future
> improvements are possible and cover what you have described already.

This has nothing to do with my specific usecase.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-11 20:27           ` David Rientjes
@ 2017-10-12  6:33             ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2017-10-12  6:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 11-10-17 13:27:44, David Rientjes wrote:
> On Wed, 11 Oct 2017, Michal Hocko wrote:
> 
> > > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > > against that mem cgroup from oom kill in system oom conditions, the 
> > > ability of users to completely evade the oom killer by attaching all 
> > > processes to child cgroups either purposefully or unpurposefully, and the 
> > > inability of userspace to effectively control oom victim selection:
> > > 
> > > Nacked-by: David Rientjes <rientjes@google.com>
> > 
> > I consider this NACK rather dubious. Evading the heuristic as you
> > describe requires root privileges in default configuration because
> > normal users are not allowed to create subtrees. If you
> > really want to delegate subtree to an untrusted entity then you do not
> > have to opt-in for this oom strategy. We can work on an additional means
> > which would allow to cover those as well (e.g. priority based one which
> > is requested for other usecases).
> > 
> 
> You're missing the point that the user is trusted and it may be doing 
> something to circumvent oom kill unknowingly.

I would really like to see a practical example of something like that. I
am not saying this is completely impossible but as already pointed out
this _can_ be addressed _on top_ of the current implementation. We will
need some way to consider hierarchies anyway.

So I really fail to see why this would be a blocker. After all it
is no different than skipping oom selection by splitting a process
(knowingly or otherwise) into subprocesses which is possible even
now. OOM killer selection has never been, will not be and cannot be
perfect in principal. Quite contrary, the more clever the heuristics are
trying to be the more corner cases they might generate as we could see
in the past.

> With a single unified 
> hierarchy, the user is forced to attach its processes to subcontainers if 
> it wants to constrain resources with other controllers.  Doing so ends up 
> completely avoiding oom kill because of this implementation detail.  It 
> has nothing to do with trust and the admin who is opting-in will not know 
> a user has cirumvented oom kill purely because it constrains its processes 
> with controllers other than the memory controller.
> 
> > A similar argument applies to the root memcg evaluation. While the
> > proposed behavior is not optimal it would work for general usecase
> > described here where the root memcg doesn't really run any large number
> > of tasks. If somebody who explicitly opts-in for the new strategy and it
> > doesn't work well for that usecase we can enhance the behavior. That
> > alone is not a reason to nack the whole thing.
> > 
> > I find it really disturbing that you keep nacking this approach just
> > because it doesn't suite your specific usecase while it doesn't break
> > it. Moreover it has been stated several times already that future
> > improvements are possible and cover what you have described already.
> 
> This has nothing to do with my specific usecase.

Well, I might be really wrong but it is hard to not notice how most of
your complains push towards hierarchical level-by-level comparisons.
Which has been considered and deemed unsuitable for the default cgroup
aware oom selection because it imposes structural constrains on how
the hierarchy is organized and thus disallow many usecases. So pushing
for this just because it resembles your current inhouse implementation
leaves me with a feeling that you care more about your usecase than a
general usability.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-10 21:13       ` David Rientjes
       [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  2017-10-11 13:08         ` Michal Hocko
@ 2017-10-11 16:10         ` Roman Gushchin
  2 siblings, 0 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-11 16:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for root
> > memory cgroup. If we'll start gathering them at some point, we can change this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 10000 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 10000 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 10000 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.
> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes does 
> > > not allow this heuristic to be defeated.
> > 
> > To all previously said words I can only add that cgroup v2 allows to limit
> > the amount of cgroups in the sub-tree:
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > 
> 
> So the solution to 
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> evading all oom kills for your mem cgroup is to limit the number of 
> cgroups that can be created by the user?  With a unified cgroup hierarchy, 
> that doesn't work well if I wanted to actually constrain these individual 
> processes to different resource limits like cpu usage.  In fact, the user 
> may not know it is effectively evading the oom killer entirely because it 
> has constrained the cpu of individual processes because its a side-effect 
> of this heuristic.
> 
> 
> You chose not to respond to my reiteration of userspace having absolutely 
> no control over victim selection with the new heuristic without setting 
> all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
> very important job that is running on a system that is really supposed to 
> use 80% of memory, I need to be able to specify that it should not be oom 
> killed based on user goals.  Setting all processes to be oom disabled in 
> the important mem cgroup to avoid being oom killed unless absolutely 
> necessary in a system oom condition is not a robust solution: (1) the mem 
> cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
> panic()'s if these preferred mem cgroups are the only consumers left on 
> the system.  With overcommit, both of these possibilities exist in the 
> wild and the problem is only a result of the implementation detail of this 
> patchset.
> 
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes@google.com>

Hi David!

Do you find the following approach (summing oom_score of
tasks belonging to the root memory cgroup) acceptable?

Also, I've closed the race, you've pointed on.

Thanks!

--------------------------------------------------------------------------------
From 7f51d26be2d2a5b6e4840574f72beb15920c0993 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Thu, 25 May 2017 14:18:45 +0100
Subject: [v12 3/6] mm, oom: cgroup-aware OOM killer

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  17 +++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 181 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  72 +++++++++++++-----
 4 files changed, 262 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df3368734f1c..8f04e1fb9dd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Root memory cgroup is a special case:
+	 * we don't have necessary stats to evaluate it exactly as
+	 * leaf memory cgroups, so we approximate it's oom_score
+	 * by summing oom_score of all belonging tasks, which are
+	 * owners of their mm structs.
+	 *
+	 * If there are inflight OOM victim tasks inside
+	 * the root memcg, we return -1.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP,
+				      &task->signal->oom_mm->flags)) {
+				score = -1;
+				break;
+			}
+
+			task_lock(task);
+			if (!task->mm || task->mm->owner != task) {
+				task_unlock(task);
+				continue;
+			}
+			task_unlock(task);
+
+			score += oom_badness(task, memcg, nodemask,
+					     totalpages);
+		}
+		css_task_iter_end(&it);
+
+		return score;
+	}
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0b9f36117989..5b670adb850c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -923,7 +923,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -984,6 +984,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1036,6 +1057,7 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc)) {
+		if (oom_kill_memcg_victim(oc))
+		    delay = true;
+
+		goto out;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

[parent not found: <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>]

* [v11 4/6] mm, oom: introduce memory.oom_group
       [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-10-05 13:04   ` Roman Gushchin
  2017-10-05 14:29     ` Michal Hocko
  2017-10-05 14:31     ` Michal Hocko
  0 siblings, 2 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
Cc: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: kernel-team-b10kYP2dOMg@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 include/linux/memcontrol.h | 17 +++++++++++
 mm/memcontrol.c            | 75 +++++++++++++++++++++++++++++++++++++++++++---
 mm/oom_kill.c              | 49 +++++++++++++++++++++++-------
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/*
+	 * Treat the sub-tree as an indivisible memory consumer,
+	 * kill all belonging tasks if the memory cgroup selected
+	 * as OOM victim.
+	 */
+	bool oom_group;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 {
 	return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 191b70735f1f..d5acb278b11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *group = NULL;
+	long group_score = 0;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
 
 	/*
+	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+	 * all tasks belonging to the memcg should be killed.
+	 * So, we mark the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		return;
+	}
+
+	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
+	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
+		/*
+		 * We don't consider non-leaf non-oom_group memory cgroups
+		 * as OOM victims.
+		 */
+		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+		    !mem_cgroup_oom_group(iter))
+			continue;
+
+		/*
+		 * If group is not set or we've ran out of the group's sub-tree,
+		 * we should set group and reset group_score.
+		 */
+		if (!group || group == root_mem_cgroup ||
+		    !mem_cgroup_is_descendant(iter, group)) {
+			group = iter;
+			group_score = 0;
+		}
+
 		if (memcg_has_children(iter) && iter != root_mem_cgroup)
 			continue;
 
@@ -2810,9 +2842,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			break;
 		}
 
-		if (score > oc->chosen_points) {
-			oc->chosen_points = score;
-			oc->chosen_memcg = iter;
+		group_score += score;
+
+		if (group_score > oc->chosen_points) {
+			oc->chosen_points = group_score;
+			oc->chosen_memcg = group;
 		}
 	}
 
@@ -5437,6 +5471,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes,
+				      loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5557,6 +5618,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 20e62ec32ba8..c8fbc73c4ed3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -851,6 +851,17 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	/*
+	 * __oom_kill_process() is used to kill all tasks belonging to
+	 * the selected memory cgroup, so we should check that we're not
+	 * trying to kill an unkillable task.
+	 */
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -987,21 +998,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
 {
+	get_task_struct(task);
+	__oom_kill_process(task);
+	return 0;
+}
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
 	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
 		return oc->chosen_memcg;
 
-	/* Kill a task in the chosen memcg with the biggest memory footprint */
-	oc->chosen_points = 0;
-	oc->chosen_task = NULL;
-	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
-	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
-		goto out;
-
-	__oom_kill_process(oc->chosen_task);
+	/*
+	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+	 * of the chosen memory cgroup, otherwise kill the task with the biggest
+	 * memory footprint.
+	 */
+	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+		/* We have one or more terminating processes at this point. */
+		oc->chosen_task = INFLIGHT_VICTIM;
+	} else {
+		oc->chosen_points = 0;
+		oc->chosen_task = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+		if (oc->chosen_task == NULL ||
+		    oc->chosen_task == INFLIGHT_VICTIM)
+			goto out;
+
+		__oom_kill_process(oc->chosen_task);
+	}
 
 out:
 	mem_cgroup_put(oc->chosen_memcg);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [v11 4/6] mm, oom: introduce memory.oom_group
  2017-10-05 13:04   ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-10-05 14:29     ` Michal Hocko
  2017-10-05 14:31     ` Michal Hocko
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2017-10-05 14:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu 05-10-17 14:04:52, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed because
> this has been a long established way to protect a particular process
> from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
> user defined configuration might lead to data corruptions or other
> misbehavior.
> 
> The default value is 0.

I still believe that  oc->chosen_task == INFLIGHT_VICTIM check in
oom_kill_memcg_victim should go away.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 17 +++++++++++
>  mm/memcontrol.c            | 75 +++++++++++++++++++++++++++++++++++++++++++---
>  mm/oom_kill.c              | 49 +++++++++++++++++++++++-------
>  3 files changed, 127 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 75b63b68846e..84ac10d7e67d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,13 @@ struct mem_cgroup {
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
>  
> +	/*
> +	 * Treat the sub-tree as an indivisible memory consumer,
> +	 * kill all belonging tasks if the memory cgroup selected
> +	 * as OOM victim.
> +	 */
> +	bool oom_group;
> +
>  	/* handle for "memory.events" */
>  	struct cgroup_file events_file;
>  
> @@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
>  
>  bool mem_cgroup_select_oom_victim(struct oom_control *oc);
>  
> +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
> +{
> +	return memcg->oom_group;
> +}
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  {
>  	return false;
>  }
> +
> +static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 191b70735f1f..d5acb278b11a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
>  
>  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  {
> -	struct mem_cgroup *iter;
> +	struct mem_cgroup *iter, *group = NULL;
> +	long group_score = 0;
>  
>  	oc->chosen_memcg = NULL;
>  	oc->chosen_points = 0;
>  
>  	/*
> +	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> +	 * all tasks belonging to the memcg should be killed.
> +	 * So, we mark the memcg as a victim.
> +	 */
> +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> +		oc->chosen_memcg = oc->memcg;
> +		css_get(&oc->chosen_memcg->css);
> +		return;
> +	}
> +
> +	/*
>  	 * The oom_score is calculated for leaf memory cgroups (including
>  	 * the root memcg).
> +	 * Non-leaf oom_group cgroups accumulating score of descendant
> +	 * leaf memory cgroups.
>  	 */
>  	rcu_read_lock();
>  	for_each_mem_cgroup_tree(iter, root) {
>  		long score;
>  
> +		/*
> +		 * We don't consider non-leaf non-oom_group memory cgroups
> +		 * as OOM victims.
> +		 */
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> +		    !mem_cgroup_oom_group(iter))
> +			continue;
> +
> +		/*
> +		 * If group is not set or we've ran out of the group's sub-tree,
> +		 * we should set group and reset group_score.
> +		 */
> +		if (!group || group == root_mem_cgroup ||
> +		    !mem_cgroup_is_descendant(iter, group)) {
> +			group = iter;
> +			group_score = 0;
> +		}
> +
>  		if (memcg_has_children(iter) && iter != root_mem_cgroup)
>  			continue;
>  
> @@ -2810,9 +2842,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  			break;
>  		}
>  
> -		if (score > oc->chosen_points) {
> -			oc->chosen_points = score;
> -			oc->chosen_memcg = iter;
> +		group_score += score;
> +
> +		if (group_score > oc->chosen_points) {
> +			oc->chosen_points = group_score;
> +			oc->chosen_memcg = group;
>  		}
>  	}
>  
> @@ -5437,6 +5471,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static int memory_oom_group_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	bool oom_group = memcg->oom_group;
> +
> +	seq_printf(m, "%d\n", oom_group);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes,
> +				      loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	int oom_group;
> +	int err;
> +
> +	err = kstrtoint(strstrip(buf), 0, &oom_group);
> +	if (err)
> +		return err;
> +
> +	memcg->oom_group = oom_group;
> +
> +	return nbytes;
> +}
> +
>  static int memory_events_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> @@ -5557,6 +5618,12 @@ static struct cftype memory_files[] = {
>  		.write = memory_max_write,
>  	},
>  	{
> +		.name = "oom_group",
> +		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> +		.seq_show = memory_oom_group_show,
> +		.write = memory_oom_group_write,
> +	},
> +	{
>  		.name = "events",
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.file_offset = offsetof(struct mem_cgroup, events_file),
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 20e62ec32ba8..c8fbc73c4ed3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -851,6 +851,17 @@ static void __oom_kill_process(struct task_struct *victim)
>  	struct mm_struct *mm;
>  	bool can_oom_reap = true;
>  
> +	/*
> +	 * __oom_kill_process() is used to kill all tasks belonging to
> +	 * the selected memory cgroup, so we should check that we're not
> +	 * trying to kill an unkillable task.
> +	 */
> +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		put_task_struct(victim);
> +		return;
> +	}
> +
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);
> @@ -987,21 +998,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> -static bool oom_kill_memcg_victim(struct oom_control *oc)
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
>  {
> +	get_task_struct(task);
> +	__oom_kill_process(task);
> +	return 0;
> +}
>  
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
>  	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
>  		return oc->chosen_memcg;
>  
> -	/* Kill a task in the chosen memcg with the biggest memory footprint */
> -	oc->chosen_points = 0;
> -	oc->chosen_task = NULL;
> -	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> -
> -	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> -		goto out;
> -
> -	__oom_kill_process(oc->chosen_task);
> +	/*
> +	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> +	 * of the chosen memory cgroup, otherwise kill the task with the biggest
> +	 * memory footprint.
> +	 */
> +	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> +				      NULL);
> +		/* We have one or more terminating processes at this point. */
> +		oc->chosen_task = INFLIGHT_VICTIM;
> +	} else {
> +		oc->chosen_points = 0;
> +		oc->chosen_task = NULL;
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +		if (oc->chosen_task == NULL ||
> +		    oc->chosen_task == INFLIGHT_VICTIM)
> +			goto out;
> +
> +		__oom_kill_process(oc->chosen_task);
> +	}
>  
>  out:
>  	mem_cgroup_put(oc->chosen_memcg);
> -- 
> 2.13.6

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 4/6] mm, oom: introduce memory.oom_group
  2017-10-05 13:04   ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
  2017-10-05 14:29     ` Michal Hocko
@ 2017-10-05 14:31     ` Michal Hocko
       [not found]       ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2017-10-05 14:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Btw. here is how I would do the recursive oom badness. The diff is not
the nicest one because there is some code moving but the resulting code
is smaller and imho easier to grasp. Only compile tested though
---
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e562b1..9cdba4682198 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p);
 int cgroup_init_early(void);
 int cgroup_init(void);
 
+static bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+	return cgrp->nr_populated_csets;
+}
+
 /*
  * Iteration helpers and macros.
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8dacf73ad57e..a2dd7e3ffe23 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
 	spin_unlock_bh(&cgroup_idr_lock);
 }
 
-static bool cgroup_has_tasks(struct cgroup *cgrp)
-{
-	return cgrp->nr_populated_csets;
-}
-
 bool cgroup_is_threaded(struct cgroup *cgrp)
 {
 	return cgrp->dom_cgrp != cgrp;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3848bce4c86..012b2216266f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2671,59 +2671,63 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 }
 
 static long memcg_oom_badness(struct mem_cgroup *memcg,
-			      const nodemask_t *nodemask,
-			      unsigned long totalpages)
+			      const nodemask_t *nodemask)
 {
+	struct mem_cgroup *iter;
+	struct css_task_iter it;
+	struct task_struct *task;
 	long points = 0;
+	int eligible = 0;
 	int nid;
 	pg_data_t *pgdat;
 
-	/*
-	 * We don't have necessary stats for the root memcg,
-	 * so we define it's oom_score as the maximum oom_score
-	 * of the belonging tasks.
-	 *
-	 * As tasks in the root memcg unlikely are parts of a
-	 * single workload, and we don't have to implement
-	 * group killing, this approximation is reasonable.
-	 *
-	 * But if we will have necessary stats for the root memcg,
-	 * we might switch to the approach which is used for all
-	 * other memcgs.
-	 */
-	if (memcg == root_mem_cgroup) {
-		struct css_task_iter it;
-		struct task_struct *task;
-		long score, max_score = 0;
-
+	for_each_mem_cgroup_tree(iter, memcg) {
+		/*
+		 * Memcg is OOM eligible if there are OOM killable tasks inside.
+		 *
+		 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+		 * as unkillable.
+		 *
+		 * If there are inflight OOM victim tasks inside the memcg,
+		 * we return -1.
+		 */
 		css_task_iter_start(&memcg->css, 0, &it);
 		while ((task = css_task_iter_next(&it))) {
-			score = oom_badness(task, memcg, nodemask,
-					    totalpages);
-			if (score > max_score)
-				max_score = score;
+			if (!eligible &&
+			    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+				eligible = 1;
+
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+				eligible = -1;
+				break;
+			}
 		}
 		css_task_iter_end(&it);
 
-		return max_score;
-	}
+		if (eligible <= 0) {
+			mem_cgroup_iter_break(memcg, iter);
+			points = -1;
+			break;
+		}
 
-	for_each_node_state(nid, N_MEMORY) {
-		if (nodemask && !node_isset(nid, *nodemask))
-			continue;
+		for_each_node_state(nid, N_MEMORY) {
+			if (nodemask && !node_isset(nid, *nodemask))
+				continue;
 
-		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
-				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+			points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
 
-		pgdat = NODE_DATA(nid);
-		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
-					    NR_SLAB_UNRECLAIMABLE);
-	}
+			pgdat = NODE_DATA(nid);
+			points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+						    NR_SLAB_UNRECLAIMABLE);
+		}
 
-	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
-		(PAGE_SIZE / 1024);
-	points += memcg_page_state(memcg, MEMCG_SOCK);
-	points += memcg_page_state(memcg, MEMCG_SWAP);
+		points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+			(PAGE_SIZE / 1024);
+		points += memcg_page_state(memcg, MEMCG_SOCK);
+		points += memcg_page_state(memcg, MEMCG_SWAP);
+	}
 
 	return points;
 }
@@ -2741,43 +2745,56 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 			       const nodemask_t *nodemask,
 			       unsigned long totalpages)
 {
-	struct css_task_iter it;
-	struct task_struct *task;
-	int eligible = 0;
-
 	/*
-	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
 	 *
-	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
-	 * as unkillable.
+	 * As tasks in the root memcg unlikely are parts of a
+	 * single workload, and we don't have to implement
+	 * group killing, this approximation is reasonable.
 	 *
-	 * If there are inflight OOM victim tasks inside the memcg,
-	 * we return -1.
+	 * But if we will have necessary stats for the root memcg,
+	 * we might switch to the approach which is used for all
+	 * other memcgs.
 	 */
-	css_task_iter_start(&memcg->css, 0, &it);
-	while ((task = css_task_iter_next(&it))) {
-		if (!eligible &&
-		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
-			eligible = 1;
-
-		if (tsk_is_oom_victim(task) &&
-		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
-			eligible = -1;
-			break;
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+				max_score = -1;
+				break;
+			}
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (score > max_score)
+				max_score = score;
 		}
-	}
-	css_task_iter_end(&it);
+		css_task_iter_end(&it);
 
-	if (eligible <= 0)
-		return eligible;
+		return max_score;
+	}
 
-	return memcg_oom_badness(memcg, nodemask, totalpages);
+	return memcg_oom_badness(memcg, nodemask);
 }
 
+static bool memcg_is_oom_eligible(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_oom_group(memcg))
+		return true;
+	if (cgroup_has_tasks(memcg->css.cgroup))
+		return true;
+
+	return false;
+}
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
-	struct mem_cgroup *iter, *group = NULL;
-	long group_score = 0;
+	struct mem_cgroup *iter;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
@@ -2803,35 +2820,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
-		/*
-		 * We don't consider non-leaf non-oom_group memory cgroups
-		 * as OOM victims.
-		 */
-		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-		    !mem_cgroup_oom_group(iter))
-			continue;
-
-		/*
-		 * If group is not set or we've ran out of the group's sub-tree,
-		 * we should set group and reset group_score.
-		 */
-		if (!group || group == root_mem_cgroup ||
-		    !mem_cgroup_is_descendant(iter, group)) {
-			group = iter;
-			group_score = 0;
-		}
-
-		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+		if (!memcg_is_oom_eligible(iter))
 			continue;
 
 		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
 
-		/*
-		 * Ignore empty and non-eligible memory cgroups.
-		 */
-		if (score == 0)
-			continue;
-
 		/*
 		 * If there are inflight OOM victims, we don't need
 		 * to look further for new victims.
@@ -2842,11 +2835,9 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			break;
 		}
 
-		group_score += score;
-
-		if (group_score > oc->chosen_points) {
-			oc->chosen_points = group_score;
-			oc->chosen_memcg = group;
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
 		}
 	}
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 27+ messages in thread

[parent not found: <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [v11 4/6] mm, oom: introduce memory.oom_group
       [not found]       ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-10-06 12:04         ` Roman Gushchin
  2017-10-06 12:17           ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Roman Gushchin @ 2017-10-06 12:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
	Tejun Heo, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 05, 2017 at 04:31:04PM +0200, Michal Hocko wrote:
> Btw. here is how I would do the recursive oom badness. The diff is not
> the nicest one because there is some code moving but the resulting code
> is smaller and imho easier to grasp. Only compile tested though

Thanks!

I'm not against this approach, and maybe it can lead to a better code,
but the version you sent is just not there yet.

There are some problems with it:

1) If there are nested cgroups with oom_group set, you will calculate
a badness multiple times, and rely on the fact, that top memcg will
become the largest score. It can be optimized, of course, but it's
additional code.

2) cgroup_has_tasks() probably requires additional locking.
Maybe it's ok to read nr_populated_csets without explicit locking,
but it's not obvious for me.

3) Returning -1 from memcg_oom_badness() if eligible is equal to 0
is suspicious.

Right now your version has exactly the same amount of code
(skipping comments). I assume, this approach just requires some additional
thinking/rework.

Anyway, thank you for sharing this!

> ---
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 085056e562b1..9cdba4682198 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p);
>  int cgroup_init_early(void);
>  int cgroup_init(void);
>  
> +static bool cgroup_has_tasks(struct cgroup *cgrp)
> +{
> +	return cgrp->nr_populated_csets;
> +}
> +
>  /*
>   * Iteration helpers and macros.
>   */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 8dacf73ad57e..a2dd7e3ffe23 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
>  	spin_unlock_bh(&cgroup_idr_lock);
>  }
>  
> -static bool cgroup_has_tasks(struct cgroup *cgrp)
> -{
> -	return cgrp->nr_populated_csets;
> -}
> -
>  bool cgroup_is_threaded(struct cgroup *cgrp)
>  {
>  	return cgrp->dom_cgrp != cgrp;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [v11 4/6] mm, oom: introduce memory.oom_group
  2017-10-06 12:04         ` Roman Gushchin
@ 2017-10-06 12:17           ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2017-10-06 12:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri 06-10-17 13:04:35, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 04:31:04PM +0200, Michal Hocko wrote:
> > Btw. here is how I would do the recursive oom badness. The diff is not
> > the nicest one because there is some code moving but the resulting code
> > is smaller and imho easier to grasp. Only compile tested though
> 
> Thanks!
> 
> I'm not against this approach, and maybe it can lead to a better code,
> but the version you sent is just not there yet.
> 
> There are some problems with it:
> 
> 1) If there are nested cgroups with oom_group set, you will calculate
> a badness multiple times, and rely on the fact, that top memcg will
> become the largest score. It can be optimized, of course, but it's
> additional code.

right. As I've said we can introduce iterator helper to skip the subtree
but I suspect it will not make much of a difference.

> 
> 2) cgroup_has_tasks() probably requires additional locking.
> Maybe it's ok to read nr_populated_csets without explicit locking,
> but it's not obvious for me.

I do not see why. Tasks are free to come and go and you only know at the
time you are killing.

> 3) Returning -1 from memcg_oom_badness() if eligible is equal to 0
> is suspicious.

I didn't spend too much time on it. I merely wanted to point out my
thinking more specifically than the pseudo code posted earlier. But this
should be ok, because that would mean that either all tasks are
OOM_SCORE_ADJ_MIN (eligible = 0) or there is a inflight victim (eligible
= -1). Anyway the initialization should go inside the tree walk

> Right now your version has exactly the same amount of code
> (skipping comments). I assume, this approach just requires some additional
> thinking/rework.

Well, this is not about the amount of code but more about the clear
logic implemented at the correct level. It is simply much easier when
you evaluate the killable entity at one place rather open code it.

But as I've said nothing I would want to enforce.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
       [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
@ 2017-10-05 13:04 ` Roman Gushchin
  2017-10-05 13:04 ` [v11 6/6] mm, oom, docs: describe the " Roman Gushchin
  5 siblings, 0 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5acb278b11a..fe6155d827c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2866,6 +2866,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [v11 6/6] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-10-05 13:04 ` [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-10-05 13:04 ` Roman Gushchin
  5 siblings, 0 replies; 27+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..28429e62b0ea 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup as an
+	indivisible memory consumers and compare it with other memory
+	consumers by it's memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	This applies to system-wide OOM conditions and reaching
+	the hard memory limit of the cgroup and their ancestor.
+	If OOM condition happens in a descendant cgroup with it's own
+	memory limit, the memory cgroup can't be considered
+	as an OOM victim, and OOM killer will not kill all belonging
+	tasks.
+
+	Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-10-13 21:31 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
     [not found]   ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>
2017-10-09 21:11     ` David Rientjes
2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
     [not found]   ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 14:27     ` Michal Hocko
2017-10-09 21:52   ` David Rientjes
     [not found]     ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10  8:18       ` Michal Hocko
2017-10-10 12:23     ` Roman Gushchin
2017-10-10 21:13       ` David Rientjes
     [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10 22:04           ` Roman Gushchin
2017-10-11 20:21             ` David Rientjes
     [not found]               ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-11 21:49                 ` Roman Gushchin
2017-10-12 21:50                   ` David Rientjes
2017-10-13 13:32                     ` Roman Gushchin
     [not found]                       ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2017-10-13 21:31                         ` David Rientjes
2017-10-11 13:08         ` Michal Hocko
2017-10-11 20:27           ` David Rientjes
2017-10-12  6:33             ` Michal Hocko
2017-10-11 16:10         ` Roman Gushchin
     [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 13:04   ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-05 14:29     ` Michal Hocko
2017-10-05 14:31     ` Michal Hocko
     [not found]       ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-10-06 12:04         ` Roman Gushchin
2017-10-06 12:17           ` Michal Hocko
2017-10-05 13:04 ` [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04 ` [v11 6/6] mm, oom, docs: describe the " Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox