[patch -mm 0/4] mm, memcg: introduce oom policies

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [patch -mm 0/4] mm, memcg: introduce oom policies
@ 2018-01-17  2:14 David Rientjes
  2018-01-17  2:15 ` [patch -mm 1/4] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
                   ` (5 more replies)
  0 siblings, 6 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-17  2:14 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
     using other controllers since scoring is done per cgroup and not
     hierarchically,

 (2) does not allow the user to influence the decisionmaking, such that
     important subtrees cannot be preferred or biased, and

 (3) unfairly compares the root mem cgroup using completely different
     criteria than leaf mem cgroups and allows wildly inaccurate results
     if oom_score_adj is used.

This patchset aims to fix (1) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.

It eliminates a pointless tunable, memory.oom_group, that unnecessarily
pollutes the mem cgroup v2 filesystem and is invalid when cgroup v2 is
mounted with the "groupoom" option.
---
 Applied on top of -mm.

 Documentation/cgroup-v2.txt |  87 ++++++++++++++++-----------------
 include/linux/cgroup-defs.h |   5 --
 include/linux/memcontrol.h  |  37 ++++++++++----
 kernel/cgroup/cgroup.c      |  13 +----
 mm/memcontrol.c             | 116 +++++++++++++++++++++++++-------------------
 mm/oom_kill.c               |   4 +-
 6 files changed, 139 insertions(+), 123 deletions(-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm 1/4] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
@ 2018-01-17  2:15 ` David Rientjes
  2018-01-17  2:15 ` [patch -mm 2/4] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-17  2:15 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

The cgroup aware oom killer is needlessly declared for the entire system
by a mount option.  It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.

This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy.  It will be expanded in the next patch to define cgroup
aware oom killer behavior.

This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.

Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt |  9 +++++++++
 include/linux/memcontrol.h  | 11 +++++++++++
 mm/memcontrol.c             | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1060,6 +1060,15 @@ PAGE_SIZE multiple when read back.
 	If cgroup-aware OOM killer is not enabled, ENOTSUPP error
 	is returned on attempt to access the file.
 
+  memory.oom_policy
+
+	A read-write single string file which exists on all cgroups.  The
+	default value is "none".
+
+	If "none", the OOM killer will use the default policy to choose a
+	victim; that is, it will choose the single process with the largest
+	memory footprint.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
 	MEMCG_NR_EVENTS,
 };
 
+enum memcg_oom_policy {
+	/*
+	 * No special oom policy, process selection is determined by
+	 * oom_badness()
+	 */
+	MEMCG_OOM_POLICY_NONE,
+};
+
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
 	int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* OOM policy for this subtree */
+	enum memcg_oom_policy oom_policy;
+
 	/*
 	 * Treat the sub-tree as an indivisible memory consumer,
 	 * kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4417,6 +4417,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+		memcg->oom_policy = parent->oom_policy;
 	}
 	if (parent && parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
@@ -5534,6 +5535,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+	switch (policy) {
+	case MEMCG_OOM_POLICY_NONE:
+	default:
+		seq_puts(m, "none\n");
+	};
+	return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+				       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = nbytes;
+
+	buf = strstrip(buf);
+	if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+	else
+		ret = -EINVAL;
+
+	return ret;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5575,6 +5604,12 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "oom_policy",
+		.flags = CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_policy_show,
+		.write = memory_oom_policy_write,
+	},
 	{ }	/* terminate */
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm 2/4] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
  2018-01-17  2:15 ` [patch -mm 1/4] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
@ 2018-01-17  2:15 ` David Rientjes
       [not found] ` <alpine.DEB.2.10.1801161812550.28198-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-17  2:15 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.

Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt | 43 +++++++++++++++++++++----------------------
 include/linux/cgroup-defs.h |  5 -----
 include/linux/memcontrol.h  |  5 +++++
 kernel/cgroup/cgroup.c      | 13 +------------
 mm/memcontrol.c             | 17 ++++++++---------
 5 files changed, 35 insertions(+), 48 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1069,6 +1069,10 @@ PAGE_SIZE multiple when read back.
 	victim; that is, it will choose the single process with the largest
 	memory footprint.
 
+	If "cgroup", the OOM killer will compare mem cgroups as indivisible
+	memory consumers; that is, they will compare mem cgroup usage rather
+	than process memory footprint.  See the "OOM Killer" section.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1275,37 +1279,32 @@ belonging to the affected files to ensure correct memory ownership.
 OOM Killer
 ~~~~~~~~~~
 
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
 
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
+This policy is controlled by memory.oom_policy.  When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill.  Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
 
-  # mount -o remount,groupoom $MOUNT_POINT
+There are currently two available oom policies:
 
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
+ - "none": default, choose the largest single memory hogging process to
+   oom kill, as traditionally the OOM killer has always done.
 
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+   subtree as an OOM victim and kill at least one process, depending on
+   memory.oom_group, from it.
 
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint.  A user can control this behavior by
+enabling the per-cgroup memory.oom_group option.  If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
 with other leaf memory cgroups and cgroups with oom_group option set.
 
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
-
 Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
-	/*
-	 * Enable cgroup-aware OOM killer.
-	 */
-	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -64,6 +64,11 @@ enum memcg_oom_policy {
 	 * oom_badness()
 	 */
 	MEMCG_OOM_POLICY_NONE,
+	/*
+	 * Local cgroup usage is used to select a target cgroup, treating each
+	 * mem cgroup as an indivisible consumer
+	 */
+	MEMCG_OOM_POLICY_CGROUP,
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
-		} else if (!strcmp(token, "groupoom")) {
-			*root_flags |= CGRP_GROUP_OOM;
-			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
-		if (root_flags & CGRP_GROUP_OOM)
-			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
-		else
-			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
-	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
-		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
@@ -5921,8 +5911,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			     char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
-					"groupoom\n");
+	return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2798,14 +2798,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return false;
-
 	if (oc->memcg)
 		root = oc->memcg;
 	else
 		root = root_mem_cgroup;
 
+	if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+		return false;
+
 	select_victim_memcg(root, oc);
 
 	return oc->chosen_memcg;
@@ -5412,9 +5412,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 	bool oom_group = memcg->oom_group;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return -ENOTSUPP;
-
 	seq_printf(m, "%d\n", oom_group);
 
 	return 0;
@@ -5428,9 +5425,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	int oom_group;
 	int err;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return -ENOTSUPP;
-
 	err = kstrtoint(strstrip(buf), 0, &oom_group);
 	if (err)
 		return err;
@@ -5541,6 +5535,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
 	enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
 
 	switch (policy) {
+	case MEMCG_OOM_POLICY_CGROUP:
+		seq_puts(m, "cgroup\n");
+		break;
 	case MEMCG_OOM_POLICY_NONE:
 	default:
 		seq_puts(m, "none\n");
@@ -5557,6 +5554,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
 	buf = strstrip(buf);
 	if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+	else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
 	else
 		ret = -EINVAL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <alpine.DEB.2.10.1801161812550.28198-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
       [not found] ` <alpine.DEB.2.10.1801161812550.28198-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2018-01-17  2:15   ` David Rientjes
  2018-01-17 15:41     ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-17  2:15 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

The behavior of killing an entire indivisible memory consumer, enabled
by memory.oom_group, is an oom policy itself.  It specifies that all
usage should be accounted to an ancestor and, if selected by the cgroup
aware oom killer, all processes attached to it and its descendant mem
cgroups should be oom killed.

This is replaced by writing "all" to memory.oom_policy and allows for the
same functionality as the existing memory.oom_group without (1) polluting
the mem cgroup v2 filesystem unnecessarily and (2) unnecessarily when the
"groupoom" mount option is not used (now by writing "cgroup" to the root
mem cgroup's memory.oom_policy).

The "all" oom policy cannot be enabled on the root mem cgroup.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroup-v2.txt | 51 ++++++++++++++---------------------------
 include/linux/memcontrol.h  | 18 +++++++--------
 mm/memcontrol.c             | 55 ++++++++++++---------------------------------
 mm/oom_kill.c               |  4 ++--
 4 files changed, 41 insertions(+), 87 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1035,31 +1035,6 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
-  memory.oom_group
-
-	A read-write single value file which exists on non-root
-	cgroups.  The default is "0".
-
-	If set, OOM killer will consider the memory cgroup as an
-	indivisible memory consumers and compare it with other memory
-	consumers by it's memory footprint.
-	If such memory cgroup is selected as an OOM victim, all
-	processes belonging to it or it's descendants will be killed.
-
-	This applies to system-wide OOM conditions and reaching
-	the hard memory limit of the cgroup and their ancestor.
-	If OOM condition happens in a descendant cgroup with it's own
-	memory limit, the memory cgroup can't be considered
-	as an OOM victim, and OOM killer will not kill all belonging
-	tasks.
-
-	Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
-	and will never kill the unkillable task, even if memory.oom_group
-	is set.
-
-	If cgroup-aware OOM killer is not enabled, ENOTSUPP error
-	is returned on attempt to access the file.
-
   memory.oom_policy
 
 	A read-write single string file which exists on all cgroups.  The
@@ -1073,6 +1048,11 @@ PAGE_SIZE multiple when read back.
 	memory consumers; that is, they will compare mem cgroup usage rather
 	than process memory footprint.  See the "OOM Killer" section.
 
+	If "all", the OOM killer will compare mem cgroups and its subtree
+	as indivisible memory consumers and kill all processes attached to
+	the mem cgroup and its subtree.  This policy cannot be set on the
+	root mem cgroup.  See the "OOM Killer" section.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1287,29 +1267,32 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
 select a process, or cgroup, to kill.  Likewise, when the system is OOM,
 the policy is dictated by the root mem cgroup.
 
-There are currently two available oom policies:
+There are currently three available oom policies:
 
  - "none": default, choose the largest single memory hogging process to
    oom kill, as traditionally the OOM killer has always done.
 
  - "cgroup": choose the cgroup with the largest memory footprint from the
-   subtree as an OOM victim and kill at least one process, depending on
-   memory.oom_group, from it.
+   subtree as an OOM victim and kill at least one process.
+
+ - "all": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill all processes attached (cannot be set on
+   the root mem cgroup).
 
 When selecting a cgroup as a victim, the OOM killer will kill the process
-with the largest memory footprint.  A user can control this behavior by
-enabling the per-cgroup memory.oom_group option.  If set, it causes the
-OOM killer to kill all processes attached to the cgroup, except processes
-with /proc/pid/oom_score_adj set to -1000 (oom disabled).
+with the largest memory footprint, unless the policy is specified as "all".
+In that case, the OOM killer will kill all processes attached to the cgroup
+and its subtree, except processes with /proc/pid/oom_score_adj set to
+-1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+with other leaf memory cgroups.
 
 Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
 If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
+the source and destination memory cgroups and setting a policy of "all"
 on ancestor layer.
 
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
 	 * mem cgroup as an indivisible consumer
 	 */
 	MEMCG_OOM_POLICY_CGROUP,
+	/*
+	 * Same as MEMCG_OOM_POLICY_CGROUP, but all eligible processes attached
+	 * to the cgroup and subtree should be oom killed
+	 */
+	MEMCG_OOM_POLICY_ALL,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -219,13 +224,6 @@ struct mem_cgroup {
 	/* OOM policy for this subtree */
 	enum memcg_oom_policy oom_policy;
 
-	/*
-	 * Treat the sub-tree as an indivisible memory consumer,
-	 * kill all belonging tasks if the memory cgroup selected
-	 * as OOM victim.
-	 */
-	bool oom_group;
-
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -513,9 +511,9 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
-static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_oom_policy_all(struct mem_cgroup *memcg)
 {
-	return memcg->oom_group;
+	return memcg->oom_policy == MEMCG_OOM_POLICY_ALL;
 }
 
 #ifdef CONFIG_MEMCG_SWAP
@@ -1019,7 +1017,7 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	return false;
 }
 
-static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_oom_policy_all(struct mem_cgroup *memcg)
 {
 	return false;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2715,11 +2715,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	oc->chosen_points = 0;
 
 	/*
-	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
-	 * all tasks belonging to the memcg should be killed.
+	 * If OOM is memcg-wide, and the oom policy is "all", all processes
+	 * attached to the memcg and subtree should be killed.
 	 * So, we mark the memcg as a victim.
 	 */
-	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+	if (oc->memcg && mem_cgroup_oom_policy_all(oc->memcg)) {
 		oc->chosen_memcg = oc->memcg;
 		css_get(&oc->chosen_memcg->css);
 		return;
@@ -2728,7 +2728,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
-	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * Cgroups with oom policy of "all" accumulate the score of descendant
 	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
@@ -2736,11 +2736,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 		long score;
 
 		/*
-		 * We don't consider non-leaf non-oom_group memory cgroups
-		 * as OOM victims.
+		 * We don't consider non-leaf memory cgroups without the oom
+		 * policy of "all" as oom victims.
 		 */
 		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-		    !mem_cgroup_oom_group(iter))
+		    !mem_cgroup_oom_policy_all(iter))
 			continue;
 
 		/*
@@ -2803,7 +2803,7 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	else
 		root = root_mem_cgroup;
 
-	if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+	if (root->oom_policy == MEMCG_OOM_POLICY_NONE)
 		return false;
 
 	select_victim_memcg(root, oc);
@@ -5407,33 +5407,6 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
-static int memory_oom_group_show(struct seq_file *m, void *v)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
-	bool oom_group = memcg->oom_group;
-
-	seq_printf(m, "%d\n", oom_group);
-
-	return 0;
-}
-
-static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
-				      char *buf, size_t nbytes,
-				      loff_t off)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	int oom_group;
-	int err;
-
-	err = kstrtoint(strstrip(buf), 0, &oom_group);
-	if (err)
-		return err;
-
-	memcg->oom_group = oom_group;
-
-	return nbytes;
-}
-
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5538,6 +5511,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
 	case MEMCG_OOM_POLICY_CGROUP:
 		seq_puts(m, "cgroup\n");
 		break;
+	case MEMCG_OOM_POLICY_ALL:
+		seq_puts(m, "all\n");
+		break;
 	case MEMCG_OOM_POLICY_NONE:
 	default:
 		seq_puts(m, "none\n");
@@ -5556,6 +5532,9 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
 		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
 	else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+	else if (memcg != root_mem_cgroup &&
+			!memcmp("all", buf, min(sizeof("all")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_ALL;
 	else
 		ret = -EINVAL;
 
@@ -5586,12 +5565,6 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_max_show,
 		.write = memory_max_write,
 	},
-	{
-		.name = "oom_group",
-		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
-		.seq_show = memory_oom_group_show,
-		.write = memory_oom_group_write,
-	},
 	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1000,11 +1000,11 @@ static bool oom_kill_memcg_victim(struct oom_control *oc)
 		return oc->chosen_memcg;
 
 	/*
-	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+	 * If the oom policy is "all", kill all tasks belonging to the sub-tree
 	 * of the chosen memory cgroup, otherwise kill the task with the biggest
 	 * memory footprint.
 	 */
-	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+	if (mem_cgroup_oom_policy_all(oc->chosen_memcg)) {
 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
 				      NULL);
 		/* We have one or more terminating processes at this point. */

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17  2:15   ` [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable David Rientjes
@ 2018-01-17 15:41     ` Tejun Heo
  2018-01-17 16:00       ` Michal Hocko
  2018-01-17 22:14       ` David Rientjes
  0 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2018-01-17 15:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

Hello, David.

On Tue, Jan 16, 2018 at 06:15:08PM -0800, David Rientjes wrote:
> The behavior of killing an entire indivisible memory consumer, enabled
> by memory.oom_group, is an oom policy itself.  It specifies that all

I thought we discussed this before but maybe I'm misremembering.
There are two parts to the OOM policy.  One is victim selection, the
other is the action to take thereafter.

The two are different and conflating the two don't work too well.  For
example, please consider what should be given to the delegatee when
delegating a subtree, which often is a good excercise when designing
these APIs.

When a given workload is selected for OOM kill (IOW, selected to free
some memory), whether the workload can handle individual process kills
or not is the property of the workload itself.  Some applications can
safely handle some of its processes picked off and killed.  Most
others can't and want to be handled as a single unit, which makes it a
property of the workload.

That makes sense in the hierarchy too because whether one process or
the whole workload is killed doesn't infringe upon the parent's
authority over resources which in turn implies that there's nothing to
worry about how the parent's groupoom setting should constrain the
descendants.

OOM victim selection policy is a different beast.  As you've mentioned
multiple times, especially if you're worrying about people abusing OOM
policies by creating sub-cgroups and so on, the policy, first of all,
shouldn't be delegatable and secondly should have meaningful
hierarchical restrictions so that a policy that an ancestor chose
can't be nullified by a descendant.

I'm not necessarily against adding hierarchical victim selection
policy tunables; however, I am skeptical whether static tunables on
cgroup hierarchy (including selectable policies) can be made clean and
versatile enough, especially because the resource hierarchy doesn't
necessarily, or rather in most cases, match the OOM victim selection
decision tree, but I'd be happy to be proven wrong.

Without explicit configurations, the only thing the OOM killer needs
to guarantee is that the system can make forward progress.  We've
always been tweaking victim selection with or without cgroup and
absolutely shouldn't be locked into a specific heuristics.  The
heuristics is an implementaiton detail subject to improvements.

To me, your patchset actually seems to demonstrate that these are
separate issues.  The goal of groupoom is just to kill logical units
as cgroup hierarchy can inform the kernel of how workloads are
composed in the userspace.  If you want to improve victim selection,
sure, please go ahead, but your argument that groupoom can't be merged
because of victim selection policy doesn't make sense to me.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17 15:41     ` Tejun Heo
@ 2018-01-17 16:00       ` Michal Hocko
  2018-01-17 22:18         ` David Rientjes
  2018-01-17 22:14       ` David Rientjes
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-17 16:00 UTC (permalink / raw)
  To: David Rientjes, Tejun Heo
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Wed 17-01-18 07:41:55, Tejun Heo wrote:
> Hello, David.
> 
> On Tue, Jan 16, 2018 at 06:15:08PM -0800, David Rientjes wrote:
> > The behavior of killing an entire indivisible memory consumer, enabled
> > by memory.oom_group, is an oom policy itself.  It specifies that all
> 
> I thought we discussed this before but maybe I'm misremembering.
> There are two parts to the OOM policy.  One is victim selection, the
> other is the action to take thereafter.

Yes we have. Multiple times! The last time I've said the very same thing
was yesterday http://lkml.kernel.org/r/20180116220907.GD17351@dhcp22.suse.cz

> The two are different and conflating the two don't work too well.  For
> example, please consider what should be given to the delegatee when
> delegating a subtree, which often is a good excercise when designing
> these APIs.

Absolutely agreed! And moreover, there are not all that many ways what
to do as an action. You just kill a logical entity - be it a process or
a logical group of processes. But you have way too many policies how
to select that entity. Do you want to chose the youngest process/group
because all the older ones have been computing real stuff and you would
lose days of your cpu time? Or should those who pay more should be
protected (aka give them static priorities), or you name it...

I am sorry, I still didn't grasp the full semantic of the proposed
soluton but the mere fact it is starting by conflating selection and the
action is a no go and a wrong API. This is why I've said that what you
(David) outlined yesterday is probably going to suffer from a much
longer discussion and most likely to be not acceptable. Your patchset
proves me correct...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17 16:00       ` Michal Hocko
@ 2018-01-17 22:18         ` David Rientjes
  2018-01-23 15:13           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-17 22:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 17 Jan 2018, Michal Hocko wrote:

> Absolutely agreed! And moreover, there are not all that many ways what
> to do as an action. You just kill a logical entity - be it a process or
> a logical group of processes. But you have way too many policies how
> to select that entity. Do you want to chose the youngest process/group
> because all the older ones have been computing real stuff and you would
> lose days of your cpu time? Or should those who pay more should be
> protected (aka give them static priorities), or you name it...
> 

That's an argument for making the interface extensible, yes.

> I am sorry, I still didn't grasp the full semantic of the proposed
> soluton but the mere fact it is starting by conflating selection and the
> action is a no go and a wrong API. This is why I've said that what you
> (David) outlined yesterday is probably going to suffer from a much
> longer discussion and most likely to be not acceptable. Your patchset
> proves me correct...

I'm very happy to change the API if there are better suggestions.  That 
may end up just being an memory.oom_policy file, as this implements, and 
separating out a new memory.oom_action that isn't a boolean value to 
either do a full group kill or only a single process.  Or it could be what 
I suggested in my mail to Tejun, such as "hierarchy killall" written to
memory.oom_policy, which would specify a single policy and then an 
optional mechanism.  With my proposed patchset, there would then be three 
policies: "none", "cgroup", and "tree" and one possible optional 
mechanism: "killall".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17 22:18         ` David Rientjes
@ 2018-01-23 15:13           ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-01-23 15:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed 17-01-18 14:18:33, David Rientjes wrote:
> On Wed, 17 Jan 2018, Michal Hocko wrote:
> 
> > Absolutely agreed! And moreover, there are not all that many ways what
> > to do as an action. You just kill a logical entity - be it a process or
> > a logical group of processes. But you have way too many policies how
> > to select that entity. Do you want to chose the youngest process/group
> > because all the older ones have been computing real stuff and you would
> > lose days of your cpu time? Or should those who pay more should be
> > protected (aka give them static priorities), or you name it...
> > 
> 
> That's an argument for making the interface extensible, yes.

And there is no interface to control the selection yet so we can develop
one on top.
 
> > I am sorry, I still didn't grasp the full semantic of the proposed
> > soluton but the mere fact it is starting by conflating selection and the
> > action is a no go and a wrong API. This is why I've said that what you
> > (David) outlined yesterday is probably going to suffer from a much
> > longer discussion and most likely to be not acceptable. Your patchset
> > proves me correct...
> 
> I'm very happy to change the API if there are better suggestions.  That 
> may end up just being an memory.oom_policy file, as this implements, and 
> separating out a new memory.oom_action that isn't a boolean value to 
> either do a full group kill or only a single process.  Or it could be what 
> I suggested in my mail to Tejun, such as "hierarchy killall" written to
> memory.oom_policy, which would specify a single policy and then an 
> optional mechanism.  With my proposed patchset, there would then be three 
> policies: "none", "cgroup", and "tree" and one possible optional 
> mechanism: "killall".

You haven't convinced me at all. This all sounds more like "what if"
than a really thought through interface. I've tried to point out that
having a real policy driven victim selection is a _hard_ thing to do
_right_.

On the other hand oom_group makes semantic sense. It controls the
killable entity and there are usecases which want to consider the full
memcg as a single killable entity. No matter what selection policy we
chose on top. It is just a natural API.

Now you keep arguing about the victim selection and different strategies
to implement it. We will not move forward as long as you keep conflating
the two things, I am afraid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17 15:41     ` Tejun Heo
  2018-01-17 16:00       ` Michal Hocko
@ 2018-01-17 22:14       ` David Rientjes
  2018-01-19 20:53         ` David Rientjes
  1 sibling, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-17 22:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 17 Jan 2018, Tejun Heo wrote:

> Hello, David.
> 

Hi Tejun!

> > The behavior of killing an entire indivisible memory consumer, enabled
> > by memory.oom_group, is an oom policy itself.  It specifies that all
> 
> I thought we discussed this before but maybe I'm misremembering.
> There are two parts to the OOM policy.  One is victim selection, the
> other is the action to take thereafter.
> 
> The two are different and conflating the two don't work too well.  For
> example, please consider what should be given to the delegatee when
> delegating a subtree, which often is a good excercise when designing
> these APIs.
> 
> When a given workload is selected for OOM kill (IOW, selected to free
> some memory), whether the workload can handle individual process kills
> or not is the property of the workload itself.  Some applications can
> safely handle some of its processes picked off and killed.  Most
> others can't and want to be handled as a single unit, which makes it a
> property of the workload.
> 

Yes, this is a valid point.  The policy of "tree" and "all" are identical 
policies and then the mechanism differs wrt to whether one process is 
killed or all eligible processes are killed, respectively.  My motivation 
for this was to avoid having two different tunables, especially because 
later we'll need a way for userspace to influence the decisionmaking to 
protect (bias against) important subtrees.  What would really be nice is 
cgroup.subtree_control-type behavior where we could effect a policy and a 
mechanism at the same time.  It's not clear how that would be handled to 
allow only one policy and one mechanism, however, in a clean way.  The 
simplest for the user would be a new file, to specify the mechanism and 
leave memory.oom_policy alone.  Would another file really be warranted?  
Not sure.

> That makes sense in the hierarchy too because whether one process or
> the whole workload is killed doesn't infringe upon the parent's
> authority over resources which in turn implies that there's nothing to
> worry about how the parent's groupoom setting should constrain the
> descendants.
> 
> OOM victim selection policy is a different beast.  As you've mentioned
> multiple times, especially if you're worrying about people abusing OOM
> policies by creating sub-cgroups and so on, the policy, first of all,
> shouldn't be delegatable and secondly should have meaningful
> hierarchical restrictions so that a policy that an ancestor chose
> can't be nullified by a descendant.
> 

The goal here was to require a policy of either "tree" or "all" that the 
user can't change.  They are allowed to have their own oom policies 
internal to their subtree, however, for oom conditions in that subtree 
alone.  However, if the common ancestor hits its limit, it is forced to 
either be "tree" or "all" and require hierarchical usage to be considered 
instead of localized usage.  Either "tree" or "all" is appropriate, and 
this may be why you brought up the point about separating them out, i.e. 
the policy can be demanded by the common ancestor but the actual mechanism 
that the oom killer uses, kill either a single process or the full cgroup, 
is left to the user depending on their workload.  That sounds reasonable 
and I can easily separate the two by introducing a new file, similar to 
memory.oom_group but in a more extensible way so that it is not simply a 
selection of either full cgroup kill or single process.

> I'm not necessarily against adding hierarchical victim selection
> policy tunables; however, I am skeptical whether static tunables on
> cgroup hierarchy (including selectable policies) can be made clean and
> versatile enough, especially because the resource hierarchy doesn't
> necessarily, or rather in most cases, match the OOM victim selection
> decision tree, but I'd be happy to be proven wrong.
> 

Right, and I think that giving users control over their subtrees is a 
powerful tool and one that can lead to very effective use of the cgroup v2 
hierarchy.  Being able to circumvent the oom selection by creating child 
cgroups is certainly something that can trivially be prevented.  The 
argument that users can currently divide their entire processes into 
several different smaller processes to circumvent today's heuristic 
doesn't mean we can't have "tree"-like comparisons between cgroups to 
address that issue itself since all processes charge to the tree itself.

I became convinced of this when I saw the real-world usecases that would 
use such a feature on cgroup v2: we want to have hierarchical usage for 
comparison when full subtrees are dedicated to individual consumers, for 
example, and local mem cgroup usage for comparison when using hierarchies 
for top-level /admins and /students cgroups for which Michal provided an 
example.  These can coexist on systems and it's clear that there needs to 
be a system-wide policy decision for the cgroup aware oom killer (the idea 
behind the current mount option, which isn't needed anymore).  So defining 
the actual policy, and mechanism as you pointed out, for subtrees is a 
very powerful tool, it's extensible, doesn't require a system to either 
fully enable or disable, and doesn't require a remount of cgroup v2 to 
change.

> Without explicit configurations, the only thing the OOM killer needs
> to guarantee is that the system can make forward progress.  We've
> always been tweaking victim selection with or without cgroup and
> absolutely shouldn't be locked into a specific heuristics.  The
> heuristics is an implementaiton detail subject to improvements.
> 
> To me, your patchset actually seems to demonstrate that these are
> separate issues.  The goal of groupoom is just to kill logical units
> as cgroup hierarchy can inform the kernel of how workloads are
> composed in the userspace.  If you want to improve victim selection,
> sure, please go ahead, but your argument that groupoom can't be merged
> because of victim selection policy doesn't make sense to me.
> 

memory.oom_group, the mechanism behind what the oom killer chooses to do 
after victim selection, is not implemented without the selection heuristic 
comparing cgroups as indivisible memory consumers.  It could be done first 
prior to introducing the new selection criteria.  We don't have patches 
for that right now, because Roman's work introduces it in the opposite 
order.  If it is acceptable to add a separate file solely for this 
purpose, it's rather trivial to do.  My other thought was some kind of 
echo "hierarchy killall" > memory.oom_policy where the policy and an 
(optional) mechanism could be specified.  Your input on the actual 
tunables would be very valuable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-17 22:14       ` David Rientjes
@ 2018-01-19 20:53         ` David Rientjes
       [not found]           ` <alpine.DEB.2.10.1801191251080.177541-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-19 20:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 17 Jan 2018, David Rientjes wrote:

> Yes, this is a valid point.  The policy of "tree" and "all" are identical 
> policies and then the mechanism differs wrt to whether one process is 
> killed or all eligible processes are killed, respectively.  My motivation 
> for this was to avoid having two different tunables, especially because 
> later we'll need a way for userspace to influence the decisionmaking to 
> protect (bias against) important subtrees.  What would really be nice is 
> cgroup.subtree_control-type behavior where we could effect a policy and a 
> mechanism at the same time.  It's not clear how that would be handled to 
> allow only one policy and one mechanism, however, in a clean way.  The 
> simplest for the user would be a new file, to specify the mechanism and 
> leave memory.oom_policy alone.  Would another file really be warranted?  
> Not sure.
> 

Hearing no response, I'll implement this as a separate tunable in a v2 
series assuming there are no better ideas proposed before next week.  One 
of the nice things about a separate tunable is that an admin can control 
the overall policy and they can delegate the mechanism (killall vs one 
process) to a user subtree.  I agree with your earlier point that killall 
vs one process is a property of the workload and is better defined 
separately.

I'll also look to fix the breakage wrt root mem cgroup comparison with 
leaf mem cgroup comparison that is currently in -mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <alpine.DEB.2.10.1801191251080.177541-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
       [not found]           ` <alpine.DEB.2.10.1801191251080.177541-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2018-01-20 12:32             ` Tejun Heo
  2018-01-22 22:34               ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-01-20 12:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello, David.

On Fri, Jan 19, 2018 at 12:53:41PM -0800, David Rientjes wrote:
> Hearing no response, I'll implement this as a separate tunable in a v2 
> series assuming there are no better ideas proposed before next week.  One 
> of the nice things about a separate tunable is that an admin can control 
> the overall policy and they can delegate the mechanism (killall vs one 
> process) to a user subtree.  I agree with your earlier point that killall 
> vs one process is a property of the workload and is better defined 
> separately.

If I understood your arguments correctly, the reasons that you thought
your selectdion policy changes must go together with Roman's victim
action were two-fold.

1. You didn't want a separate knob for group oom behavior and wanted
   it to be combined with selection policy.  I'm glad that you now
   recognize that this would be the wrong design choice.

2. The current selection policy may be exploited by delegatee and
   strictly hierarchical seleciton should be available.  We can debate
   the pros and cons of different heuristics; however, to me, the
   followings are clear.

   * Strictly hierarchical approach can't replace the current policy.
     It doesn't work well for a lot of use cases.

   * OOM victim selection policy has always been subject to changes
     and improvements.

I don't see any blocker here.  The issue you're raising can and should
be handled separately.

In terms of interface, what makes an interface bad is when the
purposes aren't crystalized enough and different interface pieces fail
to clearnly encapsulate what's actually necessary.

Here, whether a workload can survive being killed piece-wise or not is
an inherent property of the workload and a pretty binary one at that.
I'm not necessarily against changing it to take string inputs but
don't see rationales for doing so yet.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-20 12:32             ` Tejun Heo
@ 2018-01-22 22:34               ` David Rientjes
  2018-01-23 15:53                 ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-22 22:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Sat, 20 Jan 2018, Tejun Heo wrote:

> > Hearing no response, I'll implement this as a separate tunable in a v2 
> > series assuming there are no better ideas proposed before next week.  One 
> > of the nice things about a separate tunable is that an admin can control 
> > the overall policy and they can delegate the mechanism (killall vs one 
> > process) to a user subtree.  I agree with your earlier point that killall 
> > vs one process is a property of the workload and is better defined 
> > separately.
> 
> If I understood your arguments correctly, the reasons that you thought
> your selectdion policy changes must go together with Roman's victim
> action were two-fold.
> 
> 1. You didn't want a separate knob for group oom behavior and wanted
>    it to be combined with selection policy.  I'm glad that you now
>    recognize that this would be the wrong design choice.
> 

The memory.oom_action (or mechanism) file that I've proposed is different 
than memory.oom_group: we want to provide a non-binary tunable to specify 
what action that oom killer should effect.  That could be to kill all 
processes in the subtree, similar to memory.oom_group, the local cgroup, 
or a different mechanism.  I could propose the patchset backwards, if 
necessary, because memory.oom_group is currently built upon a broken 
selection heuristic.  In other words, I could propose a memory.oom_action 
that can specify two different mechanisms that would be useful outside of 
any different selection function.  However, since the mechanism is built 
on top of the cgroup aware oom killer's policy, we can't merge it 
currently without the broken logic.

> 2. The current selection policy may be exploited by delegatee and
>    strictly hierarchical seleciton should be available.  We can debate
>    the pros and cons of different heuristics; however, to me, the
>    followings are clear.
> 
>    * Strictly hierarchical approach can't replace the current policy.
>      It doesn't work well for a lot of use cases.
> 

-ECONFUSED.  I haven't proposed any strict hierarchical approach here, 
it's configurable by the user.

>    * OOM victim selection policy has always been subject to changes
>      and improvements.
> 

That's fine, but the selection policy introduced by any cgroup aware oom 
killer is being specified now and is pretty clear cut: compare the usage 
of cgroups equally based on certain criteria and choose the largest.  I 
don't see how that could be changed or improved once the end user starts 
using it, the heuristic has become a policy.  A single cgroup aware policy 
doesn't work, if you don't have localized, per-cgroup control you have 
Michal's /admins and /students example; if you don't have hierarchical, 
subtree control you have my example of users intentionally/unintentionally 
evading the selection logic based on using cgroups.  The policy needs to 
be defined for subtrees.  There hasn't been any objection to that, so 
introducing functionality that adds a completely unnecessary and broken 
mount option and forces users to configure cgroups in a certain manner 
that no longer exists with subtree control doesn't seem helpful.

> I don't see any blocker here.  The issue you're raising can and should
> be handled separately.
> 

It can't, because the current patchset locks the system into a single 
selection criteria that is unnecessary and the mount option would become a 
no-op after the policy per subtree becomes configurable by the user as 
part of the hierarchy itself.

> Here, whether a workload can survive being killed piece-wise or not is
> an inherent property of the workload and a pretty binary one at that.
> I'm not necessarily against changing it to take string inputs but
> don't see rationales for doing so yet.
> 

We don't need the unnecessary level in the cgroup hierarchy that enables 
memory.oom_group, as proposed, and serves no other purpose.  It's 
perfectly valid for subtrees to run user executors whereas a "killall" 
mechanism is valid for some workloads and not others.  We do not need 
ancestor cgroups locking that decision into place.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-22 22:34               ` David Rientjes
@ 2018-01-23 15:53                 ` Michal Hocko
  2018-01-23 22:22                   ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-23 15:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Mon 22-01-18 14:34:39, David Rientjes wrote:
> On Sat, 20 Jan 2018, Tejun Heo wrote:
[...]
> > I don't see any blocker here.  The issue you're raising can and should
> > be handled separately.
> > 
> 
> It can't, because the current patchset locks the system into a single 
> selection criteria that is unnecessary and the mount option would become a 
> no-op after the policy per subtree becomes configurable by the user as 
> part of the hierarchy itself.

This is simply not true! OOM victim selection has changed in the
past and will be always a subject to changes in future. Current
implementation doesn't provide any externally controlable selection
policy and therefore the default can be assumed. Whatever that default
means now or in future. The only contract added here is the kill full
memcg if selected and that can be implemented on _any_ selection policy.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-23 15:53                 ` Michal Hocko
@ 2018-01-23 22:22                   ` David Rientjes
  2018-01-24  8:20                     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-23 22:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue, 23 Jan 2018, Michal Hocko wrote:

> > It can't, because the current patchset locks the system into a single 
> > selection criteria that is unnecessary and the mount option would become a 
> > no-op after the policy per subtree becomes configurable by the user as 
> > part of the hierarchy itself.
> 
> This is simply not true! OOM victim selection has changed in the
> past and will be always a subject to changes in future. Current
> implementation doesn't provide any externally controlable selection
> policy and therefore the default can be assumed. Whatever that default
> means now or in future. The only contract added here is the kill full
> memcg if selected and that can be implemented on _any_ selection policy.
> 

The current implementation of memory.oom_group is based on top of a 
selection implementation that is broken in three ways I have listed for 
months:

 - allows users to intentionally/unintentionally evade the oom killer,
   requires not locking the selection implementation for the entire
   system, requires subtree control to prevent, makes a mount option
   obsolete, and breaks existing users who would use the implementation
   based on 4.16 if this were merged,

 - unfairly compares the root mem cgroup vs leaf mem cgroup such that
   users must structure their hierarchy only for 4.16 in such a way
   that _all_ processes are under hierarchical control and have no
   power to create sub cgroups because of the point above and
   completely breaks any user of oom_score_adj in a completely
   undocumented and unspecified way, such that fixing that breakage
   would also break any existing users who would use the implementation
   based on 4.16 if this were merged, and

 - does not allow userspace to protect important cgroups, which can be
   built on top.

I'm focused on fixing the breakage in the first two points since it 
affects the API and we don't want to switch that out from the user.  I 
have brought these points up repeatedly and everybody else has actively 
disengaged from development, so I'm proposing incremental changes that 
make the cgroup aware oom killer have a sustainable API and isn't useful 
only for a highly specialized usecase where everything is containerized, 
nobody can create subcgroups, and nobody uses oom_score_adj to break the 
root mem cgroup accounting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-23 22:22                   ` David Rientjes
@ 2018-01-24  8:20                     ` Michal Hocko
  2018-01-24 21:44                       ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-24  8:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue 23-01-18 14:22:07, David Rientjes wrote:
> On Tue, 23 Jan 2018, Michal Hocko wrote:
> 
> > > It can't, because the current patchset locks the system into a single 
> > > selection criteria that is unnecessary and the mount option would become a 
> > > no-op after the policy per subtree becomes configurable by the user as 
> > > part of the hierarchy itself.
> > 
> > This is simply not true! OOM victim selection has changed in the
> > past and will be always a subject to changes in future. Current
> > implementation doesn't provide any externally controlable selection
> > policy and therefore the default can be assumed. Whatever that default
> > means now or in future. The only contract added here is the kill full
> > memcg if selected and that can be implemented on _any_ selection policy.
> > 
> 
> The current implementation of memory.oom_group is based on top of a 
> selection implementation that is broken in three ways I have listed for 
> months:

This doesn't lead to anywhere. You are not presenting any new arguments
and you are ignoring feedback you have received so far. We have tried
really hard. Considering different _independent_ people presented more or
less consistent view on these points I think you should deeply
reconsider how you take that feedback.

>  - allows users to intentionally/unintentionally evade the oom killer,
>    requires not locking the selection implementation for the entire
>    system, requires subtree control to prevent, makes a mount option
>    obsolete, and breaks existing users who would use the implementation
>    based on 4.16 if this were merged,
> 
>  - unfairly compares the root mem cgroup vs leaf mem cgroup such that
>    users must structure their hierarchy only for 4.16 in such a way
>    that _all_ processes are under hierarchical control and have no
>    power to create sub cgroups because of the point above and
>    completely breaks any user of oom_score_adj in a completely
>    undocumented and unspecified way, such that fixing that breakage
>    would also break any existing users who would use the implementation
>    based on 4.16 if this were merged, and
> 
>  - does not allow userspace to protect important cgroups, which can be
>    built on top.

For the last time. This all can be done on top of the proposed solution
without breaking the proposed user API. I am really _convinced_ that you
underestimate how complex it is to provide a sane selection policy API
and it will take _months_ to settle on something. Existing OOM APIs are
a sad story and I definitly do not want to repeat same mistakes from the
past.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-24  8:20                     ` Michal Hocko
@ 2018-01-24 21:44                       ` David Rientjes
       [not found]                         ` <alpine.DEB.2.10.1801241340310.24330-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-24 21:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 24 Jan 2018, Michal Hocko wrote:

> > The current implementation of memory.oom_group is based on top of a 
> > selection implementation that is broken in three ways I have listed for 
> > months:
> 
> This doesn't lead to anywhere. You are not presenting any new arguments
> and you are ignoring feedback you have received so far. We have tried
> really hard. Considering different _independent_ people presented more or
> less consistent view on these points I think you should deeply
> reconsider how you take that feedback.
> 

I've responded to each email providing useful feedback on this patchset.  
I agreed with Tejun about not embedding the oom mechanism into 
memory.oom_policy.  I was trying to avoid having two files in the mem 
cgroup v2 filesystem for oom policy and mechanism.  I agreed that 
delegating the mechanism to the workload would be useful in some cases.  
I've solicited feedback on any other opinions on how that can be done 
better, but it appears another tunable is the most convenient way of 
allowing this behavior to be specified.

As a result, this would remove patch 3/4 from the series.  Do you have any 
other feedback regarding the remainder of this patch series before I 
rebase it?

I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
a separate patchset to fix an issue where any user of oom_score_adj on a 
system that is not fully containerized gets very unusual, unexpected, and 
undocumented results.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <alpine.DEB.2.10.1801241340310.24330-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
       [not found]                         ` <alpine.DEB.2.10.1801241340310.24330-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2018-01-24 22:08                           ` Andrew Morton
  2018-01-24 22:18                             ` Tejun Heo
       [not found]                             ` <20180124140805.b4eb437c6fe9dadb67a32e8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2018-01-25  8:05                           ` Michal Hocko
  1 sibling, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2018-01-24 22:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Tejun Heo, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, 24 Jan 2018 13:44:02 -0800 (PST) David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On Wed, 24 Jan 2018, Michal Hocko wrote:
> 
> > > The current implementation of memory.oom_group is based on top of a 
> > > selection implementation that is broken in three ways I have listed for 
> > > months:
> > 
> > This doesn't lead to anywhere. You are not presenting any new arguments
> > and you are ignoring feedback you have received so far. We have tried
> > really hard. Considering different _independent_ people presented more or
> > less consistent view on these points I think you should deeply
> > reconsider how you take that feedback.
> > 

Please let's remember that people/process issues are unrelated to the
technical desirability of a proposed change.  IOW, assertions along the
lines of "person X is being unreasonable" do little to affect a merge
decision.

> I've responded to each email providing useful feedback on this patchset.  
> I agreed with Tejun about not embedding the oom mechanism into 
> memory.oom_policy.  I was trying to avoid having two files in the mem 
> cgroup v2 filesystem for oom policy and mechanism.  I agreed that 
> delegating the mechanism to the workload would be useful in some cases.  
> I've solicited feedback on any other opinions on how that can be done 
> better, but it appears another tunable is the most convenient way of 
> allowing this behavior to be specified.
> 
> As a result, this would remove patch 3/4 from the series.  Do you have any 
> other feedback regarding the remainder of this patch series before I 
> rebase it?
> 
> I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
> a separate patchset to fix an issue where any user of oom_score_adj on a 
> system that is not fully containerized gets very unusual, unexpected, and 
> undocumented results.
> 

Can we please try to narrow the scope of this issue by concentrating on
the userspace interfaces?  David believes that the mount option and
memory.oom_group will disappear again in the near future, others
disagree.

What do we do about that?  For example, would it really be a big
problem to continue to support those interfaces in a future iteration
of this feature?  Or is it possible to omit them from this version of
the feature?  Or is it possible to modify them in some fashion so they
will be better compatible with a future iteration of this feature?

I'm OK with merging a probably-partial feature, expecting it to be
enhanced in the future.  What I have a problem with is merging user
interfaces which will be removed or altered in the future.  Please
solve this problem for me.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-24 22:08                           ` Andrew Morton
@ 2018-01-24 22:18                             ` Tejun Heo
       [not found]                             ` <20180124140805.b4eb437c6fe9dadb67a32e8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  1 sibling, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2018-01-24 22:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Michal Hocko, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

Hello, Andrew.

On Wed, Jan 24, 2018 at 02:08:05PM -0800, Andrew Morton wrote:
> Can we please try to narrow the scope of this issue by concentrating on
> the userspace interfaces?  David believes that the mount option and
> memory.oom_group will disappear again in the near future, others
> disagree.

I'm confident that the interface is gonna age fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <20180124140805.b4eb437c6fe9dadb67a32e8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
       [not found]                             ` <20180124140805.b4eb437c6fe9dadb67a32e8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2018-01-25  8:11                               ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-01-25  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Tejun Heo, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed 24-01-18 14:08:05, Andrew Morton wrote:
[...]
> Can we please try to narrow the scope of this issue by concentrating on
> the userspace interfaces?  David believes that the mount option and
> memory.oom_group will disappear again in the near future, others
> disagree.

Mount option is the cgroups maintainers call. And they seemed to be OK
with it.
I've tried to explain that oom_group is something that is semantically
sane and something we want to support because there are workloads which
simply do not work properly when only a subset is torn down. As such it
is not an API hazard AFAICS.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
       [not found]                         ` <alpine.DEB.2.10.1801241340310.24330-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  2018-01-24 22:08                           ` Andrew Morton
@ 2018-01-25  8:05                           ` Michal Hocko
  2018-01-25 23:27                             ` David Rientjes
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-25  8:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed 24-01-18 13:44:02, David Rientjes wrote:
> On Wed, 24 Jan 2018, Michal Hocko wrote:
> 
> > > The current implementation of memory.oom_group is based on top of a 
> > > selection implementation that is broken in three ways I have listed for 
> > > months:
> > 
> > This doesn't lead to anywhere. You are not presenting any new arguments
> > and you are ignoring feedback you have received so far. We have tried
> > really hard. Considering different _independent_ people presented more or
> > less consistent view on these points I think you should deeply
> > reconsider how you take that feedback.
> > 
> 
> I've responded to each email providing useful feedback on this patchset.  
> I agreed with Tejun about not embedding the oom mechanism into 
> memory.oom_policy.  I was trying to avoid having two files in the mem 
> cgroup v2 filesystem for oom policy and mechanism.  I agreed that 
> delegating the mechanism to the workload would be useful in some cases.  
> I've solicited feedback on any other opinions on how that can be done 
> better, but it appears another tunable is the most convenient way of 
> allowing this behavior to be specified.

It is not about convenince. Those two things are simply orthogonal. And
that's what I've been saying for quite some time. Dunno, why it has been
ignored previously.

> As a result, this would remove patch 3/4 from the series.  Do you have any 
> other feedback regarding the remainder of this patch series before I 
> rebase it?

Yes, and I have provided it already. What you are proposing is
incomplete at best and needs much better consideration and much more
time to settle.

> I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
> a separate patchset to fix an issue where any user of oom_score_adj on a 
> system that is not fully containerized gets very unusual, unexpected, and 
> undocumented results.

I will not oppose but as it has been mentioned several times, this is by
no means a blocker issue. It can be added on top.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-25  8:05                           ` Michal Hocko
@ 2018-01-25 23:27                             ` David Rientjes
  2018-01-26 10:07                               ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-25 23:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, 25 Jan 2018, Michal Hocko wrote:

> > As a result, this would remove patch 3/4 from the series.  Do you have any 
> > other feedback regarding the remainder of this patch series before I 
> > rebase it?
> 
> Yes, and I have provided it already. What you are proposing is
> incomplete at best and needs much better consideration and much more
> time to settle.
> 

Could you elaborate on why specifying the oom policy for the entire 
hierarchy as part of the root mem cgroup and also for individual subtrees 
is incomplete?  It allows admins to specify and delegate policy decisions 
to subtrees owners as appropriate.  It addresses your concern in the 
/admins and /students example.  It addresses my concern about evading the 
selection criteria simply by creating child cgroups.  It appears to be a 
win-win.  What is incomplete or are you concerned about?

> > I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
> > a separate patchset to fix an issue where any user of oom_score_adj on a 
> > system that is not fully containerized gets very unusual, unexpected, and 
> > undocumented results.
> 
> I will not oppose but as it has been mentioned several times, this is by
> no means a blocker issue. It can be added on top.
> 

The current implementation is only useful for fully containerized systems 
where no processes are attached to the root mem cgroup.  Anything in the 
root mem cgroup is judged by different criteria and if they use 
/proc/pid/oom_score_adj the entire heuristic breaks down.  That's because 
per-process usage and oom_score_adj are only relevant for the root mem 
cgroup and irrelevant when attached to a leaf.  Because of that, users are 
affected by the design decision and will organize their hierarchies as 
approrpiate to avoid it.  Users who only want to use cgroups for a subset 
of processes but still treat those processes as indivisible logical units 
when attached to cgroups find that it is simply not possible.

I'm focused solely on fixing the three main issues that this 
implementation causes.  One of them, userspace influence to protect 
important cgroups, can be added on top.  The other two, evading the 
selection criteria and unfair comparison of root vs leaf, are shortcomings 
in the design that I believe should be addressed before it's merged to 
avoid changing the API later.  I'm in no rush to ask for the cgroup aware 
oom killer to be merged if it's incomplete and must be changed for 
usecases that are not highly specialized (fully containerized and no use 
of oom_score_adj for any process).  I am actively engaged in fixing it, 
however, so that it becomes a candidate for merge.  Your feedback is 
useful with regard to those fixes, but daily emails on how we must merge 
the current implementation now are not providing value, at least to me.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-25 23:27                             ` David Rientjes
@ 2018-01-26 10:07                               ` Michal Hocko
  2018-01-26 22:33                                 ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-26 10:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu 25-01-18 15:27:29, David Rientjes wrote:
> On Thu, 25 Jan 2018, Michal Hocko wrote:
> 
> > > As a result, this would remove patch 3/4 from the series.  Do you have any 
> > > other feedback regarding the remainder of this patch series before I 
> > > rebase it?
> > 
> > Yes, and I have provided it already. What you are proposing is
> > incomplete at best and needs much better consideration and much more
> > time to settle.
> > 
> 
> Could you elaborate on why specifying the oom policy for the entire 
> hierarchy as part of the root mem cgroup and also for individual subtrees 
> is incomplete?  It allows admins to specify and delegate policy decisions 
> to subtrees owners as appropriate.  It addresses your concern in the 
> /admins and /students example.  It addresses my concern about evading the 
> selection criteria simply by creating child cgroups.  It appears to be a 
> win-win.  What is incomplete or are you concerned about?

I will get back to this later. I am really busy these days. This is not
a trivial thing at all.

> > > I will address the unfair root mem cgroup vs leaf mem cgroup comparison in 
> > > a separate patchset to fix an issue where any user of oom_score_adj on a 
> > > system that is not fully containerized gets very unusual, unexpected, and 
> > > undocumented results.
> > 
> > I will not oppose but as it has been mentioned several times, this is by
> > no means a blocker issue. It can be added on top.
> > 
> 
> The current implementation is only useful for fully containerized systems 
> where no processes are attached to the root mem cgroup.  Anything in the 
> root mem cgroup is judged by different criteria and if they use 
> /proc/pid/oom_score_adj the entire heuristic breaks down.

Most usecases I've ever seen usually use oom_score_adj only to disable
the oom killer for a particular service. In those case the current
heuristic works reasonably well.

I am not _aware_ of any usecase which actively uses oom_score_adj to
actively control the oom selection decisions and it would _require_ the
memcg aware oom killer. Maybe there are some but then we need to do much
more than to "fix" the root memcg comparison. We would need a complete
memcg aware prioritization as well. It simply doesn't make much sense
to tune oom selection only on subset of tasks ignoring the rest of the
system workload which is likely to comprise the majority of the resource
consumers.

We have already discussed that something like that will emerge sooner or
later but I am not convinced we need it _now_. It is perfectly natural
to start with a simple model without any priorities at all.

> That's because per-process usage and oom_score_adj are only relevant  
> for the root mem cgroup and irrelevant when attached to a leaf.       

This is the simplest implementation. You could go and ignore
oom_score_adj on root tasks. Would it be much better? Should you ignore
oom disabled tasks? Should you consider kernel memory footprint of those
tasks? Maybe we will realize that we simply have to account root memcg
like any other memcg.  We used to do that but it has been reverted due
to performance footprint. There are more questions to answer I believe
but the most important one is whether actually any _real_ user cares.

I can see your arguments and they are true. You can construct setups
where the current memcg oom heuristic works sub-optimally. The same has
been the case for the OOM killer in general. The OOM killer has always
been just a heuristic and there always be somebody complaining. This
doesn't mean we should just remove it because it works reasonably well
for most users.

> Because of that, users are 
> affected by the design decision and will organize their hierarchies as 
> approrpiate to avoid it.  Users who only want to use cgroups for a subset 
> of processes but still treat those processes as indivisible logical units 
> when attached to cgroups find that it is simply not possible.

Nobody enforces the memcg oom selection as presented here for those
users. They have to explicitly _opt-in_. If the new heuristic doesn't
work for them we will hear about that most likely. I am really skeptical
that oom_score_adj can be reused for memcg aware oom selection.

> I'm focused solely on fixing the three main issues that this 
> implementation causes.  One of them, userspace influence to protect 
> important cgroups, can be added on top.  The other two, evading the 
> selection criteria and unfair comparison of root vs leaf, are shortcomings 
> in the design that I believe should be addressed before it's merged to 
> avoid changing the API later.

I believe I have explained why the root memcg comparison is an
implementation detail. The subtree delegation is something that we will
have to care eventually. But I do not see it as an immediate thread.
Same as I do not see the current OOM killer flawed because there are
ways to evade from it. Moreover the delegation is much less of a problem
because creating subgroups is usually a privileged operation and it
requires quite some care already. This is much a higher bar than a
simple fork and hide games in the global case.

> I'm in no rush to ask for the cgroup aware 
> oom killer to be merged if it's incomplete and must be changed for 
> usecases that are not highly specialized (fully containerized and no use 
> of oom_score_adj for any process).

You might be not in a rush but it feels rather strange to block a
feature other people want to use.

> I am actively engaged in fixing it, 
> however, so that it becomes a candidate for merge.

I do not think anything you have proposed so far is even close to
mergeable state. I think you are simply oversimplifying this. We have
spent many months discussing different aspects of the memcg aware OOM
killer. The result is a compromise that should work reasonably well
for the targeted usecases and it doesn't bring unsustainable APIs that
will get carved into stone.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
  2018-01-26 10:07                               ` Michal Hocko
@ 2018-01-26 22:33                                 ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-26 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018, Michal Hocko wrote:

> > Could you elaborate on why specifying the oom policy for the entire 
> > hierarchy as part of the root mem cgroup and also for individual subtrees 
> > is incomplete?  It allows admins to specify and delegate policy decisions 
> > to subtrees owners as appropriate.  It addresses your concern in the 
> > /admins and /students example.  It addresses my concern about evading the 
> > selection criteria simply by creating child cgroups.  It appears to be a 
> > win-win.  What is incomplete or are you concerned about?
> 
> I will get back to this later. I am really busy these days. This is not
> a trivial thing at all.
> 

Please follow-up in the v2 patchset when you have time.

> Most usecases I've ever seen usually use oom_score_adj only to disable
> the oom killer for a particular service. In those case the current
> heuristic works reasonably well.
> 

I'm not familiar with the workloads you have worked with that use 
oom_score_adj.  We use it to prefer a subset of processes first and a 
subset of processes last.  I don't expect this to be a highly specialized 
usecase, it's the purpose of the tunable.

The fact remains that oom_score_adj tuning is only effective with the 
current implementation when attached to the root mem cgroup in an 
undocumented way, the preference or bias immediately changes as soon as it 
is attached to a cgroup, even if it's the only non root mem cgroup on the 
system.

> > That's because per-process usage and oom_score_adj are only relevant  
> > for the root mem cgroup and irrelevant when attached to a leaf.       
> 
> This is the simplest implementation. You could go and ignore
> oom_score_adj on root tasks. Would it be much better? Should you ignore
> oom disabled tasks? Should you consider kernel memory footprint of those
> tasks? Maybe we will realize that we simply have to account root memcg
> like any other memcg.  We used to do that but it has been reverted due
> to performance footprint. There are more questions to answer I believe
> but the most important one is whether actually any _real_ user cares.
> 

The goal is to compare the root mem cgroup and leaf mem cgroups equally.  
That is specifically listed as a goal for the cgroup aware oom killer and 
it's very obvious it's not implemented correctly particularly because of 
this bias but also because sum of oom_badness() != anon + unevictable + 
unreclaimable slab, even discounting oom_score_adj.  The amount of slab is 
only considered for leaf mem cgroups as well.

What I've proposed in the past was to use the global state of anon, 
unevictable, and unreclaimable slab to fairly account the root mem cgroup 
without bias from oom_score_adj for comparing cgroup usage.  oom_score_adj 
is valid when choosing the process from the root mem cgroup to kill, not 
when comparing against other cgroups since leaf cgroups discount it.

> I can see your arguments and they are true. You can construct setups
> where the current memcg oom heuristic works sub-optimally. The same has
> been the case for the OOM killer in general. The OOM killer has always
> been just a heuristic and there always be somebody complaining. This
> doesn't mean we should just remove it because it works reasonably well
> for most users.
> 

It's not most users, it's only for configurations that are fully 
containerized where there are no user processes attached to the root mem 
cgroup and nobody uses oom_score_adj like it is defined to be used, and 
it's undocumented so they don't even know that fact without looking at the 
kernel implementation.

> > Because of that, users are 
> > affected by the design decision and will organize their hierarchies as 
> > approrpiate to avoid it.  Users who only want to use cgroups for a subset 
> > of processes but still treat those processes as indivisible logical units 
> > when attached to cgroups find that it is simply not possible.
> 
> Nobody enforces the memcg oom selection as presented here for those
> users. They have to explicitly _opt-in_. If the new heuristic doesn't
> work for them we will hear about that most likely. I am really skeptical
> that oom_score_adj can be reused for memcg aware oom selection.
> 

oom_score_adj is value for choosing a process attached to a mem cgroup to 
kill, absent memory.oom_group being set.  It is not valid to for comparing 
cgroups, obviously.  That's why it shouldn't be used for the root mem 
cgroup either, which the current implementation does, when it is 
documented falsely to be a fair comparison.

> I do not think anything you have proposed so far is even close to
> mergeable state. I think you are simply oversimplifying this. We have
> spent many months discussing different aspects of the memcg aware OOM
> killer. The result is a compromise that should work reasonably well
> for the targeted usecases and it doesn't bring unsustainable APIs that
> will get carved into stone.

If you don't have time to review the patchset to show that it's not 
mergeable, I'm not sure that I have anything to work with.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm 4/4] mm, memcg: add hierarchical usage oom policy
  2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
                   ` (2 preceding siblings ...)
       [not found] ` <alpine.DEB.2.10.1801161812550.28198-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2018-01-17  2:15 ` David Rientjes
  2018-01-17 11:46 ` [patch -mm 0/4] mm, memcg: introduce oom policies Roman Gushchin
  2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
  5 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-17  2:15 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.

In this regard, users who do not distribute their processes over a set of
subcontainers for mem cgroup control, statistics, or other controllers
are unfairly penalized.

This adds an oom policy, "tree", that accounts for hierarchical usage
when comparing cgroups and the cgroup aware oom killer is enabled by an
ancestor.  This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage.  In other words, they can longer evade the oom killer
by using other controllers or subcontainers.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt | 12 ++++++++++--
 include/linux/memcontrol.h  |  9 +++++++--
 mm/memcontrol.c             | 23 +++++++++++++++--------
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1048,6 +1048,11 @@ PAGE_SIZE multiple when read back.
 	memory consumers; that is, they will compare mem cgroup usage rather
 	than process memory footprint.  See the "OOM Killer" section.
 
+	If "tree", the OOM killer will compare mem cgroups and its subtree
+	as indivisible memory consumers when selecting a hierarchy.  This
+	policy cannot be set on the root mem cgroup.  See the "OOM Killer"
+	section.
+
 	If "all", the OOM killer will compare mem cgroups and its subtree
 	as indivisible memory consumers and kill all processes attached to
 	the mem cgroup and its subtree.  This policy cannot be set on the
@@ -1275,6 +1280,9 @@ There are currently three available oom policies:
  - "cgroup": choose the cgroup with the largest memory footprint from the
    subtree as an OOM victim and kill at least one process.
 
+ - "tree": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill at least one process.
+
  - "all": choose the cgroup with the largest memory footprint considering
    itself and its subtree and kill all processes attached (cannot be set on
    the root mem cgroup).
@@ -1292,8 +1300,8 @@ Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
 If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and setting a policy of "all"
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+or "all" on ancestor layer.
 
 
 IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -70,8 +70,13 @@ enum memcg_oom_policy {
 	 */
 	MEMCG_OOM_POLICY_CGROUP,
 	/*
-	 * Same as MEMCG_OOM_POLICY_CGROUP, but all eligible processes attached
-	 * to the cgroup and subtree should be oom killed
+	 * Tree cgroup usage for all descendant memcg groups, treating each mem
+	 * cgroup and its subtree as an indivisible consumer
+	 */
+	MEMCG_OOM_POLICY_TREE,
+	/*
+	 * Same as MEMCG_OOM_POLICY_TREE, but all eligible processes are also
+	 * oom killed
 	 */
 	MEMCG_OOM_POLICY_ALL,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2715,11 +2715,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	oc->chosen_points = 0;
 
 	/*
-	 * If OOM is memcg-wide, and the oom policy is "all", all processes
-	 * attached to the memcg and subtree should be killed.
-	 * So, we mark the memcg as a victim.
+	 * If OOM is memcg-wide, and the oom policy is "tree" or "all", this
+	 * is the selected memcg.
 	 */
-	if (oc->memcg && mem_cgroup_oom_policy_all(oc->memcg)) {
+	if (oc->memcg && (oc->memcg->oom_policy == MEMCG_OOM_POLICY_TREE ||
+			  oc->memcg->oom_policy == MEMCG_OOM_POLICY_ALL)) {
 		oc->chosen_memcg = oc->memcg;
 		css_get(&oc->chosen_memcg->css);
 		return;
@@ -2728,8 +2728,8 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
-	 * Cgroups with oom policy of "all" accumulate the score of descendant
-	 * leaf memory cgroups.
+	 * Cgroups with oom policy of "tree" or "all" accumulate the score of
+	 * descendant leaf memory cgroups.
 	 */
 	rcu_read_lock();
 	for_each_mem_cgroup_tree(iter, root) {
@@ -2737,10 +2737,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 
 		/*
 		 * We don't consider non-leaf memory cgroups without the oom
-		 * policy of "all" as oom victims.
+		 * policy of "tree" or "all" as oom victims.
 		 */
 		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-		    !mem_cgroup_oom_policy_all(iter))
+		    iter->oom_policy != MEMCG_OOM_POLICY_TREE &&
+		    iter->oom_policy != MEMCG_OOM_POLICY_ALL)
 			continue;
 
 		/*
@@ -5511,6 +5512,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
 	case MEMCG_OOM_POLICY_CGROUP:
 		seq_puts(m, "cgroup\n");
 		break;
+	case MEMCG_OOM_POLICY_TREE:
+		seq_puts(m, "tree\n");
+		break;
 	case MEMCG_OOM_POLICY_ALL:
 		seq_puts(m, "all\n");
 		break;
@@ -5532,6 +5536,9 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
 		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
 	else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+	else if (memcg != root_mem_cgroup &&
+			!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
 	else if (memcg != root_mem_cgroup &&
 			!memcmp("all", buf, min(sizeof("all")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_ALL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 0/4] mm, memcg: introduce oom policies
  2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
                   ` (3 preceding siblings ...)
  2018-01-17  2:15 ` [patch -mm 4/4] mm, memcg: add hierarchical usage oom policy David Rientjes
@ 2018-01-17 11:46 ` Roman Gushchin
  2018-01-17 22:31   ` David Rientjes
  2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
  5 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-01-17 11:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue, Jan 16, 2018 at 06:14:58PM -0800, David Rientjes wrote:
> There are three significant concerns about the cgroup aware oom killer as
> it is implemented in -mm:
> 
>  (1) allows users to evade the oom killer by creating subcontainers or
>      using other controllers since scoring is done per cgroup and not
>      hierarchically,
> 
>  (2) does not allow the user to influence the decisionmaking, such that
>      important subtrees cannot be preferred or biased, and
> 
>  (3) unfairly compares the root mem cgroup using completely different
>      criteria than leaf mem cgroups and allows wildly inaccurate results
>      if oom_score_adj is used.
> 
> This patchset aims to fix (1) completely and, by doing so, introduces a
> completely extensible user interface that can be expanded in the future.
> 
> It eliminates the mount option for the cgroup aware oom killer entirely
> since it is now enabled through the root mem cgroup's oom policy.
> 
> It eliminates a pointless tunable, memory.oom_group, that unnecessarily
> pollutes the mem cgroup v2 filesystem and is invalid when cgroup v2 is
> mounted with the "groupoom" option.

You're introducing a new oom_policy knob, which has two separate sets
of possible values for the root and non-root cgroups. I don't think
it aligns with the existing cgroup v2 design.

For the root cgroup it works exactly as mount option, and both "none"
and "cgroup" values have no meaning outside of the root cgroup. We can
discuss if a knob on root cgroup is better than a mount option, or not
(I don't think so), but it has nothing to do with oom policy as you
define it for non-root cgroups.

For non-root cgroups you're introducing "all" and "tree", and the _only_
difference is that in the "all" mode all processes will be killed, rather
than the biggest in the "tree". I find these names confusing, in reality
it's more "evaluate together and kill all" and "evaluate together and
kill one".

So, it's not really the fully hierarchical approach, which I thought,
you were arguing for. You can easily do the same with adding the third
value to the memory.groupoom knob, as I've suggested earlier (say, "disable,
"kill" and "evaluate"), and will be much less confusing.

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm 0/4] mm, memcg: introduce oom policies
  2018-01-17 11:46 ` [patch -mm 0/4] mm, memcg: introduce oom policies Roman Gushchin
@ 2018-01-17 22:31   ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-17 22:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 17 Jan 2018, Roman Gushchin wrote:

> You're introducing a new oom_policy knob, which has two separate sets
> of possible values for the root and non-root cgroups. I don't think
> it aligns with the existing cgroup v2 design.
> 

The root mem cgroup can use "none" or "cgroup" to either disable or 
enable, respectively, the cgroup aware oom killer.  These are available to 
non root mem cgroups as well, with the addition of hierarchical usage 
comparison to prevent the ability to completely evade the oom killer by 
the user by creating sub cgroups.  Just as we have files that are made 
available on the root cgroup and not others, I think it's fine to allow 
only certain policies on the root mem cgroup.  As I wrote to Tejun, I'm 
fine with the policy being separated from the mechanism.  That can be 
resolved in that email thread, but the overall point of this patchset is 
to allow hierarchical comparison on some subtrees and not on others.  We 
can avoid the mount option in the same manner.

> For the root cgroup it works exactly as mount option, and both "none"
> and "cgroup" values have no meaning outside of the root cgroup. We can
> discuss if a knob on root cgroup is better than a mount option, or not
> (I don't think so), but it has nothing to do with oom policy as you
> define it for non-root cgroups.
> 

It certainly does have value outside of the root cgroup: for system oom 
conditions it is possible to choose the largest process, largest single 
cgroup, or largest subtree to kill from.  For memcg oom conditions it's 
possible for a consumer in a subtree to define whether it wants the 
largest memory hogging process to be oom killed or the largest of its 
child sub cgroups.  This would be needed for a job scheduler in its own 
subtree, for example, that creates its own subtrees to limit jobs 
scheduled by individual users who have the power over their subtree but 
not their limit.  This is a real-world example.  Michal also provided his 
own example concerning top-level /admins and /students.  They want to use 
"cgroup" themselves and then /students/roman would be forced to "tree".

Keep in mind that this patchset alone makes the interface extensible and 
addresses one of my big three concerns, but the comparison of the root mem 
cgroup compared to other individual cgroups and cgroups maintaining a 
subtree needs to be fixed separately so that we don't completely evade all 
of these oom policies by using /proc/pid/oom_score_adj in the root mem 
cgroup.  That's a separate issue that needs to be addressed.  My last 
concern, about userspace influence, can probably be addressed after this 
is merged by yet another tunable since it's vital that important cgroups 
can be protected.  It does make sense for an important cgroup to be able 
to use 50% of memory without being oom killed, and that's impossible if 
cgroup v2 has been mounted with your mount option.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm v2 0/3] mm, memcg: introduce oom policies
  2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
                   ` (4 preceding siblings ...)
  2018-01-17 11:46 ` [patch -mm 0/4] mm, memcg: introduce oom policies Roman Gushchin
@ 2018-01-25 23:53 ` David Rientjes
  2018-01-25 23:53   ` [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
                     ` (2 more replies)
  5 siblings, 3 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-25 23:53 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
     using other controllers since scoring is done per cgroup and not
     hierarchically,

 (2) does not allow the user to influence the decisionmaking, such that
     important subtrees cannot be preferred or biased, and

 (3) unfairly compares the root mem cgroup using completely different
     criteria than leaf mem cgroups and allows wildly inaccurate results
     if oom_score_adj is used.

This patchset aims to fix (1) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.

It preserves memory.oom_group behavior.
---
 Applied on top of -mm.

 Documentation/cgroup-v2.txt | 64 ++++++++++++++++++++++++++++-----------------
 include/linux/cgroup-defs.h |  5 ----
 include/linux/memcontrol.h  | 21 +++++++++++++++
 kernel/cgroup/cgroup.c      | 13 +--------
 mm/memcontrol.c             | 64 ++++++++++++++++++++++++++++++++++++---------
 5 files changed, 114 insertions(+), 53 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
@ 2018-01-25 23:53   ` David Rientjes
  2018-01-26 17:15     ` Michal Hocko
  2018-01-25 23:53   ` [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
  2018-01-25 23:53   ` [patch -mm v2 3/3] mm, memcg: add hierarchical usage oom policy David Rientjes
  2 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-25 23:53 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

The cgroup aware oom killer is needlessly declared for the entire system
by a mount option.  It's unnecessary to force the system into a single
oom policy: either cgroup aware, or the traditional process aware.

This patch introduces a memory.oom_policy tunable for all mem cgroups.
It is currently a no-op: it can only be set to "none", which is its
default policy.  It will be expanded in the next patch to define cgroup
aware oom killer behavior.

This is an extensible interface that can be used to define cgroup aware
assessment of mem cgroup subtrees or the traditional process aware
assessment.

Another benefit of such an approach is that an admin can lock in a
certain policy for the system or for a mem cgroup subtree and can
delegate the policy decision to the user to determine if the kill should
originate from a subcontainer, as indivisible memory consumers
themselves, or selection should be done per process.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt |  9 +++++++++
 include/linux/memcontrol.h  | 11 +++++++++++
 mm/memcontrol.c             | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1065,6 +1065,15 @@ PAGE_SIZE multiple when read back.
 	If cgroup-aware OOM killer is not enabled, ENOTSUPP error
 	is returned on attempt to access the file.
 
+  memory.oom_policy
+
+	A read-write single string file which exists on all cgroups.  The
+	default value is "none".
+
+	If "none", the OOM killer will use the default policy to choose a
+	victim; that is, it will choose the single process with the largest
+	memory footprint.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -58,6 +58,14 @@ enum memcg_event_item {
 	MEMCG_NR_EVENTS,
 };
 
+enum memcg_oom_policy {
+	/*
+	 * No special oom policy, process selection is determined by
+	 * oom_badness()
+	 */
+	MEMCG_OOM_POLICY_NONE,
+};
+
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
 	int priority;
@@ -203,6 +211,9 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* OOM policy for this subtree */
+	enum memcg_oom_policy oom_policy;
+
 	/*
 	 * Treat the sub-tree as an indivisible memory consumer,
 	 * kill all belonging tasks if the memory cgroup selected
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4417,6 +4417,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+		memcg->oom_policy = parent->oom_policy;
 	}
 	if (parent && parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
@@ -5534,6 +5535,34 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_oom_policy_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
+
+	switch (policy) {
+	case MEMCG_OOM_POLICY_NONE:
+	default:
+		seq_puts(m, "none\n");
+	};
+	return 0;
+}
+
+static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
+				       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	ssize_t ret = nbytes;
+
+	buf = strstrip(buf);
+	if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+	else
+		ret = -EINVAL;
+
+	return ret;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5575,6 +5604,12 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "oom_policy",
+		.flags = CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_policy_show,
+		.write = memory_oom_policy_write,
+	},
 	{ }	/* terminate */
 };
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-25 23:53   ` [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
@ 2018-01-26 17:15     ` Michal Hocko
  2018-01-29 22:38       ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-26 17:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu 25-01-18 15:53:45, David Rientjes wrote:
> The cgroup aware oom killer is needlessly declared for the entire system
> by a mount option.  It's unnecessary to force the system into a single
> oom policy: either cgroup aware, or the traditional process aware.
> 
> This patch introduces a memory.oom_policy tunable for all mem cgroups.
> It is currently a no-op: it can only be set to "none", which is its
> default policy.  It will be expanded in the next patch to define cgroup
> aware oom killer behavior.
> 
> This is an extensible interface that can be used to define cgroup aware
> assessment of mem cgroup subtrees or the traditional process aware
> assessment.
> 

So what is the actual semantic and scope of this policy. Does it apply
only down the hierarchy. Also how do you compare cgroups with different
policies? Let's say you have
          root
         / |  \
        A  B   C
       / \    / \
      D   E  F   G

Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1

Now we have the global OOM killer to choose a victim. From a quick
glance over those patches, it seems that we will be comparing only
tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C
policies are ignored. Moreover If I select any of B's tasks then I will
happily kill it breaking the expectation that the whole memcg will go
away. Weird, don't you think? Or did I misunderstand?

So let's assume that root: cgroup. Then we are finally comparing
cgroups. D, E, B, C. Of those D, E and F do not have any
policy. Do they inherit their policy from the parent? If they don't then
we should be comparing their tasks separately, no? The code disagrees
because once we are in the cgroup mode, we do not care about separate
tasks.

Let's say we choose C because it has the largest cumulative consumption.
It is not oom_group so it will select a task from F, G. Again you are
breaking oom_group policy of G if you kill a single task. So you would
have to be recursive here. That sounds fixable though. Just be
recursive.

Then you say

> Another benefit of such an approach is that an admin can lock in a
> certain policy for the system or for a mem cgroup subtree and can
> delegate the policy decision to the user to determine if the kill should
> originate from a subcontainer, as indivisible memory consumers
> themselves, or selection should be done per process.

And the code indeed doesn't check oom_policy on each level of the
hierarchy, unless I am missing something. So the subgroup is simply
locked in to the oom_policy parent has chosen. That is not the case for
the tree policy.

So look how we are comparing cumulative groups without policy with
groups with policy with subtrees. Either I have grossly misunderstood
something or this is massively inconsistent and it doesn't make much
sense to me. Root memcg without cgroup policy will simply turn off the
whole thing for the global OOM case. So you really need to enable it
there but then it is not really clear how to configure lower levels.

From the above it seems that you are more interested in memcg OOMs and
want to give different hierarchies different policies but you quickly
hit the similar inconsistencies there as well.

I am not sure how extensible this is actually. How do we place
priorities on top?

> Signed-off-by: David Rientjes <rientjes@google.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-26 17:15     ` Michal Hocko
@ 2018-01-29 22:38       ` David Rientjes
       [not found]         ` <alpine.DEB.2.10.1801291418150.29670-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-29 22:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018, Michal Hocko wrote:

> > The cgroup aware oom killer is needlessly declared for the entire system
> > by a mount option.  It's unnecessary to force the system into a single
> > oom policy: either cgroup aware, or the traditional process aware.
> > 
> > This patch introduces a memory.oom_policy tunable for all mem cgroups.
> > It is currently a no-op: it can only be set to "none", which is its
> > default policy.  It will be expanded in the next patch to define cgroup
> > aware oom killer behavior.
> > 
> > This is an extensible interface that can be used to define cgroup aware
> > assessment of mem cgroup subtrees or the traditional process aware
> > assessment.
> > 
> 
> So what is the actual semantic and scope of this policy. Does it apply
> only down the hierarchy. Also how do you compare cgroups with different
> policies? Let's say you have
>           root
>          / |  \
>         A  B   C
>        / \    / \
>       D   E  F   G
> 
> Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> 

At each level of the hierarchy, memory.oom_policy compares immediate 
children, it's the only way that an admin can lock in a specific oom 
policy like "tree" and then delegate the subtree to the user.  If you've 
configured it as above, comparing A and C should be the same based on the 
cumulative usage of their child mem cgroups.

The policy for B hasn't been specified, but since it does not have any 
children "cgroup" and "tree" should be the same.

> Now we have the global OOM killer to choose a victim. From a quick
> glance over those patches, it seems that we will be comparing only
> tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C
> policies are ignored.

Right, a policy of "none" reverts its subtree back to per-process 
comparison if you are either not using the cgroup aware oom killer or your 
subtree is not using the cgroup aware oom killer.

> Moreover If I select any of B's tasks then I will
> happily kill it breaking the expectation that the whole memcg will go
> away. Weird, don't you think? Or did I misunderstand?
> 

It's just as weird as the behavior of memory.oom_group today without using 
the mount option :)  In that case, mem_cgroup_select_oom_victim() always 
returns false and the value of memory.oom_group is ignored.  I agree that 
it's weird in -mm and there's nothing preventing us from separating 
memory.oom_group from the cgroup aware oom killer and allowing it to be 
set regardless of a selection change.  If memory.oom_group is set, and the 
kill originates from that mem cgroup, kill all processes attached to it 
and its subtree.

This is a criticism of the current implementation in -mm, however, my 
extension only respects its weirdness.

> So let's assume that root: cgroup. Then we are finally comparing
> cgroups. D, E, B, C. Of those D, E and F do not have any
> policy. Do they inherit their policy from the parent? If they don't then
> we should be comparing their tasks separately, no? The code disagrees
> because once we are in the cgroup mode, we do not care about separate
> tasks.
> 

No, perhaps I wasn't clear in the documentation: the policy at each level 
of the hierarchy is specified by memory.oom_policy and compares its 
immediate children with that policy.  So the per-cgroup usage of A, B, and 
C and compared regardless of A, B, and C's own oom policies.

> Let's say we choose C because it has the largest cumulative consumption.
> It is not oom_group so it will select a task from F, G. Again you are
> breaking oom_group policy of G if you kill a single task. So you would
> have to be recursive here. That sounds fixable though. Just be
> recursive.
> 

I fully agree, but that's (another) implementation detail of what is in 
-mm that isn't specified.  I think where you're going is the complete 
separation of mem cgroup selection from memory.oom_group.  I agree, and we 
can fix that.  memory.oom_group also shouldn't depend on any mount option, 
it can be set or unset depending on the properties of the workload.

> Then you say
> 
> > Another benefit of such an approach is that an admin can lock in a
> > certain policy for the system or for a mem cgroup subtree and can
> > delegate the policy decision to the user to determine if the kill should
> > originate from a subcontainer, as indivisible memory consumers
> > themselves, or selection should be done per process.
> 
> And the code indeed doesn't check oom_policy on each level of the
> hierarchy, unless I am missing something. So the subgroup is simply
> locked in to the oom_policy parent has chosen. That is not the case for
> the tree policy.
> 
> So look how we are comparing cumulative groups without policy with
> groups with policy with subtrees. Either I have grossly misunderstood
> something or this is massively inconsistent and it doesn't make much
> sense to me. Root memcg without cgroup policy will simply turn off the
> whole thing for the global OOM case. So you really need to enable it
> there but then it is not really clear how to configure lower levels.
> 
> From the above it seems that you are more interested in memcg OOMs and
> want to give different hierarchies different policies but you quickly
> hit the similar inconsistencies there as well.
> 
> I am not sure how extensible this is actually. How do we place
> priorities on top?
> 

If you're referring to strict priorities where one cgroup can be preferred 
or biased against regardless of usage, that would be an extension with yet 
another tunable.  Userspace influence over the selection is not addressed 
by this patchset, nor is the unfair comparison of the root mem cgroup with 
leaf mem cgroups.  My suggestion previously was a memory.oom_value 
tunable, which is configured depending on its parent's memory.oom_policy.  
If "cgroup" or "tree" it behaves as oom_score_adj-type behavior, i.e. it's 
an adjustment on the usage.  This way, a subtree's usage can have a 
certain amount of memory discounted, for example, because it is supposed 
to use more than 50% of memory.  If a new "priority" memory.oom_policy 
were implemented, which would be trivial, comparison between child cgroups 
would be as simple as comparing memory.oom_value integers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <alpine.DEB.2.10.1801291418150.29670-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
       [not found]         ` <alpine.DEB.2.10.1801291418150.29670-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2018-01-30  8:50           ` Michal Hocko
  2018-01-30 22:38             ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-30  8:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 29-01-18 14:38:02, David Rientjes wrote:
> On Fri, 26 Jan 2018, Michal Hocko wrote:
> 
> > > The cgroup aware oom killer is needlessly declared for the entire system
> > > by a mount option.  It's unnecessary to force the system into a single
> > > oom policy: either cgroup aware, or the traditional process aware.
> > > 
> > > This patch introduces a memory.oom_policy tunable for all mem cgroups.
> > > It is currently a no-op: it can only be set to "none", which is its
> > > default policy.  It will be expanded in the next patch to define cgroup
> > > aware oom killer behavior.
> > > 
> > > This is an extensible interface that can be used to define cgroup aware
> > > assessment of mem cgroup subtrees or the traditional process aware
> > > assessment.
> > > 
> > 
> > So what is the actual semantic and scope of this policy. Does it apply
> > only down the hierarchy. Also how do you compare cgroups with different
> > policies? Let's say you have
> >           root
> >          / |  \
> >         A  B   C
> >        / \    / \
> >       D   E  F   G
> > 
> > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > 
> 
> At each level of the hierarchy, memory.oom_policy compares immediate 
> children, it's the only way that an admin can lock in a specific oom 
> policy like "tree" and then delegate the subtree to the user.  If you've 
> configured it as above, comparing A and C should be the same based on the 
> cumulative usage of their child mem cgroups.

So cgroup == tree if we are memcg aware OOM killing, right? Why do we
need both then? Just to make memcg aware OOM killing possible?

> The policy for B hasn't been specified, but since it does not have any 
> children "cgroup" and "tree" should be the same.

So now you have a killable cgroup selected by process criterion? That
just doesn't make any sense. So I guess it would at least require to
enforce (cgroup || tree) to allow oom_group.

But even then it doesn't make much sense to me because having a memcg
killable or not is an attribute of the _memcg_ rather than the OOM
context, no? In other words how much sense does it make to have B OOM
intity or not depending on whether this is a global OOM or B OOM. Either
the workload running inside B can cope with partial tear down or it
cannot. Or do you have an example when something like that would be
useful?
 
> > Now we have the global OOM killer to choose a victim. From a quick
> > glance over those patches, it seems that we will be comparing only
> > tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C
> > policies are ignored.
> 
> Right, a policy of "none" reverts its subtree back to per-process 
> comparison if you are either not using the cgroup aware oom killer or your 
> subtree is not using the cgroup aware oom killer.

So how are you going to compare none cgroups with those that consider
full memcg or hierarchy (cgroup, tree)? Are you going to consider
oom_score_adj?

> > Moreover If I select any of B's tasks then I will
> > happily kill it breaking the expectation that the whole memcg will go
> > away. Weird, don't you think? Or did I misunderstand?
> > 
> 
> It's just as weird as the behavior of memory.oom_group today without using 
> the mount option :)

Which is why oom_group returns -ENOTSUPP, so you simply cannot even set
any memcg as oom killable. And you do not have this weirdness.

> In that case, mem_cgroup_select_oom_victim() always 
> returns false and the value of memory.oom_group is ignored.  I agree that 
> it's weird in -mm and there's nothing preventing us from separating 
> memory.oom_group from the cgroup aware oom killer and allowing it to be 
> set regardless of a selection change.

it is not weird. I suspect you misunderstood the code and its intention.

> If memory.oom_group is set, and the 
> kill originates from that mem cgroup, kill all processes attached to it 
> and its subtree.
> 
> This is a criticism of the current implementation in -mm, however, my 
> extension only respects its weirdness.
> 
> > So let's assume that root: cgroup. Then we are finally comparing
> > cgroups. D, E, B, C. Of those D, E and F do not have any
> > policy. Do they inherit their policy from the parent? If they don't then
> > we should be comparing their tasks separately, no? The code disagrees
> > because once we are in the cgroup mode, we do not care about separate
> > tasks.
> > 
> 
> No, perhaps I wasn't clear in the documentation: the policy at each level 
> of the hierarchy is specified by memory.oom_policy and compares its 
> immediate children with that policy.  So the per-cgroup usage of A, B, and 
> C and compared regardless of A, B, and C's own oom policies.

You are still operating in terms of levels. And that is rather confusing
because we are operating on a _tree_ and that walk has to be independent
on the way we walk that tree - i.e. whether we do DFS or BFS ordering.

> > Let's say we choose C because it has the largest cumulative consumption.
> > It is not oom_group so it will select a task from F, G. Again you are
> > breaking oom_group policy of G if you kill a single task. So you would
> > have to be recursive here. That sounds fixable though. Just be
> > recursive.
> > 
> 
> I fully agree, but that's (another) implementation detail of what is in 
> -mm that isn't specified.  I think where you're going is the complete 
> separation of mem cgroup selection from memory.oom_group.  I agree, and we 
> can fix that.  memory.oom_group also shouldn't depend on any mount option, 
> it can be set or unset depending on the properties of the workload.

Huh? oom_group is completely orthogonal to the selection strategy. Full
stop. I do not know how many times I have to repeat that. oom_group
defines who to kill from the target. It is completely irrelevant how we
have selected the target.

> > Then you say
> > 
> > > Another benefit of such an approach is that an admin can lock in a
> > > certain policy for the system or for a mem cgroup subtree and can
> > > delegate the policy decision to the user to determine if the kill should
> > > originate from a subcontainer, as indivisible memory consumers
> > > themselves, or selection should be done per process.
> > 
> > And the code indeed doesn't check oom_policy on each level of the
> > hierarchy, unless I am missing something. So the subgroup is simply
> > locked in to the oom_policy parent has chosen. That is not the case for
> > the tree policy.
> > 
> > So look how we are comparing cumulative groups without policy with
> > groups with policy with subtrees. Either I have grossly misunderstood
> > something or this is massively inconsistent and it doesn't make much
> > sense to me. Root memcg without cgroup policy will simply turn off the
> > whole thing for the global OOM case. So you really need to enable it
> > there but then it is not really clear how to configure lower levels.
> > 
> > From the above it seems that you are more interested in memcg OOMs and
> > want to give different hierarchies different policies but you quickly
> > hit the similar inconsistencies there as well.
> > 
> > I am not sure how extensible this is actually. How do we place
> > priorities on top?
> > 
> 
> If you're referring to strict priorities where one cgroup can be preferred 
> or biased against regardless of usage, that would be an extension with yet 
> another tunable.

How does that fit into cgroup, tree, none policy model?

> Userspace influence over the selection is not addressed 
> by this patchset, nor is the unfair comparison of the root mem cgroup with 
> leaf mem cgroups.  My suggestion previously was a memory.oom_value 
> tunable, which is configured depending on its parent's memory.oom_policy.  
> If "cgroup" or "tree" it behaves as oom_score_adj-type behavior, i.e. it's 
> an adjustment on the usage.  This way, a subtree's usage can have a 
> certain amount of memory discounted, for example, because it is supposed 
> to use more than 50% of memory.  If a new "priority" memory.oom_policy 
> were implemented, which would be trivial, comparison between child cgroups 
> would be as simple as comparing memory.oom_value integers.

What if we need to implement "kill the youngest" policy? This wouldn't
really work out, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-30  8:50           ` Michal Hocko
@ 2018-01-30 22:38             ` David Rientjes
  2018-01-31  9:47               ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-30 22:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue, 30 Jan 2018, Michal Hocko wrote:

> > > So what is the actual semantic and scope of this policy. Does it apply
> > > only down the hierarchy. Also how do you compare cgroups with different
> > > policies? Let's say you have
> > >           root
> > >          / |  \
> > >         A  B   C
> > >        / \    / \
> > >       D   E  F   G
> > > 
> > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > 
> > 
> > At each level of the hierarchy, memory.oom_policy compares immediate 
> > children, it's the only way that an admin can lock in a specific oom 
> > policy like "tree" and then delegate the subtree to the user.  If you've 
> > configured it as above, comparing A and C should be the same based on the 
> > cumulative usage of their child mem cgroups.
> 
> So cgroup == tree if we are memcg aware OOM killing, right? Why do we
> need both then? Just to make memcg aware OOM killing possible?
> 

We need "tree" to account the usage of the subtree rather than simply the 
cgroup alone, but "cgroup" and "tree" are accounted with the same units.  
In your example, D and E are treated as individual memory consumers and C 
is treated as the sum of all subtree memory consumers.

If we have /students/michal and /students/david, and both of these are 
"cgroup" policy, as the current patchset in -mm implements, and you use 
1GB, but I create /students/david/{a,b,c,d} each with 512MB of usage, you 
always get oom killed.

If we both have "tree" policy, I always get oom killed because my usage is 
2GB.  /students/michal and /students/david are compared based on their 
total usage instead of each cgroup being an individual memory consumer.

This is impossible with what is in -mm.

> > The policy for B hasn't been specified, but since it does not have any 
> > children "cgroup" and "tree" should be the same.
> 
> So now you have a killable cgroup selected by process criterion? That
> just doesn't make any sense. So I guess it would at least require to
> enforce (cgroup || tree) to allow oom_group.
> 

Hmm, I'm not sure why we would limit memory.oom_group to any policy.  Even 
if we are selecting a process, even without selecting cgroups as victims, 
killing a process may still render an entire cgroup useless and it makes 
sense to kill all processes in that cgroup.  If an unlucky process is 
selected with today's heursitic of oom_badness() or with a "none" policy 
with my patchset, I don't see why we can't enable the user to kill all 
other processes in the cgroup.  It may not make sense for some trees, but 
but I think it could be useful for others.

> > Right, a policy of "none" reverts its subtree back to per-process 
> > comparison if you are either not using the cgroup aware oom killer or your 
> > subtree is not using the cgroup aware oom killer.
> 
> So how are you going to compare none cgroups with those that consider
> full memcg or hierarchy (cgroup, tree)? Are you going to consider
> oom_score_adj?
> 

No, I think it would make sense to make the restriction that to set 
"none", the ancestor mem cgroups would also need the same policy, which is 
to select the largest process while still respecting 
/proc/pid/oom_score_adj.

> > In that case, mem_cgroup_select_oom_victim() always 
> > returns false and the value of memory.oom_group is ignored.  I agree that 
> > it's weird in -mm and there's nothing preventing us from separating 
> > memory.oom_group from the cgroup aware oom killer and allowing it to be 
> > set regardless of a selection change.
> 
> it is not weird. I suspect you misunderstood the code and its intention.
> 

We agree that memory.oom_group and a selection logic are two different 
things, and that's why I find it weird that memory.oom_group cannot be set 
without locking the entire hierarchy into a selection logic.  If you have 
a subtree oom, it makes sense for you to be able to kill all processes as 
a property of the workload.  That's independent of how the target mem 
cgroup was selected.  Regardless of the selection logic, we're going 
to target a specific mem cgroup for kill.  Choosing to kill one or all 
processes is still useful.

> > No, perhaps I wasn't clear in the documentation: the policy at each level 
> > of the hierarchy is specified by memory.oom_policy and compares its 
> > immediate children with that policy.  So the per-cgroup usage of A, B, and 
> > C and compared regardless of A, B, and C's own oom policies.
> 
> You are still operating in terms of levels. And that is rather confusing
> because we are operating on a _tree_ and that walk has to be independent
> on the way we walk that tree - i.e. whether we do DFS or BFS ordering.
> 

The selection criteria for the proposed policies, which can be extended, 
is to compare individual cgroups (for "cgroups" policy) to determine the 
victim and within that subtree, to allow the selection to be delegated 
further.  If the goal is the largest cgroup, all mem cgroups down the tree 
will have "cgroup" set.  If you come to a student, in your example, it can 
be set to "tree" such that their cumulative usage, regardless of creating 
child cgroups, is compared with other students.

If you have an example of a structure that cannot work with this model, 
or the results seem confusing given how the policies are defined for a 
subtree, that would be helpful.

> > I fully agree, but that's (another) implementation detail of what is in 
> > -mm that isn't specified.  I think where you're going is the complete 
> > separation of mem cgroup selection from memory.oom_group.  I agree, and we 
> > can fix that.  memory.oom_group also shouldn't depend on any mount option, 
> > it can be set or unset depending on the properties of the workload.
> 
> Huh? oom_group is completely orthogonal to the selection strategy. Full
> stop. I do not know how many times I have to repeat that. oom_group
> defines who to kill from the target. It is completely irrelevant how we
> have selected the target.
> 

That's exactly what I said above.

The problem in -mm is that memory.oom_group is only effective when mounted 
with the "groupoom" option, so they are tied together.  I think that 
should be fixed, there's no reason for the dependency.

> > If you're referring to strict priorities where one cgroup can be preferred 
> > or biased against regardless of usage, that would be an extension with yet 
> > another tunable.
> 
> How does that fit into cgroup, tree, none policy model?
> 

The idea of memory.oom_value is that it can take on different meanings 
depending on the policy of the parent, it can be an adjustment to usage 
for "cgroup" and "tree" and a priority value for "priority".  For "none", 
it's a no-op, but can retain the value it is set with if the policy of the 
parent subsequently changes.  One entity would be setting the 
memory.oom_policy for a subtree (or root mem cgroup) and the 
memory.oom_value of the cgroups to be compared.  "Tree" is just a modified 
version of "cgroup" that hierarchical accounts usage so the units are the 
same.  For both, this would be a positive or negative adjustment on that 
usage, just as oom_score_adj is a positive or negative adjustment on rss.

> > Userspace influence over the selection is not addressed 
> > by this patchset, nor is the unfair comparison of the root mem cgroup with 
> > leaf mem cgroups.  My suggestion previously was a memory.oom_value 
> > tunable, which is configured depending on its parent's memory.oom_policy.  
> > If "cgroup" or "tree" it behaves as oom_score_adj-type behavior, i.e. it's 
> > an adjustment on the usage.  This way, a subtree's usage can have a 
> > certain amount of memory discounted, for example, because it is supposed 
> > to use more than 50% of memory.  If a new "priority" memory.oom_policy 
> > were implemented, which would be trivial, comparison between child cgroups 
> > would be as simple as comparing memory.oom_value integers.
> 
> What if we need to implement "kill the youngest" policy? This wouldn't
> really work out, right?

A memory.oom_policy of "priority" can be implemented such that tiebreaks 
between cgroups of the same priority kill the youngest.  That's how we do 
it, not to say that it's the only way of doing tiebreaks.  However, if the 
question is how a user would effect a policy of killing the youngest 
cgroup regardless of usage or any priority, that would be its own 
memory.oom_policy if someone needed it.

I very much appreciate the feedback.  In a v3, I can make the "none" 
policy only allowed if the ancestor is also "none", and this can preserve 
backwards compatibility without locking the entire cgroup v2 hierarchy 
into a single selection logic with a mount option.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-30 22:38             ` David Rientjes
@ 2018-01-31  9:47               ` Michal Hocko
  2018-02-01 10:11                 ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-31  9:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue 30-01-18 14:38:40, David Rientjes wrote:
> On Tue, 30 Jan 2018, Michal Hocko wrote:
> 
> > > > So what is the actual semantic and scope of this policy. Does it apply
> > > > only down the hierarchy. Also how do you compare cgroups with different
> > > > policies? Let's say you have
> > > >           root
> > > >          / |  \
> > > >         A  B   C
> > > >        / \    / \
> > > >       D   E  F   G
> > > > 
> > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > > 
> > > 
> > > At each level of the hierarchy, memory.oom_policy compares immediate 
> > > children, it's the only way that an admin can lock in a specific oom 
> > > policy like "tree" and then delegate the subtree to the user.  If you've 
> > > configured it as above, comparing A and C should be the same based on the 
> > > cumulative usage of their child mem cgroups.
> > 
> > So cgroup == tree if we are memcg aware OOM killing, right? Why do we
> > need both then? Just to make memcg aware OOM killing possible?
> > 
> 
> We need "tree" to account the usage of the subtree rather than simply the 
> cgroup alone, but "cgroup" and "tree" are accounted with the same units.  
> In your example, D and E are treated as individual memory consumers and C 
> is treated as the sum of all subtree memory consumers.

It seems I am still not clear with my question. What kind of difference
does policy=cgroup vs. none on A? Also what kind of different does it
make when a leaf node has cgroup policy?

[...]

> > So now you have a killable cgroup selected by process criterion? That
> > just doesn't make any sense. So I guess it would at least require to
> > enforce (cgroup || tree) to allow oom_group.
> > 
> 
> Hmm, I'm not sure why we would limit memory.oom_group to any policy.  Even 
> if we are selecting a process, even without selecting cgroups as victims, 
> killing a process may still render an entire cgroup useless and it makes 
> sense to kill all processes in that cgroup.  If an unlucky process is 
> selected with today's heursitic of oom_badness() or with a "none" policy 
> with my patchset, I don't see why we can't enable the user to kill all 
> other processes in the cgroup.  It may not make sense for some trees, but 
> but I think it could be useful for others.

My intuition screams here. I will think about this some more but I would
be really curious about any sensible usecase when you want sacrifice the
whole gang just because of one process compared to other processes or
cgroups is too large. Do you see how you are mixing entities here?

> > > Right, a policy of "none" reverts its subtree back to per-process 
> > > comparison if you are either not using the cgroup aware oom killer or your 
> > > subtree is not using the cgroup aware oom killer.
> > 
> > So how are you going to compare none cgroups with those that consider
> > full memcg or hierarchy (cgroup, tree)? Are you going to consider
> > oom_score_adj?
> > 
> 
> No, I think it would make sense to make the restriction that to set 
> "none", the ancestor mem cgroups would also need the same policy,

I do not understand. Get back to our example. Are you saying that G
with none will enforce the none policy to C and root? If yes then this
doesn't make any sense because you are not really able to delegate the
oom policy down the tree at all. It would effectively make tree policy
pointless.

I am skipping the rest of the following text because it is picking
on details and the whole design is not clear to me. So could you start
over documenting semantic and requirements. Ideally by describing:
- how does the policy on the root of the OOM hierarchy controls the
  selection policy
- how does the per-memcg policy act during the tree walk - for both
  intermediate nodes and leafs
- how does the oom killer act based on the selected memcg
- how do you compare tasks with memcgs

[...]

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable
  2018-01-31  9:47               ` Michal Hocko
@ 2018-02-01 10:11                 ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-02-01 10:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Wed, 31 Jan 2018, Michal Hocko wrote:

> > > > >           root
> > > > >          / |  \
> > > > >         A  B   C
> > > > >        / \    / \
> > > > >       D   E  F   G
> > > > > 
> > > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > > > 
> > > > 
> > > > At each level of the hierarchy, memory.oom_policy compares immediate 
> > > > children, it's the only way that an admin can lock in a specific oom 
> > > > policy like "tree" and then delegate the subtree to the user.  If you've 
> > > > configured it as above, comparing A and C should be the same based on the 
> > > > cumulative usage of their child mem cgroups.
> > > 
> It seems I am still not clear with my question. What kind of difference
> does policy=cgroup vs. none on A? Also what kind of different does it
> make when a leaf node has cgroup policy?
> 

If A has an oom policy of "cgroup" it will be comparing the local usage of 
D vs E, "tree" would be the same since neither descendants have child 
cgroups.  If A has an oom policy of "none" it would compare processes 
attached to D and E and respect /proc/pid/oom_score_adj.  It allows for 
opting in and opting out of cgroup aware selection not only for the whole 
system but also per subtree.

> > Hmm, I'm not sure why we would limit memory.oom_group to any policy.  Even 
> > if we are selecting a process, even without selecting cgroups as victims, 
> > killing a process may still render an entire cgroup useless and it makes 
> > sense to kill all processes in that cgroup.  If an unlucky process is 
> > selected with today's heursitic of oom_badness() or with a "none" policy 
> > with my patchset, I don't see why we can't enable the user to kill all 
> > other processes in the cgroup.  It may not make sense for some trees, but 
> > but I think it could be useful for others.
> 
> My intuition screams here. I will think about this some more but I would
> be really curious about any sensible usecase when you want sacrifice the
> whole gang just because of one process compared to other processes or
> cgroups is too large. Do you see how you are mixing entities here?
> 

It's a property of the workload that has nothing to do with selection.  
Regardless of how a victim is selected, we need a victim.  That victim may 
be able to tolerate the loss of the process, and not even need to be the 
largest memory hogging process based on /proc/pid/oom_score_adj (periodic 
cleanups, logging, stat collection are what I'm most familiar with).  It 
may also be vital to the workload and it's better off to kill the entire 
job, it's highly dependent on what the job is.  There's a general usecase 
for memory.oom_group behavior without any selection changes, we've had a 
killall tunable for years and is used by many customers for the same 
reason.  There's no reason for it to be coupled, it can exist independent 
of any cgroup aware selection.

> I do not understand. Get back to our example. Are you saying that G
> with none will enforce the none policy to C and root? If yes then this
> doesn't make any sense because you are not really able to delegate the
> oom policy down the tree at all. It would effectively make tree policy
> pointless.
> 

The oom policy of G is pointless, it has no children cgroups.  It can be 
"none", "cgroup", or "tree", it's not the root of a subtree.  (The oom 
policy of the root mem cgroup is also irrelevant if there are no other 
cgroups, same thing.)  If G is oom, it kills the largest process or 
everything if memory.oom_group is set, which in your example it is.

> I am skipping the rest of the following text because it is picking
> on details and the whole design is not clear to me. So could you start
> over documenting semantic and requirements. Ideally by describing:
> - how does the policy on the root of the OOM hierarchy controls the
>   selection policy

If "none", there's no difference than Linus's tree right now.  If 
"cgroup", it enables cgroup aware selection: it compares all cgroups on 
the system wrt local usage unless that cgroup has "tree" set in which case 
its usage is hierarchical.

> - how does the per-memcg policy act during the tree walk - for both
>   intermediate nodes and leafs

The oom policy is determined by the mem cgroup under oom, that is the root 
of the subtree under oom and its policy dictates how to select a victim 
mem cgroup.

> - how does the oom killer act based on the selected memcg

That's the point about memory.oom_group: once it has selected a cgroup (if 
cgroup aware behavior is enabled for the oom subtree [could be root]), a 
memory hogging process attached to that subtree is killed or everything is 
killed if memory.oom_group is enabled.

> - how do you compare tasks with memcgs
> 

You don't, I think the misunderstanding is what happens if the root of a 
subtree is "cgroup", for example, and a descendant has "none" enabled.  
The root is under oom, it is comparing cgroups :)  "None" is only 
effective if that subtree root is oom where process usage is considered.

The point is that all the functionality available in -mm is still 
available, just dictate "cgroup" everywhere and make it a decision that 
can change per subtree, if necessary, without any mount option that would 
become obsoleted.  Then, make memory.oom_group possible without any 
specific selection policy since its useful on its own.

Let me give you a concrete example based on your earlier /admins, 
/teachers, /students example.  We oversubscribe the /students subtree in 
the case where /admins and /teachers aren't using the memory.  We say 100 
students can use 1GB each, but the limit of /students is actually 200GB.  
100 students using 1GB each won't cause a system oom, we control that by 
the limit of /admins and /teachers.  But we allow using memory that isn't 
in use by /admins and /teachers if it's there, opening up overconsumers to 
the possibility of oom kill.  (This is a real world example with batch job 
scheduling, it's anything but hypothetical.)

/students/michal logs in, and he has complete control over his subtree.  
He's going to start several jobs, all in their own cgroups, with usage 
well over 1GB, but if he's oom killed he wants the right thing oom killed.  

Obviously this completely breaks down if the -mm functionality is used if 
you have 10 jobs using 512MB each, because another student using more than 
1GB who isn't using cgroups is going to be oom killed instead, although 
you are using 5GB.  We've discussed that ad nauseam, and is why I 
introduced "tree".

But now look at the API.  /students/michal is using child cgroups but 
which selection policy is in effect?  Will it kill the most memory hogging 
process in his subtree or the most memory hogging process from the most 
memory hogging cgroup?  It's an important distinction because it's 
directly based on how he constructs his hierarchy: if locked into one 
selection logic, the least important job *must* be in the highest 
consuming cgroup; otherwise, his /proc/pid/oom_score_adj is respected.  He 
*must* query the mount option.

But now let's say that memory.oom_policy is merged to give this control to 
him to do per process, per cgroup, or per subtree oom killing based on how 
he defines it.  The mount option doesn't mean anything anymore, in fact, 
it can mean the complete opposite of what actually happens.  That's the 
direct objection to the mount option.  Since I have systems with thousands 
of cgroups in hundreds of cgroups and over 100 workgroups that define, 
sometimes very creatively, how to select oom victims, I'm an advocate for 
an extensible interface that is useful for general purpose, doesn't remove 
any functionality, and doesn't have contradicting specifications.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
  2018-01-25 23:53   ` [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
@ 2018-01-25 23:53   ` David Rientjes
  2018-01-26  0:00     ` Andrew Morton
  2018-01-25 23:53   ` [patch -mm v2 3/3] mm, memcg: add hierarchical usage oom policy David Rientjes
  2 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-25 23:53 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.

Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt | 43 +++++++++++++++++++++----------------------
 include/linux/cgroup-defs.h |  5 -----
 include/linux/memcontrol.h  |  5 +++++
 kernel/cgroup/cgroup.c      | 13 +------------
 mm/memcontrol.c             | 17 ++++++++---------
 5 files changed, 35 insertions(+), 48 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1074,6 +1074,10 @@ PAGE_SIZE multiple when read back.
 	victim; that is, it will choose the single process with the largest
 	memory footprint.
 
+	If "cgroup", the OOM killer will compare mem cgroups as indivisible
+	memory consumers; that is, they will compare mem cgroup usage rather
+	than process memory footprint.  See the "OOM Killer" section.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1280,37 +1284,32 @@ belonging to the affected files to ensure correct memory ownership.
 OOM Killer
 ~~~~~~~~~~
 
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
+Cgroup v2 memory controller implements an optional cgroup-aware out of
+memory killer, which treats cgroups as indivisible OOM entities.
 
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
+This policy is controlled by memory.oom_policy.  When a memory cgroup is
+out of memory, its memory.oom_policy will dictate how the OOM killer will
+select a process, or cgroup, to kill.  Likewise, when the system is OOM,
+the policy is dictated by the root mem cgroup.
 
-  # mount -o remount,groupoom $MOUNT_POINT
+There are currently two available oom policies:
 
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
+ - "none": default, choose the largest single memory hogging process to
+   oom kill, as traditionally the OOM killer has always done.
 
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer to kill all processes attached to the cgroup,
-except processes with oom_score_adj set to -1000.
+ - "cgroup": choose the cgroup with the largest memory footprint from the
+   subtree as an OOM victim and kill at least one process, depending on
+   memory.oom_group, from it.
 
-This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
-the memory controller considers only cgroups belonging to the sub-tree
-of the OOM'ing cgroup.
+When selecting a cgroup as a victim, the OOM killer will kill the process
+with the largest memory footprint.  A user can control this behavior by
+enabling the per-cgroup memory.oom_group option.  If set, it causes the
+OOM killer to kill all processes attached to the cgroup, except processes
+with /proc/pid/oom_score_adj set to -1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
 with other leaf memory cgroups and cgroups with oom_group option set.
 
-If there are no cgroups with the enabled memory controller,
-the OOM killer is using the "traditional" process-based approach.
-
 Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,11 +81,6 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
-
-	/*
-	 * Enable cgroup-aware OOM killer.
-	 */
-	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -64,6 +64,11 @@ enum memcg_oom_policy {
 	 * oom_badness()
 	 */
 	MEMCG_OOM_POLICY_NONE,
+	/*
+	 * Local cgroup usage is used to select a target cgroup, treating each
+	 * mem cgroup as an indivisible consumer
+	 */
+	MEMCG_OOM_POLICY_CGROUP,
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
-		} else if (!strcmp(token, "groupoom")) {
-			*root_flags |= CGRP_GROUP_OOM;
-			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
-
-		if (root_flags & CGRP_GROUP_OOM)
-			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
-		else
-			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
-	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
-		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
@@ -5922,8 +5912,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			     char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
-					"groupoom\n");
+	return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2798,14 +2798,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return false;
-
 	if (oc->memcg)
 		root = oc->memcg;
 	else
 		root = root_mem_cgroup;
 
+	if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP)
+		return false;
+
 	select_victim_memcg(root, oc);
 
 	return oc->chosen_memcg;
@@ -5412,9 +5412,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 	bool oom_group = memcg->oom_group;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return -ENOTSUPP;
-
 	seq_printf(m, "%d\n", oom_group);
 
 	return 0;
@@ -5428,9 +5425,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	int oom_group;
 	int err;
 
-	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
-		return -ENOTSUPP;
-
 	err = kstrtoint(strstrip(buf), 0, &oom_group);
 	if (err)
 		return err;
@@ -5541,6 +5535,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
 	enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy);
 
 	switch (policy) {
+	case MEMCG_OOM_POLICY_CGROUP:
+		seq_puts(m, "cgroup\n");
+		break;
 	case MEMCG_OOM_POLICY_NONE:
 	default:
 		seq_puts(m, "none\n");
@@ -5557,6 +5554,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
 	buf = strstrip(buf);
 	if (!memcmp("none", buf, min(sizeof("none")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
+	else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
 	else
 		ret = -EINVAL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-25 23:53   ` [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
@ 2018-01-26  0:00     ` Andrew Morton
  2018-01-26 22:20       ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2018-01-26  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, 25 Jan 2018 15:53:48 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> Now that each mem cgroup on the system has a memory.oom_policy tunable to
> specify oom kill selection behavior, remove the needless "groupoom" mount
> option that requires (1) the entire system to be forced, perhaps
> unnecessarily, perhaps unexpectedly, into a single oom policy that
> differs from the traditional per process selection, and (2) a remount to
> change.
> 
> Instead of enabling the cgroup aware oom killer with the "groupoom" mount
> option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

Can we retain the groupoom mount option and use its setting to set the
initial value of every memory.oom_policy?  That way the mount option
remains somewhat useful and we're back-compatible?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-26  0:00     ` Andrew Morton
@ 2018-01-26 22:20       ` David Rientjes
  2018-01-26 22:39         ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-26 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, 25 Jan 2018, Andrew Morton wrote:

> > Now that each mem cgroup on the system has a memory.oom_policy tunable to
> > specify oom kill selection behavior, remove the needless "groupoom" mount
> > option that requires (1) the entire system to be forced, perhaps
> > unnecessarily, perhaps unexpectedly, into a single oom policy that
> > differs from the traditional per process selection, and (2) a remount to
> > change.
> > 
> > Instead of enabling the cgroup aware oom killer with the "groupoom" mount
> > option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
> 
> Can we retain the groupoom mount option and use its setting to set the
> initial value of every memory.oom_policy?  That way the mount option
> remains somewhat useful and we're back-compatible?
> 

-ECONFUSED.  We want to have a mount option that has the sole purpose of 
doing echo cgroup > /mnt/cgroup/memory.oom_policy?

Please note that this patchset is not only to remove a mount option, it is 
to allow oom policies to be configured per subtree such that users whom 
you delegate those subtrees to cannot evade the oom policy that is set at 
a higher level.  The goal is to prevent the user from needing to organize 
their hierarchy is a specific way to work around this constraint and use 
things like limiting the number of child cgroups that user is allowed to 
create only to work around the oom policy.  With a cgroup v2 single 
hierarchy it severely limits the amount of control the user has over their 
processes because they are locked into a very specific hierarchy 
configuration solely to not allow the user to evade oom kill.

This, and fixes to fairly compare the root mem cgroup with leaf mem 
cgroups, are essential before the feature is merged otherwise it yields 
wildly unpredictable (and unexpected, since its interaction with 
oom_score_adj isn't documented) results as I already demonstrated where 
cgroups with 1GB of usage are killed instead of 6GB workers outside of 
that subtree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-26 22:20       ` David Rientjes
@ 2018-01-26 22:39         ` Andrew Morton
  2018-01-26 22:52           ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2018-01-26 22:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018 14:20:24 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Thu, 25 Jan 2018, Andrew Morton wrote:
> 
> > > Now that each mem cgroup on the system has a memory.oom_policy tunable to
> > > specify oom kill selection behavior, remove the needless "groupoom" mount
> > > option that requires (1) the entire system to be forced, perhaps
> > > unnecessarily, perhaps unexpectedly, into a single oom policy that
> > > differs from the traditional per process selection, and (2) a remount to
> > > change.
> > > 
> > > Instead of enabling the cgroup aware oom killer with the "groupoom" mount
> > > option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
> > 
> > Can we retain the groupoom mount option and use its setting to set the
> > initial value of every memory.oom_policy?  That way the mount option
> > remains somewhat useful and we're back-compatible?
> > 
> 
> -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> doing echo cgroup > /mnt/cgroup/memory.oom_policy?

Approximately.  Let me put it another way: can we modify your patchset
so that the mount option remains, and continues to have a sufficiently
same effect?  For backward compatibility.

> This, and fixes to fairly compare the root mem cgroup with leaf mem 
> cgroups, are essential before the feature is merged otherwise it yields 
> wildly unpredictable (and unexpected, since its interaction with 
> oom_score_adj isn't documented) results as I already demonstrated where 
> cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> that subtree.

OK, so Roman's new feature is incomplete: it satisfies some use cases
but not others.  And we kinda have a plan to address the other use
cases in the future.

There's nothing wrong with that!  As long as we don't break existing
setups while evolving the feature.  How do we do that?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-26 22:39         ` Andrew Morton
@ 2018-01-26 22:52           ` David Rientjes
  2018-01-27  0:17             ` Andrew Morton
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-01-26 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018, Andrew Morton wrote:

> > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> 
> Approximately.  Let me put it another way: can we modify your patchset
> so that the mount option remains, and continues to have a sufficiently
> same effect?  For backward compatibility.
> 

The mount option would exist solely to set the oom policy of the root mem 
cgroup, it would lose its effect of mandating that policy for any subtree 
since it would become configurable by the user if delegated.

Let me put it another way: if the cgroup aware oom killer is merged for 
4.16 without this patchset, userspace can reasonably infer the oom policy 
from checking how cgroups were mounted.  If my followup patchset were 
merged for 4.17, that's invalid and it becomes dependent on kernel 
version: it could have the "groupoom" mount option but configured through 
the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That inconsistency, to me, is unfair to burden userspace with.

> > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > cgroups, are essential before the feature is merged otherwise it yields 
> > wildly unpredictable (and unexpected, since its interaction with 
> > oom_score_adj isn't documented) results as I already demonstrated where 
> > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > that subtree.
> 
> OK, so Roman's new feature is incomplete: it satisfies some use cases
> but not others.  And we kinda have a plan to address the other use
> cases in the future.
> 

Those use cases are also undocumented such that the user doesn't know the 
behavior they are opting into.  Nowhere in the patchset does it mention 
anything about oom_score_adj other than being oom disabled.  It doesn't 
mention that a per-process tunable now depends strictly on whether it is 
attached to root or not.  It specifies a fair comparison between the root 
mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
implementation itself.  So I'm not sure the user would know which use 
cases it is valid for, which is why I've been trying to make it generally 
purposeful and documented.

> There's nothing wrong with that!  As long as we don't break existing
> setups while evolving the feature.  How do we do that?
> 

We'd break the setups that actually configure their cgroups and processes 
to abide by the current implementation since we'd need to discount 
oom_score_adj from the the root mem cgroup usage to fix it.

There hasn't been any feedback on v2 of my patchset that would suggest 
changes are needed.  I think we all recognize the shortcoming that it is 
addressing.  The only feedback on v1, the need for memory.oom_group as a 
separate tunable, has been addressed in v2.  What are we waiting for?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-26 22:52           ` David Rientjes
@ 2018-01-27  0:17             ` Andrew Morton
  2018-01-29 10:46               ` Michal Hocko
  2018-01-29 22:16               ` David Rientjes
  0 siblings, 2 replies; 52+ messages in thread
From: Andrew Morton @ 2018-01-27  0:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018 14:52:59 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> On Fri, 26 Jan 2018, Andrew Morton wrote:
> 
> > > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> > 
> > Approximately.  Let me put it another way: can we modify your patchset
> > so that the mount option remains, and continues to have a sufficiently
> > same effect?  For backward compatibility.
> > 
> 
> The mount option would exist solely to set the oom policy of the root mem 
> cgroup, it would lose its effect of mandating that policy for any subtree 
> since it would become configurable by the user if delegated.

Why can't we propagate the mount option into the subtrees?

If the user then alters that behaviour with new added-by-David tunables
then fine, that's still backward compatible.

> Let me put it another way: if the cgroup aware oom killer is merged for 
> 4.16 without this patchset, userspace can reasonably infer the oom policy 
> from checking how cgroups were mounted.  If my followup patchset were 
> merged for 4.17, that's invalid and it becomes dependent on kernel 
> version: it could have the "groupoom" mount option but configured through 
> the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That concern seems unreasonable to me.  Is an application *really*
going to peek at the mount options to figure out what its present oom
policy is?  Well, maybe.  But that's a pretty dopey thing to do and I
wouldn't lose much sleep over breaking any such application in the very
unlikely case that such a thing was developed in that two-month window.

If that's really a concern then let's add (to Roman's patchset) a
proper interface for an application to query its own oom policy.

> That inconsistency, to me, is unfair to burden userspace with.
> 
> > > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > > cgroups, are essential before the feature is merged otherwise it yields 
> > > wildly unpredictable (and unexpected, since its interaction with 
> > > oom_score_adj isn't documented) results as I already demonstrated where 
> > > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > > that subtree.
> > 
> > OK, so Roman's new feature is incomplete: it satisfies some use cases
> > but not others.  And we kinda have a plan to address the other use
> > cases in the future.
> > 
> 
> Those use cases are also undocumented such that the user doesn't know the 
> behavior they are opting into.  Nowhere in the patchset does it mention 
> anything about oom_score_adj other than being oom disabled.  It doesn't 
> mention that a per-process tunable now depends strictly on whether it is 
> attached to root or not.  It specifies a fair comparison between the root 
> mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
> implementation itself.  So I'm not sure the user would know which use 
> cases it is valid for, which is why I've been trying to make it generally 
> purposeful and documented.

Documentation patches are nice.  We can cc:stable them too, so no huge
hurry.

> > There's nothing wrong with that!  As long as we don't break existing
> > setups while evolving the feature.  How do we do that?
> > 
> 
> We'd break the setups that actually configure their cgroups and processes 
> to abide by the current implementation since we'd need to discount 
> oom_score_adj from the the root mem cgroup usage to fix it.

Am having trouble understanding that.  Expand, please?

Can we address this (and other such) issues in the (interim)
documentation?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-27  0:17             ` Andrew Morton
@ 2018-01-29 10:46               ` Michal Hocko
       [not found]                 ` <20180129104657.GC21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2018-01-29 22:16               ` David Rientjes
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-29 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri 26-01-18 16:17:35, Andrew Morton wrote:
> On Fri, 26 Jan 2018 14:52:59 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
[...]
> > Those use cases are also undocumented such that the user doesn't know the 
> > behavior they are opting into.  Nowhere in the patchset does it mention 
> > anything about oom_score_adj other than being oom disabled.  It doesn't 
> > mention that a per-process tunable now depends strictly on whether it is 
> > attached to root or not.  It specifies a fair comparison between the root 
> > mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
> > implementation itself.  So I'm not sure the user would know which use 
> > cases it is valid for, which is why I've been trying to make it generally 
> > purposeful and documented.
> 
> Documentation patches are nice.  We can cc:stable them too, so no huge
> hurry.

What about this?

From c02d8bc1396d5ab3785d01f577e2ee74e5dd985e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 29 Jan 2018 11:42:59 +0100
Subject: [PATCH] oom, memcg: clarify root memcg oom accounting

David Rientjes has pointed out that the current way how the root memcg
is accounted for the cgroup aware OOM killer is undocumented. Unlike
regular cgroups there is no accounting going on in the root memcg
(mostly for performance reasons). Therefore we are suming up oom_badness
of its tasks. This might result in an over accounting because of the
oom_score_adj setting. Document this for now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/cgroup-v2.txt | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2eaed1e2243d..7dff106bba57 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1292,7 +1292,11 @@ the memory controller considers only cgroups belonging to the sub-tree
 of the OOM'ing cgroup.
 
 The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+with other leaf memory cgroups and cgroups with oom_group option
+set. Due to internal implementation restrictions the size of the root
+cgroup is a cumulative sum of oom_badness of all its tasks (in other
+words oom_score_adj of each task is obeyed). This might change in the
+future.
 
 If there are no cgroups with the enabled memory controller,
 the OOM killer is using the "traditional" process-based approach.
-- 
2.15.1
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

[parent not found: <20180129104657.GC21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                 ` <20180129104657.GC21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-01-29 19:11                   ` Tejun Heo
  2018-01-30  8:54                     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-01-29 19:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello, Michal.

On Mon, Jan 29, 2018 at 11:46:57AM +0100, Michal Hocko wrote:
> @@ -1292,7 +1292,11 @@ the memory controller considers only cgroups belonging to the sub-tree
>  of the OOM'ing cgroup.
>  
>  The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
> +with other leaf memory cgroups and cgroups with oom_group option
> +set. Due to internal implementation restrictions the size of the root
> +cgroup is a cumulative sum of oom_badness of all its tasks (in other
> +words oom_score_adj of each task is obeyed). This might change in the
> +future.

Thanks, we can definitely use more documentation.  However, it's a bit
difficult to follow.  Maybe expand it to a separate paragraph on the
current behavior with a clear warning that the default OOM heuristics
is subject to changes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-29 19:11                   ` Tejun Heo
@ 2018-01-30  8:54                     ` Michal Hocko
       [not found]                       ` <20180130085445.GQ21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-30  8:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, David Rientjes, Roman Gushchin, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Mon 29-01-18 11:11:39, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jan 29, 2018 at 11:46:57AM +0100, Michal Hocko wrote:
> > @@ -1292,7 +1292,11 @@ the memory controller considers only cgroups belonging to the sub-tree
> >  of the OOM'ing cgroup.
> >  
> >  The root cgroup is treated as a leaf memory cgroup, so it's compared
> > -with other leaf memory cgroups and cgroups with oom_group option set.
> > +with other leaf memory cgroups and cgroups with oom_group option
> > +set. Due to internal implementation restrictions the size of the root
> > +cgroup is a cumulative sum of oom_badness of all its tasks (in other
> > +words oom_score_adj of each task is obeyed). This might change in the
> > +future.
> 
> Thanks, we can definitely use more documentation.  However, it's a bit
> difficult to follow.  Maybe expand it to a separate paragraph on the
> current behavior with a clear warning that the default OOM heuristics
> is subject to changes?

Does this sound any better?

From ea4fa9c36d3ec2cf13d1949169924a1a54b9fcd6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 30 Jan 2018 09:54:15 +0100
Subject: [PATCH] oom, memcg: clarify root memcg oom accounting

David Rientjes has pointed out that the current way how the root memcg
is accounted for the cgroup aware OOM killer is undocumented. Unlike
regular cgroups there is no accounting going on in the root memcg
(mostly for performance reasons). Therefore we are suming up oom_badness
of its tasks. This might result in an over accounting because of the
oom_score_adj setting. Document this for now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/cgroup-v2.txt | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2eaed1e2243d..67bdf19f8e5b 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
 the memory controller considers only cgroups belonging to the sub-tree
 of the OOM'ing cgroup.
 
-The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+Leaf cgroups are compared based on their cumulative memory usage. The
+root cgroup is treated as a leaf memory cgroup as well, so it's
+compared with other leaf memory cgroups. Due to internal implementation
+restrictions the size of the root cgroup is a cumulative sum of
+oom_badness of all its tasks (in other words oom_score_adj of each task
+is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
+can lead to overestimating of the root cgroup consumption and it is
+therefore discouraged. This might change in the future, though.
 
 If there are no cgroups with the enabled memory controller,
 the OOM killer is using the "traditional" process-based approach.
-- 
2.15.1
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

[parent not found: <20180130085445.GQ21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                       ` <20180130085445.GQ21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-01-30 11:58                         ` Roman Gushchin
       [not found]                           ` <20180130115846.GA4720-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-01-30 11:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> On Mon 29-01-18 11:11:39, Tejun Heo wrote:

Hello, Michal!

> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 2eaed1e2243d..67bdf19f8e5b 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
>  the memory controller considers only cgroups belonging to the sub-tree
>  of the OOM'ing cgroup.
>  
> -The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IMO, this statement is important. Isn't it?

> +Leaf cgroups are compared based on their cumulative memory usage. The
> +root cgroup is treated as a leaf memory cgroup as well, so it's
> +compared with other leaf memory cgroups. Due to internal implementation
> +restrictions the size of the root cgroup is a cumulative sum of
> +oom_badness of all its tasks (in other words oom_score_adj of each task
> +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> +can lead to overestimating of the root cgroup consumption and it is

Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
-999, this means the root croup has extremely low chances to be elected.

> +therefore discouraged. This might change in the future, though.

Other than that looks very good to me.

Thank you!

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <20180130115846.GA4720-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                           ` <20180130115846.GA4720-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2018-01-30 12:08                             ` Michal Hocko
  2018-01-30 12:13                               ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-01-30 12:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Andrew Morton, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue 30-01-18 11:58:51, Roman Gushchin wrote:
> On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> > On Mon 29-01-18 11:11:39, Tejun Heo wrote:
> 
> Hello, Michal!
> 
> > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > index 2eaed1e2243d..67bdf19f8e5b 100644
> > --- a/Documentation/cgroup-v2.txt
> > +++ b/Documentation/cgroup-v2.txt
> > @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> >  the memory controller considers only cgroups belonging to the sub-tree
> >  of the OOM'ing cgroup.
> >  
> > -The root cgroup is treated as a leaf memory cgroup, so it's compared
> > -with other leaf memory cgroups and cgroups with oom_group option set.
>                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> IMO, this statement is important. Isn't it?
> 
> > +Leaf cgroups are compared based on their cumulative memory usage. The
> > +root cgroup is treated as a leaf memory cgroup as well, so it's
> > +compared with other leaf memory cgroups. Due to internal implementation
> > +restrictions the size of the root cgroup is a cumulative sum of
> > +oom_badness of all its tasks (in other words oom_score_adj of each task
> > +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> > +can lead to overestimating of the root cgroup consumption and it is
> 
> Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
> in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
> -999, this means the root croup has extremely low chances to be elected.
> 
> > +therefore discouraged. This might change in the future, though.
> 
> Other than that looks very good to me.

This?

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2eaed1e2243d..34ad80ee90f2 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1291,8 +1291,15 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
 the memory controller considers only cgroups belonging to the sub-tree
 of the OOM'ing cgroup.
 
-The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+Leaf cgroups and cgroups with oom_group option set are compared based
+on their cumulative memory usage. The root cgroup is treated as a
+leaf memory cgroup as well, so it's compared with other leaf memory
+cgroups. Due to internal implementation restrictions the size of
+the root cgroup is a cumulative sum of oom_badness of all its tasks
+(in other words oom_score_adj of each task is obeyed). Relying on
+oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
+underestimating of the root cgroup consumption and it is therefore
+discouraged. This might change in the future, though.
 
 If there are no cgroups with the enabled memory controller,
 the OOM killer is using the "traditional" process-based approach.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-30 12:08                             ` Michal Hocko
@ 2018-01-30 12:13                               ` Roman Gushchin
       [not found]                                 ` <20180130121315.GA5888-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-01-30 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue, Jan 30, 2018 at 01:08:52PM +0100, Michal Hocko wrote:
> On Tue 30-01-18 11:58:51, Roman Gushchin wrote:
> > On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> > > On Mon 29-01-18 11:11:39, Tejun Heo wrote:
> > 
> > Hello, Michal!
> > 
> > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > index 2eaed1e2243d..67bdf19f8e5b 100644
> > > --- a/Documentation/cgroup-v2.txt
> > > +++ b/Documentation/cgroup-v2.txt
> > > @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > >  the memory controller considers only cgroups belonging to the sub-tree
> > >  of the OOM'ing cgroup.
> > >  
> > > -The root cgroup is treated as a leaf memory cgroup, so it's compared
> > > -with other leaf memory cgroups and cgroups with oom_group option set.
> >                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > IMO, this statement is important. Isn't it?
> > 
> > > +Leaf cgroups are compared based on their cumulative memory usage. The
> > > +root cgroup is treated as a leaf memory cgroup as well, so it's
> > > +compared with other leaf memory cgroups. Due to internal implementation
> > > +restrictions the size of the root cgroup is a cumulative sum of
> > > +oom_badness of all its tasks (in other words oom_score_adj of each task
> > > +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> > > +can lead to overestimating of the root cgroup consumption and it is
> > 
> > Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
> > in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
> > -999, this means the root croup has extremely low chances to be elected.
> > 
> > > +therefore discouraged. This might change in the future, though.
> > 
> > Other than that looks very good to me.
> 
> This?
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 2eaed1e2243d..34ad80ee90f2 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1291,8 +1291,15 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
>  the memory controller considers only cgroups belonging to the sub-tree
>  of the OOM'ing cgroup.
>  
> -The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
> +Leaf cgroups and cgroups with oom_group option set are compared based
> +on their cumulative memory usage. The root cgroup is treated as a
> +leaf memory cgroup as well, so it's compared with other leaf memory
> +cgroups. Due to internal implementation restrictions the size of
> +the root cgroup is a cumulative sum of oom_badness of all its tasks
> +(in other words oom_score_adj of each task is obeyed). Relying on
> +oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
> +underestimating of the root cgroup consumption and it is therefore
> +discouraged. This might change in the future, though.

Acked-by: Roman Gushchin <guro@fb.com>

Thank you!

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <20180130121315.GA5888-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>]

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                                 ` <20180130121315.GA5888-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2018-01-30 12:20                                   ` Michal Hocko
  2018-01-30 15:15                                     ` Tejun Heo
       [not found]                                     ` <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2018-01-30 12:20 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Andrew Morton, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue 30-01-18 12:13:22, Roman Gushchin wrote:
> On Tue, Jan 30, 2018 at 01:08:52PM +0100, Michal Hocko wrote:
> > On Tue 30-01-18 11:58:51, Roman Gushchin wrote:
> > > On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> > > > On Mon 29-01-18 11:11:39, Tejun Heo wrote:
> > > 
> > > Hello, Michal!
> > > 
> > > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > > index 2eaed1e2243d..67bdf19f8e5b 100644
> > > > --- a/Documentation/cgroup-v2.txt
> > > > +++ b/Documentation/cgroup-v2.txt
> > > > @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > > >  the memory controller considers only cgroups belonging to the sub-tree
> > > >  of the OOM'ing cgroup.
> > > >  
> > > > -The root cgroup is treated as a leaf memory cgroup, so it's compared
> > > > -with other leaf memory cgroups and cgroups with oom_group option set.
> > >                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > IMO, this statement is important. Isn't it?
> > > 
> > > > +Leaf cgroups are compared based on their cumulative memory usage. The
> > > > +root cgroup is treated as a leaf memory cgroup as well, so it's
> > > > +compared with other leaf memory cgroups. Due to internal implementation
> > > > +restrictions the size of the root cgroup is a cumulative sum of
> > > > +oom_badness of all its tasks (in other words oom_score_adj of each task
> > > > +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> > > > +can lead to overestimating of the root cgroup consumption and it is
> > > 
> > > Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
> > > in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
> > > -999, this means the root croup has extremely low chances to be elected.
> > > 
> > > > +therefore discouraged. This might change in the future, though.
> > > 
> > > Other than that looks very good to me.
> > 
> > This?
> > 
> > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > index 2eaed1e2243d..34ad80ee90f2 100644
> > --- a/Documentation/cgroup-v2.txt
> > +++ b/Documentation/cgroup-v2.txt
> > @@ -1291,8 +1291,15 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> >  the memory controller considers only cgroups belonging to the sub-tree
> >  of the OOM'ing cgroup.
> >  
> > -The root cgroup is treated as a leaf memory cgroup, so it's compared
> > -with other leaf memory cgroups and cgroups with oom_group option set.
> > +Leaf cgroups and cgroups with oom_group option set are compared based
> > +on their cumulative memory usage. The root cgroup is treated as a
> > +leaf memory cgroup as well, so it's compared with other leaf memory
> > +cgroups. Due to internal implementation restrictions the size of
> > +the root cgroup is a cumulative sum of oom_badness of all its tasks
> > +(in other words oom_score_adj of each task is obeyed). Relying on
> > +oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
> > +underestimating of the root cgroup consumption and it is therefore
> > +discouraged. This might change in the future, though.
> 
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Andrew?

From 361275a05ad7026b8f721f8aa756a4975a2c42b1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Date: Tue, 30 Jan 2018 09:54:15 +0100
Subject: [PATCH] oom, memcg: clarify root memcg oom accounting

David Rientjes has pointed out that the current way how the root memcg
is accounted for the cgroup aware OOM killer is undocumented. Unlike
regular cgroups there is no accounting going on in the root memcg
(mostly for performance reasons). Therefore we are suming up oom_badness
of its tasks. This might result in an over accounting because of the
oom_score_adj setting. Document this for now.

Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
---
 Documentation/cgroup-v2.txt | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2eaed1e2243d..34ad80ee90f2 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1291,8 +1291,15 @@ This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
 the memory controller considers only cgroups belonging to the sub-tree
 of the OOM'ing cgroup.
 
-The root cgroup is treated as a leaf memory cgroup, so it's compared
-with other leaf memory cgroups and cgroups with oom_group option set.
+Leaf cgroups and cgroups with oom_group option set are compared based
+on their cumulative memory usage. The root cgroup is treated as a
+leaf memory cgroup as well, so it's compared with other leaf memory
+cgroups. Due to internal implementation restrictions the size of
+the root cgroup is a cumulative sum of oom_badness of all its tasks
+(in other words oom_score_adj of each task is obeyed). Relying on
+oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
+underestimating of the root cgroup consumption and it is therefore
+discouraged. This might change in the future, though.
 
 If there are no cgroups with the enabled memory controller,
 the OOM killer is using the "traditional" process-based approach.
-- 
2.15.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-30 12:20                                   ` Michal Hocko
@ 2018-01-30 15:15                                     ` Tejun Heo
       [not found]                                     ` <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2018-01-30 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue, Jan 30, 2018 at 01:20:11PM +0100, Michal Hocko wrote:
> From 361275a05ad7026b8f721f8aa756a4975a2c42b1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 30 Jan 2018 09:54:15 +0100
> Subject: [PATCH] oom, memcg: clarify root memcg oom accounting
> 
> David Rientjes has pointed out that the current way how the root memcg
> is accounted for the cgroup aware OOM killer is undocumented. Unlike
> regular cgroups there is no accounting going on in the root memcg
> (mostly for performance reasons). Therefore we are suming up oom_badness
> of its tasks. This might result in an over accounting because of the
> oom_score_adj setting. Document this for now.
> 
> Acked-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks, Michal.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                                     ` <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-01-30 17:30                                       ` Johannes Weiner
  2018-01-30 19:39                                       ` Andrew Morton
  1 sibling, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2018-01-30 17:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, Andrew Morton, David Rientjes,
	Vladimir Davydov, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Jan 30, 2018 at 01:20:11PM +0100, Michal Hocko wrote:
> From 361275a05ad7026b8f721f8aa756a4975a2c42b1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Date: Tue, 30 Jan 2018 09:54:15 +0100
> Subject: [PATCH] oom, memcg: clarify root memcg oom accounting
> 
> David Rientjes has pointed out that the current way how the root memcg
> is accounted for the cgroup aware OOM killer is undocumented. Unlike
> regular cgroups there is no accounting going on in the root memcg
> (mostly for performance reasons). Therefore we are suming up oom_badness
> of its tasks. This might result in an over accounting because of the
> oom_score_adj setting. Document this for now.
> 
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
       [not found]                                     ` <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2018-01-30 17:30                                       ` Johannes Weiner
@ 2018-01-30 19:39                                       ` Andrew Morton
  1 sibling, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2018-01-30 19:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, David Rientjes, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, 30 Jan 2018 13:20:11 +0100 Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> Subject: [PATCH] oom, memcg: clarify root memcg oom accounting
> 
> David Rientjes has pointed out that the current way how the root memcg
> is accounted for the cgroup aware OOM killer is undocumented. Unlike
> regular cgroups there is no accounting going on in the root memcg
> (mostly for performance reasons). Therefore we are suming up oom_badness
> of its tasks. This might result in an over accounting because of the
> oom_score_adj setting. Document this for now.

Thanks.  Some tweakage:

--- a/Documentation/cgroup-v2.txt~mm-oom-docs-describe-the-cgroup-aware-oom-killer-fix-2-fix
+++ a/Documentation/cgroup-v2.txt
@@ -1292,13 +1292,13 @@ of the OOM'ing cgroup.
 
 Leaf cgroups and cgroups with oom_group option set are compared based
 on their cumulative memory usage. The root cgroup is treated as a
-leaf memory cgroup as well, so it's compared with other leaf memory
+leaf memory cgroup as well, so it is compared with other leaf memory
 cgroups. Due to internal implementation restrictions the size of
-the root cgroup is a cumulative sum of oom_badness of all its tasks
+the root cgroup is the cumulative sum of oom_badness of all its tasks
 (in other words oom_score_adj of each task is obeyed). Relying on
-oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
-underestimating of the root cgroup consumption and it is therefore
-discouraged. This might change in the future, though.
+oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or
+underestimation of the root cgroup consumption and it is therefore
+discouraged. This might change in the future, however.
 
 If there are no cgroups with the enabled memory controller,
 the OOM killer is using the "traditional" process-based approach.
_

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable
  2018-01-27  0:17             ` Andrew Morton
  2018-01-29 10:46               ` Michal Hocko
@ 2018-01-29 22:16               ` David Rientjes
  1 sibling, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-29 22:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, 26 Jan 2018, Andrew Morton wrote:

> > > > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > > > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> > > 
> > > Approximately.  Let me put it another way: can we modify your patchset
> > > so that the mount option remains, and continues to have a sufficiently
> > > same effect?  For backward compatibility.
> > > 
> > 
> > The mount option would exist solely to set the oom policy of the root mem 
> > cgroup, it would lose its effect of mandating that policy for any subtree 
> > since it would become configurable by the user if delegated.
> 
> Why can't we propagate the mount option into the subtrees?
> 
> If the user then alters that behaviour with new added-by-David tunables
> then fine, that's still backward compatible.
> 

It's not, if you look for the "groupoom" mount option it will specify two 
different things: the entire hierarchy is locked into a single per-cgroup 
usage comparison (Roman's original patchset), and entire hierarchy had an 
initial oom policy set which could have subsequently changed (my 
extension).  With memory.oom_policy you need to query what the effective 
policy is, checking for "groupoom" is entirely irrelevant, it was only the 
initial setting.

Thus, if memory.oom_policy is going to be merged in the future, it 
necessarily obsoletes the mount option.  It would depend on the kernel 
version to determine its meaning.

I'm struggling to see the benefit of simply not reviewing patches that 
build off the original and merging a patchset early.  What are we gaining?

> > Let me put it another way: if the cgroup aware oom killer is merged for 
> > 4.16 without this patchset, userspace can reasonably infer the oom policy 
> > from checking how cgroups were mounted.  If my followup patchset were 
> > merged for 4.17, that's invalid and it becomes dependent on kernel 
> > version: it could have the "groupoom" mount option but configured through 
> > the root mem cgroup's memory.oom_policy to not be cgroup aware at all.
> 
> That concern seems unreasonable to me.  Is an application *really*
> going to peek at the mount options to figure out what its present oom
> policy is?  Well, maybe.  But that's a pretty dopey thing to do and I
> wouldn't lose much sleep over breaking any such application in the very
> unlikely case that such a thing was developed in that two-month window.
> 

It's not dopey, it's the only way that any userspace can determine what 
process is going to be oom killed!  That policy will dictate how the 
cgroup hierarchy is configured without my extension, there's no other way 
to prefer or bias processes.

How can a userspace cgroup manager possibly construct a cgroup v2 
hierarchy with expected oom kill behavior if it is not peeking at the 
mount option?

My concern is that if extended with my patchset the mount option itself 
becomes obsolete and then peeking at it is irrelevant to the runtime 
behavior!

> > > There's nothing wrong with that!  As long as we don't break existing
> > > setups while evolving the feature.  How do we do that?
> > 
> > We'd break the setups that actually configure their cgroups and processes 
> > to abide by the current implementation since we'd need to discount 
> > oom_score_adj from the the root mem cgroup usage to fix it.
> 
> Am having trouble understanding that.  Expand, please?
> 
> Can we address this (and other such) issues in the (interim)
> documentation?
> 

This point isn't a documentation issue at all, this is the fact that 
oom_score_adj is only effective for the root mem cgroup.  If the user is 
fully aware of the implementation, it does not change the fact that he or 
she will construct their cgroup hierarchy and attach processes to it to 
abide by the behavior.  That is the breakage that I am concerned about.

An example: you have a log scraper that is running with 
/proc/pid/oom_score_adj == 999.  It's best effort, it can be killed, we'll 
retry the next time if the system has memory available.  This is partly 
why oom_adj and oom_score_adj exist and is used on production systems.

If you attach that process to an unlimited mem cgroup dedicated to system 
daemons purely for the rich stats that mem cgroup provides, this breaks 
the oom_score_adj setting solely because it's attached to the cgroup.  On 
system-wide oom, it is no longer the killed process merely because it is 
attached to an unlimited child cgroup.  This is not the only such example: 
this occurs for any process attached to a cgroup.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch -mm v2 3/3] mm, memcg: add hierarchical usage oom policy
  2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
  2018-01-25 23:53   ` [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
  2018-01-25 23:53   ` [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
@ 2018-01-25 23:53   ` David Rientjes
  2 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-01-25 23:53 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Michal Hocko, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

One of the three significant concerns brought up about the cgroup aware
oom killer is that its decisionmaking is completely evaded by creating
subcontainers and attaching processes such that the ancestor's usage does
not exceed another cgroup on the system.

In this regard, users who do not distribute their processes over a set of
subcontainers for mem cgroup control, statistics, or other controllers
are unfairly penalized.

This adds an oom policy, "tree", that accounts for hierarchical usage
when comparing cgroups and the cgroup aware oom killer is enabled by an
ancestor.  This allows administrators, for example, to require users in
their own top-level mem cgroup subtree to be accounted for with
hierarchical usage.  In other words, they can longer evade the oom killer
by using other controllers or subcontainers.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroup-v2.txt | 12 ++++++++++--
 include/linux/memcontrol.h  |  5 +++++
 mm/memcontrol.c             | 12 +++++++++---
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1078,6 +1078,11 @@ PAGE_SIZE multiple when read back.
 	memory consumers; that is, they will compare mem cgroup usage rather
 	than process memory footprint.  See the "OOM Killer" section.
 
+	If "tree", the OOM killer will compare mem cgroups and its subtree
+	as indivisible memory consumers when selecting a hierarchy.  This
+	policy cannot be set on the root mem cgroup.  See the "OOM Killer"
+	section.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1301,6 +1306,9 @@ There are currently two available oom policies:
    subtree as an OOM victim and kill at least one process, depending on
    memory.oom_group, from it.
 
+ - "tree": choose the cgroup with the largest memory footprint considering
+   itself and its subtree and kill at least one process.
+
 When selecting a cgroup as a victim, the OOM killer will kill the process
 with the largest memory footprint.  A user can control this behavior by
 enabling the per-cgroup memory.oom_group option.  If set, it causes the
@@ -1314,8 +1322,8 @@ Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
 significant memory footprint may affect OOM victim selection logic.
 If it's a case, please, consider creating a common ancestor for
-the source and destination memory cgroups and enabling oom_group
-on ancestor layer.
+the source and destination memory cgroups and setting a policy of "tree"
+and enabling oom_group on an ancestor layer.
 
 
 IO
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,6 +69,11 @@ enum memcg_oom_policy {
 	 * mem cgroup as an indivisible consumer
 	 */
 	MEMCG_OOM_POLICY_CGROUP,
+	/*
+	 * Tree cgroup usage for all descendant memcg groups, treating each mem
+	 * cgroup and its subtree as an indivisible consumer
+	 */
+	MEMCG_OOM_POLICY_TREE,
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2728,7 +2728,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
-	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * Cgroups with oom policy of "tree" accumulate the score of descendant
 	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
@@ -2737,10 +2737,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 
 		/*
 		 * We don't consider non-leaf non-oom_group memory cgroups
-		 * as OOM victims.
+		 * without the oom policy of "tree" as OOM victims.
 		 */
 		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-		    !mem_cgroup_oom_group(iter))
+		    !mem_cgroup_oom_group(iter) &&
+		    iter->oom_policy != MEMCG_OOM_POLICY_TREE)
 			continue;
 
 		/*
@@ -5538,6 +5539,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v)
 	case MEMCG_OOM_POLICY_CGROUP:
 		seq_puts(m, "cgroup\n");
 		break;
+	case MEMCG_OOM_POLICY_TREE:
+		seq_puts(m, "tree\n");
+		break;
 	case MEMCG_OOM_POLICY_NONE:
 	default:
 		seq_puts(m, "none\n");
@@ -5556,6 +5560,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of,
 		memcg->oom_policy = MEMCG_OOM_POLICY_NONE;
 	else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes)))
 		memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP;
+	else if (!memcmp("tree", buf, min(sizeof("tree")-1, nbytes)))
+		memcg->oom_policy = MEMCG_OOM_POLICY_TREE;
 	else
 		ret = -EINVAL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2018-02-01 10:11 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-17  2:14 [patch -mm 0/4] mm, memcg: introduce oom policies David Rientjes
2018-01-17  2:15 ` [patch -mm 1/4] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-01-17  2:15 ` [patch -mm 2/4] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
     [not found] ` <alpine.DEB.2.10.1801161812550.28198-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2018-01-17  2:15   ` [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable David Rientjes
2018-01-17 15:41     ` Tejun Heo
2018-01-17 16:00       ` Michal Hocko
2018-01-17 22:18         ` David Rientjes
2018-01-23 15:13           ` Michal Hocko
2018-01-17 22:14       ` David Rientjes
2018-01-19 20:53         ` David Rientjes
     [not found]           ` <alpine.DEB.2.10.1801191251080.177541-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2018-01-20 12:32             ` Tejun Heo
2018-01-22 22:34               ` David Rientjes
2018-01-23 15:53                 ` Michal Hocko
2018-01-23 22:22                   ` David Rientjes
2018-01-24  8:20                     ` Michal Hocko
2018-01-24 21:44                       ` David Rientjes
     [not found]                         ` <alpine.DEB.2.10.1801241340310.24330-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2018-01-24 22:08                           ` Andrew Morton
2018-01-24 22:18                             ` Tejun Heo
     [not found]                             ` <20180124140805.b4eb437c6fe9dadb67a32e8a-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2018-01-25  8:11                               ` Michal Hocko
2018-01-25  8:05                           ` Michal Hocko
2018-01-25 23:27                             ` David Rientjes
2018-01-26 10:07                               ` Michal Hocko
2018-01-26 22:33                                 ` David Rientjes
2018-01-17  2:15 ` [patch -mm 4/4] mm, memcg: add hierarchical usage oom policy David Rientjes
2018-01-17 11:46 ` [patch -mm 0/4] mm, memcg: introduce oom policies Roman Gushchin
2018-01-17 22:31   ` David Rientjes
2018-01-25 23:53 ` [patch -mm v2 0/3] " David Rientjes
2018-01-25 23:53   ` [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-01-26 17:15     ` Michal Hocko
2018-01-29 22:38       ` David Rientjes
     [not found]         ` <alpine.DEB.2.10.1801291418150.29670-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2018-01-30  8:50           ` Michal Hocko
2018-01-30 22:38             ` David Rientjes
2018-01-31  9:47               ` Michal Hocko
2018-02-01 10:11                 ` David Rientjes
2018-01-25 23:53   ` [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
2018-01-26  0:00     ` Andrew Morton
2018-01-26 22:20       ` David Rientjes
2018-01-26 22:39         ` Andrew Morton
2018-01-26 22:52           ` David Rientjes
2018-01-27  0:17             ` Andrew Morton
2018-01-29 10:46               ` Michal Hocko
     [not found]                 ` <20180129104657.GC21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2018-01-29 19:11                   ` Tejun Heo
2018-01-30  8:54                     ` Michal Hocko
     [not found]                       ` <20180130085445.GQ21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2018-01-30 11:58                         ` Roman Gushchin
     [not found]                           ` <20180130115846.GA4720-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2018-01-30 12:08                             ` Michal Hocko
2018-01-30 12:13                               ` Roman Gushchin
     [not found]                                 ` <20180130121315.GA5888-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2018-01-30 12:20                                   ` Michal Hocko
2018-01-30 15:15                                     ` Tejun Heo
     [not found]                                     ` <20180130122011.GB21609-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2018-01-30 17:30                                       ` Johannes Weiner
2018-01-30 19:39                                       ` Andrew Morton
2018-01-29 22:16               ` David Rientjes
2018-01-25 23:53   ` [patch -mm v2 3/3] mm, memcg: add hierarchical usage oom policy David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox