Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
To: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Vladimir Davydov
	<vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Tetsuo Handa
	<penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	kernel-team-b10kYP2dOMg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
Date: Tue, 10 Oct 2017 23:04:17 +0100	[thread overview]
Message-ID: <20171010220417.GA8667@castle> (raw)
In-Reply-To: <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>

On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for root
> > memory cgroup. If we'll start gathering them at some point, we can change this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 10000 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 10000 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 10000 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 10000 process in the root
memory cgroup, and even if we kill 1 process we will still have 9999
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes does 
> > > not allow this heuristic to be defeated.
> > 
> > To all previously said words I can only add that cgroup v2 allows to limit
> > the amount of cgroups in the sub-tree:
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > 
> 
> So the solution to 
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> evading all oom kills for your mem cgroup is to limit the number of 
> cgroups that can be created by the user?  With a unified cgroup hierarchy, 
> that doesn't work well if I wanted to actually constrain these individual 
> processes to different resource limits like cpu usage.  In fact, the user 
> may not know it is effectively evading the oom killer entirely because it 
> has constrained the cpu of individual processes because its a side-effect 
> of this heuristic.
> 
> 
> You chose not to respond to my reiteration of userspace having absolutely 
> no control over victim selection with the new heuristic without setting 
> all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
> very important job that is running on a system that is really supposed to 
> use 80% of memory, I need to be able to specify that it should not be oom 
> killed based on user goals.  Setting all processes to be oom disabled in 
> the important mem cgroup to avoid being oom killed unless absolutely 
> necessary in a system oom condition is not a robust solution: (1) the mem 
> cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
> panic()'s if these preferred mem cgroups are the only consumers left on 
> the system.  With overcommit, both of these possibilities exist in the 
> wild and the problem is only a result of the implementation detail of this 
> patchset.
> 
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
will it fix the problem?

It might have some drawbacks as well (especially around oom_score_adj),
but it's doable, if we'll ignore tasks which are not owners of their's mm struct.

> 
> > > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > > process cannot be killed because of oom_unkillable_task(), the only 
> > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > > eligible processes changed, we end up falling back to the complete 
> > > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > > oom_unkillable_task() and also retry in the case where 
> > > oom_kill_memcg_victim() returns NULL.
> > 
> > I agree with you here. The fallback to the existing mechanism is implemented
> > to be safe for sure, especially in a case of a global OOM. When we'll get
> > more confidence in cgroup-aware OOM killer reliability, we can change this
> > behavior. Personally, I would prefer to get rid of looking at all tasks just
> > to find a pre-existing OOM victim, but it might be quite tricky to implement.
> > 
> 
> I'm not sure what this has to do with confidence in this patchset's 
> reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
> found an eligible process in oc->chosen_memcg but it was either ineligible 
> later because of oom_unkillable_task(), it moved, or it exited.  It's a 
> race.  For users who opt-in to this new heuristic, they should not be 
> concerned with a process exiting and thus killing a completely unexpected 
> process from an unexpected memcg when it should be possible to retry and 
> select the correct victim.

Yes, I have to agree here.
Looks like we can't fallback to the original policy.

Thanks!

WARNING: multiple messages have this Message-ID (diff)

From: Roman Gushchin <guro@fb.com>
To: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
Date: Tue, 10 Oct 2017 23:04:17 +0100	[thread overview]
Message-ID: <20171010220417.GA8667@castle> (raw)
In-Reply-To: <alpine.DEB.2.10.1710101345370.28262@chino.kir.corp.google.com>

On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for root
> > memory cgroup. If we'll start gathering them at some point, we can change this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 10000 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 10000 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 10000 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 10000 process in the root
memory cgroup, and even if we kill 1 process we will still have 9999
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes does 
> > > not allow this heuristic to be defeated.
> > 
> > To all previously said words I can only add that cgroup v2 allows to limit
> > the amount of cgroups in the sub-tree:
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > 
> 
> So the solution to 
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> evading all oom kills for your mem cgroup is to limit the number of 
> cgroups that can be created by the user?  With a unified cgroup hierarchy, 
> that doesn't work well if I wanted to actually constrain these individual 
> processes to different resource limits like cpu usage.  In fact, the user 
> may not know it is effectively evading the oom killer entirely because it 
> has constrained the cpu of individual processes because its a side-effect 
> of this heuristic.
> 
> 
> You chose not to respond to my reiteration of userspace having absolutely 
> no control over victim selection with the new heuristic without setting 
> all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
> very important job that is running on a system that is really supposed to 
> use 80% of memory, I need to be able to specify that it should not be oom 
> killed based on user goals.  Setting all processes to be oom disabled in 
> the important mem cgroup to avoid being oom killed unless absolutely 
> necessary in a system oom condition is not a robust solution: (1) the mem 
> cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
> panic()'s if these preferred mem cgroups are the only consumers left on 
> the system.  With overcommit, both of these possibilities exist in the 
> wild and the problem is only a result of the implementation detail of this 
> patchset.
> 
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes@google.com>

So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
will it fix the problem?

It might have some drawbacks as well (especially around oom_score_adj),
but it's doable, if we'll ignore tasks which are not owners of their's mm struct.

> 
> > > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > > process cannot be killed because of oom_unkillable_task(), the only 
> > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > > eligible processes changed, we end up falling back to the complete 
> > > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > > oom_unkillable_task() and also retry in the case where 
> > > oom_kill_memcg_victim() returns NULL.
> > 
> > I agree with you here. The fallback to the existing mechanism is implemented
> > to be safe for sure, especially in a case of a global OOM. When we'll get
> > more confidence in cgroup-aware OOM killer reliability, we can change this
> > behavior. Personally, I would prefer to get rid of looking at all tasks just
> > to find a pre-existing OOM victim, but it might be quite tricky to implement.
> > 
> 
> I'm not sure what this has to do with confidence in this patchset's 
> reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
> found an eligible process in oc->chosen_memcg but it was either ineligible 
> later because of oom_unkillable_task(), it moved, or it exited.  It's a 
> race.  For users who opt-in to this new heuristic, they should not be 
> concerned with a process exiting and thus killing a completely unexpected 
> process from an unexpected memcg when it should be possible to retry and 
> select the correct victim.

Yes, I have to agree here.
Looks like we can't fallback to the original policy.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Roman Gushchin <guro@fb.com>
To: David Rientjes <rientjes@google.com>
Cc: <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, <kernel-team@fb.com>,
	<cgroups@vger.kernel.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [v11 3/6] mm, oom: cgroup-aware OOM killer
Date: Tue, 10 Oct 2017 23:04:17 +0100	[thread overview]
Message-ID: <20171010220417.GA8667@castle> (raw)
In-Reply-To: <alpine.DEB.2.10.1710101345370.28262@chino.kir.corp.google.com>

On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for root
> > memory cgroup. If we'll start gathering them at some point, we can change this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 10000 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 10000 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 10000 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 10000 process in the root
memory cgroup, and even if we kill 1 process we will still have 9999
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristic from a score of all anonymous + 
> > > unevictable memory for all processes to a score of the largest anonymous +
> > > unevictable memory for a single process.  There is no downside or 
> > > ramifaction for the end user in doing this.  When comparing cgroups based 
> > > on usage, it only makes sense to compare the hierarchical usage of that 
> > > cgroup so that attaching processes to descendants or splitting the 
> > > implementation of a process into several smaller individual processes does 
> > > not allow this heuristic to be defeated.
> > 
> > To all previously said words I can only add that cgroup v2 allows to limit
> > the amount of cgroups in the sub-tree:
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > 
> 
> So the solution to 
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> evading all oom kills for your mem cgroup is to limit the number of 
> cgroups that can be created by the user?  With a unified cgroup hierarchy, 
> that doesn't work well if I wanted to actually constrain these individual 
> processes to different resource limits like cpu usage.  In fact, the user 
> may not know it is effectively evading the oom killer entirely because it 
> has constrained the cpu of individual processes because its a side-effect 
> of this heuristic.
> 
> 
> You chose not to respond to my reiteration of userspace having absolutely 
> no control over victim selection with the new heuristic without setting 
> all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
> very important job that is running on a system that is really supposed to 
> use 80% of memory, I need to be able to specify that it should not be oom 
> killed based on user goals.  Setting all processes to be oom disabled in 
> the important mem cgroup to avoid being oom killed unless absolutely 
> necessary in a system oom condition is not a robust solution: (1) the mem 
> cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
> panic()'s if these preferred mem cgroups are the only consumers left on 
> the system.  With overcommit, both of these possibilities exist in the 
> wild and the problem is only a result of the implementation detail of this 
> patchset.
> 
> For these reasons: unfair comparison of root mem cgroup usage to bias 
> against that mem cgroup from oom kill in system oom conditions, the 
> ability of users to completely evade the oom killer by attaching all 
> processes to child cgroups either purposefully or unpurposefully, and the 
> inability of userspace to effectively control oom victim selection:
> 
> Nacked-by: David Rientjes <rientjes@google.com>

So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
will it fix the problem?

It might have some drawbacks as well (especially around oom_score_adj),
but it's doable, if we'll ignore tasks which are not owners of their's mm struct.

> 
> > > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > > process cannot be killed because of oom_unkillable_task(), the only 
> > > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > > eligible processes changed, we end up falling back to the complete 
> > > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > > oom_unkillable_task() and also retry in the case where 
> > > oom_kill_memcg_victim() returns NULL.
> > 
> > I agree with you here. The fallback to the existing mechanism is implemented
> > to be safe for sure, especially in a case of a global OOM. When we'll get
> > more confidence in cgroup-aware OOM killer reliability, we can change this
> > behavior. Personally, I would prefer to get rid of looking at all tasks just
> > to find a pre-existing OOM victim, but it might be quite tricky to implement.
> > 
> 
> I'm not sure what this has to do with confidence in this patchset's 
> reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
> found an eligible process in oc->chosen_memcg but it was either ineligible 
> later because of oom_unkillable_task(), it moved, or it exited.  It's a 
> race.  For users who opt-in to this new heuristic, they should not be 
> concerned with a process exiting and thus killing a completely unexpected 
> process from an unexpected memcg when it should be possible to retry and 
> select the correct victim.

Yes, I have to agree here.
Looks like we can't fallback to the original policy.

Thanks!

next prev parent reply	other threads:[~2017-10-10 22:04 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-05 13:04 [v11 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04 ` Roman Gushchin
2017-10-05 13:04 ` Roman Gushchin
2017-10-05 13:04 ` [v11 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
2017-10-05 13:04 ` [v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
     [not found]   ` <20171005130454.5590-3-guro-b10kYP2dOMg@public.gmane.org>
2017-10-09 21:11     ` David Rientjes
2017-10-09 21:11       ` David Rientjes
2017-10-09 21:11       ` David Rientjes
2017-10-05 13:04 ` [v11 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
     [not found]   ` <20171005130454.5590-4-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 14:27     ` Michal Hocko
2017-10-05 14:27       ` Michal Hocko
2017-10-05 14:27       ` Michal Hocko
2017-10-09 21:52   ` David Rientjes
2017-10-09 21:52     ` David Rientjes
     [not found]     ` <alpine.DEB.2.10.1710091414260.59643-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10  8:18       ` Michal Hocko
2017-10-10  8:18         ` Michal Hocko
2017-10-10  8:18         ` Michal Hocko
2017-10-10 12:23     ` Roman Gushchin
2017-10-10 12:23       ` Roman Gushchin
2017-10-10 21:13       ` David Rientjes
2017-10-10 21:13         ` David Rientjes
     [not found]         ` <alpine.DEB.2.10.1710101345370.28262-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-10 22:04           ` Roman Gushchin [this message]
2017-10-10 22:04             ` Roman Gushchin
2017-10-10 22:04             ` Roman Gushchin
2017-10-11 20:21             ` David Rientjes
2017-10-11 20:21               ` David Rientjes
     [not found]               ` <alpine.DEB.2.10.1710111247390.98307-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-10-11 21:49                 ` Roman Gushchin
2017-10-11 21:49                   ` Roman Gushchin
2017-10-11 21:49                   ` Roman Gushchin
2017-10-12 21:50                   ` David Rientjes
2017-10-12 21:50                     ` David Rientjes
2017-10-13 13:32                     ` Roman Gushchin
2017-10-13 13:32                       ` Roman Gushchin
2017-10-13 13:32                       ` Roman Gushchin
     [not found]                       ` <20171013133219.GA5363-B3w7+ongkCiLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2017-10-13 21:31                         ` David Rientjes
2017-10-13 21:31                           ` David Rientjes
2017-10-13 21:31                           ` David Rientjes
2017-10-11 13:08         ` Michal Hocko
2017-10-11 13:08           ` Michal Hocko
2017-10-11 20:27           ` David Rientjes
2017-10-11 20:27             ` David Rientjes
2017-10-12  6:33             ` Michal Hocko
2017-10-12  6:33               ` Michal Hocko
2017-10-11 16:10         ` Roman Gushchin
2017-10-11 16:10           ` Roman Gushchin
2017-10-11 16:10           ` Roman Gushchin
     [not found] ` <20171005130454.5590-1-guro-b10kYP2dOMg@public.gmane.org>
2017-10-05 13:04   ` [v11 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-05 13:04     ` Roman Gushchin
2017-10-05 13:04     ` Roman Gushchin
2017-10-05 14:29     ` Michal Hocko
2017-10-05 14:29       ` Michal Hocko
2017-10-05 14:31     ` Michal Hocko
2017-10-05 14:31       ` Michal Hocko
     [not found]       ` <20171005143104.wo5xstpe7mhkdlbr-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-10-06 12:04         ` Roman Gushchin
2017-10-06 12:04           ` Roman Gushchin
2017-10-06 12:04           ` Roman Gushchin
2017-10-06 12:17           ` Michal Hocko
2017-10-06 12:17             ` Michal Hocko
2017-10-05 13:04 ` [v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
2017-10-05 13:04 ` [v11 6/6] mm, oom, docs: describe the " Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin
2017-10-05 13:04   ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171010220417.GA8667@castle \
    --to=guro-b10kyp2domg@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=kernel-team-b10kYP2dOMg@public.gmane.org \
    --cc=linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org \
    --cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.