Re: [PATCH v13 0/7] cgroup-aware OOM killer

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
       [not found] <20171130152824.1591-1-guro@fb.com>
@ 2018-06-05 11:47 ` Michal Hocko
  2018-06-05 12:13   ` Michal Hocko
  2018-07-13 21:59   ` David Rientjes
  0 siblings, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2018-06-05 11:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

It seems that this is still in limbo mostly because of David's concerns.
So let me reiterate them and provide my POV once more (and the last
time) just to help Andrew make a decision:

1) comparision root with tail memcgs during the OOM killer is not fair
because we are comparing tasks with memcgs.

This is true, but I do not think this matters much for workloads which
are going to use the feature. Why? Because the main consumers of the new
feature seem to be containers which really need some fairness when
comparing _workloads_ rather than processes. Those are unlikely to
contain any significant memory consumers in the root memcg. That would
be mostly common infrastructure.

Is this is fixable? Yes, we would need to account in the root memcgs.
Why are we not doing that now? Because it has some negligible
performance overhead. Are there other ways? Yes we can approximate root
memcg memory consumption but I would rather wait for somebody seeing
that as a real problem rather than add hacks now without a strong
reason.

2) Evading the oom killer by attaching processes to child cgroups which
basically means that a task can split up the workload into smaller
memcgs to hide their real memory consumption.

Again true but not really anything new. Processes can already fork and
split up the memory consumption. Moreover it doesn't even require any
special privileges to do so unlike creating a sub memcg. Is this
fixable? Yes, untrusted workloads can setup group oom evaluation at the
delegation layer so all subgroups would be considered together.

3) Userspace has zero control over oom kill selection in leaf mem
cgroups.

Again true but this is something that needs a good evaluation to not end
up in the fiasko we have seen with oom_score*. Current users demanding
this feature can live without any prioritization so blocking the whole
feature seems unreasonable.

4) Future extensibility to be backward compatible.

David is wrong here IMHO. Any prioritization or oom selection policy
controls added in future are orthogonal to the oom_group concept added
by this patchset. Allowing memcg to be an oom entity is something that
we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
semantic and softening it will be possible by a adding a new knob to
tell whether a memcg/hierarchy is a workload or a set of tasks.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
@ 2018-06-05 12:13   ` Michal Hocko
  2018-07-13 21:59   ` David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2018-06-05 12:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Tue 05-06-18 13:47:29, Michal Hocko wrote:
> It seems that this is still in limbo mostly because of David's concerns.
> So let me reiterate them and provide my POV once more (and the last
> time) just to help Andrew make a decision:

Sorry, I forgot to add reference to the email with the full David's
reasoning. Here it is http://lkml.kernel.org/r/alpine.DEB.2.10.1801091556490.173445@chino.kir.corp.google.com
 
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
> 
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
> 
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
> 
> 
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
> 
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
> 
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
> 
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
> 
> 4) Future extensibility to be backward compatible.
> 
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
  2018-06-05 12:13   ` Michal Hocko
@ 2018-07-13 21:59   ` David Rientjes
  2018-07-14  1:55     ` Tetsuo Handa
  2018-07-16  9:36     ` Michal Hocko
  1 sibling, 2 replies; 13+ messages in thread
From: David Rientjes @ 2018-07-13 21:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Tue, 5 Jun 2018, Michal Hocko wrote:

> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
> 
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
> 

There are users (us) who want to use the feature and not all processes are 
attached to leaf mem cgroups.  The functionality can be provided in a 
generally useful way that doesn't require any specific hierarchy, and I 
implemented this in my patch series at 
https://marc.info/?l=linux-mm&m=152175563004458&w=2.  That proposal to fix 
*all* of my concerns with the cgroup-aware oom killer as it sits in -mm, 
in it's entirety, only extends it so it is generally useful and does not 
remove any functionality.  I'm not sure why we are discussing ways of 
moving forward when that patchset has been waiting for review for almost 
four months and, to date, I haven't seen an objection to.

I don't know why we cannot agree on making solutions generally useful nor 
why that patchset has not been merged into -mm so that the whole feature 
can be merged.  It's baffling.  This is the first time I've encountered a 
perceived stalemate when there is a patchset sitting, unreviewed, that 
fixes all of the concerns that there are about the implementation sitting 
in -mm.

This isn't a criticism just of comparing processes attached to root 
differently than leaf mem cgroups, it's how oom_score_adj influences that 
decision.  It's trivial for a very small consumer (not "significant" 
memory consumer, as you put it) to require an oom kill from root instead 
of a leaf mem cgroup.  I show in 
https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the 
oom_score_adj of my bash shell attached to the root mem cgroup is 
considered equal to a 95GB leaf mem cgroup with the current 
implementation.

> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
> 

I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from 
what I remmeber Roman actually liked it.

> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
> 
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
> 

Processes being able to fork to split up memory consumption is also fixed 
by https://marc.info/?l=linux-mm&m=152175564104467 just as creating 
subcontainers to intentionally or unintentionally subverting the oom 
policy is fixed.  It solves both problems.

> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
> 
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
> 

One objection here is how the oom_score_adj of a process means something 
or doesn't mean something depending on what cgroup it is attached to.  The 
cgroup-aware oom killer is cgroup aware.  oom_score_adj should play no 
part.  I fixed this with https://marc.info/?t=152175564500007&r=1&w=2.  
The other objection is that users do have cgroups that shouldn't be oom 
killed because they are important, either because they are required to 
provide a service for a smaller cgroup or because of business goals.  We 
have cgroups that use more than half of system memory and they are allowed 
to do so because they are important.  We would love to be able to bias 
against that cgroup to prefer others, or prefer cgroups for oom kill 
because they are less important.  This was done for processes with 
oom_score_adj, we need it for a cgroup aware oom killer for the same 
reason.

But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2 
that I said priority or adjustment can be added on top of the feature 
after it's merged.  This itself is not precluding anything from being 
merged.

> 4) Future extensibility to be backward compatible.
> 
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.

I've always said that the mechanism and policy in this patchset should be 
separated.  I do that exact thing in 
https://marc.info/?l=linux-mm&m=152175564304469&w=2.  I suggest that 
different subtrees will want (or the admin will require) different 
behaviors with regard to the mechanism.

I've stated the problems (and there are others wrt mempolicy selection) 
that the current implementation has and given a full solution at 
https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been 
reviewed.  I would love feedback from anybody on this thread on that.  I'm 
not trying to preclude the cgroup-aware oom killer from being merged, I'm 
the only person actively trying to get it merged.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-13 21:59   ` David Rientjes
@ 2018-07-14  1:55     ` Tetsuo Handa
  2018-07-16 21:13       ` Tetsuo Handa
  2018-07-16  9:36     ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2018-07-14  1:55 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On 2018/07/14 6:59, David Rientjes wrote:
> I'm not trying to preclude the cgroup-aware oom killer from being merged,
> I'm the only person actively trying to get it merged.

Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
and my cleanup? The gap between linux.git and linux-next.git keeps us unable
to use agreed baseline.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-14  1:55     ` Tetsuo Handa
@ 2018-07-16 21:13       ` Tetsuo Handa
  2018-07-16 22:09         ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2018-07-16 21:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Michal Hocko, Roman Gushchin, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

No response from Roman and David...

Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
Roman's series has a bug which I mentioned and which can be avoided by my patch.
David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
next OOM victim without trying to reclaim any memory.

Since they are not responding to my mail, I suggest once dropping from linux-next.

https://www.spinics.net/lists/linux-mm/msg153212.html
https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u

On 2018/07/14 10:55, Tetsuo Handa wrote:
> On 2018/07/14 6:59, David Rientjes wrote:
>> I'm not trying to preclude the cgroup-aware oom killer from being merged,
>> I'm the only person actively trying to get it merged.
> 
> Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
> and my cleanup? The gap between linux.git and linux-next.git keeps us unable
> to use agreed baseline.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16 21:13       ` Tetsuo Handa
@ 2018-07-16 22:09         ` Roman Gushchin
  2018-07-17  0:55           ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2018-07-16 22:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> No response from Roman and David...
> 
> Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> Roman's series has a bug which I mentioned and which can be avoided by my patch.
> David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> next OOM victim without trying to reclaim any memory.
> 
> Since they are not responding to my mail, I suggest once dropping from linux-next.

I was in cc, and didn't thought that you're expecting something from me.

I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
I'm happy to help with rebasing and everything else.

Thanks,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16 22:09         ` Roman Gushchin
@ 2018-07-17  0:55           ` Tetsuo Handa
  2018-07-31 14:14             ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2018-07-17  0:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> > 
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> > next OOM victim without trying to reclaim any memory.
> > 
> > Since they are not responding to my mail, I suggest once dropping from linux-next.
> 
> I was in cc, and didn't thought that you're expecting something from me.

Oops. I was waiting for your response. ;-)

  But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch
  in linux-next. And it seems to me that your patch contains a bug which leads to
  premature memory allocation failure explained below.

  Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
  (which eliminates "delay" and "out:" from your patch) so that people can easily
  backport my patch? Or, do you want to apply a fix (which eliminates "delay" and
  "out:" from linux-next) prior to my patch?

> 
> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> I'm happy to help with rebasing and everything else.

Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
and easy to cleanly backport (if applied before your series).

Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-17  0:55           ` Tetsuo Handa
@ 2018-07-31 14:14             ` Tetsuo Handa
  2018-08-01 16:37               ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2018-07-31 14:14 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: David Rientjes, Michal Hocko, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On 2018/07/17 9:55, Tetsuo Handa wrote:
>> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
>> I'm happy to help with rebasing and everything else.
> 
> Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> and easy to cleanly backport (if applied before your series).
> 
> Also, I expect you to check whether my cleanup patch which removes "abort" path
> ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> simplifying your series. I don't know detailed behavior of your series, but I
> assume that your series do not kill threads which current thread should not wait
> for MMF_OOM_SKIP.

syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

I can't tell what change is triggering this race. Maybe removal of oom_lock from
the oom reaper made more likely to hit. But anyway I suspect that

static bool oom_kill_memcg_victim(struct oom_control *oc)
{
        if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
                return oc->chosen_memcg; // <= This line is still broken

because

                /* We have one or more terminating processes at this point. */
                oc->chosen_task = INFLIGHT_VICTIM;

is not called.

Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
with oom_lock held.

Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
apply my cleanup patch? Since the merge window is approaching, I really want to
see how next -rc1 would look like...
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-31 14:14             ` Tetsuo Handa
@ 2018-08-01 16:37               ` Roman Gushchin
  2018-08-01 22:01                 ` Tetsuo Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2018-08-01 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> On 2018/07/17 9:55, Tetsuo Handa wrote:
> >> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> >> I'm happy to help with rebasing and everything else.
> > 
> > Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> > patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> > and easy to cleanly backport (if applied before your series).
> > 
> > Also, I expect you to check whether my cleanup patch which removes "abort" path
> > ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> > simplifying your series. I don't know detailed behavior of your series, but I
> > assume that your series do not kill threads which current thread should not wait
> > for MMF_OOM_SKIP.
> 
> syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fid-3Dea8c7912757d253537375e981b61749b2da69258&d=DwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=h9FJRAWtCmDLT-cVwvXKCYIUVRSrD--0XFJE-OnNY64&s=If6eFu6MlYjnfLXeg5_S-3tuhCZhSMv8_qfSrMfwOQ0&e=
> 
> I can't tell what change is triggering this race. Maybe removal of oom_lock from
> the oom reaper made more likely to hit. But anyway I suspect that
> 
> static bool oom_kill_memcg_victim(struct oom_control *oc)
> {
>         if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
>                 return oc->chosen_memcg; // <= This line is still broken
> 
> because
> 
>                 /* We have one or more terminating processes at this point. */
>                 oc->chosen_task = INFLIGHT_VICTIM;
> 
> is not called.
> 
> Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
> with oom_lock held.
> 
> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> apply my cleanup patch? Since the merge window is approaching, I really want to
> see how next -rc1 would look like...

Hi Tetsuo!

Has this cleanup patch been acked by somebody?
Which problem does it solve?
Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.

Anyway, there is a good chance that current cgroup-aware OOM killer
implementation will be replaced by a lightweight version (memory.oom.group).
Please, take a look at it, probably your cleanup will not conflict with it
at all.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-08-01 16:37               ` Roman Gushchin
@ 2018-08-01 22:01                 ` Tetsuo Handa
  2018-08-01 22:55                   ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Tetsuo Handa @ 2018-08-01 22:01 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: David Rientjes, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On 2018/08/02 1:37, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
>> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
>> apply my cleanup patch? Since the merge window is approaching, I really want to
>> see how next -rc1 would look like...
> 
> Hi Tetsuo!
> 
> Has this cleanup patch been acked by somebody?

Not yet. But since Michal considers this cleanup as "a nice shortcut"
( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
I will get an ACK regarding this cleanup.

> Which problem does it solve?

It simplifies tricky out_of_memory() return value decision, and
it also fixes a bug in your series which syzbot is pointing out.

> Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.

What I need is a git tree which I can use as a baseline for making this cleanup.
linux.git is not suitable because it does not include Michal's fix, but
linux-next.git is not suitable because Michal's fix is overwritten by your series.
I want a git tree which includes Michal's fix and does not include your series.

> 
> Anyway, there is a good chance that current cgroup-aware OOM killer
> implementation will be replaced by a lightweight version (memory.oom.group).
> Please, take a look at it, probably your cleanup will not conflict with it
> at all.

Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
I want to see how next -rc1 would look like (for testing purpose) and want to use
linux-next.git as a baseline (for making this cleanup).

> 
> Thanks!
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-08-01 22:01                 ` Tetsuo Handa
@ 2018-08-01 22:55                   ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2018-08-01 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, Aug 02, 2018 at 07:01:28AM +0900, Tetsuo Handa wrote:
> On 2018/08/02 1:37, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> >> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> >> apply my cleanup patch? Since the merge window is approaching, I really want to
> >> see how next -rc1 would look like...
> > 
> > Hi Tetsuo!
> > 
> > Has this cleanup patch been acked by somebody?
> 
> Not yet. But since Michal considers this cleanup as "a nice shortcut"
> ( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
> I will get an ACK regarding this cleanup.
> 
> > Which problem does it solve?
> 
> It simplifies tricky out_of_memory() return value decision, and
> it also fixes a bug in your series which syzbot is pointing out.
> 
> > Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
> 
> What I need is a git tree which I can use as a baseline for making this cleanup.
> linux.git is not suitable because it does not include Michal's fix, but
> linux-next.git is not suitable because Michal's fix is overwritten by your series.
> I want a git tree which includes Michal's fix and does not include your series.
> 
> > 
> > Anyway, there is a good chance that current cgroup-aware OOM killer
> > implementation will be replaced by a lightweight version (memory.oom.group).
> > Please, take a look at it, probably your cleanup will not conflict with it
> > at all.
> 
> Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
> I want to see how next -rc1 would look like (for testing purpose) and want to use
> linux-next.git as a baseline (for making this cleanup).

I'll post memory.oom.group v2 later today, and if there will be no objections,
I'll ask Andrew to drop current memcg-aware OOM killer and replace it
with lightweight memory.oom.group.

These changes will be picked by linux-next in few days.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-13 21:59   ` David Rientjes
  2018-07-14  1:55     ` Tetsuo Handa
@ 2018-07-16  9:36     ` Michal Hocko
  2018-07-17  3:59       ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2018-07-16  9:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 13-07-18 14:59:59, David Rientjes wrote:
> On Tue, 5 Jun 2018, Michal Hocko wrote:
> 
> > 1) comparision root with tail memcgs during the OOM killer is not fair
> > because we are comparing tasks with memcgs.
> > 
> > This is true, but I do not think this matters much for workloads which
> > are going to use the feature. Why? Because the main consumers of the new
> > feature seem to be containers which really need some fairness when
> > comparing _workloads_ rather than processes. Those are unlikely to
> > contain any significant memory consumers in the root memcg. That would
> > be mostly common infrastructure.
> > 
> 
> There are users (us) who want to use the feature and not all processes are 
> attached to leaf mem cgroups.  The functionality can be provided in a 
> generally useful way that doesn't require any specific hierarchy, and I 
> implemented this in my patch series at 
> https://marc.info/?l=linux-mm&m=152175563004458&w=2.  That proposal to fix 
> *all* of my concerns with the cgroup-aware oom killer as it sits in -mm, 
> in it's entirety, only extends it so it is generally useful and does not 
> remove any functionality.  I'm not sure why we are discussing ways of 
> moving forward when that patchset has been waiting for review for almost 
> four months and, to date, I haven't seen an objection to.

Well, I didn't really get to your patches yet. The last time I've
checked I had some pretty serious concerns about the consistency of your
proposal. Those might have been fixed in the lastest version of your
patchset I haven't seen. But I still strongly suspect that you are
largerly underestimating the complexity of more generic oom policies
which you are heading to.

Considering user API failures from the past (oom_*adj fiasco for
example) suggests that we should start with smaller steps and only
provide a clear and simple API. oom_group is such a simple and
semantically consistent thing which is the reason I am OK with it much
more than your "we can be more generic" approach. I simply do not trust
we can agree on sane and consistent api in a reasonable time.

And it is quite mind boggling that a simpler approach has been basically
blocked for months because there are some concerns for workloads which
are not really asking for the feature. Sure your usecase might need to
handle root memcg differently. That is a fair point but that shouldn't
really block containers users who can use the proposed solution without
any further changes. If we ever decide to handle root memcg differently
we are free to do so because the oom selection policy is not carved in
stone by any api.

[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16  9:36     ` Michal Hocko
@ 2018-07-17  3:59       ` David Rientjes
  0 siblings, 0 replies; 13+ messages in thread
From: David Rientjes @ 2018-07-17  3:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Mon, 16 Jul 2018, Michal Hocko wrote:

> Well, I didn't really get to your patches yet. The last time I've
> checked I had some pretty serious concerns about the consistency of your
> proposal. Those might have been fixed in the lastest version of your
> patchset I haven't seen. But I still strongly suspect that you are
> largerly underestimating the complexity of more generic oom policies
> which you are heading to.
> 

I don't believe it's underestimated since it's used.  It's perfectly valid 
the lock an entire hierarchy or individual subtrees into a single policy 
if that's what is preferred.  Any use of a different policy at a subtree 
root is a conscious decision made by the owner of that subtree.  If they 
prefer to kill the largest process, the largest descendant cgroup, or the 
largest subtree, it is up to them.  All three have valid usecases, the 
goal is not to lock the entire hierarchy into a single policy: this 
introduces the ability for users to subvert the selection policy either 
intentionally or unintentionally because they are using a unified single 
hierarchy with cgroup v2 and they are using controllers other than mem 
cgroup.

> Considering user API failures from the past (oom_*adj fiasco for
> example) suggests that we should start with smaller steps and only
> provide a clear and simple API. oom_group is such a simple and
> semantically consistent thing which is the reason I am OK with it much
> more than your "we can be more generic" approach. I simply do not trust
> we can agree on sane and consistent api in a reasonable time.
> 
> And it is quite mind boggling that a simpler approach has been basically
> blocked for months because there are some concerns for workloads which
> are not really asking for the feature. Sure your usecase might need to
> handle root memcg differently. That is a fair point but that shouldn't
> really block containers users who can use the proposed solution without
> any further changes. If we ever decide to handle root memcg differently
> we are free to do so because the oom selection policy is not carved in
> stone by any api.
>  

Please respond directly to the patchset which clearly enumerates the 
problems with the current implementation in -mm.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-01 22:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20171130152824.1591-1-guro@fb.com>
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
2018-06-05 12:13   ` Michal Hocko
2018-07-13 21:59   ` David Rientjes
2018-07-14  1:55     ` Tetsuo Handa
2018-07-16 21:13       ` Tetsuo Handa
2018-07-16 22:09         ` Roman Gushchin
2018-07-17  0:55           ` Tetsuo Handa
2018-07-31 14:14             ` Tetsuo Handa
2018-08-01 16:37               ` Roman Gushchin
2018-08-01 22:01                 ` Tetsuo Handa
2018-08-01 22:55                   ` Roman Gushchin
2018-07-16  9:36     ` Michal Hocko
2018-07-17  3:59       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).