All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	Li Zefan <lizefan@huawei.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	kernel-team@fb.com,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
Date: Thu, 18 May 2017 20:20:50 +0100	[thread overview]
Message-ID: <20170518192050.GA1648@castle> (raw)
In-Reply-To: <CAKTCnzkBNV9bsQSg4kzhxY=i=-y3x78StbbXfV9mvXLsJhGHig@mail.gmail.com>

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?

Imagine, you have a machine with multiple containers,
each with it's own process tree, and the machine is overcommited,
i.e. sum of container's memory limits is larger the amount available RAM.

In a case of a system-wide OOM some random container will be affected.

Historically, this problem was solving by some user-space daemon,
which was monitoring OOM events and cleaning up affected containers.
But this approach can't solve the main problem: non-optimal selection
of a victim. 

> 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> the problem?

I don't think it's related.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

They can be used to inform an userspace daemon about an already happened OOM,
but they do not affect victim selection.

> 4. How do we determine limits for these containers? From a fariness
> perspective

Limits are usually set from some high-level understanding of the nature
of tasks which are working inside, but overcommiting the machine is
a common place, I assume.

Thank you!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Roman Gushchin <guro@fb.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	Li Zefan <lizefan@huawei.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	<kernel-team@fb.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
Date: Thu, 18 May 2017 20:20:50 +0100	[thread overview]
Message-ID: <20170518192050.GA1648@castle> (raw)
In-Reply-To: <CAKTCnzkBNV9bsQSg4kzhxY=i=-y3x78StbbXfV9mvXLsJhGHig@mail.gmail.com>

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?

Imagine, you have a machine with multiple containers,
each with it's own process tree, and the machine is overcommited,
i.e. sum of container's memory limits is larger the amount available RAM.

In a case of a system-wide OOM some random container will be affected.

Historically, this problem was solving by some user-space daemon,
which was monitoring OOM events and cleaning up affected containers.
But this approach can't solve the main problem: non-optimal selection
of a victim. 

> 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> the problem?

I don't think it's related.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

They can be used to inform an userspace daemon about an already happened OOM,
but they do not affect victim selection.

> 4. How do we determine limits for these containers? From a fariness
> perspective

Limits are usually set from some high-level understanding of the nature
of tasks which are working inside, but overcommiting the machine is
a common place, I assume.

Thank you!

Roman

  reply	other threads:[~2017-05-18 19:20 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-18 16:28 [RFC PATCH] mm, oom: cgroup-aware OOM-killer Roman Gushchin
2017-05-18 17:30 ` Michal Hocko
2017-05-18 17:30   ` Michal Hocko
2017-05-18 18:11   ` Johannes Weiner
2017-05-18 18:11     ` Johannes Weiner
2017-05-19  8:02     ` Michal Hocko
2017-05-19  8:02       ` Michal Hocko
2017-05-18 18:37   ` Balbir Singh
2017-05-18 18:37     ` Balbir Singh
2017-05-18 19:20     ` Roman Gushchin [this message]
2017-05-18 19:20       ` Roman Gushchin
2017-05-18 19:41       ` Balbir Singh
2017-05-18 19:41         ` Balbir Singh
2017-05-18 19:22     ` Johannes Weiner
2017-05-18 19:22       ` Johannes Weiner
2017-05-18 19:43       ` Balbir Singh
2017-05-18 19:43         ` Balbir Singh
2017-05-18 20:15         ` Johannes Weiner
2017-05-18 20:15           ` Johannes Weiner
2017-05-20 18:37 ` Vladimir Davydov
2017-05-20 18:37   ` Vladimir Davydov
2017-05-22 17:01   ` Roman Gushchin
2017-05-22 17:01     ` Roman Gushchin
2017-05-23  7:07     ` Michal Hocko
2017-05-23  7:07       ` Michal Hocko
2017-05-23 13:25       ` Johannes Weiner
2017-05-23 13:25         ` Johannes Weiner
2017-05-25 15:38         ` Michal Hocko
2017-05-25 15:38           ` Michal Hocko
     [not found]           ` <20170525153819.GA7349-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-05-25 17:08             ` Johannes Weiner
2017-05-25 17:08               ` Johannes Weiner
2017-05-25 17:08               ` Johannes Weiner
2017-05-31 16:25               ` Michal Hocko
2017-05-31 16:25                 ` Michal Hocko
2017-05-31 18:01                 ` Johannes Weiner
2017-05-31 18:01                   ` Johannes Weiner
2017-06-02  8:43                   ` Michal Hocko
2017-06-02  8:43                     ` Michal Hocko
2017-06-02 15:18                     ` Roman Gushchin
2017-06-02 15:18                       ` Roman Gushchin
2017-06-05  8:27                       ` Michal Hocko
2017-06-05  8:27                         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170518192050.GA1648@castle \
    --to=guro@fb.com \
    --cc=bsingharora@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.