From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@kernel.org>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
David Rientjes <rientjes@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm: oom: deduplicate victim selection code for memcg and global oom
Date: Sat, 23 Jul 2016 17:49:13 +0300 [thread overview]
Message-ID: <20160723144913.GA2027@esperanza> (raw)
In-Reply-To: <20160721124144.GB21806@cmpxchg.org>
On Thu, Jul 21, 2016 at 08:41:44AM -0400, Johannes Weiner wrote:
> On Mon, Jun 27, 2016 at 07:39:54PM +0300, Vladimir Davydov wrote:
> > When selecting an oom victim, we use the same heuristic for both memory
> > cgroup and global oom. The only difference is the scope of tasks to
> > select the victim from. So we could just export an iterator over all
> > memcg tasks and keep all oom related logic in oom_kill.c, but instead we
> > duplicate pieces of it in memcontrol.c reusing some initially private
> > functions of oom_kill.c in order to not duplicate all of it. That looks
> > ugly and error prone, because any modification of select_bad_process
> > should also be propagated to mem_cgroup_out_of_memory.
> >
> > Let's rework this as follows: keep all oom heuristic related code
> > private to oom_kill.c and make oom_kill.c use exported memcg functions
> > when it's really necessary (like in case of iterating over memcg tasks).
>
> This approach, with the control flow in the OOM code, makes a lot of
> sense to me. I think it's particularly useful in preparation for
> supporting cgroup-aware OOM killing, where not just individual tasks
> but entire cgroups are evaluated and killed as opaque memory units.
Yeah, that too. Also, this patch can be thought of as a preparation step
for unified oom locking and oom timeouts (provided we ever agree to add
them). Currently, there's some code in memcg trying to implement proper
locking that would allow running oom in parallel in different cgroups
and wait until memory is actually freed instead of looping and retrying
reclaim. I think it could be reused for global case, although it's going
to be tricky as we need to support legacy cgroup oom control api.
>
> I'm thinking about doing something like the following, which should be
> able to work regardless on what cgroup level - root, intermediate, or
> leaf node - the OOM killer is invoked, and this patch works toward it:
>
> struct oom_victim {
> bool is_memcg;
> union {
> struct task_struct *task;
> struct mem_cgroup *memcg;
> } entity;
> unsigned long badness;
> };
>
> oom_evaluate_memcg(oc, memcg, victim)
> {
> if (memcg == root) {
> for_each_memcg_process(p, memcg) {
> badness = oom_badness(oc, memcg, p);
> if (badness == some_special_value) {
> ...
> } else if (badness > victim->badness) {
> victim->is_memcg = false;
> victim->entity.task = p;
> victim->badness = badness;
> }
> }
> } else {
> badness = 0;
> for_each_memcg_process(p, memcg) {
> b = oom_badness(oc, memcg, p);
> if (b == some_special_value)
> ...
> else
> badness += b;
> }
> if (badness > victim.badness)
> victim->is_memcg = true;
> victim->entity.memcg = memcg;
> victim->badness = badness;
Yeah, that makes sense. However, I don't think we should always kill the
whole cgroup, even if it's badness is highest. IMO what should be killed
- cgroup or task - depends on the workload running inside the container.
Some workloads (e.g. those that fork often) can put up with youngest of
their tasks getting oom-killed, others will just get stuck if one of the
workers is killed - for them we'd better kill the whole container. I
guess we could introduce a per cgroup tunable which would define oom
behavior - whether the whole cgroup should be killed on oom or just one
task/sub-cgroup in the cgroup.
> }
> }
> }
>
> oom()
> {
> struct oom_victim victim = {
> .badness = 0,
> };
>
> for_each_mem_cgroup_tree(memcg, oc->memcg)
> oom_evaluate_memcg(oc, memcg, &victim);
>
> if (!victim.badness && !is_sysrq_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
>
> if (victim.badness != -1) {
> oom_kill_victim(oc, &victim);
> schedule_timeout_killable(1);
> }
>
> return true;
> }
>
> But even without that, with the unification of two identical control
> flows and the privatization of a good amount of oom killer internals,
> the patch speaks for itself.
>
> > Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2016-07-23 14:49 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-27 16:39 [PATCH v2] mm: oom: deduplicate victim selection code for memcg and global oom Vladimir Davydov
2016-06-28 0:14 ` David Rientjes
2016-06-28 16:16 ` Vladimir Davydov
2016-07-01 11:18 ` Michal Hocko
2016-07-21 12:41 ` Johannes Weiner
2016-07-23 14:49 ` Vladimir Davydov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160723144913.GA2027@esperanza \
--to=vdavydov@virtuozzo.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).