From: Roman Gushchin <guro@fb.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
David Rientjes <rientjes@google.com>, Tejun Heo <tj@kernel.org>,
kernel-team@fb.com, cgroups@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
Date: Tue, 1 Aug 2017 16:25:48 +0100 [thread overview]
Message-ID: <20170801152548.GA29502@castle.dhcp.TheFacebook.com> (raw)
In-Reply-To: <20170801145435.GN15774@dhcp22.suse.cz>
On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 14:27:16, Roman Gushchin wrote:
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + long points = 0;
> > + int nid;
> > +
> > + for_each_node_state(nid, N_MEMORY) {
> > + if (nodemask && !node_isset(nid, *nodemask))
> > + continue;
> > +
> > + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > + }
> > +
> > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > + (PAGE_SIZE / 1024);
> > + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> > + points += memcg_page_state(memcg, MEMCG_SOCK);
> > + points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > + return points;
>
> I am wondering why are you diverging from the global oom_badness
> behavior here. Although doing per NUMA accounting sounds like a better
> idea but then you just end up mixing this with non NUMA numbers and the
> whole thing is harder to understand without great advantages.
Ok, makes sense. I can revert to the existing OOM behaviour here.
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > + struct mem_cgroup *iter, *parent;
> > +
> > + for_each_mem_cgroup_tree(iter, root) {
> > + if (memcg_has_children(iter)) {
> > + iter->oom_score = 0;
> > + continue;
> > + }
> > +
> > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > + if (iter->oom_score == -1) {
> > + oc->chosen_memcg = (void *)-1UL;
> > + mem_cgroup_iter_break(root, iter);
> > + return;
> > + }
> > +
> > + if (!iter->oom_score)
> > + continue;
> > +
> > + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > + parent = parent_mem_cgroup(parent))
> > + parent->oom_score += iter->oom_score;
> > + }
> > +
> > + for (;;) {
> > + struct cgroup_subsys_state *css;
> > + struct mem_cgroup *memcg = NULL;
> > + long score = LONG_MIN;
> > +
> > + css_for_each_child(css, &root->css) {
> > + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> > +
> > + if (iter->oom_score > score) {
> > + memcg = iter;
> > + score = iter->oom_score;
> > + }
> > + }
> > +
> > + if (!memcg) {
> > + if (oc->memcg && root == oc->memcg) {
> > + oc->chosen_memcg = oc->memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = oc->memcg->oom_score;
> > + }
> > + break;
> > + }
> > +
> > + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> > + oc->chosen_memcg = memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = score;
> > + break;
> > + }
> > +
> > + root = memcg;
> > + }
> > +}
>
> This and the rest of the victim selection code is really hairy and hard
> to follow.
Will adding more comments help here?
>
> I would reap out the oom_kill_process into a separate patch.
It was a separate patch, I've merged it based on Vladimir's feedback.
No problems, I can divide it back.
> > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > +static void __oom_kill_process(struct task_struct *victim)
>
> To the rest of the patch. I have to say I do not quite like how it is
> implemented. I was hoping for something much simpler which would hook
> into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> then we would update the cumulative memcg badness (more specifically the
> badness of the topmost parent with kill-all flag). Memcg will then
> compete with existing self contained tasks (oom_badness will have to
> tell whether points belong to a task or a memcg to allow the caller to
> deal with it). But it shouldn't be much more complex than that.
I'm not sure, it will be any simpler. Basically I'm doing the same:
the difference is that you want to iterate over tasks and for each
task traverse the memcg tree, update per-cgroup oom score and find
the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
traverse the cgroup tree, and for each leaf cgroup iterate over processes.
Also, please note, that even without the kill-all flag the decision is made
on per-cgroup level (except tasks in the root cgroup).
Thank you!
Roman
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Roman Gushchin <guro@fb.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: <linux-mm@kvack.org>, Vladimir Davydov <vdavydov.dev@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
David Rientjes <rientjes@google.com>, Tejun Heo <tj@kernel.org>,
<kernel-team@fb.com>, <cgroups@vger.kernel.org>,
<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
Date: Tue, 1 Aug 2017 16:25:48 +0100 [thread overview]
Message-ID: <20170801152548.GA29502@castle.dhcp.TheFacebook.com> (raw)
In-Reply-To: <20170801145435.GN15774@dhcp22.suse.cz>
On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 14:27:16, Roman Gushchin wrote:
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + long points = 0;
> > + int nid;
> > +
> > + for_each_node_state(nid, N_MEMORY) {
> > + if (nodemask && !node_isset(nid, *nodemask))
> > + continue;
> > +
> > + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > + }
> > +
> > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > + (PAGE_SIZE / 1024);
> > + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> > + points += memcg_page_state(memcg, MEMCG_SOCK);
> > + points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > + return points;
>
> I am wondering why are you diverging from the global oom_badness
> behavior here. Although doing per NUMA accounting sounds like a better
> idea but then you just end up mixing this with non NUMA numbers and the
> whole thing is harder to understand without great advantages.
Ok, makes sense. I can revert to the existing OOM behaviour here.
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > + struct mem_cgroup *iter, *parent;
> > +
> > + for_each_mem_cgroup_tree(iter, root) {
> > + if (memcg_has_children(iter)) {
> > + iter->oom_score = 0;
> > + continue;
> > + }
> > +
> > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > + if (iter->oom_score == -1) {
> > + oc->chosen_memcg = (void *)-1UL;
> > + mem_cgroup_iter_break(root, iter);
> > + return;
> > + }
> > +
> > + if (!iter->oom_score)
> > + continue;
> > +
> > + for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > + parent = parent_mem_cgroup(parent))
> > + parent->oom_score += iter->oom_score;
> > + }
> > +
> > + for (;;) {
> > + struct cgroup_subsys_state *css;
> > + struct mem_cgroup *memcg = NULL;
> > + long score = LONG_MIN;
> > +
> > + css_for_each_child(css, &root->css) {
> > + struct mem_cgroup *iter = mem_cgroup_from_css(css);
> > +
> > + if (iter->oom_score > score) {
> > + memcg = iter;
> > + score = iter->oom_score;
> > + }
> > + }
> > +
> > + if (!memcg) {
> > + if (oc->memcg && root == oc->memcg) {
> > + oc->chosen_memcg = oc->memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = oc->memcg->oom_score;
> > + }
> > + break;
> > + }
> > +
> > + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> > + oc->chosen_memcg = memcg;
> > + css_get(&oc->chosen_memcg->css);
> > + oc->chosen_points = score;
> > + break;
> > + }
> > +
> > + root = memcg;
> > + }
> > +}
>
> This and the rest of the victim selection code is really hairy and hard
> to follow.
Will adding more comments help here?
>
> I would reap out the oom_kill_process into a separate patch.
It was a separate patch, I've merged it based on Vladimir's feedback.
No problems, I can divide it back.
> > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > +static void __oom_kill_process(struct task_struct *victim)
>
> To the rest of the patch. I have to say I do not quite like how it is
> implemented. I was hoping for something much simpler which would hook
> into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> then we would update the cumulative memcg badness (more specifically the
> badness of the topmost parent with kill-all flag). Memcg will then
> compete with existing self contained tasks (oom_badness will have to
> tell whether points belong to a task or a memcg to allow the caller to
> deal with it). But it shouldn't be much more complex than that.
I'm not sure, it will be any simpler. Basically I'm doing the same:
the difference is that you want to iterate over tasks and for each
task traverse the memcg tree, update per-cgroup oom score and find
the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
traverse the cgroup tree, and for each leaf cgroup iterate over processes.
Also, please note, that even without the kill-all flag the decision is made
on per-cgroup level (except tasks in the root cgroup).
Thank you!
Roman
next prev parent reply other threads:[~2017-08-01 15:25 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-26 13:27 [v4 0/4] cgroup-aware OOM killer Roman Gushchin
2017-07-26 13:27 ` [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage Roman Gushchin
2017-07-26 13:27 ` Roman Gushchin
2017-07-26 13:56 ` Michal Hocko
2017-07-26 13:56 ` Michal Hocko
2017-07-26 14:06 ` Roman Gushchin
2017-07-26 14:06 ` Roman Gushchin
2017-07-26 14:24 ` Michal Hocko
2017-07-26 14:24 ` Michal Hocko
2017-07-26 14:44 ` Michal Hocko
2017-07-26 14:44 ` Michal Hocko
[not found] ` <20170726144408.GU2981-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-07-26 14:50 ` Roman Gushchin
2017-07-26 14:50 ` Roman Gushchin
2017-07-26 14:50 ` Roman Gushchin
2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-07-26 13:27 ` Roman Gushchin
[not found] ` <20170726132718.14806-3-guro-b10kYP2dOMg@public.gmane.org>
2017-07-27 21:41 ` kbuild test robot
2017-07-27 21:41 ` kbuild test robot
2017-07-27 21:41 ` kbuild test robot
2017-08-01 14:54 ` Michal Hocko
2017-08-01 14:54 ` Michal Hocko
2017-08-01 15:25 ` Roman Gushchin [this message]
2017-08-01 15:25 ` Roman Gushchin
2017-08-01 17:03 ` Michal Hocko
2017-08-01 17:03 ` Michal Hocko
2017-08-01 18:13 ` Roman Gushchin
2017-08-01 18:13 ` Roman Gushchin
2017-08-01 18:13 ` Roman Gushchin
2017-08-02 7:29 ` Michal Hocko
2017-08-02 7:29 ` Michal Hocko
2017-08-03 12:47 ` Roman Gushchin
2017-08-03 12:47 ` Roman Gushchin
[not found] ` <20170803124751.GA24563-2xczL/1GIl5a1dPMsufgnw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2017-08-03 13:01 ` Michal Hocko
2017-08-03 13:01 ` Michal Hocko
2017-08-03 13:01 ` Michal Hocko
2017-08-08 23:06 ` David Rientjes
2017-08-08 23:06 ` David Rientjes
[not found] ` <alpine.DEB.2.10.1708081559001.54505-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2017-08-14 12:03 ` Roman Gushchin
2017-08-14 12:03 ` Roman Gushchin
2017-08-14 12:03 ` Roman Gushchin
2017-07-26 13:27 ` [v4 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
2017-07-26 13:27 ` Roman Gushchin
[not found] ` <20170726132718.14806-4-guro-b10kYP2dOMg@public.gmane.org>
2017-08-08 23:14 ` David Rientjes
2017-08-08 23:14 ` David Rientjes
2017-08-08 23:14 ` David Rientjes
2017-08-14 12:39 ` Roman Gushchin
2017-08-14 12:39 ` Roman Gushchin
2017-07-26 13:27 ` [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
2017-07-26 13:27 ` Roman Gushchin
[not found] ` <20170726132718.14806-5-guro-b10kYP2dOMg@public.gmane.org>
2017-08-08 23:24 ` David Rientjes
2017-08-08 23:24 ` David Rientjes
2017-08-08 23:24 ` David Rientjes
2017-08-14 12:28 ` Roman Gushchin
2017-08-14 12:28 ` Roman Gushchin
2017-08-14 12:28 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170801152548.GA29502@castle.dhcp.TheFacebook.com \
--to=guro@fb.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
--cc=tj@kernel.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.