Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Roman Gushchin <guro@fb.com>
To: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-mm@kvack.org, Tejun Heo <tj@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Li Zefan <lizefan@huawei.com>, Michal Hocko <mhocko@kernel.org>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer
Date: Tue, 6 Jun 2017 16:59:48 +0100	[thread overview]
Message-ID: <20170606155948.GA752@castle> (raw)
In-Reply-To: <20170604204333.GD19980@esperanza>

On Sun, Jun 04, 2017 at 11:43:33PM +0300, Vladimir Davydov wrote:
> On Thu, Jun 01, 2017 at 07:35:14PM +0100, Roman Gushchin wrote:
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f979ac7..855d335 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> >  	return ret;
> >  }
> >  
> > +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
> > +				   const nodemask_t *nodemask)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	struct mem_cgroup *iter;
> > +
> > +	for_each_mem_cgroup_tree(iter, memcg) {
> 
> AFAIU this function is called on every iteration over the cgroup tree,
> which might be costly in case of a deep hierarchy, as it has quadratic
> complexity at worst. We could eliminate the nested loop by computing
> badness of all eligible cgroups before starting looking for a victim and
> saving the values in struct mem_cgroup. Not sure if it's worth it, as
> OOM is a pretty cold path.

I've thought about it, but it really not obvious that we want to pay
with some additional memory usage (and code complexity) for optimization
of this path. So, I decided to leave it simple now, and postpone
all optimizations after we'll agree on everything else.

> 
> > +		for_each_node_state(nid, N_MEMORY) {
> > +			if (nodemask && !node_isset(nid, *nodemask))
> > +				continue;
> > +
> > +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> > +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> 
> Hmm, is there a reason why we shouldn't take into account file pages?

Because under the OOM conditions we should not have too much pagecache,
and killing a process will unlikely help us to release any additional memory.
But maybe I'm missing something... Lazy free?

> 
> > +		}
> > +
> > +		points += mem_cgroup_get_nr_swap_pages(iter);
> 
> AFAICS mem_cgroup_get_nr_swap_pages() returns the number of pages that
> can still be charged to the cgroup. IIUC we want to account pages that
> have already been charged to the cgroup, i.e. the value of the 'swap'
> page counter or MEMCG_SWAP stat counter.

Ok, I'll check it. Thank you!

> 
> > +		points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
> > +			(PAGE_SIZE / 1024);
> > +		points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
> > +		points += memcg_page_state(iter, MEMCG_SOCK);
> > +	}
> > +
> > +	return points;
> > +}
> > +
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > +	struct cgroup_subsys_state *css = NULL;
> > +	struct mem_cgroup *iter = NULL;
> > +	struct mem_cgroup *chosen_memcg = NULL;
> > +	struct mem_cgroup *parent = root_mem_cgroup;
> > +	unsigned long totalpages = oc->totalpages;
> > +	long chosen_memcg_points = 0;
> > +	long points = 0;
> > +
> > +	oc->chosen = NULL;
> > +	oc->chosen_memcg = NULL;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +		return false;
> > +
> > +	pr_info("Choosing a victim memcg because of the %s",
> > +		oc->memcg ?
> > +		"memory limit reached of cgroup " :
> > +		"system-wide OOM\n");
> > +	if (oc->memcg) {
> > +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > +		pr_cont("\n");
> > +
> > +		chosen_memcg = oc->memcg;
> > +		parent = oc->memcg;
> > +	}
> > +
> > +	rcu_read_lock();
> > +
> > +	for (;;) {
> > +		css = css_next_child(css, &parent->css);
> > +		if (css) {
> > +			iter = mem_cgroup_from_css(css);
> > +
> > +			points = mem_cgroup_oom_badness(iter, oc->nodemask);
> > +			points += iter->oom_score_adj * (totalpages / 1000);
> > +
> > +			pr_info("Cgroup ");
> > +			pr_cont_cgroup_path(iter->css.cgroup);
> > +			pr_cont(": %ld\n", points);
> 
> Not sure if everyone wants to see these messages in the log.

What do you suggest? Remove debug output at all (probably, we still want some),
ratelimit it, make optional?

> 
> > +
> > +			if (points > chosen_memcg_points) {
> > +				chosen_memcg = iter;
> > +				chosen_memcg_points = points;
> > +				oc->chosen_points = points;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
> > +			/* Go deeper in the cgroup hierarchy */
> > +			totalpages = chosen_memcg_points;
> 
> We set 'totalpages' to the target cgroup limit (or the total RAM
> size) when computing a victim score. Why do you prefer to use
> chosen_memcg_points here instead? Why not the limit of the chosen
> cgroup?

Because I'm trying to implement hierarchical oom_score_adj, so that if
a parent cgroup has oom_score_adj set to -1000, it's successors will
(almost) never selected.

> > +			chosen_memcg_points = 0;
> > +
> > +			parent = chosen_memcg;
> > +			chosen_memcg = NULL;
> > +
> > +			continue;
> > +		}
> > +
> > +		if (!chosen_memcg && parent != root_mem_cgroup)
> > +			chosen_memcg = parent;
> > +
> > +		break;
> > +	}
> > +
> 
> > +	if (!oc->memcg) {
> > +		/*
> > +		 * We should also consider tasks in the root cgroup
> > +		 * with badness larger than oc->chosen_points
> > +		 */
> > +
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		int ret = 0;
> > +
> > +		css_task_iter_start(&root_mem_cgroup->css, &it);
> > +		while (!ret && (task = css_task_iter_next(&it)))
> > +			ret = oom_evaluate_task(task, oc);
> > +		css_task_iter_end(&it);
> > +	}
> 
> IMHO it isn't quite correct to compare tasks from the root cgroup with
> leaf cgroups, because they are at different levels. Shouldn't we compare
> their scores only with the top level cgroups?

Not sure I follow your idea...
Of course, comparing tasks and cgroups is not really precise,
but hopefully should be good enough for the task.
 
> As an alternative approach, may be, we could remove this branch
> altogether and ignore root tasks here (i.e. have any root task a higher
> priority a priori)? Perhaps, it could be acceptable, because normally
> the root cgroup only hosts kernel processes and init (at least this is
> the default systemd setup IIRC).
> 
> > +
> > +	if (!oc->chosen && chosen_memcg) {
> > +		pr_info("Chosen cgroup ");
> > +		pr_cont_cgroup_path(chosen_memcg->css.cgroup);
> > +		pr_cont(": %ld\n", oc->chosen_points);
> > +
> > +		if (chosen_memcg->oom_kill_all_tasks) {
> > +			css_get(&chosen_memcg->css);
> > +			oc->chosen_memcg = chosen_memcg;
> > +		} else {
> > +			/*
> > +			 * If we don't need to kill all tasks in the cgroup,
> > +			 * let's select the biggest task.
> > +			 */
> > +			oc->chosen_points = 0;
> 
> > +			select_bad_process(oc, chosen_memcg);
> 
> I think we'd better use mem_cgroup_scan_task() here directly, without
> exporting select_bad_process() from oom_kill.c. IMHO it would be more
> straightforward, because select_bad_process() has a branch handling the
> global OOM, which isn't used in this case. Come to think of it, wouldn't
> it be better to return the chosen cgroup in @oc and let out_of_memory()
> select a process within it or kill it as a whole depending on the value
> of the oom_kill_all_tasks flag?
> 
> Also, if the chosen cgroup has no tasks (which is perfectly possible if
> all memory within the cgroup is consumed by shmem e.g.), shouldn't we
> retry the cgroup selection?

Good point. Whould we retry the cgroup selection or just ignore
non-populated cgroups during selection?

> 
> > +		}
> > +	} else if (oc->chosen)
> > +		pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
> > +			oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
> > +
> > +	rcu_read_unlock();
> > +
> > +	oc->chosen_points = 0;
> > +	return !!oc->chosen || !!oc->chosen_memcg;
> > +}
> > +
> > +static int __oom_kill_task(struct task_struct *tsk, void *arg)
> > +{
> > +	if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
> > +		get_task_struct(tsk);
> > +		__oom_kill_process(tsk);
> > +	}
> > +	return 0;
> > +}
> > +
> > +bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
> 
> I think it'd be OK to define this function in oom_kill.c - we
> have everything we need for that. We wouldn't have to export
> __oom_kill_process without oom_kill_process then, which is kinda
> ugly IMHO.
> 
> > +{
> > +	if (oc->chosen_memcg) {
> > +		/*
> > +		 * Kill all tasks in the cgroup hierarchy
> > +		 */
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg,
> > +				      __oom_kill_task, NULL);
> > +
> > +		/*
> > +		 * Release oc->chosen_memcg
> > +		 */
> > +		css_put(&oc->chosen_memcg->css);
> > +		oc->chosen_memcg = NULL;
> > +	}
> > +
> > +	if (oc->chosen && oc->chosen != (void *)-1UL) {
> 
> > +		__oom_kill_process(oc->chosen);
> 
> Why don't you use oom_kill_process (without leading underscores) here?

Because oom_kill_process() has some unwanted side-effects:
1) it can kill other than specified process, we don't need this optimization here,
2) bulky debug output.

Thank you for review!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-06-06 16:00 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
2017-06-04 19:25   ` Vladimir Davydov
2017-06-04 22:50   ` David Rientjes
2017-06-06 16:20     ` Roman Gushchin
2017-06-06 20:42       ` David Rientjes
2017-06-08 15:59         ` Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 2/7] mm, oom: split oom_kill_process() and export __oom_kill_process() Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 3/7] mm, oom: export oom_evaluate_task() and select_bad_process() Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups Roman Gushchin
2017-06-04 19:30   ` Vladimir Davydov
2017-06-01 18:35 ` [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj " Roman Gushchin
2017-06-04 19:39   ` Vladimir Davydov
2017-06-01 18:35 ` [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-06-04 20:43   ` Vladimir Davydov
2017-06-06 15:59     ` Roman Gushchin [this message]
2017-06-01 18:35 ` [RFC PATCH v2 7/7] mm,oom,docs: describe the " Roman Gushchin
2017-06-09 16:30 ` [RFC PATCH v2 0/7] " Michal Hocko
2017-06-22 17:10   ` Roman Gushchin
2017-06-23 13:43     ` Michal Hocko
2017-06-23 18:39       ` Roman Gushchin
2017-06-26 11:55         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170606155948.GA752@castle \
    --to=guro@fb.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan@huawei.com \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox