From: Jianguo Wu <wujianguo@huawei.com>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.cz>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Christoph Lameter <cl@linux-foundation.org>,
Pekka Enberg <penberg@kernel.org>, Tejun Heo <tj@kernel.org>,
Mel Gorman <mgorman@suse.de>, Oleg Nesterov <oleg@redhat.com>,
Rik van Riel <riel@redhat.com>, Tim Hockin <thockin@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [patch 00/11] userspace out of memory handling
Date: Tue, 11 Mar 2014 20:03:15 +0800 [thread overview]
Message-ID: <531EFB83.1070404@huawei.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1403051831100.30075@chino.kir.corp.google.com>
On 2014/3/6 10:52, David Rientjes wrote:
> On Wed, 5 Mar 2014, Andrew Morton wrote:
>
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg. If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
>
> It does so by relooping in the page allocator waiting for memory to be
> freed just like it would if the kernel oom killer were called and process
> A was waiting for the oom kill victim process B to exit, we don't have the
> ability to put it on a waitqueue because we don't touch the freeing
> hotpath. The userspace oom handler may not even necessarily kill
> anything, it may be able to free its own memory and start throttling other
> processes, for example.
>
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
>
> I agree completely and proposed this before as memory.oom_delay_millisecs
> at http://lwn.net/Articles/432226 which we use internally when memory
> can't be freed or a memcg's limit cannot be expanded. I guess it makes
> more sense alongside the rest of this patchset now, I can add it as an
> additional patch next time around.
>
>> Would be interested to see a description of how all this works.
>>
>
> There's an article for LWN also being developed on this topic. As
> mentioned in that article, I think it would be best to generalize a lot of
> the common functions and the eventfd handling entirely into a library.
> I've attached an example implementation that just invokes a function to
> handle the situation.
>
> For Google's usecase specifically, at the root memcg level (system oom) we
> want to do priority based memcg killing. We want to kill from within a
> memcg hierarchy that has the lowest priority relative to other memcgs.
> This cannot be implemented with /proc/pid/oom_score_adj today. Those
> priorities may also change depending on whether a memcg hierarchy is
> "overlimit", i.e. its limit has been increased temporarily because it has
> hit a memcg oom and additional memory is readily available on the system.
>
> So why not just introduce a memcg tunable that specifies a priority?
> Well, it's not that simple. Other users will want to implement different
> policies on system oom (think about things like existing panic_on_oom or
> oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task
> originally for SGI because they wanted a fast oom kill rather than
> expensive tasklist scan: the allocating task itself is rather irrelevant,
> it was just the unlucky task that was allocating at the moment that oom
> was triggered. What's guaranteed is that current in that case will always
> free memory from under oom (it's not a member of some other mempolicy or
> cpuset that would be needlessly killed). Both sysctls could trivially be
> reimplemented in userspace with this feature.
>
> I have other customers who don't run in a memcg environment at all, they
> simply reattach all processes to root and delete all other memcgs. These
> customers are only concerned about system oom conditions and want to do
> something "interesting" before a process is killed. Some want to log the
> VM statistics as an artifact to examine later, some want to examine heap
> profiles, others can start throttling and freeing memory rather than kill
> anything. All of this is impossible today because the kernel oom killer
> will simply kill something immediately and any stats we collect afterwards
> don't represent the oom condition. The heap profiles are lost, throttling
> is useless, etc.
>
> Jianguo (cc'd) may also have usecases not described here.
>
I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.
>> It is unfortunate that this feature is memcg-only. Surely it could
>> also be used by non-memcg setups. Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
>
> It's memcg only because it can handle both system and memcg oom conditions
> with the same clean interface, it would be possible to implement only
> system oom condition handling through procfs (a little sloppy since it
> needs to register the eventfd) but then a userspace oom handler would need
> to determine which interface to use based on whether it was running in a
> memcg or non-memcg environment. I implemented this feature with userspace
> in mind: I didn't want it to need two different implementations to do the
> same thing depending on memcg. The way it is written, a userspace oom
> handler does not know (nor need not care) whether it is constrained by the
> amount of system RAM or a memcg limit. It can simply write the reserve to
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and
> be done.
>
> This does mean that memcg needs to be enabled for the support, though.
> This is already done on most distributions, the cgroup just needs to be
> mounted. Would it be better to duplicate the interface in two different
> spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea
> of a userspace library that takes care of this registration (and mounting,
> perhaps) proposed on LWN would be the best of both worlds.
>
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
>
> Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I
> removed it purposefully :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-03-11 12:05 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
2014-03-05 3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
2014-03-05 3:58 ` [patch 02/11] mm, mempolicy: rename slab_node for clarity David Rientjes
2014-03-05 3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
2014-03-07 17:20 ` Andi Kleen
2014-03-07 20:48 ` Andrew Morton
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
2014-03-05 21:17 ` Andrew Morton
2014-03-06 2:53 ` David Rientjes
2014-03-06 21:04 ` Tejun Heo
2014-03-05 3:59 ` [patch 05/11] res_counter: remove interface for locked charging and uncharging David Rientjes
2014-03-05 3:59 ` [patch 06/11] res_counter: add interface for maximum nofail charge David Rientjes
2014-03-05 3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
2014-03-06 21:12 ` Tejun Heo
2014-03-05 3:59 ` [patch 08/11] mm, memcg: add memcg oom reserve documentation David Rientjes
2014-03-05 3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
2014-03-06 21:13 ` Tejun Heo
2014-03-05 3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 21:17 ` [patch 00/11] userspace out of memory handling Andrew Morton
2014-03-06 2:52 ` David Rientjes
2014-03-11 12:03 ` Jianguo Wu [this message]
2014-03-06 20:49 ` Tejun Heo
2014-03-06 20:55 ` David Rientjes
2014-03-06 20:59 ` Tejun Heo
2014-03-06 21:08 ` David Rientjes
2014-03-06 21:11 ` Tejun Heo
2014-03-06 21:23 ` David Rientjes
2014-03-06 21:29 ` Tejun Heo
2014-03-06 21:33 ` Tejun Heo
2014-03-07 12:23 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=531EFB83.1070404@huawei.com \
--to=wujianguo@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=oleg@redhat.com \
--cc=penberg@kernel.org \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=thockin@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).