All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jianguo Wu <wujianguo@huawei.com>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Pekka Enberg <penberg@kernel.org>, Tejun Heo <tj@kernel.org>,
	Mel Gorman <mgorman@suse.de>, Oleg Nesterov <oleg@redhat.com>,
	Rik van Riel <riel@redhat.com>, Tim Hockin <thockin@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [patch 00/11] userspace out of memory handling
Date: Tue, 11 Mar 2014 20:03:15 +0800	[thread overview]
Message-ID: <531EFB83.1070404@huawei.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1403051831100.30075@chino.kir.corp.google.com>

On 2014/3/6 10:52, David Rientjes wrote:

> On Wed, 5 Mar 2014, Andrew Morton wrote:
> 
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg.  If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
> 
> It does so by relooping in the page allocator waiting for memory to be 
> freed just like it would if the kernel oom killer were called and process 
> A was waiting for the oom kill victim process B to exit, we don't have the 
> ability to put it on a waitqueue because we don't touch the freeing 
> hotpath.  The userspace oom handler may not even necessarily kill 
> anything, it may be able to free its own memory and start throttling other 
> processes, for example.
> 
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
> 
> I agree completely and proposed this before as memory.oom_delay_millisecs 
> at http://lwn.net/Articles/432226 which we use internally when memory 
> can't be freed or a memcg's limit cannot be expanded.  I guess it makes 
> more sense alongside the rest of this patchset now, I can add it as an 
> additional patch next time around.
> 
>> Would be interested to see a description of how all this works.
>>
> 
> There's an article for LWN also being developed on this topic.  As 
> mentioned in that article, I think it would be best to generalize a lot of 
> the common functions and the eventfd handling entirely into a library.  
> I've attached an example implementation that just invokes a function to 
> handle the situation.
> 
> For Google's usecase specifically, at the root memcg level (system oom) we 
> want to do priority based memcg killing.  We want to kill from within a 
> memcg hierarchy that has the lowest priority relative to other memcgs.  
> This cannot be implemented with /proc/pid/oom_score_adj today.  Those 
> priorities may also change depending on whether a memcg hierarchy is 
> "overlimit", i.e. its limit has been increased temporarily because it has 
> hit a memcg oom and additional memory is readily available on the system.
> 
> So why not just introduce a memcg tunable that specifies a priority?  
> Well, it's not that simple.  Other users will want to implement different 
> policies on system oom (think about things like existing panic_on_oom or 
> oom_kill_allocating_task sysctls).  I introduced oom_kill_allocating_task 
> originally for SGI because they wanted a fast oom kill rather than 
> expensive tasklist scan: the allocating task itself is rather irrelevant, 
> it was just the unlucky task that was allocating at the moment that oom 
> was triggered.  What's guaranteed is that current in that case will always 
> free memory from under oom (it's not a member of some other mempolicy or 
> cpuset that would be needlessly killed).  Both sysctls could trivially be 
> reimplemented in userspace with this feature.
> 
> I have other customers who don't run in a memcg environment at all, they 
> simply reattach all processes to root and delete all other memcgs.  These 
> customers are only concerned about system oom conditions and want to do 
> something "interesting" before a process is killed.  Some want to log the 
> VM statistics as an artifact to examine later, some want to examine heap 
> profiles, others can start throttling and freeing memory rather than kill 
> anything.  All of this is impossible today because the kernel oom killer 
> will simply kill something immediately and any stats we collect afterwards 
> don't represent the oom condition.  The heap profiles are lost, throttling 
> is useless, etc.
> 
> Jianguo (cc'd) may also have usecases not described here.
> 

I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.

>> It is unfortunate that this feature is memcg-only.  Surely it could
>> also be used by non-memcg setups.  Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
> 
> It's memcg only because it can handle both system and memcg oom conditions 
> with the same clean interface, it would be possible to implement only 
> system oom condition handling through procfs (a little sloppy since it 
> needs to register the eventfd) but then a userspace oom handler would need 
> to determine which interface to use based on whether it was running in a 
> memcg or non-memcg environment.  I implemented this feature with userspace 
> in mind: I didn't want it to need two different implementations to do the 
> same thing depending on memcg.  The way it is written, a userspace oom 
> handler does not know (nor need not care) whether it is constrained by the 
> amount of system RAM or a memcg limit.  It can simply write the reserve to 
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and 
> be done.
> 
> This does mean that memcg needs to be enabled for the support, though.  
> This is already done on most distributions, the cgroup just needs to be 
> mounted.  Would it be better to duplicate the interface in two different 
> spots depending on CONFIG_MEMCG?  I didn't think so, and I think the idea 
> of a userspace library that takes care of this registration (and mounting, 
> perhaps) proposed on LWN would be the best of both worlds.
> 
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
> 
> Ok!  I'm hoping that the PF_MEMPOLICY bit that is removed in those patches 
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I 
> removed it purposefully :)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Jianguo Wu <wujianguo@huawei.com>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Pekka Enberg <penberg@kernel.org>, Tejun Heo <tj@kernel.org>,
	Mel Gorman <mgorman@suse.de>, Oleg Nesterov <oleg@redhat.com>,
	Rik van Riel <riel@redhat.com>, Tim Hockin <thockin@google.com>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<cgroups@vger.kernel.org>, <linux-doc@vger.kernel.org>
Subject: Re: [patch 00/11] userspace out of memory handling
Date: Tue, 11 Mar 2014 20:03:15 +0800	[thread overview]
Message-ID: <531EFB83.1070404@huawei.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1403051831100.30075@chino.kir.corp.google.com>

On 2014/3/6 10:52, David Rientjes wrote:

> On Wed, 5 Mar 2014, Andrew Morton wrote:
> 
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg.  If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
> 
> It does so by relooping in the page allocator waiting for memory to be 
> freed just like it would if the kernel oom killer were called and process 
> A was waiting for the oom kill victim process B to exit, we don't have the 
> ability to put it on a waitqueue because we don't touch the freeing 
> hotpath.  The userspace oom handler may not even necessarily kill 
> anything, it may be able to free its own memory and start throttling other 
> processes, for example.
> 
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
> 
> I agree completely and proposed this before as memory.oom_delay_millisecs 
> at http://lwn.net/Articles/432226 which we use internally when memory 
> can't be freed or a memcg's limit cannot be expanded.  I guess it makes 
> more sense alongside the rest of this patchset now, I can add it as an 
> additional patch next time around.
> 
>> Would be interested to see a description of how all this works.
>>
> 
> There's an article for LWN also being developed on this topic.  As 
> mentioned in that article, I think it would be best to generalize a lot of 
> the common functions and the eventfd handling entirely into a library.  
> I've attached an example implementation that just invokes a function to 
> handle the situation.
> 
> For Google's usecase specifically, at the root memcg level (system oom) we 
> want to do priority based memcg killing.  We want to kill from within a 
> memcg hierarchy that has the lowest priority relative to other memcgs.  
> This cannot be implemented with /proc/pid/oom_score_adj today.  Those 
> priorities may also change depending on whether a memcg hierarchy is 
> "overlimit", i.e. its limit has been increased temporarily because it has 
> hit a memcg oom and additional memory is readily available on the system.
> 
> So why not just introduce a memcg tunable that specifies a priority?  
> Well, it's not that simple.  Other users will want to implement different 
> policies on system oom (think about things like existing panic_on_oom or 
> oom_kill_allocating_task sysctls).  I introduced oom_kill_allocating_task 
> originally for SGI because they wanted a fast oom kill rather than 
> expensive tasklist scan: the allocating task itself is rather irrelevant, 
> it was just the unlucky task that was allocating at the moment that oom 
> was triggered.  What's guaranteed is that current in that case will always 
> free memory from under oom (it's not a member of some other mempolicy or 
> cpuset that would be needlessly killed).  Both sysctls could trivially be 
> reimplemented in userspace with this feature.
> 
> I have other customers who don't run in a memcg environment at all, they 
> simply reattach all processes to root and delete all other memcgs.  These 
> customers are only concerned about system oom conditions and want to do 
> something "interesting" before a process is killed.  Some want to log the 
> VM statistics as an artifact to examine later, some want to examine heap 
> profiles, others can start throttling and freeing memory rather than kill 
> anything.  All of this is impossible today because the kernel oom killer 
> will simply kill something immediately and any stats we collect afterwards 
> don't represent the oom condition.  The heap profiles are lost, throttling 
> is useless, etc.
> 
> Jianguo (cc'd) may also have usecases not described here.
> 

I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.

>> It is unfortunate that this feature is memcg-only.  Surely it could
>> also be used by non-memcg setups.  Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
> 
> It's memcg only because it can handle both system and memcg oom conditions 
> with the same clean interface, it would be possible to implement only 
> system oom condition handling through procfs (a little sloppy since it 
> needs to register the eventfd) but then a userspace oom handler would need 
> to determine which interface to use based on whether it was running in a 
> memcg or non-memcg environment.  I implemented this feature with userspace 
> in mind: I didn't want it to need two different implementations to do the 
> same thing depending on memcg.  The way it is written, a userspace oom 
> handler does not know (nor need not care) whether it is constrained by the 
> amount of system RAM or a memcg limit.  It can simply write the reserve to 
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and 
> be done.
> 
> This does mean that memcg needs to be enabled for the support, though.  
> This is already done on most distributions, the cgroup just needs to be 
> mounted.  Would it be better to duplicate the interface in two different 
> spots depending on CONFIG_MEMCG?  I didn't think so, and I think the idea 
> of a userspace library that takes care of this registration (and mounting, 
> perhaps) proposed on LWN would be the best of both worlds.
> 
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
> 
> Ok!  I'm hoping that the PF_MEMPOLICY bit that is removed in those patches 
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I 
> removed it purposefully :)




  reply	other threads:[~2014-03-11 12:03 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-05  3:58 [patch 00/11] userspace out of memory handling David Rientjes
2014-03-05  3:58 ` David Rientjes
2014-03-05  3:58 ` David Rientjes
2014-03-05  3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
2014-03-05  3:58   ` David Rientjes
2014-03-05  3:58 ` [patch 02/11] mm, mempolicy: rename slab_node for clarity David Rientjes
2014-03-05  3:58   ` David Rientjes
2014-03-05  3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
2014-03-05  3:59   ` David Rientjes
     [not found]   ` <alpine.DEB.2.02.1403041954420.8067-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2014-03-07 17:20     ` Andi Kleen
2014-03-07 17:20       ` Andi Kleen
2014-03-07 17:20       ` Andi Kleen
2014-03-07 20:48       ` Andrew Morton
2014-03-07 20:48         ` Andrew Morton
2014-03-05  3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-05 21:17   ` Andrew Morton
2014-03-05 21:17     ` Andrew Morton
2014-03-06  2:53     ` David Rientjes
2014-03-06  2:53       ` David Rientjes
2014-03-06 21:04   ` Tejun Heo
2014-03-06 21:04     ` Tejun Heo
2014-03-05  3:59 ` [patch 05/11] res_counter: remove interface for locked charging and uncharging David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-05  3:59 ` [patch 06/11] res_counter: add interface for maximum nofail charge David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-05  3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-06 21:12   ` Tejun Heo
2014-03-06 21:12     ` Tejun Heo
2014-03-05  3:59 ` [patch 08/11] mm, memcg: add memcg oom reserve documentation David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-05  3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-06 21:13   ` Tejun Heo
2014-03-06 21:13     ` Tejun Heo
2014-03-05  3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-06 21:15   ` Tejun Heo
2014-03-06 21:15     ` Tejun Heo
2014-03-05  3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
2014-03-05  3:59   ` David Rientjes
2014-03-06 21:15   ` Tejun Heo
2014-03-06 21:15     ` Tejun Heo
     [not found] ` <alpine.DEB.2.02.1403041952170.8067-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2014-03-05 21:17   ` [patch 00/11] userspace out of memory handling Andrew Morton
2014-03-05 21:17     ` Andrew Morton
2014-03-05 21:17     ` Andrew Morton
2014-03-06  2:52     ` David Rientjes
2014-03-11 12:03       ` Jianguo Wu [this message]
2014-03-11 12:03         ` Jianguo Wu
2014-03-06 20:49 ` Tejun Heo
2014-03-06 20:49   ` Tejun Heo
2014-03-06 20:55   ` David Rientjes
2014-03-06 20:55     ` David Rientjes
     [not found]     ` <alpine.DEB.2.02.1403061254240.25499-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2014-03-06 20:59       ` Tejun Heo
2014-03-06 20:59         ` Tejun Heo
2014-03-06 20:59         ` Tejun Heo
2014-03-06 21:08         ` David Rientjes
2014-03-06 21:08           ` David Rientjes
2014-03-06 21:11           ` Tejun Heo
2014-03-06 21:11             ` Tejun Heo
2014-03-06 21:23             ` David Rientjes
2014-03-06 21:23               ` David Rientjes
2014-03-06 21:29               ` Tejun Heo
2014-03-06 21:29                 ` Tejun Heo
2014-03-06 21:33               ` Tejun Heo
2014-03-06 21:33                 ` Tejun Heo
2014-03-07 12:23                 ` Michal Hocko
2014-03-07 12:23                   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=531EFB83.1070404@huawei.com \
    --to=wujianguo@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=oleg@redhat.com \
    --cc=penberg@kernel.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=thockin@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.