Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sha Zhengju <handai.szj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"
	<linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	"cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org"
	<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>,
	"akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org"
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
Date: Thu, 18 Oct 2012 21:51:57 +0800	[thread overview]
Message-ID: <5080097D.5020501@gmail.com> (raw)
In-Reply-To: <20121018115640.GB24295-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.

WARNING: multiple messages have this Message-ID (diff)

From: Sha Zhengju <handai.szj@gmail.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"kamezawa.hiroyu@jp.fujitsu.com" <kamezawa.hiroyu@jp.fujitsu.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Sha Zhengju <handai.szj@taobao.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
Date: Thu, 18 Oct 2012 21:51:57 +0800	[thread overview]
Message-ID: <5080097D.5020501@gmail.com> (raw)
In-Reply-To: <20121018115640.GB24295@dhcp22.suse.cz>

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Sha Zhengju <handai.szj@gmail.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"kamezawa.hiroyu@jp.fujitsu.com" <kamezawa.hiroyu@jp.fujitsu.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Sha Zhengju <handai.szj@taobao.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
Date: Thu, 18 Oct 2012 21:51:57 +0800	[thread overview]
Message-ID: <5080097D.5020501@gmail.com> (raw)
In-Reply-To: <20121018115640.GB24295@dhcp22.suse.cz>

On 10/18/2012 07:56 PM, Michal Hocko wrote:
> On Wed 17-10-12 01:14:48, Sha Zhengju wrote:
>> On Tuesday, October 16, 2012, Michal Hocko<mhocko@suse.cz>  wrote:
> [...]
>>> Could you be more specific about the motivation for this patch? Is it
>>> "let's be consistent with the global oom" or you have a real use case
>>> for this knob.
>>>
>> In our environment(rhel6), we encounter a memcg oom 'deadlock'
>> problem.  Simply speaking, suppose process A is selected to be killed
>> by memcg oom killer, but A is uninterruptible sleeping on a page
>> lock. What's worse, the exact page lock is holding by another memcg
>> process B which is trapped in mem_croup_oom_lock(proves to be a
>> livelock).
> Hmm, this is strange. How can you get down that road with the page lock
> held? Is it possible this is related to the issue fixed by: 1d65f86d
> (mm: preallocate page before lock_page() at filemap COW)?

No, it has nothing with the cow page. By checking stack of the process A
selected to be killed(uninterruptible sleeping), it was stuck at:
__do_fault->filemap_fault->__lock_page_or_retry->wait_on_page_bit--(D 
state).
The person B holding the exactly page lock is on the following path:
__do_fault->filemap_fault->__do_page_cache_readahead->..->mpage_readpages
->add_to_page_cache_locked ---- >(in memcg oom and cannot exit)
In mpage_readpages, B tends to read a dozen of pages in: for each of 
page will do
locking, charging, and then send out a big bio. And A is waiting for one 
of the pages
and stuck.

As I said, 37b23e05 has made pagefault killable by changing 
uninterruptible sleeping
to killable sleeping. So A can be woke up to exit successfully and free 
the memory which
can in turn help B pass memcg charging period.

(By the way, it seems commit 37b23e05 and 7d9fdac need to be backported 
to --stable tree
to deliver RHEL users. ;-) )

>> Then A can not exit successfully to free the memory and both of them
>> can not moving on.
>> Indeed, we should dig into these locks to find the solution and
>> in fact the 37b23e05 (x86, mm: make pagefault killable) and
>> 7d9fdac(Memcg: make oom_lock 0 and 1 based other than counter) have
>> already solved the problem, but if oom_killing_allocating_task is
>> memcg aware, enabling this suicide oom behavior will be a simpler
>> workaround. What's more, enabling the sysctl can avoid other potential
>> oom problems to some extent.
> As I said, I am not against this but I really want to see a valid use
> case first. So far I haven't seen any because what you mention above is
> a clear bug which should be fixed. I can imagine the huge number of
> tasks in the group could be a problem as well but I would like to see
> what are those problems first.
>

In view of consistent with global oom and performance benefit, I suggest
we may as well open it in memcg oom as there's no obvious harm.
As refer to the bug I mentioned, obviously the key solution is the above two
patchset, but considing other *potential* memcg oom bugs, the sysctl may
be a role of temporary workaround to some extent... but it's just a 
workaround.


Thanks,
Sha

>>> The primary motivation for oom_kill_allocating_tas AFAIU was to reduce
>>> search over huge tasklists and reduce task_lock holding times. I am not
>>> sure whether the original concern is still valid since 6b0c81b (mm,
>>> oom: reduce dependency on tasklist_lock) as the tasklist_lock usage has
>>> been reduced conciderably in favor of RCU read locks is taken but maybe
>>> even that can be too disruptive?
>>> David?
>>
>> On the other hand, from the semantic meaning of oom_kill_allocating_task,
>> it implies to allow suicide-like oom, which has no obvious relationship
>> with performance problems(such as huge task lists or task_lock holding
>> time).
> I guess that suicide-like oom in fact means "kill the poor soul that
> happened to charge the last". I do not see any use case for this from
> top of my head (appart from the performance benefits of course).
>
>> So make the sysctl be consistent with global oom will be better or set
>> an individual option for memcg oom just as panic_on_oom does.

next prev parent reply	other threads:[~2012-10-18 13:51 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-16 10:12 [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening Sha Zhengju
2012-10-16 10:12 ` Sha Zhengju
2012-10-16 10:12 ` Sha Zhengju
2012-10-16 10:20 ` Ni zhan Chen
2012-10-16 10:20   ` Ni zhan Chen
     [not found]   ` <507D34E3.3040705-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2012-10-16 10:41     ` Sha Zhengju
2012-10-16 10:41       ` Sha Zhengju
2012-10-16 10:41       ` Sha Zhengju
     [not found] ` <1350382328-28977-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-10-16 13:34   ` Michal Hocko
2012-10-16 13:34     ` Michal Hocko
2012-10-16 13:34     ` Michal Hocko
2012-10-16 17:14     ` Sha Zhengju
     [not found]       ` <CAFj3OHVW-betpEnauzk-vQEfw_7bJxFneQb2oWpAZzOpZuMDiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-10-18 11:56         ` Michal Hocko
2012-10-18 11:56           ` Michal Hocko
2012-10-18 11:56           ` Michal Hocko
     [not found]           ` <20121018115640.GB24295-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-10-18 13:51             ` Sha Zhengju [this message]
2012-10-18 13:51               ` Sha Zhengju
2012-10-18 13:51               ` Sha Zhengju
     [not found]               ` <5080097D.5020501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2012-10-18 15:32                 ` Michal Hocko
2012-10-18 15:32                   ` Michal Hocko
2012-10-18 15:32                   ` Michal Hocko
     [not found]                   ` <20121018153256.GC24295-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-10-19  4:11                     ` Sha Zhengju
2012-10-19  4:11                       ` Sha Zhengju
2012-10-19  4:11                       ` Sha Zhengju
     [not found]                       ` <5080D308.1020805-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2012-10-19  9:52                         ` Michal Hocko
2012-10-19  9:52                           ` Michal Hocko
2012-10-19  9:52                           ` Michal Hocko
     [not found]     ` <20121016133439.GI13991-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-10-16 18:39       ` David Rientjes
2012-10-16 18:39         ` David Rientjes
2012-10-16 18:39         ` David Rientjes
2012-10-16 18:44   ` David Rientjes
2012-10-16 18:44     ` David Rientjes
2012-10-16 18:44     ` David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2012-10-16  6:10 Sha Zhengju
2012-10-16  6:10 ` Sha Zhengju
2012-10-16  6:10 ` Sha Zhengju
     [not found] ` <1350367837-27919-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-10-16  6:12   ` David Rientjes
2012-10-16  6:12     ` David Rientjes
2012-10-16  6:12     ` David Rientjes
2012-10-16  6:32     ` Sha Zhengju
2012-10-16  6:32       ` Sha Zhengju
2012-10-16  7:03       ` Michal Hocko
2012-10-16  7:03         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5080097D.5020501@gmail.com \
    --to=handai.szj-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org \
    --cc=kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.