Re: mm allocation failure and hang when running xfstests generic/269 on xfs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Xiong Zhou <xzhou@redhat.com>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: mm allocation failure and hang when running xfstests generic/269 on xfs
Date: Thu, 2 Mar 2017 10:30:02 -0500	[thread overview]
Message-ID: <20170302153002.GG3213@bfoster.bfoster> (raw)
In-Reply-To: <20170302151411.GM1404@dhcp22.suse.cz>

On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:51:31, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > > implement their own retries but that is a different topic).
> > > > > > > 
> > > > > > 
> > > > > > ... but what I'm trying to understand here is whether this failure
> > > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > > > 
> > > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > > memory but in that case we can expect the OOM killer releasing some
> > > > > memory which would allow us to make a forward progress on the next
> > > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > > space to allocate from or much more likely due to already mentioned
> > > > > patch. So vmalloc is different, really.
> > > > 
> > > > Right.. that's why I'm asking. So it's technically possible but highly
> > > > unlikely due to the different failure characteristics. That seems
> > > > reasonable to me, then. 
> > > > 
> > > > To be clear, do we understand what causes the vzalloc() failure to be
> > > > effectively permanent in this specific reproducer? I know you mention
> > > > above that we could be out of vmalloc space, but that doesn't clarify
> > > > whether there are other potential failure paths or then what this has to
> > > > do with the fact that the process was killed. Does the pending signal
> > > > cause the subsequent failures or are you saying that there is some other
> > > > root cause of the failure, this process would effectively be spinning
> > > > here anyways, and we're just noticing it because it's trying to exit?
> > > 
> > > In this particular case it is fatal_signal_pending that causes the
> > > permanent failure. This check has been added to prevent from complete
> > > memory reserves depletion on OOM when a killed task has a free ticket to
> > > reserves and vmalloc requests can be really large. In this case there
> > > was no OOM killer going on but fsstress has SIGKILL pending for other
> > > reason. Most probably as a result of the group_exit when all threads
> > > are killed (see zap_process). I could have turn fatal_signal_pending
> > > into tsk_is_oom_victim which would be less likely to hit but in
> > > principle fatal_signal_pending should be better because we do want to
> > > bail out when the process is existing as soon as possible.
> > > 
> > > What I really wanted to say is that there are other possible permanent
> > > failure paths in vmalloc AFAICS. They are much less probable but they
> > > still exist.
> > > 
> > > Does that make more sense now?
> > 
> > Yes, thanks. That explains why this crops up now where it hasn't in the
> > past. Please include that background in the commit log description.
> 
> OK, does this sound better. I am open to any suggestions to improve this
> of course
> 

Yeah..

> : xfs: allow kmem_zalloc_greedy to fail
> : 
> : Even though kmem_zalloc_greedy is documented it might fail the current
> : code doesn't really implement this properly and loops on the smallest
> : allowed size for ever. This is a problem because vzalloc might fail
> : permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> : ("vmalloc: back off when the current task is killed") when the current
> : task is killed. The later one makes the failure scenario much more
> : probable than it used to be. Fix this by bailing out if the minimum size
                               ^
	" because it makes vmalloc() failures permanent for tasks with fatal
	  signals pending."

> : request failed.
> : 
> : This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> : 
> : fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> : fsstress cpuset=/ mems_allowed=0-1
> : CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> : Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> : Call Trace:
> :  dump_stack+0x63/0x87
> :  warn_alloc+0x114/0x1c0
> :  ? alloc_pages_current+0x88/0x120
> :  __vmalloc_node_range+0x250/0x2a0
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  ? free_hot_cold_page+0x21f/0x280
> :  vzalloc+0x54/0x60
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  xfs_bulkstat+0x11b/0x730 [xfs]
> :  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> :  ? selinux_capable+0x20/0x30
> :  ? security_capable+0x48/0x60
> :  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> :  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> :  ? do_filp_open+0xa5/0x100
> :  do_vfs_ioctl+0xa7/0x5e0
> :  SyS_ioctl+0x79/0x90
> :  do_syscall_64+0x67/0x180
> :  entry_SYSCALL64_slow_path+0x25/0x25
> : 
> : fsstress keeps looping inside kmem_zalloc_greedy without any way out
> : because vmalloc keeps failing due to fatal_signal_pending.
> : 
> : Reported-by: Xiong Zhou <xzhou@redhat.com>
> : Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> : Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> > Also, that kind of makes me think that a fatal_signal_pending() check is
> > still appropriate in the loop, even if we want to drop the infinite
> > retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> > however many retries are left before we return and that's also more
> > explicit for the next person who goes to change this code in the future.
> 
> I am not objecting to adding fatal_signal_pending as well I just thought
> that from the logic POV breaking after reaching the minimum size is just
> the right thing to do. We can optimize further by checking
> fatal_signal_pending and reducing retries when we know it doesn't make
> much sense but that should be done on top as an optimization IMHO.
> 

I don't think of it as an optimization to not invoke calls (a
non-deterministic number of times) that we know are going to fail, but
ultimately I don't care too much how it's framed or if it's done in
separate patches or whatnot. As long as they are posted at the same
time. ;)

Brian

> > Otherwise, I'm fine with breaking the infinite retry loop at the same
> > time. It looks like Christoph added this function originally so this
> > should probably require his ack as well..
> 
> What do you think Christoph?
> -- 
> Michal Hocko
> SUSE Labs

WARNING: multiple messages have this Message-ID (diff)

From: Brian Foster <bfoster@redhat.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Xiong Zhou <xzhou@redhat.com>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: mm allocation failure and hang when running xfstests generic/269 on xfs
Date: Thu, 2 Mar 2017 10:30:02 -0500	[thread overview]
Message-ID: <20170302153002.GG3213@bfoster.bfoster> (raw)
In-Reply-To: <20170302151411.GM1404@dhcp22.suse.cz>

On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:51:31, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > > implement their own retries but that is a different topic).
> > > > > > > 
> > > > > > 
> > > > > > ... but what I'm trying to understand here is whether this failure
> > > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > > > 
> > > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > > memory but in that case we can expect the OOM killer releasing some
> > > > > memory which would allow us to make a forward progress on the next
> > > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > > space to allocate from or much more likely due to already mentioned
> > > > > patch. So vmalloc is different, really.
> > > > 
> > > > Right.. that's why I'm asking. So it's technically possible but highly
> > > > unlikely due to the different failure characteristics. That seems
> > > > reasonable to me, then. 
> > > > 
> > > > To be clear, do we understand what causes the vzalloc() failure to be
> > > > effectively permanent in this specific reproducer? I know you mention
> > > > above that we could be out of vmalloc space, but that doesn't clarify
> > > > whether there are other potential failure paths or then what this has to
> > > > do with the fact that the process was killed. Does the pending signal
> > > > cause the subsequent failures or are you saying that there is some other
> > > > root cause of the failure, this process would effectively be spinning
> > > > here anyways, and we're just noticing it because it's trying to exit?
> > > 
> > > In this particular case it is fatal_signal_pending that causes the
> > > permanent failure. This check has been added to prevent from complete
> > > memory reserves depletion on OOM when a killed task has a free ticket to
> > > reserves and vmalloc requests can be really large. In this case there
> > > was no OOM killer going on but fsstress has SIGKILL pending for other
> > > reason. Most probably as a result of the group_exit when all threads
> > > are killed (see zap_process). I could have turn fatal_signal_pending
> > > into tsk_is_oom_victim which would be less likely to hit but in
> > > principle fatal_signal_pending should be better because we do want to
> > > bail out when the process is existing as soon as possible.
> > > 
> > > What I really wanted to say is that there are other possible permanent
> > > failure paths in vmalloc AFAICS. They are much less probable but they
> > > still exist.
> > > 
> > > Does that make more sense now?
> > 
> > Yes, thanks. That explains why this crops up now where it hasn't in the
> > past. Please include that background in the commit log description.
> 
> OK, does this sound better. I am open to any suggestions to improve this
> of course
> 

Yeah..

> : xfs: allow kmem_zalloc_greedy to fail
> : 
> : Even though kmem_zalloc_greedy is documented it might fail the current
> : code doesn't really implement this properly and loops on the smallest
> : allowed size for ever. This is a problem because vzalloc might fail
> : permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> : ("vmalloc: back off when the current task is killed") when the current
> : task is killed. The later one makes the failure scenario much more
> : probable than it used to be. Fix this by bailing out if the minimum size
                               ^
	" because it makes vmalloc() failures permanent for tasks with fatal
	  signals pending."

> : request failed.
> : 
> : This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> : 
> : fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> : fsstress cpuset=/ mems_allowed=0-1
> : CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> : Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> : Call Trace:
> :  dump_stack+0x63/0x87
> :  warn_alloc+0x114/0x1c0
> :  ? alloc_pages_current+0x88/0x120
> :  __vmalloc_node_range+0x250/0x2a0
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  ? free_hot_cold_page+0x21f/0x280
> :  vzalloc+0x54/0x60
> :  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> :  xfs_bulkstat+0x11b/0x730 [xfs]
> :  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> :  ? selinux_capable+0x20/0x30
> :  ? security_capable+0x48/0x60
> :  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> :  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> :  ? do_filp_open+0xa5/0x100
> :  do_vfs_ioctl+0xa7/0x5e0
> :  SyS_ioctl+0x79/0x90
> :  do_syscall_64+0x67/0x180
> :  entry_SYSCALL64_slow_path+0x25/0x25
> : 
> : fsstress keeps looping inside kmem_zalloc_greedy without any way out
> : because vmalloc keeps failing due to fatal_signal_pending.
> : 
> : Reported-by: Xiong Zhou <xzhou@redhat.com>
> : Analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> : Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> > Also, that kind of makes me think that a fatal_signal_pending() check is
> > still appropriate in the loop, even if we want to drop the infinite
> > retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> > however many retries are left before we return and that's also more
> > explicit for the next person who goes to change this code in the future.
> 
> I am not objecting to adding fatal_signal_pending as well I just thought
> that from the logic POV breaking after reaching the minimum size is just
> the right thing to do. We can optimize further by checking
> fatal_signal_pending and reducing retries when we know it doesn't make
> much sense but that should be done on top as an optimization IMHO.
> 

I don't think of it as an optimization to not invoke calls (a
non-deterministic number of times) that we know are going to fail, but
ultimately I don't care too much how it's framed or if it's done in
separate patches or whatnot. As long as they are posted at the same
time. ;)

Brian

> > Otherwise, I'm fine with breaking the infinite retry loop at the same
> > time. It looks like Christoph added this function originally so this
> > should probably require his ack as well..
> 
> What do you think Christoph?
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-03-02 15:30 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-01  4:46 mm allocation failure and hang when running xfstests generic/269 on xfs Xiong Zhou
2017-03-01  4:46 ` Xiong Zhou
2017-03-02  0:37 ` Christoph Hellwig
2017-03-02  0:37   ` Christoph Hellwig
2017-03-02  5:19   ` Xiong Zhou
2017-03-02  5:19     ` Xiong Zhou
2017-03-02  6:41     ` Bob Liu
2017-03-02  6:41       ` Bob Liu
2017-03-02  6:41       ` Bob Liu
2017-03-02  6:41       ` Bob Liu
2017-03-02  6:47     ` Anshuman Khandual
2017-03-02  6:47       ` Anshuman Khandual
2017-03-02  8:42       ` Michal Hocko
2017-03-02  8:42         ` Michal Hocko
2017-03-02  9:23         ` Xiong Zhou
2017-03-02  9:23           ` Xiong Zhou
2017-03-02 10:04     ` Tetsuo Handa
2017-03-02 10:04       ` Tetsuo Handa
2017-03-02 10:35       ` Michal Hocko
2017-03-02 10:35         ` Michal Hocko
2017-03-02 10:35         ` Michal Hocko
2017-03-02 10:53         ` mm allocation failure and hang when running xfstests generic/269on xfs Tetsuo Handa
2017-03-02 10:53           ` Tetsuo Handa
2017-03-02 12:24         ` mm allocation failure and hang when running xfstests generic/269 on xfs Brian Foster
2017-03-02 12:24           ` Brian Foster
2017-03-02 12:49           ` Michal Hocko
2017-03-02 12:49             ` Michal Hocko
2017-03-02 13:00             ` Brian Foster
2017-03-02 13:00               ` Brian Foster
2017-03-02 13:07               ` Tetsuo Handa
2017-03-02 13:07                 ` Tetsuo Handa
2017-03-02 13:27               ` Michal Hocko
2017-03-02 13:27                 ` Michal Hocko
2017-03-02 13:41                 ` Brian Foster
2017-03-02 13:41                   ` Brian Foster
2017-03-02 13:50                   ` Michal Hocko
2017-03-02 13:50                     ` Michal Hocko
2017-03-02 14:23                     ` Brian Foster
2017-03-02 14:23                       ` Brian Foster
2017-03-02 14:34                       ` Michal Hocko
2017-03-02 14:34                         ` Michal Hocko
2017-03-02 14:51                         ` Brian Foster
2017-03-02 14:51                           ` Brian Foster
2017-03-02 15:14                           ` Michal Hocko
2017-03-02 15:14                             ` Michal Hocko
2017-03-02 15:30                             ` Brian Foster [this message]
2017-03-02 15:30                               ` Brian Foster
2017-03-02 15:45                               ` [PATCH 1/2] xfs: allow kmem_zalloc_greedy to fail Michal Hocko
2017-03-02 15:45                                 ` Michal Hocko
2017-03-02 15:45                                 ` Michal Hocko
2017-03-02 15:45                                 ` Michal Hocko
2017-03-02 15:45                                 ` [PATCH 2/2] xfs: back off from kmem_zalloc_greedy if the task is killed Michal Hocko
2017-03-02 15:45                                   ` Michal Hocko
2017-03-02 15:45                                   ` Michal Hocko
2017-03-02 15:45                                   ` Michal Hocko
2017-03-02 15:49                                   ` Christoph Hellwig
2017-03-02 15:49                                     ` Christoph Hellwig
2017-03-02 15:59                                   ` Brian Foster
2017-03-02 15:59                                     ` Brian Foster
2017-03-02 15:49                                 ` [PATCH 1/2] xfs: allow kmem_zalloc_greedy to fail Christoph Hellwig
2017-03-02 15:49                                   ` Christoph Hellwig
2017-03-02 15:59                                 ` Brian Foster
2017-03-02 15:59                                   ` Brian Foster
2017-03-02 16:16                                 ` Michal Hocko
2017-03-02 16:16                                   ` Michal Hocko
2017-03-02 16:44                                   ` Darrick J. Wong
2017-03-02 16:44                                     ` Darrick J. Wong
2017-03-03 22:54                                 ` Dave Chinner
2017-03-03 22:54                                   ` Dave Chinner
2017-03-03 23:19                                   ` Darrick J. Wong
2017-03-03 23:19                                     ` Darrick J. Wong
2017-03-04  4:48                                     ` Dave Chinner
2017-03-04  4:48                                       ` Dave Chinner
2017-03-06 13:21                                   ` Michal Hocko
2017-03-06 13:21                                     ` Michal Hocko
2017-03-02 15:47                               ` mm allocation failure and hang when running xfstests generic/269 on xfs Michal Hocko
2017-03-02 15:47                                 ` Michal Hocko
2017-03-02 15:47                           ` Christoph Hellwig
2017-03-02 15:47                             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170302153002.GG3213@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=xzhou@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.