linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: hch@lst.de, mgorman@suse.de, viro@ZenIV.linux.org.uk,
	linux-mm@kvack.org, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Date: Mon, 30 Jan 2017 09:55:46 +0100	[thread overview]
Message-ID: <20170130085546.GF8443@dhcp22.suse.cz> (raw)
In-Reply-To: <201701290027.AFB30799.FVtFLOOOJMSHQF@I-love.SAKURA.ne.jp>

On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo,
> > before we settle on the proper fix for this issue, could you give the
> > patch a try and try to reproduce the too_many_isolated() issue or
> > just see whether patch [1] has any negative effect on your oom stress
> > testing?
> > 
> > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz
> 
> I tested with both [1] and below patch applied on linux-next-20170125 and
> the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .
> 
> Regarding below patch, it helped avoiding complete memory depletion with
> large write() request. I don't know whether below patch helps avoiding
> complete memory depletion when reading large amount (in other words, I
> don't know whether this check is done for large read() request).

It's not AFAICS. do_generic_file_read doesn't do the
fatal_signal_pending check.

> But
> I believe that __GFP_KILLABLE (despite the limitation that there are
> unkillable waits in the reclaim path) is better solution compared to
> scattering around fatal_signal_pending() in the callers. The reason
> we check SIGKILL here is to avoid allocating memory more than needed.
> If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
> retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
> of whether the OOM killer is invoked, whether memory can be allocated
> without direct reclaim operation, not allocating memory unless needed
> (in other words, allow page allocator fail immediately if the caller
> can give up on SIGKILL and SIGKILL is pending) makes sense. It will
> reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
> OOM reaper is not available.

I am not really convinced this is a good idea. Put aside the fuzzy
semantic of __GFP_KILLABLE, we would have to use this flag in all
potentially allocating places from read/write paths and then it is just
easier to do the explicit checks in the the loops around those
allocations.
 
> > On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> > [...]
> > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > > 
> > > Tetsuo has noticed that an OOM stress test which performs large write
> > > requests can cause the full memory reserves depletion. He has tracked
> > > this down to the following path
> > > 	__alloc_pages_nodemask+0x436/0x4d0
> > > 	alloc_pages_current+0x97/0x1b0
> > > 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> > > 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> > > 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > > 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> > > 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> > > 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> > > 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> > > 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> > > 	? iomap_write_end+0x80/0x80
> > > 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > > 	? remove_wait_queue+0x59/0x60
> > > 	xfs_file_write_iter+0x90/0x130 [xfs]
> > > 	__vfs_write+0xe5/0x140
> > > 	vfs_write+0xc7/0x1f0
> > > 	? syscall_trace_enter+0x1d0/0x380
> > > 	SyS_write+0x58/0xc0
> > > 	do_syscall_64+0x6c/0x200
> > > 	entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > the oom victim has access to all memory reserves to make a forward
> > > progress to exit easier. But iomap_file_buffered_write and other callers
> > > of iomap_apply loop to complete the full request. We need to check for
> > > fatal signals and back off with a short write instead. As the
> > > iomap_apply delegates all the work down to the actor we have to hook
> > > into those. All callers that work with the page cache are calling
> > > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > > has to handle the situation explicitly because it copies data to the
> > > userspace directly. Other callers like iomap_page_mkwrite work on a
> > > single page or iomap_fiemap_actor do not allocate memory based on the
> > > given len.
> > > 
> > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > > Cc: stable # 4.8+
> > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/dax.c   | 5 +++++
> > >  fs/iomap.c | 3 +++
> > >  2 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 413a91db9351..0e263dacf9cf 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  		struct blk_dax_ctl dax = { 0 };
> > >  		ssize_t map_len;
> > >  
> > > +		if (fatal_signal_pending(current)) {
> > > +			ret = -EINTR;
> > > +			break;
> > > +		}
> > > +
> > >  		dax.sector = dax_iomap_sector(iomap, pos);
> > >  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> > >  		map_len = dax_map_atomic(iomap->bdev, &dax);
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index e57b90b5ff37..691eada58b06 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> > >  
> > >  	BUG_ON(pos + len > iomap->offset + iomap->length);
> > >  
> > > +	if (fatal_signal_pending(current))
> > > +		return -EINTR;
> > > +
> > >  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> > >  	if (!page)
> > >  		return -ENOMEM;
> > > -- 
> > > 2.11.0
> 
> Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> tell whether it has any negative effect, but I got on the first trial that
> all allocating threads are blocked on wait_for_completion() from flush_work()
> in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> workqueue context". There was no warn_alloc() stall warning message afterwords.

That patch is buggy and there is a follow up [1] which is not sitting in the
mmotm (and thus linux-next) yet. I didn't get to review it properly and
I cannot say I would be too happy about using WQ from the page
allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.

[1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-01-30  8:55 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-18 13:44 [RFC PATCH 0/2] fix unbounded too_many_isolated Michal Hocko
2017-01-18 13:44 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Michal Hocko
2017-01-18 14:46   ` Mel Gorman
2017-01-18 15:15     ` Michal Hocko
2017-01-18 15:54       ` Mel Gorman
2017-01-18 16:17         ` Michal Hocko
2017-01-18 17:00           ` Mel Gorman
2017-01-18 17:29             ` Michal Hocko
2017-01-19 10:07               ` Mel Gorman
2017-01-19 11:23                 ` Michal Hocko
2017-01-19 13:11                   ` Mel Gorman
2017-01-20 13:27                     ` Tetsuo Handa
2017-01-21  7:42                       ` Tetsuo Handa
2017-01-25 10:15                         ` Michal Hocko
2017-01-25 10:19                           ` Christoph Hellwig
2017-01-25 10:46                             ` Michal Hocko
2017-01-25 11:09                               ` Tetsuo Handa
2017-01-25 13:00                                 ` Michal Hocko
2017-01-27 14:49                                   ` Michal Hocko
2017-01-28 15:27                                     ` Tetsuo Handa
2017-01-30  8:55                                       ` Michal Hocko [this message]
2017-02-02 10:14                                         ` Michal Hocko
2017-02-03 10:57                                           ` Tetsuo Handa
2017-02-03 14:41                                             ` Michal Hocko
2017-02-03 14:50                                             ` Michal Hocko
2017-02-03 17:24                                               ` Brian Foster
2017-02-06  6:29                                                 ` Tetsuo Handa
2017-02-06 14:35                                                   ` Brian Foster
2017-02-06 14:42                                                     ` Michal Hocko
2017-02-06 15:47                                                       ` Brian Foster
2017-02-07 10:30                                                     ` Tetsuo Handa
2017-02-07 16:54                                                       ` Brian Foster
2017-02-03 14:55                                             ` Michal Hocko
2017-02-05 10:43                                               ` Tetsuo Handa
2017-02-06 10:34                                                 ` Michal Hocko
2017-02-06 10:39                                                 ` Michal Hocko
2017-02-07 21:12                                                   ` Michal Hocko
2017-02-08  9:24                                                     ` Peter Zijlstra
2017-02-21  9:40                                             ` Michal Hocko
2017-02-21 14:35                                               ` Tetsuo Handa
2017-02-21 15:53                                                 ` Michal Hocko
2017-02-22  2:02                                                   ` Tetsuo Handa
2017-02-22  7:54                                                     ` Michal Hocko
2017-02-26  6:30                                                       ` Tetsuo Handa
2017-01-31 11:58                                   ` Michal Hocko
2017-01-31 12:51                                     ` Christoph Hellwig
2017-01-31 13:21                                       ` Michal Hocko
2017-01-25 10:33                           ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone Tetsuo Handa
2017-01-25 12:34                             ` Michal Hocko
2017-01-25 13:13                               ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Tetsuo Handa
2017-01-25  9:53                       ` Michal Hocko
2017-01-20  6:42                 ` Hillf Danton
2017-01-20  9:25                   ` Mel Gorman
2017-01-18 13:44 ` [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
2017-01-18 14:50   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170130085546.GF8443@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).