All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	mgorman@suse.de, viro@ZenIV.linux.org.uk, linux-mm@kvack.org,
	hannes@cmpxchg.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Date: Tue, 31 Jan 2017 12:58:46 +0100	[thread overview]
Message-ID: <20170131115846.GD19082@dhcp22.suse.cz> (raw)
In-Reply-To: <20170125130014.GO32377@dhcp22.suse.cz>

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > > I think we are missing a check for fatal_signal_pending in
> > > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > > full memory reserves. What do you think about the following? I haven't
> > > > > tested this but it mimics generic_perform_write so I guess it should
> > > > > work.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > this looks reasonable to me.  But we have a few more such loops,
> > > > maybe it makes sense to move the check into iomap_apply?
> > > 
> > > I wasn't sure about the expected semantic of iomap_apply but now that
> > > I've actually checked all the callers I believe all of them should be
> > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > > pattern to return the number of written pages or an error but it rather
> > > propagates the error out. From my limited understanding of those code
> > > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > > that is just too convoluted for me. If that one is OK as well then
> > > the following patch should be indeed better.
> > 
> > Is "length" in
> > 
> >    written = actor(inode, pos, length, data, &iomap);
> > 
> > call guaranteed to be small enough? If not guaranteed,
> > don't we need to check SIGKILL inside "actor" functions?
> 
> You are right! Checking for signals inside iomap_apply doesn't really
> solve anything because basically all users do iov_iter_count(). Blee. So
> we have loops around iomap_apply which itself loops inside the actor.
> iomap_write_begin seems to be used by most of them which is also where we
> get the pagecache page so I guess this should be the "right" place to
> put the check in. Things like dax_iomap_actor will need an explicit check.
> This is quite unfortunate but I do not see any better solution.
> What do you think Christoph?

What do you think Christoph? I have an additional patch to handle
do_generic_file_read and a similar one to back off in
__vmalloc_area_node. I would like to post them all in one series but I
would like to know that this one is OK before I do that.

Thanks!

> ---
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	mgorman@suse.de, viro@ZenIV.linux.org.uk, linux-mm@kvack.org,
	hannes@cmpxchg.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
Date: Tue, 31 Jan 2017 12:58:46 +0100	[thread overview]
Message-ID: <20170131115846.GD19082@dhcp22.suse.cz> (raw)
In-Reply-To: <20170125130014.GO32377@dhcp22.suse.cz>

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > > I think we are missing a check for fatal_signal_pending in
> > > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > > full memory reserves. What do you think about the following? I haven't
> > > > > tested this but it mimics generic_perform_write so I guess it should
> > > > > work.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > this looks reasonable to me.  But we have a few more such loops,
> > > > maybe it makes sense to move the check into iomap_apply?
> > > 
> > > I wasn't sure about the expected semantic of iomap_apply but now that
> > > I've actually checked all the callers I believe all of them should be
> > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > > pattern to return the number of written pages or an error but it rather
> > > propagates the error out. From my limited understanding of those code
> > > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > > that is just too convoluted for me. If that one is OK as well then
> > > the following patch should be indeed better.
> > 
> > Is "length" in
> > 
> >    written = actor(inode, pos, length, data, &iomap);
> > 
> > call guaranteed to be small enough? If not guaranteed,
> > don't we need to check SIGKILL inside "actor" functions?
> 
> You are right! Checking for signals inside iomap_apply doesn't really
> solve anything because basically all users do iov_iter_count(). Blee. So
> we have loops around iomap_apply which itself loops inside the actor.
> iomap_write_begin seems to be used by most of them which is also where we
> get the pagecache page so I guess this should be the "right" place to
> put the check in. Things like dax_iomap_actor will need an explicit check.
> This is quite unfortunate but I do not see any better solution.
> What do you think Christoph?

What do you think Christoph? I have an additional patch to handle
do_generic_file_read and a similar one to back off in
__vmalloc_area_node. I would like to post them all in one series but I
would like to know that this one is OK before I do that.

Thanks!

> ---
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2017-01-31 11:58 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-18 13:44 [RFC PATCH 0/2] fix unbounded too_many_isolated Michal Hocko
2017-01-18 13:44 ` Michal Hocko
2017-01-18 13:44 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Michal Hocko
2017-01-18 13:44   ` Michal Hocko
2017-01-18 14:46   ` Mel Gorman
2017-01-18 14:46     ` Mel Gorman
2017-01-18 15:15     ` Michal Hocko
2017-01-18 15:15       ` Michal Hocko
2017-01-18 15:54       ` Mel Gorman
2017-01-18 15:54         ` Mel Gorman
2017-01-18 16:17         ` Michal Hocko
2017-01-18 16:17           ` Michal Hocko
2017-01-18 17:00           ` Mel Gorman
2017-01-18 17:00             ` Mel Gorman
2017-01-18 17:29             ` Michal Hocko
2017-01-18 17:29               ` Michal Hocko
2017-01-19 10:07               ` Mel Gorman
2017-01-19 10:07                 ` Mel Gorman
2017-01-19 11:23                 ` Michal Hocko
2017-01-19 11:23                   ` Michal Hocko
2017-01-19 13:11                   ` Mel Gorman
2017-01-19 13:11                     ` Mel Gorman
2017-01-20 13:27                     ` Tetsuo Handa
2017-01-20 13:27                       ` Tetsuo Handa
2017-01-21  7:42                       ` Tetsuo Handa
2017-01-21  7:42                         ` Tetsuo Handa
2017-01-25 10:15                         ` Michal Hocko
2017-01-25 10:15                           ` Michal Hocko
2017-01-25 10:19                           ` Christoph Hellwig
2017-01-25 10:19                             ` Christoph Hellwig
2017-01-25 10:46                             ` Michal Hocko
2017-01-25 10:46                               ` Michal Hocko
2017-01-25 11:09                               ` Tetsuo Handa
2017-01-25 11:09                                 ` Tetsuo Handa
2017-01-25 13:00                                 ` Michal Hocko
2017-01-25 13:00                                   ` Michal Hocko
2017-01-27 14:49                                   ` Michal Hocko
2017-01-27 14:49                                     ` Michal Hocko
2017-01-28 15:27                                     ` Tetsuo Handa
2017-01-28 15:27                                       ` Tetsuo Handa
2017-01-30  8:55                                       ` Michal Hocko
2017-01-30  8:55                                         ` Michal Hocko
2017-02-02 10:14                                         ` Michal Hocko
2017-02-02 10:14                                           ` Michal Hocko
2017-02-03 10:57                                           ` Tetsuo Handa
2017-02-03 10:57                                             ` Tetsuo Handa
2017-02-03 14:41                                             ` Michal Hocko
2017-02-03 14:41                                               ` Michal Hocko
2017-02-03 14:50                                             ` Michal Hocko
2017-02-03 14:50                                               ` Michal Hocko
2017-02-03 17:24                                               ` Brian Foster
2017-02-03 17:24                                                 ` Brian Foster
2017-02-06  6:29                                                 ` Tetsuo Handa
2017-02-06  6:29                                                   ` Tetsuo Handa
2017-02-06 14:35                                                   ` Brian Foster
2017-02-06 14:35                                                     ` Brian Foster
2017-02-06 14:42                                                     ` Michal Hocko
2017-02-06 14:42                                                       ` Michal Hocko
2017-02-06 15:47                                                       ` Brian Foster
2017-02-06 15:47                                                         ` Brian Foster
2017-02-07 10:30                                                     ` Tetsuo Handa
2017-02-07 10:30                                                       ` Tetsuo Handa
2017-02-07 16:54                                                       ` Brian Foster
2017-02-07 16:54                                                         ` Brian Foster
2017-02-03 14:55                                             ` Michal Hocko
2017-02-03 14:55                                               ` Michal Hocko
2017-02-05 10:43                                               ` Tetsuo Handa
2017-02-05 10:43                                                 ` Tetsuo Handa
2017-02-06 10:34                                                 ` Michal Hocko
2017-02-06 10:34                                                   ` Michal Hocko
2017-02-06 10:39                                                 ` Michal Hocko
2017-02-06 10:39                                                   ` Michal Hocko
2017-02-07 21:12                                                   ` Michal Hocko
2017-02-07 21:12                                                     ` Michal Hocko
2017-02-08  9:24                                                     ` Peter Zijlstra
2017-02-08  9:24                                                       ` Peter Zijlstra
2017-02-21  9:40                                             ` Michal Hocko
2017-02-21  9:40                                               ` Michal Hocko
2017-02-21 14:35                                               ` Tetsuo Handa
2017-02-21 14:35                                                 ` Tetsuo Handa
2017-02-21 15:53                                                 ` Michal Hocko
2017-02-21 15:53                                                   ` Michal Hocko
2017-02-22  2:02                                                   ` Tetsuo Handa
2017-02-22  2:02                                                     ` Tetsuo Handa
2017-02-22  7:54                                                     ` Michal Hocko
2017-02-22  7:54                                                       ` Michal Hocko
2017-02-26  6:30                                                       ` Tetsuo Handa
2017-02-26  6:30                                                         ` Tetsuo Handa
2017-01-31 11:58                                   ` Michal Hocko [this message]
2017-01-31 11:58                                     ` Michal Hocko
2017-01-31 12:51                                     ` Christoph Hellwig
2017-01-31 12:51                                       ` Christoph Hellwig
2017-01-31 13:21                                       ` Michal Hocko
2017-01-31 13:21                                         ` Michal Hocko
2017-01-25 10:33                           ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone Tetsuo Handa
2017-01-25 10:33                             ` Tetsuo Handa
2017-01-25 12:34                             ` Michal Hocko
2017-01-25 12:34                               ` Michal Hocko
2017-01-25 13:13                               ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Tetsuo Handa
2017-01-25 13:13                                 ` Tetsuo Handa
2017-01-25  9:53                       ` Michal Hocko
2017-01-25  9:53                         ` Michal Hocko
2017-01-20  6:42                 ` Hillf Danton
2017-01-20  6:42                   ` Hillf Danton
2017-01-20  9:25                   ` Mel Gorman
2017-01-20  9:25                     ` Mel Gorman
2017-01-18 13:44 ` [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
2017-01-18 13:44   ` Michal Hocko
2017-01-18 14:50   ` Mel Gorman
2017-01-18 14:50     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170131115846.GD19082@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.