Re: [PATCH] xfs: prevent stack overflows from page cache allocation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
       [not found] <1382585110-1796-1-git-send-email-david@fromorbit.com>
@ 2013-10-24  8:48 ` Christoph Hellwig
  2013-10-24 10:37   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2013-10-24  8:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel

On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Page cache allocation doesn't always go through ->begin_write and
> hence we don't always get the opportunity to set the allocation
> context to GFP_NOFS. Failing to do this means we open up the direct
> relcaim stack to recurse into the filesystem and consume a
> significant amount of stack.
> 
> On RHEL6.4 kernels we are seeing ra_submit() and
> generic_file_splice_read() from an nfsd context recursing into the
> filesystem via the inode cache shrinker and evicting inodes. This is
> causing truncation to be run (e.g EOF block freeing) and causing
> bmap btree block merges and free space btree block splits to occur.
> These btree manipulations are occurring with the call chain already
> 30 functions deep and hence there is not enough stack space to
> complete such operations.

It seems like we really should fix this in the VFS as it could affect
all non-trivial filesystems.

> To avoid these specific overruns, we need to prevent the page cache
> allocation from recursing via direct reclaim. We can do that because
> the allocation functions take the allocation context from that which
> is stored in the mapping for the inode. We don't set that right now,
> so the default is GFP_HIGHUSER_MOVABLE, which is effectively a
> GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so
> when we initialise an inode, set the mapping gfp mask appropriately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index c4cd6d4..27e0e54 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1168,6 +1168,7 @@ xfs_setup_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct inode		*inode = &ip->i_vnode;
> +	gfp_t			gfp_mask;
>  
>  	inode->i_ino = ip->i_ino;
>  	inode->i_state = I_NEW;
> @@ -1230,6 +1231,14 @@ xfs_setup_inode(
>  	}
>  
>  	/*
> +	 * Ensure all page cache allocations are done from GFP_NOFS context to
> +	 * prevent direct reclaim recursion back into the filesystem and blowing
> +	 * stacks or deadlocking.
> +	 */
> +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
> +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> +
> +	/*
>  	 * If there is no attribute fork no ACL can exist on this inode,
>  	 * and it can't have any file capabilities attached to it either.
>  	 */
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24  8:48 ` [PATCH] xfs: prevent stack overflows from page cache allocation Christoph Hellwig
@ 2013-10-24 10:37   ` Dave Chinner
  2013-10-24 15:42     ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2013-10-24 10:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-fsdevel

On Thu, Oct 24, 2013 at 01:48:03AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Page cache allocation doesn't always go through ->begin_write and
> > hence we don't always get the opportunity to set the allocation
> > context to GFP_NOFS. Failing to do this means we open up the direct
> > relcaim stack to recurse into the filesystem and consume a
> > significant amount of stack.
> > 
> > On RHEL6.4 kernels we are seeing ra_submit() and
> > generic_file_splice_read() from an nfsd context recursing into the
> > filesystem via the inode cache shrinker and evicting inodes. This is
> > causing truncation to be run (e.g EOF block freeing) and causing
> > bmap btree block merges and free space btree block splits to occur.
> > These btree manipulations are occurring with the call chain already
> > 30 functions deep and hence there is not enough stack space to
> > complete such operations.
> 
> It seems like we really should fix this in the VFS as it could affect
> all non-trivial filesystems.

Sure, if you want to. But doing that shouldn't prevent this fix from
being committed in the mean time, especially as other filesystems
already use this method for avoiding these problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 10:37   ` Dave Chinner
@ 2013-10-24 15:42     ` Christoph Hellwig
  2013-10-24 16:41       ` Ben Myers
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2013-10-24 15:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> Sure, if you want to. But doing that shouldn't prevent this fix from
> being committed in the mean time, especially as other filesystems
> already use this method for avoiding these problems.

I'd much prefer aiming for the proper fix first.  If for some reason we
can't get it done in time the workaround can be applied.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 15:42     ` Christoph Hellwig
@ 2013-10-24 16:41       ` Ben Myers
  2013-10-24 21:24         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Ben Myers @ 2013-10-24 16:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, xfs

Hey,

On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> > Sure, if you want to. But doing that shouldn't prevent this fix from
> > being committed in the mean time, especially as other filesystems
> > already use this method for avoiding these problems.
> 
> I'd much prefer aiming for the proper fix first.  If for some reason we
> can't get it done in time the workaround can be applied.

Dave probably has a customer waiting on this.  If pulling this in will make a
proper fix more difficult to do I can understand keeping the patch out.
Otherwise, can't we just remove this along with the other filesystems'
equivalent code when the proper fix is committed?

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 16:41       ` Ben Myers
@ 2013-10-24 21:24         ` Dave Chinner
  2013-10-25 11:29           ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2013-10-24 21:24 UTC (permalink / raw)
  To: Ben Myers; +Cc: Christoph Hellwig, linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 11:41:15AM -0500, Ben Myers wrote:
> Hey,
> 
> On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> > > Sure, if you want to. But doing that shouldn't prevent this fix from
> > > being committed in the mean time, especially as other filesystems
> > > already use this method for avoiding these problems.
> > 
> > I'd much prefer aiming for the proper fix first.  If for some reason we
> > can't get it done in time the workaround can be applied.
> 
> Dave probably has a customer waiting on this.

Obviously. And being a kernel where we have a fixed ABI, we can't
backport any fix that changes core code.

> If pulling this in will make a
> proper fix more difficult to do I can understand keeping the patch out.

It doesn't make a proper fix any harder - removing 2 lines of code
is trivial.

> Otherwise, can't we just remove this along with the other filesystems'
> equivalent code when the proper fix is committed?

Yes, we can.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 21:24         ` Dave Chinner
@ 2013-10-25 11:29           ` Christoph Hellwig
  2013-10-27  9:04             ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2013-10-25 11:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ben Myers, Christoph Hellwig, linux-fsdevel, xfs

On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote:
> > > I'd much prefer aiming for the proper fix first.  If for some reason we
> > > can't get it done in time the workaround can be applied.
> > 
> > Dave probably has a customer waiting on this.
> 
> Obviously. And being a kernel where we have a fixed ABI, we can't
> backport any fix that changes core code.

No one is trying to tell you what to ship to your customers.  That
doesn't mean we should aim for the right fix upstream.

I don't really mind pushing patches like yours as a last resort when
dealing with unrepsonsive or disagreeing maintainers like we had to do
in the past, but trying to push the workaround without even attemping
the proper fix is a bit sad.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-25 11:29           ` Christoph Hellwig
@ 2013-10-27  9:04             ` Dave Chinner
  2013-10-28  9:49               ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2013-10-27  9:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ben Myers, linux-fsdevel, xfs

On Fri, Oct 25, 2013 at 04:29:34AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote:
> > > > I'd much prefer aiming for the proper fix first.  If for some reason we
> > > > can't get it done in time the workaround can be applied.
> > > 
> > > Dave probably has a customer waiting on this.
> > 
> > Obviously. And being a kernel where we have a fixed ABI, we can't
> > backport any fix that changes core code.
> 
> No one is trying to tell you what to ship to your customers.  That
> doesn't mean we should aim for the right fix upstream.

Have you looked at the page cache allocation code recently? e.g,
grab_cache_page_write_begin(), which we pass AOP_FLAG_NOFS into to
prevent direct reclaim deadlocks:

        gfp_mask = mapping_gfp_mask(mapping);
        if (mapping_cap_account_dirty(mapping))
                gfp_mask |= __GFP_WRITE;
        if (flags & AOP_FLAG_NOFS)
                gfp_notmask = __GFP_FS;

See? The gfp mask that is used for allocation is directly based on
the gfp mask that is set on the mapping. With this patch I proposed,
we can remove the all the AOP_FLAG_NOFS usage because it's
redundant.  IOWs, the mapping alread has the correct allocation
context set on it for *all* mapping based allocations without having
to add and pass magic flags to various interfaces to get it to do
the right thing.

> I don't really mind pushing patches like yours as a last resort when
> dealing with unrepsonsive or disagreeing maintainers like we had to do
> in the past, but trying to push the workaround without even attemping
> the proper fix is a bit sad.

The page cache infrastructure (i.e. the generic VFS code) is already
set up to be used in this manner, and other filesystems use it in
this way where they need to as well. So, really, I'm not sure what
problem you think needs solving here.

Indeed, I'm not even sure yet how this would be "fixed" in the VFS
code.  Passing AOP_FLAG_NOFS several functions deep to get it into
ra_submit or the low level splice functions is a non-starter, so the
context needs to be carried by something else. That's exactly what
the per-mapping gfp mask does...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-27  9:04             ` Dave Chinner
@ 2013-10-28  9:49               ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2013-10-28  9:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, xfs

Thanks for the explanation Dave.  I suspected a generic bug because
the two places mentioned are generic code.  Turns out only XFS and some
cluster filesystems are affected because other filesystems do not call
those under filesystem locks or i_mutex.

But it seem like AOP_FLAG_NOFS is indeed pointless and we should remove it,
in XFS as part of this patch and everywhere else later after talking
to the maintainers for the few other places using it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-10-28  9:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1382585110-1796-1-git-send-email-david@fromorbit.com>
2013-10-24  8:48 ` [PATCH] xfs: prevent stack overflows from page cache allocation Christoph Hellwig
2013-10-24 10:37   ` Dave Chinner
2013-10-24 15:42     ` Christoph Hellwig
2013-10-24 16:41       ` Ben Myers
2013-10-24 21:24         ` Dave Chinner
2013-10-25 11:29           ` Christoph Hellwig
2013-10-27  9:04             ` Dave Chinner
2013-10-28  9:49               ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).