public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: prevent stack overflows from page cache allocation
@ 2013-10-24  3:25 Dave Chinner
  2013-10-24  8:48 ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-10-24  3:25 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Page cache allocation doesn't always go through ->begin_write and
hence we don't always get the opportunity to set the allocation
context to GFP_NOFS. Failing to do this means we open up the direct
relcaim stack to recurse into the filesystem and consume a
significant amount of stack.

On RHEL6.4 kernels we are seeing ra_submit() and
generic_file_splice_read() from an nfsd context recursing into the
filesystem via the inode cache shrinker and evicting inodes. This is
causing truncation to be run (e.g EOF block freeing) and causing
bmap btree block merges and free space btree block splits to occur.
These btree manipulations are occurring with the call chain already
30 functions deep and hence there is not enough stack space to
complete such operations.

To avoid these specific overruns, we need to prevent the page cache
allocation from recursing via direct reclaim. We can do that because
the allocation functions take the allocation context from that which
is stored in the mapping for the inode. We don't set that right now,
so the default is GFP_HIGHUSER_MOVABLE, which is effectively a
GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so
when we initialise an inode, set the mapping gfp mask appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c4cd6d4..27e0e54 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1168,6 +1168,7 @@ xfs_setup_inode(
 	struct xfs_inode	*ip)
 {
 	struct inode		*inode = &ip->i_vnode;
+	gfp_t			gfp_mask;
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state = I_NEW;
@@ -1230,6 +1231,14 @@ xfs_setup_inode(
 	}
 
 	/*
+	 * Ensure all page cache allocations are done from GFP_NOFS context to
+	 * prevent direct reclaim recursion back into the filesystem and blowing
+	 * stacks or deadlocking.
+	 */
+	gfp_mask = mapping_gfp_mask(inode->i_mapping);
+	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
+
+	/*
 	 * If there is no attribute fork no ACL can exist on this inode,
 	 * and it can't have any file capabilities attached to it either.
 	 */
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24  3:25 [PATCH] xfs: prevent stack overflows from page cache allocation Dave Chinner
@ 2013-10-24  8:48 ` Christoph Hellwig
  2013-10-24 10:37   ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2013-10-24  8:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Page cache allocation doesn't always go through ->begin_write and
> hence we don't always get the opportunity to set the allocation
> context to GFP_NOFS. Failing to do this means we open up the direct
> relcaim stack to recurse into the filesystem and consume a
> significant amount of stack.
> 
> On RHEL6.4 kernels we are seeing ra_submit() and
> generic_file_splice_read() from an nfsd context recursing into the
> filesystem via the inode cache shrinker and evicting inodes. This is
> causing truncation to be run (e.g EOF block freeing) and causing
> bmap btree block merges and free space btree block splits to occur.
> These btree manipulations are occurring with the call chain already
> 30 functions deep and hence there is not enough stack space to
> complete such operations.

It seems like we really should fix this in the VFS as it could affect
all non-trivial filesystems.

> To avoid these specific overruns, we need to prevent the page cache
> allocation from recursing via direct reclaim. We can do that because
> the allocation functions take the allocation context from that which
> is stored in the mapping for the inode. We don't set that right now,
> so the default is GFP_HIGHUSER_MOVABLE, which is effectively a
> GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so
> when we initialise an inode, set the mapping gfp mask appropriately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index c4cd6d4..27e0e54 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1168,6 +1168,7 @@ xfs_setup_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct inode		*inode = &ip->i_vnode;
> +	gfp_t			gfp_mask;
>  
>  	inode->i_ino = ip->i_ino;
>  	inode->i_state = I_NEW;
> @@ -1230,6 +1231,14 @@ xfs_setup_inode(
>  	}
>  
>  	/*
> +	 * Ensure all page cache allocations are done from GFP_NOFS context to
> +	 * prevent direct reclaim recursion back into the filesystem and blowing
> +	 * stacks or deadlocking.
> +	 */
> +	gfp_mask = mapping_gfp_mask(inode->i_mapping);
> +	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> +
> +	/*
>  	 * If there is no attribute fork no ACL can exist on this inode,
>  	 * and it can't have any file capabilities attached to it either.
>  	 */
> -- 
> 1.8.4.rc3
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24  8:48 ` Christoph Hellwig
@ 2013-10-24 10:37   ` Dave Chinner
  2013-10-24 15:42     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-10-24 10:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 01:48:03AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Page cache allocation doesn't always go through ->begin_write and
> > hence we don't always get the opportunity to set the allocation
> > context to GFP_NOFS. Failing to do this means we open up the direct
> > relcaim stack to recurse into the filesystem and consume a
> > significant amount of stack.
> > 
> > On RHEL6.4 kernels we are seeing ra_submit() and
> > generic_file_splice_read() from an nfsd context recursing into the
> > filesystem via the inode cache shrinker and evicting inodes. This is
> > causing truncation to be run (e.g EOF block freeing) and causing
> > bmap btree block merges and free space btree block splits to occur.
> > These btree manipulations are occurring with the call chain already
> > 30 functions deep and hence there is not enough stack space to
> > complete such operations.
> 
> It seems like we really should fix this in the VFS as it could affect
> all non-trivial filesystems.

Sure, if you want to. But doing that shouldn't prevent this fix from
being committed in the mean time, especially as other filesystems
already use this method for avoiding these problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 10:37   ` Dave Chinner
@ 2013-10-24 15:42     ` Christoph Hellwig
  2013-10-24 16:41       ` Ben Myers
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2013-10-24 15:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> Sure, if you want to. But doing that shouldn't prevent this fix from
> being committed in the mean time, especially as other filesystems
> already use this method for avoiding these problems.

I'd much prefer aiming for the proper fix first.  If for some reason we
can't get it done in time the workaround can be applied.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 15:42     ` Christoph Hellwig
@ 2013-10-24 16:41       ` Ben Myers
  2013-10-24 21:24         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Myers @ 2013-10-24 16:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, xfs

Hey,

On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> > Sure, if you want to. But doing that shouldn't prevent this fix from
> > being committed in the mean time, especially as other filesystems
> > already use this method for avoiding these problems.
> 
> I'd much prefer aiming for the proper fix first.  If for some reason we
> can't get it done in time the workaround can be applied.

Dave probably has a customer waiting on this.  If pulling this in will make a
proper fix more difficult to do I can understand keeping the patch out.
Otherwise, can't we just remove this along with the other filesystems'
equivalent code when the proper fix is committed?

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 16:41       ` Ben Myers
@ 2013-10-24 21:24         ` Dave Chinner
  2013-10-25 11:29           ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-10-24 21:24 UTC (permalink / raw)
  To: Ben Myers; +Cc: Christoph Hellwig, linux-fsdevel, xfs

On Thu, Oct 24, 2013 at 11:41:15AM -0500, Ben Myers wrote:
> Hey,
> 
> On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote:
> > > Sure, if you want to. But doing that shouldn't prevent this fix from
> > > being committed in the mean time, especially as other filesystems
> > > already use this method for avoiding these problems.
> > 
> > I'd much prefer aiming for the proper fix first.  If for some reason we
> > can't get it done in time the workaround can be applied.
> 
> Dave probably has a customer waiting on this.

Obviously. And being a kernel where we have a fixed ABI, we can't
backport any fix that changes core code.

> If pulling this in will make a
> proper fix more difficult to do I can understand keeping the patch out.

It doesn't make a proper fix any harder - removing 2 lines of code
is trivial.

> Otherwise, can't we just remove this along with the other filesystems'
> equivalent code when the proper fix is committed?

Yes, we can.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-24 21:24         ` Dave Chinner
@ 2013-10-25 11:29           ` Christoph Hellwig
  2013-10-27  9:04             ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2013-10-25 11:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, Ben Myers, xfs

On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote:
> > > I'd much prefer aiming for the proper fix first.  If for some reason we
> > > can't get it done in time the workaround can be applied.
> > 
> > Dave probably has a customer waiting on this.
> 
> Obviously. And being a kernel where we have a fixed ABI, we can't
> backport any fix that changes core code.

No one is trying to tell you what to ship to your customers.  That
doesn't mean we should aim for the right fix upstream.

I don't really mind pushing patches like yours as a last resort when
dealing with unrepsonsive or disagreeing maintainers like we had to do
in the past, but trying to push the workaround without even attemping
the proper fix is a bit sad.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-25 11:29           ` Christoph Hellwig
@ 2013-10-27  9:04             ` Dave Chinner
  2013-10-28  9:49               ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-10-27  9:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Ben Myers, xfs

On Fri, Oct 25, 2013 at 04:29:34AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote:
> > > > I'd much prefer aiming for the proper fix first.  If for some reason we
> > > > can't get it done in time the workaround can be applied.
> > > 
> > > Dave probably has a customer waiting on this.
> > 
> > Obviously. And being a kernel where we have a fixed ABI, we can't
> > backport any fix that changes core code.
> 
> No one is trying to tell you what to ship to your customers.  That
> doesn't mean we should aim for the right fix upstream.

Have you looked at the page cache allocation code recently? e.g,
grab_cache_page_write_begin(), which we pass AOP_FLAG_NOFS into to
prevent direct reclaim deadlocks:

        gfp_mask = mapping_gfp_mask(mapping);
        if (mapping_cap_account_dirty(mapping))
                gfp_mask |= __GFP_WRITE;
        if (flags & AOP_FLAG_NOFS)
                gfp_notmask = __GFP_FS;

See? The gfp mask that is used for allocation is directly based on
the gfp mask that is set on the mapping. With this patch I proposed,
we can remove the all the AOP_FLAG_NOFS usage because it's
redundant.  IOWs, the mapping alread has the correct allocation
context set on it for *all* mapping based allocations without having
to add and pass magic flags to various interfaces to get it to do
the right thing.

> I don't really mind pushing patches like yours as a last resort when
> dealing with unrepsonsive or disagreeing maintainers like we had to do
> in the past, but trying to push the workaround without even attemping
> the proper fix is a bit sad.

The page cache infrastructure (i.e. the generic VFS code) is already
set up to be used in this manner, and other filesystems use it in
this way where they need to as well. So, really, I'm not sure what
problem you think needs solving here.

Indeed, I'm not even sure yet how this would be "fixed" in the VFS
code.  Passing AOP_FLAG_NOFS several functions deep to get it into
ra_submit or the low level splice functions is a non-starter, so the
context needs to be carried by something else. That's exactly what
the per-mapping gfp mask does...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent stack overflows from page cache allocation
  2013-10-27  9:04             ` Dave Chinner
@ 2013-10-28  9:49               ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2013-10-28  9:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, xfs

Thanks for the explanation Dave.  I suspected a generic bug because
the two places mentioned are generic code.  Turns out only XFS and some
cluster filesystems are affected because other filesystems do not call
those under filesystem locks or i_mutex.

But it seem like AOP_FLAG_NOFS is indeed pointless and we should remove it,
in XFS as part of this patch and everywhere else later after talking
to the maintainers for the few other places using it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-10-28  9:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-24  3:25 [PATCH] xfs: prevent stack overflows from page cache allocation Dave Chinner
2013-10-24  8:48 ` Christoph Hellwig
2013-10-24 10:37   ` Dave Chinner
2013-10-24 15:42     ` Christoph Hellwig
2013-10-24 16:41       ` Ben Myers
2013-10-24 21:24         ` Dave Chinner
2013-10-25 11:29           ` Christoph Hellwig
2013-10-27  9:04             ` Dave Chinner
2013-10-28  9:49               ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox