* Re: [PATCH] xfs: prevent stack overflows from page cache allocation [not found] <1382585110-1796-1-git-send-email-david@fromorbit.com> @ 2013-10-24 8:48 ` Christoph Hellwig 2013-10-24 10:37 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Christoph Hellwig @ 2013-10-24 8:48 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs, linux-fsdevel On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Page cache allocation doesn't always go through ->begin_write and > hence we don't always get the opportunity to set the allocation > context to GFP_NOFS. Failing to do this means we open up the direct > relcaim stack to recurse into the filesystem and consume a > significant amount of stack. > > On RHEL6.4 kernels we are seeing ra_submit() and > generic_file_splice_read() from an nfsd context recursing into the > filesystem via the inode cache shrinker and evicting inodes. This is > causing truncation to be run (e.g EOF block freeing) and causing > bmap btree block merges and free space btree block splits to occur. > These btree manipulations are occurring with the call chain already > 30 functions deep and hence there is not enough stack space to > complete such operations. It seems like we really should fix this in the VFS as it could affect all non-trivial filesystems. > To avoid these specific overruns, we need to prevent the page cache > allocation from recursing via direct reclaim. We can do that because > the allocation functions take the allocation context from that which > is stored in the mapping for the inode. We don't set that right now, > so the default is GFP_HIGHUSER_MOVABLE, which is effectively a > GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so > when we initialise an inode, set the mapping gfp mask appropriately. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iops.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index c4cd6d4..27e0e54 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -1168,6 +1168,7 @@ xfs_setup_inode( > struct xfs_inode *ip) > { > struct inode *inode = &ip->i_vnode; > + gfp_t gfp_mask; > > inode->i_ino = ip->i_ino; > inode->i_state = I_NEW; > @@ -1230,6 +1231,14 @@ xfs_setup_inode( > } > > /* > + * Ensure all page cache allocations are done from GFP_NOFS context to > + * prevent direct reclaim recursion back into the filesystem and blowing > + * stacks or deadlocking. > + */ > + gfp_mask = mapping_gfp_mask(inode->i_mapping); > + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); > + > + /* > * If there is no attribute fork no ACL can exist on this inode, > * and it can't have any file capabilities attached to it either. > */ > -- > 1.8.4.rc3 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs ---end quoted text--- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-24 8:48 ` [PATCH] xfs: prevent stack overflows from page cache allocation Christoph Hellwig @ 2013-10-24 10:37 ` Dave Chinner 2013-10-24 15:42 ` Christoph Hellwig 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2013-10-24 10:37 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs, linux-fsdevel On Thu, Oct 24, 2013 at 01:48:03AM -0700, Christoph Hellwig wrote: > On Thu, Oct 24, 2013 at 02:25:10PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Page cache allocation doesn't always go through ->begin_write and > > hence we don't always get the opportunity to set the allocation > > context to GFP_NOFS. Failing to do this means we open up the direct > > relcaim stack to recurse into the filesystem and consume a > > significant amount of stack. > > > > On RHEL6.4 kernels we are seeing ra_submit() and > > generic_file_splice_read() from an nfsd context recursing into the > > filesystem via the inode cache shrinker and evicting inodes. This is > > causing truncation to be run (e.g EOF block freeing) and causing > > bmap btree block merges and free space btree block splits to occur. > > These btree manipulations are occurring with the call chain already > > 30 functions deep and hence there is not enough stack space to > > complete such operations. > > It seems like we really should fix this in the VFS as it could affect > all non-trivial filesystems. Sure, if you want to. But doing that shouldn't prevent this fix from being committed in the mean time, especially as other filesystems already use this method for avoiding these problems. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-24 10:37 ` Dave Chinner @ 2013-10-24 15:42 ` Christoph Hellwig 2013-10-24 16:41 ` Ben Myers 0 siblings, 1 reply; 8+ messages in thread From: Christoph Hellwig @ 2013-10-24 15:42 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, xfs On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote: > Sure, if you want to. But doing that shouldn't prevent this fix from > being committed in the mean time, especially as other filesystems > already use this method for avoiding these problems. I'd much prefer aiming for the proper fix first. If for some reason we can't get it done in time the workaround can be applied. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-24 15:42 ` Christoph Hellwig @ 2013-10-24 16:41 ` Ben Myers 2013-10-24 21:24 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Ben Myers @ 2013-10-24 16:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, xfs Hey, On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote: > On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote: > > Sure, if you want to. But doing that shouldn't prevent this fix from > > being committed in the mean time, especially as other filesystems > > already use this method for avoiding these problems. > > I'd much prefer aiming for the proper fix first. If for some reason we > can't get it done in time the workaround can be applied. Dave probably has a customer waiting on this. If pulling this in will make a proper fix more difficult to do I can understand keeping the patch out. Otherwise, can't we just remove this along with the other filesystems' equivalent code when the proper fix is committed? Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-24 16:41 ` Ben Myers @ 2013-10-24 21:24 ` Dave Chinner 2013-10-25 11:29 ` Christoph Hellwig 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2013-10-24 21:24 UTC (permalink / raw) To: Ben Myers; +Cc: Christoph Hellwig, linux-fsdevel, xfs On Thu, Oct 24, 2013 at 11:41:15AM -0500, Ben Myers wrote: > Hey, > > On Thu, Oct 24, 2013 at 08:42:20AM -0700, Christoph Hellwig wrote: > > On Thu, Oct 24, 2013 at 09:37:51PM +1100, Dave Chinner wrote: > > > Sure, if you want to. But doing that shouldn't prevent this fix from > > > being committed in the mean time, especially as other filesystems > > > already use this method for avoiding these problems. > > > > I'd much prefer aiming for the proper fix first. If for some reason we > > can't get it done in time the workaround can be applied. > > Dave probably has a customer waiting on this. Obviously. And being a kernel where we have a fixed ABI, we can't backport any fix that changes core code. > If pulling this in will make a > proper fix more difficult to do I can understand keeping the patch out. It doesn't make a proper fix any harder - removing 2 lines of code is trivial. > Otherwise, can't we just remove this along with the other filesystems' > equivalent code when the proper fix is committed? Yes, we can. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-24 21:24 ` Dave Chinner @ 2013-10-25 11:29 ` Christoph Hellwig 2013-10-27 9:04 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Christoph Hellwig @ 2013-10-25 11:29 UTC (permalink / raw) To: Dave Chinner; +Cc: Ben Myers, Christoph Hellwig, linux-fsdevel, xfs On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote: > > > I'd much prefer aiming for the proper fix first. If for some reason we > > > can't get it done in time the workaround can be applied. > > > > Dave probably has a customer waiting on this. > > Obviously. And being a kernel where we have a fixed ABI, we can't > backport any fix that changes core code. No one is trying to tell you what to ship to your customers. That doesn't mean we should aim for the right fix upstream. I don't really mind pushing patches like yours as a last resort when dealing with unrepsonsive or disagreeing maintainers like we had to do in the past, but trying to push the workaround without even attemping the proper fix is a bit sad. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-25 11:29 ` Christoph Hellwig @ 2013-10-27 9:04 ` Dave Chinner 2013-10-28 9:49 ` Christoph Hellwig 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2013-10-27 9:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ben Myers, linux-fsdevel, xfs On Fri, Oct 25, 2013 at 04:29:34AM -0700, Christoph Hellwig wrote: > On Fri, Oct 25, 2013 at 08:24:48AM +1100, Dave Chinner wrote: > > > > I'd much prefer aiming for the proper fix first. If for some reason we > > > > can't get it done in time the workaround can be applied. > > > > > > Dave probably has a customer waiting on this. > > > > Obviously. And being a kernel where we have a fixed ABI, we can't > > backport any fix that changes core code. > > No one is trying to tell you what to ship to your customers. That > doesn't mean we should aim for the right fix upstream. Have you looked at the page cache allocation code recently? e.g, grab_cache_page_write_begin(), which we pass AOP_FLAG_NOFS into to prevent direct reclaim deadlocks: gfp_mask = mapping_gfp_mask(mapping); if (mapping_cap_account_dirty(mapping)) gfp_mask |= __GFP_WRITE; if (flags & AOP_FLAG_NOFS) gfp_notmask = __GFP_FS; See? The gfp mask that is used for allocation is directly based on the gfp mask that is set on the mapping. With this patch I proposed, we can remove the all the AOP_FLAG_NOFS usage because it's redundant. IOWs, the mapping alread has the correct allocation context set on it for *all* mapping based allocations without having to add and pass magic flags to various interfaces to get it to do the right thing. > I don't really mind pushing patches like yours as a last resort when > dealing with unrepsonsive or disagreeing maintainers like we had to do > in the past, but trying to push the workaround without even attemping > the proper fix is a bit sad. The page cache infrastructure (i.e. the generic VFS code) is already set up to be used in this manner, and other filesystems use it in this way where they need to as well. So, really, I'm not sure what problem you think needs solving here. Indeed, I'm not even sure yet how this would be "fixed" in the VFS code. Passing AOP_FLAG_NOFS several functions deep to get it into ra_submit or the low level splice functions is a non-starter, so the context needs to be carried by something else. That's exactly what the per-mapping gfp mask does... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: prevent stack overflows from page cache allocation 2013-10-27 9:04 ` Dave Chinner @ 2013-10-28 9:49 ` Christoph Hellwig 0 siblings, 0 replies; 8+ messages in thread From: Christoph Hellwig @ 2013-10-28 9:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, Ben Myers, xfs Thanks for the explanation Dave. I suspected a generic bug because the two places mentioned are generic code. Turns out only XFS and some cluster filesystems are affected because other filesystems do not call those under filesystem locks or i_mutex. But it seem like AOP_FLAG_NOFS is indeed pointless and we should remove it, in XFS as part of this patch and everywhere else later after talking to the maintainers for the few other places using it. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-10-28 9:49 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1382585110-1796-1-git-send-email-david@fromorbit.com> 2013-10-24 8:48 ` [PATCH] xfs: prevent stack overflows from page cache allocation Christoph Hellwig 2013-10-24 10:37 ` Dave Chinner 2013-10-24 15:42 ` Christoph Hellwig 2013-10-24 16:41 ` Ben Myers 2013-10-24 21:24 ` Dave Chinner 2013-10-25 11:29 ` Christoph Hellwig 2013-10-27 9:04 ` Dave Chinner 2013-10-28 9:49 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).