From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: [RFC][PATCH] vfs: In mntput run deactivate_super on a shallow stack. Date: Wed, 09 Apr 2014 18:36:12 -0700 Message-ID: <874n226k4z.fsf@x220.int.ebiederm.org> References: <87ob28kqks.fsf_-_@xmission.com> <874n3n7czm.fsf_-_@xmission.com> <87wqezl5df.fsf_-_@x220.int.ebiederm.org> <20140409023027.GX18016@ZenIV.linux.org.uk> <20140409023947.GY18016@ZenIV.linux.org.uk> <87sipmbe8x.fsf@x220.int.ebiederm.org> <20140409175322.GZ18016@ZenIV.linux.org.uk> <20140409182830.GA18016@ZenIV.linux.org.uk> <87txa286fu.fsf@x220.int.ebiederm.org> <87fvlm860e.fsf_-_@x220.int.ebiederm.org> <20140409232423.GB18016@ZenIV.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain Cc: Linus Torvalds , "Serge E. Hallyn" , Linux-Fsdevel , Kernel Mailing List , Andy Lutomirski , Rob Landley , Miklos Szeredi , Christoph Hellwig , Karel Zak , "J. Bruce Fields" , Fengguang Wu To: Al Viro Return-path: Received: from out02.mta.xmission.com ([166.70.13.232]:56016 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933287AbaDJBgi (ORCPT ); Wed, 9 Apr 2014 21:36:38 -0400 In-Reply-To: <20140409232423.GB18016@ZenIV.linux.org.uk> (Al Viro's message of "Thu, 10 Apr 2014 00:24:23 +0100") Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Al Viro writes: > On Wed, Apr 09, 2014 at 03:58:25PM -0700, Eric W. Biederman wrote: >> >> mntput as part of pathput is called from all over the vfs sometimes as >> in the case of symlink chasing from some rather deep call chains. >> During filesystem unmount with the right set of races those innocuous >> little mntput calls that take very little stack space can become calls >> become mosters calling deactivate_super that can take up 3k or more of >> stack space as syncrhonous filesystem I/O is performed, through >> multiple levels of the I/O stack. >> >> Avoid deactivate_super being called from a deep stack by converting >> mntput to use task_work_add when the mnt_count goes to 0. The >> filesystem is still unmounted synchronously preserving the semantics >> that system calls like umount require. > > Careful. For one thing, you've just introduced a massive leak in knfsd > and any other kernel thread that might do mntput(). task_work_add() > makes no sense there - there is no userland to return to. For another, > in things like cleanup of failing modprobe we might end up delaying fs > shutdown too much. So it's not that simple, unfortunately. Unfortunately. > I agree that fs shutdown is better dealt with on mostly empty stack, of > course - moreover, done right that has a potential to make mntput() > safe in atomic contexts (there's also acct_auto_close_mnt() to deal > with; that might take some work to get right, but I think it's not > fatal). I am slowly digging into this. With this patch I was was able to do an A/B comparison of what the stack cost on my unmounting my minimal ext4 filesystem from d_invalidate called with in a context with maximum symlink recursion depth, without and without a changed mntput. I used sysfs instead of nfs to mount my minimal ext4 filesystem on as I was a lazy bum and didn't have a nfs server setup handy. With just my detach_mounts branched merged into 3.15-rc0 I saw 4880 stack bytes left before calling detach_mounts from d_invalidate I saw 3904 stack bytes left after calling detach_mounts from d_invalidate Which means in practice unmounting my mininal ext4 filesystem image only used 976 additional bytes of stack. With the same kernel plus my change to mntput I saw 4880 stack bytes left before calling detach_mounts from d_invalidate I saw 4880 stack bytes left after calling detach_mounts from d_invalidate Which at least confirms that a change to mntput is enough to make deep stacks safe. With 3904 bytes of headroom from ext4 I may have to measure some of the nastier cases just to be certain there actually is a problem here. Eric