From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753948AbZBQKkY (ORCPT ); Tue, 17 Feb 2009 05:40:24 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751376AbZBQKkK (ORCPT ); Tue, 17 Feb 2009 05:40:10 -0500 Received: from casper.infradead.org ([85.118.1.10]:60054 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751346AbZBQKkJ (ORCPT ); Tue, 17 Feb 2009 05:40:09 -0500 Subject: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) From: Peter Zijlstra To: Nick Piggin Cc: Edward Shishkin , Andrew Morton , Ryan Hope , Randy Dunlap , linux-kernel@vger.kernel.org, ReiserFS Mailing List In-Reply-To: <20090217102443.GA26402@wotan.suse.de> References: <18837.24581.181196.569183@edward.zelnet.ru> <1234530519.6519.46.camel@twins> <49957C43.7050701@gmail.com> <1234534150.6519.101.camel@twins> <18838.49922.215481.399653@edward.zelnet.ru> <1234645893.4695.8.camel@laptop> <18841.60432.329341.514726@edward.zelnet.ru> <1234861781.4744.21.camel@laptop> <20090217093805.GB31323@wotan.suse.de> <1234865116.4744.46.camel@laptop> <20090217102443.GA26402@wotan.suse.de> Content-Type: text/plain Date: Tue, 17 Feb 2009 11:40:00 +0100 Message-Id: <1234867200.4744.65.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.25.90 Content-Transfer-Encoding: 7bit X-Bad-Reply: References and In-Reply-To but no 'Re:' in Subject. Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote: > On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > > > It is a great shame that filesystems are not properly notified > > > that a page may become dirty before the actual set_page_dirty > > > event (which is not allowed to fail and is called after the > > > page is already dirty). > > > > Not quite true, for example the set_page_dirty() done by the write fault > > code is done _before_ the page becomes dirty. > > > > This before/after thing was the reason for that horrid file corruption > > bug that dragged on for a few weeks back in .19 (IIRC). > > Yeah, there are actually races though. The page can become cleaned > before set_page_dirty is reached, and there are also nasty races with > truncate. Hmm, so you're saying that never got properly fixed? > > > This is a big problem I have with fsblock simply in trying to > > > make the memory allocation robust. page_mkwrite unfortunately > > > is racy and I've fixed problems there... the big problem though > > > is get_user_pages. Fixing that properly seems to require fixing > > > callers so it is not really realistic in the short term. > > > > Right, I'm just not sure what we can do, even with a > > prepage_page_dirty() function, what are you going to do, fail the fault? > > Oh, for regular page fault functions using page_mkwrite, they > definitely want to fail the fault with a SIGBUS, and actually XFS > already does that (for fsblock robust memory allocations you > would also want to fail OOM on metadata allocation failure). What > is the other option? Silently fail the write? OK, agreed. > For XFS purpose (ie. -ENOSPC handling), the current code is reasonable > although there could be some truncate races with block allocation. But > mostly probably works. For something like fsblock it can be much more > common to have the metadata refcount reach 0 and freed before spd is > called. In that case the code actually goes into a bug situation so it > is a bit more critical. > > But no that's the "easy" part. The hard part is get_user_pages > because the caller can hold onto the page indefinitely simply with a > refcount, and go along happily dirtying it at any stage (actually > writing to the page memory) before actually calling set_page_dirty. Should a gup user not specify .write=1 if it wants to dirty the page, at which point the follow_page() will do the dirty-fault thingy. Ah, but then we can clean it because we're not holding the page-lock. I see. > The "cleanest" way to fix this from VM point of view is probably to > force gup callers to hold the page locked for the duration to > prevent truncation or writeout after the filesystem notification. > Don't know if that would be very popular, however. Right, so you'd want to keep the page locked over gup(.write=1) sections. So should we extend the gup() with put_user_page()?