From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from bombadil.infradead.org ([198.137.202.133]:43882 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754902AbeDMOCi (ORCPT ); Fri, 13 Apr 2018 10:02:38 -0400 Date: Fri, 13 Apr 2018 07:02:32 -0700 From: Matthew Wilcox To: Jeff Layton Cc: Dave Chinner , lsf-pc , Andres Freund , Andreas Dilger , "Theodore Y. Ts'o" , Ext4 Developers List , Linux FS Devel , "Joshua D. Drake" Subject: Re: fsync() errors is unsafe and risks data loss Message-ID: <20180413140232.GA24379@bombadil.infradead.org> References: <20180410220726.vunhvwuzxi5bm6e5@alap3.anarazel.de> <190CF56C-C03D-4504-8B35-5DB479801513@dilger.ca> <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de> <20180412030248.GA8509@bombadil.infradead.org> <1523531354.4532.21.camel@redhat.com> <20180412120122.GE23861@dastard> <1523545730.4532.82.camel@redhat.com> <20180412224404.GA5572@dastard> <1523625536.4847.21.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1523625536.4847.21.camel@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote: > On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote: > > To save you looking, XFS will trash the page contents completely on > > a filesystem level ->writepage error. It doesn't mark them "clean", > > doesn't attempt to redirty and rewrite them - it clears the uptodate > > state and may invalidate it completely. IOWs, the data written > > "sucessfully" to the cached page is now gone. It will be re-read > > from disk on the next read() call, in direct violation of the above > > POSIX requirements. > > > > This is my point: we've done that in XFS knowing that we violate > > POSIX specifications in this specific corner case - it's the lesser > > of many evils we have to chose between. Hence if we chose to encode > > that behaviour as the general writeback IO error handling algorithm, > > then it needs to done with the knowledge it is a specification > > violation. Not to mention be documented as a POSIX violation in the > > various relevant man pages and that this is how all filesystems will > > behave on async writeback error..... > > > > Got it, thanks. > > Yes, I think we ought to probably do the same thing globally. It's nice > to know that xfs has already been doing this. That makes me feel better > about making this behavior the gold standard for Linux filesystems. > > So to summarize, at this point in the discussion, I think we want to > consider doing the following: > > * better reporting from syncfs (report an error when even one inode > failed to be written back since last syncfs call). We'll probably > implement this via a per-sb errseq_t in some fashion, though there are > some implementation issues to work out. > > * invalidate or clear uptodate flag on pages that experience writeback > errors, across filesystems. Encourage this as standard behavior for > filesystems and maybe add helpers to make it easier to do this. > > Did I miss anything? Would that be enough to help the Pg usecase? > > I don't see us ever being able to reasonably support its current > expectation that writeback errors will be seen on fd's that were opened > after the error occurred. That's a really thorny problem from an object > lifetime perspective. I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!) 1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty. 2. Background writebacks should skip pages which are PageError. 3. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate. I think kupdate writes are the same as for_background writes. for_reclaim is tougher. I don't want to see us getting into OOM because we're hanging onto stale data, but we don't necessarily have an open fd to report the error on. I think I'm leaning towards behaving the same for for_reclaim as for_sync, but this is probably a subject on which reasonable people can disagree. And this logic all needs to be on one place, although invoked from each filesystem.