From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from bombadil.infradead.org ([198.137.202.133]:43882 "EHLO
        bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1754902AbeDMOCi (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 13 Apr 2018 10:02:38 -0400
Date: Fri, 13 Apr 2018 07:02:32 -0700
From: Matthew Wilcox <willy@infradead.org>
To: Jeff Layton <jlayton@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
        lsf-pc <lsf-pc@lists.linuxfoundation.org>,
        Andres Freund <andres@anarazel.de>,
        Andreas Dilger <adilger@dilger.ca>,
        "Theodore Y. Ts'o" <tytso@mit.edu>,
        Ext4 Developers List <linux-ext4@vger.kernel.org>,
        Linux FS Devel <linux-fsdevel@vger.kernel.org>,
        "Joshua D. Drake" <jd@commandprompt.com>
Subject: Re: fsync() errors is unsafe and risks data loss
Message-ID: <20180413140232.GA24379@bombadil.infradead.org>
References: <20180410220726.vunhvwuzxi5bm6e5@alap3.anarazel.de>
 <190CF56C-C03D-4504-8B35-5DB479801513@dilger.ca>
 <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de>
 <20180412030248.GA8509@bombadil.infradead.org>
 <1523531354.4532.21.camel@redhat.com>
 <20180412120122.GE23861@dastard>
 <1523545730.4532.82.camel@redhat.com>
 <20180412224404.GA5572@dastard>
 <1523625536.4847.21.camel@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1523625536.4847.21.camel@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:
> On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> > To save you looking, XFS will trash the page contents completely on
> > a filesystem level ->writepage error. It doesn't mark them "clean",
> > doesn't attempt to redirty and rewrite them - it clears the uptodate
> > state and may invalidate it completely. IOWs, the data written
> > "sucessfully" to the cached page is now gone. It will be re-read
> > from disk on the next read() call, in direct violation of the above
> > POSIX requirements.
> > 
> > This is my point: we've done that in XFS knowing that we violate
> > POSIX specifications in this specific corner case - it's the lesser
> > of many evils we have to chose between. Hence if we chose to encode
> > that behaviour as the general writeback IO error handling algorithm,
> > then it needs to done with the knowledge it is a specification
> > violation. Not to mention be documented as a POSIX violation in the
> > various relevant man pages and that this is how all filesystems will
> > behave on async writeback error.....
> > 
> 
> Got it, thanks.
> 
> Yes, I think we ought to probably do the same thing globally. It's nice
> to know that xfs has already been doing this. That makes me feel better
> about making this behavior the gold standard for Linux filesystems.
> 
> So to summarize, at this point in the discussion, I think we want to
> consider doing the following:
> 
> * better reporting from syncfs (report an error when even one inode
> failed to be written back since last syncfs call). We'll probably
> implement this via a per-sb errseq_t in some fashion, though there are
> some implementation issues to work out.
> 
> * invalidate or clear uptodate flag on pages that experience writeback
> errors, across filesystems. Encourage this as standard behavior for
> filesystems and maybe add helpers to make it easier to do this.
> 
> Did I miss anything? Would that be enough to help the Pg usecase?
> 
> I don't see us ever being able to reasonably support its current
> expectation that writeback errors will be seen on fd's that were opened
> after the error occurred. That's a really thorny problem from an object
> lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that
we should have the same behaviour across all Linux filesystems!)

1. If we get an error while wbc->for_background is true, we should not clear
   uptodate on the page, rather SetPageError and SetPageDirty.
2. Background writebacks should skip pages which are PageError.
3. for_sync writebacks should attempt one last write.  Maybe it'll
   succeed this time.  If it does, just ClearPageError.  If not, we have
   somebody to report this writeback error to, and ClearPageUptodate.

I think kupdate writes are the same as for_background writes.  for_reclaim
is tougher.  I don't want to see us getting into OOM because we're hanging
onto stale data, but we don't necessarily have an open fd to report the
error on.  I think I'm leaning towards behaving the same for for_reclaim
as for_sync, but this is probably a subject on which reasonable people
can disagree.

And this logic all needs to be on one place, although invoked from
each filesystem.