From: Jeff Layton <jlayton@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
viro@zeniv.linux.org.uk, konishi.ryusuke@lab.ntt.co.jp,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-nilfs@vger.kernel.org,
NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business
Date: Thu, 09 Mar 2017 07:43:12 -0500 [thread overview]
Message-ID: <1489063392.2791.8.camel@redhat.com> (raw)
In-Reply-To: <20170309110225.GF15874@quack2.suse.cz>
On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > >
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > >
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > >
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment). The result was **NOT** pretty.
> > > >
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later. This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to. By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error. This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > >
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > >
> > > Honza
> >
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
>
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
>
Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?
> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
>
> Honza
I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).
In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.
The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.
--
Jeff Layton <jlayton@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-03-09 12:43 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-05 13:35 [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-05 13:35 ` [PATCH 1/3] nilfs2: set the mapping error when calling SetPageError on writeback Jeff Layton
2017-03-07 13:46 ` Ryusuke Konishi
2017-03-05 13:35 ` [PATCH 2/3] mm: don't TestClearPageError in __filemap_fdatawait_range Jeff Layton
2017-03-05 13:35 ` [PATCH 3/3] mm: set mapping error when launder_pages fails Jeff Layton
2017-03-05 14:40 ` [PATCH 0/3] mm/fs: get PG_error out of the writeback reporting business Jeff Layton
2017-03-06 23:08 ` Ross Zwisler
2017-03-07 10:26 ` Jan Kara
2017-03-07 14:03 ` Jeff Layton
2017-03-07 15:59 ` Ross Zwisler
2017-03-07 16:17 ` Jan Kara
2017-03-09 2:57 ` Theodore Ts'o
2017-03-09 9:04 ` Jan Kara
2017-03-09 10:47 ` Jeff Layton
2017-03-09 11:02 ` Jan Kara
2017-03-09 12:43 ` Jeff Layton [this message]
[not found] ` <1489063392.2791.8.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-09 13:22 ` Brian Foster
2017-03-09 14:21 ` Theodore Ts'o
2017-03-06 3:06 ` NeilBrown
[not found] ` <871subkst8.fsf-wvvUuzkyo1HefUI2i7LXDhCRmIWqnp/j@public.gmane.org>
2017-03-06 11:43 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1489063392.2791.8.camel@redhat.com \
--to=jlayton@redhat.com \
--cc=jack@suse.cz \
--cc=konishi.ryusuke@lab.ntt.co.jp \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nilfs@vger.kernel.org \
--cc=neilb@suse.com \
--cc=ross.zwisler@linux.intel.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).