From: NeilBrown <neilb@suse.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>,
Jeff Layton <jlayton@redhat.com>, linux-mm <linux-mm@kvack.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Cc: lsf-pc <lsf-pc@lists.linuxfoundation.org>,
Neil Brown <neilb@suse.de>,
linux-scsi <linux-scsi@vger.kernel.org>,
linux-block@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] do we really need PG_error at all?
Date: Mon, 27 Feb 2017 11:27:05 +1100 [thread overview]
Message-ID: <874lzgqy06.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <1488151856.4157.50.camel@HansenPartnership.com>
[-- Attachment #1: Type: text/plain, Size: 5847 bytes --]
On Sun, Feb 26 2017, James Bottomley wrote:
> On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
>> On Sun, Feb 26 2017, James Bottomley wrote:
>>
>> > [added linux-scsi and linux-block because this is part of our error
>> > handling as well]
>> > On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
>> > > Proposing this as a LSF/MM TOPIC, but it may turn out to be me
>> > > just not understanding the semantics here.
>> > >
>> > > As I was looking into -ENOSPC handling in cephfs, I noticed that
>> > > PG_error is only ever tested in one place [1]
>> > > __filemap_fdatawait_range, which does this:
>> > >
>> > > if (TestClearPageError(page))
>> > > ret = -EIO;
>> > >
>> > > This error code will override any AS_* error that was set in the
>> > > mapping. Which makes me wonder...why don't we just set this error
>> > > in the mapping and not bother with a per-page flag? Could we
>> > > potentially free up a page flag by eliminating this?
>> >
>> > Note that currently the AS_* codes are only set for write errors
>> > not for reads and we have no mapping error handling at all for swap
>> > pages, but I'm sure this is fixable.
>>
>> How is a read error different from a failure to set PG_uptodate?
>> Does PG_error suppress retries?
>
> We don't do any retries in the code above the block layer (or at least
> we shouldn't).
I was wondering about what would/should happen if a read request was
re-issued for some reason. Should the error flag on the page cause an
immediate failure, or should it try again.
If read-ahead sees a read-error on some future page, is it necessary to
record the error so subsequent read-aheads don't notice the page is
missing and repeatedly try to re-load it?
When the application eventually gets to the faulty page, should a read
be tried then, or is the read-ahead failure permanent?
>
>> >
>> > From the I/O layer point of view we take great pains to try to
>> > pinpoint the error exactly to the sector. We reflect this up by
>> > setting the PG_error flag on the page where the error occurred. If
>> > we only set the error on the mapping, we lose that granularity,
>> > because the mapping is mostly at the file level (or VMA level for
>> > anon pages).
>>
>> Are you saying that the IO layer finds the page in the bi_io_vec and
>> explicitly sets PG_error,
>
> I didn't say anything about the mechanism. I think the function you're
> looking for is fs/mpage.c:mpage_end_io(). layers below block indicate
> the position in the request. Block maps the position to bio and the
> bio completion maps to page. So the actual granularity seen in the
> upper layer depends on how the page to bio mapping is done.
If the block layer is just returning the status at a per-bio level (which
makes perfect sense), then this has nothing directly to do with the
PG_error flag.
The page cache needs to do something with bi_error, but it isn't
immediately clear that it needs to set PG_error.
>
>> rather than just passing an error indication to bi_end_io ?? That
>> would seem to be wrong as the page may not be in the page cache.
>
> Usually pages in the mpage_end_io path are pinned, I think.
>
>> So I guess I misunderstand you.
>>
>> >
>> > So I think the question for filesystem people from us would be do
>> > you care about this accuracy? If it's OK just to know an error
>> > occurred somewhere in this file, then perhaps we don't need it.
>>
>> I had always assumed that a bio would either succeed or fail, and
>> that no finer granularity could be available.
>
> It does ... but a bio can be as small as a single page.
>
>> I think the question here is: Do filesystems need the pagecache to
>> record which pages have seen an IO error?
>
> It's not just filesystems. The partition code uses PageError() ... the
> metadata code might as well (those are things with no mapping). I'm
> not saying we can't remove PG_error; I am saying it's not going to be
> quite as simple as using the AS_ flags.
The partition code could use PageUptodate().
mpage_end_io() calls page_endio() on each page, and on read error that
calls:
ClearPageUptodate(page);
SetPageError(page);
are both of these necessary?
fs/buffer.c can use several bios to read a single page.
If any one returns an error, PG_error is set. When all of them have
completed, if PG_error is clear, PG_uptodate is then set.
This is an opportunistic use of PG_error, rather than an essential use.
It could be "fixed", and would need to be fixed if we were to deprecate
use of PG_error for read errors.
There are probably other usages like this.
Thanks,
NeilBrown
>
> James
>
>> I think that for write errors, there is no value in recording
>> block-oriented error status - only file-oriented status.
>> For read errors, it might if help to avoid indefinite read retries,
>> but I don't know the code well enough to be sure if this is an issue.
>>
>> NeilBrown
>>
>>
>> >
>> > James
>> >
>> > > The main argument I could see for keeping it is that removing it
>> > > might subtly change the behavior of sync_file_range if you have
>> > > tasks syncing different ranges in a file concurrently. I'm not
>> > > sure if that would break any guarantees though.
>> > >
>> > > Even if we do need it, I think we might need some cleanup here
>> > > anyway. A lot of readpage operations end up setting that flag
>> > > when they hit an error. Isn't it wrong to return an error on
>> > > fsync, just because we had a read error somewhere in the file in
>> > > a range that was never dirtied?
>> > >
>> > > --
>> > > [1]: there is another place in f2fs, but it's more or less
>> > > equivalent to the call site in __filemap_fdatawait_range.
>> > >
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2017-02-27 0:27 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-26 14:42 [LSF/MM TOPIC] do we really need PG_error at all? Jeff Layton
2017-02-26 17:10 ` James Bottomley
2017-02-26 21:03 ` NeilBrown
2017-02-26 22:43 ` Jeff Layton
2017-02-26 23:30 ` James Bottomley
2017-02-26 23:57 ` Jeff Layton
2017-02-27 0:27 ` NeilBrown [this message]
2017-02-27 15:07 ` Jeff Layton
2017-02-27 22:51 ` Andreas Dilger
2017-02-27 23:02 ` Jeff Layton
2017-02-27 23:32 ` NeilBrown
2017-02-28 1:11 ` [Lsf-pc] " Jeff Layton
2017-02-28 10:12 ` Boaz Harrosh
2017-02-28 11:32 ` Jeff Layton
2017-02-28 20:45 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874lzgqy06.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=jlayton@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-scsi@vger.kernel.org \
--cc=lsf-pc@lists.linuxfoundation.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).