linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Kevin Wolf <kwolf@redhat.com>, Theodore Ts'o <tytso@mit.edu>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Christoph Hellwig <hch@infradead.org>,
	Ric Wheeler <rwheeler@redhat.com>, Rik van Riel <riel@redhat.com>
Subject: Re: [LSF/MM TOPIC] I/O error handling and fsync()
Date: Fri, 13 Jan 2017 15:51:09 +1100	[thread overview]
Message-ID: <87y3yfftqa.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20170111114023.GA4813@noname.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6761 bytes --]

On Wed, Jan 11 2017, Kevin Wolf wrote:

> Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
>> A couple of thoughts.
>> 
>> First of all, one of the reasons why this probably hasn't been
>> addressed for so long is because programs who really care about issues
>> like this tend to use Direct I/O, and don't use the page cache at all.
>> And perhaps this is an option open to qemu as well?
>
> For our immediate case, yes, O_DIRECT can be enabled as an option in
> qemu, and it is generally recommended to do that at least for long-lived
> VMs. For other cases it might be nice to use the cache e.g. for quicker
> startup, but those might be cases where error recovery isn't as
> important.
>
> I just see a much broader problem here than just for qemu. Essentially
> this approach would mean that every program that cares about the state
> it sees being safe on disk after a successful fsync() would have to use
> O_DIRECT. I'm not sure if that's what we want.

This is not correct.  If an application has exclusive write access to a
file (which is common, even if only enforced by convention) and if that
program checks the return of every write() and every fsync() (which, for
example, stdio does, allowing ferror() to report if there have ever been
errors), then it will know if its data if safe.

If any of these writes returned an error, then there is NOTHING IT CAN
DO about that file.  It should be considered to be toast.
If there is a separate filesystem it can use, then maybe there is a way
forward, but normally it would just report an error in whatever way is
appropriate.

My position on this is primarily that if you get a single write error,
then you cannot trust anything any more.
You suggested before that NFS problems can cause errors which can be
fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
will block, not return an error.  Your last paragraph below indicates
that you agree.  So I ask again: can you provide a genuine example of a
case where a write might result in an error, but that sysadmin
involvement can allow a subsequent attempt to write to succeed.   I
don't think you can, but I'm open...

I note that ext4 has an option "errors=remount-ro".  I think that
actually makes a lot of sense.  I could easily see an argument for
supporting this at the file level, when it isn't enabled at the
filesystem level. If there is any write error, then all subsequent
writes should cause an error, only reads should be allowed.

Thanks,
NeilBrown


>
>> Secondly, one of the reasons why we mark the page clean is because we
>> didn't want a failing disk to memory to be trapped with no way of
>> releasing the pages.  For example, if a user plugs in a USB
>> thumbstick, writes to it, and then rudely yanks it out before all of
>> the pages have been writeback, it would be unfortunate if the dirty
>> pages can only be released by rebooting the system.
>
> Yes, I understand that and permanent failure is definitely a case to
> consider while making any changes. That's why I suggested to still allow
> releasing such pages, but at a lower priority than actually clean pages.
> And of course, after losing data, an fsync() may never succeed again on
> a file descriptor that was open when the data was thrown away.
>
>> So an approach that might work is fsync() will keep the pages dirty
>> --- but only while the file descriptor is open.  This could either be
>> the default behavior, or something that has to be specifically
>> requested via fcntl(2).  That way, as soon as the process exits (at
>> which point it will be too late for it do anything to save the
>> contents of the file) we also release the memory.  And if the process
>> gets OOM killed, again, the right thing happens.  But if the process
>> wants to take emergency measures to write the file somewhere else, it
>> knows that the pages won't get lost until the file gets closed.
>
> This sounds more or less like what I had in mind, so I agree.
>
> The fcntl() flag is an interesting thought, too, but would there be
> any situation where the userspace would have an advantage from not
> requesting the flag?
>
>> (BTW, a process could guarantee this today without any kernel changes
>> by mmap'ing the whole file and mlock'ing the pages that it had
>> modified.  That way, even if there is an I/O error and the fsync
>> causes the pages to be marked clean, the pages wouldn't go away.
>> However, this is really a hack, and it would probably be easier for
>> the process to use Direct I/O instead.  :-)
>
> That, and even if the pages would still in memory, as I understand it,
> the writeout would never be retried because they are still marked clean.
> So it wouldn't be usable for a temporary failure, but only for reading
> the data back from the cache into a different file.
>
>> Finally, if the kernel knows that an error might be one that could be
>> resolved by the simple expedient of waiting (for example, if a fibre
>> channel cable is temporarily unplugged so it can be rerouted, but the
>> user might plug it back in a minute or two later, or a dm-thin device
>> is full, but the system administrator might do something to fix it),
>> in the ideal world, the kernel should deal with it without requiring
>> any magic from userspace applications.  There might be a helper system
>> daemon that enacts policy (we've paged the sysadmin, so it's OK to
>> keep the page dirty and retry the writebacks to the dm-thin volume
>> after the helper daemon gives the all-clear), but we shouldn't require
>> all user space applications to have magic, Linux-specific retry code.
>
> Yes and no. I agree that the kernel should mostly make things just work.
> We're talking about a relatively obscure error case here, so if
> userspace applications have to do something extraordinary, chances are
> they won't be doing it.
>
> On the other hand, indefinitely blocking on fsync() isn't really what we
> want either, so while the kernel should keep trying to get the data
> written in the background, a failing fsync() would be okay, as long as a
> succeeding fsync() afterwards means that we're fully consistent again.
>
> In qemu, indefinitely blocking read/write syscalls are already a problem
> (on NFS), because instead of getting an error and then stopping the VM,
> the request hangs so long that the guest kernel sees a timeout and
> offlines the disk anyway. But that's a separate problem...
>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2017-01-13  4:51 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 16:02 [LSF/MM TOPIC] I/O error handling and fsync() Kevin Wolf
2017-01-11  0:41 ` NeilBrown
2017-01-13 11:09   ` Kevin Wolf
2017-01-13 14:21     ` Theodore Ts'o
2017-01-13 16:00       ` Kevin Wolf
2017-01-13 22:28         ` NeilBrown
2017-01-14  6:18           ` Darrick J. Wong
2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
2017-01-22 22:44             ` NeilBrown
2017-01-22 23:31               ` Jeff Layton
2017-01-23  0:21                 ` Theodore Ts'o
2017-01-23 10:09                   ` Kevin Wolf
2017-01-23 12:10                     ` Jeff Layton
2017-01-23 17:25                       ` Theodore Ts'o
2017-01-23 17:53                         ` Chuck Lever
2017-01-23 22:40                         ` Jeff Layton
2017-01-23 22:35                     ` Jeff Layton
2017-01-23 23:09                       ` Trond Myklebust
2017-01-24  0:16                         ` NeilBrown
2017-01-24  0:46                           ` Jeff Layton
2017-01-24 21:58                             ` NeilBrown
2017-01-25 13:00                               ` Jeff Layton
2017-01-30  5:30                                 ` NeilBrown
2017-01-24  3:34                           ` Trond Myklebust
2017-01-25 18:35                             ` Theodore Ts'o
2017-01-26  0:36                               ` NeilBrown
2017-01-26  9:25                                 ` Jan Kara
2017-01-26 22:19                                   ` NeilBrown
2017-01-27  3:23                                     ` Theodore Ts'o
2017-01-27  6:03                                       ` NeilBrown
2017-01-30 16:04                                       ` Jan Kara
2017-01-13 18:40     ` Al Viro
2017-01-13 19:06       ` Kevin Wolf
2017-01-11  5:03 ` Theodore Ts'o
2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
2017-01-11 15:45     ` Theodore Ts'o
2017-01-11 10:55   ` Chris Vest
2017-01-11 11:40   ` Kevin Wolf
2017-01-13  4:51     ` NeilBrown [this message]
2017-01-13 11:51       ` Kevin Wolf
2017-01-13 21:55         ` NeilBrown
2017-01-11 12:14   ` Chris Vest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y3yfftqa.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=hch@infradead.org \
    --cc=kwolf@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=riel@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).