From: Kevin Wolf <kwolf@redhat.com>
To: Jeff Cody <jcody@redhat.com>
Cc: Jeff Darcy <jdarcy@redhat.com>,
qemu-block@nongnu.org, Vijay Bellur <vbellur@redhat.com>,
pkarampu@redhat.com, qemu-devel@nongnu.org,
Ric Wheeler <rwheeler@redhat.com>,
kdhananj@redhat.com
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error
Date: Wed, 6 Apr 2016 15:20:11 +0200 [thread overview]
Message-ID: <20160406132011.GI5098@noname.redhat.com> (raw)
In-Reply-To: <20160406131053.GB7329@localhost.localdomain>
Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben:
> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
> > Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
> > > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> > > >
> > > > We had a thread discussing this not on the upstream list.
> > > >
> > > > My summary of the thread is that I don't understand why gluster
> > > > should drop cached data after a failed fsync() for any open file.
> > >
> > > It certainly shouldn't, but it does by default. :-)
> > >
> > > Have a look at commit 3fcead2d in glusterfs.git, which at least
> > > introduces an option to get usable behaviour:
> > >
> > > { .key = {"resync-failed-syncs-after-fsync"},
> > > .type = GF_OPTION_TYPE_BOOL,
> > > .default_value = "off",
> > > .description = "If sync of \"cached-writes issued before fsync\" "
> > > "(to backend) fails, this option configures whether "
> > > "to retry syncing them after fsync or forget them. "
> > > "If set to on, cached-writes are retried "
> > > "till a \"flush\" fop (or a successful sync) on sync "
> > > "failures. "
> > > "fsync itself is failed irrespective of the value of "
> > > "this option. ",
> > > },
> > >
> > > As you can see, the default is still to drop cached data, and this is
> > > with the file still opened. qemu needs to make sure that this option is
> > > set, and if Jeff's comment in the code below is right, there is no way
> > > currently to make sure that the option isn't silently ignored.
> > >
> > > Can we get some function that sets an option and fails if the option is
> > > unknown? Or one that queries the state after setting an option, so we
> > > can check whether we succeeded in switching to the mode we need?
> > >
> > > > For closed files, I think it might still happen but this is the same
> > > > as any file system (and unlikely to be the case for qemu?).
> > >
> > > Our problem is only with open images. Dropping caches for files that
> > > qemu doesn't use any more is fine as far as I'm concerned.
> > >
> > > Note that our usage can involve cases where we reopen a file with
> > > different flags, i.e. first open a second file descriptor, then close
> > > the first one. The image was never completely closed here and we would
> > > still want the cache to preserve our data in such cases.
> >
> > Hm, actually, maybe we should just call bdrv_flush() before reopening an
> > image, and if an error is returned, we abort the reopen. It's far from
> > being a hot path, so the overhead of a flush shouldn't matter, and it
> > seems we're taking an unnecessary risk without doing this.
> >
>
> [I seemed to have been dropped from the cc]
>
> Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
> reop_s->glfs)? Because otherwise, we already do this in
> bdrv_reopen_prepare() on the original fd. It happens right before the call
> to drv->bdrv_reopen_prepare():
>
>
> 2020 ret = bdrv_flush(reopen_state->bs);
> 2021 if (ret) {
> 2022 error_setg_errno(errp, -ret, "Error flushing drive");
> 2023 goto error;
> 2024 }
> 2025
> 2026 if (drv->bdrv_reopen_prepare) {
> 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);
Ah, thanks. Yes, this is what I meant. I expected it somewhere close to
the bdrv_drain_all() call, so I missed the call you quoted. So that's
good news, at least this part of the problem doesn't exist then. :-)
Kevin
> >
> > > > I will note that Linux in general had (still has I think?) the
> > > > behavior that once the process closes a file (or exits), we lose
> > > > context to return an error to. From that point on, any failed IO
> > > > from the page cache to the target disk will be dropped from cache.
> > > > To hold things in the cache would lead it to fill with old data that
> > > > is not really recoverable and we have no good way to know that the
> > > > situation is repairable and how long that might take. Upstream
> > > > kernel people have debated this, the behavior might be tweaked for
> > > > certain types of errors.
> > >
> > > That's fine, we just don't want the next fsync() to signal success when
> > > in reality the cache has thrown away our data. As soon as we close the
> > > image, there is no next fsync(), so you can do whatever you like.
> > >
> > > Kevin
> > >
> > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> > > > >[ Adding some CCs ]
> > > > >
> > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > > >>Upon receiving an I/O error after an fsync, by default gluster will
> > > > >>dump its cache. However, QEMU will retry the fsync, which is especially
> > > > >>useful when encountering errors such as ENOSPC when using the werror=stop
> > > > >>option. When using caching with gluster, however, the last written data
> > > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of
> > > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > > >>cached data after a failed fsync, so that ENOSPC and other transient
> > > > >>errors are recoverable.
> > > > >>
> > > > >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > > >>---
> > > > >> block/gluster.c | 27 +++++++++++++++++++++++++++
> > > > >> configure | 8 ++++++++
> > > > >> 2 files changed, 35 insertions(+)
> > > > >>
> > > > >>diff --git a/block/gluster.c b/block/gluster.c
> > > > >>index 30a827e..b1cf71b 100644
> > > > >>--- a/block/gluster.c
> > > > >>+++ b/block/gluster.c
> > > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options,
> > > > >> goto out;
> > > > >> }
> > > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > > >>+ /* Without this, if fsync fails for a recoverable reason (for instance,
> > > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means
> > > > >>+ * almost certain data loss. Not all gluster versions support the
> > > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > > >>+ * discover during runtime if it is supported (this api returns success for
> > > > >>+ * unknown key/value pairs) */
> > > > >Honestly, this sucks. There is apparently no way to operate gluster so
> > > > >we can safely recover after a failed fsync. "We hope everything is fine,
> > > > >but depending on your gluster version, we may now corrupt your image"
> > > > >isn't very good.
> > > > >
> > > > >We need to consider very carefully if this is good enough to go on after
> > > > >an error. I'm currently leaning towards "no". That is, we should only
> > > > >enable this after Gluster provides us a way to make sure that the option
> > > > >is really set.
> > > > >
> > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > > >>+ "resync-failed-syncs-after-fsync",
> > > > >>+ "on");
> > > > >>+ if (ret < 0) {
> > > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > > >>+ ret = -errno;
> > > > >>+ goto out;
> > > > >>+ }
> > > > >>+#endif
> > > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > > > >In this case (as well as theoretically in the case that the option
> > > > >didn't take effect - if only we could know about it), a failed
> > > > >glfs_fsync_async() is fatal and we need to stop operating on the image,
> > > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > > > >The guest will see a broken disk that fails all I/O requests, but that's
> > > > >better than corrupting data.
> > > > >
> > > > >Kevin
> > > >
> > >
> >
next prev parent reply other threads:[~2016-04-06 13:20 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-06 3:29 [Qemu-devel] [PATCH for-2.6 0/2] Bug fixes for gluster Jeff Cody
2016-04-06 3:29 ` [Qemu-devel] [PATCH for-2.6 1/2] block/gluster: return correct error value Jeff Cody
2016-04-06 15:00 ` Niels de Vos
2016-04-06 3:29 ` [Qemu-devel] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error Jeff Cody
2016-04-06 11:02 ` Kevin Wolf
2016-04-06 11:19 ` Ric Wheeler
2016-04-06 11:41 ` Kevin Wolf
2016-04-06 11:51 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2016-04-06 13:10 ` Jeff Cody
2016-04-06 13:20 ` Kevin Wolf [this message]
2016-04-07 7:48 ` Pranith Kumar Karampuri
2016-04-11 4:26 ` Raghavendra Gowdappa
2016-05-09 6:38 ` Raghavendra Talur
2016-05-09 8:02 ` Kevin Wolf
2016-04-06 12:44 ` [Qemu-devel] " Jeff Cody
2016-04-06 14:47 ` Niels de Vos
2016-04-06 16:05 ` Jeff Cody
2016-04-07 16:17 ` Jeff Cody
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160406132011.GI5098@noname.redhat.com \
--to=kwolf@redhat.com \
--cc=jcody@redhat.com \
--cc=jdarcy@redhat.com \
--cc=kdhananj@redhat.com \
--cc=pkarampu@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=rwheeler@redhat.com \
--cc=vbellur@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.