From: Jeff Cody <jcody@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Jeff Darcy <jdarcy@redhat.com>,
qemu-block@nongnu.org, Vijay Bellur <vbellur@redhat.com>,
pkarampu@redhat.com, qemu-devel@nongnu.org,
Ric Wheeler <rwheeler@redhat.com>,
kdhananj@redhat.com
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error
Date: Wed, 6 Apr 2016 09:10:53 -0400 [thread overview]
Message-ID: <20160406131053.GB7329@localhost.localdomain> (raw)
In-Reply-To: <20160406115159.GG5098@noname.redhat.com>
On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
> > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
> > >
> > > We had a thread discussing this not on the upstream list.
> > >
> > > My summary of the thread is that I don't understand why gluster
> > > should drop cached data after a failed fsync() for any open file.
> >
> > It certainly shouldn't, but it does by default. :-)
> >
> > Have a look at commit 3fcead2d in glusterfs.git, which at least
> > introduces an option to get usable behaviour:
> >
> > { .key = {"resync-failed-syncs-after-fsync"},
> > .type = GF_OPTION_TYPE_BOOL,
> > .default_value = "off",
> > .description = "If sync of \"cached-writes issued before fsync\" "
> > "(to backend) fails, this option configures whether "
> > "to retry syncing them after fsync or forget them. "
> > "If set to on, cached-writes are retried "
> > "till a \"flush\" fop (or a successful sync) on sync "
> > "failures. "
> > "fsync itself is failed irrespective of the value of "
> > "this option. ",
> > },
> >
> > As you can see, the default is still to drop cached data, and this is
> > with the file still opened. qemu needs to make sure that this option is
> > set, and if Jeff's comment in the code below is right, there is no way
> > currently to make sure that the option isn't silently ignored.
> >
> > Can we get some function that sets an option and fails if the option is
> > unknown? Or one that queries the state after setting an option, so we
> > can check whether we succeeded in switching to the mode we need?
> >
> > > For closed files, I think it might still happen but this is the same
> > > as any file system (and unlikely to be the case for qemu?).
> >
> > Our problem is only with open images. Dropping caches for files that
> > qemu doesn't use any more is fine as far as I'm concerned.
> >
> > Note that our usage can involve cases where we reopen a file with
> > different flags, i.e. first open a second file descriptor, then close
> > the first one. The image was never completely closed here and we would
> > still want the cache to preserve our data in such cases.
>
> Hm, actually, maybe we should just call bdrv_flush() before reopening an
> image, and if an error is returned, we abort the reopen. It's far from
> being a hot path, so the overhead of a flush shouldn't matter, and it
> seems we're taking an unnecessary risk without doing this.
>
[I seemed to have been dropped from the cc]
Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
reop_s->glfs)? Because otherwise, we already do this in
bdrv_reopen_prepare() on the original fd. It happens right before the call
to drv->bdrv_reopen_prepare():
2020 ret = bdrv_flush(reopen_state->bs);
2021 if (ret) {
2022 error_setg_errno(errp, -ret, "Error flushing drive");
2023 goto error;
2024 }
2025
2026 if (drv->bdrv_reopen_prepare) {
2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);
>
> > > I will note that Linux in general had (still has I think?) the
> > > behavior that once the process closes a file (or exits), we lose
> > > context to return an error to. From that point on, any failed IO
> > > from the page cache to the target disk will be dropped from cache.
> > > To hold things in the cache would lead it to fill with old data that
> > > is not really recoverable and we have no good way to know that the
> > > situation is repairable and how long that might take. Upstream
> > > kernel people have debated this, the behavior might be tweaked for
> > > certain types of errors.
> >
> > That's fine, we just don't want the next fsync() to signal success when
> > in reality the cache has thrown away our data. As soon as we close the
> > image, there is no next fsync(), so you can do whatever you like.
> >
> > Kevin
> >
> > > On 04/06/2016 07:02 AM, Kevin Wolf wrote:
> > > >[ Adding some CCs ]
> > > >
> > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
> > > >>Upon receiving an I/O error after an fsync, by default gluster will
> > > >>dump its cache. However, QEMU will retry the fsync, which is especially
> > > >>useful when encountering errors such as ENOSPC when using the werror=stop
> > > >>option. When using caching with gluster, however, the last written data
> > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of
> > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the
> > > >>cached data after a failed fsync, so that ENOSPC and other transient
> > > >>errors are recoverable.
> > > >>
> > > >>Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > >>---
> > > >> block/gluster.c | 27 +++++++++++++++++++++++++++
> > > >> configure | 8 ++++++++
> > > >> 2 files changed, 35 insertions(+)
> > > >>
> > > >>diff --git a/block/gluster.c b/block/gluster.c
> > > >>index 30a827e..b1cf71b 100644
> > > >>--- a/block/gluster.c
> > > >>+++ b/block/gluster.c
> > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options,
> > > >> goto out;
> > > >> }
> > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
> > > >>+ /* Without this, if fsync fails for a recoverable reason (for instance,
> > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means
> > > >>+ * almost certain data loss. Not all gluster versions support the
> > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to
> > > >>+ * discover during runtime if it is supported (this api returns success for
> > > >>+ * unknown key/value pairs) */
> > > >Honestly, this sucks. There is apparently no way to operate gluster so
> > > >we can safely recover after a failed fsync. "We hope everything is fine,
> > > >but depending on your gluster version, we may now corrupt your image"
> > > >isn't very good.
> > > >
> > > >We need to consider very carefully if this is good enough to go on after
> > > >an error. I'm currently leaning towards "no". That is, we should only
> > > >enable this after Gluster provides us a way to make sure that the option
> > > >is really set.
> > > >
> > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
> > > >>+ "resync-failed-syncs-after-fsync",
> > > >>+ "on");
> > > >>+ if (ret < 0) {
> > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
> > > >>+ ret = -errno;
> > > >>+ goto out;
> > > >>+ }
> > > >>+#endif
> > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
> > > >In this case (as well as theoretically in the case that the option
> > > >didn't take effect - if only we could know about it), a failed
> > > >glfs_fsync_async() is fatal and we need to stop operating on the image,
> > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
> > > >The guest will see a broken disk that fails all I/O requests, but that's
> > > >better than corrupting data.
> > > >
> > > >Kevin
> > >
> >
>
next prev parent reply other threads:[~2016-04-06 13:11 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-06 3:29 [Qemu-devel] [PATCH for-2.6 0/2] Bug fixes for gluster Jeff Cody
2016-04-06 3:29 ` [Qemu-devel] [PATCH for-2.6 1/2] block/gluster: return correct error value Jeff Cody
2016-04-06 15:00 ` Niels de Vos
2016-04-06 3:29 ` [Qemu-devel] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error Jeff Cody
2016-04-06 11:02 ` Kevin Wolf
2016-04-06 11:19 ` Ric Wheeler
2016-04-06 11:41 ` Kevin Wolf
2016-04-06 11:51 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2016-04-06 13:10 ` Jeff Cody [this message]
2016-04-06 13:20 ` Kevin Wolf
2016-04-07 7:48 ` Pranith Kumar Karampuri
2016-04-11 4:26 ` Raghavendra Gowdappa
2016-05-09 6:38 ` Raghavendra Talur
2016-05-09 8:02 ` Kevin Wolf
2016-04-06 12:44 ` [Qemu-devel] " Jeff Cody
2016-04-06 14:47 ` Niels de Vos
2016-04-06 16:05 ` Jeff Cody
2016-04-07 16:17 ` Jeff Cody
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160406131053.GB7329@localhost.localdomain \
--to=jcody@redhat.com \
--cc=jdarcy@redhat.com \
--cc=kdhananj@redhat.com \
--cc=kwolf@redhat.com \
--cc=pkarampu@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=rwheeler@redhat.com \
--cc=vbellur@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).