From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53909) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ao4gP-0005kE-Ee for qemu-devel@nongnu.org; Thu, 07 Apr 2016 03:48:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ao4gN-0003ZD-Ss for qemu-devel@nongnu.org; Thu, 07 Apr 2016 03:48:53 -0400 Message-ID: <570610D7.8000104@redhat.com> Date: Thu, 07 Apr 2016 13:18:39 +0530 From: Pranith Kumar Karampuri MIME-Version: 1.0 References: <20160406110216.GE5098@noname.redhat.com> <5704F0A4.8090806@redhat.com> <20160406114140.GF5098@noname.redhat.com> <20160406115159.GG5098@noname.redhat.com> <20160406131053.GB7329@localhost.localdomain> <20160406132011.GI5098@noname.redhat.com> In-Reply-To: <20160406132011.GI5098@noname.redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf , Jeff Cody Cc: Jeff Darcy , qemu-block@nongnu.org, Vijay Bellur , Raghavendra Gowdappa , qemu-devel@nongnu.org, Ric Wheeler , kdhananj@redhat.com +Raghavendra G who implemented this option in write-behind, to this upstream patch review discussion Pranith On 04/06/2016 06:50 PM, Kevin Wolf wrote: > Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben: >> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote: >>> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: >>>> Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: >>>>> We had a thread discussing this not on the upstream list. >>>>> >>>>> My summary of the thread is that I don't understand why gluster >>>>> should drop cached data after a failed fsync() for any open file. >>>> It certainly shouldn't, but it does by default. :-) >>>> >>>> Have a look at commit 3fcead2d in glusterfs.git, which at least >>>> introduces an option to get usable behaviour: >>>> >>>> { .key = {"resync-failed-syncs-after-fsync"}, >>>> .type = GF_OPTION_TYPE_BOOL, >>>> .default_value = "off", >>>> .description = "If sync of \"cached-writes issued before fsync\" " >>>> "(to backend) fails, this option configures whether " >>>> "to retry syncing them after fsync or forget them. " >>>> "If set to on, cached-writes are retried " >>>> "till a \"flush\" fop (or a successful sync) on sync " >>>> "failures. " >>>> "fsync itself is failed irrespective of the value of " >>>> "this option. ", >>>> }, >>>> >>>> As you can see, the default is still to drop cached data, and this is >>>> with the file still opened. qemu needs to make sure that this option is >>>> set, and if Jeff's comment in the code below is right, there is no way >>>> currently to make sure that the option isn't silently ignored. >>>> >>>> Can we get some function that sets an option and fails if the option is >>>> unknown? Or one that queries the state after setting an option, so we >>>> can check whether we succeeded in switching to the mode we need? >>>> >>>>> For closed files, I think it might still happen but this is the same >>>>> as any file system (and unlikely to be the case for qemu?). >>>> Our problem is only with open images. Dropping caches for files that >>>> qemu doesn't use any more is fine as far as I'm concerned. >>>> >>>> Note that our usage can involve cases where we reopen a file with >>>> different flags, i.e. first open a second file descriptor, then close >>>> the first one. The image was never completely closed here and we would >>>> still want the cache to preserve our data in such cases. >>> Hm, actually, maybe we should just call bdrv_flush() before reopening an >>> image, and if an error is returned, we abort the reopen. It's far from >>> being a hot path, so the overhead of a flush shouldn't matter, and it >>> seems we're taking an unnecessary risk without doing this. >>> >> [I seemed to have been dropped from the cc] >> >> Are you talking about doing a bdrv_flush() on the new descriptor (i.e. >> reop_s->glfs)? Because otherwise, we already do this in >> bdrv_reopen_prepare() on the original fd. It happens right before the call >> to drv->bdrv_reopen_prepare(): >> >> >> 2020 ret = bdrv_flush(reopen_state->bs); >> 2021 if (ret) { >> 2022 error_setg_errno(errp, -ret, "Error flushing drive"); >> 2023 goto error; >> 2024 } >> 2025 >> 2026 if (drv->bdrv_reopen_prepare) { >> 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err); > Ah, thanks. Yes, this is what I meant. I expected it somewhere close to > the bdrv_drain_all() call, so I missed the call you quoted. So that's > good news, at least this part of the problem doesn't exist then. :-) > > Kevin > >>>>> I will note that Linux in general had (still has I think?) the >>>>> behavior that once the process closes a file (or exits), we lose >>>>> context to return an error to. From that point on, any failed IO >>>>> from the page cache to the target disk will be dropped from cache. >>>>> To hold things in the cache would lead it to fill with old data that >>>>> is not really recoverable and we have no good way to know that the >>>>> situation is repairable and how long that might take. Upstream >>>>> kernel people have debated this, the behavior might be tweaked for >>>>> certain types of errors. >>>> That's fine, we just don't want the next fsync() to signal success when >>>> in reality the cache has thrown away our data. As soon as we close the >>>> image, there is no next fsync(), so you can do whatever you like. >>>> >>>> Kevin >>>> >>>>> On 04/06/2016 07:02 AM, Kevin Wolf wrote: >>>>>> [ Adding some CCs ] >>>>>> >>>>>> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: >>>>>>> Upon receiving an I/O error after an fsync, by default gluster will >>>>>>> dump its cache. However, QEMU will retry the fsync, which is especially >>>>>>> useful when encountering errors such as ENOSPC when using the werror=stop >>>>>>> option. When using caching with gluster, however, the last written data >>>>>>> will be lost upon encountering ENOSPC. Using the cache xlator option of >>>>>>> 'resync-failed-syncs-after-fsync' should cause gluster to retain the >>>>>>> cached data after a failed fsync, so that ENOSPC and other transient >>>>>>> errors are recoverable. >>>>>>> >>>>>>> Signed-off-by: Jeff Cody >>>>>>> --- >>>>>>> block/gluster.c | 27 +++++++++++++++++++++++++++ >>>>>>> configure | 8 ++++++++ >>>>>>> 2 files changed, 35 insertions(+) >>>>>>> >>>>>>> diff --git a/block/gluster.c b/block/gluster.c >>>>>>> index 30a827e..b1cf71b 100644 >>>>>>> --- a/block/gluster.c >>>>>>> +++ b/block/gluster.c >>>>>>> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, >>>>>>> goto out; >>>>>>> } >>>>>>> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT >>>>>>> + /* Without this, if fsync fails for a recoverable reason (for instance, >>>>>>> + * ENOSPC), gluster will dump its cache, preventing retries. This means >>>>>>> + * almost certain data loss. Not all gluster versions support the >>>>>>> + * 'resync-failed-syncs-after-fsync' key value, but there is no way to >>>>>>> + * discover during runtime if it is supported (this api returns success for >>>>>>> + * unknown key/value pairs) */ >>>>>> Honestly, this sucks. There is apparently no way to operate gluster so >>>>>> we can safely recover after a failed fsync. "We hope everything is fine, >>>>>> but depending on your gluster version, we may now corrupt your image" >>>>>> isn't very good. >>>>>> >>>>>> We need to consider very carefully if this is good enough to go on after >>>>>> an error. I'm currently leaning towards "no". That is, we should only >>>>>> enable this after Gluster provides us a way to make sure that the option >>>>>> is really set. >>>>>> >>>>>>> + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", >>>>>>> + "resync-failed-syncs-after-fsync", >>>>>>> + "on"); >>>>>>> + if (ret < 0) { >>>>>>> + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); >>>>>>> + ret = -errno; >>>>>>> + goto out; >>>>>>> + } >>>>>>> +#endif >>>>>> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. >>>>>> In this case (as well as theoretically in the case that the option >>>>>> didn't take effect - if only we could know about it), a failed >>>>>> glfs_fsync_async() is fatal and we need to stop operating on the image, >>>>>> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. >>>>>> The guest will see a broken disk that fails all I/O requests, but that's >>>>>> better than corrupting data. >>>>>> >>>>>> Kevin