From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53909)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pkarampu@redhat.com>) id 1ao4gP-0005kE-Ee
	for qemu-devel@nongnu.org; Thu, 07 Apr 2016 03:48:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pkarampu@redhat.com>) id 1ao4gN-0003ZD-Ss
	for qemu-devel@nongnu.org; Thu, 07 Apr 2016 03:48:53 -0400
Message-ID: <570610D7.8000104@redhat.com>
Date: Thu, 07 Apr 2016 13:18:39 +0530
From: Pranith Kumar Karampuri <pkarampu@redhat.com>
MIME-Version: 1.0
References: <cover.1459913103.git.jcody@redhat.com>
	<d9445c47de98ad9142a0e40598f7c8152ed833d1.1459913103.git.jcody@redhat.com>
	<20160406110216.GE5098@noname.redhat.com>
	<5704F0A4.8090806@redhat.com>
	<20160406114140.GF5098@noname.redhat.com>
	<20160406115159.GG5098@noname.redhat.com>
	<20160406131053.GB7329@localhost.localdomain>
	<20160406132011.GI5098@noname.redhat.com>
In-Reply-To: <20160406132011.GI5098@noname.redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster:
 prevent data loss after i/o error
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>, Jeff Cody <jcody@redhat.com>
Cc: Jeff Darcy <jdarcy@redhat.com>, qemu-block@nongnu.org, Vijay Bellur <vbellur@redhat.com>, Raghavendra Gowdappa <rgowdapp@redhat.com>, qemu-devel@nongnu.org, Ric Wheeler <rwheeler@redhat.com>, kdhananj@redhat.com

+Raghavendra G who implemented this option in write-behind, to this 
upstream patch review discussion

Pranith
On 04/06/2016 06:50 PM, Kevin Wolf wrote:
> Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben:
>> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote:
>>> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben:
>>>> Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben:
>>>>> We had a thread discussing this not on the upstream list.
>>>>>
>>>>> My summary of the thread is that I don't understand why gluster
>>>>> should drop cached data after a failed fsync() for any open file.
>>>> It certainly shouldn't, but it does by default. :-)
>>>>
>>>> Have a look at commit 3fcead2d in glusterfs.git, which at least
>>>> introduces an option to get usable behaviour:
>>>>
>>>>      { .key = {"resync-failed-syncs-after-fsync"},
>>>>        .type = GF_OPTION_TYPE_BOOL,
>>>>        .default_value = "off",
>>>>        .description = "If sync of \"cached-writes issued before fsync\" "
>>>>                       "(to backend) fails, this option configures whether "
>>>>                       "to retry syncing them after fsync or forget them. "
>>>>                       "If set to on, cached-writes are retried "
>>>>                       "till a \"flush\" fop (or a successful sync) on sync "
>>>>                       "failures. "
>>>>                       "fsync itself is failed irrespective of the value of "
>>>>                       "this option. ",
>>>>      },
>>>>
>>>> As you can see, the default is still to drop cached data, and this is
>>>> with the file still opened. qemu needs to make sure that this option is
>>>> set, and if Jeff's comment in the code below is right, there is no way
>>>> currently to make sure that the option isn't silently ignored.
>>>>
>>>> Can we get some function that sets an option and fails if the option is
>>>> unknown? Or one that queries the state after setting an option, so we
>>>> can check whether we succeeded in switching to the mode we need?
>>>>
>>>>> For closed files, I think it might still happen but this is the same
>>>>> as any file system (and unlikely to be the case for qemu?).
>>>> Our problem is only with open images. Dropping caches for files that
>>>> qemu doesn't use any more is fine as far as I'm concerned.
>>>>
>>>> Note that our usage can involve cases where we reopen a file with
>>>> different flags, i.e. first open a second file descriptor, then close
>>>> the first one. The image was never completely closed here and we would
>>>> still want the cache to preserve our data in such cases.
>>> Hm, actually, maybe we should just call bdrv_flush() before reopening an
>>> image, and if an error is returned, we abort the reopen. It's far from
>>> being a hot path, so the overhead of a flush shouldn't matter, and it
>>> seems we're taking an unnecessary risk without doing this.
>>>
>> [I seemed to have been dropped from the cc]
>>
>> Are you talking about doing a bdrv_flush() on the new descriptor (i.e.
>> reop_s->glfs)?  Because otherwise, we already do this in
>> bdrv_reopen_prepare() on the original fd.  It happens right before the call
>> to drv->bdrv_reopen_prepare():
>>
>>
>> 2020     ret = bdrv_flush(reopen_state->bs);
>> 2021     if (ret) {
>> 2022         error_setg_errno(errp, -ret, "Error flushing drive");
>> 2023         goto error;
>> 2024     }
>> 2025
>> 2026     if (drv->bdrv_reopen_prepare) {
>> 2027         ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);
> Ah, thanks. Yes, this is what I meant. I expected it somewhere close to
> the bdrv_drain_all() call, so I missed the call you quoted. So that's
> good news, at least this part of the problem doesn't exist then. :-)
>
> Kevin
>
>>>>> I will note that Linux in general had (still has I think?) the
>>>>> behavior that once the process closes a file (or exits), we lose
>>>>> context to return an error to. From that point on, any failed IO
>>>>> from the page cache to the target disk will be dropped from cache.
>>>>> To hold things in the cache would lead it to fill with old data that
>>>>> is not really recoverable and we have no good way to know that the
>>>>> situation is repairable and how long that might take. Upstream
>>>>> kernel people have debated this, the behavior might be tweaked for
>>>>> certain types of errors.
>>>> That's fine, we just don't want the next fsync() to signal success when
>>>> in reality the cache has thrown away our data. As soon as we close the
>>>> image, there is no next fsync(), so you can do whatever you like.
>>>>
>>>> Kevin
>>>>
>>>>> On 04/06/2016 07:02 AM, Kevin Wolf wrote:
>>>>>> [ Adding some CCs ]
>>>>>>
>>>>>> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben:
>>>>>>> Upon receiving an I/O error after an fsync, by default gluster will
>>>>>>> dump its cache.  However, QEMU will retry the fsync, which is especially
>>>>>>> useful when encountering errors such as ENOSPC when using the werror=stop
>>>>>>> option.  When using caching with gluster, however, the last written data
>>>>>>> will be lost upon encountering ENOSPC.  Using the cache xlator option of
>>>>>>> 'resync-failed-syncs-after-fsync' should cause gluster to retain the
>>>>>>> cached data after a failed fsync, so that ENOSPC and other transient
>>>>>>> errors are recoverable.
>>>>>>>
>>>>>>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>>>>>>> ---
>>>>>>>   block/gluster.c | 27 +++++++++++++++++++++++++++
>>>>>>>   configure       |  8 ++++++++
>>>>>>>   2 files changed, 35 insertions(+)
>>>>>>>
>>>>>>> diff --git a/block/gluster.c b/block/gluster.c
>>>>>>> index 30a827e..b1cf71b 100644
>>>>>>> --- a/block/gluster.c
>>>>>>> +++ b/block/gluster.c
>>>>>>> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict *options,
>>>>>>>           goto out;
>>>>>>>       }
>>>>>>> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT
>>>>>>> +    /* Without this, if fsync fails for a recoverable reason (for instance,
>>>>>>> +     * ENOSPC), gluster will dump its cache, preventing retries.  This means
>>>>>>> +     * almost certain data loss.  Not all gluster versions support the
>>>>>>> +     * 'resync-failed-syncs-after-fsync' key value, but there is no way to
>>>>>>> +     * discover during runtime if it is supported (this api returns success for
>>>>>>> +     * unknown key/value pairs) */
>>>>>> Honestly, this sucks. There is apparently no way to operate gluster so
>>>>>> we can safely recover after a failed fsync. "We hope everything is fine,
>>>>>> but depending on your gluster version, we may now corrupt your image"
>>>>>> isn't very good.
>>>>>>
>>>>>> We need to consider very carefully if this is good enough to go on after
>>>>>> an error. I'm currently leaning towards "no". That is, we should only
>>>>>> enable this after Gluster provides us a way to make sure that the option
>>>>>> is really set.
>>>>>>
>>>>>>> +    ret = glfs_set_xlator_option (s->glfs, "*-write-behind",
>>>>>>> +                                           "resync-failed-syncs-after-fsync",
>>>>>>> +                                           "on");
>>>>>>> +    if (ret < 0) {
>>>>>>> +        error_setg_errno(errp, errno, "Unable to set xlator key/value pair");
>>>>>>> +        ret = -errno;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +#endif
>>>>>> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT.
>>>>>> In this case (as well as theoretically in the case that the option
>>>>>> didn't take effect - if only we could know about it), a failed
>>>>>> glfs_fsync_async() is fatal and we need to stop operating on the image,
>>>>>> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images.
>>>>>> The guest will see a broken disk that fails all I/O requests, but that's
>>>>>> better than corrupting data.
>>>>>>
>>>>>> Kevin