qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Jeff Cody <jcody@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, kwolf@redhat.com,
	pkarampu@redhat.com, rgowdapp@redhat.com, ndevos@redhat.com,
	Rik van Riel <riel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster
Date: Tue, 19 Apr 2016 21:56:27 -0400	[thread overview]
Message-ID: <5716E1CB.4020300@redhat.com> (raw)
In-Reply-To: <20160419140917.GC4841@localhost.localdomain>

On 04/19/2016 10:09 AM, Jeff Cody wrote:
> On Tue, Apr 19, 2016 at 08:18:39AM -0400, Ric Wheeler wrote:
>> On 04/19/2016 08:07 AM, Jeff Cody wrote:
>>> Bug fixes for gluster; third patch is to prevent
>>> a potential data loss when trying to recover from
>>> a recoverable error (such as ENOSPC).
>> Hi Jeff,
>>
>> Just a note, I have been talking to some of the disk drive people
>> here at LSF (the kernel summit for file and storage people) and got
>> a non-public confirmation that individual storage devices (s-ata
>> drives or scsi) can also dump cache state when a synchronize cache
>> command fails.  Also followed up with Rik van Riel - in the page
>> cache in general, when we fail to write back dirty pages, they are
>> simply marked "clean" (which means effectively that they get
>> dropped).
>>
>> Long winded way of saying that I think that this scenario is not
>> unique to gluster - any failed fsync() to a file (or block device)
>> might be an indication of permanent data loss.
>>
> Ric,
>
> Thanks.
>
> I think you are right, we likely do need to address how QEMU handles fsync
> failures across the board in QEMU at some point (2.7?).  Another point to
> consider is that QEMU is cross-platform - so not only do we have different
> protocols, and filesystems, but also different underlying host OSes as well.
> It is likely, like you said, that there are other non-gluster scenarios where
> we have non-recoverable data loss on fsync failure.
>
> With Gluster specifically, if we look at just ENOSPC, does this mean that
> even if Gluster retains its cache after fsync failure, we still won't know
> that there was no permanent data loss?  If we hit ENOSPC during an fsync, I
> presume that means Gluster itself may have encountered ENOSPC from a fsync to
> the underlying storage.  In that case, does Gluster just pass the error up
> the stack?
>
> Jeff

I still worry that in many non-gluster situations we will have permanent data 
loss here. Specifically, the way the page cache works, if we fail to write back 
cached data *at any time*, a future fsync() will get a failure.

That failure could be because of a thinly provisioned backing store, but in the 
interim, the page cache is free to drop the pages that had failed. In effect, we 
end up with data loss in part or in whole without a way to detect which bits got 
dropped.

Note that this is not a gluster issue, this is for any file system on top of 
thinly provisioned storage (i.e., we would see this with xfs on thin storage or 
ext4 on thin storage).  In effect, if gluster has written the data back to xfs 
and that is on top of a thinly provisioned target, the kernel might drop that 
data before you can try an fsync again. Even if you retry the fsync(), the pages 
are marked clean so they will not be pushed back to storage on that second fsync().

Same issue with link loss - if we lose connection to a storage target, it is 
likely to take time to detect that, more time to reconnect. In the interim, any 
page cache data is very likely to get dropped under memory pressure.

In both of these cases, fsync() failure is effectively a signal of a high chance 
of data that has been already lost. A retry will not save the day.

At LSF/MM today, we discussed an option that would allow the page cache to hang 
on to data - for re-tryable errors only for example - so that this would not 
happen. The impact of this is also potentially huge (page cache/physical memory 
could be exhausted while waiting for an admin to fix the issue) so it would have 
to be a non-default option.

I think that we will need some discussions with the kernel memory management 
team (and some storage kernel people) to see what seems reasonable here.

Regards,

Ric

>
>>> The final patch closes the gluster fd and sets the
>>> protocol drv to NULL on fsync failure in gluster;
>>> we have no way of knowing what gluster versions
>>> support retaining fysnc cache on error, so until
>>> we do the safest thing to do is invalidate the
>>> drive.
>>>
>>> Jeff Cody (3):
>>>    block/gluster: return correct error value
>>>    block/gluster: code movement of qemu_gluster_close()
>>>    block/gluster: prevent data loss after i/o error
>>>
>>>   block/gluster.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++-----------
>>>   configure       |  8 +++++++
>>>   2 files changed, 62 insertions(+), 12 deletions(-)
>>>

  reply	other threads:[~2016-04-20  1:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-19 12:07 [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster Jeff Cody
2016-04-19 12:07 ` [Qemu-devel] [PATCH for-2.6 v2 1/3] block/gluster: return correct error value Jeff Cody
2016-04-19 12:07 ` [Qemu-devel] [PATCH for-2.6 v2 2/3] block/gluster: code movement of qemu_gluster_close() Jeff Cody
2016-04-19 12:07 ` [Qemu-devel] [PATCH for-2.6 v2 3/3] block/gluster: prevent data loss after i/o error Jeff Cody
2016-04-19 12:27   ` Kevin Wolf
2016-04-19 12:29     ` Jeff Cody
2016-04-19 12:18 ` [Qemu-devel] [PATCH for-2.6 v2 0/3] Bug fixes for gluster Ric Wheeler
2016-04-19 14:09   ` Jeff Cody
2016-04-20  1:56     ` Ric Wheeler [this message]
2016-04-20  9:24       ` Kevin Wolf
2016-04-20 10:40         ` Ric Wheeler
2016-04-20 11:46           ` Kevin Wolf
2016-04-20 18:38             ` Rik van Riel
2016-04-21  8:43               ` Kevin Wolf
2016-04-20 18:37           ` Rik van Riel
2016-04-20  5:15     ` Raghavendra Gowdappa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5716E1CB.4020300@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=jcody@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=ndevos@redhat.com \
    --cc=pkarampu@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rgowdapp@redhat.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).