From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:44369)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rgowdapp@redhat.com>) id 1apTRC-00035V-JR
	for qemu-devel@nongnu.org; Mon, 11 Apr 2016 00:26:59 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rgowdapp@redhat.com>) id 1apTR7-0006vK-Js
	for qemu-devel@nongnu.org; Mon, 11 Apr 2016 00:26:58 -0400
Date: Mon, 11 Apr 2016 00:26:50 -0400 (EDT)
From: Raghavendra Gowdappa <rgowdapp@redhat.com>
Message-ID: <977766574.2310591.1460348810883.JavaMail.zimbra@redhat.com>
In-Reply-To: <570610D7.8000104@redhat.com>
References: <cover.1459913103.git.jcody@redhat.com>
	<20160406110216.GE5098@noname.redhat.com>
	<5704F0A4.8090806@redhat.com>
	<20160406114140.GF5098@noname.redhat.com>
	<20160406115159.GG5098@noname.redhat.com>
	<20160406131053.GB7329@localhost.localdomain>
	<20160406132011.GI5098@noname.redhat.com>
	<570610D7.8000104@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster:
 prevent data loss after i/o error
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Pranith Kumar Karampuri <pkarampu@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Jeff Cody <jcody@redhat.com>, Ric Wheeler <rwheeler@redhat.com>, Jeff Darcy <jdarcy@redhat.com>, qemu-block@nongnu.org, Vijay Bellur <vbellur@redhat.com>, qemu-devel@nongnu.org, kdhananj@redhat.com, raghavendra talur <rtalur@redhat.com>, Poornima Gurusiddaiah <pgurusid@redhat.com>Vijay Bellur <vbellur@redhat.com>


> +Raghavendra G who implemented this option in write-behind, to this
> upstream patch review discussion

Thanks Pranith. Kritika did inform us about the discussion. We were working=
 on ways to find out solutions to the problems raised (and it was a long fe=
stive weekend in Bangalore).

Sorry for top-posting. I am trying to address two issues raised in this mai=
l thread:
1. No ways to identify whether setting of option succeeded in gfapi.
2. Why retaining cache after fsync failure is not default behavior?

1. No ways to identify whether setting of option succeeded in gfapi:

I've added Poornima and Raghavendra Talur who work on gfapi to assist on th=
is.

2. Why retaining cache after fsync failure is not default behavior?

It was mostly to not to break two synchronized applications, which rely on =
fsync failures to retry. Details of discussion can be found below. The othe=
r reason was we could not figure out what POSIX's take on the state of earl=
ier write after fsync failure (Other filesystems xfs, ext4 had non-uniform =
behavior). The question more specifically was "is it correct for a write is=
sued before a failed fsync to succeed on the backend storage (persisting ha=
ppened after fsync failure)?". I've also added Vijay Bellur who was involve=
d in the earlier discussion to cc list.

Following is the discussion we had earlier with Kevin, Jeff Cody and others=
 on the same topic. I am quoting it verbatim below.

<quote>

> > > > > I would actually argue that this setting would be right for every=
one,
> > > > > not just qemu. Can you think of a case where keeping the data cac=
hed
> > > > > after a failed fsync would actively hurt any application? I can o=
nly
> > > > > think of cases where data is unnecessarily lost if data is droppe=
d.
> > > > >=20
> > > >=20
> > > > I worry about use cases with concurrent writers to the same file an=
d
> > > > how different applications would handle fsync() failures with our n=
ew
> > > > behavior.
> > >=20
> > > Any specific scenario you're worried about?
> > >=20
> > > > Keeping the known old behavior as the default will ensure that we d=
o
> > > > not break anything once this is out. qemu/virt users with gluster a=
re
> > > > accustomed to setting the virt group and hence no additional knobs
> > > > would need to be altered by them.
> > >=20
> > > Not changing anything is a safe way to avoid regressions. But it's al=
so
> > > a safe way to leave bugs unfixed. I would say that the current behav=
=C3=ADour
> > > is at least borderline buggy and very hard for applications to handle
> > > correctly.
> >=20
> > I tend to agree with Kevin on this. If we've an error handling strategy
> > that
> > is posix-complaint, I don't think there is need to add one more option.
> > Most
> > of the times options tend to be used in default values, which is equiva=
lent
> > to not providing an option at all. However, before doing that its bette=
r we
> > think through whether it can affect any existing deployments adversely
> > (even
> > when they are not posix-complaint).
> >=20
>=20
> One pattern that I can think of -
>=20
> Applications that operate on the same file from different clients through
> some locking or other co-ordination would have this pattern:
>=20
> lock (file), write (file), fsync (file), unlock (file);
> =20
> Now if the first fsync() fails, based on application logic the offset use=
d
> for the failed write + fsync could be re-utilized by a co-ordinating
> application on another node to write out legitimate data. When control
> returns back to the application that received a failure, the subsequent
> write + fsync can cause data to be overwritten at the old offset along wi=
th
> new data being written at the new offset.

Yeah. I agree. Co-ordinated applications on different mounts will have issu=
es, if they are working on the assumption that after fsync no older writes =
will hit the backend. Given that there seems to be a fair bit of confusion =
on status of retry of older writes after an fsync failure, we can expect so=
me regressions. So, to summarize,

1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and flus=
h act as barriers and will make sure either
   a. older writes are synced to backend
   b. old writes are failed and never retried.

   after a failed fsync/flush.

2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a =
failed fsync and retry them till a flush. After a flush, no retries of fail=
ed syncs will be attempted. This behaviour will be introduced as an option.

3. Transient and non-transient errors will be treated similarly and failed =
syncs will be retried alike.

Does everyone agree on the above points? If yes, I'll modify [1] accordingl=
y.

[1] http://review.gluster.org/#/c/12594/

</quote>