From mboxrd@z Thu Jan  1 00:00:00 1970
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 15:35:19 +0930
Message-ID: <201005061535.20619.rusty@rustcorp.com.au>
References: <20100218222220.GA14847@redhat.com> <201005051428.41735.rusty@rustcorp.com.au> <20100505160343.264fd015@notabene.brown>
Mime-Version: 1.0
Content-Type: Text/Plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Jamie Lokier <jamie@shareable.org>, Jens Axboe <qemu@kernel.dk>,
	tytso@mit.edu, kvm@vger.kernel.org,
	"Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org,
	virtualization@lists.linux-foundation.org, hch@lst.de
To: Neil Brown <neilb@suse.de>
Return-path: <kvm-owner@vger.kernel.org>
Received: from ozlabs.org ([203.10.76.45]:44097 "EHLO ozlabs.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750880Ab0EFGF0 convert rfc822-to-8bit (ORCPT
	<rfc822;kvm@vger.kernel.org>); Thu, 6 May 2010 02:05:26 -0400
In-Reply-To: <20100505160343.264fd015@notabene.brown>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>=20
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, s=
o CC'ing the
> > > > > usual I/O suspects...
> > > >=20
> > > > It would be nice to have a more fuller API for this, but the re=
ality is
> > > > that only the flush approach is really workable. Even just stri=
ct
> > > > ordering of requests could only be supported on SCSI, and even =
there the
> > > > kernel still lacks proper guarantees on error handling to preve=
nt
> > > > reordering there.
> > >=20
> > > There's a few I/O scheduling differences that might be useful:
> > >=20
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but =
not
> > >    before a BARRIER.  That might be useful for time-critical WRIT=
Es,
> > >    and those issued by high I/O priority.
> >=20
> > This is only because noone actually wants flushes or barriers, thou=
gh
> > I/O people seem to only offer that.  We really want "<these writes>=
 must
> > occur before <this write>".  That offers maximum choice to the I/O =
subsystem
> > and potentially to smart (virtual?) disks.
> >=20
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH=
 is
> > >    only for data belonging to a particular file (e.g. fdatasync w=
ith
> > >    no file size change, even on btrfs if O_DIRECT was used for th=
e
> > >    writes being committed).  That would entail tagging FLUSHes an=
d
> > >    WRITEs with a fs-specific identifier (such as inode number), o=
paque
> > >    to the scheduler which only checks equality.
> >=20
> > This is closer.  In userspace I'd be happy with a "all prior writes=
 to this
> > struct file before all future writes".  Even if the original guaran=
tees were
> > stronger (ie. inode basis).  We currently implement transactions us=
ing 4 fsync
> > /msync pairs.
> >=20
> > 	write_recovery_data(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	write_recovery_header(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	overwrite_with_new_data(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	remove_recovery_header(fd);
> > 	fsync(fd);
> > 	msync(mmap);
>=20
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data yo=
u would
> not need the first fsync, and as long as you have two places to write=
 recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
>   write_internally_checksummed_recovery_data_and_header_to_unused_log=
_space()
>   fsync / msync
>   overwrite_with_new_data()
>=20
> To recovery you choose the most recent log_space and replay the conte=
nt.
> That may be a redundant operation, but that is no loss.

I think you missed a checksum for the new data?  Otherwise we can't tel=
l if
the new data is completely written.  But yes, I will steal this scheme =
for
TDB2, thanks!

In practice, it's the first sync which is glacial, the rest are pretty =
cheap.

> Also cannot see the point of msync if you have already performed an f=
sync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.

I assume it's this from the msync man page:

       msync()  flushes  changes  made  to the in-core copy of a file t=
hat was
       mapped into memory using mmap(2) back to disk.   Without  use  o=
f  this
       call  there  is  no guarantee that changes are written back befo=
re mun=E2=80=90
       map(2) is called.=20

> > It's an implementation detail; barrier has less flexibility because=
 it has
> > less information about what is required. I'm saying I want to give =
you as
> > much information as I can, even if you don't use it yet.
>=20
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information =
to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantee=
s that
> you asked for, people will complain that they are getting empty files=
 that
> they didn't expect.

I think that's an oversimplification: IIUC that occurred to people *not=
*
using fsync().  They weren't using it because it was too slow.  Providi=
ng
a primitive which is as fast or faster and more specific doesn't have t=
he
same magnitude of social issues.

And we can't write userspace interfaces for idiots only.

> The abstraction I would like to see is a simple 'barrier' that contai=
ns no
> data and has a filesystem-wide effect.

I think you lack ambition ;)

Thinking about the single-file use case (eg. kvm guest or tdb), isn't t=
hat
suboptimal for md?  Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.

Cheers,
Rusty.

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1O9uDI-0001Y1-Fj
	for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:36 -0400
Received: from [140.186.70.92] (port=49040 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1O9uDF-0001VU-Dj
	for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:34 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <rusty@rustcorp.com.au>) id 1O9uDC-0001AQ-Ko
	for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:33 -0400
Received: from ozlabs.org ([203.10.76.45]:57649)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <rusty@rustcorp.com.au>) id 1O9uDB-00019y-Uf
	for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:30 -0400
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 15:35:19 +0930
References: <20100218222220.GA14847@redhat.com>
	<201005051428.41735.rusty@rustcorp.com.au>
	<20100505160343.264fd015@notabene.brown>
In-Reply-To: <20100505160343.264fd015@notabene.brown>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Message-Id: <201005061535.20619.rusty@rustcorp.com.au>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Neil Brown <neilb@suse.de>
Cc: tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, Jens Axboe <qemu@kernel.dk>, hch@lst.de

On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>=20
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC=
'ing the
> > > > > usual I/O suspects...
> > > >=20
> > > > It would be nice to have a more fuller API for this, but the realit=
y is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even ther=
e the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >=20
> > > There's a few I/O scheduling differences that might be useful:
> > >=20
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > >    before a BARRIER.  That might be useful for time-critical WRITEs,
> > >    and those issued by high I/O priority.
> >=20
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that.  We really want "<these writes> must
> > occur before <this write>".  That offers maximum choice to the I/O subs=
ystem
> > and potentially to smart (virtual?) disks.
> >=20
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > >    only for data belonging to a particular file (e.g. fdatasync with
> > >    no file size change, even on btrfs if O_DIRECT was used for the
> > >    writes being committed).  That would entail tagging FLUSHes and
> > >    WRITEs with a fs-specific identifier (such as inode number), opaque
> > >    to the scheduler which only checks equality.
> >=20
> > This is closer.  In userspace I'd be happy with a "all prior writes to =
this
> > struct file before all future writes".  Even if the original guarantees=
 were
> > stronger (ie. inode basis).  We currently implement transactions using =
4 fsync
> > /msync pairs.
> >=20
> > 	write_recovery_data(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	write_recovery_header(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	overwrite_with_new_data(fd);
> > 	fsync(fd);
> > 	msync(mmap);
> > 	remove_recovery_header(fd);
> > 	fsync(fd);
> > 	msync(mmap);
>=20
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you wo=
uld
> not need the first fsync, and as long as you have two places to write rec=
overy
> data, you don't need the 3rd and 4th syncs.
> Just:
>   write_internally_checksummed_recovery_data_and_header_to_unused_log_spa=
ce()
>   fsync / msync
>   overwrite_with_new_data()
>=20
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.

I think you missed a checksum for the new data?  Otherwise we can't tell if
the new data is completely written.  But yes, I will steal this scheme for
TDB2, thanks!

In practice, it's the first sync which is glacial, the rest are pretty chea=
p.

> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.

I assume it's this from the msync man page:

       msync()  flushes  changes  made  to the in-core copy of a file that =
was
       mapped into memory using mmap(2) back to disk.   Without  use  of  t=
his
       call  there  is  no guarantee that changes are written back before m=
un=E2=80=90
       map(2) is called.=20

> > It's an implementation detail; barrier has less flexibility because it =
has
> > less information about what is required. I'm saying I want to give you =
as
> > much information as I can, even if you don't use it yet.
>=20
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to s=
till
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees th=
at
> you asked for, people will complain that they are getting empty files that
> they didn't expect.

I think that's an oversimplification: IIUC that occurred to people *not*
using fsync().  They weren't using it because it was too slow.  Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.

And we can't write userspace interfaces for idiots only.

> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.

I think you lack ambition ;)

Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md?  Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.

Cheers,
Rusty.