From: Rusty Russell <rusty@rustcorp.com.au>
To: Neil Brown <neilb@suse.de>
Cc: Jamie Lokier <jamie@shareable.org>, Jens Axboe <qemu@kernel.dk>,
tytso@mit.edu, kvm@vger.kernel.org,
"Michael S. Tsirkin" <mst@redhat.com>,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 15:35:19 +0930 [thread overview]
Message-ID: <201005061535.20619.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20100505160343.264fd015@notabene.brown>
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4 fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
WARNING: multiple messages have this Message-ID (diff)
From: Rusty Russell <rusty@rustcorp.com.au>
To: Neil Brown <neilb@suse.de>
Cc: tytso@mit.edu, kvm@vger.kernel.org,
"Michael S. Tsirkin" <mst@redhat.com>,
qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org,
Jens Axboe <qemu@kernel.dk>,
hch@lst.de
Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Date: Thu, 6 May 2010 15:35:19 +0930 [thread overview]
Message-ID: <201005061535.20619.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20100505160343.264fd015@notabene.brown>
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4 fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
next prev parent reply other threads:[~2010-05-06 6:05 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-18 22:22 [PATCH] virtio-spec: document block CMD and FLUSH Michael S. Tsirkin
2010-02-18 22:22 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-19 21:26 ` Michael S. Tsirkin
2010-04-19 21:26 ` Michael S. Tsirkin
2010-04-19 21:26 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-28 15:52 ` Michael S. Tsirkin
2010-04-28 15:52 ` Michael S. Tsirkin
2010-04-28 15:52 ` [Qemu-devel] " Michael S. Tsirkin
2010-04-20 1:46 ` [Qemu-devel] " Jamie Lokier
2010-04-20 1:46 ` Jamie Lokier
2010-04-20 13:22 ` Paul Brook
2010-04-20 13:22 ` Paul Brook
2010-04-20 13:22 ` Paul Brook
2010-04-21 10:39 ` Michael S. Tsirkin
2010-04-21 10:39 ` Michael S. Tsirkin
2010-04-21 10:39 ` Michael S. Tsirkin
2010-05-04 18:56 ` Christoph Hellwig
2010-05-04 18:56 ` Christoph Hellwig
2010-05-04 19:01 ` Michael S. Tsirkin
2010-05-04 19:01 ` Michael S. Tsirkin
2010-05-04 19:01 ` Michael S. Tsirkin
2010-04-20 1:46 ` Jamie Lokier
2010-05-04 4:38 ` Rusty Russell
2010-05-04 4:38 ` Rusty Russell
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
2010-05-04 6:56 ` Stefan Hajnoczi
2010-05-04 6:56 ` Stefan Hajnoczi
2010-05-04 8:34 ` Avi Kivity
2010-05-04 8:34 ` Avi Kivity
2010-05-04 8:34 ` [Qemu-devel] " Avi Kivity
2010-05-04 8:41 ` Jens Axboe
2010-05-04 8:41 ` Jens Axboe
2010-05-04 8:41 ` [Qemu-devel] " Jens Axboe
2010-05-04 20:17 ` Jamie Lokier
2010-05-04 20:17 ` Jamie Lokier
2010-05-05 4:58 ` Rusty Russell
2010-05-05 4:58 ` Rusty Russell
2010-05-05 4:58 ` Rusty Russell
2010-05-05 6:03 ` Neil Brown
2010-05-05 6:03 ` Neil Brown
2010-05-06 6:05 ` Rusty Russell [this message]
2010-05-06 6:05 ` Rusty Russell
2010-05-06 14:57 ` Jamie Lokier
2010-05-06 14:57 ` Jamie Lokier
2010-05-06 14:57 ` Jamie Lokier
2010-05-06 6:05 ` Rusty Russell
2010-05-05 6:03 ` Neil Brown
2010-05-06 15:25 ` Jamie Lokier
2010-05-06 15:25 ` Jamie Lokier
2010-05-06 15:25 ` Jamie Lokier
2010-05-04 20:17 ` Jamie Lokier
2010-05-04 10:05 ` Christoph Hellwig
2010-05-04 10:05 ` Christoph Hellwig
2010-05-04 10:05 ` [Qemu-devel] " Christoph Hellwig
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 20:32 ` Jamie Lokier
2010-05-04 18:54 ` Christoph Hellwig
2010-05-04 18:54 ` [Qemu-devel] " Christoph Hellwig
2010-05-04 18:56 ` Michael S. Tsirkin
2010-05-04 18:56 ` [Qemu-devel] " Michael S. Tsirkin
2010-05-04 18:58 ` Michael S. Tsirkin
2010-05-04 18:58 ` [Qemu-devel] " Michael S. Tsirkin
2010-05-04 18:58 ` Michael S. Tsirkin
2010-05-04 18:56 ` Michael S. Tsirkin
2010-05-05 5:00 ` Rusty Russell
2010-05-05 5:00 ` [Qemu-devel] " Rusty Russell
2010-05-05 5:00 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201005061535.20619.rusty@rustcorp.com.au \
--to=rusty@rustcorp.com.au \
--cc=hch@lst.de \
--cc=jamie@shareable.org \
--cc=kvm@vger.kernel.org \
--cc=mst@redhat.com \
--cc=neilb@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=qemu@kernel.dk \
--cc=tytso@mit.edu \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.