* [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 4:38 ` Rusty Russell
0 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-04 4:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: tytso, kvm, Neil Brown, qemu-devel, virtualization, Jens Axboe,
hch
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> I took a stub at documenting CMD and FLUSH request types in virtio
> block. Christoph, could you look over this please?
>
> I note that the interface seems full of warts to me,
> this might be a first step to cleaning them.
ISTR Christoph had withdrawn some patches in this area, and was waiting
for him to resubmit?
I've given up on figuring out the block device. What seem to me to be sane
semantics along the lines of memory barriers are foreign to disk people: they
want (and depend on) flushing everywhere.
For example, tdb transactions do not require a flush, they only require what
I would call a barrier: that prior data be written out before any future data.
Surely that would be more efficient in general than a flush! In fact, TDB
wants only writes to *that file* (and metadata) written out first; it has no
ordering issues with other I/O on the same device.
A generic I/O interface would allow you to specify "this request depends on these
outstanding requests" and leave it at that. It might have some sync flush
command for dumb applications and OSes. The userspace API might be not be as
precise and only allow such a barrier against all prior writes on this fd.
ISTR someone mentioning a desire for such an API years ago, so CC'ing the
usual I/O suspects...
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
@ 2010-05-04 6:56 ` Stefan Hajnoczi
-1 siblings, 0 replies; 68+ messages in thread
From: Stefan Hajnoczi @ 2010-05-04 6:56 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
[-- Attachment #1.1: Type: text/plain, Size: 1396 bytes --]
A userspace barrier API would be very useful instead of doing fsync when
only ordering is required. I'd like to follow that discussion too.
Stefan
On 4 May 2010 05:39, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> I took a stub at documenting CMD and FLU...
ISTR Christoph had withdrawn some patches in this area, and was waiting
for him to resubmit?
I've given up on figuring out the block device. What seem to me to be sane
semantics along the lines of memory barriers are foreign to disk people:
they
want (and depend on) flushing everywhere.
For example, tdb transactions do not require a flush, they only require what
I would call a barrier: that prior data be written out before any future
data.
Surely that would be more efficient in general than a flush! In fact, TDB
wants only writes to *that file* (and metadata) written out first; it has no
ordering issues with other I/O on the same device.
A generic I/O interface would allow you to specify "this request depends on
these
outstanding requests" and leave it at that. It might have some sync flush
command for dumb applications and OSes. The userspace API might be not be
as
precise and only allow such a barrier against all prior writes on this fd.
ISTR someone mentioning a desire for such an API years ago, so CC'ing the
usual I/O suspects...
Cheers,
Rusty.
[-- Attachment #1.2: Type: text/html, Size: 1742 bytes --]
[-- Attachment #2: Type: text/plain, Size: 184 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 6:56 ` Stefan Hajnoczi
0 siblings, 0 replies; 68+ messages in thread
From: Stefan Hajnoczi @ 2010-05-04 6:56 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]
A userspace barrier API would be very useful instead of doing fsync when
only ordering is required. I'd like to follow that discussion too.
Stefan
On 4 May 2010 05:39, "Rusty Russell" <rusty@rustcorp.com.au> wrote:
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> I took a stub at documenting CMD and FLU...
ISTR Christoph had withdrawn some patches in this area, and was waiting
for him to resubmit?
I've given up on figuring out the block device. What seem to me to be sane
semantics along the lines of memory barriers are foreign to disk people:
they
want (and depend on) flushing everywhere.
For example, tdb transactions do not require a flush, they only require what
I would call a barrier: that prior data be written out before any future
data.
Surely that would be more efficient in general than a flush! In fact, TDB
wants only writes to *that file* (and metadata) written out first; it has no
ordering issues with other I/O on the same device.
A generic I/O interface would allow you to specify "this request depends on
these
outstanding requests" and leave it at that. It might have some sync flush
command for dumb applications and OSes. The userspace API might be not be
as
precise and only allow such a barrier against all prior writes on this fd.
ISTR someone mentioning a desire for such an API years ago, so CC'ing the
usual I/O suspects...
Cheers,
Rusty.
[-- Attachment #2: Type: text/html, Size: 1742 bytes --]
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
(?)
(?)
@ 2010-05-04 8:34 ` Avi Kivity
-1 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2010-05-04 8:34 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Anthony Liguori, Jens Axboe, hch
On 05/04/2010 07:38 AM, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
>
>> I took a stub at documenting CMD and FLUSH request types in virtio
>> block. Christoph, could you look over this please?
>>
>> I note that the interface seems full of warts to me,
>> this might be a first step to cleaning them.
>>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
I think that's SCSI ordered tags.
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
Depends on all previous requests, and will commit before all following
requests. ie a full barrier.
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
>
I'd love to see TCQ exposed to user space.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
@ 2010-05-04 8:34 ` Avi Kivity
-1 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2010-05-04 8:34 UTC (permalink / raw)
To: Rusty Russell
Cc: Michael S. Tsirkin, virtualization, Anthony Liguori, qemu-devel,
kvm, hch, Neil Brown, Jens Axboe, tytso
On 05/04/2010 07:38 AM, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
>
>> I took a stub at documenting CMD and FLUSH request types in virtio
>> block. Christoph, could you look over this please?
>>
>> I note that the interface seems full of warts to me,
>> this might be a first step to cleaning them.
>>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
I think that's SCSI ordered tags.
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
Depends on all previous requests, and will commit before all following
requests. ie a full barrier.
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
>
I'd love to see TCQ exposed to user space.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 68+ messages in thread
* [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 8:34 ` Avi Kivity
0 siblings, 0 replies; 68+ messages in thread
From: Avi Kivity @ 2010-05-04 8:34 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
On 05/04/2010 07:38 AM, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
>
>> I took a stub at documenting CMD and FLUSH request types in virtio
>> block. Christoph, could you look over this please?
>>
>> I note that the interface seems full of warts to me,
>> this might be a first step to cleaning them.
>>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
I think that's SCSI ordered tags.
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
Depends on all previous requests, and will commit before all following
requests. ie a full barrier.
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
>
I'd love to see TCQ exposed to user space.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
` (3 preceding siblings ...)
(?)
@ 2010-05-04 8:41 ` Jens Axboe
-1 siblings, 0 replies; 68+ messages in thread
From: Jens Axboe @ 2010-05-04 8:41 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Anthony Liguori, hch
On Tue, May 04 2010, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
It would be nice to have a more fuller API for this, but the reality is
that only the flush approach is really workable. Even just strict
ordering of requests could only be supported on SCSI, and even there the
kernel still lacks proper guarantees on error handling to prevent
reordering there.
--
Jens Axboe
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
@ 2010-05-04 8:41 ` Jens Axboe
-1 siblings, 0 replies; 68+ messages in thread
From: Jens Axboe @ 2010-05-04 8:41 UTC (permalink / raw)
To: Rusty Russell
Cc: Michael S. Tsirkin, virtualization, Anthony Liguori, qemu-devel,
kvm, hch, Neil Brown, tytso
On Tue, May 04 2010, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
It would be nice to have a more fuller API for this, but the reality is
that only the flush approach is really workable. Even just strict
ordering of requests could only be supported on SCSI, and even there the
kernel still lacks proper guarantees on error handling to prevent
reordering there.
--
Jens Axboe
^ permalink raw reply [flat|nested] 68+ messages in thread
* [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 8:41 ` Jens Axboe
0 siblings, 0 replies; 68+ messages in thread
From: Jens Axboe @ 2010-05-04 8:41 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, hch
On Tue, May 04 2010, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
>
> A generic I/O interface would allow you to specify "this request depends on these
> outstanding requests" and leave it at that. It might have some sync flush
> command for dumb applications and OSes. The userspace API might be not be as
> precise and only allow such a barrier against all prior writes on this fd.
>
> ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> usual I/O suspects...
It would be nice to have a more fuller API for this, but the reality is
that only the flush approach is really workable. Even just strict
ordering of requests could only be supported on SCSI, and even there the
kernel still lacks proper guarantees on error handling to prevent
reordering there.
--
Jens Axboe
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 8:41 ` [Qemu-devel] " Jens Axboe
@ 2010-05-04 20:17 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:17 UTC (permalink / raw)
To: Jens Axboe
Cc: Rusty Russell, tytso, kvm, Michael S. Tsirkin, Neil Brown,
qemu-devel, virtualization, hch
Jens Axboe wrote:
> On Tue, May 04 2010, Rusty Russell wrote:
> > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > usual I/O suspects...
>
> It would be nice to have a more fuller API for this, but the reality is
> that only the flush approach is really workable. Even just strict
> ordering of requests could only be supported on SCSI, and even there the
> kernel still lacks proper guarantees on error handling to prevent
> reordering there.
There's a few I/O scheduling differences that might be useful:
1. The I/O scheduler could freely move WRITEs before a FLUSH but not
before a BARRIER. That might be useful for time-critical WRITEs,
and those issued by high I/O priority.
2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
only for data belonging to a particular file (e.g. fdatasync with
no file size change, even on btrfs if O_DIRECT was used for the
writes being committed). That would entail tagging FLUSHes and
WRITEs with a fs-specific identifier (such as inode number), opaque
to the scheduler which only checks equality.
3. By delaying FLUSHes through reordering as above, the I/O scheduler
could merge multiple FLUSHes into a single command.
4. On MD/RAID, BARRIER requires every backing device to quiesce before
sending the low-level cache-flush, and all of those to finish
before resuming each backing device. FLUSH doesn't require as much
synchronising. (With per-file FLUSH; see 2; it could even avoid
FLUSH altogether to some backing devices for small files).
In other words, FLUSH can be more relaxed than BARRIER inside the
kernel. It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 20:17 ` Jamie Lokier
0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:17 UTC (permalink / raw)
To: Jens Axboe
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, Rusty Russell,
qemu-devel, virtualization, hch
Jens Axboe wrote:
> On Tue, May 04 2010, Rusty Russell wrote:
> > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > usual I/O suspects...
>
> It would be nice to have a more fuller API for this, but the reality is
> that only the flush approach is really workable. Even just strict
> ordering of requests could only be supported on SCSI, and even there the
> kernel still lacks proper guarantees on error handling to prevent
> reordering there.
There's a few I/O scheduling differences that might be useful:
1. The I/O scheduler could freely move WRITEs before a FLUSH but not
before a BARRIER. That might be useful for time-critical WRITEs,
and those issued by high I/O priority.
2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
only for data belonging to a particular file (e.g. fdatasync with
no file size change, even on btrfs if O_DIRECT was used for the
writes being committed). That would entail tagging FLUSHes and
WRITEs with a fs-specific identifier (such as inode number), opaque
to the scheduler which only checks equality.
3. By delaying FLUSHes through reordering as above, the I/O scheduler
could merge multiple FLUSHes into a single command.
4. On MD/RAID, BARRIER requires every backing device to quiesce before
sending the low-level cache-flush, and all of those to finish
before resuming each backing device. FLUSH doesn't require as much
synchronising. (With per-file FLUSH; see 2; it could even avoid
FLUSH altogether to some backing devices for small files).
In other words, FLUSH can be more relaxed than BARRIER inside the
kernel. It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 20:17 ` Jamie Lokier
@ 2010-05-05 4:58 ` Rusty Russell
-1 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-05 4:58 UTC (permalink / raw)
To: Jamie Lokier
Cc: Jens Axboe, tytso, kvm, Michael S. Tsirkin, Neil Brown,
qemu-devel, virtualization, hch
On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> >
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
>
> There's a few I/O scheduling differences that might be useful:
>
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> before a BARRIER. That might be useful for time-critical WRITEs,
> and those issued by high I/O priority.
This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that. We really want "<these writes> must
occur before <this write>". That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.
> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> only for data belonging to a particular file (e.g. fdatasync with
> no file size change, even on btrfs if O_DIRECT was used for the
> writes being committed). That would entail tagging FLUSHes and
> WRITEs with a fs-specific identifier (such as inode number), opaque
> to the scheduler which only checks equality.
This is closer. In userspace I'd be happy with a "all prior writes to this
struct file before all future writes". Even if the original guarantees were
stronger (ie. inode basis). We currently implement transactions using 4 fsync
/msync pairs.
write_recovery_data(fd);
fsync(fd);
msync(mmap);
write_recovery_header(fd);
fsync(fd);
msync(mmap);
overwrite_with_new_data(fd);
fsync(fd);
msync(mmap);
remove_recovery_header(fd);
fsync(fd);
msync(mmap);
Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.
> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel. It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)
It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-05 4:58 ` Rusty Russell
0 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-05 4:58 UTC (permalink / raw)
To: Jamie Lokier
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> >
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
>
> There's a few I/O scheduling differences that might be useful:
>
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> before a BARRIER. That might be useful for time-critical WRITEs,
> and those issued by high I/O priority.
This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that. We really want "<these writes> must
occur before <this write>". That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.
> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> only for data belonging to a particular file (e.g. fdatasync with
> no file size change, even on btrfs if O_DIRECT was used for the
> writes being committed). That would entail tagging FLUSHes and
> WRITEs with a fs-specific identifier (such as inode number), opaque
> to the scheduler which only checks equality.
This is closer. In userspace I'd be happy with a "all prior writes to this
struct file before all future writes". Even if the original guarantees were
stronger (ie. inode basis). We currently implement transactions using 4 fsync
/msync pairs.
write_recovery_data(fd);
fsync(fd);
msync(mmap);
write_recovery_header(fd);
fsync(fd);
msync(mmap);
overwrite_with_new_data(fd);
fsync(fd);
msync(mmap);
remove_recovery_header(fd);
fsync(fd);
msync(mmap);
Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.
> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel. It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)
It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 4:58 ` Rusty Russell
(?)
@ 2010-05-05 6:03 ` Neil Brown
-1 siblings, 0 replies; 68+ messages in thread
From: Neil Brown @ 2010-05-05 6:03 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Jamie Lokier, qemu-devel,
virtualization, Jens Axboe, hch
On Wed, 5 May 2010 14:28:41 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
>
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
Seems over-zealous.
If the recovery_header held a strong checksum of the recovery_data you would
not need the first fsync, and as long as you have two places to write recovery
data, you don't need the 3rd and 4th syncs.
Just:
write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
fsync / msync
overwrite_with_new_data()
To recovery you choose the most recent log_space and replay the content.
That may be a redundant operation, but that is no loss.
Also cannot see the point of msync if you have already performed an fsync,
and if there is a point, I would expect you to call msync before
fsync... Maybe there is some subtlety there that I am not aware of.
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
Only we know that approach doesn't work.
People will learn that they don't need to give the extra information to still
achieve the same result - just like they did with ext3 and fsync.
Then when we improve the implementation to only provide the guarantees that
you asked for, people will complain that they are getting empty files that
they didn't expect.
The abstraction I would like to see is a simple 'barrier' that contains no
data and has a filesystem-wide effect.
If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER,
it would send an empty barrier, then the data, then another empty barrier.
(However I suspect most filesystems don't really need barriers on both sides.)
A low level driver might merge these together if the underlying hardware
supported that combined operation (which I believe some do).
I think this merging would be less complex that the current need to split a
BIO_RW_BARRIER in to the three separate operations when only a flush is
possible (I know it would make md code a lot nicer :-).
I would probably expose this to user-space as extra flags to sync_file_range:
SYNC_FILE_RANGE_BARRIER_BEFORE
SYNC_FILE_RANGE_BARRIER_AFTER
This would make it clear that a barrier does *not* imply a sync, it only
applies to data for which a sync has already been requested. So data that has
already been 'synced' is stored strictly before data which has not yet been
submitted with write() (or by changing a mmapped area).
The barrier would still be filesystem wide in that if you
SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another
file on the same filesystem, the pages scheduled in the first file would be
affect by the barrier request on the second file.
Implementing this would probably require a new address_space_operation so
that the filesystem would have a chance to ensure all necessary writes were
queued before issuing the barrier.
NeilBrown
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 4:58 ` Rusty Russell
@ 2010-05-05 6:03 ` Neil Brown
-1 siblings, 0 replies; 68+ messages in thread
From: Neil Brown @ 2010-05-05 6:03 UTC (permalink / raw)
To: Rusty Russell
Cc: Jamie Lokier, Jens Axboe, tytso, kvm, Michael S. Tsirkin,
qemu-devel, virtualization, hch
On Wed, 5 May 2010 14:28:41 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
>
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
Seems over-zealous.
If the recovery_header held a strong checksum of the recovery_data you would
not need the first fsync, and as long as you have two places to write recovery
data, you don't need the 3rd and 4th syncs.
Just:
write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
fsync / msync
overwrite_with_new_data()
To recovery you choose the most recent log_space and replay the content.
That may be a redundant operation, but that is no loss.
Also cannot see the point of msync if you have already performed an fsync,
and if there is a point, I would expect you to call msync before
fsync... Maybe there is some subtlety there that I am not aware of.
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
Only we know that approach doesn't work.
People will learn that they don't need to give the extra information to still
achieve the same result - just like they did with ext3 and fsync.
Then when we improve the implementation to only provide the guarantees that
you asked for, people will complain that they are getting empty files that
they didn't expect.
The abstraction I would like to see is a simple 'barrier' that contains no
data and has a filesystem-wide effect.
If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER,
it would send an empty barrier, then the data, then another empty barrier.
(However I suspect most filesystems don't really need barriers on both sides.)
A low level driver might merge these together if the underlying hardware
supported that combined operation (which I believe some do).
I think this merging would be less complex that the current need to split a
BIO_RW_BARRIER in to the three separate operations when only a flush is
possible (I know it would make md code a lot nicer :-).
I would probably expose this to user-space as extra flags to sync_file_range:
SYNC_FILE_RANGE_BARRIER_BEFORE
SYNC_FILE_RANGE_BARRIER_AFTER
This would make it clear that a barrier does *not* imply a sync, it only
applies to data for which a sync has already been requested. So data that has
already been 'synced' is stored strictly before data which has not yet been
submitted with write() (or by changing a mmapped area).
The barrier would still be filesystem wide in that if you
SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another
file on the same filesystem, the pages scheduled in the first file would be
affect by the barrier request on the second file.
Implementing this would probably require a new address_space_operation so
that the filesystem would have a chance to ensure all necessary writes were
queued before issuing the barrier.
NeilBrown
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-05 6:03 ` Neil Brown
0 siblings, 0 replies; 68+ messages in thread
From: Neil Brown @ 2010-05-05 6:03 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, qemu-devel, virtualization,
Jens Axboe, hch
On Wed, 5 May 2010 14:28:41 +0930
Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
>
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
Seems over-zealous.
If the recovery_header held a strong checksum of the recovery_data you would
not need the first fsync, and as long as you have two places to write recovery
data, you don't need the 3rd and 4th syncs.
Just:
write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
fsync / msync
overwrite_with_new_data()
To recovery you choose the most recent log_space and replay the content.
That may be a redundant operation, but that is no loss.
Also cannot see the point of msync if you have already performed an fsync,
and if there is a point, I would expect you to call msync before
fsync... Maybe there is some subtlety there that I am not aware of.
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
Only we know that approach doesn't work.
People will learn that they don't need to give the extra information to still
achieve the same result - just like they did with ext3 and fsync.
Then when we improve the implementation to only provide the guarantees that
you asked for, people will complain that they are getting empty files that
they didn't expect.
The abstraction I would like to see is a simple 'barrier' that contains no
data and has a filesystem-wide effect.
If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER,
it would send an empty barrier, then the data, then another empty barrier.
(However I suspect most filesystems don't really need barriers on both sides.)
A low level driver might merge these together if the underlying hardware
supported that combined operation (which I believe some do).
I think this merging would be less complex that the current need to split a
BIO_RW_BARRIER in to the three separate operations when only a flush is
possible (I know it would make md code a lot nicer :-).
I would probably expose this to user-space as extra flags to sync_file_range:
SYNC_FILE_RANGE_BARRIER_BEFORE
SYNC_FILE_RANGE_BARRIER_AFTER
This would make it clear that a barrier does *not* imply a sync, it only
applies to data for which a sync has already been requested. So data that has
already been 'synced' is stored strictly before data which has not yet been
submitted with write() (or by changing a mmapped area).
The barrier would still be filesystem wide in that if you
SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another
file on the same filesystem, the pages scheduled in the first file would be
affect by the barrier request on the second file.
Implementing this would probably require a new address_space_operation so
that the filesystem would have a chance to ensure all necessary writes were
queued before issuing the barrier.
NeilBrown
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 6:03 ` Neil Brown
(?)
@ 2010-05-06 6:05 ` Rusty Russell
-1 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-06 6:05 UTC (permalink / raw)
To: Neil Brown
Cc: tytso, kvm, Michael S. Tsirkin, Jamie Lokier, qemu-devel,
virtualization, Jens Axboe, hch
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4 fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 6:03 ` Neil Brown
@ 2010-05-06 6:05 ` Rusty Russell
-1 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-06 6:05 UTC (permalink / raw)
To: Neil Brown
Cc: Jamie Lokier, Jens Axboe, tytso, kvm, Michael S. Tsirkin,
qemu-devel, virtualization, hch
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4 fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-06 6:05 ` Rusty Russell
0 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-06 6:05 UTC (permalink / raw)
To: Neil Brown
Cc: tytso, kvm, Michael S. Tsirkin, qemu-devel, virtualization,
Jens Axboe, hch
On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote:
> On Wed, 5 May 2010 14:28:41 +0930
> Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > > usual I/O suspects...
> > > >
> > > > It would be nice to have a more fuller API for this, but the reality is
> > > > that only the flush approach is really workable. Even just strict
> > > > ordering of requests could only be supported on SCSI, and even there the
> > > > kernel still lacks proper guarantees on error handling to prevent
> > > > reordering there.
> > >
> > > There's a few I/O scheduling differences that might be useful:
> > >
> > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > > before a BARRIER. That might be useful for time-critical WRITEs,
> > > and those issued by high I/O priority.
> >
> > This is only because noone actually wants flushes or barriers, though
> > I/O people seem to only offer that. We really want "<these writes> must
> > occur before <this write>". That offers maximum choice to the I/O subsystem
> > and potentially to smart (virtual?) disks.
> >
> > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > > only for data belonging to a particular file (e.g. fdatasync with
> > > no file size change, even on btrfs if O_DIRECT was used for the
> > > writes being committed). That would entail tagging FLUSHes and
> > > WRITEs with a fs-specific identifier (such as inode number), opaque
> > > to the scheduler which only checks equality.
> >
> > This is closer. In userspace I'd be happy with a "all prior writes to this
> > struct file before all future writes". Even if the original guarantees were
> > stronger (ie. inode basis). We currently implement transactions using 4 fsync
> > /msync pairs.
> >
> > write_recovery_data(fd);
> > fsync(fd);
> > msync(mmap);
> > write_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
> > overwrite_with_new_data(fd);
> > fsync(fd);
> > msync(mmap);
> > remove_recovery_header(fd);
> > fsync(fd);
> > msync(mmap);
>
> Seems over-zealous.
> If the recovery_header held a strong checksum of the recovery_data you would
> not need the first fsync, and as long as you have two places to write recovery
> data, you don't need the 3rd and 4th syncs.
> Just:
> write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> fsync / msync
> overwrite_with_new_data()
>
> To recovery you choose the most recent log_space and replay the content.
> That may be a redundant operation, but that is no loss.
I think you missed a checksum for the new data? Otherwise we can't tell if
the new data is completely written. But yes, I will steal this scheme for
TDB2, thanks!
In practice, it's the first sync which is glacial, the rest are pretty cheap.
> Also cannot see the point of msync if you have already performed an fsync,
> and if there is a point, I would expect you to call msync before
> fsync... Maybe there is some subtlety there that I am not aware of.
I assume it's this from the msync man page:
msync() flushes changes made to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk. Without use of this
call there is no guarantee that changes are written back before mun‐
map(2) is called.
> > It's an implementation detail; barrier has less flexibility because it has
> > less information about what is required. I'm saying I want to give you as
> > much information as I can, even if you don't use it yet.
>
> Only we know that approach doesn't work.
> People will learn that they don't need to give the extra information to still
> achieve the same result - just like they did with ext3 and fsync.
> Then when we improve the implementation to only provide the guarantees that
> you asked for, people will complain that they are getting empty files that
> they didn't expect.
I think that's an oversimplification: IIUC that occurred to people *not*
using fsync(). They weren't using it because it was too slow. Providing
a primitive which is as fast or faster and more specific doesn't have the
same magnitude of social issues.
And we can't write userspace interfaces for idiots only.
> The abstraction I would like to see is a simple 'barrier' that contains no
> data and has a filesystem-wide effect.
I think you lack ambition ;)
Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
suboptimal for md? Since you have to hand your barrier to every device
whereas a file-wide primitive may theoretically only go to some.
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-06 6:05 ` Rusty Russell
@ 2010-05-06 14:57 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 14:57 UTC (permalink / raw)
To: Rusty Russell
Cc: Neil Brown, Jens Axboe, tytso, kvm, Michael S. Tsirkin,
qemu-devel, virtualization, hch
Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> > write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> > fsync / msync
> > overwrite_with_new_data()
> >
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
>
> I think you missed a checksum for the new data? Otherwise we can't tell if
> the new data is completely written.
The data checksum can go in the recovery-data block. If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).
> But yes, I will steal this scheme for TDB2, thanks!
Take a look at the filesystems. I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.
Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.
> In practice, it's the first sync which is glacial, the rest are pretty cheap.
The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk. Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?
>
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
>
> I assume it's this from the msync man page:
>
> msync() flushes changes made to the in-core copy of a file that was
> mapped into memory using mmap(2) back to disk. Without use of this
> call there is no guarantee that changes are written back before mun‐
> map(2) is called.
Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one). It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.
Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.
If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.
Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes. fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.
But then historically writable mmap has been iffy on a boatload of
unixes.
> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> >
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
>
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync(). They weren't using it because it was too slow. Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.
I agree with Rusty. Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.
And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears. Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)
> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
>
> I think you lack ambition ;)
>
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md? Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.
Yes.
Note that database-like programs still need fsync-like behaviour
*sometimes*: The "D" in ACID depends on it, and the "C" in ACID also
depends on it where multiple files are involved which must contain
consistent data with each other after crash/recovery (Perhaps Samba
depends on this?)
Single-file sync is valuable just like single-file barrier, and so is
the combination.
Since you mentioned ambition, think about multi-file updates. They're
analogous in userspace to MD's barrier/sync requirements in
kernelspace.
One API that supports multi-file update barriers is "long aio-fsync":
Something which returns when the data in earlier writes (to one file)
is committed, but does not force the commit to happen more quickly
than normal. Both single-file barriers (like you want for TDB) and
multi-file barriers can be implemented on top of that, but it's much
more difficult to use than an fbarrier() syscall, which is only
suitable for single-file. But I wonder if there would be many users
of fbarrier() who aren't perfectly capable of using something else if
needed.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-06 14:57 ` Jamie Lokier
0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 14:57 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> > write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> > fsync / msync
> > overwrite_with_new_data()
> >
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
>
> I think you missed a checksum for the new data? Otherwise we can't tell if
> the new data is completely written.
The data checksum can go in the recovery-data block. If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).
> But yes, I will steal this scheme for TDB2, thanks!
Take a look at the filesystems. I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.
Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.
> In practice, it's the first sync which is glacial, the rest are pretty cheap.
The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk. Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?
>
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
>
> I assume it's this from the msync man page:
>
> msync() flushes changes made to the in-core copy of a file that was
> mapped into memory using mmap(2) back to disk. Without use of this
> call there is no guarantee that changes are written back before mun‐
> map(2) is called.
Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one). It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.
Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.
If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.
Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes. fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.
But then historically writable mmap has been iffy on a boatload of
unixes.
> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> >
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
>
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync(). They weren't using it because it was too slow. Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.
I agree with Rusty. Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.
And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears. Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)
> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
>
> I think you lack ambition ;)
>
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md? Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.
Yes.
Note that database-like programs still need fsync-like behaviour
*sometimes*: The "D" in ACID depends on it, and the "C" in ACID also
depends on it where multiple files are involved which must contain
consistent data with each other after crash/recovery (Perhaps Samba
depends on this?)
Single-file sync is valuable just like single-file barrier, and so is
the combination.
Since you mentioned ambition, think about multi-file updates. They're
analogous in userspace to MD's barrier/sync requirements in
kernelspace.
One API that supports multi-file update barriers is "long aio-fsync":
Something which returns when the data in earlier writes (to one file)
is committed, but does not force the commit to happen more quickly
than normal. Both single-file barriers (like you want for TDB) and
multi-file barriers can be implemented on top of that, but it's much
more difficult to use than an fbarrier() syscall, which is only
suitable for single-file. But I wonder if there would be many users
of fbarrier() who aren't perfectly capable of using something else if
needed.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-06 6:05 ` Rusty Russell
(?)
(?)
@ 2010-05-06 14:57 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 14:57 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> > write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> > fsync / msync
> > overwrite_with_new_data()
> >
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
>
> I think you missed a checksum for the new data? Otherwise we can't tell if
> the new data is completely written.
The data checksum can go in the recovery-data block. If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).
> But yes, I will steal this scheme for TDB2, thanks!
Take a look at the filesystems. I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.
Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.
> In practice, it's the first sync which is glacial, the rest are pretty cheap.
The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk. Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?
>
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
>
> I assume it's this from the msync man page:
>
> msync() flushes changes made to the in-core copy of a file that was
> mapped into memory using mmap(2) back to disk. Without use of this
> call there is no guarantee that changes are written back before mun‐
> map(2) is called.
Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one). It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.
Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.
If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.
Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes. fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.
But then historically writable mmap has been iffy on a boatload of
unixes.
> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> >
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
>
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync(). They weren't using it because it was too slow. Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.
I agree with Rusty. Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.
And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears. Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)
> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
>
> I think you lack ambition ;)
>
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md? Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.
Yes.
Note that database-like programs still need fsync-like behaviour
*sometimes*: The "D" in ACID depends on it, and the "C" in ACID also
depends on it where multiple files are involved which must contain
consistent data with each other after crash/recovery (Perhaps Samba
depends on this?)
Single-file sync is valuable just like single-file barrier, and so is
the combination.
Since you mentioned ambition, think about multi-file updates. They're
analogous in userspace to MD's barrier/sync requirements in
kernelspace.
One API that supports multi-file update barriers is "long aio-fsync":
Something which returns when the data in earlier writes (to one file)
is committed, but does not force the commit to happen more quickly
than normal. Both single-file barriers (like you want for TDB) and
multi-file barriers can be implemented on top of that, but it's much
more difficult to use than an fbarrier() syscall, which is only
suitable for single-file. But I wonder if there would be many users
of fbarrier() who aren't perfectly capable of using something else if
needed.
-- Jamie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 4:58 ` Rusty Russell
` (2 preceding siblings ...)
(?)
@ 2010-05-06 15:25 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 15:25 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
We do want flushes for the "D" in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.
Less obviously, it's also needed for the "C" in ACID when more than
one file is involved. "C" is about differently updated things staying
consistent with each other.
For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames. (I
don't know if you do this; it's just an illustration).
To rename a Samba user involves updating both. Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.
After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
I agree, and I've started a few threads about it over the last couple of years.
An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.
With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.
With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-05 4:58 ` Rusty Russell
@ 2010-05-06 15:25 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 15:25 UTC (permalink / raw)
To: Rusty Russell
Cc: Jens Axboe, tytso, kvm, Michael S. Tsirkin, Neil Brown,
qemu-devel, virtualization, hch
Rusty Russell wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
We do want flushes for the "D" in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.
Less obviously, it's also needed for the "C" in ACID when more than
one file is involved. "C" is about differently updated things staying
consistent with each other.
For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames. (I
don't know if you do this; it's just an illustration).
To rename a Samba user involves updating both. Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.
After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
I agree, and I've started a few threads about it over the last couple of years.
An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.
With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.
With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-06 15:25 ` Jamie Lokier
0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-06 15:25 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > > usual I/O suspects...
> > >
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> >
> > There's a few I/O scheduling differences that might be useful:
> >
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> > before a BARRIER. That might be useful for time-critical WRITEs,
> > and those issued by high I/O priority.
>
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that. We really want "<these writes> must
> occur before <this write>". That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.
We do want flushes for the "D" in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.
Less obviously, it's also needed for the "C" in ACID when more than
one file is involved. "C" is about differently updated things staying
consistent with each other.
For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames. (I
don't know if you do this; it's just an illustration).
To rename a Samba user involves updating both. Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.
After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.
> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> > only for data belonging to a particular file (e.g. fdatasync with
> > no file size change, even on btrfs if O_DIRECT was used for the
> > writes being committed). That would entail tagging FLUSHes and
> > WRITEs with a fs-specific identifier (such as inode number), opaque
> > to the scheduler which only checks equality.
>
> This is closer. In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes". Even if the original guarantees were
> stronger (ie. inode basis). We currently implement transactions using 4 fsync
> /msync pairs.
>
> write_recovery_data(fd);
> fsync(fd);
> msync(mmap);
> write_recovery_header(fd);
> fsync(fd);
> msync(mmap);
> overwrite_with_new_data(fd);
> fsync(fd);
> msync(mmap);
> remove_recovery_header(fd);
> fsync(fd);
> msync(mmap);
>
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
>
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel. It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
>
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.
I agree, and I've started a few threads about it over the last couple of years.
An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.
With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.
With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 20:17 ` Jamie Lokier
(?)
(?)
@ 2010-05-05 4:58 ` Rusty Russell
-1 siblings, 0 replies; 68+ messages in thread
From: Rusty Russell @ 2010-05-05 4:58 UTC (permalink / raw)
To: Jamie Lokier
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 04 2010, Rusty Russell wrote:
> > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > > usual I/O suspects...
> >
> > It would be nice to have a more fuller API for this, but the reality is
> > that only the flush approach is really workable. Even just strict
> > ordering of requests could only be supported on SCSI, and even there the
> > kernel still lacks proper guarantees on error handling to prevent
> > reordering there.
>
> There's a few I/O scheduling differences that might be useful:
>
> 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> before a BARRIER. That might be useful for time-critical WRITEs,
> and those issued by high I/O priority.
This is only because noone actually wants flushes or barriers, though
I/O people seem to only offer that. We really want "<these writes> must
occur before <this write>". That offers maximum choice to the I/O subsystem
and potentially to smart (virtual?) disks.
> 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> only for data belonging to a particular file (e.g. fdatasync with
> no file size change, even on btrfs if O_DIRECT was used for the
> writes being committed). That would entail tagging FLUSHes and
> WRITEs with a fs-specific identifier (such as inode number), opaque
> to the scheduler which only checks equality.
This is closer. In userspace I'd be happy with a "all prior writes to this
struct file before all future writes". Even if the original guarantees were
stronger (ie. inode basis). We currently implement transactions using 4 fsync
/msync pairs.
write_recovery_data(fd);
fsync(fd);
msync(mmap);
write_recovery_header(fd);
fsync(fd);
msync(mmap);
overwrite_with_new_data(fd);
fsync(fd);
msync(mmap);
remove_recovery_header(fd);
fsync(fd);
msync(mmap);
Yet we really only need ordering, not guarantees about it actually hitting
disk before returning.
> In other words, FLUSH can be more relaxed than BARRIER inside the
> kernel. It's ironic that we think of fsync as stronger than
> fbarrier outside the kernel :-)
It's an implementation detail; barrier has less flexibility because it has
less information about what is required. I'm saying I want to give you as
much information as I can, even if you don't use it yet.
Thanks,
Rusty.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 8:41 ` [Qemu-devel] " Jens Axboe
(?)
(?)
@ 2010-05-04 20:17 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:17 UTC (permalink / raw)
To: Jens Axboe
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, hch
Jens Axboe wrote:
> On Tue, May 04 2010, Rusty Russell wrote:
> > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > usual I/O suspects...
>
> It would be nice to have a more fuller API for this, but the reality is
> that only the flush approach is really workable. Even just strict
> ordering of requests could only be supported on SCSI, and even there the
> kernel still lacks proper guarantees on error handling to prevent
> reordering there.
There's a few I/O scheduling differences that might be useful:
1. The I/O scheduler could freely move WRITEs before a FLUSH but not
before a BARRIER. That might be useful for time-critical WRITEs,
and those issued by high I/O priority.
2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
only for data belonging to a particular file (e.g. fdatasync with
no file size change, even on btrfs if O_DIRECT was used for the
writes being committed). That would entail tagging FLUSHes and
WRITEs with a fs-specific identifier (such as inode number), opaque
to the scheduler which only checks equality.
3. By delaying FLUSHes through reordering as above, the I/O scheduler
could merge multiple FLUSHes into a single command.
4. On MD/RAID, BARRIER requires every backing device to quiesce before
sending the low-level cache-flush, and all of those to finish
before resuming each backing device. FLUSH doesn't require as much
synchronising. (With per-file FLUSH; see 2; it could even avoid
FLUSH altogether to some backing devices for small files).
In other words, FLUSH can be more relaxed than BARRIER inside the
kernel. It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
` (5 preceding siblings ...)
(?)
@ 2010-05-04 10:05 ` Christoph Hellwig
-1 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2010-05-04 10:05 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Anthony Liguori, Jens Axboe, hch
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
Any patches I've withdrawn in this area are withdrawn for good. But
what I really need to do is to review Michaels spec updates, sorry.
UI'll get back to it today.
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
@ 2010-05-04 10:05 ` Christoph Hellwig
-1 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2010-05-04 10:05 UTC (permalink / raw)
To: Rusty Russell
Cc: Michael S. Tsirkin, virtualization, Anthony Liguori, qemu-devel,
kvm, hch, Neil Brown, Jens Axboe, tytso
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
Any patches I've withdrawn in this area are withdrawn for good. But
what I really need to do is to review Michaels spec updates, sorry.
UI'll get back to it today.
^ permalink raw reply [flat|nested] 68+ messages in thread
* [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 10:05 ` Christoph Hellwig
0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2010-05-04 10:05 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
Any patches I've withdrawn in this area are withdrawn for good. But
what I really need to do is to review Michaels spec updates, sorry.
UI'll get back to it today.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
` (7 preceding siblings ...)
(?)
@ 2010-05-04 20:32 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:32 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
I've just posted elsewhere on this thread, that an I/O level flush can
be more efficient than an I/O level barrier (implemented using a
cache-flush really), because the barrier has stricter ordering
requirements at the I/O scheduling level.
By the time you work up to tdb, another way to think of it is
distinguishing "eager fsync" from "fsync but I'm not in a hurry -
delay as long as is convenient". The latter makes much more sense
with AIO.
> A generic I/O interface would allow you to specify "this request
> depends on these outstanding requests" and leave it at that. It
> might have some sync flush command for dumb applications and OSes.
For filesystems, it would probably be easy to label in-place
overwrites and fdatasync data flushes when there's no file extension
with an opqaue per-file identifier for certain operations. Typically
over-writing in place and fdatasync would match up and wouldn't need
ordering against anything else. Other operations would tend to get
labelled as ordered against everything including these.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
2010-05-04 4:38 ` [Qemu-devel] " Rusty Russell
@ 2010-05-04 20:32 ` Jamie Lokier
-1 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:32 UTC (permalink / raw)
To: Rusty Russell
Cc: Michael S. Tsirkin, tytso, kvm, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
I've just posted elsewhere on this thread, that an I/O level flush can
be more efficient than an I/O level barrier (implemented using a
cache-flush really), because the barrier has stricter ordering
requirements at the I/O scheduling level.
By the time you work up to tdb, another way to think of it is
distinguishing "eager fsync" from "fsync but I'm not in a hurry -
delay as long as is convenient". The latter makes much more sense
with AIO.
> A generic I/O interface would allow you to specify "this request
> depends on these outstanding requests" and leave it at that. It
> might have some sync flush command for dumb applications and OSes.
For filesystems, it would probably be easy to label in-place
overwrites and fdatasync data flushes when there's no file extension
with an opqaue per-file identifier for certain operations. Typically
over-writing in place and fdatasync would match up and wouldn't need
ordering against anything else. Other operations would tend to get
labelled as ordered against everything including these.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
@ 2010-05-04 20:32 ` Jamie Lokier
0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2010-05-04 20:32 UTC (permalink / raw)
To: Rusty Russell
Cc: tytso, kvm, Michael S. Tsirkin, Neil Brown, qemu-devel,
virtualization, Jens Axboe, hch
Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block. Christoph, could you look over this please?
> >
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
>
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
>
> I've given up on figuring out the block device. What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
>
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush! In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.
I've just posted elsewhere on this thread, that an I/O level flush can
be more efficient than an I/O level barrier (implemented using a
cache-flush really), because the barrier has stricter ordering
requirements at the I/O scheduling level.
By the time you work up to tdb, another way to think of it is
distinguishing "eager fsync" from "fsync but I'm not in a hurry -
delay as long as is convenient". The latter makes much more sense
with AIO.
> A generic I/O interface would allow you to specify "this request
> depends on these outstanding requests" and leave it at that. It
> might have some sync flush command for dumb applications and OSes.
For filesystems, it would probably be easy to label in-place
overwrites and fdatasync data flushes when there's no file extension
with an opqaue per-file identifier for certain operations. Typically
over-writing in place and fdatasync would match up and wouldn't need
ordering against anything else. Other operations would tend to get
labelled as ordered against everything including these.
-- Jamie
^ permalink raw reply [flat|nested] 68+ messages in thread