* [Qemu-devel] Caching modes
@ 2010-09-20 16:53 Anthony Liguori
2010-09-20 18:37 ` Blue Swirl
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
0 siblings, 2 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 16:53 UTC (permalink / raw)
To: qemu-devel, Kevin Wolf, Christoph Hellwig
Moving to a separate thread since this has come up a few times and I
think we need to discuss the assumptions a bit more.
This is how I understand the caching modes should behave and what
guarantees a guest gets.
cache=none
All read and write requests SHOULD avoid any type of caching in the
host. Any write request MUST complete after the next level of storage
reports that the write request has completed. A flush from the guest
MUST complete after all pending I/O requests for the guest have been
completed.
As an implementation detail, with the raw format, these guarantees are
only in place for preallocated images. Sparse images do not provide as
strong of a guarantee.
cache=writethrough
All read and write requests MAY be cached by the host. Read requests
MAY be satisfied by cached data in the host. Any write request MUST
complete after the next level of storage reports that the write request
has completed. A flush from the guest MUST complete after all pending
I/O requests for the guest have been completed.
As an implementation detail, with the raw format, these guarantees also
apply for sparse images. In the future, we could relax this such that
sparse images did not provide as strong of a guarantee.
cache=writeback
All read and writes requests MAY be cached by the host. Read and write
requests may be completed entirely within the cache. A write request
MAY complete before the next level of storage reports that the write
request has completed. A flush from the guest MUST complete after all
pending I/O requests for the guest have been completed and acknowledged
by the next level of the storage hierarchy.
Guest disk cache.
For all devices that support it, the exposed cache attribute should be
independent of the host caching mode. Here are correct usages of disk
caching mode:
Writethrough disk cache; cache=none|writethrough if the disk cache is
set to writethrough or the disk is considered "enterprise class" and has
a battery backup. cache=writeback IFF the host is backed by an UPS.
Writeback disk cache; cache=none|writethrough if the disk cache is set
to writeback and the disk is not enterprise class. cache=writeback if
the host is not backed by an UPS.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] Caching modes
2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
@ 2010-09-20 18:37 ` Blue Swirl
2010-09-20 18:51 ` Anthony Liguori
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
1 sibling, 1 reply; 12+ messages in thread
From: Blue Swirl @ 2010-09-20 18:37 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig
On Mon, Sep 20, 2010 at 4:53 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> Moving to a separate thread since this has come up a few times and I think
> we need to discuss the assumptions a bit more.
>
> This is how I understand the caching modes should behave and what guarantees
> a guest gets.
>
> cache=none
>
> All read and write requests SHOULD avoid any type of caching in the host.
> Any write request MUST complete after the next level of storage reports
> that the write request has completed. A flush from the guest MUST complete
> after all pending I/O requests for the guest have been completed.
>
> As an implementation detail, with the raw format, these guarantees are only
> in place for preallocated images. Sparse images do not provide as strong of
> a guarantee.
>
> cache=writethrough
>
> All read and write requests MAY be cached by the host. Read requests MAY be
> satisfied by cached data in the host. Any write request MUST complete after
> the next level of storage reports that the write request has completed. A
> flush from the guest MUST complete after all pending I/O requests for the
> guest have been completed.
>
> As an implementation detail, with the raw format, these guarantees also
> apply for sparse images. In the future, we could relax this such that
> sparse images did not provide as strong of a guarantee.
>
> cache=writeback
>
> All read and writes requests MAY be cached by the host. Read and write
> requests may be completed entirely within the cache. A write request MAY
> complete before the next level of storage reports that the write request has
> completed. A flush from the guest MUST complete after all pending I/O
> requests for the guest have been completed and acknowledged by the next
> level of the storage hierarchy.
It would be nice to have additional mode, like cache=always, where
even flushes MAY be ignored. This would max out the performance.
>
> Guest disk cache.
>
> For all devices that support it, the exposed cache attribute should be
> independent of the host caching mode. Here are correct usages of disk
> caching mode:
>
> Writethrough disk cache; cache=none|writethrough if the disk cache is set to
> writethrough or the disk is considered "enterprise class" and has a battery
> backup. cache=writeback IFF the host is backed by an UPS.
The "enterprise class" disks, battery backups and UPS devices are not
consumer equipment. Wouldn't this mean that any private QEMU user
would need to use cache=none?
As an example, what is the correct usage for laptop user, considering
that there is a battery, but it can also drain and the drainage is
dependent on flush frequency?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] Caching modes
2010-09-20 18:37 ` Blue Swirl
@ 2010-09-20 18:51 ` Anthony Liguori
0 siblings, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 18:51 UTC (permalink / raw)
To: Blue Swirl; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig
On 09/20/2010 01:37 PM, Blue Swirl wrote:
>
> It would be nice to have additional mode, like cache=always, where
> even flushes MAY be ignored. This would max out the performance.
>
That's cache=unsafe and we have it. I ignored it for the purposes of
this discussion.
>> Guest disk cache.
>>
>> For all devices that support it, the exposed cache attribute should be
>> independent of the host caching mode. Here are correct usages of disk
>> caching mode:
>>
>> Writethrough disk cache; cache=none|writethrough if the disk cache is set to
>> writethrough or the disk is considered "enterprise class" and has a battery
>> backup. cache=writeback IFF the host is backed by an UPS.
>>
> The "enterprise class" disks, battery backups and UPS devices are not
> consumer equipment. Wouldn't this mean that any private QEMU user
> would need to use cache=none?
>
No, cache=writethrough and cache=none should be equivalent from a data
integrity/data loss perspective. Using cache=writeback without
enterprise storage is risky but practically speaking, most consumer
storage is not battery backed and uses writeback caching anyway so there
is already risk.
> As an example, what is the correct usage for laptop user, considering
> that there is a battery, but it can also drain and the drainage is
> dependent on flush frequency?
>
Minus cache=unsafe, you'll never get data corruption. The only
consideration is how much data loss can occur from the last time there
was a flush. Well behaved applications always flush important data to
avoid loss of anything important but practically speaking, the world
isn't full of behaved applications.
The only difference between cache=writeback and a normal disk's
writeback cache is that cache=writeback can be a very, very large cache
that isn't frequently flushed. So the amount of data loss can be much
higher than expected.
For most laptop users, cache=none or cache=writethrough is appropriate.
For a developer, cache=writeback probably is reasonable.
Regards,
Anthony Liguori
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
2010-09-20 18:37 ` Blue Swirl
@ 2010-09-20 19:34 ` Christoph Hellwig
2010-09-20 20:11 ` Anthony Liguori
1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-20 19:34 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig
On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote:
> cache=none
>
> All read and write requests SHOULD avoid any type of caching in the
> host. Any write request MUST complete after the next level of storage
> reports that the write request has completed. A flush from the guest
> MUST complete after all pending I/O requests for the guest have been
> completed.
>
> As an implementation detail, with the raw format, these guarantees are
> only in place for preallocated images. Sparse images do not provide as
> strong of a guarantee.
That's not how cache=none ever worked nor works currently.
But discussion the current cache modes is rather mood as they try to
map multi-dimension behaviour difference into a single options. I have
some patches that I need to finish up a bit more that will give you
your no caching enabled mode, but I don't think mapping cache=none to it
will do anyone a favour.
With the split between the guest visible write-cache-enable (WCE) flag, and
the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll
get the following modes:
| WC enable | WC disable
-----------------------------------------------
direct | |
buffer | |
buffer + ignore flush | |
currently we only have:
cache=none direct + WC enable
cache=writeback buffer + WC enable
cache=writethrough buffer + WC disable
cache=unsafe buffer + ignore flush + WC enable
splitting these up is important because we want to migrate between
hosts that can support direct I/O or not without requiring guest visible
state changes, and also because we want to use direct I/O with guest
that were installed using cache=unsafe without stopping the guest.
It also allows the guest to change the WC enable/disable flag, which
they can do for real IDE/SCSI hardware. And it allows Anthony's belowed
no caching at all mode, which actually is useful for guest that can not
deal with volatile write caches.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
@ 2010-09-20 20:11 ` Anthony Liguori
2010-09-20 23:17 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Anthony Liguori @ 2010-09-20 20:11 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel
On 09/20/2010 02:34 PM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote:
>
>> cache=none
>>
>> All read and write requests SHOULD avoid any type of caching in the
>> host. Any write request MUST complete after the next level of storage
>> reports that the write request has completed. A flush from the guest
>> MUST complete after all pending I/O requests for the guest have been
>> completed.
>>
>> As an implementation detail, with the raw format, these guarantees are
>> only in place for preallocated images. Sparse images do not provide as
>> strong of a guarantee.
>>
> That's not how cache=none ever worked nor works currently.
>
How does it work today compared to what I wrote above?
> But discussion the current cache modes is rather mood as they try to
> map multi-dimension behaviour difference into a single options. I have
> some patches that I need to finish up a bit more that will give you
> your no caching enabled mode, but I don't think mapping cache=none to it
> will do anyone a favour.
>
> With the split between the guest visible write-cache-enable (WCE) flag, and
> the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll
> get the following modes:
>
>
> | WC enable | WC disable
> -----------------------------------------------
> direct | |
> buffer | |
> buffer + ignore flush | |
>
> currently we only have:
>
> cache=none direct + WC enable
> cache=writeback buffer + WC enable
> cache=writethrough buffer + WC disable
> cache=unsafe buffer + ignore flush + WC enable
>
Where does O_DSYNC fit into this chart?
Do all modern filesystems implement O_DSYNC without generating
additional barriers per request?
Having a barrier per-write request is ultimately not the right semantic
for any of the modes. However, without the use of O_DSYNC (or
sync_file_range(), which I know you dislike), I don't see how we can
have reasonable semantics without always implementing write back caching
in the host.
> splitting these up is important because we want to migrate between
> hosts that can support direct I/O or not without requiring guest visible
> state changes, and also because we want to use direct I/O with guest
> that were installed using cache=unsafe without stopping the guest.
>
> It also allows the guest to change the WC enable/disable flag, which
> they can do for real IDE/SCSI hardware. And it allows Anthony's belowed
> no caching at all mode, which actually is useful for guest that can not
> deal with volatile write caches.
>
I'm certainly happy to break up the caching option. However, I still
don't know how we get a reasonable equivalent to cache=writethrough
without assuming that ext4 is mounted without barriers enabled.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-20 20:11 ` Anthony Liguori
@ 2010-09-20 23:17 ` Christoph Hellwig
2010-09-21 0:18 ` Anthony Liguori
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-20 23:17 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel
On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
> >>All read and write requests SHOULD avoid any type of caching in the
> >>host. Any write request MUST complete after the next level of storage
> >>reports that the write request has completed. A flush from the guest
> >>MUST complete after all pending I/O requests for the guest have been
> >>completed.
> >>
> >>As an implementation detail, with the raw format, these guarantees are
> >>only in place for preallocated images. Sparse images do not provide as
> >>strong of a guarantee.
> >>
> >That's not how cache=none ever worked nor works currently.
> >
>
> How does it work today compared to what I wrote above?
For the guest point of view it works exactly as you describe
cache=writeback. There is no ordering or cache flushing guarantees. By
using O_DIRECT we do bypass the host file cache, but we don't even try
on the others (disk cache, commiting metadata transaction that are
required to actually see the commited data for sparse, preallocated or
growing images).
What you describe above is the equivalent of O_DSYNC|O_DIRECT which
doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
guarantees the semantics for sparse images. Sparse images really aren't
special in any way - preallocaiton using posix_fallocate or COW
filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
> > | WC enable | WC disable
> >-----------------------------------------------
> >direct | |
> >buffer | |
> >buffer + ignore flush | |
> >
> >currently we only have:
> >
> > cache=none direct + WC enable
> > cache=writeback buffer + WC enable
> > cache=writethrough buffer + WC disable
> > cache=unsafe buffer + ignore flush + WC enable
> >
>
> Where does O_DSYNC fit into this chart?
O_DSYNC is used for all WC disable modes.
> Do all modern filesystems implement O_DSYNC without generating
> additional barriers per request?
>
> Having a barrier per-write request is ultimately not the right semantic
> for any of the modes. However, without the use of O_DSYNC (or
> sync_file_range(), which I know you dislike), I don't see how we can
> have reasonable semantics without always implementing write back caching
> in the host.
Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37. But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it. We
currently use the cache flush, and although I plan to experiment a bit
more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
surprised if they actually are any faster.
> I'm certainly happy to break up the caching option. However, I still
> don't know how we get a reasonable equivalent to cache=writethrough
> without assuming that ext4 is mounted without barriers enabled.
There's two problems here - one is a Linux-wide problem and that's the
barrier primitive which is currenly the only way to flush a volatile
disk cache. We've sorted this out for the 2.6.37. The other is that
ext3 and ext4 have really bad fsync implementations. Just use a better
filesystem or bug one of it's developers if you want that fixed. But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-20 23:17 ` Christoph Hellwig
@ 2010-09-21 0:18 ` Anthony Liguori
2010-09-21 8:15 ` Kevin Wolf
2010-09-21 14:26 ` Christoph Hellwig
0 siblings, 2 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21 0:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel
On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
>
>>>> All read and write requests SHOULD avoid any type of caching in the
>>>> host. Any write request MUST complete after the next level of storage
>>>> reports that the write request has completed. A flush from the guest
>>>> MUST complete after all pending I/O requests for the guest have been
>>>> completed.
>>>>
>>>> As an implementation detail, with the raw format, these guarantees are
>>>> only in place for preallocated images. Sparse images do not provide as
>>>> strong of a guarantee.
>>>>
>>>>
>>> That's not how cache=none ever worked nor works currently.
>>>
>>>
>> How does it work today compared to what I wrote above?
>>
> For the guest point of view it works exactly as you describe
> cache=writeback. There is no ordering or cache flushing guarantees. By
> using O_DIRECT we do bypass the host file cache, but we don't even try
> on the others (disk cache, commiting metadata transaction that are
> required to actually see the commited data for sparse, preallocated or
> growing images).
>
O_DIRECT alone to a pre-allocated file on a normal file system should
result in the data being visible without any additional metadata
transactions.
The only time when that isn't true is when dealing with CoW or other
special filesystem features.
> What you describe above is the equivalent of O_DSYNC|O_DIRECT which
> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
> guarantees the semantics for sparse images. Sparse images really aren't
> special in any way - preallocaiton using posix_fallocate or COW
> filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
>
>
>>> | WC enable | WC disable
>>> -----------------------------------------------
>>> direct | |
>>> buffer | |
>>> buffer + ignore flush | |
>>>
>>> currently we only have:
>>>
>>> cache=none direct + WC enable
>>> cache=writeback buffer + WC enable
>>> cache=writethrough buffer + WC disable
>>> cache=unsafe buffer + ignore flush + WC enable
>>>
>>>
>> Where does O_DSYNC fit into this chart?
>>
> O_DSYNC is used for all WC disable modes.
>
>
>> Do all modern filesystems implement O_DSYNC without generating
>> additional barriers per request?
>>
>> Having a barrier per-write request is ultimately not the right semantic
>> for any of the modes. However, without the use of O_DSYNC (or
>> sync_file_range(), which I know you dislike), I don't see how we can
>> have reasonable semantics without always implementing write back caching
>> in the host.
>>
> Barriers are a Linux-specific implementation details that is in the
> process of going away, probably in Linux 2.6.37. But if you want
> O_DSYNC semantics with a volatile disk write cache there is no way
> around using a cache flush or the FUA bit on all I/O caused by it.
If you have a volatile disk write cache, then we don't need O_DSYNC
semantics.
> We
> currently use the cache flush, and although I plan to experiment a bit
> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> surprised if they actually are any faster.
>
The thing I struggle with understanding is that if the guest is sending
us a write request, why are we sending the underlying disk a write +
flush request? That doesn't seem logical at all to me.
Even if we advertise WC disable, it should be up to the guest to decide
when to issue flushes.
>> I'm certainly happy to break up the caching option. However, I still
>> don't know how we get a reasonable equivalent to cache=writethrough
>> without assuming that ext4 is mounted without barriers enabled.
>>
> There's two problems here - one is a Linux-wide problem and that's the
> barrier primitive which is currenly the only way to flush a volatile
> disk cache. We've sorted this out for the 2.6.37. The other is that
> ext3 and ext4 have really bad fsync implementations. Just use a better
> filesystem or bug one of it's developers if you want that fixed. But
> except for disabling the disk cache there is no way to get data integrity
> without cache flushes (the FUA bit is nothing but an implicit flush).
>
But why are we issuing more flushes than the guest is issuing if we
don't have to worry about filesystem metadata (i.e. preallocated storage
or physical devices)?
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-21 0:18 ` Anthony Liguori
@ 2010-09-21 8:15 ` Kevin Wolf
2010-09-21 14:26 ` Christoph Hellwig
1 sibling, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2010-09-21 8:15 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel
Am 21.09.2010 02:18, schrieb Anthony Liguori:
> On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
>> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
>>
>>>>> All read and write requests SHOULD avoid any type of caching in the
>>>>> host. Any write request MUST complete after the next level of storage
>>>>> reports that the write request has completed. A flush from the guest
>>>>> MUST complete after all pending I/O requests for the guest have been
>>>>> completed.
>>>>>
>>>>> As an implementation detail, with the raw format, these guarantees are
>>>>> only in place for preallocated images. Sparse images do not provide as
>>>>> strong of a guarantee.
>>>>>
>>>>>
>>>> That's not how cache=none ever worked nor works currently.
>>>>
>>>>
>>> How does it work today compared to what I wrote above?
>>>
>> For the guest point of view it works exactly as you describe
>> cache=writeback. There is no ordering or cache flushing guarantees. By
>> using O_DIRECT we do bypass the host file cache, but we don't even try
>> on the others (disk cache, commiting metadata transaction that are
>> required to actually see the commited data for sparse, preallocated or
>> growing images).
>>
>
> O_DIRECT alone to a pre-allocated file on a normal file system should
> result in the data being visible without any additional metadata
> transactions.
>
> The only time when that isn't true is when dealing with CoW or other
> special filesystem features.
I think preallocated files are the exception, usually people use sparse
files. And even with preallocation, the disk cache is still left.
>> What you describe above is the equivalent of O_DSYNC|O_DIRECT which
>> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
>> guarantees the semantics for sparse images. Sparse images really aren't
>> special in any way - preallocaiton using posix_fallocate or COW
>> filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
>>
>>
>>>> | WC enable | WC disable
>>>> -----------------------------------------------
>>>> direct | |
>>>> buffer | |
>>>> buffer + ignore flush | |
>>>>
>>>> currently we only have:
>>>>
>>>> cache=none direct + WC enable
>>>> cache=writeback buffer + WC enable
>>>> cache=writethrough buffer + WC disable
>>>> cache=unsafe buffer + ignore flush + WC enable
>>>>
>>>>
>>> Where does O_DSYNC fit into this chart?
>>>
>> O_DSYNC is used for all WC disable modes.
>>
>>
>>> Do all modern filesystems implement O_DSYNC without generating
>>> additional barriers per request?
>>>
>>> Having a barrier per-write request is ultimately not the right semantic
>>> for any of the modes. However, without the use of O_DSYNC (or
>>> sync_file_range(), which I know you dislike), I don't see how we can
>>> have reasonable semantics without always implementing write back caching
>>> in the host.
>>>
>> Barriers are a Linux-specific implementation details that is in the
>> process of going away, probably in Linux 2.6.37. But if you want
>> O_DSYNC semantics with a volatile disk write cache there is no way
>> around using a cache flush or the FUA bit on all I/O caused by it.
>
> If you have a volatile disk write cache, then we don't need O_DSYNC
> semantics.
What has semantics of a qemu option to do with the host disk write
cache? We always need to provide the same semantics. If anything, we can
take advantage of a host providing write-through/no caches so that we
don't have to issue the flushes ourselves.
>> We
>> currently use the cache flush, and although I plan to experiment a bit
>> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
>> surprised if they actually are any faster.
>>
>
> The thing I struggle with understanding is that if the guest is sending
> us a write request, why are we sending the underlying disk a write +
> flush request? That doesn't seem logical at all to me.
>
> Even if we advertise WC disable, it should be up to the guest to decide
> when to issue flushes.
Why should a guest ever flush a cache when it's told that this cache
doesn't exist?
Kevin
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-21 0:18 ` Anthony Liguori
2010-09-21 8:15 ` Kevin Wolf
@ 2010-09-21 14:26 ` Christoph Hellwig
2010-09-21 15:13 ` Anthony Liguori
1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-21 14:26 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel
On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
> O_DIRECT alone to a pre-allocated file on a normal file system should
> result in the data being visible without any additional metadata
> transactions.
Anthony, for the third time: no. O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file.
In general this will degrade performance, but it is useful in
special situations, such as when applications do their own
caching. File I/O is done directly to/from user space buffers.
The O_DIRECT flag on its own makes at an effort to transfer data
synchronously, but does not give the guarantees of the O_SYNC
that data and necessary metadata are transferred. To guarantee
synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block
devices is described in raw(8).
O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache. Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.
Metadata updates happen when filling sparse file, when extening the file
size, when using a COW filesystem, and when converting preallocated to
fully allocated extents in practice and could happen in many more cases
depending on the filesystem implementation.
> >Barriers are a Linux-specific implementation details that is in the
> >process of going away, probably in Linux 2.6.37. But if you want
> >O_DSYNC semantics with a volatile disk write cache there is no way
> >around using a cache flush or the FUA bit on all I/O caused by it.
>
> If you have a volatile disk write cache, then we don't need O_DSYNC
> semantics.
If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache. But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct. O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.
> > We
> >currently use the cache flush, and although I plan to experiment a bit
> >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> >surprised if they actually are any faster.
> >
>
> The thing I struggle with understanding is that if the guest is sending
> us a write request, why are we sending the underlying disk a write +
> flush request? That doesn't seem logical at all to me.
We only send a cache flush request *iff* we present the guest a device
without a volatile write cache so that it can assume all writes are
stable and we sit on a device that does have a volatile write cache.
> Even if we advertise WC disable, it should be up to the guest to decide
> when to issue flushes.
No. If we don't claim to have a volatile cache no guest will ever flush
the cache. Which is just logially given that we just told it that we
don't have a cache that needs flushing.
> >ext3 and ext4 have really bad fsync implementations. Just use a better
> >filesystem or bug one of it's developers if you want that fixed. But
> >except for disabling the disk cache there is no way to get data integrity
> >without cache flushes (the FUA bit is nothing but an implicit flush).
> >
>
> But why are we issuing more flushes than the guest is issuing if we
> don't have to worry about filesystem metadata (i.e. preallocated storage
> or physical devices)?
Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-21 14:26 ` Christoph Hellwig
@ 2010-09-21 15:13 ` Anthony Liguori
2010-09-21 20:57 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21 15:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel
On 09/21/2010 09:26 AM, Christoph Hellwig wrote:
> On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
>
>> O_DIRECT alone to a pre-allocated file on a normal file system should
>> result in the data being visible without any additional metadata
>> transactions.
>>
> Anthony, for the third time: no. O_DIRECT is a non-portable extension
> in Linux (taken from IRIX) and is defined as:
>
>
> O_DIRECT (Since Linux 2.4.10)
> Try to minimize cache effects of the I/O to and from this file.
> In general this will degrade performance, but it is useful in
> special situations, such as when applications do their own
> caching. File I/O is done directly to/from user space buffers.
> The O_DIRECT flag on its own makes at an effort to transfer data
> synchronously, but does not give the guarantees of the O_SYNC
> that data and necessary metadata are transferred. To guarantee
> synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
> See NOTES below for further discussion.
>
> A semantically similar (but deprecated) interface for block
> devices is described in raw(8).
>
> O_DIRECT does not have any meaning for data integrity, it just tells the
> filesystem it *should* not use the pagecache. Even if it should not
> various filesystem have fallbacks to buffered I/O for corner cases.
> It does *not* mean the actual disk cache gets flushed, and it *does*
> not guarantee anything about metadata which is very important.
>
Yes, I understand all of this but I was trying to avoid accepting it.
But after the call today, I'm convinced that this is fundamentally a
filesystem problem.
I think what we need to do is:
1) make virtual WC guest controllable. If a guest enables WC, &=
~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user
specify the virtual WC mode but it has to be changable during live
migration.
2) only let the user choose between using and not using the host page
cache. IOW, direct=on|off. cache=XXX is deprecated.
3) make O_DIRECT | O_DSYNC not suck so badly on ext4.
>>> Barriers are a Linux-specific implementation details that is in the
>>> process of going away, probably in Linux 2.6.37. But if you want
>>> O_DSYNC semantics with a volatile disk write cache there is no way
>>> around using a cache flush or the FUA bit on all I/O caused by it.
>>>
>> If you have a volatile disk write cache, then we don't need O_DSYNC
>> semantics.
>>
> If you present a volatile write cache to the guest you do indeed not
> need O_DSYNC and can rely on the guest sending fdatasync calls when it
> wants to flush the cache. But for the statement above you can replace
> O_DSYC with fdatasync and it will still be correct. O_DSYNC in current
> Linux kernels is nothing but an implicit range fdatasync after each
> write.
>
Yes. I was stuck on O_DSYNC being independent of the virtual WC but
it's clear to me now that it cannot be.
>>> ext3 and ext4 have really bad fsync implementations. Just use a better
>>> filesystem or bug one of it's developers if you want that fixed. But
>>> except for disabling the disk cache there is no way to get data integrity
>>> without cache flushes (the FUA bit is nothing but an implicit flush).
>>>
>>>
>> But why are we issuing more flushes than the guest is issuing if we
>> don't have to worry about filesystem metadata (i.e. preallocated storage
>> or physical devices)?
>>
> Who is "we" and what is workload/filesystem/kernel combination?
> Specific details and numbers please.
>
My concern is ext4. With a preallocated file and cache=none as
implemented today, performance is good even when barrier=1. If we
enable O_DSYNC, performance will plummet. Ultimately, this is an ext4
problem, not a QEMU problem.
Perhaps we can issue a warning if the WC is disabled and we do an fsstat
and see that it's ext4 with barriers enabled.
I think it's more common for a user to want to disable a virtual WC
because they have less faith in the hypervisor than they have in the
underlying storage.
The scenarios I am concerned about:
1) User has enterprise storage, but has an image on ext4 with
barrier=1. User explicitly disables WC in guest because they have
enterprise storage but not an UPS for the hypervisor.
2) User does not have enterprise storage, but has an image on ext4 with
barrier=1. User explicitly disables WC in guest because they don't know
what they're doing.
In the case of (1), the answer may be "ext4 sucks, remount with
barrier=0" but I think we need to at least warn the user of this.
For (2), again it's probably the user doing the wrong thing because if
they don't have enterprise storage, then they shouldn't care about a
virtual WC. Practically though, I've seen a lot of this with users.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-21 15:13 ` Anthony Liguori
@ 2010-09-21 20:57 ` Christoph Hellwig
2010-09-21 21:27 ` Anthony Liguori
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-21 20:57 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel
On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote:
> 1) make virtual WC guest controllable. If a guest enables WC, &=
> ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user
> specify the virtual WC mode but it has to be changable during live
> migration.
I have patches for that are almost ready to submit.
>
> 2) only let the user choose between using and not using the host page
> cache. IOW, direct=on|off. cache=XXX is deprecated.
Also done by that patch series. That's exactly what I described to mail
roundtrips ago..
> My concern is ext4. With a preallocated file and cache=none as
> implemented today, performance is good even when barrier=1. If we
> enable O_DSYNC, performance will plummet. Ultimately, this is an ext4
> problem, not a QEMU problem.
For Linux or Windows guests WCE=0 is not a particularly good default
given that they can deal with the write caches, and mirrors the
situation with consumer SATA disk. For for older Unix guests you'll
need to be able to persistently disable the write cache.
To make things more confusing the default ATA/SATA way to tune the
volatile write cache setting is not persistent - e.g. if you disable it
using hdparm it will come up enabled again.
> 2) User does not have enterprise storage, but has an image on ext4 with
> barrier=1. User explicitly disables WC in guest because they don't know
> what they're doing.
>
> For (2), again it's probably the user doing the wrong thing because if
> they don't have enterprise storage, then they shouldn't care about a
> virtual WC. Practically though, I've seen a lot of this with users.
This setting is just fine, especially if using O_DIRECT. The guest
sends cache flush requests often enough to not make it a problem. If
you do not use O_DIRECT in that scenario which will cache a lot more
data in theory - but any filesystem aware of cache flushes will flush
them frequent enough to not make it a problem. It is a real problem
however when using ext3 in it's default setting in the guest which
doesn't use barrier. But that's a bug in ext3 and nothing but
petitioning it's maintainer to fix it will help you there.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes
2010-09-21 20:57 ` Christoph Hellwig
@ 2010-09-21 21:27 ` Anthony Liguori
0 siblings, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2010-09-21 21:27 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel
On 09/21/2010 03:57 PM, Christoph Hellwig wrote:
> On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote:
>
>> 1) make virtual WC guest controllable. If a guest enables WC,&=
>> ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user
>> specify the virtual WC mode but it has to be changable during live
>> migration.
>>
> I have patches for that are almost ready to submit.
>
>
>> 2) only let the user choose between using and not using the host page
>> cache. IOW, direct=on|off. cache=XXX is deprecated.
>>
> Also done by that patch series. That's exactly what I described to mail
> roundtrips ago..
>
Yes.
>> My concern is ext4. With a preallocated file and cache=none as
>> implemented today, performance is good even when barrier=1. If we
>> enable O_DSYNC, performance will plummet. Ultimately, this is an ext4
>> problem, not a QEMU problem.
>>
> For Linux or Windows guests WCE=0 is not a particularly good default
> given that they can deal with the write caches, and mirrors the
> situation with consumer SATA disk. For for older Unix guests you'll
> need to be able to persistently disable the write cache.
>
> To make things more confusing the default ATA/SATA way to tune the
> volatile write cache setting is not persistent - e.g. if you disable it
> using hdparm it will come up enabled again.
>
Yes, potentially, we could save this in a config file (and really, I
mean libvirt could save it).
>> 2) User does not have enterprise storage, but has an image on ext4 with
>> barrier=1. User explicitly disables WC in guest because they don't know
>> what they're doing.
>>
>> For (2), again it's probably the user doing the wrong thing because if
>> they don't have enterprise storage, then they shouldn't care about a
>> virtual WC. Practically though, I've seen a lot of this with users.
>>
> This setting is just fine, especially if using O_DIRECT. The guest
> sends cache flush requests often enough to not make it a problem. If
> you do not use O_DIRECT in that scenario which will cache a lot more
> data in theory - but any filesystem aware of cache flushes will flush
> them frequent enough to not make it a problem. It is a real problem
> however when using ext3 in it's default setting in the guest which
> doesn't use barrier. But that's a bug in ext3 and nothing but
> petitioning it's maintainer to fix it will help you there.
>
It's not just ext3, it's ext4 with barrier=0 which is what certain
applications are being told to do in the face of poor performance.
So direct=on,wc=on + ext4 barrier=0 in the guest is less safe than ext4
barrier=0 on bare metal.
Very specifically, if we do cache=none as we do today, and within the
guest, we have ext4 barrier=0 and run DB2, DB2's guarantees are weaker
than they are on bare metal because of the fact that metadata is not
getting flushed.
To resolve this, we need to do direct=on,wc=off + ext4 barrier=0 on the
host. This is safe and should perform reasonably well but there's far
too much complexity for a user to get to this point.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-09-21 21:28 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
2010-09-20 18:37 ` Blue Swirl
2010-09-20 18:51 ` Anthony Liguori
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
2010-09-20 20:11 ` Anthony Liguori
2010-09-20 23:17 ` Christoph Hellwig
2010-09-21 0:18 ` Anthony Liguori
2010-09-21 8:15 ` Kevin Wolf
2010-09-21 14:26 ` Christoph Hellwig
2010-09-21 15:13 ` Anthony Liguori
2010-09-21 20:57 ` Christoph Hellwig
2010-09-21 21:27 ` Anthony Liguori
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).