* [Qemu-devel] Caching modes @ 2010-09-20 16:53 Anthony Liguori 2010-09-20 18:37 ` Blue Swirl 2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig 0 siblings, 2 replies; 12+ messages in thread From: Anthony Liguori @ 2010-09-20 16:53 UTC (permalink / raw) To: qemu-devel, Kevin Wolf, Christoph Hellwig Moving to a separate thread since this has come up a few times and I think we need to discuss the assumptions a bit more. This is how I understand the caching modes should behave and what guarantees a guest gets. cache=none All read and write requests SHOULD avoid any type of caching in the host. Any write request MUST complete after the next level of storage reports that the write request has completed. A flush from the guest MUST complete after all pending I/O requests for the guest have been completed. As an implementation detail, with the raw format, these guarantees are only in place for preallocated images. Sparse images do not provide as strong of a guarantee. cache=writethrough All read and write requests MAY be cached by the host. Read requests MAY be satisfied by cached data in the host. Any write request MUST complete after the next level of storage reports that the write request has completed. A flush from the guest MUST complete after all pending I/O requests for the guest have been completed. As an implementation detail, with the raw format, these guarantees also apply for sparse images. In the future, we could relax this such that sparse images did not provide as strong of a guarantee. cache=writeback All read and writes requests MAY be cached by the host. Read and write requests may be completed entirely within the cache. A write request MAY complete before the next level of storage reports that the write request has completed. A flush from the guest MUST complete after all pending I/O requests for the guest have been completed and acknowledged by the next level of the storage hierarchy. Guest disk cache. For all devices that support it, the exposed cache attribute should be independent of the host caching mode. Here are correct usages of disk caching mode: Writethrough disk cache; cache=none|writethrough if the disk cache is set to writethrough or the disk is considered "enterprise class" and has a battery backup. cache=writeback IFF the host is backed by an UPS. Writeback disk cache; cache=none|writethrough if the disk cache is set to writeback and the disk is not enterprise class. cache=writeback if the host is not backed by an UPS. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] Caching modes 2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori @ 2010-09-20 18:37 ` Blue Swirl 2010-09-20 18:51 ` Anthony Liguori 2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig 1 sibling, 1 reply; 12+ messages in thread From: Blue Swirl @ 2010-09-20 18:37 UTC (permalink / raw) To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig On Mon, Sep 20, 2010 at 4:53 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: > Moving to a separate thread since this has come up a few times and I think > we need to discuss the assumptions a bit more. > > This is how I understand the caching modes should behave and what guarantees > a guest gets. > > cache=none > > All read and write requests SHOULD avoid any type of caching in the host. > Any write request MUST complete after the next level of storage reports > that the write request has completed. A flush from the guest MUST complete > after all pending I/O requests for the guest have been completed. > > As an implementation detail, with the raw format, these guarantees are only > in place for preallocated images. Sparse images do not provide as strong of > a guarantee. > > cache=writethrough > > All read and write requests MAY be cached by the host. Read requests MAY be > satisfied by cached data in the host. Any write request MUST complete after > the next level of storage reports that the write request has completed. A > flush from the guest MUST complete after all pending I/O requests for the > guest have been completed. > > As an implementation detail, with the raw format, these guarantees also > apply for sparse images. In the future, we could relax this such that > sparse images did not provide as strong of a guarantee. > > cache=writeback > > All read and writes requests MAY be cached by the host. Read and write > requests may be completed entirely within the cache. A write request MAY > complete before the next level of storage reports that the write request has > completed. A flush from the guest MUST complete after all pending I/O > requests for the guest have been completed and acknowledged by the next > level of the storage hierarchy. It would be nice to have additional mode, like cache=always, where even flushes MAY be ignored. This would max out the performance. > > Guest disk cache. > > For all devices that support it, the exposed cache attribute should be > independent of the host caching mode. Here are correct usages of disk > caching mode: > > Writethrough disk cache; cache=none|writethrough if the disk cache is set to > writethrough or the disk is considered "enterprise class" and has a battery > backup. cache=writeback IFF the host is backed by an UPS. The "enterprise class" disks, battery backups and UPS devices are not consumer equipment. Wouldn't this mean that any private QEMU user would need to use cache=none? As an example, what is the correct usage for laptop user, considering that there is a battery, but it can also drain and the drainage is dependent on flush frequency? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] Caching modes 2010-09-20 18:37 ` Blue Swirl @ 2010-09-20 18:51 ` Anthony Liguori 0 siblings, 0 replies; 12+ messages in thread From: Anthony Liguori @ 2010-09-20 18:51 UTC (permalink / raw) To: Blue Swirl; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig On 09/20/2010 01:37 PM, Blue Swirl wrote: > > It would be nice to have additional mode, like cache=always, where > even flushes MAY be ignored. This would max out the performance. > That's cache=unsafe and we have it. I ignored it for the purposes of this discussion. >> Guest disk cache. >> >> For all devices that support it, the exposed cache attribute should be >> independent of the host caching mode. Here are correct usages of disk >> caching mode: >> >> Writethrough disk cache; cache=none|writethrough if the disk cache is set to >> writethrough or the disk is considered "enterprise class" and has a battery >> backup. cache=writeback IFF the host is backed by an UPS. >> > The "enterprise class" disks, battery backups and UPS devices are not > consumer equipment. Wouldn't this mean that any private QEMU user > would need to use cache=none? > No, cache=writethrough and cache=none should be equivalent from a data integrity/data loss perspective. Using cache=writeback without enterprise storage is risky but practically speaking, most consumer storage is not battery backed and uses writeback caching anyway so there is already risk. > As an example, what is the correct usage for laptop user, considering > that there is a battery, but it can also drain and the drainage is > dependent on flush frequency? > Minus cache=unsafe, you'll never get data corruption. The only consideration is how much data loss can occur from the last time there was a flush. Well behaved applications always flush important data to avoid loss of anything important but practically speaking, the world isn't full of behaved applications. The only difference between cache=writeback and a normal disk's writeback cache is that cache=writeback can be a very, very large cache that isn't frequently flushed. So the amount of data loss can be much higher than expected. For most laptop users, cache=none or cache=writethrough is appropriate. For a developer, cache=writeback probably is reasonable. Regards, Anthony Liguori Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori 2010-09-20 18:37 ` Blue Swirl @ 2010-09-20 19:34 ` Christoph Hellwig 2010-09-20 20:11 ` Anthony Liguori 1 sibling, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2010-09-20 19:34 UTC (permalink / raw) To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Christoph Hellwig On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote: > cache=none > > All read and write requests SHOULD avoid any type of caching in the > host. Any write request MUST complete after the next level of storage > reports that the write request has completed. A flush from the guest > MUST complete after all pending I/O requests for the guest have been > completed. > > As an implementation detail, with the raw format, these guarantees are > only in place for preallocated images. Sparse images do not provide as > strong of a guarantee. That's not how cache=none ever worked nor works currently. But discussion the current cache modes is rather mood as they try to map multi-dimension behaviour difference into a single options. I have some patches that I need to finish up a bit more that will give you your no caching enabled mode, but I don't think mapping cache=none to it will do anyone a favour. With the split between the guest visible write-cache-enable (WCE) flag, and the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll get the following modes: | WC enable | WC disable ----------------------------------------------- direct | | buffer | | buffer + ignore flush | | currently we only have: cache=none direct + WC enable cache=writeback buffer + WC enable cache=writethrough buffer + WC disable cache=unsafe buffer + ignore flush + WC enable splitting these up is important because we want to migrate between hosts that can support direct I/O or not without requiring guest visible state changes, and also because we want to use direct I/O with guest that were installed using cache=unsafe without stopping the guest. It also allows the guest to change the WC enable/disable flag, which they can do for real IDE/SCSI hardware. And it allows Anthony's belowed no caching at all mode, which actually is useful for guest that can not deal with volatile write caches. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig @ 2010-09-20 20:11 ` Anthony Liguori 2010-09-20 23:17 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Anthony Liguori @ 2010-09-20 20:11 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel On 09/20/2010 02:34 PM, Christoph Hellwig wrote: > On Mon, Sep 20, 2010 at 11:53:02AM -0500, Anthony Liguori wrote: > >> cache=none >> >> All read and write requests SHOULD avoid any type of caching in the >> host. Any write request MUST complete after the next level of storage >> reports that the write request has completed. A flush from the guest >> MUST complete after all pending I/O requests for the guest have been >> completed. >> >> As an implementation detail, with the raw format, these guarantees are >> only in place for preallocated images. Sparse images do not provide as >> strong of a guarantee. >> > That's not how cache=none ever worked nor works currently. > How does it work today compared to what I wrote above? > But discussion the current cache modes is rather mood as they try to > map multi-dimension behaviour difference into a single options. I have > some patches that I need to finish up a bit more that will give you > your no caching enabled mode, but I don't think mapping cache=none to it > will do anyone a favour. > > With the split between the guest visible write-cache-enable (WCE) flag, and > the host-specific "use O_DIRECT" and "ignore cache flushes" flags we'll > get the following modes: > > > | WC enable | WC disable > ----------------------------------------------- > direct | | > buffer | | > buffer + ignore flush | | > > currently we only have: > > cache=none direct + WC enable > cache=writeback buffer + WC enable > cache=writethrough buffer + WC disable > cache=unsafe buffer + ignore flush + WC enable > Where does O_DSYNC fit into this chart? Do all modern filesystems implement O_DSYNC without generating additional barriers per request? Having a barrier per-write request is ultimately not the right semantic for any of the modes. However, without the use of O_DSYNC (or sync_file_range(), which I know you dislike), I don't see how we can have reasonable semantics without always implementing write back caching in the host. > splitting these up is important because we want to migrate between > hosts that can support direct I/O or not without requiring guest visible > state changes, and also because we want to use direct I/O with guest > that were installed using cache=unsafe without stopping the guest. > > It also allows the guest to change the WC enable/disable flag, which > they can do for real IDE/SCSI hardware. And it allows Anthony's belowed > no caching at all mode, which actually is useful for guest that can not > deal with volatile write caches. > I'm certainly happy to break up the caching option. However, I still don't know how we get a reasonable equivalent to cache=writethrough without assuming that ext4 is mounted without barriers enabled. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-20 20:11 ` Anthony Liguori @ 2010-09-20 23:17 ` Christoph Hellwig 2010-09-21 0:18 ` Anthony Liguori 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2010-09-20 23:17 UTC (permalink / raw) To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote: > >>All read and write requests SHOULD avoid any type of caching in the > >>host. Any write request MUST complete after the next level of storage > >>reports that the write request has completed. A flush from the guest > >>MUST complete after all pending I/O requests for the guest have been > >>completed. > >> > >>As an implementation detail, with the raw format, these guarantees are > >>only in place for preallocated images. Sparse images do not provide as > >>strong of a guarantee. > >> > >That's not how cache=none ever worked nor works currently. > > > > How does it work today compared to what I wrote above? For the guest point of view it works exactly as you describe cache=writeback. There is no ordering or cache flushing guarantees. By using O_DIRECT we do bypass the host file cache, but we don't even try on the others (disk cache, commiting metadata transaction that are required to actually see the commited data for sparse, preallocated or growing images). What you describe above is the equivalent of O_DSYNC|O_DIRECT which doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also guarantees the semantics for sparse images. Sparse images really aren't special in any way - preallocaiton using posix_fallocate or COW filesystems like btrfs,nilfs2 or zfs have exactly the same issues. > > | WC enable | WC disable > >----------------------------------------------- > >direct | | > >buffer | | > >buffer + ignore flush | | > > > >currently we only have: > > > > cache=none direct + WC enable > > cache=writeback buffer + WC enable > > cache=writethrough buffer + WC disable > > cache=unsafe buffer + ignore flush + WC enable > > > > Where does O_DSYNC fit into this chart? O_DSYNC is used for all WC disable modes. > Do all modern filesystems implement O_DSYNC without generating > additional barriers per request? > > Having a barrier per-write request is ultimately not the right semantic > for any of the modes. However, without the use of O_DSYNC (or > sync_file_range(), which I know you dislike), I don't see how we can > have reasonable semantics without always implementing write back caching > in the host. Barriers are a Linux-specific implementation details that is in the process of going away, probably in Linux 2.6.37. But if you want O_DSYNC semantics with a volatile disk write cache there is no way around using a cache flush or the FUA bit on all I/O caused by it. We currently use the cache flush, and although I plan to experiment a bit more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very surprised if they actually are any faster. > I'm certainly happy to break up the caching option. However, I still > don't know how we get a reasonable equivalent to cache=writethrough > without assuming that ext4 is mounted without barriers enabled. There's two problems here - one is a Linux-wide problem and that's the barrier primitive which is currenly the only way to flush a volatile disk cache. We've sorted this out for the 2.6.37. The other is that ext3 and ext4 have really bad fsync implementations. Just use a better filesystem or bug one of it's developers if you want that fixed. But except for disabling the disk cache there is no way to get data integrity without cache flushes (the FUA bit is nothing but an implicit flush). ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-20 23:17 ` Christoph Hellwig @ 2010-09-21 0:18 ` Anthony Liguori 2010-09-21 8:15 ` Kevin Wolf 2010-09-21 14:26 ` Christoph Hellwig 0 siblings, 2 replies; 12+ messages in thread From: Anthony Liguori @ 2010-09-21 0:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel On 09/20/2010 06:17 PM, Christoph Hellwig wrote: > On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote: > >>>> All read and write requests SHOULD avoid any type of caching in the >>>> host. Any write request MUST complete after the next level of storage >>>> reports that the write request has completed. A flush from the guest >>>> MUST complete after all pending I/O requests for the guest have been >>>> completed. >>>> >>>> As an implementation detail, with the raw format, these guarantees are >>>> only in place for preallocated images. Sparse images do not provide as >>>> strong of a guarantee. >>>> >>>> >>> That's not how cache=none ever worked nor works currently. >>> >>> >> How does it work today compared to what I wrote above? >> > For the guest point of view it works exactly as you describe > cache=writeback. There is no ordering or cache flushing guarantees. By > using O_DIRECT we do bypass the host file cache, but we don't even try > on the others (disk cache, commiting metadata transaction that are > required to actually see the commited data for sparse, preallocated or > growing images). > O_DIRECT alone to a pre-allocated file on a normal file system should result in the data being visible without any additional metadata transactions. The only time when that isn't true is when dealing with CoW or other special filesystem features. > What you describe above is the equivalent of O_DSYNC|O_DIRECT which > doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also > guarantees the semantics for sparse images. Sparse images really aren't > special in any way - preallocaiton using posix_fallocate or COW > filesystems like btrfs,nilfs2 or zfs have exactly the same issues. > > >>> | WC enable | WC disable >>> ----------------------------------------------- >>> direct | | >>> buffer | | >>> buffer + ignore flush | | >>> >>> currently we only have: >>> >>> cache=none direct + WC enable >>> cache=writeback buffer + WC enable >>> cache=writethrough buffer + WC disable >>> cache=unsafe buffer + ignore flush + WC enable >>> >>> >> Where does O_DSYNC fit into this chart? >> > O_DSYNC is used for all WC disable modes. > > >> Do all modern filesystems implement O_DSYNC without generating >> additional barriers per request? >> >> Having a barrier per-write request is ultimately not the right semantic >> for any of the modes. However, without the use of O_DSYNC (or >> sync_file_range(), which I know you dislike), I don't see how we can >> have reasonable semantics without always implementing write back caching >> in the host. >> > Barriers are a Linux-specific implementation details that is in the > process of going away, probably in Linux 2.6.37. But if you want > O_DSYNC semantics with a volatile disk write cache there is no way > around using a cache flush or the FUA bit on all I/O caused by it. If you have a volatile disk write cache, then we don't need O_DSYNC semantics. > We > currently use the cache flush, and although I plan to experiment a bit > more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very > surprised if they actually are any faster. > The thing I struggle with understanding is that if the guest is sending us a write request, why are we sending the underlying disk a write + flush request? That doesn't seem logical at all to me. Even if we advertise WC disable, it should be up to the guest to decide when to issue flushes. >> I'm certainly happy to break up the caching option. However, I still >> don't know how we get a reasonable equivalent to cache=writethrough >> without assuming that ext4 is mounted without barriers enabled. >> > There's two problems here - one is a Linux-wide problem and that's the > barrier primitive which is currenly the only way to flush a volatile > disk cache. We've sorted this out for the 2.6.37. The other is that > ext3 and ext4 have really bad fsync implementations. Just use a better > filesystem or bug one of it's developers if you want that fixed. But > except for disabling the disk cache there is no way to get data integrity > without cache flushes (the FUA bit is nothing but an implicit flush). > But why are we issuing more flushes than the guest is issuing if we don't have to worry about filesystem metadata (i.e. preallocated storage or physical devices)? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-21 0:18 ` Anthony Liguori @ 2010-09-21 8:15 ` Kevin Wolf 2010-09-21 14:26 ` Christoph Hellwig 1 sibling, 0 replies; 12+ messages in thread From: Kevin Wolf @ 2010-09-21 8:15 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel Am 21.09.2010 02:18, schrieb Anthony Liguori: > On 09/20/2010 06:17 PM, Christoph Hellwig wrote: >> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote: >> >>>>> All read and write requests SHOULD avoid any type of caching in the >>>>> host. Any write request MUST complete after the next level of storage >>>>> reports that the write request has completed. A flush from the guest >>>>> MUST complete after all pending I/O requests for the guest have been >>>>> completed. >>>>> >>>>> As an implementation detail, with the raw format, these guarantees are >>>>> only in place for preallocated images. Sparse images do not provide as >>>>> strong of a guarantee. >>>>> >>>>> >>>> That's not how cache=none ever worked nor works currently. >>>> >>>> >>> How does it work today compared to what I wrote above? >>> >> For the guest point of view it works exactly as you describe >> cache=writeback. There is no ordering or cache flushing guarantees. By >> using O_DIRECT we do bypass the host file cache, but we don't even try >> on the others (disk cache, commiting metadata transaction that are >> required to actually see the commited data for sparse, preallocated or >> growing images). >> > > O_DIRECT alone to a pre-allocated file on a normal file system should > result in the data being visible without any additional metadata > transactions. > > The only time when that isn't true is when dealing with CoW or other > special filesystem features. I think preallocated files are the exception, usually people use sparse files. And even with preallocation, the disk cache is still left. >> What you describe above is the equivalent of O_DSYNC|O_DIRECT which >> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also >> guarantees the semantics for sparse images. Sparse images really aren't >> special in any way - preallocaiton using posix_fallocate or COW >> filesystems like btrfs,nilfs2 or zfs have exactly the same issues. >> >> >>>> | WC enable | WC disable >>>> ----------------------------------------------- >>>> direct | | >>>> buffer | | >>>> buffer + ignore flush | | >>>> >>>> currently we only have: >>>> >>>> cache=none direct + WC enable >>>> cache=writeback buffer + WC enable >>>> cache=writethrough buffer + WC disable >>>> cache=unsafe buffer + ignore flush + WC enable >>>> >>>> >>> Where does O_DSYNC fit into this chart? >>> >> O_DSYNC is used for all WC disable modes. >> >> >>> Do all modern filesystems implement O_DSYNC without generating >>> additional barriers per request? >>> >>> Having a barrier per-write request is ultimately not the right semantic >>> for any of the modes. However, without the use of O_DSYNC (or >>> sync_file_range(), which I know you dislike), I don't see how we can >>> have reasonable semantics without always implementing write back caching >>> in the host. >>> >> Barriers are a Linux-specific implementation details that is in the >> process of going away, probably in Linux 2.6.37. But if you want >> O_DSYNC semantics with a volatile disk write cache there is no way >> around using a cache flush or the FUA bit on all I/O caused by it. > > If you have a volatile disk write cache, then we don't need O_DSYNC > semantics. What has semantics of a qemu option to do with the host disk write cache? We always need to provide the same semantics. If anything, we can take advantage of a host providing write-through/no caches so that we don't have to issue the flushes ourselves. >> We >> currently use the cache flush, and although I plan to experiment a bit >> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very >> surprised if they actually are any faster. >> > > The thing I struggle with understanding is that if the guest is sending > us a write request, why are we sending the underlying disk a write + > flush request? That doesn't seem logical at all to me. > > Even if we advertise WC disable, it should be up to the guest to decide > when to issue flushes. Why should a guest ever flush a cache when it's told that this cache doesn't exist? Kevin ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-21 0:18 ` Anthony Liguori 2010-09-21 8:15 ` Kevin Wolf @ 2010-09-21 14:26 ` Christoph Hellwig 2010-09-21 15:13 ` Anthony Liguori 1 sibling, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2010-09-21 14:26 UTC (permalink / raw) To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote: > O_DIRECT alone to a pre-allocated file on a normal file system should > result in the data being visible without any additional metadata > transactions. Anthony, for the third time: no. O_DIRECT is a non-portable extension in Linux (taken from IRIX) and is defined as: O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECT does not have any meaning for data integrity, it just tells the filesystem it *should* not use the pagecache. Even if it should not various filesystem have fallbacks to buffered I/O for corner cases. It does *not* mean the actual disk cache gets flushed, and it *does* not guarantee anything about metadata which is very important. Metadata updates happen when filling sparse file, when extening the file size, when using a COW filesystem, and when converting preallocated to fully allocated extents in practice and could happen in many more cases depending on the filesystem implementation. > >Barriers are a Linux-specific implementation details that is in the > >process of going away, probably in Linux 2.6.37. But if you want > >O_DSYNC semantics with a volatile disk write cache there is no way > >around using a cache flush or the FUA bit on all I/O caused by it. > > If you have a volatile disk write cache, then we don't need O_DSYNC > semantics. If you present a volatile write cache to the guest you do indeed not need O_DSYNC and can rely on the guest sending fdatasync calls when it wants to flush the cache. But for the statement above you can replace O_DSYC with fdatasync and it will still be correct. O_DSYNC in current Linux kernels is nothing but an implicit range fdatasync after each write. > > We > >currently use the cache flush, and although I plan to experiment a bit > >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very > >surprised if they actually are any faster. > > > > The thing I struggle with understanding is that if the guest is sending > us a write request, why are we sending the underlying disk a write + > flush request? That doesn't seem logical at all to me. We only send a cache flush request *iff* we present the guest a device without a volatile write cache so that it can assume all writes are stable and we sit on a device that does have a volatile write cache. > Even if we advertise WC disable, it should be up to the guest to decide > when to issue flushes. No. If we don't claim to have a volatile cache no guest will ever flush the cache. Which is just logially given that we just told it that we don't have a cache that needs flushing. > >ext3 and ext4 have really bad fsync implementations. Just use a better > >filesystem or bug one of it's developers if you want that fixed. But > >except for disabling the disk cache there is no way to get data integrity > >without cache flushes (the FUA bit is nothing but an implicit flush). > > > > But why are we issuing more flushes than the guest is issuing if we > don't have to worry about filesystem metadata (i.e. preallocated storage > or physical devices)? Who is "we" and what is workload/filesystem/kernel combination? Specific details and numbers please. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-21 14:26 ` Christoph Hellwig @ 2010-09-21 15:13 ` Anthony Liguori 2010-09-21 20:57 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Anthony Liguori @ 2010-09-21 15:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel On 09/21/2010 09:26 AM, Christoph Hellwig wrote: > On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote: > >> O_DIRECT alone to a pre-allocated file on a normal file system should >> result in the data being visible without any additional metadata >> transactions. >> > Anthony, for the third time: no. O_DIRECT is a non-portable extension > in Linux (taken from IRIX) and is defined as: > > > O_DIRECT (Since Linux 2.4.10) > Try to minimize cache effects of the I/O to and from this file. > In general this will degrade performance, but it is useful in > special situations, such as when applications do their own > caching. File I/O is done directly to/from user space buffers. > The O_DIRECT flag on its own makes at an effort to transfer data > synchronously, but does not give the guarantees of the O_SYNC > that data and necessary metadata are transferred. To guarantee > synchronous I/O the O_SYNC must be used in addition to O_DIRECT. > See NOTES below for further discussion. > > A semantically similar (but deprecated) interface for block > devices is described in raw(8). > > O_DIRECT does not have any meaning for data integrity, it just tells the > filesystem it *should* not use the pagecache. Even if it should not > various filesystem have fallbacks to buffered I/O for corner cases. > It does *not* mean the actual disk cache gets flushed, and it *does* > not guarantee anything about metadata which is very important. > Yes, I understand all of this but I was trying to avoid accepting it. But after the call today, I'm convinced that this is fundamentally a filesystem problem. I think what we need to do is: 1) make virtual WC guest controllable. If a guest enables WC, &= ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user specify the virtual WC mode but it has to be changable during live migration. 2) only let the user choose between using and not using the host page cache. IOW, direct=on|off. cache=XXX is deprecated. 3) make O_DIRECT | O_DSYNC not suck so badly on ext4. >>> Barriers are a Linux-specific implementation details that is in the >>> process of going away, probably in Linux 2.6.37. But if you want >>> O_DSYNC semantics with a volatile disk write cache there is no way >>> around using a cache flush or the FUA bit on all I/O caused by it. >>> >> If you have a volatile disk write cache, then we don't need O_DSYNC >> semantics. >> > If you present a volatile write cache to the guest you do indeed not > need O_DSYNC and can rely on the guest sending fdatasync calls when it > wants to flush the cache. But for the statement above you can replace > O_DSYC with fdatasync and it will still be correct. O_DSYNC in current > Linux kernels is nothing but an implicit range fdatasync after each > write. > Yes. I was stuck on O_DSYNC being independent of the virtual WC but it's clear to me now that it cannot be. >>> ext3 and ext4 have really bad fsync implementations. Just use a better >>> filesystem or bug one of it's developers if you want that fixed. But >>> except for disabling the disk cache there is no way to get data integrity >>> without cache flushes (the FUA bit is nothing but an implicit flush). >>> >>> >> But why are we issuing more flushes than the guest is issuing if we >> don't have to worry about filesystem metadata (i.e. preallocated storage >> or physical devices)? >> > Who is "we" and what is workload/filesystem/kernel combination? > Specific details and numbers please. > My concern is ext4. With a preallocated file and cache=none as implemented today, performance is good even when barrier=1. If we enable O_DSYNC, performance will plummet. Ultimately, this is an ext4 problem, not a QEMU problem. Perhaps we can issue a warning if the WC is disabled and we do an fsstat and see that it's ext4 with barriers enabled. I think it's more common for a user to want to disable a virtual WC because they have less faith in the hypervisor than they have in the underlying storage. The scenarios I am concerned about: 1) User has enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they have enterprise storage but not an UPS for the hypervisor. 2) User does not have enterprise storage, but has an image on ext4 with barrier=1. User explicitly disables WC in guest because they don't know what they're doing. In the case of (1), the answer may be "ext4 sucks, remount with barrier=0" but I think we need to at least warn the user of this. For (2), again it's probably the user doing the wrong thing because if they don't have enterprise storage, then they shouldn't care about a virtual WC. Practically though, I've seen a lot of this with users. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-21 15:13 ` Anthony Liguori @ 2010-09-21 20:57 ` Christoph Hellwig 2010-09-21 21:27 ` Anthony Liguori 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2010-09-21 20:57 UTC (permalink / raw) To: Anthony Liguori; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote: > 1) make virtual WC guest controllable. If a guest enables WC, &= > ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user > specify the virtual WC mode but it has to be changable during live > migration. I have patches for that are almost ready to submit. > > 2) only let the user choose between using and not using the host page > cache. IOW, direct=on|off. cache=XXX is deprecated. Also done by that patch series. That's exactly what I described to mail roundtrips ago.. > My concern is ext4. With a preallocated file and cache=none as > implemented today, performance is good even when barrier=1. If we > enable O_DSYNC, performance will plummet. Ultimately, this is an ext4 > problem, not a QEMU problem. For Linux or Windows guests WCE=0 is not a particularly good default given that they can deal with the write caches, and mirrors the situation with consumer SATA disk. For for older Unix guests you'll need to be able to persistently disable the write cache. To make things more confusing the default ATA/SATA way to tune the volatile write cache setting is not persistent - e.g. if you disable it using hdparm it will come up enabled again. > 2) User does not have enterprise storage, but has an image on ext4 with > barrier=1. User explicitly disables WC in guest because they don't know > what they're doing. > > For (2), again it's probably the user doing the wrong thing because if > they don't have enterprise storage, then they shouldn't care about a > virtual WC. Practically though, I've seen a lot of this with users. This setting is just fine, especially if using O_DIRECT. The guest sends cache flush requests often enough to not make it a problem. If you do not use O_DIRECT in that scenario which will cache a lot more data in theory - but any filesystem aware of cache flushes will flush them frequent enough to not make it a problem. It is a real problem however when using ext3 in it's default setting in the guest which doesn't use barrier. But that's a bug in ext3 and nothing but petitioning it's maintainer to fix it will help you there. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] Re: Caching modes 2010-09-21 20:57 ` Christoph Hellwig @ 2010-09-21 21:27 ` Anthony Liguori 0 siblings, 0 replies; 12+ messages in thread From: Anthony Liguori @ 2010-09-21 21:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel On 09/21/2010 03:57 PM, Christoph Hellwig wrote: > On Tue, Sep 21, 2010 at 10:13:01AM -0500, Anthony Liguori wrote: > >> 1) make virtual WC guest controllable. If a guest enables WC,&= >> ~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user >> specify the virtual WC mode but it has to be changable during live >> migration. >> > I have patches for that are almost ready to submit. > > >> 2) only let the user choose between using and not using the host page >> cache. IOW, direct=on|off. cache=XXX is deprecated. >> > Also done by that patch series. That's exactly what I described to mail > roundtrips ago.. > Yes. >> My concern is ext4. With a preallocated file and cache=none as >> implemented today, performance is good even when barrier=1. If we >> enable O_DSYNC, performance will plummet. Ultimately, this is an ext4 >> problem, not a QEMU problem. >> > For Linux or Windows guests WCE=0 is not a particularly good default > given that they can deal with the write caches, and mirrors the > situation with consumer SATA disk. For for older Unix guests you'll > need to be able to persistently disable the write cache. > > To make things more confusing the default ATA/SATA way to tune the > volatile write cache setting is not persistent - e.g. if you disable it > using hdparm it will come up enabled again. > Yes, potentially, we could save this in a config file (and really, I mean libvirt could save it). >> 2) User does not have enterprise storage, but has an image on ext4 with >> barrier=1. User explicitly disables WC in guest because they don't know >> what they're doing. >> >> For (2), again it's probably the user doing the wrong thing because if >> they don't have enterprise storage, then they shouldn't care about a >> virtual WC. Practically though, I've seen a lot of this with users. >> > This setting is just fine, especially if using O_DIRECT. The guest > sends cache flush requests often enough to not make it a problem. If > you do not use O_DIRECT in that scenario which will cache a lot more > data in theory - but any filesystem aware of cache flushes will flush > them frequent enough to not make it a problem. It is a real problem > however when using ext3 in it's default setting in the guest which > doesn't use barrier. But that's a bug in ext3 and nothing but > petitioning it's maintainer to fix it will help you there. > It's not just ext3, it's ext4 with barrier=0 which is what certain applications are being told to do in the face of poor performance. So direct=on,wc=on + ext4 barrier=0 in the guest is less safe than ext4 barrier=0 on bare metal. Very specifically, if we do cache=none as we do today, and within the guest, we have ext4 barrier=0 and run DB2, DB2's guarantees are weaker than they are on bare metal because of the fact that metadata is not getting flushed. To resolve this, we need to do direct=on,wc=off + ext4 barrier=0 on the host. This is safe and should perform reasonably well but there's far too much complexity for a user to get to this point. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-09-21 21:28 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori 2010-09-20 18:37 ` Blue Swirl 2010-09-20 18:51 ` Anthony Liguori 2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig 2010-09-20 20:11 ` Anthony Liguori 2010-09-20 23:17 ` Christoph Hellwig 2010-09-21 0:18 ` Anthony Liguori 2010-09-21 8:15 ` Kevin Wolf 2010-09-21 14:26 ` Christoph Hellwig 2010-09-21 15:13 ` Anthony Liguori 2010-09-21 20:57 ` Christoph Hellwig 2010-09-21 21:27 ` Anthony Liguori
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).