* [Qemu-devel] [RFC] Disk integrity in QEMU @ 2008-10-09 17:00 Anthony Liguori 2008-10-10 7:54 ` Gerd Hoffmann ` (7 more replies) 0 siblings, 8 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-09 17:00 UTC (permalink / raw) To: qemu-devel@nongnu.org Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper Hi, There's been a lot of discussion recently mostly in other places about disk integrity and performance in QEMU. I must admit, my own thinking has changed pretty recently in this space. I wanted to try and focus the conversation on qemu-devel so that we could get everyone involved and come up with a plan for the future. Right now, QEMU can open a file in two ways. It can open it without any special caching flags (the default) or it can open it O_DIRECT. O_DIRECT implies that the IO does not go through the host page cache. This is controlled with cache=on and cache=off respectively. When cache=on, read requests may not actually go to the disk. If a previous read request (by some application on the system) has read the same data, then it becomes a simple memcpy(). Also, the host IO scheduler may do read ahead which means that the data may be available from that. In general, the host knows the most about the underlying disk system and the total IO load on the system so it is far better suited to optimize these sort of things than the guest. Write requests end up being simple memcpy()s too as the data is just copied into the page cache and the page is scheduled to be eventually written to disk. Since we don't know when the data is actually written to disk, we tell the guest the data is written before it actually is. If you assume that the host is stable, then there isn't an integrity issue. This assumes that you have backup power and that the host OS has no bugs. It's not a totally unreasonable assumption but for a large number of users, it's not a good assumption. A side effect of cache=off is that data integrity only depends on the integrity of your storage system (which isn't always safe, btw) which is probably closer to what most users expect. There many other side effects though. An alternative to cache=off that addresses the data integrity problem directly is to open all disk images with O_DSYNC. This will still use the host page cache (and therefore get all the benefits of it) but will only signal write completion when the data is actually written to disk. The effect of this is to make the integrity of the VM equal the integrity of the storage system (no longer relying on the host). By still going through the page cache, you still get the benefits of the host's IO scheduler and read-ahead. The only place affected by performance is writes (reads are equivalent). If you run a write benchmark in a guest today, you'll see a number that is higher than native. The implication here is that data integrity is not being maintained if you don't trust the host. O_DSYNC takes care of this. Read performance should be unaffected by using O_DSYNC. O_DIRECT will significantly reduce read performance. I think we should use O_DSYNC by default and I have sent out a patch that contains that. We will follow up with benchmarks to demonstrate this. There are certain benefits to using O_DIRECT. One argument for using O_DIRECT is that you have to allocate memory in the host page cache to perform IO. If you are not sharing data between guests, and the guest has a relatively large amount of memory compared to the host, and you have a simple disk in the host, going through the host page cache wastes some memory that could be used to cache other IO operations on the system. I don't really think this is the typical case so I don't think this is an argument for having it on by default. However, it can be enabled if you know this is going to be the case. The biggest benefit to using O_DIRECT, is that you can potentially avoid ever bringing data into the CPUs cache. Once data is cached, copying it is relatively cheap. If you're never going to touch the data (think, disk DMA => nic DMA via sendfile()), then avoiding the CPU cache can be a big win. Again, I don't think this is the common case but the option is there in case it's suitable. An important point is that today, we always copy data internally in QEMU which means practically speaking, you'll never see this benefit. So to summarize, I think we should enable O_DSYNC by default to ensure that guest data integrity is not dependent on the host OS, and that practically speaking, cache=off is only useful for very specialized circumstances. Part of the patch I'll follow up with includes changes to the man page to document all of this for users. Thoughts? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori @ 2008-10-10 7:54 ` Gerd Hoffmann 2008-10-10 8:12 ` Mark McLoughlin 2008-10-10 9:32 ` Avi Kivity 2008-10-10 8:11 ` Aurelien Jarno ` (6 subsequent siblings) 7 siblings, 2 replies; 101+ messages in thread From: Gerd Hoffmann @ 2008-10-10 7:54 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Hi, > Read performance should be unaffected by using O_DSYNC. O_DIRECT will > significantly reduce read performance. I think we should use O_DSYNC by > default and I have sent out a patch that contains that. We will follow > up with benchmarks to demonstrate this. So O_SYNC on/off is pretty much equivalent to disk write caching being on/off, right? So we could make that guest-controlled, i.e. toggeling write caching in the guest (using hdparm) toggles O_SYNC in qemu? This together with disk-flush command support (mapping to fsync on the host) should allow guests to go into barrier mode for better write performance without loosing data integrity. cheers, Gerd ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 7:54 ` Gerd Hoffmann @ 2008-10-10 8:12 ` Mark McLoughlin 2008-10-12 23:10 ` Jamie Lokier 2008-10-10 9:32 ` Avi Kivity 1 sibling, 1 reply; 101+ messages in thread From: Mark McLoughlin @ 2008-10-10 8:12 UTC (permalink / raw) To: Gerd Hoffmann Cc: Chris Wright, kvm-devel, Ryan Harper, qemu-devel, Laurent Vivier On Fri, 2008-10-10 at 09:54 +0200, Gerd Hoffmann wrote: > Hi, > > > Read performance should be unaffected by using O_DSYNC. O_DIRECT will > > significantly reduce read performance. I think we should use O_DSYNC by > > default and I have sent out a patch that contains that. We will follow > > up with benchmarks to demonstrate this. > > So O_SYNC on/off is pretty much equivalent to disk write caching being > on/off, right? So we could make that guest-controlled, i.e. toggeling > write caching in the guest (using hdparm) toggles O_SYNC in qemu? I don't think it's correct to equate disk write caching to completing guest writes when the data has been copied to the host's page cache. The host's page cache will cache much more data for much longer than a typical disk, right? If so, then this form of write caching is much more likely to result in fs corruption if the host crashes. In that case, all qemu users would really need to disable write caching in the guest using hdparm, which they don't need to do on bare-metal. Cheers, Mark. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 8:12 ` Mark McLoughlin @ 2008-10-12 23:10 ` Jamie Lokier 2008-10-14 17:15 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Jamie Lokier @ 2008-10-12 23:10 UTC (permalink / raw) To: Mark McLoughlin, qemu-devel Cc: Chris Wright, kvm-devel, Ryan Harper, Gerd Hoffmann, Laurent Vivier Mark McLoughlin wrote: > > So O_SYNC on/off is pretty much equivalent to disk write caching being > > on/off, right? So we could make that guest-controlled, i.e. toggeling > > write caching in the guest (using hdparm) toggles O_SYNC in qemu? > > I don't think it's correct to equate disk write caching to completing > guest writes when the data has been copied to the host's page cache. The > host's page cache will cache much more data for much longer than a > typical disk, right? > > If so, then this form of write caching is much more likely to result in > fs corruption if the host crashes. In that case, all qemu users would > really need to disable write caching in the guest using hdparm, which > they don't need to do on bare-metal. However, should the effect of the guest turning off the IDE disk write cache perhaps be identical to the guest issuing IDE cache flush commands following every IDE write? This could mean the host calling fdatasync, or fsync, or using O_DSYNC, or O_DIRECT - whatever the host does for IDE flush cache. What this means _exactly_ for data integrity is outside of qemu's control and is a user & host configuration issue. But qemu could provide consistency at least. -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 23:10 ` Jamie Lokier @ 2008-10-14 17:15 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 17:15 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Gerd Hoffmann Jamie Lokier wrote: > However, should the effect of the guest turning off the IDE disk write > cache perhaps be identical to the guest issuing IDE cache flush commands > following every IDE write? > > This could mean the host calling fdatasync, or fsync, or using > O_DSYNC, or O_DIRECT - whatever the host does for IDE flush cache. > > What this means _exactly_ for data integrity is outside of qemu's > control and is a user & host configuration issue. But qemu could > provide consistency at least. > We should completely ignore the guest IDE write cache. It was brought into life by the deficiencies of IDE which presented the user with an impossible tradeoff -- you can choose between data loss and horrible performance. Since modern hardware doesn't require this tradeoff, there is no reason to force the user to make these choices. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 7:54 ` Gerd Hoffmann 2008-10-10 8:12 ` Mark McLoughlin @ 2008-10-10 9:32 ` Avi Kivity 2008-10-12 23:00 ` Jamie Lokier 1 sibling, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-10 9:32 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Gerd Hoffmann wrote: > Hi, > > >> Read performance should be unaffected by using O_DSYNC. O_DIRECT will >> significantly reduce read performance. I think we should use O_DSYNC by >> default and I have sent out a patch that contains that. We will follow >> up with benchmarks to demonstrate this. >> > > So O_SYNC on/off is pretty much equivalent to disk write caching being > on/off, right? So we could make that guest-controlled, i.e. toggeling > write caching in the guest (using hdparm) toggles O_SYNC in qemu? This > together with disk-flush command support (mapping to fsync on the host) > should allow guests to go into barrier mode for better write performance > without loosing data integrity. > IDE write caching is very different from host write caching. The IDE write cache is not susceptible to software failures (well it is susceptible to firmware failures, but let's ignore that). It is likely to survive reset and perhaps even powerdown. The risk window is a few megabytes and tens of milliseconds long. The host pagecache will not survive software failures, resets, or powerdown. The risk window is hundreds of megabytes and thousands of milliseconds long. It's perfectly normal to leave a production system on IDE (though perhaps not a mission-critical database), but totally mad to do so with host caching. I don't think we should tie data integrity to an IDE misfeature that doesn't even exist anymore (with the advent of SATA NCQ). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 9:32 ` Avi Kivity @ 2008-10-12 23:00 ` Jamie Lokier 0 siblings, 0 replies; 101+ messages in thread From: Jamie Lokier @ 2008-10-12 23:00 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Avi Kivity wrote: > The IDE write cache is not susceptible to software failures (well it is > susceptible to firmware failures, but let's ignore that). It is likely > to survive reset and perhaps even powerdown. The risk window is a few > megabytes and tens of milliseconds long. Nonetheless, from yanking the power relatively often while using ext3 (this is on a host only, no qemu involved) I've seen a number of corruption cases, and these all went away when the IDE write cache was disabled, or when IDE write barriers were used. This is a failure case which happens in real life, but not often if you don't often yank the power during writes. Just so you know. -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori 2008-10-10 7:54 ` Gerd Hoffmann @ 2008-10-10 8:11 ` Aurelien Jarno 2008-10-10 12:26 ` Anthony Liguori 2008-10-10 9:16 ` Avi Kivity ` (5 subsequent siblings) 7 siblings, 1 reply; 101+ messages in thread From: Aurelien Jarno @ 2008-10-10 8:11 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote: [snip] > So to summarize, I think we should enable O_DSYNC by default to ensure > that guest data integrity is not dependent on the host OS, and that > practically speaking, cache=off is only useful for very specialized > circumstances. Part of the patch I'll follow up with includes changes > to the man page to document all of this for users. > > Thoughts? > While I agree O_DSYNC should be the defaults, I wonder if we should keep the current behaviour available for those who want it. We can imagine the following options: cache=off O_DIRECT cache=read O_DSYNC (default) cache=on 0 -- .''`. Aurelien Jarno | GPG: 1024D/F1BCDB73 : :' : Debian developer | Electrical Engineer `. `' aurel32@debian.org | aurelien@aurel32.net `- people.debian.org/~aurel32 | www.aurel32.net ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 8:11 ` Aurelien Jarno @ 2008-10-10 12:26 ` Anthony Liguori 2008-10-10 12:53 ` Paul Brook 2008-10-10 15:48 ` Aurelien Jarno 0 siblings, 2 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-10 12:26 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Aurelien Jarno wrote: > On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote: > > [snip] > > >> So to summarize, I think we should enable O_DSYNC by default to ensure >> that guest data integrity is not dependent on the host OS, and that >> practically speaking, cache=off is only useful for very specialized >> circumstances. Part of the patch I'll follow up with includes changes >> to the man page to document all of this for users. >> >> Thoughts? >> >> > > While I agree O_DSYNC should be the defaults, I wonder if we should keep > the current behaviour available for those who want it. We can imagine > the following options: > cache=off O_DIRECT > cache=read O_DSYNC (default) > Or maybe cache=off, cache=on, cache=wb. So that the default was cache=on which is write-through, or the user can choose write-back caching. But that said, I'm concerned that this is far too confusing for users. I don't think anyone is relying on disk write performance when in write-back mode simply because the guest already has a page cache so writes are already being completed instantaneously from the application's perspective. Regards, Anthony Liguori > cache=on 0 > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:26 ` Anthony Liguori @ 2008-10-10 12:53 ` Paul Brook 2008-10-10 13:55 ` Anthony Liguori 2008-10-10 15:48 ` Aurelien Jarno 1 sibling, 1 reply; 101+ messages in thread From: Paul Brook @ 2008-10-10 12:53 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper > But that said, I'm concerned that this is far too confusing for users. > I don't think anyone is relying on disk write performance when in > write-back mode simply because the guest already has a page cache so > writes are already being completed instantaneously from the > application's perspective. This isn't entirely true. With IDE devices you don't have command queueing, so it's easy for a large write to stall subsequent reads for a relatively long time. I'm not sure how much this effects qemu, but I've definitely seen it happening on real hardware. Paul ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:53 ` Paul Brook @ 2008-10-10 13:55 ` Anthony Liguori 2008-10-10 14:05 ` Paul Brook 2008-10-10 14:19 ` Avi Kivity 0 siblings, 2 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-10 13:55 UTC (permalink / raw) To: Paul Brook Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Paul Brook wrote: >> But that said, I'm concerned that this is far too confusing for users. >> I don't think anyone is relying on disk write performance when in >> write-back mode simply because the guest already has a page cache so >> writes are already being completed instantaneously from the >> application's perspective. >> > > This isn't entirely true. With IDE devices you don't have command queueing, so > it's easy for a large write to stall subsequent reads for a relatively long > time. > I'm not sure how much this effects qemu, but I've definitely seen it happening > on real hardware. > I think that suggests we should have a cache=wb option and if people report slow downs with IDE, we can observe if cache=wb helps. My suspicion is that it's not going to have a practical impact because as long as the operations are asynchronous (via DMA), then you're getting native-like performance. My bigger concern is synchronous IO operations because then a guest VCPU is getting far less time to run and that may have a cascading effect on performance. Anyway, I'll work up a new patch with cache=wb and repost. Regards, Anthony Liguori > Paul > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 13:55 ` Anthony Liguori @ 2008-10-10 14:05 ` Paul Brook 2008-10-10 14:19 ` Avi Kivity 1 sibling, 0 replies; 101+ messages in thread From: Paul Brook @ 2008-10-10 14:05 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper On Friday 10 October 2008, Anthony Liguori wrote: > Paul Brook wrote: > >> But that said, I'm concerned that this is far too confusing for users. > >> I don't think anyone is relying on disk write performance when in > >> write-back mode simply because the guest already has a page cache so > >> writes are already being completed instantaneously from the > >> application's perspective. > > > > This isn't entirely true. With IDE devices you don't have command > > queueing, so it's easy for a large write to stall subsequent reads for a > > relatively long time. > > I'm not sure how much this effects qemu, but I've definitely seen it > > happening on real hardware. > > I think that suggests we should have a cache=wb option and if people > report slow downs with IDE, we can observe if cache=wb helps. My > suspicion is that it's not going to have a practical impact because as > long as the operations are asynchronous (via DMA), then you're getting > native-like performance. Sounds reasonable to me. Paul ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 13:55 ` Anthony Liguori 2008-10-10 14:05 ` Paul Brook @ 2008-10-10 14:19 ` Avi Kivity 2008-10-17 13:14 ` Jens Axboe 1 sibling, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-10 14:19 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Paul Brook Anthony Liguori wrote: >> >> This isn't entirely true. With IDE devices you don't have command >> queueing, so it's easy for a large write to stall subsequent reads >> for a relatively long time. >> I'm not sure how much this effects qemu, but I've definitely seen it >> happening on real hardware. >> > > I think that suggests we should have a cache=wb option and if people > report slow downs with IDE, we can observe if cache=wb helps. My > suspicion is that it's not going to have a practical impact because as > long as the operations are asynchronous (via DMA), then you're getting > native-like performance. > > My bigger concern is synchronous IO operations because then a guest > VCPU is getting far less time to run and that may have a cascading > effect on performance. IDE is limited to 256 sectors per transaction, or 128KB. If a sync transaction takes 5 ms, then your write rate is limited to 25 MB/sec. It's much worse if you're allocating qcow2 data, so each transaction is several sync writes. Fabrice's point also holds: if the guest is issuing many write transactions for some reason, you don't want them hammering the disk and killing your desktop performance if you're just developing, say, a new filesystem. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 14:19 ` Avi Kivity @ 2008-10-17 13:14 ` Jens Axboe 2008-10-19 9:13 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2008-10-17 13:14 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Paul Brook On Fri, Oct 10 2008, Avi Kivity wrote: > Anthony Liguori wrote: > >> > >> This isn't entirely true. With IDE devices you don't have command > >> queueing, so it's easy for a large write to stall subsequent reads > >> for a relatively long time. > >> I'm not sure how much this effects qemu, but I've definitely seen it > >> happening on real hardware. > >> > > > > I think that suggests we should have a cache=wb option and if people > > report slow downs with IDE, we can observe if cache=wb helps. My > > suspicion is that it's not going to have a practical impact because as > > long as the operations are asynchronous (via DMA), then you're getting > > native-like performance. > > > > My bigger concern is synchronous IO operations because then a guest > > VCPU is getting far less time to run and that may have a cascading > > effect on performance. > > IDE is limited to 256 sectors per transaction, or 128KB. If a sync > transaction takes 5 ms, then your write rate is limited to 25 MB/sec. > It's much worse if you're allocating qcow2 data, so each transaction is > several sync writes. No it isn't, even most IDE drives support lba48 which raises that limit to 64K sectors, or 32MB. -- Jens Axboe ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-17 13:14 ` Jens Axboe @ 2008-10-19 9:13 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-19 9:13 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Paul Brook Jens Axboe wrote: >> IDE is limited to 256 sectors per transaction, or 128KB. If a sync >> transaction takes 5 ms, then your write rate is limited to 25 MB/sec. >> It's much worse if you're allocating qcow2 data, so each transaction is >> several sync writes. >> > > No it isn't, even most IDE drives support lba48 which raises that limit > to 64K sectors, or 32MB. > Right, and qemu even supports this. Thanks for the correction. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:26 ` Anthony Liguori 2008-10-10 12:53 ` Paul Brook @ 2008-10-10 15:48 ` Aurelien Jarno 1 sibling, 0 replies; 101+ messages in thread From: Aurelien Jarno @ 2008-10-10 15:48 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel On Fri, Oct 10, 2008 at 07:26:00AM -0500, Anthony Liguori wrote: > Aurelien Jarno wrote: >> On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote: >> >> [snip] >> >> >>> So to summarize, I think we should enable O_DSYNC by default to >>> ensure that guest data integrity is not dependent on the host OS, >>> and that practically speaking, cache=off is only useful for very >>> specialized circumstances. Part of the patch I'll follow up with >>> includes changes to the man page to document all of this for users. >>> >>> Thoughts? >>> >>> >> >> While I agree O_DSYNC should be the defaults, I wonder if we should keep >> the current behaviour available for those who want it. We can imagine >> the following options: >> cache=off O_DIRECT >> cache=read O_DSYNC (default) >> > > Or maybe cache=off, cache=on, cache=wb. So that the default was > cache=on which is write-through, or the user can choose write-back > caching. > > But that said, I'm concerned that this is far too confusing for users. > I don't think anyone is relying on disk write performance when in > write-back mode simply because the guest already has a page cache so > writes are already being completed instantaneously from the > application's perspective. > Some of my setups rely on host cache. I am using a swap partition for some guests in order to increase the available "memory" (some platforms in qemu are limited to 256MB of RAM), and it that case I don't care about data integrity -- .''`. Aurelien Jarno | GPG: 1024D/F1BCDB73 : :' : Debian developer | Electrical Engineer `. `' aurel32@debian.org | aurelien@aurel32.net `- people.debian.org/~aurel32 | www.aurel32.net ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori 2008-10-10 7:54 ` Gerd Hoffmann 2008-10-10 8:11 ` Aurelien Jarno @ 2008-10-10 9:16 ` Avi Kivity 2008-10-10 9:58 ` Daniel P. Berrange ` (2 more replies) 2008-10-10 10:03 ` Fabrice Bellard ` (4 subsequent siblings) 7 siblings, 3 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-10 9:16 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori wrote: [O_DSYNC, O_DIRECT, and 0] > > Thoughts? There are (at least) three usage models for qemu: - OS development tool - casual or client-side virtualization - server partitioning The last two uses are almost always in conjunction with a hypervisor. When using qemu as an OS development tool, data integrity is not very important. On the other hand, performance and caching are, especially as the guest is likely to be restarted multiple times so the guest page cache is of limited value. For this use model the current default (write back cache) is fine. The 'causal virtualization' use is when the user has a full native desktop, and is also running another operating system. In this case, the host page cache is likely to be larger than the guest page cache. Data integrity is important, so write-back is out of the picture. I guess for this use case O_DSYNC is preferred though O_DIRECT might not be significantly slower for long-running guests. This is because reads are unlikely to be cached and writes will not benefit much from the host pagecache. For server partitioning, data integrity and performance are critical. The host page cache is significantly smaller than the guest page cache; if you have spare memory, give it to your guests. O_DIRECT is practically mandataed here; the host page cache does nothing except to impose an additional copy. Given the rather small difference between O_DSYNC and O_DIRECT, I favor not adding O_DSYNC as it will add only marginal value. Regarding choosing the default value, I think we should change the default to be safe, that is O_DIRECT. If that is regarded as too radical, the default should be O_DSYNC with options to change it to O_DIRECT or writeback. Note that some disk formats will need updating like qcow2 if they are not to have abyssal performance. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 9:16 ` Avi Kivity @ 2008-10-10 9:58 ` Daniel P. Berrange 2008-10-10 10:26 ` Avi Kivity 2008-10-10 12:34 ` Anthony Liguori 2008-10-11 17:54 ` Mark Wagner 2 siblings, 1 reply; 101+ messages in thread From: Daniel P. Berrange @ 2008-10-10 9:58 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier On Fri, Oct 10, 2008 at 11:16:05AM +0200, Avi Kivity wrote: > Anthony Liguori wrote: > > [O_DSYNC, O_DIRECT, and 0] > > > > >Thoughts? > > There are (at least) three usage models for qemu: > > - OS development tool > - casual or client-side virtualization > - server partitioning > > The last two uses are almost always in conjunction with a hypervisor. > > When using qemu as an OS development tool, data integrity is not very > important. On the other hand, performance and caching are, especially > as the guest is likely to be restarted multiple times so the guest page > cache is of limited value. For this use model the current default > (write back cache) is fine. It is a myth that developers dont' care about data consistency / crash safety. I've lost countless guest VMs to corruption when my host OS crashed & its just a waste of my time. Given the choice between likely-to-corrupt and not-likely-to-corrupt, even developers will want the latter. > The 'causal virtualization' use is when the user has a full native > desktop, and is also running another operating system. In this case, > the host page cache is likely to be larger than the guest page cache. > Data integrity is important, so write-back is out of the picture. I > guess for this use case O_DSYNC is preferred though O_DIRECT might not > be significantly slower for long-running guests. This is because reads > are unlikely to be cached and writes will not benefit much from the host > pagecache. > > For server partitioning, data integrity and performance are critical. > The host page cache is significantly smaller than the guest page cache; > if you have spare memory, give it to your guests. O_DIRECT is > practically mandataed here; the host page cache does nothing except to > impose an additional copy. > > Given the rather small difference between O_DSYNC and O_DIRECT, I favor > not adding O_DSYNC as it will add only marginal value. > > Regarding choosing the default value, I think we should change the > default to be safe, that is O_DIRECT. If that is regarded as too > radical, the default should be O_DSYNC with options to change it to > O_DIRECT or writeback. Note that some disk formats will need updating > like qcow2 if they are not to have abyssal performance. Absoutely agree that the default should be safe. I don't have enough knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies we should choose the best setting by default, because we can't expect users to know the tradeoffs either. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 9:58 ` Daniel P. Berrange @ 2008-10-10 10:26 ` Avi Kivity 2008-10-10 12:59 ` Paul Brook 0 siblings, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-10 10:26 UTC (permalink / raw) To: Daniel P. Berrange, qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Daniel P. Berrange wrote: >> There are (at least) three usage models for qemu: >> >> - OS development tool >> - casual or client-side virtualization >> - server partitioning >> >> The last two uses are almost always in conjunction with a hypervisor. >> >> When using qemu as an OS development tool, data integrity is not very >> important. On the other hand, performance and caching are, especially >> as the guest is likely to be restarted multiple times so the guest page >> cache is of limited value. For this use model the current default >> (write back cache) is fine. >> > > It is a myth that developers dont' care about data consistency / crash > safety. I've lost countless guest VMs to corruption when my host OS > crashed & its just a waste of my time. Given the choice between > likely-to-corrupt and not-likely-to-corrupt, even developers will > want the latter. > There are other data integrity solutions for developers, like backups (unlikely, I know) or -snapshot. > Absoutely agree that the default should be safe. I don't have enough > knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies > we should choose the best setting by default, because we can't expect > users to know the tradeoffs either. > The fact that there are different use models for qemu implies that the default must be chosen at some higher level than qemu code itself. It might be done using /etc/qemu or ~/.qemu, or at the management interface, but there is no best setting for qemu itself. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 10:26 ` Avi Kivity @ 2008-10-10 12:59 ` Paul Brook 2008-10-10 13:20 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Paul Brook @ 2008-10-10 12:59 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Avi Kivity >> - OS development tool >> - casual or client-side virtualization >> - server partitioning > > Absoutely agree that the default should be safe. I don't have enough > > knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies > > we should choose the best setting by default, because we can't expect > > users to know the tradeoffs either. > > The fact that there are different use models for qemu implies that the > default must be chosen at some higher level than qemu code itself. It > might be done using /etc/qemu or ~/.qemu, or at the management > interface, but there is no best setting for qemu itself. This suggests that the most appropriate defaults are for the users that are least likely to be using a management tool. I'd guess that the server partitioning folks are most likely to be using a management tool, so qemu defaults should be setup for casual/development use. I don't have hard data to back this up though. Paul ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:59 ` Paul Brook @ 2008-10-10 13:20 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-10 13:20 UTC (permalink / raw) To: Paul Brook Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Paul Brook wrote: >>> Absoutely agree that the default should be safe. I don't have enough >>> knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies >>> we should choose the best setting by default, because we can't expect >>> users to know the tradeoffs either. >>> >> The fact that there are different use models for qemu implies that the >> default must be chosen at some higher level than qemu code itself. It >> might be done using /etc/qemu or ~/.qemu, or at the management >> interface, but there is no best setting for qemu itself. >> > > This suggests that the most appropriate defaults are for the users that are > least likely to be using a management tool. I'd guess that the server > partitioning folks are most likely to be using a management tool, so qemu > defaults should be setup for casual/development use. I don't have hard data > to back this up though. > I agree (as my own uses are of the development kind). That rules out O_DIRECT as the qemu-level default. However I'm not sure writeback is a good default, it's too risky (though I've never been bitten; and I've had my share of host crashes). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 9:16 ` Avi Kivity 2008-10-10 9:58 ` Daniel P. Berrange @ 2008-10-10 12:34 ` Anthony Liguori 2008-10-10 12:56 ` Avi Kivity 2008-10-11 9:07 ` andrzej zaborowski 2008-10-11 17:54 ` Mark Wagner 2 siblings, 2 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-10 12:34 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Avi Kivity wrote: > Anthony Liguori wrote: > > [O_DSYNC, O_DIRECT, and 0] > >> >> Thoughts? > > There are (at least) three usage models for qemu: > > - OS development tool > - casual or client-side virtualization > - server partitioning > > The last two uses are almost always in conjunction with a hypervisor. > > When using qemu as an OS development tool, data integrity is not very > important. On the other hand, performance and caching are, especially > as the guest is likely to be restarted multiple times so the guest > page cache is of limited value. For this use model the current > default (write back cache) is fine. > > The 'causal virtualization' use is when the user has a full native > desktop, and is also running another operating system. In this case, > the host page cache is likely to be larger than the guest page cache. > Data integrity is important, so write-back is out of the picture. I > guess for this use case O_DSYNC is preferred though O_DIRECT might not > be significantly slower for long-running guests. This is because > reads are unlikely to be cached and writes will not benefit much from > the host pagecache. > > For server partitioning, data integrity and performance are critical. > The host page cache is significantly smaller than the guest page > cache; if you have spare memory, give it to your guests. I don't think this wisdom is bullet-proof. In the case of server partitioning, if you're designing for the future then you can assume some form of host data deduplification either through qcow deduplification, a proper content addressable storage mechanism, or file system level deduplification. It's becoming more common to see large amounts of homogeneous consolidation either because of cloud computing, virtual appliances, or just because most x86 virtualization involves Windows consolidation and there aren't that many versions of Windows. In this case, there is an awful lot of opportunity for increasing overall system throughput by caching common data access across virtual machines. > O_DIRECT is practically mandataed here; the host page cache does > nothing except to impose an additional copy. > > Given the rather small difference between O_DSYNC and O_DIRECT, I > favor not adding O_DSYNC as it will add only marginal value. The difference isn't small. Our fio runs are defeating the host page cache on write so we're adjusting the working set size. But the difference in read performance between dsync and direct is many factors when the data can be cached. > Regarding choosing the default value, I think we should change the > default to be safe, that is O_DIRECT. If that is regarded as too > radical, the default should be O_DSYNC with options to change it to > O_DIRECT or writeback. Note that some disk formats will need updating > like qcow2 if they are not to have abyssal performance. I think qcow2 will be okay because the only issue is image expansion and that is a relatively uncommon case that is amortized throughout the life time of the VM. So far, while there is objection to using O_DIRECT by default, I haven't seen any objection to O_DSYNC by default so as long as no one objects in the next few days, I think that's what we'll end up doing. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:34 ` Anthony Liguori @ 2008-10-10 12:56 ` Avi Kivity 2008-10-11 9:07 ` andrzej zaborowski 1 sibling, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-10 12:56 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori wrote: >> >> For server partitioning, data integrity and performance are >> critical. The host page cache is significantly smaller than the >> guest page cache; if you have spare memory, give it to your guests. > > I don't think this wisdom is bullet-proof. In the case of server > partitioning, if you're designing for the future then you can assume > some form of host data deduplification either through qcow > deduplification, a proper content addressable storage mechanism, or > file system level deduplification. It's becoming more common to see > large amounts of homogeneous consolidation either because of cloud > computing, virtual appliances, or just because most x86 virtualization > involves Windows consolidation and there aren't that many versions of > Windows. > > In this case, there is an awful lot of opportunity for increasing > overall system throughput by caching common data access across virtual > machines. That's true. But is the OS image a significant image of I/O in a running system? My guess is that it is not. In any case, deduplication is far enough into the future to not attempt to solve it now. The solution may be part of the deduplication solution itself, for example it may choose to cache shared data (since they are read-only anyway) even with O_DIRECT. > >> O_DIRECT is practically mandataed here; the host page cache does >> nothing except to impose an additional copy. >> >> Given the rather small difference between O_DSYNC and O_DIRECT, I >> favor not adding O_DSYNC as it will add only marginal value. > > The difference isn't small. Our fio runs are defeating the host page > cache on write so we're adjusting the working set size. But the > difference in read performance between dsync and direct is many > factors when the data can be cached. > That's because you're leaving host memory idle. That's not a realistic scenario. What happens if you assign free host memory to the guest? >> Regarding choosing the default value, I think we should change the >> default to be safe, that is O_DIRECT. If that is regarded as too >> radical, the default should be O_DSYNC with options to change it to >> O_DIRECT or writeback. Note that some disk formats will need >> updating like qcow2 if they are not to have abyssal performance. > > I think qcow2 will be okay because the only issue is image expansion > and that is a relatively uncommon case that is amortized throughout > the life time of the VM. So far, while there is objection to using > O_DIRECT by default, I haven't seen any objection to O_DSYNC by > default so as long as no one objects in the next few days, I think > that's what we'll end up doing. I don't mind that as long as there is a way to request O_DIRECT (which I think is cache=off under your proposal). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 12:34 ` Anthony Liguori 2008-10-10 12:56 ` Avi Kivity @ 2008-10-11 9:07 ` andrzej zaborowski 1 sibling, 0 replies; 101+ messages in thread From: andrzej zaborowski @ 2008-10-11 9:07 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel 2008/10/10 Anthony Liguori <anthony@codemonkey.ws>: > I think qcow2 will be okay because the only issue is image expansion and > that is a relatively uncommon case that is amortized throughout the life > time of the VM. It's discutible how common this is and if you can count on the amortization. I'd say that for most users creating new short-lived VMs is the bigger slice of their time using qemu. Fore example think about the trying out different distros like with free.oszoo.org, most images there are qcow2. Similarly trying to install an os and booting its kernel with different options in sequence, is where waiting is most annoying. Also -snapshot uses qcow2. In any case let's have benchmarks before deciding anything about chagning the default behavior. Since about 0.9.0 qemu is going through a lot of (necessary) changes that in a great part were slow downs, and they really accumulated. Regards ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-10 9:16 ` Avi Kivity 2008-10-10 9:58 ` Daniel P. Berrange 2008-10-10 12:34 ` Anthony Liguori @ 2008-10-11 17:54 ` Mark Wagner 2008-10-11 20:35 ` Anthony Liguori 2 siblings, 1 reply; 101+ messages in thread From: Mark Wagner @ 2008-10-11 17:54 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Avi Kivity wrote: > Anthony Liguori wrote: > > [O_DSYNC, O_DIRECT, and 0] > >> >> Thoughts? > > There are (at least) three usage models for qemu: > > - OS development tool > - casual or client-side virtualization > - server partitioning > > The last two uses are almost always in conjunction with a hypervisor. > > When using qemu as an OS development tool, data integrity is not very > important. On the other hand, performance and caching are, especially > as the guest is likely to be restarted multiple times so the guest page > cache is of limited value. For this use model the current default > (write back cache) is fine. > > The 'causal virtualization' use is when the user has a full native > desktop, and is also running another operating system. In this case, > the host page cache is likely to be larger than the guest page cache. > Data integrity is important, so write-back is out of the picture. I > guess for this use case O_DSYNC is preferred though O_DIRECT might not > be significantly slower for long-running guests. This is because reads > are unlikely to be cached and writes will not benefit much from the host > pagecache. > > For server partitioning, data integrity and performance are critical. > The host page cache is significantly smaller than the guest page cache; > if you have spare memory, give it to your guests. O_DIRECT is > practically mandataed here; the host page cache does nothing except to > impose an additional copy. > > Given the rather small difference between O_DSYNC and O_DIRECT, I favor > not adding O_DSYNC as it will add only marginal value. > > Regarding choosing the default value, I think we should change the > default to be safe, that is O_DIRECT. If that is regarded as too > radical, the default should be O_DSYNC with options to change it to > O_DIRECT or writeback. Note that some disk formats will need updating > like qcow2 if they are not to have abyssal performance. > I think one of the main things to be considered is the integrity of the actual system call. The Linux manpage for open() states the following about the use of the O_DIRECT flag: O_DIRECT (Since Linux 2.6.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte boundaries suffices. If I focus on the sentence "The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. ", I think there a bug here. If I open a file with the O_DIRECT flag and the host reports back to me that the transfer has completed when in fact its still in the host cache, its a bug as it violates the open()/write() call and there is no guarantee that the data will actually be written. So I guess the real issue isn't what the default should be (although the performance team at Red Hat would vote for cache=off), the real issue is that we need to honor the system call from the guest. If the file is opened with O_DIRECT on the guest, then the host needs to honor that and do the same. -mark ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-11 17:54 ` Mark Wagner @ 2008-10-11 20:35 ` Anthony Liguori 2008-10-12 0:43 ` Mark Wagner ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-11 20:35 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner wrote: > Avi Kivity wrote: > > I think one of the main things to be considered is the integrity of the > actual system call. The Linux manpage for open() states the following > about the use of the O_DIRECT flag: > > O_DIRECT (Since Linux 2.6.10) > Try to minimize cache effects of the I/O to and from this file. In > general this will degrade performance, but it is useful in special > situations, such as when applications do their own caching. File > I/O is done directly to/from user space buffers. The I/O is > synchronous, that is, at the completion of a read(2) or write(2), > data is guaranteed to have been transferred. Under Linux 2.4 > transfer sizes, and the alignment of user buffer and file offset > must all be multiples of the logical block size of the file system. > Under Linux 2.6 alignment to 512-byte boundaries suffices. > > > If I focus on the sentence "The I/O is synchronous, that is, at > the completion of a read(2) or write(2), data is guaranteed to have > been transferred. ", It's extremely important to understand what the guarantee is. The guarantee is that upon completion on write(), the data will have been reported as written by the underlying storage subsystem. This does *not* mean that the data is on disk. If you have a normal laptop, your disk has a cache. That cache does not have a battery backup. Under normal operations, the cache is acting in write-back mode and when you do a write, the disk will report the write as completed even though it is not actually on disk. If you really care about the data being on disk, you have to either use a disk with a battery backed cache (much more expensive) or enable write-through caching (will significantly reduce performance). In the case of KVM, even using write-back caching with the host page cache, we are still honoring the guarantee of O_DIRECT. We just have another level of caching that happens to be write-back. > I think there a bug here. If I open a > file with the O_DIRECT flag and the host reports back to me that > the transfer has completed when in fact its still in the host cache, > its a bug as it violates the open()/write() call and there is no > guarantee that the data will actually be written. This is very important, O_DIRECT does *not* guarantee that data actually resides on disk. There are many possibly places that it can be cached (in the storage controller, in the disks themselves, in a RAID controller). > So I guess the real issue isn't what the default should be (although > the performance team at Red Hat would vote for cache=off), The consensus so far has been that we want to still use the host page cache but use it in write-through mode. This would mean that the guest would only see data completion when the host's storage subsystem reports the write as having completed. This is not the same as cache=off but I think gives the real effect that is desired. Do you have another argument for using cache=off? Regards, Anthony Liguori > the real > issue is that we need to honor the system call from the guest. If > the file is opened with O_DIRECT on the guest, then the host needs > to honor that and do the same. > > -mark > > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-11 20:35 ` Anthony Liguori @ 2008-10-12 0:43 ` Mark Wagner 2008-10-12 1:50 ` Chris Wright 2008-10-12 17:54 ` Anthony Liguori 2008-10-12 0:44 ` Chris Wright 2008-10-12 10:12 ` Avi Kivity 2 siblings, 2 replies; 101+ messages in thread From: Mark Wagner @ 2008-10-12 0:43 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Anthony Liguori wrote: Note I think that are two distinct arguments going on here. My main concern is that I don't think that this a simple "what do we make the default cache policy be" issue. I think that regardless of the cache policy, if something in the guest requests O_DIRECT, the host must honor that and not cache the data. So in the following discussion below, the question of what the default cache flag should be and the question of the host needing to honor O_DIRECT in a guest are somewhat intermingled... > Mark Wagner wrote: >> Avi Kivity wrote: >> >> I think one of the main things to be considered is the integrity of the >> actual system call. The Linux manpage for open() states the following >> about the use of the O_DIRECT flag: >> >> O_DIRECT (Since Linux 2.6.10) >> Try to minimize cache effects of the I/O to and from this file. In >> general this will degrade performance, but it is useful in special >> situations, such as when applications do their own caching. File >> I/O is done directly to/from user space buffers. The I/O is >> synchronous, that is, at the completion of a read(2) or write(2), >> data is guaranteed to have been transferred. Under Linux 2.4 >> transfer sizes, and the alignment of user buffer and file offset >> must all be multiples of the logical block size of the file system. >> Under Linux 2.6 alignment to 512-byte boundaries suffices. >> >> >> If I focus on the sentence "The I/O is synchronous, that is, at >> the completion of a read(2) or write(2), data is guaranteed to have >> been transferred. ", > > It's extremely important to understand what the guarantee is. The > guarantee is that upon completion on write(), the data will have been > reported as written by the underlying storage subsystem. This does > *not* mean that the data is on disk. I apologize if I worded it poorly, I assume that the guarantee is that the data has been sent to the storage controller and said controller sent an indication that the write has completed. This could mean multiple things likes its in the controllers cache, on the disk, etc. I do not believe that this means that the data is still sitting in the host cache. I realize it may not yet be on a disk, but, at a minimum, I would expect that is has been sent to the storage controller. Do you consider the hosts cache to be part of the storage subsystem ? > > If you have a normal laptop, your disk has a cache. That cache does not > have a battery backup. Under normal operations, the cache is acting in > write-back mode and when you do a write, the disk will report the write > as completed even though it is not actually on disk. If you really care > about the data being on disk, you have to either use a disk with a > battery backed cache (much more expensive) or enable write-through > caching (will significantly reduce performance). > We are testing things on the big side. Systems with 32 GB of mem, 2 TB of enterprise storage (MSA, EVA, etc). There is a write cache with battery backup on the storage controllers. We understand the trade offs between the life-time of the battery and the potential data loss because they are well documented and we can make informed decisions because we know they are there. I think that people are too quickly assuming that because an IDE drive will cache your writes *if you let it*, then its clearly OK for the host to lie to the guests when they request O_DIRECT and cache whatever the developers feel like. I think the leap to get from the write cache on an IDE drive to "its OK to cache what ever we want on the host" is huge, and deadly. Keep in mind, the disk on a laptop is not caching GB worth of data like the host can. The impact is that while there is a chance of data loss with my laptop if I leave the disk cache on, the amount of data is much smaller and the time it takes to flush the disks cache is also much smaller than a multi-GB cache on my host. > In the case of KVM, even using write-back caching with the host page > cache, we are still honoring the guarantee of O_DIRECT. We just have > another level of caching that happens to be write-back. I still don't get it. If I have something running on the host that I open with O_DIRECT, do you still consider it not to be a violation of the system call if that data ends up in the host cache instead of being sent to the storage controller? If you do think it violates the terms of the call, then what is the difference between the host and a guest in this situation? QEMU is clearly not a battery backed storage controller. > >> I think there a bug here. If I open a >> file with the O_DIRECT flag and the host reports back to me that >> the transfer has completed when in fact its still in the host cache, >> its a bug as it violates the open()/write() call and there is no >> guarantee that the data will actually be written. > > This is very important, O_DIRECT does *not* guarantee that data actually > resides on disk. There are many possibly places that it can be cached > (in the storage controller, in the disks themselves, in a RAID controller). > I don't believe I said was on the disk, just that the host indicated to the guest that the write had completed. Everything you mentioned could be considered external to the OS. You didn't mention the host page cache, is it allowed there or not? >> So I guess the real issue isn't what the default should be (although >> the performance team at Red Hat would vote for cache=off), > > The consensus so far has been that we want to still use the host page > cache but use it in write-through mode. This would mean that the guest > would only see data completion when the host's storage subsystem reports > the write as having completed. This is not the same as cache=off but I > think gives the real effect that is desired. > > Do you have another argument for using cache=off? Thats not the argument I'm trying to make. Well I guess I still didn't make my point clearly. cache=off seems to be a band-aid to the fact that the host is not honoring the O_DIRECT flag. I can easily see a malicious use of the cache=on flag to inject something into the data stream or highjack said stream from a guest app that requested O_DIRECT. While this is also possible in may other ways, in this particular case it is enabled via the config option in QEMU. I can easily see something as simple as setting a large page cache, config the guests to use cache=on and then every second messing with the caches in order to cause data corruption. (wonder if "echo 1 > /proc/sys/vm/drop_caches will do the trick ?)". From the guests perspective, they have been guaranteed that their data is secure, but it real isn't. We are testing with Oracle right now. Oracle assumes it has control of the storage and does lots of things assuming direct IO. However, I can configure cache=on for the storage presented to the guest and Oracle really won't have direct control because there is a host cache in the way. If I run the same Oracle config on bare metal, it does have direct control because the OS knows that the host cache must be bypassed. The end result is that the final behavior of guest OS is drastically different than that of the same OS running on a host because I can configure QEMU to hijack the data underneath the actual call and at a minimum, delay it from going to the external storage subsystem where the application expects it to be. The impact of this decision is that this is causing QEMU to be unreliable for any type of use that requires data integrity and unsuitable for any type of enterprise deployment. -mark > Regards, > > Anthony Liguori > >> the real >> issue is that we need to honor the system call from the guest. If >> the file is opened with O_DIRECT on the guest, then the host needs >> to honor that and do the same. >> >> -mark >> >> >> >> > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 0:43 ` Mark Wagner @ 2008-10-12 1:50 ` Chris Wright 2008-10-12 16:22 ` Jamie Lokier 2008-10-12 17:54 ` Anthony Liguori 1 sibling, 1 reply; 101+ messages in thread From: Chris Wright @ 2008-10-12 1:50 UTC (permalink / raw) To: Mark Wagner Cc: Chris Wright, Mark McLoughlin, kvm, Laurent Vivier, qemu-devel, Ryan Harper * Mark Wagner (mwagner@redhat.com) wrote: > I think that are two distinct arguments going on here. My main concern is > that I don't think that this a simple "what do we make the default cache policy > be" issue. I think that regardless of the cache policy, if something in the > guest requests O_DIRECT, the host must honor that and not cache the data. OK, O_DIRECT in the guest is just one example of the guest requesting data to be synchronously written to disk. It bypasses guest page cache, but even page cached writes need to be written at some point. Any time the disk driver issues an io where it expects the data to be on disk (possible low-level storage subystem caching) is the area of concern. * Mark Wagner (mwagner@redhat.com) wrote: > Anthony Liguori wrote: >> It's extremely important to understand what the guarantee is. The >> guarantee is that upon completion on write(), the data will have been >> reported as written by the underlying storage subsystem. This does >> *not* mean that the data is on disk. > > I apologize if I worded it poorly, I assume that the guarantee is that > the data has been sent to the storage controller and said controller > sent an indication that the write has completed. This could mean > multiple things likes its in the controllers cache, on the disk, etc. > > I do not believe that this means that the data is still sitting in the > host cache. I realize it may not yet be on a disk, but, at a minimum, > I would expect that is has been sent to the storage controller. Do you > consider the hosts cache to be part of the storage subsystem ? Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get it through to host's storage subsytem, and I think that's been the core of the discussion (plus defaults, etc). >> In the case of KVM, even using write-back caching with the host page >> cache, we are still honoring the guarantee of O_DIRECT. We just have >> another level of caching that happens to be write-back. > > I still don't get it. If I have something running on the host that I > open with O_DIRECT, do you still consider it not to be a violation of > the system call if that data ends up in the host cache instead of being > sent to the storage controller? I suppose an argument could be made for host caching and write-back to be considered part of the storage subsystem from the guest pov, but then we also need to bring in the requirement for proper cache flushing. Given a popular linux guest fs can be a little fast and loose, wb and flushing isn't really optimal choice for the integrity case. thanks, -chris ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 1:50 ` Chris Wright @ 2008-10-12 16:22 ` Jamie Lokier 0 siblings, 0 replies; 101+ messages in thread From: Jamie Lokier @ 2008-10-12 16:22 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm, Laurent Vivier, Ryan Harper, Mark Wagner Chris Wright wrote: > Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get > it through to host's storage subsytem, and I think that's been the core > of the discussion (plus defaults, etc). Just want to point out that the storage commitment from O_DIRECT can be _weaker_ than O_DSYNC. On Linux,m O_DIRECT never uses storage-device barriers or transactions, but O_DSYNC sometimes does, and fsync is even more likely to than O_DSYNC. I'm not certain, but I think the same applies to other host OSes too - including Windows, which has its own equivalents to O_DSYNC and O_DIRECT, and extra documented semantics when they are used together. Although this is a host implementation detail, unfortunately it means that O_DIRECT=no-cache and O_DSYNC=write-through-cache is not an accurate characterisation. Some might be mislead into assuming that "cache=off" is as strongly committing their data to hard storage as "cache=wb" would. I think you can assume this only when the underlying storage devices' write caches are disabled. You cannot assume this if the host filesystem uses barriers instead of disabling the storage devices' write cache. Unfortunately there's not a lot qemu can do about these various quirks, but at least it should be documented, so that someone requiring storage commitment (e.g. for a critical guest database) is advised to investigate whether O_DIRECT and/or O_DSYNC give them what they require with their combination of host kernel, filesystem, filesystem options and storage device(s). -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 0:43 ` Mark Wagner 2008-10-12 1:50 ` Chris Wright @ 2008-10-12 17:54 ` Anthony Liguori 2008-10-12 18:14 ` nuitari-qemu 2008-10-13 0:27 ` Mark Wagner 1 sibling, 2 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 17:54 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner wrote: > > I do not believe that this means that the data is still sitting in the > host cache. I realize it may not yet be on a disk, but, at a minimum, > I would expect that is has been sent to the storage controller. Do you > consider the hosts cache to be part of the storage subsystem ? Yes. And the storage subsystem is often complicated like this. Consider if you had a hardware iSCSI initiator. The host just sees a SCSI disk and when the writes are issued as completed, that simply means the writes have gone to the iSCSI server. The iSCSI server may have its own cache or some deep storage multi-level cached storage subsystem. The fact that the virtualization layer has a cache is really not that unusual. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 17:54 ` Anthony Liguori @ 2008-10-12 18:14 ` nuitari-qemu 2008-10-13 0:27 ` Mark Wagner 1 sibling, 0 replies; 101+ messages in thread From: nuitari-qemu @ 2008-10-12 18:14 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier >> I do not believe that this means that the data is still sitting in the >> host cache. I realize it may not yet be on a disk, but, at a minimum, >> I would expect that is has been sent to the storage controller. Do you >> consider the hosts cache to be part of the storage subsystem ? > > The fact that the virtualization layer has a cache is really not that > unusual. Wouldn't it be better to have cache=on/off control wether or not qemu/kvm does any caching on their own and have a different configuration option for O_DIRECT / O_DSYNC on the diskfiles? ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 17:54 ` Anthony Liguori 2008-10-12 18:14 ` nuitari-qemu @ 2008-10-13 0:27 ` Mark Wagner 2008-10-13 1:21 ` Anthony Liguori 1 sibling, 1 reply; 101+ messages in thread From: Mark Wagner @ 2008-10-13 0:27 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Anthony Liguori wrote: > Mark Wagner wrote: >> >> I do not believe that this means that the data is still sitting in the >> host cache. I realize it may not yet be on a disk, but, at a minimum, >> I would expect that is has been sent to the storage controller. Do you >> consider the hosts cache to be part of the storage subsystem ? > > Yes. And the storage subsystem is often complicated like this. > Consider if you had a hardware iSCSI initiator. The host just sees a > SCSI disk and when the writes are issued as completed, that simply means > the writes have gone to the iSCSI server. The iSCSI server may have its > own cache or some deep storage multi-level cached storage subsystem. > If you stopped and listened to yourself, you'd see that you are making my point... AFAIK, QEMU is neither designed nor intended to be an Enterprise Storage Array, I thought this group is designing a virtualization layer. However, the persistent argument is that since Enterprise Storage products will often acknowledge a write before the data is actually on the disk, its OK for QEMU to do the same. If QEMU had a similar design to Enterprise Storage with redundancy, battery backup, etc, I'd be fine with it, but you don't. QEMU is a layer that I've also thought was suppose to be small, lightweight and unobtrusive that is silently putting everyones data at risk. The low-end iSCSI server from EqualLogic claims: "it combines intelligence and automation with fault tolerance" "Dual, redundant controllers with a total of 4 GB battery-backed memory" AFAIK QEMU provides neither of these characteristics. -mark > The fact that the virtualization layer has a cache is really not that > unusual. Do other virtualization layers lie to the guest and indicate that the data has successfully been ACK'd by the storage subsystem when the data is actually still in the host cache? -mark > > Regards, > > Anthony Liguori > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 0:27 ` Mark Wagner @ 2008-10-13 1:21 ` Anthony Liguori 2008-10-13 2:09 ` Mark Wagner 0 siblings, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-13 1:21 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner wrote: > If you stopped and listened to yourself, you'd see that you are making > my point... > > AFAIK, QEMU is neither designed nor intended to be an Enterprise > Storage Array, > I thought this group is designing a virtualization layer. However, > the persistent > argument is that since Enterprise Storage products will often > acknowledge a write > before the data is actually on the disk, its OK for QEMU to do the same. I think you're a little lost in this thread. We're going to have QEMU only acknowledge writes when they complete. I've already sent out a patch. Just waiting a couple days to let everyone give their input. > If QEMU > had a similar design to Enterprise Storage with redundancy, battery > backup, etc, I'd > be fine with it, but you don't. QEMU is a layer that I've also thought > was suppose > to be small, lightweight and unobtrusive that is silently putting > everyones data > at risk. > > The low-end iSCSI server from EqualLogic claims: > "it combines intelligence and automation with fault tolerance" > "Dual, redundant controllers with a total of 4 GB battery-backed > memory" > > AFAIK QEMU provides neither of these characteristics. So if this is your only concern, we're in violent agreement. You were previously arguing that we should use O_DIRECT in the host if we're not "lying" about write completions anymore. That's what I'm opposing because the details of whether we use O_DIRECT or not have absolutely nothing to do with data integrity as long as we're using O_DSYNC. Regards, Anthony Liguori > > -mark > >> The fact that the virtualization layer has a cache is really not that >> unusual. > Do other virtualization layers lie to the guest and indicate that the > data > has successfully been ACK'd by the storage subsystem when the data is > actually > still in the host cache? > > > -mark >> >> Regards, >> >> Anthony Liguori >> >> > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 1:21 ` Anthony Liguori @ 2008-10-13 2:09 ` Mark Wagner 2008-10-13 3:16 ` Anthony Liguori ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Mark Wagner @ 2008-10-13 2:09 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Anthony Liguori wrote: > Mark Wagner wrote: >> If you stopped and listened to yourself, you'd see that you are making >> my point... >> >> AFAIK, QEMU is neither designed nor intended to be an Enterprise >> Storage Array, >> I thought this group is designing a virtualization layer. However, >> the persistent >> argument is that since Enterprise Storage products will often >> acknowledge a write >> before the data is actually on the disk, its OK for QEMU to do the same. > > I think you're a little lost in this thread. We're going to have QEMU > only acknowledge writes when they complete. I've already sent out a > patch. Just waiting a couple days to let everyone give their input. > Actually, I'm just don't being clear enough in trying to point out that I don't think just setting a default value for "cache" goes far enough. My argument has nothing to do with the default value. It has to do with what the right thing to do is in specific situations regardless of the value of the cache setting. My point is that if a file is opened in the guest with the O_DIRECT (or O_DSYNC) then QEMU *must* honor that regardless of whatever value the current value of "cache" is. So, if the system admin for the host decides to set cache=on and something in the guest opens a file with O_DIRECT, I feel that it is a violation of the system call for the host to cache the write in its local cache w/o sending it immediately to the storage subsystem. It must get an ACK from the storage subsystem before it can return to the guest in order to preserve the guarantee. So, if your proposed default value for the cache is in effect, then O_DSYNC should provide the write-thru required by the guests use of O_DIRECT on the writes. However, if the default cache value is not used and its set to cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are issues that need to be addressed. -mark >> If QEMU >> had a similar design to Enterprise Storage with redundancy, battery >> backup, etc, I'd >> be fine with it, but you don't. QEMU is a layer that I've also thought >> was suppose >> to be small, lightweight and unobtrusive that is silently putting >> everyones data >> at risk. >> >> The low-end iSCSI server from EqualLogic claims: >> "it combines intelligence and automation with fault tolerance" >> "Dual, redundant controllers with a total of 4 GB battery-backed >> memory" >> >> AFAIK QEMU provides neither of these characteristics. > > So if this is your only concern, we're in violent agreement. You were > previously arguing that we should use O_DIRECT in the host if we're not > "lying" about write completions anymore. That's what I'm opposing > because the details of whether we use O_DIRECT or not have absolutely > nothing to do with data integrity as long as we're using O_DSYNC. > > Regards, > > Anthony Liguori > >> >> -mark >> >>> The fact that the virtualization layer has a cache is really not that >>> unusual. >> Do other virtualization layers lie to the guest and indicate that the >> data >> has successfully been ACK'd by the storage subsystem when the data is >> actually >> still in the host cache? >> >> >> -mark >>> >>> Regards, >>> >>> Anthony Liguori >>> >>> >> >> >> > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 2:09 ` Mark Wagner @ 2008-10-13 3:16 ` Anthony Liguori 2008-10-13 6:42 ` Aurelien Jarno 2008-10-13 14:38 ` Steve Ofsthun 2 siblings, 0 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-13 3:16 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner wrote: > So, if your proposed default value for the cache is in effect, then > O_DSYNC > should provide the write-thru required by the guests use of O_DIRECT > on the > writes. However, if the default cache value is not used and its set to > cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are The option would be cache=writeback and the man pages have a pretty clear warning in it that it could lead to data loss. It's used for -snapshot and it's totally safe for that (and improves write performance for that case). It's also there because a number of people expressed a concern that they did not care about data integrity and wished to be able to get the performance boost. I don't see a harm in that since I think we'll now have adequate documentation. Regards, Anthony Liguori > > issues that need to be addressed. > > -mark > >>> If QEMU >>> had a similar design to Enterprise Storage with redundancy, battery >>> backup, etc, I'd >>> be fine with it, but you don't. QEMU is a layer that I've also >>> thought was suppose >>> to be small, lightweight and unobtrusive that is silently putting >>> everyones data >>> at risk. >>> >>> The low-end iSCSI server from EqualLogic claims: >>> "it combines intelligence and automation with fault tolerance" >>> "Dual, redundant controllers with a total of 4 GB battery-backed >>> memory" >>> >>> AFAIK QEMU provides neither of these characteristics. >> >> So if this is your only concern, we're in violent agreement. You >> were previously arguing that we should use O_DIRECT in the host if >> we're not "lying" about write completions anymore. That's what I'm >> opposing because the details of whether we use O_DIRECT or not have >> absolutely nothing to do with data integrity as long as we're using >> O_DSYNC. >> >> Regards, >> >> Anthony Liguori >> >>> >>> -mark >>> >>>> The fact that the virtualization layer has a cache is really not >>>> that unusual. >>> Do other virtualization layers lie to the guest and indicate that >>> the data >>> has successfully been ACK'd by the storage subsystem when the data >>> is actually >>> still in the host cache? >>> >>> >>> -mark >>>> >>>> Regards, >>>> >>>> Anthony Liguori >>>> >>>> >>> >>> >>> >> >> >> > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 2:09 ` Mark Wagner 2008-10-13 3:16 ` Anthony Liguori @ 2008-10-13 6:42 ` Aurelien Jarno 2008-10-13 14:38 ` Steve Ofsthun 2 siblings, 0 replies; 101+ messages in thread From: Aurelien Jarno @ 2008-10-13 6:42 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner a écrit : > Anthony Liguori wrote: >> Mark Wagner wrote: >>> If you stopped and listened to yourself, you'd see that you are making >>> my point... >>> >>> AFAIK, QEMU is neither designed nor intended to be an Enterprise >>> Storage Array, >>> I thought this group is designing a virtualization layer. However, >>> the persistent >>> argument is that since Enterprise Storage products will often >>> acknowledge a write >>> before the data is actually on the disk, its OK for QEMU to do the same. >> I think you're a little lost in this thread. We're going to have QEMU >> only acknowledge writes when they complete. I've already sent out a >> patch. Just waiting a couple days to let everyone give their input. >> > Actually, I'm just don't being clear enough in trying to point out that I > don't think just setting a default value for "cache" goes far enough. My > argument has nothing to do with the default value. It has to do with what the > right thing to do is in specific situations regardless of the value of the > cache setting. > > My point is that if a file is opened in the guest with the O_DIRECT (or O_DSYNC) > then QEMU *must* honor that regardless of whatever value the current value of > "cache" is. > > So, if the system admin for the host decides to set cache=on and something > in the guest opens a file with O_DIRECT, I feel that it is a violation > of the system call for the host to cache the write in its local cache w/o > sending it immediately to the storage subsystem. It must get an ACK from > the storage subsystem before it can return to the guest in order to preserve > the guarantee. > > So, if your proposed default value for the cache is in effect, then O_DSYNC > should provide the write-thru required by the guests use of O_DIRECT on the > writes. However, if the default cache value is not used and its set to > cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are > issues that need to be addressed. > Everybody agrees that we should support data integrity *by default*. But please admit that some persons have different needs than yours, and actually *want* to lie to the guest. We should propose such and option, with a *big warning*. -- .''`. Aurelien Jarno | GPG: 1024D/F1BCDB73 : :' : Debian developer | Electrical Engineer `. `' aurel32@debian.org | aurelien@aurel32.net `- people.debian.org/~aurel32 | www.aurel32.net ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 2:09 ` Mark Wagner 2008-10-13 3:16 ` Anthony Liguori 2008-10-13 6:42 ` Aurelien Jarno @ 2008-10-13 14:38 ` Steve Ofsthun 2 siblings, 0 replies; 101+ messages in thread From: Steve Ofsthun @ 2008-10-13 14:38 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Mark Wagner wrote: > Anthony Liguori wrote: >> Mark Wagner wrote: >>> If you stopped and listened to yourself, you'd see that you are >>> making my point... >>> >>> AFAIK, QEMU is neither designed nor intended to be an Enterprise >>> Storage Array, >>> I thought this group is designing a virtualization layer. However, >>> the persistent >>> argument is that since Enterprise Storage products will often >>> acknowledge a write >>> before the data is actually on the disk, its OK for QEMU to do the same. >> >> I think you're a little lost in this thread. We're going to have QEMU >> only acknowledge writes when they complete. I've already sent out a >> patch. Just waiting a couple days to let everyone give their input. >> > Actually, I'm just don't being clear enough in trying to point out that I > don't think just setting a default value for "cache" goes far enough. My > argument has nothing to do with the default value. It has to do with > what the > right thing to do is in specific situations regardless of the value of the > cache setting. > > My point is that if a file is opened in the guest with the O_DIRECT (or > O_DSYNC) > then QEMU *must* honor that regardless of whatever value the current > value of > "cache" is. I disagree here. QEMU's contract is not with any particular guest OS interface. QEMU's contract is with the faithfulness of the hardware emulation. The guest OS must perform appropriate actions that would guarantee the behavior advertised to any particular application. So your discussion should focus on what should QEMU do when asked to flush an I/O stream on a virtual device. While the specific actions QEMU might perform may be different based on caching mode, the end result should be host caching flushed to the underlying storage hierarchy. Note that this still doesn't guarantee the I/O is on the disk unless the storage is configured properly. QEMU shouldn't attempt to provide stronger guarantees than the host OS provides. Looking at a parallel in the real world. Most disk drives today ship with write caching enabled. Most OSes will accept this and allow delayed writes to the actual media. Is this completely safe? No. Is this accepted? Yes. Now, to become safe an application will perform extraordinary actions (various sync modes, etc) to guarantee the data is on the media. Yet even this can be circumvented by specific performance modes in the storage hierarchy. However, there are best practices to follow to avoid unexpected vulnerabilities. For certain application environments is to mandatory to disable writeback caching on the drives. Yet we wouldn't want to impose this constraint on all application environments. There are always tradeoffs. Now given that there are data safety issues to deal with, it is important to prevent a default behavior that recklessly endangers guest data. A customer will expect a single virtual machine to exhibit the same data safety as a single physical machine. However, running a group of virtual machines on a single host, the guest user will expect the same reliability as a group of physical machines. Note that the virtualization layer adds vulnerabilities (a host OS crash for example) that reduce the reliability of the virtual machines over the physical machines they replace. So the default behavior of a virtualization stack may need to be more conservative that the corresponding physical stack it replaces. On the flip side though, the virtualization layer can exploit new opportunities for optimization. Imagine a single macro operation running within a virtual machine (backup, OS installation). Data integrity of the entire operation is important, not the individual I/Os. So by disabling all individual I/O synchronization semantics, I get a backup or installation to run in half the time. This can be a key advantage for virtual deployments. We don't want to prevent this situation because we want to guarantee the integrity of half a backup, or half an install. Steve ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-11 20:35 ` Anthony Liguori 2008-10-12 0:43 ` Mark Wagner @ 2008-10-12 0:44 ` Chris Wright 2008-10-12 10:21 ` Avi Kivity 2008-10-12 10:12 ` Avi Kivity 2 siblings, 1 reply; 101+ messages in thread From: Chris Wright @ 2008-10-12 0:44 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper * Anthony Liguori (anthony@codemonkey.ws) wrote: > Mark Wagner wrote: >> So I guess the real issue isn't what the default should be (although >> the performance team at Red Hat would vote for cache=off), > > The consensus so far has been that we want to still use the host page > cache but use it in write-through mode. This would mean that the guest > would only see data completion when the host's storage subsystem reports > the write as having completed. This is not the same as cache=off but I > think gives the real effect that is desired. I think it's safe to say the perf folks are concerned w/ data integrity first, stable/reproducible results second, and raw performance third. So seeing data cached in host was simply not what they expected. I think write through is sufficient. However I think that uncached vs. wt will show up on the radar under reproducible results (need to tune based on cache size). And in most overcommit scenarios memory is typically more precious than cpu, it's unclear to me if the extra buffering is anything other than memory overhead. As long as it's configurable then it's comparable and benchmarking and best practices can dictate best choice. thanks, -chris ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 0:44 ` Chris Wright @ 2008-10-12 10:21 ` Avi Kivity 2008-10-12 14:37 ` Dor Laor 2008-10-12 17:59 ` Anthony Liguori 0 siblings, 2 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-12 10:21 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper Chris Wright wrote: > I think it's safe to say the perf folks are concerned w/ data integrity > first, stable/reproducible results second, and raw performance third. > > So seeing data cached in host was simply not what they expected. I think > write through is sufficient. However I think that uncached vs. wt will > show up on the radar under reproducible results (need to tune based on > cache size). And in most overcommit scenarios memory is typically more > precious than cpu, it's unclear to me if the extra buffering is anything > other than memory overhead. As long as it's configurable then it's > comparable and benchmarking and best practices can dictate best choice. > Getting good performance because we have a huge amount of free memory in the host is not a good benchmark. Under most circumstances, the free memory will be used either for more guests, or will be given to the existing guests, which can utilize it more efficiently than the host. I can see two cases where this is not true: - using older, 32-bit guests which cannot utilize all of the cache. I think Windows XP is limited to 512MB of cache, and usually doesn't utilize even that. So if you have an application running on 32-bit Windows (or on 32-bit Linux with pae disabled), and a huge host, you will see a significant boost from cache=writethrough. This is a case where performance can exceed native, simply because native cannot exploit all the resources of the host. - if cache requirements vary in time across the different guests, and if some smart ballooning is not in place, having free memory on the host means we utilize it for whichever guest has the greatest need, so overall performance improves. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 10:21 ` Avi Kivity @ 2008-10-12 14:37 ` Dor Laor 2008-10-12 15:35 ` Jamie Lokier 2008-10-12 18:02 ` Anthony Liguori 2008-10-12 17:59 ` Anthony Liguori 1 sibling, 2 replies; 101+ messages in thread From: Dor Laor @ 2008-10-12 14:37 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Avi Kivity wrote: > Chris Wright wrote: >> I think it's safe to say the perf folks are concerned w/ data integrity >> first, stable/reproducible results second, and raw performance third. >> >> So seeing data cached in host was simply not what they expected. I >> think >> write through is sufficient. However I think that uncached vs. wt will >> show up on the radar under reproducible results (need to tune based on >> cache size). And in most overcommit scenarios memory is typically more >> precious than cpu, it's unclear to me if the extra buffering is anything >> other than memory overhead. As long as it's configurable then it's >> comparable and benchmarking and best practices can dictate best choice. >> > > Getting good performance because we have a huge amount of free memory > in the host is not a good benchmark. Under most circumstances, the > free memory will be used either for more guests, or will be given to > the existing guests, which can utilize it more efficiently than the host. > > I can see two cases where this is not true: > > - using older, 32-bit guests which cannot utilize all of the cache. I > think Windows XP is limited to 512MB of cache, and usually doesn't > utilize even that. So if you have an application running on 32-bit > Windows (or on 32-bit Linux with pae disabled), and a huge host, you > will see a significant boost from cache=writethrough. This is a case > where performance can exceed native, simply because native cannot > exploit all the resources of the host. > > - if cache requirements vary in time across the different guests, and > if some smart ballooning is not in place, having free memory on the > host means we utilize it for whichever guest has the greatest need, so > overall performance improves. > > > Another justification for ODIRECT is that many production system will use the base images for their VMs. It's mainly true for desktop virtualization but probably for some server virtualization deployments. In these type of scenarios, we can have all of the base image chain opened as default with caching for read-only while the leaf images are open with cache=off. Since there is ongoing effort (both by IT and developers) to keep the base images as big as possible, it guarantees that this data is best suited for caching in the host while the private leaf images will be uncached. This way we provide good performance and caching for the shared parent images while also promising correctness. Actually this is what happens on mainline qemu with cache=off. Cheers, Dor ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 14:37 ` Dor Laor @ 2008-10-12 15:35 ` Jamie Lokier 2008-10-12 18:00 ` Anthony Liguori 2008-10-12 18:02 ` Anthony Liguori 1 sibling, 1 reply; 101+ messages in thread From: Jamie Lokier @ 2008-10-12 15:35 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Dor Laor wrote: > Actually this is what happens on mainline qemu with cache=off. Have I understood right that cache=off on a qcow2 image only uses O_DIRECT for the leaf image, and the chain of base images don't use O_DIRECT? Sometimes on a memory constrained host, where the (collective) guest memory is nearly as big as the host memory, I'm not sure this is what I want. -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 15:35 ` Jamie Lokier @ 2008-10-12 18:00 ` Anthony Liguori 0 siblings, 0 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 18:00 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Jamie Lokier wrote: > Dor Laor wrote: > >> Actually this is what happens on mainline qemu with cache=off. >> > > Have I understood right that cache=off on a qcow2 image only uses > O_DIRECT for the leaf image, and the chain of base images don't use > O_DIRECT? > Yeah, that's a bug IMHO and in my patch to add O_DSYNC, I fix that. I think an argument for O_DIRECT in a leaf and wb in the leaf is seriously flawed... Regards, Anthony Liguori > Sometimes on a memory constrained host, where the (collective) guest > memory is nearly as big as the host memory, I'm not sure this is what > I want. > > -- Jamie > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 14:37 ` Dor Laor 2008-10-12 15:35 ` Jamie Lokier @ 2008-10-12 18:02 ` Anthony Liguori 2008-10-15 10:17 ` Andrea Arcangeli 1 sibling, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 18:02 UTC (permalink / raw) To: Dor Laor Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Dor Laor wrote: > Avi Kivity wrote: > > Since there is ongoing effort (both by IT and developers) to keep the > base images as big as possible, it guarantees that > this data is best suited for caching in the host while the private > leaf images will be uncached. A proper CAS solution is really such a better approach. qcow2 deduplification is an interesting concept, but such a hack :-) > This way we provide good performance and caching for the shared parent > images while also promising correctness. You get correctness by using O_DSYNC. cache=off should disable the use of the page cache everywhere. Regards, Anthony Liguori > Actually this is what happens on mainline qemu with cache=off. > > Cheers, > Dor > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 18:02 ` Anthony Liguori @ 2008-10-15 10:17 ` Andrea Arcangeli 0 siblings, 0 replies; 101+ messages in thread From: Andrea Arcangeli @ 2008-10-15 10:17 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper [-- Attachment #1: Type: text/plain, Size: 2164 bytes --] On Sun, Oct 12, 2008 at 01:02:57PM -0500, Anthony Liguori wrote: > You get correctness by using O_DSYNC. cache=off should disable the use of > the page cache everywhere. The parent shared image is generally readonly (assuming no cluster fs or shared database storage). So O_DSYNC on the parent will be a noop but it's ok if you like it as a default. By default having cache enabled on the parent makes sense to me (O_DSYNC doesn't disable the cache like O_DIRECT does, reads are cached). Because the qemu command line is qcow2 internals agnostic (you can't specify which parent/child image to use, that's left to qemu-img to set on the qcow2 metadata) I guess the O_DIRECT/O_DSYNC behavior on the parent image should also be left to qemu-img. Assuming there's any reserved bitflag left in the qcow2 metadata to use to specify those bits. I also attached the results of my o_direct measurements. O_DIRECT seems very optimal already after the fixes to qcow2 to avoid submitting aio_read/write only large as a qcow2 cluster size. I was initially fooled because I didn't reduce the ram on the host to the guest size + less than the min filesize of iozone, after that O_DIRECT wins. All tests were run with the emulated ide driver, which is the one that soldice is using right now with non-linux guest. The aio-thread patch can't make any difference with ide as verified here. I also tried to enlarge the max dma in the ide driver to 512k (it's limited to 128k) but I couldn't measure any benefit. 128k large DMA on host seems enough to reach platter speed. I also tried with dma disabled on the guest ide driver, and that destroys the O_DIRECT performance because then the commands are too small to reach platter speed. The host IDE driver needs something >=64k to reach platter speed. In short I think except for the boot-time O_DIRECT is a must and things like this are why MAP_SHARED isn't nearly as good as O_DIRECT for certain cases, as it won't waste any cpu in the VM pagetable manglings and msyncing. So the parent image is the only one where it makes sense to allow caching to speed up the boot time and application startup on the shared executables. [-- Attachment #2: iozone-cleo-trunk-dma.ods --] [-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 37205 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 10:21 ` Avi Kivity 2008-10-12 14:37 ` Dor Laor @ 2008-10-12 17:59 ` Anthony Liguori 2008-10-12 18:34 ` Avi Kivity 1 sibling, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 17:59 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Avi Kivity wrote: > Chris Wright wrote: >> I think it's safe to say the perf folks are concerned w/ data integrity >> first, stable/reproducible results second, and raw performance third. >> >> So seeing data cached in host was simply not what they expected. I >> think >> write through is sufficient. However I think that uncached vs. wt will >> show up on the radar under reproducible results (need to tune based on >> cache size). And in most overcommit scenarios memory is typically more >> precious than cpu, it's unclear to me if the extra buffering is anything >> other than memory overhead. As long as it's configurable then it's >> comparable and benchmarking and best practices can dictate best choice. >> > > Getting good performance because we have a huge amount of free memory > in the host is not a good benchmark. Under most circumstances, the > free memory will be used either for more guests, or will be given to > the existing guests, which can utilize it more efficiently than the host. There's two arguments for O_DIRECT. The first is that you can avoid bringing in data into CPU cache. This requires zero-copy in QEMU but ignoring that, the use of the page cache doesn't necessarily prevent us from achieving this. In the future, most systems will have a DMA offload engine. This is a pretty obvious thing to attempt to accelerate with such an engine which would prevent cache pollution. Another possibility is to directly map the host's page cache into the guest's memory space. The later is a bit tricky but is so much more interesting especially if you have a strong storage backend that is capable of deduplification (you get memory compaction for free). I also have my doubts that the amount of memory saved by using O_DIRECT will have a noticable impact on performance considering that guest memory and page cache memory are entirely reclaimable. An LRU should make the best decisions about whether memory is more valuable for the guests or for the host page cache. Regards, Anthony Liguori > I can see two cases where this is not true: > > - using older, 32-bit guests which cannot utilize all of the cache. I > think Windows XP is limited to 512MB of cache, and usually doesn't > utilize even that. So if you have an application running on 32-bit > Windows (or on 32-bit Linux with pae disabled), and a huge host, you > will see a significant boost from cache=writethrough. This is a case > where performance can exceed native, simply because native cannot > exploit all the resources of the host. > > - if cache requirements vary in time across the different guests, and > if some smart ballooning is not in place, having free memory on the > host means we utilize it for whichever guest has the greatest need, so > overall performance improves. > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 17:59 ` Anthony Liguori @ 2008-10-12 18:34 ` Avi Kivity 2008-10-12 19:33 ` Izik Eidus 2008-10-12 19:59 ` Anthony Liguori 0 siblings, 2 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-12 18:34 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Anthony Liguori wrote: >> >> Getting good performance because we have a huge amount of free memory >> in the host is not a good benchmark. Under most circumstances, the >> free memory will be used either for more guests, or will be given to >> the existing guests, which can utilize it more efficiently than the >> host. > > There's two arguments for O_DIRECT. The first is that you can avoid > bringing in data into CPU cache. This requires zero-copy in QEMU but > ignoring that, the use of the page cache doesn't necessarily prevent > us from achieving this. > > In the future, most systems will have a DMA offload engine. This is a > pretty obvious thing to attempt to accelerate with such an engine > which would prevent cache pollution. But would increase latency, memory bus utilization, and cpu overhead. In the cases where the page cache buys us something (host page cache significantly larger than guest size), that's understandable. But for the other cases, why bother? Especially when many systems don't have this today. Let me phrase this another way: is there an argument against O_DIRECT? In a significant fraction of deployments it will be both simpler and faster. > Another possibility is to directly map the host's page cache into the > guest's memory space. > Doesn't work with large pages. > The later is a bit tricky but is so much more interesting especially > if you have a strong storage backend that is capable of > deduplification (you get memory compaction for free). > It's not free at all. Replacing a guest memory page involves IPIs and TLB flushes. It only works on small pages, and if the host page cache and guest page cache are aligned with each other. And with current Linux memory management, I don't see a way to do it that doesn't involve creating a vma for every page, which is prohibitively expensive. > I also have my doubts that the amount of memory saved by using > O_DIRECT will have a noticable impact on performance considering that > guest memory and page cache memory are entirely reclaimable. O_DIRECT is not about saving memory, it is about saving cpu utilization, cache utilization, and memory bandwidth. > An LRU should make the best decisions about whether memory is more > valuable for the guests or for the host page cache. > LRU typically makes fairly bad decisions since it throws most of the information it has away. I recommend looking up LRU-K and similar algorithms, just to get a feel for this; it is basically the simplest possible algorithm short of random selection. Note that Linux doesn't even have an LRU; it has to approximate since it can't sample all of the pages all of the time. With a hypervisor that uses Intel's EPT, it's even worse since we don't have an accessed bit. On silly benchmarks that just exercise the disk and touch no memory, and if you tune the host very aggresively, LRU will win on long running guests since it will eventually page out all unused guest memory (with Linux guests, it will never even page guest memory in). On real life applications I don't think there is much chance. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 18:34 ` Avi Kivity @ 2008-10-12 19:33 ` Izik Eidus 2008-10-14 17:08 ` Avi Kivity 2008-10-12 19:59 ` Anthony Liguori 1 sibling, 1 reply; 101+ messages in thread From: Izik Eidus @ 2008-10-12 19:33 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper Avi Kivity wrote: > > LRU typically makes fairly bad decisions since it throws most of the > information it has away. I recommend looking up LRU-K and similar > algorithms, just to get a feel for this; it is basically the simplest > possible algorithm short of random selection. > > Note that Linux doesn't even have an LRU; it has to approximate since it > can't sample all of the pages all of the time. With a hypervisor that > uses Intel's EPT, it's even worse since we don't have an accessed bit. > On silly benchmarks that just exercise the disk and touch no memory, and > if you tune the host very aggresively, LRU will win on long running > guests since it will eventually page out all unused guest memory (with > Linux guests, it will never even page guest memory in). On real life > applications I don't think there is much chance. > > But when using O_DIRECT you actuality make the pages not swappable at all... or am i wrong? maybe somekind of combination with the mm shrink could be good, do_try_to_free_pages is good point for reference. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 19:33 ` Izik Eidus @ 2008-10-14 17:08 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 17:08 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Izik Eidus wrote: > But when using O_DIRECT you actuality make the pages not swappable at > all... > or am i wrong? Only for the duration of the I/O operation, which is typically very short. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 18:34 ` Avi Kivity 2008-10-12 19:33 ` Izik Eidus @ 2008-10-12 19:59 ` Anthony Liguori 2008-10-12 20:43 ` Avi Kivity 1 sibling, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 19:59 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Avi Kivity wrote: > But would increase latency, memory bus utilization, and cpu overhead. > > In the cases where the page cache buys us something (host page cache > significantly larger than guest size), that's understandable. But for > the other cases, why bother? Especially when many systems don't have > this today. > > Let me phrase this another way: is there an argument against O_DIRECT? > It slows down any user who frequently restarts virtual machines. It slows down total system throughput when there are multiple virtual machines sharing a single disk. This later point is my primary concern because in the future, I expect disk sharing to be common in some form (either via common QCOW base images or via CAS). I'd like to see a benchmark demonstrating that O_DIRECT improves overall system throughput in any scenario today. I just don't buy the cost of the extra copy today is going to be significant since the CPU cache is already polluted. I think the burden of proof is on O_DIRECT because it's quite simple to demonstrate where it hurts performance (just the time it takes to do two boots of the same image). > In a significant fraction of deployments it will be both simpler and faster. > I think this is speculative. Is there any performance data to back this up? Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 19:59 ` Anthony Liguori @ 2008-10-12 20:43 ` Avi Kivity 2008-10-12 21:11 ` Anthony Liguori 0 siblings, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-12 20:43 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Anthony Liguori wrote: >> >> Let me phrase this another way: is there an argument against O_DIRECT? > > It slows down any user who frequently restarts virtual machines. This is an important use case (us developers), but not the majority of deployments. > It slows down total system throughput when there are multiple virtual > machines sharing a single disk. This later point is my primary > concern because in the future, I expect disk sharing to be common in > some form (either via common QCOW base images or via CAS). Sharing via qcow base images is also an important use case, but for desktop workloads. Server workloads will be able to share a lot less, and in any case will not keep reloading their text pages as desktops do. Regarding CAS, the Linux page cache indexes pages by inode number and offset, so it cannot share page cache contents without significant rework. Perhaps ksm could be adapted to do this, but it can't right now. And again, server consolidation scenarios which are mostly unrelated workloads jammed on a single host won't benefit much from this. > > I'd like to see a benchmark demonstrating that O_DIRECT improves > overall system throughput in any scenario today. I just don't buy the > cost of the extra copy today is going to be significant since the CPU > cache is already polluted. I think the burden of proof is on O_DIRECT > because it's quite simple to demonstrate where it hurts performance > (just the time it takes to do two boots of the same image). > >> In a significant fraction of deployments it will be both simpler and >> faster. >> > > I think this is speculative. Is there any performance data to back > this up? Given that we don't have a zero-copy implementation yet, it is impossible to generate real performance data. However it is backed up by experience; all major databases use direct I/O and their own caching; and since the data patterns of filesystems are similar to that of databases (perhaps less random), there's a case for not caching them. I'll repeat my arguments: - cache size In many deployments we will maximize the number of guests, so host memory will be low. If your L3 cache is smaller than your L2 cache, your cache hit rate will be low. Guests will write out data they are not expecting to need soon (the tails of their LRU, or their journals) so caching it is pointless. Conversely, they _will_ cache data they have just read. - cpu cache utilization When a guest writes out its page cache, this is likely to be some time after the cpu moved the data there. So it's out of the page cache. Now we're bringing it back to the cache, twice (once reading guest memory, second time writing to host page cache). Similarly, when reading from the host page cache into the guest, we have no idea whether the guest will actually touch the memory in question. It may be doing a readahead, or reading a metadata page of which it will only access a small part. So again we're wasting two pages worth of cache per page we're reading. Note also that we have no idea which vcpu will use the page, so even if the guest will touch the data, there is a high likelihood (for large guests) that it will be in the wrong cache. - conflicting readahead heuristics The host may attempt to perform readahead on the disk. However the guest is also doing readahead, so the host is extending the readahead further than is likely to be a good idea. Also, the guest does logical (file-based) readahead while the host does physical (disk order based) readahead, or qcow-level readahead which is basically reading random blocks. Now I don't have data that demonstrates how bad these effects are, but I think there is sufficient arguments here to justify adding O_DIRECT. I intend to recommend O_DIRECT unless I see performance data that favours O_DSYNC on real world scenarios that take into account bandwidth, cpu utilization, and memory utilization (i.e. a 1G guest on a 32G host running fio but not top doesn't count). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 20:43 ` Avi Kivity @ 2008-10-12 21:11 ` Anthony Liguori 2008-10-14 15:21 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-12 21:11 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Avi Kivity wrote: > Given that we don't have a zero-copy implementation yet, it is > impossible to generate real performance data. Which means that it's premature to suggest switching the default to O_DIRECT since it's not going to help right now. It can be revisited once we can do zero copy but again, I think it should be driven by actual performance data. My main point is that switching to O_DIRECT right now is only going to hurt performance for some users, and most likely help no one. > Now I don't have data that demonstrates how bad these effects are, but I > think there is sufficient arguments here to justify adding O_DIRECT. I > intend to recommend O_DIRECT unless I see performance data that favours > O_DSYNC on real world scenarios that take into account bandwidth, cpu > utilization, and memory utilization (i.e. a 1G guest on a 32G host > running fio but not top doesn't count). > So you intend on recommending something that you don't think is going to improve performance today and you know in certain scenarios is going to decrease performance? That doesn't seem right :-) I'm certainly open to changing the default once we get to a point where there's a demonstrable performance improvement from O_DIRECT but since I don't think it's a given that there will be, switching now seems like a premature optimization which has the side effect of reducing the performance of certain users. That seems like a Bad Thing to me. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 21:11 ` Anthony Liguori @ 2008-10-14 15:21 ` Avi Kivity 2008-10-14 15:32 ` Anthony Liguori 2008-10-14 19:25 ` Laurent Vivier 0 siblings, 2 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 15:21 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Anthony Liguori wrote: > Avi Kivity wrote: >> Given that we don't have a zero-copy implementation yet, it is >> impossible to generate real performance data. > > Which means that it's premature to suggest switching the default to > O_DIRECT since it's not going to help right now. It can be revisited > once we can do zero copy but again, I think it should be driven by > actual performance data. My main point is that switching to O_DIRECT > right now is only going to hurt performance for some users, and most > likely help no one. I am assuming that we will provide true O_DIRECT support soon. I don't think O_DIRECT should be qemu's default, since anyone using qemu directly is likely a "causal virtualization" user. Management systems like ovirt should definitely default to O_DIRECT (really, they shouldn't even offer caching). >> Now I don't have data that demonstrates how bad these effects are, but I >> think there is sufficient arguments here to justify adding O_DIRECT. I >> intend to recommend O_DIRECT unless I see performance data that favours >> O_DSYNC on real world scenarios that take into account bandwidth, cpu >> utilization, and memory utilization (i.e. a 1G guest on a 32G host >> running fio but not top doesn't count). >> > > So you intend on recommending something that you don't think is going > to improve performance today and you know in certain scenarios is > going to decrease performance? That doesn't seem right :-) > In the near term O_DIRECT will increase performance over the alternative. > I'm certainly open to changing the default once we get to a point > where there's a demonstrable performance improvement from O_DIRECT but > since I don't think it's a given that there will be, switching now > seems like a premature optimization which has the side effect of > reducing the performance of certain users. That seems like a Bad > Thing to me. I take the opposite view. O_DIRECT is the, well, direct path to the hardware. Caching introduces an additional layer of code and thus needs to proven effective. I/O and memory intensive applications use O_DIRECT; Xen uses O_DIRECT (or equivalent); I don't see why we need to deviate from industry practice. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-14 15:21 ` Avi Kivity @ 2008-10-14 15:32 ` Anthony Liguori 2008-10-14 15:43 ` Avi Kivity 2008-10-14 19:25 ` Laurent Vivier 1 sibling, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-14 15:32 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Avi Kivity wrote: > I don't think O_DIRECT should be qemu's default, since anyone using qemu > directly is likely a "causal virtualization" user. Management systems > like ovirt should definitely default to O_DIRECT (really, they shouldn't > even offer caching). > ovirt isn't a good example because the default storage model is iSCSI. Since you aren't preserving zero-copy, I doubt that you'll see any advantage to using O_DIRECT (I suspect the code paths aren't even different). Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-14 15:32 ` Anthony Liguori @ 2008-10-14 15:43 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 15:43 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Anthony Liguori wrote: > Avi Kivity wrote: >> I don't think O_DIRECT should be qemu's default, since anyone using qemu >> directly is likely a "causal virtualization" user. Management systems >> like ovirt should definitely default to O_DIRECT (really, they shouldn't >> even offer caching). >> > > ovirt isn't a good example because the default storage model is > iSCSI. Since you aren't preserving zero-copy, I doubt that you'll see > any advantage to using O_DIRECT (I suspect the code paths aren't even > different). If you have a hardware iSCSI initiator then O_DIRECT pays off. Even for a software initiator, the write path could be made zero copy. The read path doesn't look good though. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-14 15:21 ` Avi Kivity 2008-10-14 15:32 ` Anthony Liguori @ 2008-10-14 19:25 ` Laurent Vivier 2008-10-16 9:47 ` Avi Kivity 1 sibling, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-14 19:25 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, qemu-devel, Ryan Harper Le mardi 14 octobre 2008 à 17:21 +0200, Avi Kivity a écrit : > Anthony Liguori wrote: > > Avi Kivity wrote: > >> Given that we don't have a zero-copy implementation yet, it is > >> impossible to generate real performance data. > > > > Which means that it's premature to suggest switching the default to > > O_DIRECT since it's not going to help right now. It can be revisited > > once we can do zero copy but again, I think it should be driven by > > actual performance data. My main point is that switching to O_DIRECT > > right now is only going to hurt performance for some users, and most > > likely help no one. > > I am assuming that we will provide true O_DIRECT support soon. If you remember, I tried to introduce zero copy when I wrote the "cache=off" patch: http://thread.gmane.org/gmane.comp.emulators.qemu/22148/focus=22149 but it was not correct (see Fabrice comment). Laurent -- ------------------ Laurent.Vivier@bull.net ------------------ "Tout ce qui est impossible reste à accomplir" Jules Verne "Things are only impossible until they're not" Jean-Luc Picard ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-14 19:25 ` Laurent Vivier @ 2008-10-16 9:47 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-16 9:47 UTC (permalink / raw) To: Laurent Vivier Cc: Chris Wright, Mark McLoughlin, kvm-devel, qemu-devel, Ryan Harper Laurent Vivier wrote: >> I am assuming that we will provide true O_DIRECT support soon. >> > > If you remember, I tried to introduce zero copy when I wrote the > "cache=off" patch: > > http://thread.gmane.org/gmane.comp.emulators.qemu/22148/focus=22149 > > but it was not correct (see Fabrice comment). > Yes, this is not trivial, especially if we want to provide good support for all qemu targets. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-11 20:35 ` Anthony Liguori 2008-10-12 0:43 ` Mark Wagner 2008-10-12 0:44 ` Chris Wright @ 2008-10-12 10:12 ` Avi Kivity 2008-10-17 13:20 ` Jens Axboe 2 siblings, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-12 10:12 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Anthony Liguori wrote: >> >> If I focus on the sentence "The I/O is synchronous, that is, at >> the completion of a read(2) or write(2), data is guaranteed to have >> been transferred. ", > > It's extremely important to understand what the guarantee is. The > guarantee is that upon completion on write(), the data will have been > reported as written by the underlying storage subsystem. This does > *not* mean that the data is on disk. > It means that as far as the block-io layer of the kernel is concerned, the guarantee is met. If the writes go to to a ramdisk, or to an IDE drive with write-back cache enabled, or to disk with write-back cache disabled but without redundancy, or to a high-end storage array with double-parity protection but without a continuous data protection offsite solution, things may still go wrong. It is up to qemu to provide a strong link in the data reliability chain, not to ensure that the entire chain is perfect. That's up to the administrator or builder of the system. > If you have a normal laptop, your disk has a cache. That cache does > not have a battery backup. Under normal operations, the cache is > acting in write-back mode and when you do a write, the disk will > report the write as completed even though it is not actually on disk. > If you really care about the data being on disk, you have to either > use a disk with a battery backed cache (much more expensive) or enable > write-through caching (will significantly reduce performance). > I think that with SATA NCQ, this is no longer true. The drive will report the write complete when it is on disk, and utilize multiple outstanding requests to get coalescing and reordering. Not sure about this, though -- some drives may still be lying. > In the case of KVM, even using write-back caching with the host page > cache, we are still honoring the guarantee of O_DIRECT. We just have > another level of caching that happens to be write-back. No, we are lying. That's fine if the user tells us to lie, but not otherwise. >> I think there a bug here. If I open a >> file with the O_DIRECT flag and the host reports back to me that >> the transfer has completed when in fact its still in the host cache, >> its a bug as it violates the open()/write() call and there is no >> guarantee that the data will actually be written. > > This is very important, O_DIRECT does *not* guarantee that data > actually resides on disk. There are many possibly places that it can > be cached (in the storage controller, in the disks themselves, in a > RAID controller). O_DIRECT guarantees that the kernel is not the weak link in the reliability chain. > >> So I guess the real issue isn't what the default should be (although >> the performance team at Red Hat would vote for cache=off), > > The consensus so far has been that we want to still use the host page > cache but use it in write-through mode. This would mean that the > guest would only see data completion when the host's storage subsystem > reports the write as having completed. This is not the same as > cache=off but I think gives the real effect that is desired. I am fine with write-through as default, but cache=off should be a supported option. > > Do you have another argument for using cache=off? Performance. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-12 10:12 ` Avi Kivity @ 2008-10-17 13:20 ` Jens Axboe 2008-10-19 9:01 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2008-10-17 13:20 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper On Sun, Oct 12 2008, Avi Kivity wrote: > >If you have a normal laptop, your disk has a cache. That cache does > >not have a battery backup. Under normal operations, the cache is > >acting in write-back mode and when you do a write, the disk will > >report the write as completed even though it is not actually on disk. > >If you really care about the data being on disk, you have to either > >use a disk with a battery backed cache (much more expensive) or enable > >write-through caching (will significantly reduce performance). > > > > I think that with SATA NCQ, this is no longer true. The drive will > report the write complete when it is on disk, and utilize multiple > outstanding requests to get coalescing and reordering. Not sure about It is still very true. Go buy any consumer drive on the market and check the write cache settings - hint, it's definitely shipped with write back caching. So while the drive may have NCQ and Linux will use it, the write cache is still using write back unless you explicitly change it. > this, though -- some drives may still be lying. I think this is largely an urban myth, at least I've never come across any drives that lie. It's easy enough to test, modulo firmware bugs. Just switch to write through and compare the random write iops rate. Or enable write barriers in Linux and do the same workload, compare iops rate again with write back caching. -- Jens Axboe ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-17 13:20 ` Jens Axboe @ 2008-10-19 9:01 ` Avi Kivity 2008-10-19 18:10 ` Jens Axboe 0 siblings, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-19 9:01 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: > On Sun, Oct 12 2008, Avi Kivity wrote: > >>> If you have a normal laptop, your disk has a cache. That cache does >>> not have a battery backup. Under normal operations, the cache is >>> acting in write-back mode and when you do a write, the disk will >>> report the write as completed even though it is not actually on disk. >>> If you really care about the data being on disk, you have to either >>> use a disk with a battery backed cache (much more expensive) or enable >>> write-through caching (will significantly reduce performance). >>> >>> >> I think that with SATA NCQ, this is no longer true. The drive will >> report the write complete when it is on disk, and utilize multiple >> outstanding requests to get coalescing and reordering. Not sure about >> > > It is still very true. Go buy any consumer drive on the market and check > the write cache settings - hint, it's definitely shipped with write back > caching. So while the drive may have NCQ and Linux will use it, the > write cache is still using write back unless you explicitly change it. > > Sounds like a bug. Shouldn't Linux disable the write cache unless the user explicitly enables it, if NCQ is available? NCQ should provide acceptable throughput even without the write cache. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 9:01 ` Avi Kivity @ 2008-10-19 18:10 ` Jens Axboe 2008-10-19 18:23 ` Avi Kivity 2008-10-19 18:24 ` Avi Kivity 0 siblings, 2 replies; 101+ messages in thread From: Jens Axboe @ 2008-10-19 18:10 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper On Sun, Oct 19 2008, Avi Kivity wrote: > Jens Axboe wrote: > >On Sun, Oct 12 2008, Avi Kivity wrote: > > > >>>If you have a normal laptop, your disk has a cache. That cache does > >>>not have a battery backup. Under normal operations, the cache is > >>>acting in write-back mode and when you do a write, the disk will > >>>report the write as completed even though it is not actually on disk. > >>>If you really care about the data being on disk, you have to either > >>>use a disk with a battery backed cache (much more expensive) or enable > >>>write-through caching (will significantly reduce performance). > >>> > >>> > >>I think that with SATA NCQ, this is no longer true. The drive will > >>report the write complete when it is on disk, and utilize multiple > >>outstanding requests to get coalescing and reordering. Not sure about > >> > > > >It is still very true. Go buy any consumer drive on the market and check > >the write cache settings - hint, it's definitely shipped with write back > >caching. So while the drive may have NCQ and Linux will use it, the > >write cache is still using write back unless you explicitly change it. > > > > > > Sounds like a bug. Shouldn't Linux disable the write cache unless the > user explicitly enables it, if NCQ is available? NCQ should provide > acceptable throughput even without the write cache. How can it be a bug? Changing the cache policy of a drive would be a policy decision in the kernel, that is never the right thing to do. There's no such thing as 'acceptable throughput', manufacturers and customers usually just want the go faster stripes and data consistency is second. Additionally, write back caching is perfectly safe, if used with a barrier enabled file system in Linux. Also note that most users will not have deep queuing for most things. To get good random write performance with write through caching and NCQ, you naturally need to be able to fill the drive queue most of the time. Most desktop workloads don't come close to that, so the user will definitely see it as slower. -- Jens Axboe ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 18:10 ` Jens Axboe @ 2008-10-19 18:23 ` Avi Kivity 2008-10-19 19:17 ` M. Warner Losh 2008-10-19 18:24 ` Avi Kivity 1 sibling, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-19 18:23 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: >> Sounds like a bug. Shouldn't Linux disable the write cache unless the >> user explicitly enables it, if NCQ is available? NCQ should provide >> acceptable throughput even without the write cache. >> > > How can it be a bug? If it puts my data at risk, it's a bug. I can understand it for IDE, but not for SATA with NCQ. > Changing the cache policy of a drive would be a > policy decision in the kernel, If you don't want this in the kernel, then the system as a whole should default to being safe. Though in this case I think it is worthwhile to do this in the kernel. > that is never the right thing to do. > There's no such thing as 'acceptable throughput', I meant that performance is not completely destroyed. How can you even compare data safety to some percent of performance? > manufacturers and > customers usually just want the go faster stripes and data consistency > is second. What is the performance impact of disabling the write cache, given enough queue depth? > Additionally, write back caching is perfectly safe, if used > with a barrier enabled file system in Linux. > Not all Linux filesystems are barrier enabled, AFAIK. Further, barriers don't help with O_DIRECT (right?). I shouldn't need a disk array to run a database. > Also note that most users will not have deep queuing for most things. To > get good random write performance with write through caching and NCQ, > you naturally need to be able to fill the drive queue most of the time. > Most desktop workloads don't come close to that, so the user will > definitely see it as slower. > Most desktop workloads use writeback cache, so write performance is not critical. However I'd hate to see my data destroyed by a power failure, and today's large caches can hold a bunch of data. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 18:23 ` Avi Kivity @ 2008-10-19 19:17 ` M. Warner Losh 2008-10-19 19:31 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: M. Warner Losh @ 2008-10-19 19:17 UTC (permalink / raw) To: qemu-devel, avi; +Cc: chrisw, markmc, kvm-devel, Laurent.Vivier, ryanh In message: <48FB7B26.2090903@redhat.com> Avi Kivity <avi@redhat.com> writes: : >> Sounds like a bug. Shouldn't Linux disable the write cache unless the : >> user explicitly enables it, if NCQ is available? NCQ should provide : >> acceptable throughput even without the write cache. : >> : > : > How can it be a bug? : : If it puts my data at risk, it's a bug. I can understand it for IDE, : but not for SATA with NCQ. So wouldn't async mounts by default be a bug too? Warner ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 19:17 ` M. Warner Losh @ 2008-10-19 19:31 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-19 19:31 UTC (permalink / raw) To: M. Warner Losh Cc: chrisw, markmc, kvm-devel, Laurent.Vivier, qemu-devel, ryanh M. Warner Losh wrote: > In message: <48FB7B26.2090903@redhat.com> > Avi Kivity <avi@redhat.com> writes: > : >> Sounds like a bug. Shouldn't Linux disable the write cache unless the > : >> user explicitly enables it, if NCQ is available? NCQ should provide > : >> acceptable throughput even without the write cache. > : >> > : > > : > How can it be a bug? > : > : If it puts my data at risk, it's a bug. I can understand it for IDE, > : but not for SATA with NCQ. > > So wouldn't async mounts by default be a bug too? > No. Applications which are worried about data integrity use fsync() or backups to protect the user. I'm not worried about losing a few minutes of openoffice.org work. I'm worried about mail systems, filesystem metadata, etc. which can easily lose a large amount of data which is hard to recover. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 18:10 ` Jens Axboe 2008-10-19 18:23 ` Avi Kivity @ 2008-10-19 18:24 ` Avi Kivity 2008-10-19 18:36 ` Jens Axboe 1 sibling, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-19 18:24 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: >> Sounds like a bug. Shouldn't Linux disable the write cache unless the >> user explicitly enables it, if NCQ is available? NCQ should provide >> acceptable throughput even without the write cache. >> > > How can it be a bug? If it puts my data at risk, it's a bug. I can understand it for IDE, but not for SATA with NCQ. > Changing the cache policy of a drive would be a > policy decision in the kernel, If you don't want this in the kernel, then the system as a whole should default to being safe. Though in this case I think it is worthwhile to do this in the kernel. > that is never the right thing to do. > There's no such thing as 'acceptable throughput', I meant that performance is not completely destroyed. How can you even compare data safety to some percent of performance? > manufacturers and > customers usually just want the go faster stripes and data consistency > is second. What is the performance impact of disabling the write cache, given enough queue depth? > Additionally, write back caching is perfectly safe, if used > with a barrier enabled file system in Linux. > Not all Linux filesystems are barrier enabled, AFAIK. Further, barriers don't help with O_DIRECT (right?). I shouldn't need a disk array to run a database. > Also note that most users will not have deep queuing for most things. To > get good random write performance with write through caching and NCQ, > you naturally need to be able to fill the drive queue most of the time. > Most desktop workloads don't come close to that, so the user will > definitely see it as slower. > Most desktop workloads use writeback cache, so write performance is not critical. However I'd hate to see my data destroyed by a power failure, and today's large caches can hold a bunch of data. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 18:24 ` Avi Kivity @ 2008-10-19 18:36 ` Jens Axboe 2008-10-19 19:11 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Jens Axboe @ 2008-10-19 18:36 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper On Sun, Oct 19 2008, Avi Kivity wrote: > Jens Axboe wrote: > > > > >> Sounds like a bug. Shouldn't Linux disable the write cache unless the > >> user explicitly enables it, if NCQ is available? NCQ should provide > >> acceptable throughput even without the write cache. > >> > > > > How can it be a bug? > > If it puts my data at risk, it's a bug. I can understand it for IDE, > but not for SATA with NCQ. Then YOU turn it off. Other people would consider the lousy performance to be the bigger problem. See policy :-) > > Changing the cache policy of a drive would be a > > policy decision in the kernel, > > If you don't want this in the kernel, then the system as a whole should > default to being safe. Though in this case I think it is worthwhile to > do this in the kernel. Doesn't matter how you turn this, it's still a policy decision. Leave it to the user. It's not exactly a new turn of events, commodity drives have shipped with write caching on forever. What if the drive has a battery backing? What if the user has an UPS? > > that is never the right thing to do. > > There's no such thing as 'acceptable throughput', > > I meant that performance is not completely destroyed. How can you even How do you know it's not destroyed? Depending on your workload, it may very well be dropping your throughput by orders of magnitude. > compare data safety to some percent of performance? I'm not, what I'm saying is that different people will have different opponions on what is most important. Do note that the window of corruption is really small and requires powerloss to trigger. So for most desktop users, the tradeoff is actually sane. > > manufacturers and > > customers usually just want the go faster stripes and data consistency > > is second. > > What is the performance impact of disabling the write cache, given > enough queue depth? Depends on the drive. On commodity drives, manufacturers don't really optimize much for the write through caching, since it's not really what anybody uses. So you'd have to benchmark it to see. > > Additionally, write back caching is perfectly safe, if used > > with a barrier enabled file system in Linux. > > > > Not all Linux filesystems are barrier enabled, AFAIK. Further, barriers > don't help with O_DIRECT (right?). O_DIRECT should just use FUA writes, there are safe with write back caching. I'm actually testing such a change just to gauge the performance impact. > I shouldn't need a disk array to run a database. You are free to turn off write back caching! > > Also note that most users will not have deep queuing for most things. To > > get good random write performance with write through caching and NCQ, > > you naturally need to be able to fill the drive queue most of the time. > > Most desktop workloads don't come close to that, so the user will > > definitely see it as slower. > > > > Most desktop workloads use writeback cache, so write performance is not > critical. Ehm, how do you reach that conclusion based on that statement? > However I'd hate to see my data destroyed by a power failure, and > today's large caches can hold a bunch of data. Then you use barriers or turn write back caching off, simple as that. -- Jens Axboe ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 18:36 ` Jens Axboe @ 2008-10-19 19:11 ` Avi Kivity 2008-10-19 19:30 ` Jens Axboe 0 siblings, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-19 19:11 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: > On Sun, Oct 19 2008, Avi Kivity wrote: > >> Jens Axboe wrote: >> >> >> >> >>>> Sounds like a bug. Shouldn't Linux disable the write cache unless the >>>> user explicitly enables it, if NCQ is available? NCQ should provide >>>> acceptable throughput even without the write cache. >>>> >>>> >>> How can it be a bug? >>> >> If it puts my data at risk, it's a bug. I can understand it for IDE, >> but not for SATA with NCQ. >> > > Then YOU turn it off. Other people would consider the lousy performance > to be the bigger problem. See policy :-) > > If I get lousy performance, I can turn on the write cache and ignore the risk of data loss. If I lose my data, I can't turn off the write cache and get my data back. (it seems I can't turn off the write cache even without losing my data: [avi@firebolt ~]$ sudo sdparm --set=WCE=0 /dev/sd[ab] /dev/sda: ATA WDC WD3200YS-01P 21.0 change_mode_page: failed setting page: Caching (SBC) /dev/sdb: ATA WDC WD3200YS-01P 21.0 change_mode_page: failed setting page: Caching (SBC) ) >>> Changing the cache policy of a drive would be a >>> policy decision in the kernel, >>> >> If you don't want this in the kernel, then the system as a whole should >> default to being safe. Though in this case I think it is worthwhile to >> do this in the kernel. >> > > Doesn't matter how you turn this, it's still a policy decision. Leave it > to the user. It's not exactly a new turn of events, commodity drives > have shipped with write caching on forever. What if the drive has a > battery backing? If the drive has a batter backup, I'd argue it should report it as a write-through cache. I'm not a drive manufacturer though. > What if the user has an UPS? > > They should enable the write-back cache if they trust the UPS. Or maybe the system should do that automatically if it's aware of the UPS. "Policy" doesn't mean you shouldn't choose good defaults. >>> that is never the right thing to do. >>> There's no such thing as 'acceptable throughput', >>> >> I meant that performance is not completely destroyed. How can you even >> > > How do you know it's not destroyed? Depending on your workload, it may > very well be dropping your throughput by orders of magnitude. > > I guess this is the crux. According to my understanding, you shouldn't see such a horrible drop, unless the application does synchronous writes explicitly, in which case it is probably worried about data safety. >> compare data safety to some percent of performance? >> > > I'm not, what I'm saying is that different people will have different > opponions on what is most important. Do note that the window of > corruption is really small and requires powerloss to trigger. So for > most desktop users, the tradeoff is actually sane. > > I agree that the window is very small, and that by eliminating software failures we get rid of the major source of data loss. What I don't know is what the performance tradeoff looks like (and I can't measure since my drives won't let me turn off the cache for some reason). >>> Additionally, write back caching is perfectly safe, if used >>> with a barrier enabled file system in Linux. >>> >>> >> Not all Linux filesystems are barrier enabled, AFAIK. Further, barriers >> don't help with O_DIRECT (right?). >> > > O_DIRECT should just use FUA writes, there are safe with write back > caching. I'm actually testing such a change just to gauge the > performance impact. > You mean, this is not in mainline yet? So, with this, plus barrier support for metadata and O_SYNC writes, the write-back cache should be safe? Some googling shows that Windows XP introduced FUA for O_DIRECT and metadata writes as well. > >> I shouldn't need a disk array to run a database. >> > > You are free to turn off write back caching! > > What about the users who aren't on qemu-devel? However, with your FUA change, they should be safe. >> >> Most desktop workloads use writeback cache, so write performance is not >> critical. >> > > Ehm, how do you reach that conclusion based on that statement? > > Any write latency is buffered by the kernel. Write speed is main memory speed. Disk speed only bubbles up when memory is tight. >> However I'd hate to see my data destroyed by a power failure, and >> today's large caches can hold a bunch of data. >> > > Then you use barriers or turn write back caching off, simple as that. > I will (if I figure out how) but there may be one or two users who haven't read the scsi spec yet. Or more correctly, I am revising my opinion of the write back cache since even when it is enabled, it is completely optional. Instead of disabling the write back cache we should use FUA and barriers, and since you are to be working on FUA, it looks like this will be resolved soon without performance/correctness compromises. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 19:11 ` Avi Kivity @ 2008-10-19 19:30 ` Jens Axboe 2008-10-19 20:16 ` Avi Kivity 2008-10-20 14:14 ` Avi Kivity 0 siblings, 2 replies; 101+ messages in thread From: Jens Axboe @ 2008-10-19 19:30 UTC (permalink / raw) To: Avi Kivity Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper On Sun, Oct 19 2008, Avi Kivity wrote: > Jens Axboe wrote: > > On Sun, Oct 19 2008, Avi Kivity wrote: > > > >> Jens Axboe wrote: > >> > >> > >> > >> > >>>> Sounds like a bug. Shouldn't Linux disable the write cache unless the > >>>> user explicitly enables it, if NCQ is available? NCQ should provide > >>>> acceptable throughput even without the write cache. > >>>> > >>>> > >>> How can it be a bug? > >>> > >> If it puts my data at risk, it's a bug. I can understand it for IDE, > >> but not for SATA with NCQ. > >> > > > > Then YOU turn it off. Other people would consider the lousy performance > > to be the bigger problem. See policy :-) > > > > > > If I get lousy performance, I can turn on the write cache and ignore the > risk of data loss. If I lose my data, I can't turn off the write cache > and get my data back. > > (it seems I can't turn off the write cache even without losing my data: > > [avi@firebolt ~]$ sudo sdparm --set=WCE=0 /dev/sd[ab] > /dev/sda: ATA WDC WD3200YS-01P 21.0 > change_mode_page: failed setting page: Caching (SBC) > /dev/sdb: ATA WDC WD3200YS-01P 21.0 > change_mode_page: failed setting page: Caching (SBC) Use hdparm, it's an ATA drive even if Linux currently uses the scsi layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi disk sysfs directory. > >>> Changing the cache policy of a drive would be a > >>> policy decision in the kernel, > >>> > >> If you don't want this in the kernel, then the system as a whole should > >> default to being safe. Though in this case I think it is worthwhile to > >> do this in the kernel. > >> > > > > Doesn't matter how you turn this, it's still a policy decision. Leave it > > to the user. It's not exactly a new turn of events, commodity drives > > have shipped with write caching on forever. What if the drive has a > > battery backing? > > If the drive has a batter backup, I'd argue it should report it as a > write-through cache. I'm not a drive manufacturer though. You could argue that, but that could influence other decision making. FWIW, we've discussed this very issue for YEARS, reiterating the debate here isn't likely to change much... > > What if the user has an UPS? > > > > > > They should enable the write-back cache if they trust the UPS. Or maybe > the system should do that automatically if it's aware of the UPS. > > "Policy" doesn't mean you shouldn't choose good defaults. Changing the hardware settings for this kind of behaviour IS most certainly policy. > >>> that is never the right thing to do. > >>> There's no such thing as 'acceptable throughput', > >>> > >> I meant that performance is not completely destroyed. How can you even > >> > > > > How do you know it's not destroyed? Depending on your workload, it may > > very well be dropping your throughput by orders of magnitude. > > > > > > I guess this is the crux. According to my understanding, you shouldn't > see such a horrible drop, unless the application does synchronous writes > explicitly, in which case it is probably worried about data safety. Then you need to adjust your understanding, because you definitely will see a big drop in performance. > >> compare data safety to some percent of performance? > >> > > > > I'm not, what I'm saying is that different people will have different > > opponions on what is most important. Do note that the window of > > corruption is really small and requires powerloss to trigger. So for > > most desktop users, the tradeoff is actually sane. > > > > > > I agree that the window is very small, and that by eliminating software > failures we get rid of the major source of data loss. What I don't know > is what the performance tradeoff looks like (and I can't measure since > my drives won't let me turn off the cache for some reason). > > >>> Additionally, write back caching is perfectly safe, if used > >>> with a barrier enabled file system in Linux. > >>> > >>> > >> Not all Linux filesystems are barrier enabled, AFAIK. Further, barriers > >> don't help with O_DIRECT (right?). > >> > > > > O_DIRECT should just use FUA writes, there are safe with write back > > caching. I'm actually testing such a change just to gauge the > > performance impact. > > > > You mean, this is not in mainline yet? It isn't. > So, with this, plus barrier support for metadata and O_SYNC writes, the > write-back cache should be safe? Yes, and fsync() as well provided the fs does a flush there too. > Some googling shows that Windows XP introduced FUA for O_DIRECT and > metadata writes as well. There's a lot of other background information to understand to gauge the impact of using eg FUA for O_DIRECT in Linux as well. MS basically wrote the FUA for ATA proposal, and the original usage pattern (as far as I remember) was indeed meta data. Hence it also imposes a priority boost in most (all?) drive firmwares, since it's deemed important. So just using FUA vs non-FUA is likely to impact performance of other workloads in fairly unknown ways. FUA on non-queuing drives will also likely suck for performance, since you're basically going to be blowing a drive rev for each IO. And that hurts. > >> I shouldn't need a disk array to run a database. > >> > > > > You are free to turn off write back caching! > > > > > > What about the users who aren't on qemu-devel? It may be news to you, but it has been debated on lkml in the past as well. Not even that long ago, and I'd be surprised of lwn didn't run some article on it as well. But I agree it's important information, but realize that until just recently most people didn't really consider it a likely scenario in practice... I wrote and committed the original barrier implementation in Linux in 2001, and just this year XFS made it a default mount option. After the recent debacle on this on lkml, ext4 made it the default as well. So let me turn it around a bit - if this issue really did hit lots of people out there in real life, don't you think there would have been more noise about this and we would have made this the default years ago? So while we both agree it's a risk, it's not a huuuge risk... > However, with your FUA change, they should be safe. Yes, that would make O_DIRECT safe always. Except when it falls back to buffered IO, woops... > >> Most desktop workloads use writeback cache, so write performance is not > >> critical. > >> > > > > Ehm, how do you reach that conclusion based on that statement? > > > > > > Any write latency is buffered by the kernel. Write speed is main memory > speed. Disk speed only bubbles up when memory is tight. That's a nice theory, in practice that is completely wrong. You end up waiting on writes for LOTS of other reasons! > >> However I'd hate to see my data destroyed by a power failure, and > >> today's large caches can hold a bunch of data. > >> > > > > Then you use barriers or turn write back caching off, simple as that. > > > > I will (if I figure out how) but there may be one or two users who > haven't read the scsi spec yet. A newish hdparm should work, or the sysfs attribute. hdparm will pass-through the real ata command to do this, the sysfs approach (and sdparm) requires MODE_SENSE and MODE_SELECT transformation of that page. > Or more correctly, I am revising my opinion of the write back cache > since even when it is enabled, it is completely optional. Instead of > disabling the write back cache we should use FUA and barriers, and since > you are to be working on FUA, it looks like this will be resolved soon > without performance/correctness compromises. Lets see how the testing goes :-) Possibly just enabled FUA O_DIRECT with barriers, that'll likely be a good default. -- Jens Axboe ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 19:30 ` Jens Axboe @ 2008-10-19 20:16 ` Avi Kivity 2008-10-20 14:14 ` Avi Kivity 1 sibling, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-19 20:16 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: >> (it seems I can't turn off the write cache even without losing my data: >> > Use hdparm, it's an ATA drive even if Linux currently uses the scsi > layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi > disk sysfs directory. > Ok. It's moot anyway. >> "Policy" doesn't mean you shouldn't choose good defaults. >> > > Changing the hardware settings for this kind of behaviour IS most > certainly policy. > Leaving bad hardware settings is also policy. But in light of FUA, the SCSI write cache is not a bad thing, so we should definitely leave it on. >> I guess this is the crux. According to my understanding, you shouldn't >> see such a horrible drop, unless the application does synchronous writes >> explicitly, in which case it is probably worried about data safety. >> > > Then you need to adjust your understanding, because you definitely will > see a big drop in performance. > > Can you explain why? This is interesting. >>> O_DIRECT should just use FUA writes, there are safe with write back >>> caching. I'm actually testing such a change just to gauge the >>> performance impact. >>> >>> >> You mean, this is not in mainline yet? >> > > It isn't. > What is the time frame for this? 2.6.29? >> Some googling shows that Windows XP introduced FUA for O_DIRECT and >> metadata writes as well. >> > > There's a lot of other background information to understand to gauge the > impact of using eg FUA for O_DIRECT in Linux as well. MS basically wrote > the FUA for ATA proposal, and the original usage pattern (as far as I > remember) was indeed meta data. Hence it also imposes a priority boost > in most (all?) drive firmwares, since it's deemed important. So just > using FUA vs non-FUA is likely to impact performance of other workloads > in fairly unknown ways. FUA on non-queuing drives will also likely suck > for performance, since you're basically going to be blowing a drive rev > for each IO. And that hurts. > Let's assume queueing drives, since these are fairly common these days. So qemu issuing O_DIRECT which turns into FUA writes is safe but suboptimal. Has there been talk about exposing the difference between FUA writes and cached writes to userspace? What about barriers? With a rich enough userspace interface, qemu can communicate the intentions of the guest and not force the kernel to make a performance/correctness tradeoff. >> >> What about the users who aren't on qemu-devel? >> > > It may be news to you, but it has been debated on lkml in the past as > well. Not even that long ago, and I'd be surprised of lwn didn't run > some article on it as well. Let's postulate the existence of a user that doesn't read lkml or even lwn. > But I agree it's important information, but > realize that until just recently most people didn't really consider it a > likely scenario in practice... > > I wrote and committed the original barrier implementation in Linux in > 2001, and just this year XFS made it a default mount option. After the > recent debacle on this on lkml, ext4 made it the default as well. > > So let me turn it around a bit - if this issue really did hit lots of > people out there in real life, don't you think there would have been > more noise about this and we would have made this the default years ago? > So while we both agree it's a risk, it's not a huuuge risk... > I agree, not a huge risk. I guess compared to the rest of the suckiness involved (took a long while just to get journalling), this is really a minor issue. It's interesting though that Windows supported this in 2001, seven years ago, so at least they considered it important. I guess I'm sensitive to this because in my filesystemy past QA would jerk out data and power cables while running various tests and act surprised whenever data was lost. So I'm allergic to data loss. With qemu (at least when used with a hypervisor) we have to be extra safe since we have no idea what workload is running and how critical data safety is. Well, we have hints (whether FUA is set or not) when using SCSI, but right now we don't have a way of communicating these hints to the kernel. One important takeaway is to find out whether virtio-blk supports FUA, and if not, add it. >> However, with your FUA change, they should be safe. >> > > Yes, that would make O_DIRECT safe always. Except when it falls back to > buffered IO, woops... > > Woops. >> Any write latency is buffered by the kernel. Write speed is main memory >> speed. Disk speed only bubbles up when memory is tight. >> > > That's a nice theory, in practice that is completely wrong. You end up > waiting on writes for LOTS of other reasons! > > Journal commits? Can you elaborate? In the filesystem I worked on, one would never wait on a write to disk unless memory was full. Even synchronous writes were serviced immediately, since the system had a battery-backed replicated cache. I guess the situation with Linux filesystems is different. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-19 19:30 ` Jens Axboe 2008-10-19 20:16 ` Avi Kivity @ 2008-10-20 14:14 ` Avi Kivity 1 sibling, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-20 14:14 UTC (permalink / raw) To: Jens Axboe Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper Jens Axboe wrote: > Use hdparm, it's an ATA drive even if Linux currently uses the scsi > layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi > disk sysfs directory. > With this, I was able to benchmark the write cache for 4k and 128k random access loads. Numbers in iops, hope it doesn't get mangled: 4k blocks 128k blocks pattern cache off cache on cache off cache on read 103 101 74 71 write 86 149 72 91 rw 87 89 63 65 Test was run on a 90G logical volume of a 250G laptop disk; using O_DIRECT and libaio. Pure write workloads see a tremendous benefit, likely because the heads can do a linear scan of the disk. An 8MB cache translates to 2000 objects, likely around 1000 per pass. Increasing the block size reduces the performance boost, as expected. read/write workloads do not benefit at all (or maybe a bit); presumably the head movement is governed by reads alone. Of course, this tests only the disk subsystem; in particular, if some workload is sensitive to write latencies, the write cache can reduce those in a mixed read/write load, as long as the cache is not flooded (so loads with a lower percentage of writes would benefit more). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori ` (2 preceding siblings ...) 2008-10-10 9:16 ` Avi Kivity @ 2008-10-10 10:03 ` Fabrice Bellard 2008-10-13 16:11 ` Laurent Vivier ` (3 subsequent siblings) 7 siblings, 0 replies; 101+ messages in thread From: Fabrice Bellard @ 2008-10-10 10:03 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori wrote: > [...] > So to summarize, I think we should enable O_DSYNC by default to ensure > that guest data integrity is not dependent on the host OS, and that > practically speaking, cache=off is only useful for very specialized > circumstances. Part of the patch I'll follow up with includes changes > to the man page to document all of this for users. > > Thoughts? QEMU is also used for debugging and arbitrary machine simulation. In this case, using uncached accesses is bad because you want maximum isolation between the guest and the host. For example, if the guest is a development OS not using a disk cache, I still want to use the host disk cache. So the "normal" caching scheme must be left as it is now. However, I agree that the default behavior could be modified. Regards, Fabrice. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori ` (3 preceding siblings ...) 2008-10-10 10:03 ` Fabrice Bellard @ 2008-10-13 16:11 ` Laurent Vivier 2008-10-13 16:58 ` Anthony Liguori 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper ` (2 subsequent siblings) 7 siblings, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-13 16:11 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel Le jeudi 09 octobre 2008 à 12:00 -0500, Anthony Liguori a écrit : [...] > So to summarize, I think we should enable O_DSYNC by default to > ensure > that guest data integrity is not dependent on the host OS, and that > practically speaking, cache=off is only useful for very specialized > circumstances. Part of the patch I'll follow up with includes > changes > to the man page to document all of this for users. perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will impact host filesystem performance, at least with ext3, because the synchronicity is done through the commit of the journal of the whole filesystem: see fs/ext3/file.c:ext3_file_write() (I've removed the comments here) : ... if (file->f_flags & O_SYNC) { if (!ext3_should_journal_data(inode)) return ret; goto force_commit; } if (!IS_SYNC(inode)) return ret; force_commit: err = ext3_force_commit(inode->i_sb); if (err) return err; return ret; } Moreover, the real behavior depends on the type of the journaling system you use... Regards, Laurent -- ----------------- Laurent.Vivier@bull.net ------------------ "La perfection est atteinte non quand il ne reste rien à ajouter mais quand il ne reste rien à enlever." Saint Exupéry ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 16:11 ` Laurent Vivier @ 2008-10-13 16:58 ` Anthony Liguori 2008-10-13 17:36 ` Jamie Lokier 0 siblings, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-13 16:58 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel Laurent Vivier wrote: > Le jeudi 09 octobre 2008 à 12:00 -0500, Anthony Liguori a écrit : > [...] > >> So to summarize, I think we should enable O_DSYNC by default to >> ensure >> that guest data integrity is not dependent on the host OS, and that >> practically speaking, cache=off is only useful for very specialized >> circumstances. Part of the patch I'll follow up with includes >> changes >> to the man page to document all of this for users. >> > > perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will > impact host filesystem performance, at least with ext3, because the > synchronicity is done through the commit of the journal of the whole > filesystem: > Yes, but this is important because if the journal isn't committed, then it's possible that while the data would be on disk, the file system metadata is out of sync on disk which could result in the changes to the file being lost. I think that you are in fact correct that the journal write is probably unnecessary overhead in a lot of scenarios but Ryan actually has some performance data that he should be posting soon that shows that in most circumstances, O_DSYNC does pretty well compared to O_DIRECT for write so I don't this is a practical concern. Regards, Anthony Liguori > see fs/ext3/file.c:ext3_file_write() (I've removed the comments here) : > > ... > if (file->f_flags & O_SYNC) { > > if (!ext3_should_journal_data(inode)) > return ret; > > goto force_commit; > } > > > if (!IS_SYNC(inode)) > return ret; > > force_commit: > err = ext3_force_commit(inode->i_sb); > if (err) > return err; > return ret; > } > > Moreover, the real behavior depends on the type of the journaling system > you use... > > Regards, > Laurent > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 16:58 ` Anthony Liguori @ 2008-10-13 17:36 ` Jamie Lokier 0 siblings, 0 replies; 101+ messages in thread From: Jamie Lokier @ 2008-10-13 17:36 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel Anthony Liguori wrote: > >perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will > >impact host filesystem performance, at least with ext3, because the > >synchronicity is done through the commit of the journal of the whole > >filesystem: > > > > Yes, but this is important because if the journal isn't committed, then > it's possible that while the data would be on disk, the file system > metadata is out of sync on disk which could result in the changes to the > file being lost. > > I think that you are in fact correct that the journal write is probably > unnecessary overhead in a lot of scenarios but Ryan actually has some > performance data that he should be posting soon that shows that in most > circumstances, O_DSYNC does pretty well compared to O_DIRECT for write > so I don't this is a practical concern. fsync on ext3 is whacky anyway. I haven't checked what the _real_ semantics of O_DSYNC are for ext3, but I would be surprised if it's less whacky than fsync. Sometimes ext3 fsync takes a very long time, because it's waiting for lots of dirty data from other processes to be written. (Firefox 3 was bitten by this - it made Firefox stall repeatedly for up to half a minute for some users.) Sometimes ext3 fsync doesn't write all the dirty pages of a file - there are some recent kernel patches exploring ways to fix this. Sometimes ext3 fsync doesn't flush the disk's write cache after writing data, despite barriers being requested, if only dirty data blocks are written and there is no inode change. -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori ` (4 preceding siblings ...) 2008-10-13 16:11 ` Laurent Vivier @ 2008-10-13 17:06 ` Ryan Harper 2008-10-13 18:43 ` Anthony Liguori ` (2 more replies) 2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel 2008-10-28 17:34 ` Ian Jackson 7 siblings, 3 replies; 101+ messages in thread From: Ryan Harper @ 2008-10-13 17:06 UTC (permalink / raw) To: Anthony Liguori Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel@nongnu.org, Ryan Harper * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: > Read performance should be unaffected by using O_DSYNC. O_DIRECT will > significantly reduce read performance. I think we should use O_DSYNC by > default and I have sent out a patch that contains that. We will follow > up with benchmarks to demonstrate this. baremetal baseline (1g dataset): ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | type, block size, iface | MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ write, 16k, lvm, direct=1 | 127.7 | 12 | 11.66 | 9.48 | write, 64k, lvm, direct=1 | 178.4 | 5 | 13.65 | 27.15 | write, 1M, lvm, direct=1 | 186.0 | 3 | 163.75 | 416.91 | ---------------------------+-------+-------+--------------+------------+ read , 16k, lvm, direct=1 | 170.4 | 15 | 10.86 | 7.10 | read , 64k, lvm, direct=1 | 199.2 | 5 | 12.52 | 24.31 | read , 1M, lvm, direct=1 | 202.0 | 3 | 133.74 | 382.67 | ---------------------------+-------+-------+--------------+------------+ kvm write (1g dataset): ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ 16k,virtio,off,none | 135.0 | 94 | 9.1 | 8.71 | 16k,virtio,on ,none | 184.0 | 100 | 63.69 | 63.48 | 16k,virtio,on ,O_DSYNC | 150.0 | 35 | 6.63 | 8.31 | ---------------------------+-------+-------+--------------+------------+ 64k,virtio,off,none | 169.0 | 51 | 17.10 | 28.00 | 64k,virtio,on ,none | 189.0 | 60 | 69.42 | 24.92 | 64k,virtio,on ,O_DSYNC | 171.0 | 48 | 18.83 | 27.72 | ---------------------------+-------+-------+--------------+------------+ 1M ,virtio,off,none | 142.0 | 30 | 7176.00 | 523.00 | 1M ,virtio,on ,none | 190.0 | 45 | 5332.63 | 392.35 | 1M ,virtio,on ,O_DSYNC | 164.0 | 39 | 6444.48 | 471.20 | ---------------------------+-------+-------+--------------+------------+ kvm read (1g dataset): ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ 16k,virtio,off,none | 175.0 | 40 | 22.42 | 6.71 | 16k,virtio,on ,none | 211.0 | 147 | 59.49 | 5.54 | 16k,virtio,on ,O_DSYNC | 212.0 | 145 | 60.45 | 5.47 | ---------------------------+-------+-------+--------------+------------+ 64k,virtio,off,none | 190.0 | 64 | 16.31 | 24.92 | 64k,virtio,on ,none | 546.0 | 161 | 111.06 | 8.54 | 64k,virtio,on ,O_DSYNC | 520.0 | 151 | 116.66 | 8.97 | ---------------------------+-------+-------+--------------+------------+ 1M ,virtio,off,none | 182.0 | 32 | 5573.44 | 407.21 | 1M ,virtio,on ,none | 750.0 | 127 | 1344.65 | 96.42 | 1M ,virtio,on ,O_DSYNC | 768.0 | 123 | 1289.05 | 94.25 | ---------------------------+-------+-------+--------------+------------+ -------------------------------------------------------------------------- exporting file in ext3 filesystem as block device (1g) -------------------------------------------------------------------------- kvm write (1g dataset): ---------------------------+-------+-------+--------------+------------+ Test scenarios | bandw | % CPU | ave submit | ave compl | block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | ---------------------------+-------+-------+--------------+------------+ 16k,virtio,off,none | 12.1 | 15 | 9.1 | 8.71 | 16k,virtio,on ,none | 192.0 | 52 | 62.52 | 6.17 | 16k,virtio,on ,O_DSYNC | 142.0 | 59 | 18.81 | 8.29 | ---------------------------+-------+-------+--------------+------------+ 64k,virtio,off,none | 15.5 | 8 | 21.10 | 311.00 | 64k,virtio,on ,none | 454.0 | 130 | 113.25 | 10.65 | 64k,virtio,on ,O_DSYNC | 154.0 | 48 | 20.25 | 30.75 | ---------------------------+-------+-------+--------------+------------+ 1M ,virtio,off,none | 24.7 | 5 | 41736.22 | 3020.08 | 1M ,virtio,on ,none | 485.0 | 100 | 2052.09 | 149.81 | 1M ,virtio,on ,O_DSYNC | 161.0 | 42 | 6268.84 | 453.84 | ---------------------------+-------+-------+--------------+------------+ -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper @ 2008-10-13 18:43 ` Anthony Liguori 2008-10-14 16:42 ` Avi Kivity 2008-10-13 18:51 ` Laurent Vivier 2008-10-13 19:00 ` Mark Wagner 2 siblings, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-13 18:43 UTC (permalink / raw) To: Ryan Harper Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel@nongnu.org Ryan Harper wrote: > * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: > >> Read performance should be unaffected by using O_DSYNC. O_DIRECT will >> significantly reduce read performance. I think we should use O_DSYNC by >> default and I have sent out a patch that contains that. We will follow >> up with benchmarks to demonstrate this. >> >> With 16k writes I think we hit a pathological case with the particular storage backend we're using since it has many disks and the volume is striped. Also the results a bit different when going through a file system verses a LVM partition (the later being the first data set). Presumably, this is because even with no flags, writes happen synchronously to a LVM partition. Also, cache=off seems to do pretty terribly when operating on an ext3 file. I suspect this has to do with how ext3 implements O_DIRECT. However, the data demonstrates pretty nicely that O_DSYNC gives you native write speed, but accelerated read speed which I think we agree is the desirable behavior. cache=off never seems to outperform cache=wt which is another good argument for it being the default over cache=off. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 18:43 ` Anthony Liguori @ 2008-10-14 16:42 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 16:42 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori wrote: > > With 16k writes I think we hit a pathological case with the particular > storage backend we're using since it has many disks and the volume is > striped. Also the results a bit different when going through a file > system verses a LVM partition (the later being the first data set). > Presumably, this is because even with no flags, writes happen > synchronously to a LVM partition. > With no flags, writes should hit the buffer cache (which is the page cache's name when used to cache block devices). > Also, cache=off seems to do pretty terribly when operating on an ext3 > file. I suspect this has to do with how ext3 implements O_DIRECT. Is the file horribly fragmented? Otherwise ext3 O_DIRECT should be quite good. Maybe the mapping is not in the host cache and has to be brought in. > > However, the data demonstrates pretty nicely that O_DSYNC gives you > native write speed, but accelerated read speed which I think we agree > is the desirable behavior. cache=off never seems to outperform > cache=wt which is another good argument for it being the default over > cache=off. Without copyless block I/O, there's no reason to expect cache=none to outperform cache=writethrough. I expect the read performance to evaporate with a random access pattern over a large disk (or even sequential access, given enough running time). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper 2008-10-13 18:43 ` Anthony Liguori @ 2008-10-13 18:51 ` Laurent Vivier 2008-10-13 19:43 ` Ryan Harper 2008-10-13 19:00 ` Mark Wagner 2 siblings, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-13 18:51 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper [-- Attachment #1: Type: text/plain, Size: 6855 bytes --] Le 13 oct. 08 à 19:06, Ryan Harper a écrit : > * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: >> Read performance should be unaffected by using O_DSYNC. O_DIRECT >> will >> significantly reduce read performance. I think we should use >> O_DSYNC by >> default and I have sent out a patch that contains that. We will >> follow >> up with benchmarks to demonstrate this. > Hi Ryan, as "cache=on" implies a factor (memory) shared by the whole system, you must take into account the size of the host memory and run some applications (several guests ?) to pollute the host cache, for instance you can run 4 guest and run bench in each of them concurrently, and you could reasonably limits the size of the host memory to 5 x the size of the guest memory. (for instance 4 guests with 128 MB on a host with 768 MB). as O_DSYNC implies journal commit, you should run a bench on the ext3 host file system concurrently to the bench on a guest to see the impact of the commit on each bench. > > baremetal baseline (1g dataset): > ---------------------------+-------+-------+-------------- > +------------+ > Test scenarios | bandw | % CPU | ave submit | ave > compl | > type, block size, iface | MB/s | usage | latency usec | latency > ms | > ---------------------------+-------+-------+-------------- > +------------+ > write, 16k, lvm, direct=1 | 127.7 | 12 | 11.66 | > 9.48 | > write, 64k, lvm, direct=1 | 178.4 | 5 | 13.65 | > 27.15 | > write, 1M, lvm, direct=1 | 186.0 | 3 | 163.75 | > 416.91 | > ---------------------------+-------+-------+-------------- > +------------+ > read , 16k, lvm, direct=1 | 170.4 | 15 | 10.86 | > 7.10 | > read , 64k, lvm, direct=1 | 199.2 | 5 | 12.52 | > 24.31 | > read , 1M, lvm, direct=1 | 202.0 | 3 | 133.74 | > 382.67 | > ---------------------------+-------+-------+-------------- > +------------+ > Could you recall which benchmark you use ? > kvm write (1g dataset): > ---------------------------+-------+-------+-------------- > +------------+ > Test scenarios | bandw | % CPU | ave submit | ave > compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency > ms | > ---------------------------+-------+-------+-------------- > +------------+ > 16k,virtio,off,none | 135.0 | 94 | 9.1 | > 8.71 | > 16k,virtio,on ,none | 184.0 | 100 | 63.69 | > 63.48 | > 16k,virtio,on ,O_DSYNC | 150.0 | 35 | 6.63 | > 8.31 | > ---------------------------+-------+-------+-------------- > +------------+ > 64k,virtio,off,none | 169.0 | 51 | 17.10 | > 28.00 | > 64k,virtio,on ,none | 189.0 | 60 | 69.42 | > 24.92 | > 64k,virtio,on ,O_DSYNC | 171.0 | 48 | 18.83 | > 27.72 | > ---------------------------+-------+-------+-------------- > +------------+ > 1M ,virtio,off,none | 142.0 | 30 | 7176.00 | > 523.00 | > 1M ,virtio,on ,none | 190.0 | 45 | 5332.63 | > 392.35 | > 1M ,virtio,on ,O_DSYNC | 164.0 | 39 | 6444.48 | > 471.20 | > ---------------------------+-------+-------+-------------- > +------------+ According to the semantic, I don't understand how O_DSYNC can be better than cache=off in this case... > > kvm read (1g dataset): > ---------------------------+-------+-------+-------------- > +------------+ > Test scenarios | bandw | % CPU | ave submit | ave > compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency > ms | > ---------------------------+-------+-------+-------------- > +------------+ > 16k,virtio,off,none | 175.0 | 40 | 22.42 | > 6.71 | > 16k,virtio,on ,none | 211.0 | 147 | 59.49 | > 5.54 | > 16k,virtio,on ,O_DSYNC | 212.0 | 145 | 60.45 | > 5.47 | > ---------------------------+-------+-------+-------------- > +------------+ > 64k,virtio,off,none | 190.0 | 64 | 16.31 | > 24.92 | > 64k,virtio,on ,none | 546.0 | 161 | 111.06 | > 8.54 | > 64k,virtio,on ,O_DSYNC | 520.0 | 151 | 116.66 | > 8.97 | > ---------------------------+-------+-------+-------------- > +------------+ > 1M ,virtio,off,none | 182.0 | 32 | 5573.44 | > 407.21 | > 1M ,virtio,on ,none | 750.0 | 127 | 1344.65 | > 96.42 | > 1M ,virtio,on ,O_DSYNC | 768.0 | 123 | 1289.05 | > 94.25 | > ---------------------------+-------+-------+-------------- > +------------+ OK, but in this case the size of the cache for "cache=off" is the size of the guest cache whereas in the other cases the size of the cache is the size of the guest cache + the size of the host cache, this is not fair... > > -------------------------------------------------------------------------- > exporting file in ext3 filesystem as block device (1g) > -------------------------------------------------------------------------- > > kvm write (1g dataset): > ---------------------------+-------+-------+-------------- > +------------+ > Test scenarios | bandw | % CPU | ave submit | ave > compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency > ms | > ---------------------------+-------+-------+-------------- > +------------+ > 16k,virtio,off,none | 12.1 | 15 | 9.1 | > 8.71 | > 16k,virtio,on ,none | 192.0 | 52 | 62.52 | > 6.17 | > 16k,virtio,on ,O_DSYNC | 142.0 | 59 | 18.81 | > 8.29 | > ---------------------------+-------+-------+-------------- > +------------+ > 64k,virtio,off,none | 15.5 | 8 | 21.10 | > 311.00 | > 64k,virtio,on ,none | 454.0 | 130 | 113.25 | > 10.65 | > 64k,virtio,on ,O_DSYNC | 154.0 | 48 | 20.25 | > 30.75 | > ---------------------------+-------+-------+-------------- > +------------+ > 1M ,virtio,off,none | 24.7 | 5 | 41736.22 | > 3020.08 | > 1M ,virtio,on ,none | 485.0 | 100 | 2052.09 | > 149.81 | > 1M ,virtio,on ,O_DSYNC | 161.0 | 42 | 6268.84 | > 453.84 | > ---------------------------+-------+-------+-------------- > +------------+ What file type do you use (qcow2, raw ?). Regards, Laurent ----------------------- Laurent Vivier ---------------------- "The best way to predict the future is to invent it." - Alan Kay [-- Attachment #2: Type: text/html, Size: 13285 bytes --] ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 18:51 ` Laurent Vivier @ 2008-10-13 19:43 ` Ryan Harper 2008-10-13 20:21 ` Laurent Vivier ` (2 more replies) 0 siblings, 3 replies; 101+ messages in thread From: Ryan Harper @ 2008-10-13 19:43 UTC (permalink / raw) To: Laurent Vivier Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]: > > Le 13 oct. 08 à 19:06, Ryan Harper a écrit : > > >* Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: > >>Read performance should be unaffected by using O_DSYNC. O_DIRECT > >>will > >>significantly reduce read performance. I think we should use > >>O_DSYNC by > >>default and I have sent out a patch that contains that. We will > >>follow > >>up with benchmarks to demonstrate this. > > > > Hi Ryan, > > as "cache=on" implies a factor (memory) shared by the whole system, > you must take into account the size of the host memory and run some > applications (several guests ?) to pollute the host cache, for > instance you can run 4 guest and run bench in each of them > concurrently, and you could reasonably limits the size of the host > memory to 5 x the size of the guest memory. > (for instance 4 guests with 128 MB on a host with 768 MB). I'm not following you here, the only assumption I see is that we have 1g of host mem free for caching the write. > > as O_DSYNC implies journal commit, you should run a bench on the ext3 > host file system concurrently to the bench on a guest to see the > impact of the commit on each bench. I understand the goal here, but what sort of host ext3 journaling load is appropriate. Additionally, when we're exporting block devices, I don't believe the ext3 journal is an issue. > > > > >baremetal baseline (1g dataset): > >---------------------------+-------+-------+-------------- > >+------------+ > >Test scenarios | bandw | % CPU | ave submit | ave > >compl | > >type, block size, iface | MB/s | usage | latency usec | latency > >ms | > >---------------------------+-------+-------+-------------- > >+------------+ > >write, 16k, lvm, direct=1 | 127.7 | 12 | 11.66 | > >9.48 | > >write, 64k, lvm, direct=1 | 178.4 | 5 | 13.65 | > >27.15 | > >write, 1M, lvm, direct=1 | 186.0 | 3 | 163.75 | > >416.91 | > >---------------------------+-------+-------+-------------- > >+------------+ > >read , 16k, lvm, direct=1 | 170.4 | 15 | 10.86 | > >7.10 | > >read , 64k, lvm, direct=1 | 199.2 | 5 | 12.52 | > >24.31 | > >read , 1M, lvm, direct=1 | 202.0 | 3 | 133.74 | > >382.67 | > >---------------------------+-------+-------+-------------- > >+------------+ > > > > Could you recall which benchmark you use ? yeah: fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE} --ioengine=libaio --direct=1 --norandommap --numjobs=1 --group_reporting --thread --size=1g --write_lat_log --write_bw_log --iodepth=74 > > >kvm write (1g dataset): > >---------------------------+-------+-------+-------------- > >+------------+ > >Test scenarios | bandw | % CPU | ave submit | ave > >compl | > >block size,iface,cache,sync| MB/s | usage | latency usec | latency > >ms | > >---------------------------+-------+-------+-------------- > >+------------+ > >16k,virtio,off,none | 135.0 | 94 | 9.1 | > >8.71 | > >16k,virtio,on ,none | 184.0 | 100 | 63.69 | > >63.48 | > >16k,virtio,on ,O_DSYNC | 150.0 | 35 | 6.63 | > >8.31 | > >---------------------------+-------+-------+-------------- > >+------------+ > >64k,virtio,off,none | 169.0 | 51 | 17.10 | > >28.00 | > >64k,virtio,on ,none | 189.0 | 60 | 69.42 | > >24.92 | > >64k,virtio,on ,O_DSYNC | 171.0 | 48 | 18.83 | > >27.72 | > >---------------------------+-------+-------+-------------- > >+------------+ > >1M ,virtio,off,none | 142.0 | 30 | 7176.00 | > >523.00 | > >1M ,virtio,on ,none | 190.0 | 45 | 5332.63 | > >392.35 | > >1M ,virtio,on ,O_DSYNC | 164.0 | 39 | 6444.48 | > >471.20 | > >---------------------------+-------+-------+-------------- > >+------------+ > > According to the semantic, I don't understand how O_DSYNC can be > better than cache=off in this case... I don't have a good answer either, but O_DIRECT and O_DSYNC are different paths through the kernel. This deserves a better reply, but I don't have one off the top of my head. > > > > >kvm read (1g dataset): > >---------------------------+-------+-------+-------------- > >+------------+ > >Test scenarios | bandw | % CPU | ave submit | ave > >compl | > >block size,iface,cache,sync| MB/s | usage | latency usec | latency > >ms | > >---------------------------+-------+-------+-------------- > >+------------+ > >16k,virtio,off,none | 175.0 | 40 | 22.42 | > >6.71 | > >16k,virtio,on ,none | 211.0 | 147 | 59.49 | > >5.54 | > >16k,virtio,on ,O_DSYNC | 212.0 | 145 | 60.45 | > >5.47 | > >---------------------------+-------+-------+-------------- > >+------------+ > >64k,virtio,off,none | 190.0 | 64 | 16.31 | > >24.92 | > >64k,virtio,on ,none | 546.0 | 161 | 111.06 | > >8.54 | > >64k,virtio,on ,O_DSYNC | 520.0 | 151 | 116.66 | > >8.97 | > >---------------------------+-------+-------+-------------- > >+------------+ > >1M ,virtio,off,none | 182.0 | 32 | 5573.44 | > >407.21 | > >1M ,virtio,on ,none | 750.0 | 127 | 1344.65 | > >96.42 | > >1M ,virtio,on ,O_DSYNC | 768.0 | 123 | 1289.05 | > >94.25 | > >---------------------------+-------+-------+-------------- > >+------------+ > > OK, but in this case the size of the cache for "cache=off" is the size > of the guest cache whereas in the other cases the size of the cache is > the size of the guest cache + the size of the host cache, this is not > fair... it isn't supposed to be fair, cache=off is O_DIRECT, we're reading from the device, we *want* to be able to lean on the host cache to read the data, pay once and benefit in other guests if possible. > > > > >-------------------------------------------------------------------------- > >exporting file in ext3 filesystem as block device (1g) > >-------------------------------------------------------------------------- > > > >kvm write (1g dataset): > >---------------------------+-------+-------+-------------- > >+------------+ > >Test scenarios | bandw | % CPU | ave submit | ave > >compl | > >block size,iface,cache,sync| MB/s | usage | latency usec | latency > >ms | > >---------------------------+-------+-------+-------------- > >+------------+ > >16k,virtio,off,none | 12.1 | 15 | 9.1 | > >8.71 | > >16k,virtio,on ,none | 192.0 | 52 | 62.52 | > >6.17 | > >16k,virtio,on ,O_DSYNC | 142.0 | 59 | 18.81 | > >8.29 | > >---------------------------+-------+-------+-------------- > >+------------+ > >64k,virtio,off,none | 15.5 | 8 | 21.10 | > >311.00 | > >64k,virtio,on ,none | 454.0 | 130 | 113.25 | > >10.65 | > >64k,virtio,on ,O_DSYNC | 154.0 | 48 | 20.25 | > >30.75 | > >---------------------------+-------+-------+-------------- > >+------------+ > >1M ,virtio,off,none | 24.7 | 5 | 41736.22 | > >3020.08 | > >1M ,virtio,on ,none | 485.0 | 100 | 2052.09 | > >149.81 | > >1M ,virtio,on ,O_DSYNC | 161.0 | 42 | 6268.84 | > >453.84 | > >---------------------------+-------+-------+-------------- > >+------------+ > > What file type do you use (qcow2, raw ?). Raw. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 19:43 ` Ryan Harper @ 2008-10-13 20:21 ` Laurent Vivier 2008-10-13 21:05 ` Ryan Harper 2008-10-14 10:05 ` Kevin Wolf 2008-10-14 16:37 ` Avi Kivity 2 siblings, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-13 20:21 UTC (permalink / raw) To: Ryan Harper; +Cc: Chris Wright, Mark McLoughlin, qemu-devel, Laurent Vivier Le 13 oct. 08 à 21:43, Ryan Harper a écrit : > * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]: >> >> Le 13 oct. 08 à 19:06, Ryan Harper a écrit : >> >>> * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: >>>> Read performance should be unaffected by using O_DSYNC. O_DIRECT >>>> will >>>> significantly reduce read performance. I think we should use >>>> O_DSYNC by >>>> default and I have sent out a patch that contains that. We will >>>> follow >>>> up with benchmarks to demonstrate this. >>> >> >> Hi Ryan, >> >> as "cache=on" implies a factor (memory) shared by the whole system, >> you must take into account the size of the host memory and run some >> applications (several guests ?) to pollute the host cache, for >> instance you can run 4 guest and run bench in each of them >> concurrently, and you could reasonably limits the size of the host >> memory to 5 x the size of the guest memory. >> (for instance 4 guests with 128 MB on a host with 768 MB). > > I'm not following you here, the only assumption I see is that we > have 1g > of host mem free for caching the write. Is this a realistic use case ? > >> >> as O_DSYNC implies journal commit, you should run a bench on the ext3 >> host file system concurrently to the bench on a guest to see the >> impact of the commit on each bench. > > I understand the goal here, but what sort of host ext3 journaling load > is appropriate. Additionally, when we're exporting block devices, I > don't believe the ext3 journal is an issue. Yes, it's a comment for the last test case. I think you can run the same benchmark as you do in the guest. > > >> >>> >>> baremetal baseline (1g dataset): >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> Test scenarios | bandw | % CPU | ave submit | ave >>> compl | >>> type, block size, iface | MB/s | usage | latency usec | latency >>> ms | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> write, 16k, lvm, direct=1 | 127.7 | 12 | 11.66 | >>> 9.48 | >>> write, 64k, lvm, direct=1 | 178.4 | 5 | 13.65 | >>> 27.15 | >>> write, 1M, lvm, direct=1 | 186.0 | 3 | 163.75 | >>> 416.91 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> read , 16k, lvm, direct=1 | 170.4 | 15 | 10.86 | >>> 7.10 | >>> read , 64k, lvm, direct=1 | 199.2 | 5 | 12.52 | >>> 24.31 | >>> read , 1M, lvm, direct=1 | 202.0 | 3 | 133.74 | >>> 382.67 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> >> >> Could you recall which benchmark you use ? > > yeah: > > fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE} > --ioengine=libaio --direct=1 --norandommap --numjobs=1 -- > group_reporting > --thread --size=1g --write_lat_log --write_bw_log --iodepth=74 > Thank you... >> >>> kvm write (1g dataset): >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> Test scenarios | bandw | % CPU | ave submit | ave >>> compl | >>> block size,iface,cache,sync| MB/s | usage | latency usec | latency >>> ms | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 16k,virtio,off,none | 135.0 | 94 | 9.1 | >>> 8.71 | >>> 16k,virtio,on ,none | 184.0 | 100 | 63.69 | >>> 63.48 | >>> 16k,virtio,on ,O_DSYNC | 150.0 | 35 | 6.63 | >>> 8.31 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 64k,virtio,off,none | 169.0 | 51 | 17.10 | >>> 28.00 | >>> 64k,virtio,on ,none | 189.0 | 60 | 69.42 | >>> 24.92 | >>> 64k,virtio,on ,O_DSYNC | 171.0 | 48 | 18.83 | >>> 27.72 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 1M ,virtio,off,none | 142.0 | 30 | 7176.00 | >>> 523.00 | >>> 1M ,virtio,on ,none | 190.0 | 45 | 5332.63 | >>> 392.35 | >>> 1M ,virtio,on ,O_DSYNC | 164.0 | 39 | 6444.48 | >>> 471.20 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >> >> According to the semantic, I don't understand how O_DSYNC can be >> better than cache=off in this case... > > I don't have a good answer either, but O_DIRECT and O_DSYNC are > different paths through the kernel. This deserves a better reply, but > I don't have one off the top of my head. The O_DIRECT kernel path should be more "direct" than the O_DSYNC one. Perhaps a oprofile could help to understand ? What it is strange also is the CPU usage with cache=off. It should be lower than others, perhaps an alignment issue ? due to the LVM ? > > >> >>> >>> kvm read (1g dataset): >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> Test scenarios | bandw | % CPU | ave submit | ave >>> compl | >>> block size,iface,cache,sync| MB/s | usage | latency usec | latency >>> ms | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 16k,virtio,off,none | 175.0 | 40 | 22.42 | >>> 6.71 | >>> 16k,virtio,on ,none | 211.0 | 147 | 59.49 | >>> 5.54 | >>> 16k,virtio,on ,O_DSYNC | 212.0 | 145 | 60.45 | >>> 5.47 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 64k,virtio,off,none | 190.0 | 64 | 16.31 | >>> 24.92 | >>> 64k,virtio,on ,none | 546.0 | 161 | 111.06 | >>> 8.54 | >>> 64k,virtio,on ,O_DSYNC | 520.0 | 151 | 116.66 | >>> 8.97 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 1M ,virtio,off,none | 182.0 | 32 | 5573.44 | >>> 407.21 | >>> 1M ,virtio,on ,none | 750.0 | 127 | 1344.65 | >>> 96.42 | >>> 1M ,virtio,on ,O_DSYNC | 768.0 | 123 | 1289.05 | >>> 94.25 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >> >> OK, but in this case the size of the cache for "cache=off" is the >> size >> of the guest cache whereas in the other cases the size of the cache >> is >> the size of the guest cache + the size of the host cache, this is not >> fair... > > it isn't supposed to be fair, cache=off is O_DIRECT, we're reading > from > the device, we *want* to be able to lean on the host cache to read the > data, pay once and benefit in other guests if possible. OK, but if you want to follow this way I think you must run several guests concurrently to see how the host cache help each of them. If you want I can try this tomorrow ? The O_DSYNC patch is the one posted to the mailing-list ? And moreover, you should run an endurance test to see how the cache evolves. > >> >>> >>> -------------------------------------------------------------------------- >>> exporting file in ext3 filesystem as block device (1g) >>> -------------------------------------------------------------------------- >>> >>> kvm write (1g dataset): >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> Test scenarios | bandw | % CPU | ave submit | ave >>> compl | >>> block size,iface,cache,sync| MB/s | usage | latency usec | latency >>> ms | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 16k,virtio,off,none | 12.1 | 15 | 9.1 | >>> 8.71 | >>> 16k,virtio,on ,none | 192.0 | 52 | 62.52 | >>> 6.17 | >>> 16k,virtio,on ,O_DSYNC | 142.0 | 59 | 18.81 | >>> 8.29 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 64k,virtio,off,none | 15.5 | 8 | 21.10 | >>> 311.00 | >>> 64k,virtio,on ,none | 454.0 | 130 | 113.25 | >>> 10.65 | >>> 64k,virtio,on ,O_DSYNC | 154.0 | 48 | 20.25 | >>> 30.75 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >>> 1M ,virtio,off,none | 24.7 | 5 | 41736.22 | >>> 3020.08 | >>> 1M ,virtio,on ,none | 485.0 | 100 | 2052.09 | >>> 149.81 | >>> 1M ,virtio,on ,O_DSYNC | 161.0 | 42 | 6268.84 | >>> 453.84 | >>> ---------------------------+-------+-------+-------------- >>> +------------+ >> >> What file type do you use (qcow2, raw ?). > > Raw. No comment Laurent ----------------------- Laurent Vivier ---------------------- "The best way to predict the future is to invent it." - Alan Kay ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 20:21 ` Laurent Vivier @ 2008-10-13 21:05 ` Ryan Harper 2008-10-15 13:10 ` Laurent Vivier 0 siblings, 1 reply; 101+ messages in thread From: Ryan Harper @ 2008-10-13 21:05 UTC (permalink / raw) To: Laurent Vivier Cc: Chris Wright, Mark McLoughlin, Laurent Vivier, qemu-devel, Ryan Harper * Laurent Vivier <laurent@lvivier.info> [2008-10-13 15:39]: > >> > >>as "cache=on" implies a factor (memory) shared by the whole system, > >>you must take into account the size of the host memory and run some > >>applications (several guests ?) to pollute the host cache, for > >>instance you can run 4 guest and run bench in each of them > >>concurrently, and you could reasonably limits the size of the host > >>memory to 5 x the size of the guest memory. > >>(for instance 4 guests with 128 MB on a host with 768 MB). > > > >I'm not following you here, the only assumption I see is that we > >have 1g > >of host mem free for caching the write. > > Is this a realistic use case ? Optimistic? I don't think it is unrealistic. It is hard to know what hardware and use-case any end user may have at their disposal. > >> > >>as O_DSYNC implies journal commit, you should run a bench on the ext3 > >>host file system concurrently to the bench on a guest to see the > >>impact of the commit on each bench. > > > >I understand the goal here, but what sort of host ext3 journaling load > >is appropriate. Additionally, when we're exporting block devices, I > >don't believe the ext3 journal is an issue. > > Yes, it's a comment for the last test case. > I think you can run the same benchmark as you do in the guest. I'm not sure where to go with this. If it turns out that scaling out on to of ext3 stinks, then the deployment needs to change to deal with that limitation in ext3. Use a proper block device, something like lvm. > >>According to the semantic, I don't understand how O_DSYNC can be > >>better than cache=off in this case... > > > >I don't have a good answer either, but O_DIRECT and O_DSYNC are > >different paths through the kernel. This deserves a better reply, but > >I don't have one off the top of my head. > > The O_DIRECT kernel path should be more "direct" than the O_DSYNC one. > Perhaps a oprofile could help to understand ? > What it is strange also is the CPU usage with cache=off. It should be > lower than others, perhaps an alignment issue ? > due to the LVM ? All possible, I don't have an oprofile of it. > >> > >>OK, but in this case the size of the cache for "cache=off" is the > >>size > >>of the guest cache whereas in the other cases the size of the cache > >>is > >>the size of the guest cache + the size of the host cache, this is not > >>fair... > > > >it isn't supposed to be fair, cache=off is O_DIRECT, we're reading > >from > >the device, we *want* to be able to lean on the host cache to read the > >data, pay once and benefit in other guests if possible. > > OK, but if you want to follow this way I think you must run several > guests concurrently to see how the host cache help each of them. > If you want I can try this tomorrow ? The O_DSYNC patch is the one > posted to the mailing-list ? The patch used is the same as what is on the list, feel free to try. > > And moreover, you should run an endurance test to see how the cache > evolves. I'm not sure how interesting this is, either it was in the cache or not, depending on what work you do you can either devolve to a case where nothing is in cache or where everything is in cache. The point being that by using cache where we can we get the benefit. If you use cache=off you'll never be able to get that boost when it would other wise been available. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 21:05 ` Ryan Harper @ 2008-10-15 13:10 ` Laurent Vivier 2008-10-16 10:24 ` Laurent Vivier 0 siblings, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-15 13:10 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Hi, I made some tests on my system. Intel Xeon, 2 GB RAM Disk SATA 80 GB, using 4 GB Partitions my guests are: qemu/x86_64-softmmu/qemu-system-x86_64 -hda ../victory.qcow2 -drive file=/dev/sdc1,if=virtio,cache=on -net nic,model=virtio,macaddress=52:54:00:12:34:71 -net tap -serial stdio -m 512 -nographic qemu/x86_64-softmmu/qemu-system-x86_64 -hda ../valkyrie.qcow2 -drive file=/dev/sdc2,if=virtio,cache=on -net nic,model=virtio,macaddress=52:54:00:12:34:72 -net tap -serial stdio -m 512 -nographic I use the fio command given by Ryan with a 5 GB dataset (bigger than host RAM). Results follow. baremetal | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 59.86 | 4.29 | 20.25 | write,64k | 59.87 | 7.65 | 80.99 | write,1M | 59.87 | 14935.89 | 1280.71 | ---------------+-------+----------+----------+ read,16k | 59.87 | 3.98 | 20,24 | read,64k | 59.88 | 8.19 | 80.98 | read,1M | 59.85 | 14959.63 | 1280.55 | ---------------+-------+----------+----------+ one guest, cache=on | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 59.35 | 44.64 | 20.38 | write,64k | 53.40 | 70.87 | 90.72 | write,1M | 54.81 | 18963.69 | 1395.37 | ---------------+-------+----------+----------+ read,16k | 35.62 | 7.84 | 34.02 | read,64k | 34.27 | 11.86 | 141.48 | read,1M | 17.50 | 59689.95 | 4344.10 | ---------------+-------+----------+----------+ one guest, cache=off | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 59.31 | 4.44 | 20.43 | write,64k | 14.90 | 11.54 | 325.49 | write,1M | 23.37 | 44683.35 | 3255.03 | ---------------+-------+----------+----------+ read,16k | 59.00 | 4.41 | 20.54 | read,64k | 13.04 | 11.84 | 371.80 | read,1M | 17.79 | 58712.11 | 4277.20 | ---------------+-------+----------+----------+ one guest, cache=on, O_DSYNC | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 54.44 | 13.07 | 22.25 | write,64k | 54.19 | 13.10 | 89.48 | write,1M | 58.77 | 17763.85 | 1295.22 | ---------------+-------+----------+----------+ read,16k | 35.27 | 7.83 | 34.36 | read,64k | 33.59 | 11.74 | 144.36 | read,1M | 17.44 | 59856.18 | 4357.69 | ---------------+-------+----------+----------+ two guests, cache=on | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 19.20 | 36.83 | 63.11 | | 18.90 | 35.06 | 64.10 | write,64k | 18.22 | 62.46 | 266.09 | | 17.68 | 61.64 | 274.89 | write,1M | 17.18 | 60442.52 | 4454.48 | | 17.11 | 61137.82 | 4424.15 | ---------------+-------+----------+----------+ read,16k | 16.32 | 8.19 | 74.25 | | 20.62 | 7.17 | 58.77 | read,64k | 13.02 | 14.05 | 372.35 | | 13.47 | 14.60 | 359.95 | read,1M | 7.68 |135632.60 | 9909.40 | | 7.62 |137367.63 | 9985.99 | ---------------+-------+----------+----------+ two guests, cache=off | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 26.39 | 7.08 | 45.58 | | 26.40 | 8.33 | 45.90 | write,64k | 8.08 | 12.77 | 599.79 | | 8.09 | 12.87 | 599.59 | write,1M | 10.27 |101694.60 | 7410.92 | | 10.28 |101513.20 | 7405.89 | ---------------+-------+----------+----------+ read,16k | 42.36 | 4.60 | 28.60 | | 27.96 | 14.56 | 43.31 | read,64k | 5.84 | 13.31 | 830.94 | | 5.83 | 22.27 | 830.62 | read,1M | 7.82 |133631.63 | 9730.10 | | 7.82 |133351.59 | 9725.79 | ---------------+-------+----------+----------+ two guests, cache=on, O_DSYNC | MB/s | avg sub | avg comp | | | lat (us) | lat (ms) | ---------------+-------+----------+----------+ write,16k | 19.77 | 17.36 | 61.29 | | 19.73 | 6.36 | 61.43 | write,64k | 23.10 | 14.00 | 209.94 | | 36.25 | 14.51 | 25.22 | write,1M | 23.94 | 43704.88 | 3146.77 | | 36.68 | 28456.63 | 2073.53 | ---------------+-------+----------+----------+ read,16k | 16.38 | 8.04 | 73.99 | | 20.08 | 6.88 | 60.38 | read,64k | 11.39 | 15.22 | 425.61 | | 11.50 | 14.97 | 421.55 | read,1M | 7.68 |135693.24 | 9914.71 | | 7.61 |137409.27 | 9984.48 | ---------------+-------+----------+----------+ -- ------------------ Laurent.Vivier@bull.net ------------------ "Tout ce qui est impossible reste à accomplir" Jules Verne "Things are only impossible until they're not" Jean-Luc Picard ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-15 13:10 ` Laurent Vivier @ 2008-10-16 10:24 ` Laurent Vivier 2008-10-16 13:43 ` Anthony Liguori 0 siblings, 1 reply; 101+ messages in thread From: Laurent Vivier @ 2008-10-16 10:24 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Hi, I've made a benchmark using a database: mysql and sysbench in OLTP mode. cache=off seems to be the best choice in this case... mysql database http://sysbench.sourceforge.net sysbench --test=oltp 200,000 requests on 2,000,000 rows table. | total time | per-request stat (ms) | | (seconds) | min | avg | max | -----------------+------------+-------+-------+-------+ baremetal | 208.6237 | 2.5 | 16.7 | 942.6 | -----------------+------------+-------+-------+-------+ cache=on | 642.2962 | 2.5 | 51.4 | 326.9 | -----------------+------------+-------+-------+-------+ cache=on,O_DSYNC | 646.6570 | 2.7 | 51.7 | 347.0 | -----------------+------------+-------+-------+-------+ cache=off | 635.4424 | 2.9 | 50.8 | 399.5 | -----------------+------------+-------+-------+-------+ Laurent -- ------------------ Laurent.Vivier@bull.net ------------------ "Tout ce qui est impossible reste à accomplir" Jules Verne "Things are only impossible until they're not" Jean-Luc Picard ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-16 10:24 ` Laurent Vivier @ 2008-10-16 13:43 ` Anthony Liguori 2008-10-16 16:08 ` Laurent Vivier 2008-10-17 12:48 ` Avi Kivity 0 siblings, 2 replies; 101+ messages in thread From: Anthony Liguori @ 2008-10-16 13:43 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Laurent Vivier wrote: > Hi, > > I've made a benchmark using a database: > mysql and sysbench in OLTP mode. > > cache=off seems to be the best choice in this case... > It would be interesting for you to run the same workload under KVM. > mysql database > http://sysbench.sourceforge.net > > sysbench --test=oltp > > 200,000 requests on 2,000,000 rows table. > > | total time | per-request stat (ms) | > | (seconds) | min | avg | max | > -----------------+------------+-------+-------+-------+ > baremetal | 208.6237 | 2.5 | 16.7 | 942.6 | > -----------------+------------+-------+-------+-------+ > cache=on | 642.2962 | 2.5 | 51.4 | 326.9 | > -----------------+------------+-------+-------+-------+ > cache=on,O_DSYNC | 646.6570 | 2.7 | 51.7 | 347.0 | > -----------------+------------+-------+-------+-------+ > cache=off | 635.4424 | 2.9 | 50.8 | 399.5 | > -----------------+------------+-------+-------+-------+ > Because you're talking about 1/3% of native performance. This means that you may be dominated by things like CPU overhead verses actual IO throughput. Regards, Anthony Liguori > Laurent > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-16 13:43 ` Anthony Liguori @ 2008-10-16 16:08 ` Laurent Vivier 2008-10-17 12:48 ` Avi Kivity 1 sibling, 0 replies; 101+ messages in thread From: Laurent Vivier @ 2008-10-16 16:08 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Le jeudi 16 octobre 2008 à 08:43 -0500, Anthony Liguori a écrit : > Laurent Vivier wrote: > > Hi, > > > > I've made a benchmark using a database: > > mysql and sysbench in OLTP mode. > > > > cache=off seems to be the best choice in this case... > > > > It would be interesting for you to run the same workload under KVM. It is done under KVM... and I've just double checked these values. > > mysql database > > http://sysbench.sourceforge.net > > > > sysbench --test=oltp > > > > 200,000 requests on 2,000,000 rows table. > > > > | total time | per-request stat (ms) | > > | (seconds) | min | avg | max | > > -----------------+------------+-------+-------+-------+ > > baremetal | 208.6237 | 2.5 | 16.7 | 942.6 | > > -----------------+------------+-------+-------+-------+ > > cache=on | 642.2962 | 2.5 | 51.4 | 326.9 | > > -----------------+------------+-------+-------+-------+ > > cache=on,O_DSYNC | 646.6570 | 2.7 | 51.7 | 347.0 | > > -----------------+------------+-------+-------+-------+ > > cache=off | 635.4424 | 2.9 | 50.8 | 399.5 | > > -----------------+------------+-------+-------+-------+ > > > > Because you're talking about 1/3% of native performance. This means > that you may be dominated by things like CPU overhead verses actual IO > throughput. Yes, but as it is KVM I have no explanation... I've another interesting result with scsi-generic : -----------------+------------+-------+-------+-------+ scsi-generic | 634.1303 | 2.8 | 50.7 | 308.6 | -----------------+------------+-------+-------+-------+ Regards, Laurent -- ------------------ Laurent.Vivier@bull.net ------------------ "Tout ce qui est impossible reste à accomplir" Jules Verne "Things are only impossible until they're not" Jean-Luc Picard ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-16 13:43 ` Anthony Liguori 2008-10-16 16:08 ` Laurent Vivier @ 2008-10-17 12:48 ` Avi Kivity 2008-10-17 13:17 ` Laurent Vivier 1 sibling, 1 reply; 101+ messages in thread From: Avi Kivity @ 2008-10-17 12:48 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Anthony Liguori wrote: >> >> | total time | per-request stat (ms) | >> | (seconds) | min | avg | max | >> -----------------+------------+-------+-------+-------+ >> baremetal | 208.6237 | 2.5 | 16.7 | 942.6 | >> -----------------+------------+-------+-------+-------+ >> cache=on | 642.2962 | 2.5 | 51.4 | 326.9 | >> -----------------+------------+-------+-------+-------+ >> cache=on,O_DSYNC | 646.6570 | 2.7 | 51.7 | 347.0 | >> -----------------+------------+-------+-------+-------+ >> cache=off | 635.4424 | 2.9 | 50.8 | 399.5 | >> -----------------+------------+-------+-------+-------+ >> > > Because you're talking about 1/3% of native performance. This means > that you may be dominated by things like CPU overhead verses actual IO > throughput. I don't know mysql well, but perhaps it sizes its internal cache to system memory size, so baremetal has 4x the amount of cache. If mysql uses mmap to access its data files, then it automatically scales with system memory. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-17 12:48 ` Avi Kivity @ 2008-10-17 13:17 ` Laurent Vivier 0 siblings, 0 replies; 101+ messages in thread From: Laurent Vivier @ 2008-10-17 13:17 UTC (permalink / raw) To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper Le vendredi 17 octobre 2008 à 14:48 +0200, Avi Kivity a écrit : > Anthony Liguori wrote: > >> > >> | total time | per-request stat (ms) | > >> | (seconds) | min | avg | max | > >> -----------------+------------+-------+-------+-------+ > >> baremetal | 208.6237 | 2.5 | 16.7 | 942.6 | > >> -----------------+------------+-------+-------+-------+ > >> cache=on | 642.2962 | 2.5 | 51.4 | 326.9 | > >> -----------------+------------+-------+-------+-------+ > >> cache=on,O_DSYNC | 646.6570 | 2.7 | 51.7 | 347.0 | > >> -----------------+------------+-------+-------+-------+ > >> cache=off | 635.4424 | 2.9 | 50.8 | 399.5 | > >> -----------------+------------+-------+-------+-------+ > >> > > > > Because you're talking about 1/3% of native performance. This means > > that you may be dominated by things like CPU overhead verses actual IO > > throughput. > > I don't know mysql well, but perhaps it sizes its internal cache to > system memory size, so baremetal has 4x the amount of cache. > > If mysql uses mmap to access its data files, then it automatically > scales with system memory. It is what I thought but no: I've approximately the same results with "mem=512M". Regards, Laurent -- ------------------ Laurent.Vivier@bull.net ------------------ "Tout ce qui est impossible reste à accomplir" Jules Verne "Things are only impossible until they're not" Jean-Luc Picard ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 19:43 ` Ryan Harper 2008-10-13 20:21 ` Laurent Vivier @ 2008-10-14 10:05 ` Kevin Wolf 2008-10-14 14:32 ` Ryan Harper 2008-10-14 16:37 ` Avi Kivity 2 siblings, 1 reply; 101+ messages in thread From: Kevin Wolf @ 2008-10-14 10:05 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Laurent Vivier Ryan Harper schrieb: > * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]: >> What file type do you use (qcow2, raw ?). > > Raw. I guess the image is preallocated? What about sparse files (or qcow2, anything that grows), do have numbers on those? In the past, I experienced O_DIRECT to be horribly slow on them. Well, looking at your numbers, they _are_ quite bad, so maybe it actually was sparse. Then the preallocated case would be interesting. Kevin ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-14 10:05 ` Kevin Wolf @ 2008-10-14 14:32 ` Ryan Harper 0 siblings, 0 replies; 101+ messages in thread From: Ryan Harper @ 2008-10-14 14:32 UTC (permalink / raw) To: Kevin Wolf Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper, Laurent Vivier * Kevin Wolf <kwolf@suse.de> [2008-10-14 05:10]: > Ryan Harper schrieb: > > * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]: > >> What file type do you use (qcow2, raw ?). > > > > Raw. > > I guess the image is preallocated? What about sparse files (or qcow2, > anything that grows), do have numbers on those? In the past, I > experienced O_DIRECT to be horribly slow on them. > > Well, looking at your numbers, they _are_ quite bad, so maybe it > actually was sparse. Then the preallocated case would be interesting. It was pre-allocated. I'm incliding to think there is an alignment or some sort of bug/edge-case in the write path to the file on top of the lvm volume considering I don't see such horrible performance against the file in host via O_DIRECT. I imagine until I figure out the issue, sparse or preallocated will perform the same. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 19:43 ` Ryan Harper 2008-10-13 20:21 ` Laurent Vivier 2008-10-14 10:05 ` Kevin Wolf @ 2008-10-14 16:37 ` Avi Kivity 2 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 16:37 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Laurent Vivier Ryan Harper wrote: > fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE} > --ioengine=libaio --direct=1 --norandommap --numjobs=1 --group_reporting > --thread --size=1g --write_lat_log --write_bw_log --iodepth=74 > > How large is /dev/vda? Also, I think you're doing sequential access, which means sequential runs will improve as data is brought into cache. I suggest random access, with a very large /dev/vga. >> OK, but in this case the size of the cache for "cache=off" is the size >> of the guest cache whereas in the other cases the size of the cache is >> the size of the guest cache + the size of the host cache, this is not >> fair... >> > > it isn't supposed to be fair, cache=off is O_DIRECT, we're reading from > the device, we *want* to be able to lean on the host cache to read the > data, pay once and benefit in other guests if possible. > > My assumption is that the memory would be better utilized in the guest (which makes better eviction choices, and which is a lot closer to the application). We'd need to run fio in non-direct mode to show this. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper 2008-10-13 18:43 ` Anthony Liguori 2008-10-13 18:51 ` Laurent Vivier @ 2008-10-13 19:00 ` Mark Wagner 2008-10-13 19:15 ` Ryan Harper 2 siblings, 1 reply; 101+ messages in thread From: Mark Wagner @ 2008-10-13 19:00 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper Ryan Harper wrote: > * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]: > >> Read performance should be unaffected by using O_DSYNC. O_DIRECT will >> significantly reduce read performance. I think we should use O_DSYNC by >> default and I have sent out a patch that contains that. We will follow >> up with benchmarks to demonstrate this. >> > > > baremetal baseline (1g dataset): > ---------------------------+-------+-------+--------------+------------+ > Test scenarios | bandw | % CPU | ave submit | ave compl | > type, block size, iface | MB/s | usage | latency usec | latency ms | > ---------------------------+-------+-------+--------------+------------+ > write, 16k, lvm, direct=1 | 127.7 | 12 | 11.66 | 9.48 | > write, 64k, lvm, direct=1 | 178.4 | 5 | 13.65 | 27.15 | > write, 1M, lvm, direct=1 | 186.0 | 3 | 163.75 | 416.91 | > ---------------------------+-------+-------+--------------+------------+ > read , 16k, lvm, direct=1 | 170.4 | 15 | 10.86 | 7.10 | > read , 64k, lvm, direct=1 | 199.2 | 5 | 12.52 | 24.31 | > read , 1M, lvm, direct=1 | 202.0 | 3 | 133.74 | 382.67 | > ---------------------------+-------+-------+--------------+------------+ > > kvm write (1g dataset): > ---------------------------+-------+-------+--------------+------------+ > Test scenarios | bandw | % CPU | ave submit | ave compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | > ---------------------------+-------+-------+--------------+------------+ > 16k,virtio,off,none | 135.0 | 94 | 9.1 | 8.71 | > 16k,virtio,on ,none | 184.0 | 100 | 63.69 | 63.48 | > 16k,virtio,on ,O_DSYNC | 150.0 | 35 | 6.63 | 8.31 | > ---------------------------+-------+-------+--------------+------------+ > 64k,virtio,off,none | 169.0 | 51 | 17.10 | 28.00 | > 64k,virtio,on ,none | 189.0 | 60 | 69.42 | 24.92 | > 64k,virtio,on ,O_DSYNC | 171.0 | 48 | 18.83 | 27.72 | > ---------------------------+-------+-------+--------------+------------+ > 1M ,virtio,off,none | 142.0 | 30 | 7176.00 | 523.00 | > 1M ,virtio,on ,none | 190.0 | 45 | 5332.63 | 392.35 | > 1M ,virtio,on ,O_DSYNC | 164.0 | 39 | 6444.48 | 471.20 | > ---------------------------+-------+-------+--------------+------------+ > > kvm read (1g dataset): > ---------------------------+-------+-------+--------------+------------+ > Test scenarios | bandw | % CPU | ave submit | ave compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | > ---------------------------+-------+-------+--------------+------------+ > 16k,virtio,off,none | 175.0 | 40 | 22.42 | 6.71 | > 16k,virtio,on ,none | 211.0 | 147 | 59.49 | 5.54 | > 16k,virtio,on ,O_DSYNC | 212.0 | 145 | 60.45 | 5.47 | > ---------------------------+-------+-------+--------------+------------+ > 64k,virtio,off,none | 190.0 | 64 | 16.31 | 24.92 | > 64k,virtio,on ,none | 546.0 | 161 | 111.06 | 8.54 | > 64k,virtio,on ,O_DSYNC | 520.0 | 151 | 116.66 | 8.97 | > ---------------------------+-------+-------+--------------+------------+ > 1M ,virtio,off,none | 182.0 | 32 | 5573.44 | 407.21 | > 1M ,virtio,on ,none | 750.0 | 127 | 1344.65 | 96.42 | > 1M ,virtio,on ,O_DSYNC | 768.0 | 123 | 1289.05 | 94.25 | > ---------------------------+-------+-------+--------------+------------+ > > -------------------------------------------------------------------------- > exporting file in ext3 filesystem as block device (1g) > -------------------------------------------------------------------------- > > kvm write (1g dataset): > ---------------------------+-------+-------+--------------+------------+ > Test scenarios | bandw | % CPU | ave submit | ave compl | > block size,iface,cache,sync| MB/s | usage | latency usec | latency ms | > ---------------------------+-------+-------+--------------+------------+ > 16k,virtio,off,none | 12.1 | 15 | 9.1 | 8.71 | > 16k,virtio,on ,none | 192.0 | 52 | 62.52 | 6.17 | > 16k,virtio,on ,O_DSYNC | 142.0 | 59 | 18.81 | 8.29 | > ---------------------------+-------+-------+--------------+------------+ > 64k,virtio,off,none | 15.5 | 8 | 21.10 | 311.00 | > 64k,virtio,on ,none | 454.0 | 130 | 113.25 | 10.65 | > 64k,virtio,on ,O_DSYNC | 154.0 | 48 | 20.25 | 30.75 | > ---------------------------+-------+-------+--------------+------------+ > 1M ,virtio,off,none | 24.7 | 5 | 41736.22 | 3020.08 | > 1M ,virtio,on ,none | 485.0 | 100 | 2052.09 | 149.81 | > 1M ,virtio,on ,O_DSYNC | 161.0 | 42 | 6268.84 | 453.84 | > ---------------------------+-------+-------+--------------+------------+ > > > -- > Ryan Harper > Software Engineer; Linux Technology Center > IBM Corp., Austin, Tx > (512) 838-9253 T/L: 678-9253 > ryanh@us.ibm.com > > > Ryan Can you please post the details of the guest and host configurations. From seeing kvm write data that is greater than that of bare metal, I would think that your test dataset is too small and not exceeding that of the host cache size. Our previous testing has shown that once you exceed the host cache and cause the cache to flush, performance will drop to a point lower than if you didn't use the cache in the first place. Can you repeat the tests using a data set that is 2X the size of your hosts memory and post the results for the community to see? -mark ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 19:00 ` Mark Wagner @ 2008-10-13 19:15 ` Ryan Harper 2008-10-14 16:49 ` Avi Kivity 0 siblings, 1 reply; 101+ messages in thread From: Ryan Harper @ 2008-10-13 19:15 UTC (permalink / raw) To: Mark Wagner Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, qemu-devel, Ryan Harper * Mark Wagner <mwagner@redhat.com> [2008-10-13 14:06]: > Ryan Harper wrote: > > Can you please post the details of the guest and host configurations. http://lists.gnu.org/archive/html/qemu-devel/2008-09/msg01115.html > From seeing kvm write data that is greater than that of bare metal, > I would think that your test dataset is too small and not > exceeding that of the host cache size. The size was chosen so it would fit in to demonstrate the crazy #'s seen on cached writes without O_DSYNC. > > Our previous testing has shown that once you exceed the host cache > and cause the cache to flush, performance will drop to a point lower > than if you didn't use the cache in the first place. > > Can you repeat the tests using a data set that is 2X the size of your > hosts memory and post the results for the community to see? Yeah, I can generate those numbers as well. Seeing your note about tons of ESA and storage, feel free to generate your own #'s and post them for the community as well; the more the merrier. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU 2008-10-13 19:15 ` Ryan Harper @ 2008-10-14 16:49 ` Avi Kivity 0 siblings, 0 replies; 101+ messages in thread From: Avi Kivity @ 2008-10-14 16:49 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier, Ryan Harper, Mark Wagner Ryan Harper wrote: >> From seeing kvm write data that is greater than that of bare metal, >> I would think that your test dataset is too small and not >> exceeding that of the host cache size. >> > > The size was chosen so it would fit in to demonstrate the crazy #'s seen > on cached writes without O_DSYNC. > > A disk that is smaller than host memory is hardly interesting. Give the memory to the guest and performance will jump to memory speed rather than disk speed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori ` (5 preceding siblings ...) 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper @ 2008-10-13 17:58 ` Rik van Riel 2008-10-13 18:22 ` Jamie Lokier 2008-10-28 17:34 ` Ian Jackson 7 siblings, 1 reply; 101+ messages in thread From: Rik van Riel @ 2008-10-13 17:58 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori wrote: > When cache=on, read requests may not actually go to the disk. If a > previous read request (by some application on the system) has read the > same data, then it becomes a simple memcpy(). Also, the host IO > scheduler may do read ahead which means that the data may be available > from that. This can be as much of a data integrity problem as asynchronous writes, if various qemu/kvm guests are accessing the same disk image with a cluster filesystem like GFS. -- All rights reversed. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel @ 2008-10-13 18:22 ` Jamie Lokier 2008-10-13 18:34 ` Rik van Riel 0 siblings, 1 reply; 101+ messages in thread From: Jamie Lokier @ 2008-10-13 18:22 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Rik van Riel wrote: > >When cache=on, read requests may not actually go to the disk. If a > >previous read request (by some application on the system) has read the > >same data, then it becomes a simple memcpy(). Also, the host IO > >scheduler may do read ahead which means that the data may be available > >from that. > > This can be as much of a data integrity problem as > asynchronous writes, if various qemu/kvm guests are > accessing the same disk image with a cluster filesystem > like GFS. If there are multiple qemu/kvm guests accessing the same disk image in a cluster, provided the host cluster filesystem uses a fully coherent protocol, ordinary cached reads should be fine. (E.g. not NFS). The behaviour should be equivalent to a "virtual SAN". (Btw, some other OSes have an O_RSYNC flag to force reads to hit the media, much as O_DSYNC forces writes to. That might be relevant to accessing a disk image file on non-coherent cluster filesystems, but I wouldn't recommend that.) -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 18:22 ` Jamie Lokier @ 2008-10-13 18:34 ` Rik van Riel 2008-10-14 1:56 ` Jamie Lokier 0 siblings, 1 reply; 101+ messages in thread From: Rik van Riel @ 2008-10-13 18:34 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Jamie Lokier wrote: > Rik van Riel wrote: >>> When cache=on, read requests may not actually go to the disk. If a >>> previous read request (by some application on the system) has read the >>> same data, then it becomes a simple memcpy(). Also, the host IO >>> scheduler may do read ahead which means that the data may be available >> >from that. >> >> This can be as much of a data integrity problem as >> asynchronous writes, if various qemu/kvm guests are >> accessing the same disk image with a cluster filesystem >> like GFS. > > If there are multiple qemu/kvm guests accessing the same disk image in > a cluster, provided the host cluster filesystem uses a fully coherent > protocol, ordinary cached reads should be fine. (E.g. not NFS). The problem is when the synchronization only happens in the guests, which is a legitimate and common configuration. Ie. the hosts just pass through the IO and the guests run a GFS cluster. Caching either reads or writes at the host level causes problems. -- All rights reversed. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-13 18:34 ` Rik van Riel @ 2008-10-14 1:56 ` Jamie Lokier 2008-10-14 2:28 ` nuitari-qemu 0 siblings, 1 reply; 101+ messages in thread From: Jamie Lokier @ 2008-10-14 1:56 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Rik van Riel wrote: > >If there are multiple qemu/kvm guests accessing the same disk image in > >a cluster, provided the host cluster filesystem uses a fully coherent > >protocol, ordinary cached reads should be fine. (E.g. not NFS). > > The problem is when the synchronization only happens in the guests, > which is a legitimate and common configuration. > > Ie. the hosts just pass through the IO and the guests run a GFS > cluster. Ok, if you are using multiple hosts with a non-coherent host filesystem for the virtual disk, or a non-coherent host block device for the virtual disk, it won't work. But why would you do that? What is the legitimate and common configuration where you'd share a virtual among multiple _hosts_ with a non-coherent host file/device sharing protocol and expect it to work? Do you envisage qemu/kvm using O_DIRECT over NFS or SMB on the host, or something like that? > Caching either reads or writes at the host level causes problems. But only if the hosts are using a non-coherent protocol. Not having a visible effect (except timing) is pretty much the definition of coherent caching. Is there a reason why you wouldn't use, say, GFS on the host (because it claims to be coherent)? Does performance suck relative to O_DIRECT over NFS? -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-14 1:56 ` Jamie Lokier @ 2008-10-14 2:28 ` nuitari-qemu 0 siblings, 0 replies; 101+ messages in thread From: nuitari-qemu @ 2008-10-14 2:28 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel > Is there a reason why you wouldn't use, say, GFS on the host (because > it claims to be coherent)? Does performance suck relative to O_DIRECT > over NFS? Complexity? To set up GFS2 you have to have a full cluster setup, get it working, make sure that locking works, that quorum is achieved, have failover and proper fencing working properly. Plus you then have to maintain all of that. Then you find out that GFS2 is not ready for production (deadlocks), GFS is too old to be supported by a recent kernel. OCFS isn't easier either. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori ` (6 preceding siblings ...) 2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel @ 2008-10-28 17:34 ` Ian Jackson 2008-10-28 17:45 ` Anthony Liguori 7 siblings, 1 reply; 101+ messages in thread From: Ian Jackson @ 2008-10-28 17:34 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori writes ("[Qemu-devel] [RFC] Disk integrity in QEMU"): > So to summarize, I think we should enable O_DSYNC by default to ensure > that guest data integrity is not dependent on the host OS, and that > practically speaking, cache=off is only useful for very specialized > circumstances. Part of the patch I'll follow up with includes changes > to the man page to document all of this for users. I have a patch which does this and allows the host to control the buffering with the IDE cache control facility. I'll be resubmitting it shortly (if I manage to get round to it before going away for three weeks on Thursday lunchtime ...) Ian. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-28 17:34 ` Ian Jackson @ 2008-10-28 17:45 ` Anthony Liguori 2008-10-28 17:50 ` Ian Jackson 0 siblings, 1 reply; 101+ messages in thread From: Anthony Liguori @ 2008-10-28 17:45 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Ian Jackson wrote: > Anthony Liguori writes ("[Qemu-devel] [RFC] Disk integrity in QEMU"): > >> So to summarize, I think we should enable O_DSYNC by default to ensure >> that guest data integrity is not dependent on the host OS, and that >> practically speaking, cache=off is only useful for very specialized >> circumstances. Part of the patch I'll follow up with includes changes >> to the man page to document all of this for users. >> > > I have a patch which does this and allows the host to control the > buffering with the IDE cache control facility. > Do you mean that the guest can control host disk cachability? We've switched to always use O_DSYNC by default. There was a very long thread about it including benchmarks. With the right posix-aio tuning, we can use O_DSYNC without hurting performance*. * Write performance drops but only because write performance was greater than native before. It now is at native performance. Regards, Anthony Liguori > I'll be resubmitting it shortly (if I manage to get round to it before > going away for three weeks on Thursday lunchtime ...) > > Ian. > > > ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-28 17:45 ` Anthony Liguori @ 2008-10-28 17:50 ` Ian Jackson 2008-10-28 18:19 ` Jamie Lokier 0 siblings, 1 reply; 101+ messages in thread From: Ian Jackson @ 2008-10-28 17:50 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier, kvm-devel Anthony Liguori writes ("Re: [Qemu-devel] [RFC] Disk integrity in QEMU"): > Do you mean that the guest can control host disk cachability? Yes. > We've switched to always use O_DSYNC by default. There was a very > long thread about it including benchmarks. With the right posix-aio > tuning, we can use O_DSYNC without hurting performance*. Right. With the change in my tree, the guest can turn on the use of the host's buffer cache for writes (ie, turn off the use of O_DSYNC), using the appropriate cache control features in the IDE controller (and have write barriers with the FLUSH CACHE command). But this patch will need to be reworked into a coherent state for resubmission because of the upstream changes you mention. Ian. ^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [Qemu-devel] [RFC] Disk integrity in QEMU 2008-10-28 17:50 ` Ian Jackson @ 2008-10-28 18:19 ` Jamie Lokier 0 siblings, 0 replies; 101+ messages in thread From: Jamie Lokier @ 2008-10-28 18:19 UTC (permalink / raw) To: qemu-devel Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel, Laurent Vivier Ian Jackson wrote: > > We've switched to always use O_DSYNC by default. There was a very > > long thread about it including benchmarks. With the right posix-aio > > tuning, we can use O_DSYNC without hurting performance*. > > Right. > > With the change in my tree, the guest can turn on the use of the > host's buffer cache for writes (ie, turn off the use of O_DSYNC), > using the appropriate cache control features in the IDE controller > (and have write barriers with the FLUSH CACHE command). I think this is a good idea in principle, but it needs to be overridable by command line and monitor controls. There are a number of guests and usages where you'd want to override it. These come to mind: - Enable host write caching even though the guest turns off IDE caching, because you're testing something and speed is more important than what the guest requests, and you don't want to or can't change the guest. - Disable host write caching even though the guest turns on IDE caching, because you know the guest enables the IDE cache for speed and does not flush the IDE cache for integrity (e.g. some old Linux or Windows?), and you don't want to or can't change the guest. - Disable host read and write caching with O_DIRECT, even though the guest turns on IDE caching, because you want to emulate (roughly) a real disk's performance characteristics. - Disable host read and write caching with O_DIRECT because you don't have spare RAM after the guests have used it. Note that O_DIRECT is not strictly "less caching" than O_DSYNC. Guest IDE FLUSH CACHE commands become host fsync/fdatasync calls. On some Linux hosts, O_DSYNC + fsync will result in a _host_ IDE FLUSH CACHE, when O_DIRECT + fsync will not. -- Jamie ^ permalink raw reply [flat|nested] 101+ messages in thread
end of thread, other threads:[~2008-10-28 18:19 UTC | newest] Thread overview: 101+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori 2008-10-10 7:54 ` Gerd Hoffmann 2008-10-10 8:12 ` Mark McLoughlin 2008-10-12 23:10 ` Jamie Lokier 2008-10-14 17:15 ` Avi Kivity 2008-10-10 9:32 ` Avi Kivity 2008-10-12 23:00 ` Jamie Lokier 2008-10-10 8:11 ` Aurelien Jarno 2008-10-10 12:26 ` Anthony Liguori 2008-10-10 12:53 ` Paul Brook 2008-10-10 13:55 ` Anthony Liguori 2008-10-10 14:05 ` Paul Brook 2008-10-10 14:19 ` Avi Kivity 2008-10-17 13:14 ` Jens Axboe 2008-10-19 9:13 ` Avi Kivity 2008-10-10 15:48 ` Aurelien Jarno 2008-10-10 9:16 ` Avi Kivity 2008-10-10 9:58 ` Daniel P. Berrange 2008-10-10 10:26 ` Avi Kivity 2008-10-10 12:59 ` Paul Brook 2008-10-10 13:20 ` Avi Kivity 2008-10-10 12:34 ` Anthony Liguori 2008-10-10 12:56 ` Avi Kivity 2008-10-11 9:07 ` andrzej zaborowski 2008-10-11 17:54 ` Mark Wagner 2008-10-11 20:35 ` Anthony Liguori 2008-10-12 0:43 ` Mark Wagner 2008-10-12 1:50 ` Chris Wright 2008-10-12 16:22 ` Jamie Lokier 2008-10-12 17:54 ` Anthony Liguori 2008-10-12 18:14 ` nuitari-qemu 2008-10-13 0:27 ` Mark Wagner 2008-10-13 1:21 ` Anthony Liguori 2008-10-13 2:09 ` Mark Wagner 2008-10-13 3:16 ` Anthony Liguori 2008-10-13 6:42 ` Aurelien Jarno 2008-10-13 14:38 ` Steve Ofsthun 2008-10-12 0:44 ` Chris Wright 2008-10-12 10:21 ` Avi Kivity 2008-10-12 14:37 ` Dor Laor 2008-10-12 15:35 ` Jamie Lokier 2008-10-12 18:00 ` Anthony Liguori 2008-10-12 18:02 ` Anthony Liguori 2008-10-15 10:17 ` Andrea Arcangeli 2008-10-12 17:59 ` Anthony Liguori 2008-10-12 18:34 ` Avi Kivity 2008-10-12 19:33 ` Izik Eidus 2008-10-14 17:08 ` Avi Kivity 2008-10-12 19:59 ` Anthony Liguori 2008-10-12 20:43 ` Avi Kivity 2008-10-12 21:11 ` Anthony Liguori 2008-10-14 15:21 ` Avi Kivity 2008-10-14 15:32 ` Anthony Liguori 2008-10-14 15:43 ` Avi Kivity 2008-10-14 19:25 ` Laurent Vivier 2008-10-16 9:47 ` Avi Kivity 2008-10-12 10:12 ` Avi Kivity 2008-10-17 13:20 ` Jens Axboe 2008-10-19 9:01 ` Avi Kivity 2008-10-19 18:10 ` Jens Axboe 2008-10-19 18:23 ` Avi Kivity 2008-10-19 19:17 ` M. Warner Losh 2008-10-19 19:31 ` Avi Kivity 2008-10-19 18:24 ` Avi Kivity 2008-10-19 18:36 ` Jens Axboe 2008-10-19 19:11 ` Avi Kivity 2008-10-19 19:30 ` Jens Axboe 2008-10-19 20:16 ` Avi Kivity 2008-10-20 14:14 ` Avi Kivity 2008-10-10 10:03 ` Fabrice Bellard 2008-10-13 16:11 ` Laurent Vivier 2008-10-13 16:58 ` Anthony Liguori 2008-10-13 17:36 ` Jamie Lokier 2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper 2008-10-13 18:43 ` Anthony Liguori 2008-10-14 16:42 ` Avi Kivity 2008-10-13 18:51 ` Laurent Vivier 2008-10-13 19:43 ` Ryan Harper 2008-10-13 20:21 ` Laurent Vivier 2008-10-13 21:05 ` Ryan Harper 2008-10-15 13:10 ` Laurent Vivier 2008-10-16 10:24 ` Laurent Vivier 2008-10-16 13:43 ` Anthony Liguori 2008-10-16 16:08 ` Laurent Vivier 2008-10-17 12:48 ` Avi Kivity 2008-10-17 13:17 ` Laurent Vivier 2008-10-14 10:05 ` Kevin Wolf 2008-10-14 14:32 ` Ryan Harper 2008-10-14 16:37 ` Avi Kivity 2008-10-13 19:00 ` Mark Wagner 2008-10-13 19:15 ` Ryan Harper 2008-10-14 16:49 ` Avi Kivity 2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel 2008-10-13 18:22 ` Jamie Lokier 2008-10-13 18:34 ` Rik van Riel 2008-10-14 1:56 ` Jamie Lokier 2008-10-14 2:28 ` nuitari-qemu 2008-10-28 17:34 ` Ian Jackson 2008-10-28 17:45 ` Anthony Liguori 2008-10-28 17:50 ` Ian Jackson 2008-10-28 18:19 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).