* [Qemu-devel] [PATCH 0/4] data integrity fixes @ 2009-08-31 20:16 Christoph Hellwig 2009-08-31 20:16 ` [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Christoph Hellwig ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 20:16 UTC (permalink / raw) To: qemu-devel This series contains various patches to make sure we actually get our data to disk when the guest believes it is. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 20:16 [Qemu-devel] [PATCH 0/4] data integrity fixes Christoph Hellwig @ 2009-08-31 20:16 ` Christoph Hellwig 2009-08-31 22:09 ` Jamie Lokier 2009-08-31 20:17 ` [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 20:16 UTC (permalink / raw) To: qemu-devel Add a enable_write_cache flag in the block driver state, and use it to decide if we claim to have a volatile write cache that needs controlled flushing from the guest. Currently we only claim to have it when cache=none is specified. While that might seem wrong it actually is the case as we still have outstanding block allocations and host drive caches to flush. We do not need to claim a write cache when we use cache=writethrough because O_SYNC writes are guaranteed to have data on stable storage. We would have to claim one for data=writeback to be safe, but for I will follow Avi's opinion that it is a useless mode and should be our dedicated unsafe mode. If anyone disagrees please start the flame thrower now and I will change it. Otherwise a documentation patch will follow to explicitly document data=writeback as unsafe. Both scsi-disk and ide now use the new flage, changing from their defaults of always off (ide) or always on (scsi-disk). Signed-off-by: Christoph Hellwig <hch@lst.de> Index: qemu-kvm/hw/scsi-disk.c =================================================================== --- qemu-kvm.orig/hw/scsi-disk.c +++ qemu-kvm/hw/scsi-disk.c @@ -710,7 +710,9 @@ static int32_t scsi_send_command(SCSIDev memset(p,0,20); p[0] = 8; p[1] = 0x12; - p[2] = 4; /* WCE */ + if (bdrv_enable_write_cache(s->bdrv)) { + p[2] = 4; /* WCE */ + } p += 20; } if ((page == 0x3f || page == 0x2a) Index: qemu-kvm/block.c =================================================================== --- qemu-kvm.orig/block.c +++ qemu-kvm/block.c @@ -408,6 +408,16 @@ int bdrv_open2(BlockDriverState *bs, con } bs->drv = drv; bs->opaque = qemu_mallocz(drv->instance_size); + + /* + * Yes, BDRV_O_NOCACHE aka O_DIRECT means we have to present a + * write cache to the guest. We do need the fdatasync to flush + * out transactions for block allocations, and we maybe have a + * volatile write cache in our backing device to deal with. + */ + if (flags & BDRV_O_NOCACHE) + bs->enable_write_cache = 1; + /* Note: for compatibility, we open disk image files as RDWR, and RDONLY as fallback */ if (!(flags & BDRV_O_FILE)) @@ -918,6 +928,11 @@ int bdrv_is_sg(BlockDriverState *bs) return bs->sg; } +int bdrv_enable_write_cache(BlockDriverState *bs) +{ + return bs->enable_write_cache; +} + /* XXX: no longer used */ void bdrv_set_change_cb(BlockDriverState *bs, void (*change_cb)(void *opaque), void *opaque) Index: qemu-kvm/block_int.h =================================================================== --- qemu-kvm.orig/block_int.h +++ qemu-kvm/block_int.h @@ -152,6 +152,9 @@ struct BlockDriverState { /* the memory alignment required for the buffers handled by this driver */ int buffer_alignment; + /* do we need to tell the quest if we have a volatile write cache? */ + int enable_write_cache; + /* NOTE: the following infos are only hints for real hardware drivers. They are not used by the block driver */ int cyls, heads, secs, translation; Index: qemu-kvm/block.h =================================================================== --- qemu-kvm.orig/block.h +++ qemu-kvm/block.h @@ -120,6 +120,7 @@ int bdrv_get_translation_hint(BlockDrive int bdrv_is_removable(BlockDriverState *bs); int bdrv_is_read_only(BlockDriverState *bs); int bdrv_is_sg(BlockDriverState *bs); +int bdrv_enable_write_cache(BlockDriverState *bs); int bdrv_is_inserted(BlockDriverState *bs); int bdrv_media_changed(BlockDriverState *bs); int bdrv_is_locked(BlockDriverState *bs); Index: qemu-kvm/hw/ide/core.c =================================================================== --- qemu-kvm.orig/hw/ide/core.c +++ qemu-kvm/hw/ide/core.c @@ -148,8 +148,11 @@ static void ide_identify(IDEState *s) put_le16(p + 83, (1 << 14) | (1 << 13) | (1 <<12) | (1 << 10)); /* 14=set to 1, 1=SMART self test, 0=SMART error logging */ put_le16(p + 84, (1 << 14) | 0); - /* 14 = NOP supported, 0=SMART feature set enabled */ - put_le16(p + 85, (1 << 14) | 1); + /* 14 = NOP supported, 5=WCACHE enabled, 0=SMART feature set enabled */ + if (bdrv_enable_write_cache(s->bs)) + put_le16(p + 85, (1 << 14) | (1 << 5) | 1); + else + put_le16(p + 85, (1 << 14) | 1); /* 13=flush_cache_ext,12=flush_cache,10=lba48 */ put_le16(p + 86, (1 << 14) | (1 << 13) | (1 <<12) | (1 << 10)); /* 14=set to 1, 1=smart self test, 0=smart error logging */ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 20:16 ` [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Christoph Hellwig @ 2009-08-31 22:09 ` Jamie Lokier 2009-08-31 22:16 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 22:09 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > We would have to claim one for data=writeback to be safe, but for I will > follow Avi's opinion that it is a useless mode and should be our > dedicated unsafe mode. If anyone disagrees please start the flame > thrower now and I will change it. Otherwise a documentation patch will > follow to explicitly document data=writeback as unsafe. *Opening flame thrower!* Unsafe, useless? It's the most useful mode when you're starting and stopping VMs regularly, or if you can't use O_DIRECT. It's safe if fdatasync is called - in other words not advertising a write cache is silly. I haven't measured but I'd expect it to be much faster than O_SYNC on some host hardware, for the same reason that barriers + volatile write cache are much faster on some host hardware than disabling the write cache. Right now, on a Linux host O_SYNC is unsafe with hardware that has a volatile write cache. That might not be changed, but if it is than performance with cache=writethrough will plummet (due to issuing a CACHE FLUSH to the hardware after every write), while performance with cache=writeback will be reasonable. If an unsafe mode is desired (I think it is, for those throwaway testing VMs, or during OS installs), I suggest adding cache=volatile: cache=none O_DIRECT, fdatasync, advertise volatile write cache cache=writethrough O_SYNC, do not advertise cache=writeback fdatasync, advertise volatile write cache cache=volatile nothing (perhaps fdatasync on QEMU blockdev close) When using guests OSes which issue CACHE FLUSH commands (that's a guest config issue), why would you ever use cache=writethrough? cache=writeback should be faster and equally safe - provided you do actually advertise the write cache! So please do! Not doing so is silly. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:09 ` Jamie Lokier @ 2009-08-31 22:16 ` Christoph Hellwig 2009-08-31 22:46 ` Jamie Lokier 2009-08-31 22:53 ` Anthony Liguori 0 siblings, 2 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 22:16 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > volatile write cache. That might not be changed, but if it is than > performance with cache=writethrough will plummet (due to issuing a > CACHE FLUSH to the hardware after every write), while performance with > cache=writeback will be reasonable. Currenly all modes are more or less unsafe with volatile write caches at least when using ext3 or raw block device accesses. XFS is safe two thirds due to doing the right thing and one third due to sheer luck. > If an unsafe mode is desired (I think it is, for those throwaway > testing VMs, or during OS installs), I suggest adding cache=volatile: > > cache=none > O_DIRECT, fdatasync, advertise volatile write cache > > cache=writethrough > O_SYNC, do not advertise > > cache=writeback > fdatasync, advertise volatile write cache > > cache=volatile > nothing (perhaps fdatasync on QEMU blockdev close) Fine withe me, let the flame war begin :) > When using guests OSes which issue CACHE FLUSH commands (that's a > guest config issue), why would you ever use cache=writethrough? > cache=writeback should be faster and equally safe - provided you do > actually advertise the write cache! And provided the guest OS actually issues cache flushes when it should, something that at least Linux historicaly utterly failed at, and some other operating systems haven't even tried. cache=writethrough is the equivalent of turning off the volatile write cache of the real disk. It might be slower (which isn't even always the case for real disks), but it is much safer. E.g. if you want to move your old SCO Unix box into a VM it's the only safe option. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:16 ` Christoph Hellwig @ 2009-08-31 22:46 ` Jamie Lokier 2009-08-31 23:06 ` Christoph Hellwig 2009-08-31 22:53 ` Anthony Liguori 1 sibling, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 22:46 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > > volatile write cache. That might not be changed, but if it is than > > performance with cache=writethrough will plummet (due to issuing a > > CACHE FLUSH to the hardware after every write), while performance with > > cache=writeback will be reasonable. > > Currenly all modes are more or less unsafe with volatile write caches > at least when using ext3 or raw block device accesses. XFS is safe > two thirds due to doing the right thing and one third due to sheer > luck. Right, but now you've made it worse. By not calling fdatasync at all, you've reduced the integrity. Previously it would reach the drive's cache, and take whatever (short) time it took to reach the platter. Now you're leaving data in the host cache which can stay for much longer, and is vulnerable to host kernel crashes. Oh, and QEMU could call whatever "hdparm -F" does when using raw block devices ;-) > > nothing (perhaps fdatasync on QEMU blockdev close) > > Fine withe me, let the flame war begin :) Well I'd like to start by pointing out your patch introduces a regression in the combination cache=writeback with emulated SCSI, because it effectively removes the fdatasync calls in that case :-) So please amend the patch before it gets applied lest such silliness be propagated. Thanks :-) I've actually been using cache=writeback with emulated IDE on deployed server VMs, assuming that worked with KVM. It's been an eye opener to find that it was broken all along because the driver failed set the "has write cache" bit. Thank you for the detective work. It goes to show no matter how hard we try, data integrity is a slippery thing where getting it wrong does not show up under normal circumstances, only during catastrophic system failures. Ironically, with emulated SCSI, I used cache=writethrough, thinking guests would not issue CACHE FLUSH commands over SCSI because historically performance has been reached by having overlapping writes instead. > > When using guests OSes which issue CACHE FLUSH commands (that's a > > guest config issue), why would you ever use cache=writethrough? > > cache=writeback should be faster and equally safe - provided you do > > actually advertise the write cache! > > And provided the guest OS actually issues cache flushes when it should, > something that at least Linux historicaly utterly failed at, and some > other operating systems haven't even tried. For the hosts, yes - fsync/fdatasync/O_SYNC/O_DIRECT all utterly fail. (Afaict, Windows hosts do it in some combinations). But I'd like to think we're about to fix Linux hosts soon, thanks to your good work on that elsewhere. For guests, Linux has been good at issuing the necessary flushes for ordinary journalling, (in ext3's case, provided barrier=1 is given in mount options), which is quite important. It failed with fsync, which is also important to applications, but filesystem integrity is the most important thing and it's been good at that for many years. > cache=writethrough is the equivalent of turning off the volatile > write cache of the real disk. It might be slower (which isn't even > always the case for real disks), but it is much safer. When O_SYNC is made to flush hardware cache on Linux hosts, it will be excruciatingly slow: it'll have to seek twice for every write. Once for the data, once for the inode update. That's another reason O_DSYNC is important. > E.g. if you want to move your old SCO Unix box into a VM it's the > only safe option. I agree, and for that reason, cache=writethrough or cache=none are the only reasonable defaults. By the way, all this has led me to another idea... We may find that O_SYNC is slower than batching several writes followed by one fdatasync whose completion allows several writes to report as completed, when emulating SCSI or virtio-blk (anything which allows overlapping write commands from the guest). -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:46 ` Jamie Lokier @ 2009-08-31 23:06 ` Christoph Hellwig 2009-09-01 10:38 ` Jamie Lokier 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 23:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 11:46:45PM +0100, Jamie Lokier wrote: > > On Mon, Aug 31, 2009 at 11:09:50PM +0100, Jamie Lokier wrote: > > > Right now, on a Linux host O_SYNC is unsafe with hardware that has a > > > volatile write cache. That might not be changed, but if it is than > > > performance with cache=writethrough will plummet (due to issuing a > > > CACHE FLUSH to the hardware after every write), while performance with > > > cache=writeback will be reasonable. > > > > Currenly all modes are more or less unsafe with volatile write caches > > at least when using ext3 or raw block device accesses. XFS is safe > > two thirds due to doing the right thing and one third due to sheer > > luck. > > Right, but now you've made it worse. By not calling fdatasync at all, > you've reduced the integrity. Previously it would reach the drive's > cache, and take whatever (short) time it took to reach the platter. > Now you're leaving data in the host cache which can stay for much > longer, and is vulnerable to host kernel crashes. Your last comment is for data=writeback, which in Avi's proposal that I implemented would indeed lost any guarantees and be for all pratical matters unsafe. It's not true for any of the other options. > Oh, and QEMU could call whatever "hdparm -F" does when using raw block > devices ;-) Actually for ide/scsi implementing cache control is on my todo list. Not sure about virtio yet. > Well I'd like to start by pointing out your patch introduces a > regression in the combination cache=writeback with emulated SCSI, > because it effectively removes the fdatasync calls in that case :-) Yes, you already pointed this out above. > It goes to show no matter how hard we try, data integrity is a > slippery thing where getting it wrong does not show up under normal > circumstances, only during catastrophic system failures. Honestly, it should not. Digging through all this was a bit of work, but I was extremly how carelessly most people that touched it before were. It's not rocket sciense and can be tested quite easily using various tools - qemu beeing the easiest nowdays but scsi_debug or an instrumented iscsi target would do the same thing. > It failed with fsync, which > is also important to applications, but filesystem integrity is the > most important thing and it's been good at that for many years. Users might disagree with that. With my user hat on I couldn't care less on what state the internal metadata is as long as I get back at my data which the OS has guaranteed me to reach the disk after a successfull fsync/fdatasync/O_SYNC write. > > E.g. if you want to move your old SCO Unix box into a VM it's the > > only safe option. > > I agree, and for that reason, cache=writethrough or cache=none are the > only reasonable defaults. despite the extremly misleading name cache=none is _NOT_ an alternative, unless we make it open the image using O_DIRECT|O_SYNC. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 23:06 ` Christoph Hellwig @ 2009-09-01 10:38 ` Jamie Lokier 0 siblings, 0 replies; 33+ messages in thread From: Jamie Lokier @ 2009-09-01 10:38 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > > Oh, and QEMU could call whatever "hdparm -F" does when using raw block > > devices ;-) > > Actually for ide/scsi implementing cache control is on my todo list. > Not sure about virtio yet. I think hdparm -f -F does for some block devices what fdatasync ideally does for files. What I was getting at was until we have perfect fdatasync on block devices for Linux, QEMU could use the blockdev ioctls to accomplish the same thing on older kernels. > > It goes to show no matter how hard we try, data integrity is a > > slippery thing where getting it wrong does not show up under normal > > circumstances, only during catastrophic system failures. > > Honestly, it should not. Digging through all this was a bit of work, > but I was extremly how carelessly most people that touched it before > were. It's not rocket sciense and can be tested quite easily using > various tools - qemu beeing the easiest nowdays but scsi_debug or > an instrumented iscsi target would do the same thing. Oh I agree - we have increasingly good debugging tools. What's missing is a dirty script^H^H^H^H^H^H a good validation test which stresses the various combinations of ways to sync data on block devices and various filesystems, and various types of emulated hardware with/without caches enabled, and various mount options, and checks the I/O does what is desired in every case. > > It failed with fsync, which is also important to applications, but > > filesystem integrity is the most important thing and it's been > > good at that for many years. > > Users might disagree with that. With my user hat on I couldn't care > less on what state the internal metadata is as long as I get back at > my data which the OS has guaranteed me to reach the disk after a > successfull fsync/fdatasync/O_SYNC write. I guess it depends what you're doing. I've observed more instances of filesystem corruption due to lack of barriers, resulting in an inability to find files, than I've ever noticed missing data inside files, but then I hardly ever keep large amounts of data in databases. And I get so much mail I wouldn't notice if a few got lost ;-) > > > E.g. if you want to move your old SCO Unix box into a VM it's the > > > only safe option. > > > > I agree, and for that reason, cache=writethrough or cache=none are the > > only reasonable defaults. > > despite the extremly misleading name cache=none is _NOT_ an alternative, > unless we make it open the image using O_DIRECT|O_SYNC. Good point about the misleading name, and good point about O_DIRECT being insufficient too. For a safe emulation default with reasonable performance, I wonder if it would work to emulate drive cache _off_ at the beginning, but with the capability for the guest to enable it? The theory is that old guests don't know about drive caches and will leave it off and be safe (getting O_DSYNC or O_DIRECT|O_DSYNC)[*], and newer guests will turn it on if they also implement barriers (getting nothing or O_DIRECT, and fdatasync when they issue barriers). Do you think that would work with typical guests we know about? [*] - O_DSYNC as opposed to O_SYNC strikes me as important once proper cache flushes are implemented, as it may behave very similar to real hardware when doing data overwrites, whereas O_SYNC should seek back and forth between the data and inode areas for every write, if it's updating it's nanosecond timestamps correctly. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:16 ` Christoph Hellwig 2009-08-31 22:46 ` Jamie Lokier @ 2009-08-31 22:53 ` Anthony Liguori 2009-08-31 22:55 ` Jamie Lokier ` (3 more replies) 1 sibling, 4 replies; 33+ messages in thread From: Anthony Liguori @ 2009-08-31 22:53 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: >> If an unsafe mode is desired (I think it is, for those throwaway >> testing VMs, or during OS installs), I suggest adding cache=volatile: >> >> cache=none >> O_DIRECT, fdatasync, advertise volatile write cache >> >> cache=writethrough >> O_SYNC, do not advertise >> >> cache=writeback >> fdatasync, advertise volatile write cache >> >> cache=volatile >> nothing (perhaps fdatasync on QEMU blockdev close) >> > > Fine withe me, let the flame war begin :) > I think we should pity our poor users and avoid adding yet another obscure option that is likely to be misunderstood. Can someone do some benchmarking with cache=writeback and fdatasync first and quantify what the real performance impact is? I think the two reasonable options are 1) make cache=writeback safe, avoid a massive perf decrease in the process 2) keep cache=writeback as a no-guarantees option. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:53 ` Anthony Liguori @ 2009-08-31 22:55 ` Jamie Lokier 2009-08-31 22:58 ` Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 22:55 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel Anthony Liguori wrote: > Christoph Hellwig wrote: > >>If an unsafe mode is desired (I think it is, for those throwaway > >>testing VMs, or during OS installs), I suggest adding cache=volatile: > >> > >> cache=none > >> O_DIRECT, fdatasync, advertise volatile write cache > >> > >> cache=writethrough > >> O_SYNC, do not advertise > >> > >> cache=writeback > >> fdatasync, advertise volatile write cache > >> > >> cache=volatile > >> nothing (perhaps fdatasync on QEMU blockdev close) > >> > > > >Fine withe me, let the flame war begin :) > > > > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? > > I think the two reasonable options are 1) make cache=writeback safe, > avoid a massive perf decrease in the process 2) keep cache=writeback as > a no-guarantees option. Right now, cache=writeback does set the bit for SCSI emulation, which makes it safe for guests which understand that. Removing that is a regression in safety, not merely a lack of change. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:53 ` Anthony Liguori 2009-08-31 22:55 ` Jamie Lokier @ 2009-08-31 22:58 ` Christoph Hellwig 2009-08-31 22:59 ` Jamie Lokier 2009-09-02 3:53 ` Christoph Hellwig 3 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 22:58 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? I can run benchmarks. Any workloads that you are particularly looking for? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:53 ` Anthony Liguori 2009-08-31 22:55 ` Jamie Lokier 2009-08-31 22:58 ` Christoph Hellwig @ 2009-08-31 22:59 ` Jamie Lokier 2009-08-31 23:06 ` Christoph Hellwig 2009-09-02 3:53 ` Christoph Hellwig 3 siblings, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 22:59 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel Anthony Liguori wrote: > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? Unfortunately we can't yet quantify the impact on the hardware I care about (ordinary consumer PCs with non-NCQ SATA disks), because Linux hosts don't *yet* implement O_SYNC or fdatasync properly. I would expect the performance difference to be much more significant after those are implemented on the host. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:59 ` Jamie Lokier @ 2009-08-31 23:06 ` Christoph Hellwig 2009-08-31 23:09 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 23:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote: > Anthony Liguori wrote: > > Can someone do some benchmarking with cache=writeback and fdatasync > > first and quantify what the real performance impact is? > > Unfortunately we can't yet quantify the impact on the hardware I care > about (ordinary consumer PCs with non-NCQ SATA disks), because Linux > hosts don't *yet* implement O_SYNC or fdatasync properly. They do if you use XFS. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 23:06 ` Christoph Hellwig @ 2009-08-31 23:09 ` Christoph Hellwig 0 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 23:09 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Hellwig, qemu-devel On Tue, Sep 01, 2009 at 01:06:46AM +0200, Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 11:59:25PM +0100, Jamie Lokier wrote: > > Anthony Liguori wrote: > > > Can someone do some benchmarking with cache=writeback and fdatasync > > > first and quantify what the real performance impact is? > > > > Unfortunately we can't yet quantify the impact on the hardware I care > > about (ordinary consumer PCs with non-NCQ SATA disks), because Linux > > hosts don't *yet* implement O_SYNC or fdatasync properly. > > They do if you use XFS. And data-writeback + fdatasync also when using reiserfs. For ext3/4 you need the patches I sent out today, for O_SYNC on everythig but XFS also Jan Kara's patch series. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-08-31 22:53 ` Anthony Liguori ` (2 preceding siblings ...) 2009-08-31 22:59 ` Jamie Lokier @ 2009-09-02 3:53 ` Christoph Hellwig 2009-09-02 13:13 ` Anthony Liguori 3 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-09-02 3:53 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > I think we should pity our poor users and avoid adding yet another > obscure option that is likely to be misunderstood. > > Can someone do some benchmarking with cache=writeback and fdatasync > first and quantify what the real performance impact is? Some preliminary numbers because they are very interesting. Note that his is on a raid controller, not cheap ide disks. To make up for that I used an image file on ext3, which due to it's horrible fsync performance should be kind of a worst case. All these patches are with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, using ext3 with barrier=1 on both. A kernel defconfig compile takes between 9m40s and 9m42s with data=writeback and barrieres disabled, and with fdatasync barriers enabled it actually is minimally faster, between 9m38s and 9m39s (given that I've only done three runs each this might fall under the boundary for measurement tolerances). For comparism the raw block device nodes with cache=none (just one run) is 9m36.759s, which is not far apart. A completely native run is 7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O related. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-09-02 3:53 ` Christoph Hellwig @ 2009-09-02 13:13 ` Anthony Liguori 2009-09-02 14:14 ` Christoph Hellwig 2009-09-02 19:49 ` Christoph Hellwig 0 siblings, 2 replies; 33+ messages in thread From: Anthony Liguori @ 2009-09-02 13:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 05:53:23PM -0500, Anthony Liguori wrote: > >> I think we should pity our poor users and avoid adding yet another >> obscure option that is likely to be misunderstood. >> >> Can someone do some benchmarking with cache=writeback and fdatasync >> first and quantify what the real performance impact is? >> > > Some preliminary numbers because they are very interesting. Note that > his is on a raid controller, not cheap ide disks. To make up for that > I used an image file on ext3, which due to it's horrible fsync > performance should be kind of a worst case. All these patches are > with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, > using ext3 with barrier=1 on both. > Does barrier=0 make a performance difference? IOW, would the typical default ext3 deployment show worse behavior? > A kernel defconfig compile takes between 9m40s and 9m42s with > data=writeback and barrieres disabled, and with fdatasync barriers > enabled it actually is minimally faster, If fdatasync different than fsync on ext3? Does it result in a full metadata commit? If we think these numbers make sense, then I'd vote for enabling fdatasync in master and we'll see if there are any corner cases. > between 9m38s and 9m39s > (given that I've only done three runs each this might fall under > the boundary for measurement tolerances). > > For comparism the raw block device nodes with cache=none (just one run) > is 9m36.759s, which is not far apart. A completely native run is > 7m39.326, btw - and I fear much of the slowdown in KVM isn't I/O > related. > If you're on pre-NHM or BCN then the slowdown from shadow paging would be expected. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-09-02 13:13 ` Anthony Liguori @ 2009-09-02 14:14 ` Christoph Hellwig 2009-09-02 19:49 ` Christoph Hellwig 1 sibling, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-09-02 14:14 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote: > Does barrier=0 make a performance difference? IOW, would the typical > default ext3 deployment show worse behavior? I'll give it a spin. > >A kernel defconfig compile takes between 9m40s and 9m42s with > >data=writeback and barrieres disabled, and with fdatasync barriers > >enabled it actually is minimally faster, > > If fdatasync different than fsync on ext3? Does it result in a full > metadata commit? ext3 honors the fdatasync flag and only does it's horrible job if the metadata we care about is dirty, that is only if some non-timestamp data is dirty. Which means for a non-sparse image file it does well, while for a sparse image file needing allocations it will cause trouble. And now that you mention it I've only tested the non-sparse case which now is default for the management tools at least on Fedora. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag 2009-09-02 13:13 ` Anthony Liguori 2009-09-02 14:14 ` Christoph Hellwig @ 2009-09-02 19:49 ` Christoph Hellwig 1 sibling, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-09-02 19:49 UTC (permalink / raw) To: Anthony Liguori; +Cc: Christoph Hellwig, qemu-devel On Wed, Sep 02, 2009 at 08:13:52AM -0500, Anthony Liguori wrote: > >performance should be kind of a worst case. All these patches are > >with Linux 2.6.31-rc8 + my various barrier fixes on guest and host, > >using ext3 with barrier=1 on both. > > > > Does barrier=0 make a performance difference? IOW, would the typical > default ext3 deployment show worse behavior? Note for this tyical ext3 deployment the barrier patches are kinda useless, because we still don't have any data integrity guarantees at all. Anyway, here are the numbers with barrier=0 on host and guest: data=writeback, no write cache advertised: 9m37.890s, 9m38.303s, 9m38.423s, 9m38.861s, 9m39.599s data=writeback, write cache advertized (and backed by fdatasync): 9m39.649s, 9m39.772s, 9m40.149s, 9m41.737s, 9m41.996s ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 20:16 [Qemu-devel] [PATCH 0/4] data integrity fixes Christoph Hellwig 2009-08-31 20:16 ` [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Christoph Hellwig @ 2009-08-31 20:17 ` Christoph Hellwig 2009-08-31 21:51 ` Jamie Lokier 2009-09-01 15:59 ` Blue Swirl 2009-08-31 20:17 ` [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation Christoph Hellwig 2009-08-31 20:18 ` [Qemu-devel] [PATCH 4/4] virtio-blk: add volatile writecache feature Christoph Hellwig 3 siblings, 2 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 20:17 UTC (permalink / raw) To: qemu-devel If we are flushing the caches for our image files we only care about the data (including the metadata required for accessing it) but not things like timestamp updates. So use fdatasync instead of fsync to implement the flush operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: qemu/block/cow.c =================================================================== --- qemu.orig/block/cow.c 2009-06-07 16:18:21.942938914 -0300 +++ qemu/block/cow.c 2009-08-31 16:49:55.509043084 -0300 @@ -258,7 +258,7 @@ static int cow_create(const char *filena static void cow_flush(BlockDriverState *bs) { BDRVCowState *s = bs->opaque; - fsync(s->fd); + fdatasync(s->fd); } static QEMUOptionParameter cow_create_options[] = { Index: qemu/block/raw-posix.c =================================================================== --- qemu.orig/block/raw-posix.c 2009-08-29 14:17:23.039370199 -0300 +++ qemu/block/raw-posix.c 2009-08-31 16:49:55.513071598 -0300 @@ -723,7 +723,7 @@ static int raw_create(const char *filena static void raw_flush(BlockDriverState *bs) { BDRVRawState *s = bs->opaque; - fsync(s->fd); + fdatasync(s->fd); } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 20:17 ` [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync Christoph Hellwig @ 2009-08-31 21:51 ` Jamie Lokier 2009-08-31 21:55 ` Christoph Hellwig 2009-09-01 15:59 ` Blue Swirl 1 sibling, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 21:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > > If we are flushing the caches for our image files we only care about the > data (including the metadata required for accessing it) but not things > like timestamp updates. So use fdatasync instead of fsync to implement > the flush operations. > - fsync(s->fd); > + fdatasync(s->fd); I believe fsync was used because of uncertainty about whether fdatasync reliably flushes the necessary metadata to access the data on all hosts, after things like writing to holes and extending a file. I'm still not sure if fdatasync provides that guarantee on Linux. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 21:51 ` Jamie Lokier @ 2009-08-31 21:55 ` Christoph Hellwig 2009-08-31 22:48 ` Jamie Lokier 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 21:55 UTC (permalink / raw) To: Jamie Lokier; +Cc: qemu-devel On Mon, Aug 31, 2009 at 10:51:38PM +0100, Jamie Lokier wrote: > Christoph Hellwig wrote: > > > > If we are flushing the caches for our image files we only care about the > > data (including the metadata required for accessing it) but not things > > like timestamp updates. So use fdatasync instead of fsync to implement > > the flush operations. > > > - fsync(s->fd); > > + fdatasync(s->fd); > > I believe fsync was used because of uncertainty about whether > fdatasync reliably flushes the necessary metadata to access the data > on all hosts, after things like writing to holes and extending a file. > > I'm still not sure if fdatasync provides that guarantee on Linux. fdatasync is defined to provide that guarantee, and modulo the whole clusterfuck around volatile write caches it does the right thing thing on common Linux filesystems. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 21:55 ` Christoph Hellwig @ 2009-08-31 22:48 ` Jamie Lokier 2009-08-31 22:57 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-08-31 22:48 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 10:51:38PM +0100, Jamie Lokier wrote: > > Christoph Hellwig wrote: > > > > > > If we are flushing the caches for our image files we only care about the > > > data (including the metadata required for accessing it) but not things > > > like timestamp updates. So use fdatasync instead of fsync to implement > > > the flush operations. > > > > > - fsync(s->fd); > > > + fdatasync(s->fd); > > > > I believe fsync was used because of uncertainty about whether > > fdatasync reliably flushes the necessary metadata to access the data > > on all hosts, after things like writing to holes and extending a file. > > > > I'm still not sure if fdatasync provides that guarantee on Linux. > > fdatasync is defined to provide that guarantee, and modulo the whole > clusterfuck around volatile write caches it does the right thing > thing on common Linux filesystems. That's good to know. I'm not sure about other hosts though. It's specified to update the metadata, but I think I've encountered people saying it doesn't - specifically when extending a file, that it doesn't guarantee the file size is extended on some OSes. But that might be an urban myth. -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 22:48 ` Jamie Lokier @ 2009-08-31 22:57 ` Christoph Hellwig 0 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 22:57 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Hellwig, qemu-devel On Mon, Aug 31, 2009 at 11:48:19PM +0100, Jamie Lokier wrote: > > fdatasync is defined to provide that guarantee, and modulo the whole > > clusterfuck around volatile write caches it does the right thing > > thing on common Linux filesystems. > > That's good to know. I'm not sure about other hosts though. It's > specified to update the metadata, but I think I've encountered people > saying it doesn't - specifically when extending a file, that it > doesn't guarantee the file size is extended on some OSes. But that > might be an urban myth. Those would be typical bugs for really naive implementations. I'm pretty sure those existed for a short time in a lot of places, but I would be extremly surprised if they continued to exist for longer time in any major operating system. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-08-31 20:17 ` [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync Christoph Hellwig 2009-08-31 21:51 ` Jamie Lokier @ 2009-09-01 15:59 ` Blue Swirl 2009-09-01 16:04 ` Christoph Hellwig 1 sibling, 1 reply; 33+ messages in thread From: Blue Swirl @ 2009-09-01 15:59 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel On Mon, Aug 31, 2009 at 11:17 PM, Christoph Hellwig<hch@lst.de> wrote: > > If we are flushing the caches for our image files we only care about the > data (including the metadata required for accessing it) but not things > like timestamp updates. So use fdatasync instead of fsync to implement > the flush operations. > - fsync(s->fd); > + fdatasync(s->fd); There is no fdatasync outside of Linux, not in OpenBSD and mingw32 at least. But you could add a probe for it to configure. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-01 15:59 ` Blue Swirl @ 2009-09-01 16:04 ` Christoph Hellwig 2009-09-02 0:34 ` Jamie Lokier 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-09-01 16:04 UTC (permalink / raw) To: Blue Swirl; +Cc: Christoph Hellwig, qemu-devel On Tue, Sep 01, 2009 at 06:59:46PM +0300, Blue Swirl wrote: > On Mon, Aug 31, 2009 at 11:17 PM, Christoph Hellwig<hch@lst.de> wrote: > > > > If we are flushing the caches for our image files we only care about the > > data (including the metadata required for accessing it) but not things > > like timestamp updates. ??So use fdatasync instead of fsync to implement > > the flush operations. > > > - ?? ??fsync(s->fd); > > + ?? ??fdatasync(s->fd); > > There is no fdatasync outside of Linux, not in OpenBSD and mingw32 at > least. But you could add a probe for it to configure. Oh, okay. Kinda sad that people still don't implement useful inteface that have been in Posix for a long time. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-01 16:04 ` Christoph Hellwig @ 2009-09-02 0:34 ` Jamie Lokier 2009-09-02 0:37 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jamie Lokier @ 2009-09-02 0:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Blue Swirl, qemu-devel Christoph Hellwig wrote: > On Tue, Sep 01, 2009 at 06:59:46PM +0300, Blue Swirl wrote: > > On Mon, Aug 31, 2009 at 11:17 PM, Christoph Hellwig<hch@lst.de> wrote: > > > > > > If we are flushing the caches for our image files we only care about the > > > data (including the metadata required for accessing it) but not things > > > like timestamp updates. ??So use fdatasync instead of fsync to implement > > > the flush operations. > > > > > - ?? ??fsync(s->fd); > > > + ?? ??fdatasync(s->fd); > > > > There is no fdatasync outside of Linux, not in OpenBSD and mingw32 at > > least. But you could add a probe for it to configure. > > Oh, okay. Kinda sad that people still don't implement useful inteface > that have been in Posix for a long time. fdatasync is common among the big commercial unixes which were still being actively developed a few years ago, but fsync is universal - it's very old. Neither OpenBSD nor FreeBSD have fdatasync. mingw32 is a thin wrapper around Windows. Windows as far as I can tell doesn't have an equivalent of fdatasync, although it does have an equivalent of O_DIRECT|O_SYNC (but documentation contradicts itself regarding whether the combination flushes metadata on each write). -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-02 0:34 ` Jamie Lokier @ 2009-09-02 0:37 ` Christoph Hellwig 2009-09-02 1:18 ` Jamie Lokier 2009-09-02 14:02 ` Blue Swirl 0 siblings, 2 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-09-02 0:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: Blue Swirl, Christoph Hellwig, qemu-devel On Wed, Sep 02, 2009 at 01:34:54AM +0100, Jamie Lokier wrote: > fdatasync is common among the big commercial unixes which were still > being actively developed a few years ago, but fsync is universal - > it's very old. > > Neither OpenBSD nor FreeBSD have fdatasync. > > mingw32 is a thin wrapper around Windows. Windows as far as I can > tell doesn't have an equivalent of fdatasync, although it does have an > equivalent of O_DIRECT|O_SYNC (but documentation contradicts itself > regarding whether the combination flushes metadata on each write). Yeah, I'll just add a #ifndef _POSIX_SYNCHRONIZED_IO # define fdatasync(fd) fsync(fd) #endif Now I'm pretty sure we'll find various fuckups where headers define fdatasync but not _POSIX_SYNCHRONIZED_IO or vice versa, but I can outsource adding workarounds for that to people with those broken setups. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-02 0:37 ` Christoph Hellwig @ 2009-09-02 1:18 ` Jamie Lokier 2009-09-02 14:02 ` Blue Swirl 1 sibling, 0 replies; 33+ messages in thread From: Jamie Lokier @ 2009-09-02 1:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Blue Swirl, qemu-devel Christoph Hellwig wrote: > On Wed, Sep 02, 2009 at 01:34:54AM +0100, Jamie Lokier wrote: > > fdatasync is common among the big commercial unixes which were still > > being actively developed a few years ago, but fsync is universal - > > it's very old. > > > > Neither OpenBSD nor FreeBSD have fdatasync. > > > > mingw32 is a thin wrapper around Windows. Windows as far as I can > > tell doesn't have an equivalent of fdatasync, although it does have an > > equivalent of O_DIRECT|O_SYNC (but documentation contradicts itself > > regarding whether the combination flushes metadata on each write). > > Yeah, I'll just add a > > #ifndef _POSIX_SYNCHRONIZED_IO > # define fdatasync(fd) fsync(fd) > #endif > > Now I'm pretty sure we'll find various fuckups where headers define > fdatasync but not _POSIX_SYNCHRONIZED_IO or vice versa, but I can > outsource adding workarounds for that to people with those broken > setups. I've now learned that Windows has equivalents of O_DSYNC and fsync, but not O_SYNC and fdatasync, and that the O_DSYNC equivalent doesn't flush metadata needed to access newly allocated data; it only commits overwrites, and that the fsync equivalent does try to flush the drive's volatile write cache after writing metadata, but only on some versions with some settings and some drivers and possibly some types of drive. Not that you could tell any of that from the documentation. Why do I have the feeling some unixes are the same. No wonder the SQLite folks say, paraphrasing, "we call fsync or something else; we can't help you beyond that point and many OSes have rumoured bugs in this area or don't fulfil what these functions promise". -- Jamie ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-02 0:37 ` Christoph Hellwig 2009-09-02 1:18 ` Jamie Lokier @ 2009-09-02 14:02 ` Blue Swirl 2009-09-02 14:15 ` Christoph Hellwig 1 sibling, 1 reply; 33+ messages in thread From: Blue Swirl @ 2009-09-02 14:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel On Wed, Sep 2, 2009 at 3:37 AM, Christoph Hellwig<hch@lst.de> wrote: > On Wed, Sep 02, 2009 at 01:34:54AM +0100, Jamie Lokier wrote: >> fdatasync is common among the big commercial unixes which were still >> being actively developed a few years ago, but fsync is universal - >> it's very old. >> >> Neither OpenBSD nor FreeBSD have fdatasync. >> >> mingw32 is a thin wrapper around Windows. Windows as far as I can >> tell doesn't have an equivalent of fdatasync, although it does have an >> equivalent of O_DIRECT|O_SYNC (but documentation contradicts itself >> regarding whether the combination flushes metadata on each write). > > Yeah, I'll just add a > > #ifndef _POSIX_SYNCHRONIZED_IO > # define fdatasync(fd) fsync(fd) > #endif > > Now I'm pretty sure we'll find various fuckups where headers define > fdatasync but not _POSIX_SYNCHRONIZED_IO or vice versa, but I can > outsource adding workarounds for that to people with those broken > setups. I'd use a real function, like qemu_fdatasync(). Especially if the Windows equivalent will be complex. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync 2009-09-02 14:02 ` Blue Swirl @ 2009-09-02 14:15 ` Christoph Hellwig 0 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-09-02 14:15 UTC (permalink / raw) To: Blue Swirl; +Cc: Christoph Hellwig, qemu-devel On Wed, Sep 02, 2009 at 05:02:32PM +0300, Blue Swirl wrote: > > Now I'm pretty sure we'll find various fuckups where headers define > > fdatasync but not _POSIX_SYNCHRONIZED_IO or vice versa, but I can > > outsource adding workarounds for that to people with those broken > > setups. > > I'd use a real function, like qemu_fdatasync(). Especially if the > Windows equivalent will be complex. Ok. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation 2009-08-31 20:16 [Qemu-devel] [PATCH 0/4] data integrity fixes Christoph Hellwig 2009-08-31 20:16 ` [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Christoph Hellwig 2009-08-31 20:17 ` [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync Christoph Hellwig @ 2009-08-31 20:17 ` Christoph Hellwig 2009-09-01 10:24 ` Avi Kivity 2009-08-31 20:18 ` [Qemu-devel] [PATCH 4/4] virtio-blk: add volatile writecache feature Christoph Hellwig 3 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 20:17 UTC (permalink / raw) To: qemu-devel Instead stalling the VCPU while serving a cache flush try to do it asynchronously. Use our good old helper thread pool to issue an asynchronous fdatasync for raw-posix. Note that while Linux AIO implements a fdatasync operation it is not useful for us because it isn't actually implement in asynchronous fashion. For now only use it in IDE because virtio-blk doesn't implement cache flusing yet (will be fixed in patch 4/4) and the interface between the HBA emulation and scsi-disk will need some changes to accomodate it for scsi (will be a separate patch series). Signed-off-by: Christoph Hellwig <hch@lst.de> Index: qemu/block.c =================================================================== --- qemu.orig/block.c 2009-08-31 16:49:54.508542113 -0300 +++ qemu/block.c 2009-08-31 16:49:59.593042021 -0300 @@ -54,6 +54,8 @@ static BlockDriverAIOCB *bdrv_aio_readv_ static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, BlockDriverCompletionFunc *cb, void *opaque); +static BlockDriverAIOCB *bdrv_aio_flush_em(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque); static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num, @@ -138,6 +140,10 @@ void bdrv_register(BlockDriver *bdrv) bdrv->bdrv_read = bdrv_read_em; bdrv->bdrv_write = bdrv_write_em; } + + if (!bdrv->bdrv_aio_flush) + bdrv->bdrv_aio_flush = bdrv_aio_flush_em; + bdrv->next = first_drv; first_drv = bdrv; } @@ -1369,6 +1375,21 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockD return ret; } +BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque) +{ + BlockDriver *drv = bs->drv; + + if (!drv) + return NULL; + + /* + * Note that unlike bdrv_flush the driver is reponsible for flushing a + * backing image if it exists. + */ + return drv->bdrv_aio_flush(bs, cb, opaque); +} + void bdrv_aio_cancel(BlockDriverAIOCB *acb) { acb->pool->cancel(acb); @@ -1459,6 +1480,25 @@ static BlockDriverAIOCB *bdrv_aio_writev return bdrv_aio_rw_vector(bs, sector_num, qiov, nb_sectors, cb, opaque, 1); } +static BlockDriverAIOCB *bdrv_aio_flush_em(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque) +{ + BlockDriverAIOCBSync *acb; + + acb = qemu_aio_get(&bdrv_em_aio_pool, bs, cb, opaque); + acb->is_write = 1; /* don't bounce in the completion hadler */ + acb->qiov = NULL; + acb->bounce = NULL; + acb->ret = 0; + + if (!acb->bh) + acb->bh = qemu_bh_new(bdrv_aio_bh_cb, acb); + + bdrv_flush(bs); + qemu_bh_schedule(acb->bh); + return &acb->common; +} + /**************************************************************/ /* sync block device emulation */ Index: qemu/block.h =================================================================== --- qemu.orig/block.h 2009-08-31 16:49:54.516577491 -0300 +++ qemu/block.h 2009-08-31 16:49:59.593042021 -0300 @@ -85,6 +85,8 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDr BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *iov, int nb_sectors, BlockDriverCompletionFunc *cb, void *opaque); +BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque); void bdrv_aio_cancel(BlockDriverAIOCB *acb); /* sg packet commands */ Index: qemu/block_int.h =================================================================== --- qemu.orig/block_int.h 2009-08-31 16:49:54.512583129 -0300 +++ qemu/block_int.h 2009-08-31 16:49:59.597095469 -0300 @@ -69,6 +69,8 @@ struct BlockDriver { BlockDriverAIOCB *(*bdrv_aio_writev)(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, BlockDriverCompletionFunc *cb, void *opaque); + BlockDriverAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque); const char *protocol_name; int (*bdrv_truncate)(BlockDriverState *bs, int64_t offset); Index: qemu/hw/ide/core.c =================================================================== --- qemu.orig/hw/ide/core.c 2009-08-31 16:49:54.516577491 -0300 +++ qemu/hw/ide/core.c 2009-08-31 16:49:59.601041920 -0300 @@ -771,6 +771,16 @@ static void ide_atapi_cmd_check_status(I ide_set_irq(s); } +static void ide_flush_cb(void *opaque, int ret) +{ + IDEState *s = opaque; + + /* XXX: how do we signal I/O errors here? */ + + s->status = READY_STAT | SEEK_STAT; + ide_set_irq(s); +} + static inline void cpu_to_ube16(uint8_t *buf, int val) { buf[0] = val >> 8; @@ -1969,9 +1979,9 @@ void ide_ioport_write(void *opaque, uint case WIN_FLUSH_CACHE: case WIN_FLUSH_CACHE_EXT: if (s->bs) - bdrv_flush(s->bs); - s->status = READY_STAT | SEEK_STAT; - ide_set_irq(s); + bdrv_aio_flush(s->bs, ide_flush_cb, s); + else + ide_flush_cb(s, 0); break; case WIN_STANDBY: case WIN_STANDBY2: Index: qemu/block/raw-posix-aio.h =================================================================== --- qemu.orig/block/raw-posix-aio.h 2009-08-27 23:50:52.510770924 -0300 +++ qemu/block/raw-posix-aio.h 2009-08-31 16:49:59.605095368 -0300 @@ -17,8 +17,9 @@ #define QEMU_AIO_READ 0x0001 #define QEMU_AIO_WRITE 0x0002 #define QEMU_AIO_IOCTL 0x0004 +#define QEMU_AIO_FLUSH 0x0008 #define QEMU_AIO_TYPE_MASK \ - (QEMU_AIO_READ|QEMU_AIO_WRITE|QEMU_AIO_IOCTL) + (QEMU_AIO_READ|QEMU_AIO_WRITE|QEMU_AIO_IOCTL|QEMU_AIO_FLUSH) /* AIO flags */ #define QEMU_AIO_MISALIGNED 0x1000 Index: qemu/block/raw-posix.c =================================================================== --- qemu.orig/block/raw-posix.c 2009-08-31 16:49:55.513071598 -0300 +++ qemu/block/raw-posix.c 2009-08-31 16:49:59.613070264 -0300 @@ -574,6 +574,18 @@ static BlockDriverAIOCB *raw_aio_writev( cb, opaque, QEMU_AIO_WRITE); } +static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, void *opaque) +{ + BDRVRawState *s = bs->opaque; + + if (fd_open(bs) < 0) + return NULL; + + return paio_submit(bs, s->aio_ctx, s->fd, 0, NULL, 0, + cb, opaque, QEMU_AIO_FLUSH); +} + static void raw_close(BlockDriverState *bs) { BDRVRawState *s = bs->opaque; @@ -749,6 +761,7 @@ static BlockDriver bdrv_raw = { .bdrv_aio_readv = raw_aio_readv, .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, .bdrv_truncate = raw_truncate, .bdrv_getlength = raw_getlength, @@ -1002,6 +1015,7 @@ static BlockDriver bdrv_host_device = { .bdrv_aio_readv = raw_aio_readv, .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, .bdrv_read = raw_read, .bdrv_write = raw_write, @@ -1096,6 +1110,7 @@ static BlockDriver bdrv_host_floppy = { .bdrv_aio_readv = raw_aio_readv, .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, .bdrv_read = raw_read, .bdrv_write = raw_write, @@ -1176,6 +1191,7 @@ static BlockDriver bdrv_host_cdrom = { .bdrv_aio_readv = raw_aio_readv, .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, .bdrv_read = raw_read, .bdrv_write = raw_write, @@ -1295,6 +1311,7 @@ static BlockDriver bdrv_host_cdrom = { .bdrv_aio_readv = raw_aio_readv, .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, .bdrv_read = raw_read, .bdrv_write = raw_write, Index: qemu/posix-aio-compat.c =================================================================== --- qemu.orig/posix-aio-compat.c 2009-08-27 23:50:52.654237211 -0300 +++ qemu/posix-aio-compat.c 2009-08-31 16:49:59.621095866 -0300 @@ -134,6 +134,16 @@ static size_t handle_aiocb_ioctl(struct return aiocb->aio_nbytes; } +static size_t handle_aiocb_flush(struct qemu_paiocb *aiocb) +{ + int ret; + + ret = fdatasync(aiocb->aio_fildes); + if (ret == -1) + return -errno; + return 0; +} + #ifdef CONFIG_PREADV static ssize_t @@ -330,6 +340,9 @@ static void *aio_thread(void *unused) case QEMU_AIO_WRITE: ret = handle_aiocb_rw(aiocb); break; + case QEMU_AIO_FLUSH: + ret = handle_aiocb_flush(aiocb); + break; case QEMU_AIO_IOCTL: ret = handle_aiocb_ioctl(aiocb); break; @@ -530,8 +543,10 @@ BlockDriverAIOCB *paio_submit(BlockDrive acb->aio_type = type; acb->aio_fildes = fd; acb->ev_signo = SIGUSR2; - acb->aio_iov = qiov->iov; - acb->aio_niov = qiov->niov; + if (qiov) { + acb->aio_iov = qiov->iov; + acb->aio_niov = qiov->niov; + } acb->aio_nbytes = nb_sectors * 512; acb->aio_offset = sector_num * 512; ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation 2009-08-31 20:17 ` [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation Christoph Hellwig @ 2009-09-01 10:24 ` Avi Kivity 2009-09-01 14:25 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Avi Kivity @ 2009-09-01 10:24 UTC (permalink / raw) To: Christoph Hellwig; +Cc: qemu-devel On 08/31/2009 11:17 PM, Christoph Hellwig wrote: > Instead stalling the VCPU while serving a cache flush try to do it > asynchronously. Use our good old helper thread pool to issue an > asynchronous fdatasync for raw-posix. Note that while Linux AIO > implements a fdatasync operation it is not useful for us because > it isn't actually implement in asynchronous fashion. > > Doesn't aio_flush() need to wait for all previously issued requests to complete? We can require callers to do that, but it seems quite a burden. > For now only use it in IDE because virtio-blk doesn't implement > cache flusing yet (will be fixed in patch 4/4) and the interface between > the HBA emulation and scsi-disk will need some changes to accomodate > it for scsi (will be a separate patch series). > IDE integration should be in a separate patch. Patch has some tabs. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation 2009-09-01 10:24 ` Avi Kivity @ 2009-09-01 14:25 ` Christoph Hellwig 0 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-09-01 14:25 UTC (permalink / raw) To: Avi Kivity; +Cc: Christoph Hellwig, qemu-devel On Tue, Sep 01, 2009 at 01:24:02PM +0300, Avi Kivity wrote: > Doesn't aio_flush() need to wait for all previously issued requests to > complete? > > We can require callers to do that, but it seems quite a burden. The guest does it for us when using ide/scsi/virtio, and I don't think we want to grow other users. It's worth documenting, though. > Patch has some tabs. sorry, of course ;-) Writing code in the weird qemu style just takes me three times as long as readable code, sorry. I'll fix it. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [Qemu-devel] [PATCH 4/4] virtio-blk: add volatile writecache feature 2009-08-31 20:16 [Qemu-devel] [PATCH 0/4] data integrity fixes Christoph Hellwig ` (2 preceding siblings ...) 2009-08-31 20:17 ` [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation Christoph Hellwig @ 2009-08-31 20:18 ` Christoph Hellwig 3 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2009-08-31 20:18 UTC (permalink / raw) To: qemu-devel; +Cc: rusty Add a new VIRTIO_BLK_F_WCACHE feature to virtio-blk to indicate that we have a volatile write cache that needs controlled flushing. Implement a VIRTIO_BLK_T_FLUSH operation to flush it. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: qemu-kvm/hw/virtio-blk.c =================================================================== --- qemu-kvm.orig/hw/virtio-blk.c +++ qemu-kvm/hw/virtio-blk.c @@ -129,6 +129,13 @@ static void virtio_blk_rw_complete(void virtio_blk_req_complete(req, VIRTIO_BLK_S_OK); } +static void virtio_blk_flush_complete(void *opaque, int ret) +{ + VirtIOBlockReq *req = opaque; + + virtio_blk_req_complete(req, ret ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK); +} + static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s) { VirtIOBlockReq *req = qemu_mallocz(sizeof(*req)); @@ -252,6 +259,16 @@ static void virtio_blk_handle_scsi(VirtI } #endif /* __linux__ */ +static void virtio_blk_handle_flush(VirtIOBlockReq *req) +{ + BlockDriverAIOCB *acb; + + acb = bdrv_aio_flush(req->dev->bs, virtio_blk_flush_complete, req); + if (!acb) { + virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); + } +} + static void virtio_blk_handle_write(VirtIOBlockReq *req) { BlockDriverAIOCB *acb; @@ -294,7 +311,9 @@ static void virtio_blk_handle_output(Vir req->out = (void *)req->elem.out_sg[0].iov_base; req->in = (void *)req->elem.in_sg[req->elem.in_num - 1].iov_base; - if (req->out->type & VIRTIO_BLK_T_SCSI_CMD) { + if (req->out->type & VIRTIO_BLK_T_FLUSH) { + virtio_blk_handle_flush(req); + } else if (req->out->type & VIRTIO_BLK_T_SCSI_CMD) { virtio_blk_handle_scsi(req); } else if (req->out->type & VIRTIO_BLK_T_OUT) { qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1], @@ -382,6 +401,9 @@ static uint32_t virtio_blk_get_features( features |= (1 << VIRTIO_BLK_F_SEG_MAX); features |= (1 << VIRTIO_BLK_F_GEOMETRY); + + if (bdrv_enable_write_cache(s->bs)) + features |= (1 << VIRTIO_BLK_F_WCACHE); #ifdef __linux__ features |= (1 << VIRTIO_BLK_F_SCSI); #endif Index: qemu-kvm/hw/virtio-blk.h =================================================================== --- qemu-kvm.orig/hw/virtio-blk.h +++ qemu-kvm/hw/virtio-blk.h @@ -31,6 +31,7 @@ #define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/ #define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */ #define VIRTIO_BLK_F_IDENTIFY 8 /* ATA IDENTIFY supported */ +#define VIRTIO_BLK_F_WCACHE 9 /* write cache enabled */ #define VIRTIO_BLK_ID_LEN 256 /* length of identify u16 array */ #define VIRTIO_BLK_ID_SN 10 /* start of char * serial# */ @@ -55,6 +56,9 @@ struct virtio_blk_config /* This bit says it's a scsi command, not an actual read or write. */ #define VIRTIO_BLK_T_SCSI_CMD 2 +/* Flush the volatile write cache */ +#define VIRTIO_BLK_T_FLUSH 4 + /* Barrier before this op. */ #define VIRTIO_BLK_T_BARRIER 0x80000000 ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2009-09-02 19:49 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-31 20:16 [Qemu-devel] [PATCH 0/4] data integrity fixes Christoph Hellwig 2009-08-31 20:16 ` [Qemu-devel] [PATCH 1/4] block: add enable_write_cache flag Christoph Hellwig 2009-08-31 22:09 ` Jamie Lokier 2009-08-31 22:16 ` Christoph Hellwig 2009-08-31 22:46 ` Jamie Lokier 2009-08-31 23:06 ` Christoph Hellwig 2009-09-01 10:38 ` Jamie Lokier 2009-08-31 22:53 ` Anthony Liguori 2009-08-31 22:55 ` Jamie Lokier 2009-08-31 22:58 ` Christoph Hellwig 2009-08-31 22:59 ` Jamie Lokier 2009-08-31 23:06 ` Christoph Hellwig 2009-08-31 23:09 ` Christoph Hellwig 2009-09-02 3:53 ` Christoph Hellwig 2009-09-02 13:13 ` Anthony Liguori 2009-09-02 14:14 ` Christoph Hellwig 2009-09-02 19:49 ` Christoph Hellwig 2009-08-31 20:17 ` [Qemu-devel] [PATCH 2/4] block: use fdatasync instead of fsync Christoph Hellwig 2009-08-31 21:51 ` Jamie Lokier 2009-08-31 21:55 ` Christoph Hellwig 2009-08-31 22:48 ` Jamie Lokier 2009-08-31 22:57 ` Christoph Hellwig 2009-09-01 15:59 ` Blue Swirl 2009-09-01 16:04 ` Christoph Hellwig 2009-09-02 0:34 ` Jamie Lokier 2009-09-02 0:37 ` Christoph Hellwig 2009-09-02 1:18 ` Jamie Lokier 2009-09-02 14:02 ` Blue Swirl 2009-09-02 14:15 ` Christoph Hellwig 2009-08-31 20:17 ` [Qemu-devel] [PATCH 3/4] block: add bdrv_aio_flush operation Christoph Hellwig 2009-09-01 10:24 ` Avi Kivity 2009-09-01 14:25 ` Christoph Hellwig 2009-08-31 20:18 ` [Qemu-devel] [PATCH 4/4] virtio-blk: add volatile writecache feature Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).