* REQ_FLUSH, REQ_FUA and open/close of block devices @ 2011-05-19 15:06 Alex Bligh 2011-05-20 12:20 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Alex Bligh @ 2011-05-19 15:06 UTC (permalink / raw) To: linux-kernel; +Cc: Alex Bligh I am doing some work on making REQ_FLUSH and REQ_FUA work with block devices and have some patches that make them perform as expected with nbd if an nbd device is mounted (e.g. -t ext3, -odata-journal,barriers=1), and I see the relevant REQ_FLUSH and REQ_FUA appearing much as expected. However, if I do a straight dd to the device (which generates an open() and a close()), I see no barrier activity at all (i.e. no REQ_FLUSH and no REQ_FUA). It is surprising to me that a close() on a raw device does not generate a REQ_FLUSH. I cannot imagine it is a performance overhead. I would have thought this would useful anyway (if I've written to a raw device I'd rather expect it to hit it when I do the close()), but my specific application is ensuring cache coherency on live migrate of virtual servers: if migrating from node A to node B, then when the hypervisor closes the block device on node A, I want to be sure that any locally cached write data is written to the remote disk before it unfreezes node B. Should a close() of a dirty block device result in a REQ_FLUSH? -- Alex Bligh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-19 15:06 REQ_FLUSH, REQ_FUA and open/close of block devices Alex Bligh @ 2011-05-20 12:20 ` Christoph Hellwig 2011-05-21 8:42 ` Alex Bligh 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2011-05-20 12:20 UTC (permalink / raw) To: Alex Bligh; +Cc: linux-kernel On Thu, May 19, 2011 at 04:06:27PM +0100, Alex Bligh wrote: > Should a close() of a dirty block device result in a REQ_FLUSH? No, why would it? That's what fsync is for. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-20 12:20 ` Christoph Hellwig @ 2011-05-21 8:42 ` Alex Bligh 2011-05-22 10:44 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Alex Bligh @ 2011-05-21 8:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, Alex Bligh --On 20 May 2011 08:20:10 -0400 Christoph Hellwig <hch@infradead.org> wrote: > On Thu, May 19, 2011 at 04:06:27PM +0100, Alex Bligh wrote: >> Should a close() of a dirty block device result in a REQ_FLUSH? > > No, why would it? That's what fsync is for. I had thought fsync() was meant to be implicit in a close of a raw device though perhaps that's my faulty memory; I think you are saying it's up to userspace to fix these; fair enough. However, I'm also seeing writes to the device after the last flush when the device is unmounted. Specifically, a sequence ending mount -t ext3 -odata=journal,barrier=1 /dev/nbd0 /mnt (cd /mnt ; tar cvzf /dev/null . ; sync) 2>&1 >/dev/null dbench -D /mnt 1 & sleep 10 killall dbench sleep 2 killall -KILL dbench sync umount /mnt produces these commands (at the end): Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_FLUSH [NONE] (0x00000003) Sending command: NBD_CMD_WRITE [FUA] (0x00010001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) Sending command: NBD_CMD_WRITE [NONE] (0x00000001) (I'm testing this out by adding flush and fua support to nbd, see git.alex.org.uk if this is interesting). What I am concerned about is that relatively normal actions (e.g. unmount a filing system) do not appear to be flushing all data, even though I did "sync" then "umount". I suspect the sync is generating the FLUSH here, and nothing is flushing the umount writes. How can I know as a block device that I have to write out a (long lasting) writeback cache if I don't receive anything beyond the last WRITE? -- Alex Bligh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-21 8:42 ` Alex Bligh @ 2011-05-22 10:44 ` Christoph Hellwig 2011-05-22 11:17 ` Alex Bligh 2011-05-22 16:56 ` Jeff Garzik 0 siblings, 2 replies; 9+ messages in thread From: Christoph Hellwig @ 2011-05-22 10:44 UTC (permalink / raw) To: Alex Bligh; +Cc: linux-kernel On Sat, May 21, 2011 at 09:42:45AM +0100, Alex Bligh wrote: > What I am concerned about is that relatively normal actions (e.g. unmount > a filing system) do not appear to be flushing all data, even though I > did "sync" then "umount". I suspect the sync is generating the FLUSH here, > and nothing is flushing the umount writes. How can I know as a block > device that I have to write out a (long lasting) writeback cache if > I don't receive anything beyond the last WRITE? In your case it seems like ext3 is doing something wrong. If you run the same on XFS, you should not only see the last real write having FUA and FLUSH as it's a transaction commit, but also an explicit cache flush when devices are closed from the filesystem to work around issues like that. But the raw block device node really doesn't behave different from a file and shouldn't cause any fsync on close. Btw, using sync_file_range is a really bad idea. It will not actually flush the disk cache on the server, nor make sure metadata is commited in case of a sparse or preallocated file, and thus does not implement the FLUSH or FUA semantics correctly. And btw, I'd like to know what makes sync_file_range so tempting, even after I added documentation explaining why it's almost always wrong to use it to the man page. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-22 10:44 ` Christoph Hellwig @ 2011-05-22 11:17 ` Alex Bligh 2011-05-22 11:26 ` Christoph Hellwig 2011-05-22 16:56 ` Jeff Garzik 1 sibling, 1 reply; 9+ messages in thread From: Alex Bligh @ 2011-05-22 11:17 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, Alex Bligh Christoph, --On 22 May 2011 06:44:49 -0400 Christoph Hellwig <hch@infradead.org> wrote: > On Sat, May 21, 2011 at 09:42:45AM +0100, Alex Bligh wrote: >> What I am concerned about is that relatively normal actions (e.g. unmount >> a filing system) do not appear to be flushing all data, even though I >> did "sync" then "umount". I suspect the sync is generating the FLUSH >> here, and nothing is flushing the umount writes. How can I know as a >> block device that I have to write out a (long lasting) writeback cache if >> I don't receive anything beyond the last WRITE? > > In your case it seems like ext3 is doing something wrong. If you > run the same on XFS, you should not only see the last real write > having FUA and FLUSH as it's a transaction commit, but also an > explicit cache flush when devices are closed from the filesystem > to work around issues like that. OK. Sounds like an ext3 bug then. I will test with xfs, ext4 and btrfs and see if they exhibit the same symptoms, and come back with a more appropriate subject line. > But the raw block device node > really doesn't behave different from a file and shouldn't cause > any fsync on close. Fair enough. I will check whether the hypervisor concerned is doing an fsync() or equivalent in the right place. > Btw, using sync_file_range is a really bad idea. It will not actually > flush the disk cache on the server, nor make sure metadata is commited in > case of a sparse or preallocated file, and thus does not implement > the FLUSH or FUA semantics correctly. > > And btw, I'd like to know what makes sync_file_range so tempting, > even after I added documentation explaining why it's almost always > wrong to use it to the man page. I think you are referring to this (which in my defence wasn't in my local copy of the manpage). > This system call is extremely dangerous and should not be used in > portable programs. None of these operations writes out the file's > metadata. Therefore, unless the application is strictly performing > overwrites of already- instantiated disk blocks, there are no > guarantees that the data will be available after a crash. There is no > user interface to know if a write is purely an overwrite. On file > systems using copy-on-write semantics (e.g., btrfs) an overwrite of > existing allocated blocks is impossible. When writing into preallocated > space, many file systems also require calls into the block allocator, > which this system call does not sync out to disk. This system call > does not flush disk write caches and thus does not provide any data > integrity on systems with volatile disk write caches. So, the file in question is not mmap'd (it's an nbd disk). fsync() / fdatasync() is too expensive as it will sync everything. As far as I can tell, this is no more dangerous re metadata than fdatasync() which also does not sync metadata. I had read the last sentence as "this system call does not *necessarily* flush disk write caches" (meaning "if you haven't mounted e.g. ext3 with barriers=1, then you can't ensure write caches write through"), as opposed to "will not ever flush disk write caches", and given mounting ext3 without barriers=1 produces no FUA or FLUSH commands in normal operation anyway (as far as light debugging can see) that's not much of a loss. But rather than trying to justify myself: what is the best way to emulate FUA, i.e. ensure a specific portion of a file is synced before returning, without ensuring the whole lot is synced (which is far too slow)? The only other option I can see is to open the file with a second fd, mmap the chunk of the file (it may be larger than the available virtual address space), mysnc it with MS_SYNC, then fsync, then munmap and close, and hope the fsync doesn't spit anything else out. This seems a little excessive, and I don't even know whether it would work. I guess given NBD currently does nothing at all to support barriers, I thought this was an improvement! -- Alex Bligh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-22 11:17 ` Alex Bligh @ 2011-05-22 11:26 ` Christoph Hellwig 2011-05-22 12:00 ` Alex Bligh 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2011-05-22 11:26 UTC (permalink / raw) To: Alex Bligh; +Cc: linux-kernel > So, the file in question is not mmap'd (it's an nbd disk). fsync() / > fdatasync() is too expensive as it will sync everything. As far as I can > tell, this is no more dangerous re metadata than fdatasync() which also > does not sync metadata. I had read the last sentence as "this system > call does not *necessarily* flush disk write caches" (meaning "if you > haven't mounted e.g. ext3 with barriers=1, then you can't ensure write > caches write through"), as opposed to "will not ever flush disk write > caches", and given mounting ext3 without barriers=1 produces no FUA or > FLUSH commands in normal operation anyway (as far as light debugging > can see) that's not much of a loss. ext3 without barriers does not gurantee any data integrity and will lose your data in an eye blink if you have a large enough cache. fdatasync is equivalent to fsync except that it does not flush non-essential metadata (basically just timestamps in practice), but it does flush metadata requried to find the data again, e.g. allocation information and extent maps. sync_file_range does nothing but flush out pagecache content - it means you basically won't get your data back in case of a crash if you either: a) have a volatile write cache in your disk (e.g. any normal SATA disk) b) are using a sparse file on a filesystem c) are using a fallocate-preallocated file on a filesystem d) use any file on a COW filesystem like btrfs e.g. it only does anything useful for you if you do not have a volatile write cache, and either use a raw block device node, or just overwrite an already fully allocated (and not preallocated) file on a non-COW filesystem. > But rather than trying to justify myself: what is the best way to > emulate FUA, i.e. ensure a specific portion of a file is synced before > returning, without ensuring the whole lot is synced (which is far too > slow)? The only other option I can see is to open the file with a second > fd, mmap the chunk of the file (it may be larger than the available > virtual address space), mysnc it with MS_SYNC, then fsync, then munmap > and close, and hope the fsync doesn't spit anything else out. This > seems a little excessive, and I don't even know whether it would work. You can have a second FD with O_DSYNC open and write to that. But for NBD and Linux guest that won't make any different yet. While REQ_FUA is a separate flag so far it's only used in combination with REQ_FLUSH, so the only pattern you'll see REQ_FUA used in is: REQ_FLUSH REQ_FUA which means there's no data but the one just written in the cache. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-22 11:26 ` Christoph Hellwig @ 2011-05-22 12:00 ` Alex Bligh 2011-05-22 12:04 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Alex Bligh @ 2011-05-22 12:00 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, Alex Bligh Christoph, > ext3 without barriers does not gurantee any data integrity and will lose > your data in an eye blink if you have a large enough cache. This doesn't appear to stop people using it :-) > fdatasync is equivalent to fsync except that it does not flush > non-essential metadata (basically just timestamps in practice), but it > does flush metadata requried to find the data again, e.g. allocation > information and extent maps. sync_file_range does nothing but flush > out pagecache content - it means you basically won't get your data > back in case of a crash if you either: > > a) have a volatile write cache in your disk (e.g. any normal SATA disk) > b) are using a sparse file on a filesystem > c) are using a fallocate-preallocated file on a filesystem > d) use any file on a COW filesystem like btrfs > > e.g. it only does anything useful for you if you do not have a volatile > write cache, and either use a raw block device node, or just overwrite > an already fully allocated (and not preallocated) file on a non-COW > filesystem. Thanks, that's really useful. >> But rather than trying to justify myself: what is the best way to >> emulate FUA, i.e. ensure a specific portion of a file is synced before >> returning, without ensuring the whole lot is synced (which is far too >> slow)? The only other option I can see is to open the file with a second >> fd, mmap the chunk of the file (it may be larger than the available >> virtual address space), mysnc it with MS_SYNC, then fsync, then munmap >> and close, and hope the fsync doesn't spit anything else out. This >> seems a little excessive, and I don't even know whether it would work. > > You can have a second FD with O_DSYNC open and write to that. Fantastic - I shall do that in the long term. > But for > NBD and Linux guest that won't make any different yet. As far as I know, nbd only has linux clients. It certainly only has linux clients that transmit flush and FUA because I only added that to the protocol last week :-) > While REQ_FUA > is a separate flag so far it's only used in combination with REQ_FLUSH, > so the only pattern you'll see REQ_FUA used in is: > > REQ_FLUSH > REQ_FUA > > which means there's no data but the one just written in the cache. I think what you are saying is that when the request with REQ_FUA arrives, it will have been immediately preceded by a REQ_FLUSH. Therefore, I will only have the data attached to the request with REQ_FUA to flush anyway, so an fdatasync() does no harm performance wise. That's what I'm currently doing if sync_file_range() is not supported. It sounds like that's what I should be doing all the time. If you don't mind, I shall borrow your text above and put it in the source. -- Alex Bligh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-22 12:00 ` Alex Bligh @ 2011-05-22 12:04 ` Christoph Hellwig 0 siblings, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2011-05-22 12:04 UTC (permalink / raw) To: Alex Bligh; +Cc: linux-kernel On Sun, May 22, 2011 at 01:00:41PM +0100, Alex Bligh wrote: > I think what you are saying is that when the request with REQ_FUA arrives, > it will have been immediately preceded by a REQ_FLUSH. Therefore, I will > only have the data attached to the request with REQ_FUA to flush anyway, so > an fdatasync() does no harm performance wise. That's what I'm currently > doing if sync_file_range() is not supported. It sounds like that's what I > should be doing all the time. If you don't mind, I shall borrow your > text above and put it in the source. Sure, feel free to borrow it. Note that I have a mid-term plan to actually use REQ_FUA without preceeding REQ_FLUSH in XFS, but even in that case the write cache probably won't be too full. Long term someone who cares enough should simple submit patches for range fsync/fdatasync syscalls. We already have all the infrastructure for it in the kernel, as it's used by the O_SYNC/O_DSYNC implementation and nfsd, so it's just the actually syscall entry points that need to be added. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: REQ_FLUSH, REQ_FUA and open/close of block devices 2011-05-22 10:44 ` Christoph Hellwig 2011-05-22 11:17 ` Alex Bligh @ 2011-05-22 16:56 ` Jeff Garzik 1 sibling, 0 replies; 9+ messages in thread From: Jeff Garzik @ 2011-05-22 16:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Alex Bligh, linux-kernel On 05/22/2011 06:44 AM, Christoph Hellwig wrote: > And btw, I'd like to know what makes sync_file_range so tempting, > even after I added documentation explaining why it's almost always > wrong to use it to the man page. Because Linus used it to describe how to stream data to disk, such as what MythTV does. Example: http://lkml.indiana.edu/hypermail/linux/kernel/0904.0/01076.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-05-22 16:57 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-19 15:06 REQ_FLUSH, REQ_FUA and open/close of block devices Alex Bligh 2011-05-20 12:20 ` Christoph Hellwig 2011-05-21 8:42 ` Alex Bligh 2011-05-22 10:44 ` Christoph Hellwig 2011-05-22 11:17 ` Alex Bligh 2011-05-22 11:26 ` Christoph Hellwig 2011-05-22 12:00 ` Alex Bligh 2011-05-22 12:04 ` Christoph Hellwig 2011-05-22 16:56 ` Jeff Garzik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox