* BLKZEROOUT + pread should return zeroes, right? @ 2014-10-14 3:01 Darrick J. Wong 2014-10-14 4:27 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2014-10-14 3:01 UTC (permalink / raw) To: Jens Axboe, Martin K. Petersen Cc: linux-fsdevel, Theodore Ts'o, linux-ext4 Hi everyone, What's the intended behavior if I issue BLKZEROOUT against a range of disk sectors and immediately re-read the sectors into a buffer? I've been trying to modify e2fsprogs to use BLKZEROOUT, and I noticed today that if I run mke2fs and e2fsck -fn enough times in a tight loop, that eventually e2fsck complains about corruption in blocks that ought to contain zeroes. If I dd the block in question after the failure, I get zeroes as I'd expect. This feels incorrect -- if I pwrite a block, then blkzeroout the block, then re-read it, I ought to see zeroes, right? Or is BLKZEROOUT some sort of hint that isn't perfectly reliable, a la BLKDISCARD? Or maybe I'm just doing it incorrectly? I looked at block/blk-num.c, this seems like it ought to be ok. I boiled the whole thing down into the attached test program, which can reproduce the symptoms in a few loop iterations. If I insert "sleep(1);" before the pread64, I pread zeroes every time; otherwise, I only pread zeroes part of the time. If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the chances of preading zeroes increases dramatically, but is still not 100%. So, uh, is this a bug? Or is that just how BLKZEROOUT works? Or did I fubar the ioctl call? $ gcc -Wall -g -o test test.c $ sudo ./test /dev/sda 6: ERR 0 (0xffffffff) --D /* silly test program */ #define _XOPEN_SOURCE 600 #define _DARWIN_C_SOURCE #define _FILE_OFFSET_BITS 64 #define _LARGEFILE_SOURCE #define _LARGEFILE64_SOURCE #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <sys/types.h> #include <string.h> #include <sys/ioctl.h> #include <stdio.h> #include <stdint.h> #include <sys/stat.h> #include <fcntl.h> #include <linux/fs.h> #include <unistd.h> #include <assert.h> #define BUFSZ 4096 static int run(int iteration, const char *fname) { char buf[BUFSZ]; ssize_t sz; uint64_t range[2]; int fd, ret, i; printf("%d\r", iteration); fflush(stdout); fd = open(fname, O_RDWR); if (fd < 0) return 1; memset(buf, 0xFF, BUFSZ); sz = pwrite64(fd, buf, BUFSZ, 0); if (sz != BUFSZ) return 2; range[0] = 0; range[1] = 4096; ret = ioctl(fd, BLKZEROOUT, range); if (ret) return 5; sz = pread64(fd, buf, BUFSZ, 0); if (sz != BUFSZ) return 7; for (i = 0; i < BUFSZ; i++) { if (buf[i]) { printf("%d: ERR %d (0x%x)\n", iteration, i, buf[i]); return 8; } } close(fd); return 0; } int main(int argc, char *argv[]) { int iter = 0; int ret; if (argc != 2) { printf("Usage: %s blkdev\n", argv[0]); return 0; } do { ret = run(iter++, argv[1]); } while (!ret); return ret; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 3:01 BLKZEROOUT + pread should return zeroes, right? Darrick J. Wong @ 2014-10-14 4:27 ` Dave Chinner 2014-10-14 6:02 ` Darrick J. Wong 2014-10-14 9:21 ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig 0 siblings, 2 replies; 15+ messages in thread From: Dave Chinner @ 2014-10-14 4:27 UTC (permalink / raw) To: Darrick J. Wong Cc: Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 On Mon, Oct 13, 2014 at 08:01:32PM -0700, Darrick J. Wong wrote: > Hi everyone, > > What's the intended behavior if I issue BLKZEROOUT against a range of disk > sectors and immediately re-read the sectors into a buffer? Should return zeros. [...] > I boiled the whole thing down into the attached test program, which can > reproduce the symptoms in a few loop iterations. If I insert "sleep(1);" > before the pread64, I pread zeroes every time; otherwise, I only pread zeroes > part of the time. If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the > chances of preading zeroes increases dramatically, but is still not 100%. Hint #1: buffered IO == data in page cache. Hint #2: BLKZEROOUT operates at the bio level. > So, uh, is this a bug? Or is that just how BLKZEROOUT works? Or did I fubar > the ioctl call? Broken usage, IMO. If you are going to use the block layer ioctls to manipulate data int eh block device, you should be using direct Io for all your data IO to the block device. Otherwise, coherency problems occur.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 4:27 ` Dave Chinner @ 2014-10-14 6:02 ` Darrick J. Wong 2014-10-14 6:32 ` Theodore Ts'o 2014-10-14 9:21 ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2014-10-14 6:02 UTC (permalink / raw) To: Dave Chinner Cc: Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote: > On Mon, Oct 13, 2014 at 08:01:32PM -0700, Darrick J. Wong wrote: > > Hi everyone, > > > > What's the intended behavior if I issue BLKZEROOUT against a range of disk > > sectors and immediately re-read the sectors into a buffer? > > Should return zeros. > > [...] > > > I boiled the whole thing down into the attached test program, which can > > reproduce the symptoms in a few loop iterations. If I insert "sleep(1);" > > before the pread64, I pread zeroes every time; otherwise, I only pread zeroes > > part of the time. If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the > > chances of preading zeroes increases dramatically, but is still not 100%. > > Hint #1: buffered IO == data in page cache. > Hint #2: BLKZEROOUT operates at the bio level. Yeah, I forgot about that little quirk where the page cache is left in the dark. Thank you for the sanity check, Dave. > > So, uh, is this a bug? Or is that just how BLKZEROOUT works? Or did I fubar > > the ioctl call? > > Broken usage, IMO. If you are going to use the block layer ioctls to > manipulate data int eh block device, you should be using direct Io > for all your data IO to the block device. Otherwise, coherency > problems occur.... So... if these ioctls require direct IO read and write for any sane use model, why doesn't the kernel fail the request if the fd isn't in O_DIRECT mode? Or, if we do want to allow the ioctls to run on an fd that's opened in buffered IO mode, can we simply invalidate that part of the page cache after calling ZEROOUT? Something idiotic like fsync_bdev() -> blkdev_issue_zeroout -> invalidate_bdev -> invalidate_inode_pages2 seems to smooth things over, but that's a big dumb hammer. Tired of this for now, going to bed. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 6:02 ` Darrick J. Wong @ 2014-10-14 6:32 ` Theodore Ts'o 2014-10-15 1:25 ` Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2014-10-14 6:32 UTC (permalink / raw) To: Darrick J. Wong Cc: Dave Chinner, Jens Axboe, Martin K. Petersen, linux-fsdevel, linux-ext4 The bottom line is for most of the use cases we are talking about, we're only zero'ing one or two 4k blocks at a time, so I've never been convinced that it's worth it to use BLKZEROOUT. We could add page cache coherency features to BLKZEROOUT, but I'm not entirely sure it's worth the effort. No user space program would be able to take advantage of adding coherency for several years, or adding feature tests, etc., and is it worth the upside of being able to use WRITE SAME for a few 4k or 8k writes? (Which the vast majority of storage devices don't support anyway....) Cheers, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 6:32 ` Theodore Ts'o @ 2014-10-15 1:25 ` Darrick J. Wong 2014-10-15 1:32 ` Martin K. Petersen 2014-10-15 10:02 ` Theodore Ts'o 0 siblings, 2 replies; 15+ messages in thread From: Darrick J. Wong @ 2014-10-15 1:25 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Jens Axboe, Martin K. Petersen, linux-fsdevel, linux-ext4 On Tue, Oct 14, 2014 at 02:32:10AM -0400, Theodore Ts'o wrote: > The bottom line is for most of the use cases we are talking about, > we're only zero'ing one or two 4k blocks at a time, so I've never been > convinced that it's worth it to use BLKZEROOUT. > > We could add page cache coherency features to BLKZEROOUT, but I'm not > entirely sure it's worth the effort. No user space program would be > able to take advantage of adding coherency for several years, or Well then let's change BLKZEROOUT to require O_DIRECT instead of hiding the coherency problem, and introduce BLKZEROOUT_INV which issues the zero out and then takes care of page cache coherency. (Or at least the first part...) > adding feature tests, etc., and is it worth the upside of being able > to use WRITE SAME for a few 4k or 8k writes? (Which the vast majority > of storage devices don't support anyway....) I've converted mke2fs and e2fsck to use BLKZEROOUT to zero the journal and the inode tables when they want something to really be zero, and ext2fs_fallocate uses it to zero the fallocated range. I suspect those three will zero long runs of sectors each call. As for WRITE_SAME support, if it's there, why ignore it? The ioctl exists; someone else is bound to use it sooner or later. A further optimization to mke2fs would be to detect that we've run discard-with-zeroes and therefore can skip issuing subsequent zeroouts on the same ranges, but I'm wary that discard-zeroes-data does what it purports to do. If it /does/ work reliably, though, ext2fs_zero_blocks() could be rerouted to use discard instead. Really my reason for wanting to use zeroout is that in guaranteeing the zero-read behavior afterwards it seems like it ought to be less problematic than discard has been. --D > > Cheers, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-15 1:25 ` Darrick J. Wong @ 2014-10-15 1:32 ` Martin K. Petersen 2014-10-16 20:04 ` Darrick J. Wong 2014-10-15 10:02 ` Theodore Ts'o 1 sibling, 1 reply; 15+ messages in thread From: Martin K. Petersen @ 2014-10-15 1:32 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, Martin K. Petersen, linux-fsdevel, linux-ext4 >>>>> "Darrick" == Darrick J Wong <darrick.wong@oracle.com> writes: Darrick> Well then let's change BLKZEROOUT to require O_DIRECT instead Darrick> of hiding the coherency problem, That would break existing users, though. Darrick> A further optimization to mke2fs would be to detect that we've Darrick> run discard-with-zeroes and therefore can skip issuing Darrick> subsequent zeroouts on the same ranges, but I'm wary that Darrick> discard-zeroes-data does what it purports to do. It's dubious. I'm working on making sure we only set discard_zeroes_data when the device guarantees it for 3.19. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-15 1:32 ` Martin K. Petersen @ 2014-10-16 20:04 ` Darrick J. Wong 0 siblings, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2014-10-16 20:04 UTC (permalink / raw) To: Martin K. Petersen Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-fsdevel, linux-ext4 On Tue, Oct 14, 2014 at 09:32:25PM -0400, Martin K. Petersen wrote: > >>>>> "Darrick" == Darrick J Wong <darrick.wong@oracle.com> writes: > > Darrick> Well then let's change BLKZEROOUT to require O_DIRECT instead > Darrick> of hiding the coherency problem, > > That would break existing users, though. <shrug> If they're using the existing BLKZEROOUT without O_DIRECT they're already broken. But I might as well just fix them quietly. --D > Darrick> A further optimization to mke2fs would be to detect that we've > Darrick> run discard-with-zeroes and therefore can skip issuing > Darrick> subsequent zeroouts on the same ranges, but I'm wary that > Darrick> discard-zeroes-data does what it purports to do. > > It's dubious. I'm working on making sure we only set discard_zeroes_data > when the device guarantees it for 3.19. > > -- > Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-15 1:25 ` Darrick J. Wong 2014-10-15 1:32 ` Martin K. Petersen @ 2014-10-15 10:02 ` Theodore Ts'o 2014-10-15 12:09 ` Martin K. Petersen 1 sibling, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2014-10-15 10:02 UTC (permalink / raw) To: Darrick J. Wong, Martin K. Petersen Cc: Dave Chinner, Jens Axboe, linux-fsdevel, linux-ext4 On Tue, Oct 14, 2014 at 06:25:34PM -0700, Darrick J. Wong wrote: > > > adding feature tests, etc., and is it worth the upside of being able > > to use WRITE SAME for a few 4k or 8k writes? (Which the vast majority > > of storage devices don't support anyway....) > > I've converted mke2fs and e2fsck to use BLKZEROOUT to zero the journal and the > inode tables when they want something to really be zero, and ext2fs_fallocate > uses it to zero the fallocated range. I suspect those three will zero long > runs of sectors each call. Sure, I agree that BLKZEROOUT might make sense for zero'ing the journal and the inode table. But the journal isn't that large, and the inode tables don't need to be zero'ed if lazy_itable_init is enabled. So the actual time saved in mke2fs isn't that great. > As for WRITE_SAME support, if it's there, why ignore it? The ioctl exists; > someone else is bound to use it sooner or later. For the special case of mke2fs, we can use BLKZEROOUT without too much difficulty; we can call fsync() on the block device, and if we're really paranoid, use posix_fadvise(..., POSIX_FADV_DONTNEED) to make sure the buffer cache is emptied --- although for the case of mke2fs, we don't really need to worry that much about cache aliasing issues since the blocks that we would be interested in zero'ing out are not blocks that would be first written, or later read, using buffered I/O. But I really don't think it's worth the effort to make WRITE_SAME work for e2fsck, where (a) the WRITE_SAME would only be for very small regions, (b) the issues of cache aliasing are much more likely, and (c) where switching over to O_DIRECT will often slow down e2fsck for many types of storage devices. (Unless you have a very, very large number of disks for which you are trying to run e2fsck in parallel across all of them at once, and comparatively very small amounts of memory such that there isn't much room for the buffer cache. In most other cases, though, I've benchmarked O_DIRECT for e2fsck, and it is not a win.) > A further optimization to mke2fs would be to detect that we've run > discard-with-zeroes and therefore can skip issuing subsequent zeroouts on the > same ranges, but I'm wary that discard-zeroes-data does what it purports to do. > If it /does/ work reliably, though, ext2fs_zero_blocks() could be rerouted to > use discard instead. Really my reason for wanting to use zeroout is that in > guaranteeing the zero-read behavior afterwards it seems like it ought to be > less problematic than discard has been. So the main issue is with devices that advertise discard_zeros_data is that technically, the SATA spec still specificies the discard as "advisory", and so there have apparently been devices where under heavy load (lots of other writebacks and GC activity happening in the background), they will decide they are too busy, and will simply drop the "advisotry" trim, and so the blocks never get zero'ed. I've talked to some engineers at some HDD vendors, and they have assured me _they_ would never allow that to happen, and this was only something done by fly-by-night / startup SSD vendors that deserved to go out of business. Nevertheless, despite this, the SATA spec currently seems to allow the interpretation that all discards can be considered advisory by the SSD, even if it advertises discard-zeroes-data. I'm not sure if how much of an issue this is with eMMC devices or PCI-attached flash, and of course, if you are a handset manufacturer or are a large systems integrator purchasing a very large number of devices, it becomes possible to negotitate guarantees beyond what is guaranteed by the spec, and if you are buying a large number of devices you can do testing to make sure such devices don't have such anti-social behaviour. Unfortunately, that's not something the average consumer who is buying the cheapest possible SSD from Amazon or buy.com can count on. Despite all of this, we are actually depending on discard-zeroes-data in mke2fs today, and this is mostly because (a) I haven't yet seen a case in practice where it has been a problem for mke2fs --- in general the write patterns for mke2fs tend to be less taxing for most SSD's, as compared to a disk under heavy random write traffic with WRITE_SAME being used to zero out short block ranges as in ext4_ext_zeroout(), and (b) if the SSD does end up failing to zeroout portions of the journal, it's not a complete disaster --- it's only problematic if we have an unclean shutdown before the journal has been cycled through completely, and if we are unlucky with the previous contents of the journal blocks. (And if journal checksums are enabled, this risk is reduced even more, to the point where we shouldn't need to zero out the journal at all.) So I've not been super concerned about this, even though I know we're being a little bit risky here. I would definitely never trust discard-zeroes-data in ext4_ext_zeroout(), though. On Tue, Oct 14, 2014 at 09:32:25PM -0400, Martin K. Petersen wrote: > > It's dubious. I'm working on making sure we only set discard_zeroes_data > when the device guarantees it for 3.19. I'm curious how you were going to determine that the device guarantees that discard_zeroes_data will be honored. Is there some new bit in some mode page that promises that discards won't be thrown away any time the device feels like it? Cheers, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-15 10:02 ` Theodore Ts'o @ 2014-10-15 12:09 ` Martin K. Petersen 2014-10-18 0:03 ` [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Martin K. Petersen @ 2014-10-15 12:09 UTC (permalink / raw) To: Theodore Ts'o Cc: Darrick J. Wong, Martin K. Petersen, Dave Chinner, Jens Axboe, linux-fsdevel, linux-ext4 >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes: >> It's dubious. I'm working on making sure we only set >> discard_zeroes_data when the device guarantees it for 3.19. Ted> I'm curious how you were going to determine that the device Ted> guarantees that discard_zeroes_data will be honored. Is there some Ted> new bit in some mode page that promises that discards won't be Ted> thrown away any time the device feels like it? We discussed this a week or two ago over on linux-raid because RAID5 depends on hard guarantees in the discard_zeroes_data department. SCSI UNMAP suffers from the same lame behavior as ATA TRIM in the sense that a device can report that it supports zero after UNMAP. But there is no guarantee that all parts of an UNMAP command will be processed by the storage device. Only parts that were actually processed will return zeroes on subsequent reads. And obviously we don't know which parts the device decided to ignore. *sigh* It's incredibly frustrating that it is so hard to get the standards bodies to give any guarantees about anything. It's an absolute miracle that a READ after a WRITE is required to return the same data. Anyway. Contrary to UNMAP, WRITE SAME with the UNMAP bit set requires subsequent reads to deterministically return zeroes. If a block can't be unmapped for whatever reason it will be explicitly zeroed. So I'm working on a patch set that will: - Set discard_zeroes_data for SCSI devices that support the WRITE SAME w/ UNMAP commands. - Not set discard_zeroes_data for SCSI devices that only support UNMAP. - Set discard_zeroes_data for certain ATA SSDs that are known to behave correctly. It's a royal pain to have to maintain a whitelist. But all the hardware RAID vendors do the same thing. I'm talking to various folks to leverage any guarantees we and other vendors may have in existing product requirement documents. As you alluded to, there are devices that otherwise appear to be entirely well-behaved that can get stressed and start dropping discards. That may be acceptable for the ext4 use case but will cause corruption for RAID5. So we are compelled to being more conservative about setting discard_zeroes_data. Until my tweaks are in Neil has opted to disable discard on RAID5. Martin (I scream daily) -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents 2014-10-15 12:09 ` Martin K. Petersen @ 2014-10-18 0:03 ` Darrick J. Wong 0 siblings, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2014-10-18 0:03 UTC (permalink / raw) To: Martin K. Petersen Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-fsdevel, linux-ext4 All right, how's this for a first stab at invalidating the page cache? Since userspace doesn't really have a good way to find out which behavior it'll get, just define a new ioctl with the range in-parameters declared a little more explicitly. Userspace apps can either call the new BLKZEROOUT_INV ioctl by itself; failing that, they can call either BLKZEROOUT* with O_DIRECT set on the fd; or if they don't care for O_DIRECT, they can {fsync(); ioctl(BLKZEROOUT); posix_fadvise(DONTNEED);}, keeping in mind that a future kernel could ignore the DONTNEED. (I'll fix discard/secdiscard in a similar fashion if this sticks.) --D --- The BLKZEROOUT ioctl behaves similarly to O_DIRECT writes in that the writes are issued directly to disks without touching the page cache. However, the ioctl neither requires O_DIRECT to be set on the file descriptor (i.e. the fd can be in buffered mode) nor does it invalidate the appropriate parts of the page cache. Since it also guarantees that future reads return zeroes, the broken cache coherency gives the ioctl semantics that can trap unsuspecting users. Therefore, try to invalidate the page cache entries for the zeroed range, and set the user's length parameter to zero on success to show that the kernel took care of the invalidation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- block/ioctl.c | 29 +++++++++++++++++++++++------ include/uapi/linux/fs.h | 1 + 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index d6cda81..d3688c0 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -188,17 +188,33 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start, static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start, uint64_t len) { + int ret; + struct address_space *mapping; + uint64_t end = start + len - 1; + if (start & 511) return -EINVAL; if (len & 511) return -EINVAL; - start >>= 9; - len >>= 9; - - if (start + len > (i_size_read(bdev->bd_inode) >> 9)) + if (end >= i_size_read(bdev->bd_inode)) return -EINVAL; - return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL); + mapping = bdev->bd_inode->i_mapping; + ret = filemap_fdatawrite_range(mapping, start, end); + if (ret) + goto out; + ret = filemap_fdatawait_range(mapping, start, end); + if (ret) + goto out; + + ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL); + if (ret) + goto out; + + ret = invalidate_inode_pages2_range(mapping, start >> PAGE_CACHE_SHIFT, + end >> PAGE_CACHE_SHIFT); +out: + return ret; } static int put_ushort(unsigned long arg, unsigned short val) @@ -317,7 +333,8 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, return blk_ioctl_discard(bdev, range[0], range[1], cmd == BLKSECDISCARD); } - case BLKZEROOUT: { + case BLKZEROOUT: + case BLKZEROOUT_INV: { uint64_t range[2]; if (!(mode & FMODE_WRITE)) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index ca1a11b..370b719 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -149,6 +149,7 @@ struct inodes_stat_t { #define BLKSECDISCARD _IO(0x12,125) #define BLKROTATIONAL _IO(0x12,126) #define BLKZEROOUT _IO(0x12,127) +#define BLKZEROOUT_INV _IOR(0x12, 127, uint64_t[2]) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP _IO(0x00,1) /* bmap access */ ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 4:27 ` Dave Chinner 2014-10-14 6:02 ` Darrick J. Wong @ 2014-10-14 9:21 ` Christoph Hellwig 2014-10-14 13:44 ` Martin K. Petersen 2014-10-14 18:57 ` Zach Brown 1 sibling, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2014-10-14 9:21 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote: > Broken usage, IMO. If you are going to use the block layer ioctls to > manipulate data int eh block device, you should be using direct Io > for all your data IO to the block device. Otherwise, coherency > problems occur.... I'd say BLKZEROOUT semantics are broken. Having an ioctl exposed that maniulates on-disk data without cache coherence is a nightmare that people trip over easily. Even experienced people like Darrick. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 9:21 ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig @ 2014-10-14 13:44 ` Martin K. Petersen 2014-10-14 18:57 ` Zach Brown 1 sibling, 0 replies; 15+ messages in thread From: Martin K. Petersen @ 2014-10-14 13:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes: >> Broken usage, IMO. If you are going to use the block layer ioctls to >> manipulate data int eh block device, you should be using direct Io >> for all your data IO to the block device. Otherwise, coherency >> problems occur.... Christoph> I'd say BLKZEROOUT semantics are broken. Having an ioctl Christoph> exposed that maniulates on-disk data without cache coherence Christoph> is a nightmare that people trip over easily. I agree it's suboptimal but it's not different from how the other block ioctls are working (discard, etc.). The main issue is that the blkdev_issue_foobar() calls were designed for in-kernel use and the calling filesystems were supposed to know what they were doing with those pages. And for convenience we just wired up the same calls as ioctls. I'm undecided as to which hammer would be best. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 9:21 ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig 2014-10-14 13:44 ` Martin K. Petersen @ 2014-10-14 18:57 ` Zach Brown 2014-10-14 20:21 ` Dave Chinner 1 sibling, 1 reply; 15+ messages in thread From: Zach Brown @ 2014-10-14 18:57 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 On Tue, Oct 14, 2014 at 02:21:28AM -0700, Christoph Hellwig wrote: > On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote: > > Broken usage, IMO. If you are going to use the block layer ioctls to > > manipulate data int eh block device, you should be using direct Io > > for all your data IO to the block device. Otherwise, coherency > > problems occur.... > > I'd say BLKZEROOUT semantics are broken. Having an ioctl exposed > that maniulates on-disk data without cache coherence is a nightmare > that people trip over easily. Even experienced people like Darrick. I'm definitely with Christoph on this one. Coherent writes should be the norm. - z ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 18:57 ` Zach Brown @ 2014-10-14 20:21 ` Dave Chinner 2014-10-15 1:02 ` Martin K. Petersen 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-10-14 20:21 UTC (permalink / raw) To: Zach Brown Cc: Christoph Hellwig, Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 On Tue, Oct 14, 2014 at 11:57:53AM -0700, Zach Brown wrote: > On Tue, Oct 14, 2014 at 02:21:28AM -0700, Christoph Hellwig wrote: > > On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote: > > > Broken usage, IMO. If you are going to use the block layer ioctls to > > > manipulate data int eh block device, you should be using direct Io > > > for all your data IO to the block device. Otherwise, coherency > > > problems occur.... > > > > I'd say BLKZEROOUT semantics are broken. Having an ioctl exposed > > that maniulates on-disk data without cache coherence is a nightmare > > that people trip over easily. Even experienced people like Darrick. > > I'm definitely with Christoph on this one. Coherent writes should be > the norm. I don't care what happens to the block layer ioctls w.r.t. cache coherency. If people want to make them fully cache coherent for buffered IO, go right ahead. But that doesn't change the fact that if you want your application to work sanely on existing kernels, then using direct IO is the only option... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: BLKZEROOUT + pread should return zeroes, right? 2014-10-14 20:21 ` Dave Chinner @ 2014-10-15 1:02 ` Martin K. Petersen 0 siblings, 0 replies; 15+ messages in thread From: Martin K. Petersen @ 2014-10-15 1:02 UTC (permalink / raw) To: Dave Chinner Cc: Zach Brown, Christoph Hellwig, Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4 >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes: Dave> I don't care what happens to the block layer ioctls w.r.t. cache Dave> coherency. If people want to make them fully cache coherent for Dave> buffered IO, go right ahead. Dave> But that doesn't change the fact that if you want your application Dave> to work sanely on existing kernels, then using direct IO is the Dave> only option... Yep! -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-10-18 0:03 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-14 3:01 BLKZEROOUT + pread should return zeroes, right? Darrick J. Wong 2014-10-14 4:27 ` Dave Chinner 2014-10-14 6:02 ` Darrick J. Wong 2014-10-14 6:32 ` Theodore Ts'o 2014-10-15 1:25 ` Darrick J. Wong 2014-10-15 1:32 ` Martin K. Petersen 2014-10-16 20:04 ` Darrick J. Wong 2014-10-15 10:02 ` Theodore Ts'o 2014-10-15 12:09 ` Martin K. Petersen 2014-10-18 0:03 ` [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents Darrick J. Wong 2014-10-14 9:21 ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig 2014-10-14 13:44 ` Martin K. Petersen 2014-10-14 18:57 ` Zach Brown 2014-10-14 20:21 ` Dave Chinner 2014-10-15 1:02 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).