BLKZEROOUT + pread should return zeroes, right?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* BLKZEROOUT + pread should return zeroes, right?
@ 2014-10-14  3:01 Darrick J. Wong
  2014-10-14  4:27 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2014-10-14  3:01 UTC (permalink / raw)
  To: Jens Axboe, Martin K. Petersen
  Cc: linux-fsdevel, Theodore Ts'o, linux-ext4

Hi everyone,

What's the intended behavior if I issue BLKZEROOUT against a range of disk
sectors and immediately re-read the sectors into a buffer?

I've been trying to modify e2fsprogs to use BLKZEROOUT, and I noticed today
that if I run mke2fs and e2fsck -fn enough times in a tight loop, that
eventually e2fsck complains about corruption in blocks that ought to contain
zeroes.  If I dd the block in question after the failure, I get zeroes as I'd
expect.

This feels incorrect -- if I pwrite a block, then blkzeroout the block, then
re-read it, I ought to see zeroes, right?  Or is BLKZEROOUT some sort of
hint that isn't perfectly reliable, a la BLKDISCARD?  Or maybe I'm just doing
it incorrectly?  I looked at block/blk-num.c, this seems like it ought to be
ok.

I boiled the whole thing down into the attached test program, which can
reproduce the symptoms in a few loop iterations.  If I insert "sleep(1);"
before the pread64, I pread zeroes every time; otherwise, I only pread zeroes
part of the time.  If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the
chances of preading zeroes increases dramatically, but is still not 100%.

So, uh, is this a bug?  Or is that just how BLKZEROOUT works?  Or did I fubar
the ioctl call?

$ gcc -Wall -g -o test test.c
$ sudo ./test /dev/sda
6: ERR 0 (0xffffffff)

--D

/* silly test program */
#define _XOPEN_SOURCE 600
#define _DARWIN_C_SOURCE
#define _FILE_OFFSET_BITS 64
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <sys/types.h>
#include <string.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/fs.h>
#include <unistd.h>
#include <assert.h>

#define BUFSZ 4096

static int run(int iteration, const char *fname)
{
	char buf[BUFSZ];
	ssize_t sz;
	uint64_t range[2];
	int fd, ret, i;

	printf("%d\r", iteration);
	fflush(stdout);

	fd = open(fname, O_RDWR);
	if (fd < 0)
		return 1;

	memset(buf, 0xFF, BUFSZ);

	sz = pwrite64(fd, buf, BUFSZ, 0);
	if (sz != BUFSZ)
		return 2;

	range[0] = 0;
	range[1] = 4096;
	ret = ioctl(fd, BLKZEROOUT, range);
	if (ret)
		return 5;

	sz = pread64(fd, buf, BUFSZ, 0);
	if (sz != BUFSZ)
		return 7;

	for (i = 0; i < BUFSZ; i++) {
		if (buf[i]) {
			printf("%d: ERR %d (0x%x)\n", iteration, i, buf[i]);
			return 8;
		}
	}

	close(fd);
	return 0;
}

int main(int argc, char *argv[])
{
	int iter = 0;
	int ret;

	if (argc != 2) {
		printf("Usage: %s blkdev\n", argv[0]);
		return 0;
	}

	do {
		ret = run(iter++, argv[1]);
	} while (!ret);

	return ret;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  3:01 BLKZEROOUT + pread should return zeroes, right? Darrick J. Wong
@ 2014-10-14  4:27 ` Dave Chinner
  2014-10-14  6:02   ` Darrick J. Wong
  2014-10-14  9:21   ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2014-10-14  4:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o,
	linux-ext4

On Mon, Oct 13, 2014 at 08:01:32PM -0700, Darrick J. Wong wrote:
> Hi everyone,
> 
> What's the intended behavior if I issue BLKZEROOUT against a range of disk
> sectors and immediately re-read the sectors into a buffer?

Should return zeros.

[...]

> I boiled the whole thing down into the attached test program, which can
> reproduce the symptoms in a few loop iterations.  If I insert "sleep(1);"
> before the pread64, I pread zeroes every time; otherwise, I only pread zeroes
> part of the time.  If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the
> chances of preading zeroes increases dramatically, but is still not 100%.

Hint #1: buffered IO == data in page cache.
Hint #2: BLKZEROOUT operates at the bio level.

> So, uh, is this a bug?  Or is that just how BLKZEROOUT works?  Or did I fubar
> the ioctl call?

Broken usage, IMO. If you are going to use the block layer ioctls to
manipulate data int eh block device, you should be using direct Io
for all your data IO to the block device. Otherwise, coherency
problems occur....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  4:27 ` Dave Chinner
@ 2014-10-14  6:02   ` Darrick J. Wong
  2014-10-14  6:32     ` Theodore Ts'o
  2014-10-14  9:21   ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2014-10-14  6:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Martin K. Petersen, linux-fsdevel, Theodore Ts'o,
	linux-ext4

On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote:
> On Mon, Oct 13, 2014 at 08:01:32PM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> > 
> > What's the intended behavior if I issue BLKZEROOUT against a range of disk
> > sectors and immediately re-read the sectors into a buffer?
> 
> Should return zeros.
> 
> [...]
> 
> > I boiled the whole thing down into the attached test program, which can
> > reproduce the symptoms in a few loop iterations.  If I insert "sleep(1);"
> > before the pread64, I pread zeroes every time; otherwise, I only pread zeroes
> > part of the time.  If I call "ioctl(fd, BLKFLSBUF);" before the BLKZEROOUT, the
> > chances of preading zeroes increases dramatically, but is still not 100%.
> 
> Hint #1: buffered IO == data in page cache.
> Hint #2: BLKZEROOUT operates at the bio level.

Yeah, I forgot about that little quirk where the page cache is left in the
dark.  Thank you for the sanity check, Dave.

> > So, uh, is this a bug?  Or is that just how BLKZEROOUT works?  Or did I fubar
> > the ioctl call?
> 
> Broken usage, IMO. If you are going to use the block layer ioctls to
> manipulate data int eh block device, you should be using direct Io
> for all your data IO to the block device. Otherwise, coherency
> problems occur....

So... if these ioctls require direct IO read and write for any sane use model,
why doesn't the kernel fail the request if the fd isn't in O_DIRECT mode?  Or,
if we do want to allow the ioctls to run on an fd that's opened in buffered IO
mode, can we simply invalidate that part of the page cache after calling
ZEROOUT?

Something idiotic like fsync_bdev() -> blkdev_issue_zeroout -> invalidate_bdev
-> invalidate_inode_pages2 seems to smooth things over, but that's a big dumb
hammer.

Tired of this for now, going to bed.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  6:02   ` Darrick J. Wong
@ 2014-10-14  6:32     ` Theodore Ts'o
  2014-10-15  1:25       ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2014-10-14  6:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Jens Axboe, Martin K. Petersen, linux-fsdevel,
	linux-ext4

The bottom line is for most of the use cases we are talking about,
we're only zero'ing one or two 4k blocks at a time, so I've never been
convinced that it's worth it to use BLKZEROOUT.

We could add page cache coherency features to BLKZEROOUT, but I'm not
entirely sure it's worth the effort.  No user space program would be
able to take advantage of adding coherency for several years, or
adding feature tests, etc., and is it worth the upside of being able
to use WRITE SAME for a few 4k or 8k writes?  (Which the vast majority
of storage devices don't support anyway....)

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  6:32     ` Theodore Ts'o
@ 2014-10-15  1:25       ` Darrick J. Wong
  2014-10-15  1:32         ` Martin K. Petersen
  2014-10-15 10:02         ` Theodore Ts'o
  0 siblings, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2014-10-15  1:25 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Jens Axboe, Martin K. Petersen, linux-fsdevel,
	linux-ext4

On Tue, Oct 14, 2014 at 02:32:10AM -0400, Theodore Ts'o wrote:
> The bottom line is for most of the use cases we are talking about,
> we're only zero'ing one or two 4k blocks at a time, so I've never been
> convinced that it's worth it to use BLKZEROOUT.
> 
> We could add page cache coherency features to BLKZEROOUT, but I'm not
> entirely sure it's worth the effort.  No user space program would be
> able to take advantage of adding coherency for several years, or

Well then let's change BLKZEROOUT to require O_DIRECT instead of hiding the
coherency problem, and introduce BLKZEROOUT_INV which issues the zero out and
then takes care of page cache coherency.

(Or at least the first part...)

> adding feature tests, etc., and is it worth the upside of being able
> to use WRITE SAME for a few 4k or 8k writes?  (Which the vast majority
> of storage devices don't support anyway....)

I've converted mke2fs and e2fsck to use BLKZEROOUT to zero the journal and the
inode tables when they want something to really be zero, and ext2fs_fallocate
uses it to zero the fallocated range.  I suspect those three will zero long
runs of sectors each call.

As for WRITE_SAME support, if it's there, why ignore it?  The ioctl exists;
someone else is bound to use it sooner or later.

A further optimization to mke2fs would be to detect that we've run
discard-with-zeroes and therefore can skip issuing subsequent zeroouts on the
same ranges, but I'm wary that discard-zeroes-data does what it purports to do.
If it /does/ work reliably, though, ext2fs_zero_blocks() could be rerouted to
use discard instead.  Really my reason for wanting to use zeroout is that in
guaranteeing the zero-read behavior afterwards it seems like it ought to be
less problematic than discard has been.

--D
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-15  1:25       ` Darrick J. Wong
@ 2014-10-15  1:32         ` Martin K. Petersen
  2014-10-16 20:04           ` Darrick J. Wong
  2014-10-15 10:02         ` Theodore Ts'o
  1 sibling, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2014-10-15  1:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, Martin K. Petersen,
	linux-fsdevel, linux-ext4

>>>>> "Darrick" == Darrick J Wong <darrick.wong@oracle.com> writes:

Darrick> Well then let's change BLKZEROOUT to require O_DIRECT instead
Darrick> of hiding the coherency problem, 

That would break existing users, though.

Darrick> A further optimization to mke2fs would be to detect that we've
Darrick> run discard-with-zeroes and therefore can skip issuing
Darrick> subsequent zeroouts on the same ranges, but I'm wary that
Darrick> discard-zeroes-data does what it purports to do.

It's dubious. I'm working on making sure we only set discard_zeroes_data
when the device guarantees it for 3.19.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-15  1:32         ` Martin K. Petersen
@ 2014-10-16 20:04           ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2014-10-16 20:04 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-fsdevel,
	linux-ext4

On Tue, Oct 14, 2014 at 09:32:25PM -0400, Martin K. Petersen wrote:
> >>>>> "Darrick" == Darrick J Wong <darrick.wong@oracle.com> writes:
> 
> Darrick> Well then let's change BLKZEROOUT to require O_DIRECT instead
> Darrick> of hiding the coherency problem, 
> 
> That would break existing users, though.

<shrug> If they're using the existing BLKZEROOUT without O_DIRECT they're
already broken.

But I might as well just fix them quietly.

--D

> Darrick> A further optimization to mke2fs would be to detect that we've
> Darrick> run discard-with-zeroes and therefore can skip issuing
> Darrick> subsequent zeroouts on the same ranges, but I'm wary that
> Darrick> discard-zeroes-data does what it purports to do.
> 
> It's dubious. I'm working on making sure we only set discard_zeroes_data
> when the device guarantees it for 3.19.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-15  1:25       ` Darrick J. Wong
  2014-10-15  1:32         ` Martin K. Petersen
@ 2014-10-15 10:02         ` Theodore Ts'o
  2014-10-15 12:09           ` Martin K. Petersen
  1 sibling, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2014-10-15 10:02 UTC (permalink / raw)
  To: Darrick J. Wong, Martin K. Petersen
  Cc: Dave Chinner, Jens Axboe, linux-fsdevel, linux-ext4

On Tue, Oct 14, 2014 at 06:25:34PM -0700, Darrick J. Wong wrote:
> 
> > adding feature tests, etc., and is it worth the upside of being able
> > to use WRITE SAME for a few 4k or 8k writes?  (Which the vast majority
> > of storage devices don't support anyway....)
> 
> I've converted mke2fs and e2fsck to use BLKZEROOUT to zero the journal and the
> inode tables when they want something to really be zero, and ext2fs_fallocate
> uses it to zero the fallocated range.  I suspect those three will zero long
> runs of sectors each call.

Sure, I agree that BLKZEROOUT might make sense for zero'ing the
journal and the inode table.  But the journal isn't that large, and
the inode tables don't need to be zero'ed if lazy_itable_init is
enabled.  So the actual time saved in mke2fs isn't that great.

> As for WRITE_SAME support, if it's there, why ignore it?  The ioctl exists;
> someone else is bound to use it sooner or later.

For the special case of mke2fs, we can use BLKZEROOUT without too much
difficulty; we can call fsync() on the block device, and if we're
really paranoid, use posix_fadvise(..., POSIX_FADV_DONTNEED) to make
sure the buffer cache is emptied --- although for the case of mke2fs,
we don't really need to worry that much about cache aliasing issues
since the blocks that we would be interested in zero'ing out are not
blocks that would be first written, or later read, using buffered I/O.

But I really don't think it's worth the effort to make WRITE_SAME work
for e2fsck, where (a) the WRITE_SAME would only be for very small
regions, (b) the issues of cache aliasing are much more likely, and
(c) where switching over to O_DIRECT will often slow down e2fsck for
many types of storage devices.  (Unless you have a very, very large
number of disks for which you are trying to run e2fsck in parallel
across all of them at once, and comparatively very small amounts of
memory such that there isn't much room for the buffer cache.  In most
other cases, though, I've benchmarked O_DIRECT for e2fsck, and it is
not a win.)

> A further optimization to mke2fs would be to detect that we've run
> discard-with-zeroes and therefore can skip issuing subsequent zeroouts on the
> same ranges, but I'm wary that discard-zeroes-data does what it purports to do.
> If it /does/ work reliably, though, ext2fs_zero_blocks() could be rerouted to
> use discard instead.  Really my reason for wanting to use zeroout is that in
> guaranteeing the zero-read behavior afterwards it seems like it ought to be
> less problematic than discard has been.

So the main issue is with devices that advertise discard_zeros_data is
that technically, the SATA spec still specificies the discard as
"advisory", and so there have apparently been devices where under
heavy load (lots of other writebacks and GC activity happening in the
background), they will decide they are too busy, and will simply drop
the "advisotry" trim, and so the blocks never get zero'ed.

I've talked to some engineers at some HDD vendors, and they have
assured me _they_ would never allow that to happen, and this was only
something done by fly-by-night / startup SSD vendors that deserved to
go out of business.  Nevertheless, despite this, the SATA spec
currently seems to allow the interpretation that all discards can be
considered advisory by the SSD, even if it advertises
discard-zeroes-data.

I'm not sure if how much of an issue this is with eMMC devices or
PCI-attached flash, and of course, if you are a handset manufacturer
or are a large systems integrator purchasing a very large number of
devices, it becomes possible to negotitate guarantees beyond what is
guaranteed by the spec, and if you are buying a large number of
devices you can do testing to make sure such devices don't have such
anti-social behaviour.  Unfortunately, that's not something the
average consumer who is buying the cheapest possible SSD from Amazon
or buy.com can count on.

Despite all of this, we are actually depending on discard-zeroes-data
in mke2fs today, and this is mostly because (a) I haven't yet seen a
case in practice where it has been a problem for mke2fs --- in general
the write patterns for mke2fs tend to be less taxing for most SSD's,
as compared to a disk under heavy random write traffic with WRITE_SAME
being used to zero out short block ranges as in ext4_ext_zeroout(),
and (b) if the SSD does end up failing to zeroout portions of the
journal, it's not a complete disaster --- it's only problematic if we
have an unclean shutdown before the journal has been cycled through
completely, and if we are unlucky with the previous contents of the
journal blocks.

(And if journal checksums are enabled, this risk is reduced even more,
to the point where we shouldn't need to zero out the journal at all.)

So I've not been super concerned about this, even though I know we're
being a little bit risky here.  I would definitely never trust
discard-zeroes-data in ext4_ext_zeroout(), though.

On Tue, Oct 14, 2014 at 09:32:25PM -0400, Martin K. Petersen wrote:
> 
> It's dubious. I'm working on making sure we only set discard_zeroes_data
> when the device guarantees it for 3.19.

I'm curious how you were going to determine that the device guarantees
that discard_zeroes_data will be honored.  Is there some new bit in
some mode page that promises that discards won't be thrown away any
time the device feels like it?

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-15 10:02         ` Theodore Ts'o
@ 2014-10-15 12:09           ` Martin K. Petersen
  2014-10-18  0:03             ` [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Martin K. Petersen @ 2014-10-15 12:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Martin K. Petersen, Dave Chinner, Jens Axboe,
	linux-fsdevel, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

>> It's dubious. I'm working on making sure we only set
>> discard_zeroes_data when the device guarantees it for 3.19.

Ted> I'm curious how you were going to determine that the device
Ted> guarantees that discard_zeroes_data will be honored.  Is there some
Ted> new bit in some mode page that promises that discards won't be
Ted> thrown away any time the device feels like it?

We discussed this a week or two ago over on linux-raid because RAID5
depends on hard guarantees in the discard_zeroes_data department.

SCSI UNMAP suffers from the same lame behavior as ATA TRIM in the sense
that a device can report that it supports zero after UNMAP. But there is
no guarantee that all parts of an UNMAP command will be processed by the
storage device. Only parts that were actually processed will return
zeroes on subsequent reads. And obviously we don't know which parts the
device decided to ignore. *sigh*

It's incredibly frustrating that it is so hard to get the standards
bodies to give any guarantees about anything. It's an absolute miracle
that a READ after a WRITE is required to return the same data.

Anyway. Contrary to UNMAP, WRITE SAME with the UNMAP bit set requires
subsequent reads to deterministically return zeroes. If a block can't be
unmapped for whatever reason it will be explicitly zeroed. So I'm
working on a patch set that will:

 - Set discard_zeroes_data for SCSI devices that support the WRITE SAME
   w/ UNMAP commands.

 - Not set discard_zeroes_data for SCSI devices that only support UNMAP.

 - Set discard_zeroes_data for certain ATA SSDs that are known to behave
   correctly.

It's a royal pain to have to maintain a whitelist. But all the hardware
RAID vendors do the same thing. I'm talking to various folks to leverage
any guarantees we and other vendors may have in existing product
requirement documents.

As you alluded to, there are devices that otherwise appear to be
entirely well-behaved that can get stressed and start dropping
discards. That may be acceptable for the ext4 use case but will cause
corruption for RAID5. So we are compelled to being more conservative
about setting discard_zeroes_data. Until my tweaks are in Neil has opted
to disable discard on RAID5.

Martin (I scream daily)

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents
  2014-10-15 12:09           ` Martin K. Petersen
@ 2014-10-18  0:03             ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2014-10-18  0:03 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Theodore Ts'o, Dave Chinner, Jens Axboe, linux-fsdevel,
	linux-ext4

All right, how's this for a first stab at invalidating the page cache?  Since
userspace doesn't really have a good way to find out which behavior it'll get,
just define a new ioctl with the range in-parameters declared a little more
explicitly.

Userspace apps can either call the new BLKZEROOUT_INV ioctl by itself; failing
that, they can call either BLKZEROOUT* with O_DIRECT set on the fd; or if they
don't care for O_DIRECT, they can {fsync(); ioctl(BLKZEROOUT);
posix_fadvise(DONTNEED);}, keeping in mind that a future kernel could ignore
the DONTNEED.

(I'll fix discard/secdiscard in a similar fashion if this sticks.)

--D
---
The BLKZEROOUT ioctl behaves similarly to O_DIRECT writes in that the
writes are issued directly to disks without touching the page cache.
However, the ioctl neither requires O_DIRECT to be set on the file
descriptor (i.e. the fd can be in buffered mode) nor does it
invalidate the appropriate parts of the page cache.  Since it also
guarantees that future reads return zeroes, the broken cache coherency
gives the ioctl semantics that can trap unsuspecting users.

Therefore, try to invalidate the page cache entries for the zeroed
range, and set the user's length parameter to zero on success to show
that the kernel took care of the invalidation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 block/ioctl.c           |   29 +++++++++++++++++++++++------
 include/uapi/linux/fs.h |    1 +
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index d6cda81..d3688c0 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -188,17 +188,33 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
 			     uint64_t len)
 {
+	int ret;
+	struct address_space *mapping;
+	uint64_t end = start + len - 1;
+
 	if (start & 511)
 		return -EINVAL;
 	if (len & 511)
 		return -EINVAL;
-	start >>= 9;
-	len >>= 9;
-
-	if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+	if (end >= i_size_read(bdev->bd_inode))
 		return -EINVAL;
 
-	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
+	mapping = bdev->bd_inode->i_mapping;
+	ret = filemap_fdatawrite_range(mapping, start, end);
+	if (ret)
+		goto out;
+	ret = filemap_fdatawait_range(mapping, start, end);
+	if (ret)
+		goto out;
+
+	ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL);
+	if (ret)
+		goto out;
+
+	ret = invalidate_inode_pages2_range(mapping, start >> PAGE_CACHE_SHIFT,
+					    end >> PAGE_CACHE_SHIFT);
+out:
+	return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -317,7 +333,8 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 		return blk_ioctl_discard(bdev, range[0], range[1],
 					 cmd == BLKSECDISCARD);
 	}
-	case BLKZEROOUT: {
+	case BLKZEROOUT:
+	case BLKZEROOUT_INV: {
 		uint64_t range[2];
 
 		if (!(mode & FMODE_WRITE))
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ca1a11b..370b719 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -149,6 +149,7 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKZEROOUT_INV _IOR(0x12, 127, uint64_t[2])
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  4:27 ` Dave Chinner
  2014-10-14  6:02   ` Darrick J. Wong
@ 2014-10-14  9:21   ` Christoph Hellwig
  2014-10-14 13:44     ` Martin K. Petersen
  2014-10-14 18:57     ` Zach Brown
  1 sibling, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2014-10-14  9:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Jens Axboe, Martin K. Petersen, linux-fsdevel,
	Theodore Ts'o, linux-ext4

On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote:
> Broken usage, IMO. If you are going to use the block layer ioctls to
> manipulate data int eh block device, you should be using direct Io
> for all your data IO to the block device. Otherwise, coherency
> problems occur....

I'd say BLKZEROOUT semantics are broken.  Having an ioctl exposed
that maniulates on-disk data without cache coherence is a nightmare
that people trip over easily.  Even experienced people like Darrick.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  9:21   ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig
@ 2014-10-14 13:44     ` Martin K. Petersen
  2014-10-14 18:57     ` Zach Brown
  1 sibling, 0 replies; 15+ messages in thread
From: Martin K. Petersen @ 2014-10-14 13:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Darrick J. Wong, Jens Axboe, Martin K. Petersen,
	linux-fsdevel, Theodore Ts'o, linux-ext4

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

>> Broken usage, IMO. If you are going to use the block layer ioctls to
>> manipulate data int eh block device, you should be using direct Io
>> for all your data IO to the block device. Otherwise, coherency
>> problems occur....

Christoph> I'd say BLKZEROOUT semantics are broken.  Having an ioctl
Christoph> exposed that maniulates on-disk data without cache coherence
Christoph> is a nightmare that people trip over easily.

I agree it's suboptimal but it's not different from how the other block
ioctls are working (discard, etc.).

The main issue is that the blkdev_issue_foobar() calls were designed for
in-kernel use and the calling filesystems were supposed to know what
they were doing with those pages. And for convenience we just wired up
the same calls as ioctls.

I'm undecided as to which hammer would be best.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14  9:21   ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig
  2014-10-14 13:44     ` Martin K. Petersen
@ 2014-10-14 18:57     ` Zach Brown
  2014-10-14 20:21       ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Zach Brown @ 2014-10-14 18:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Darrick J. Wong, Jens Axboe, Martin K. Petersen,
	linux-fsdevel, Theodore Ts'o, linux-ext4

On Tue, Oct 14, 2014 at 02:21:28AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote:
> > Broken usage, IMO. If you are going to use the block layer ioctls to
> > manipulate data int eh block device, you should be using direct Io
> > for all your data IO to the block device. Otherwise, coherency
> > problems occur....
> 
> I'd say BLKZEROOUT semantics are broken.  Having an ioctl exposed
> that maniulates on-disk data without cache coherence is a nightmare
> that people trip over easily.  Even experienced people like Darrick.

I'm definitely with Christoph on this one.  Coherent writes should be
the norm.

- z

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14 18:57     ` Zach Brown
@ 2014-10-14 20:21       ` Dave Chinner
  2014-10-15  1:02         ` Martin K. Petersen
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-10-14 20:21 UTC (permalink / raw)
  To: Zach Brown
  Cc: Christoph Hellwig, Darrick J. Wong, Jens Axboe,
	Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4

On Tue, Oct 14, 2014 at 11:57:53AM -0700, Zach Brown wrote:
> On Tue, Oct 14, 2014 at 02:21:28AM -0700, Christoph Hellwig wrote:
> > On Tue, Oct 14, 2014 at 03:27:11PM +1100, Dave Chinner wrote:
> > > Broken usage, IMO. If you are going to use the block layer ioctls to
> > > manipulate data int eh block device, you should be using direct Io
> > > for all your data IO to the block device. Otherwise, coherency
> > > problems occur....
> > 
> > I'd say BLKZEROOUT semantics are broken.  Having an ioctl exposed
> > that maniulates on-disk data without cache coherence is a nightmare
> > that people trip over easily.  Even experienced people like Darrick.
> 
> I'm definitely with Christoph on this one.  Coherent writes should be
> the norm.

I don't care what happens to the block layer ioctls w.r.t. cache
coherency. If people want to make them fully cache coherent for
buffered IO, go right ahead.

But that doesn't change the fact that if you want your application
to work sanely on existing kernels, then using direct IO is the only
option...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: BLKZEROOUT + pread should return zeroes, right?
  2014-10-14 20:21       ` Dave Chinner
@ 2014-10-15  1:02         ` Martin K. Petersen
  0 siblings, 0 replies; 15+ messages in thread
From: Martin K. Petersen @ 2014-10-15  1:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zach Brown, Christoph Hellwig, Darrick J. Wong, Jens Axboe,
	Martin K. Petersen, linux-fsdevel, Theodore Ts'o, linux-ext4

>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:

Dave> I don't care what happens to the block layer ioctls w.r.t. cache
Dave> coherency. If people want to make them fully cache coherent for
Dave> buffered IO, go right ahead.

Dave> But that doesn't change the fact that if you want your application
Dave> to work sanely on existing kernels, then using direct IO is the
Dave> only option...

Yep!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-10-18  0:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-14  3:01 BLKZEROOUT + pread should return zeroes, right? Darrick J. Wong
2014-10-14  4:27 ` Dave Chinner
2014-10-14  6:02   ` Darrick J. Wong
2014-10-14  6:32     ` Theodore Ts'o
2014-10-15  1:25       ` Darrick J. Wong
2014-10-15  1:32         ` Martin K. Petersen
2014-10-16 20:04           ` Darrick J. Wong
2014-10-15 10:02         ` Theodore Ts'o
2014-10-15 12:09           ` Martin K. Petersen
2014-10-18  0:03             ` [RFC PATCH] block: make BLKZEROOUT invalidate page cache contents Darrick J. Wong
2014-10-14  9:21   ` BLKZEROOUT + pread should return zeroes, right? Christoph Hellwig
2014-10-14 13:44     ` Martin K. Petersen
2014-10-14 18:57     ` Zach Brown
2014-10-14 20:21       ` Dave Chinner
2014-10-15  1:02         ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).