Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 0/7] block: T10/DIF Fixes and cleanups v2
From: Martin K. Petersen @ 2017-04-03 21:12 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

Dmitry Monakhov <dmonakhov@openvz.org> writes:

Dmitry,

> This patch set fix various problems spotted during T10/DIF integrity
> machinery testing.
>
> TOC:
> ## Fix various bugs in T10/DIF/DIX infrastructure
> 0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
> 0002-bio-integrity-save-original-iterator-for-verify-stage
> 0003-bio-integrity-bio_trim-should-truncate-integrity-vec
> 0004-bio-integrity-fix-interface-for-bio_integrity_trim
> ## Cleanup T10/DIF/DIX infrastructure
> 0005-bio-integrity-add-bio_integrity_setup-helper
> 0006-T10-Move-opencoded-contants-to-common-header
> ## General bulletproof protection for block layer
> 0007-Guard-bvec-iteration-logic-v2

Looks like a nice cleanup of some of the things that have rotted a bit
as a result of the immutable bvec efforts. No major objections from
here. I'll try your series on my qual setup tomorrow to make sure
everything is working correctly.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* [PATCH] blk-mq: Remove blk_mq_queue_data.list
From: Bart Van Assche @ 2017-04-03 20:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org

The block layer core sets blk_mq_queue_data.list but no block=0A=
drivers read that member. Hence remove it and also the code that=0A=
is used to set this member.=0A=
=0A=
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
---=0A=
 block/blk-mq.c         | 17 -----------------=0A=
 include/linux/blk-mq.h |  1 -=0A=
 2 files changed, 18 deletions(-)=0A=
=0A=
diff --git a/block/blk-mq.c b/block/blk-mq.c=0A=
index 8fb983e6e2e4..b5580b09b4a5 100644=0A=
--- a/block/blk-mq.c=0A=
+++ b/block/blk-mq.c=0A=
@@ -984,17 +984,9 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hct=
x, struct list_head *list)=0A=
 {=0A=
 	struct request_queue *q =3D hctx->queue;=0A=
 	struct request *rq;=0A=
-	LIST_HEAD(driver_list);=0A=
-	struct list_head *dptr;=0A=
 	int errors, queued, ret =3D BLK_MQ_RQ_QUEUE_OK;=0A=
 =0A=
 	/*=0A=
-	 * Start off with dptr being NULL, so we start the first request=0A=
-	 * immediately, even if we have more pending.=0A=
-	 */=0A=
-	dptr =3D NULL;=0A=
-=0A=
-	/*=0A=
 	 * Now process all the entries, sending them to the driver.=0A=
 	 */=0A=
 	errors =3D queued =3D 0;=0A=
@@ -1026,7 +1018,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hc=
tx, struct list_head *list)=0A=
 		list_del_init(&rq->queuelist);=0A=
 =0A=
 		bd.rq =3D rq;=0A=
-		bd.list =3D dptr;=0A=
 =0A=
 		/*=0A=
 		 * Flag last if we have no more requests, or if we have more=0A=
@@ -1062,13 +1053,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *h=
ctx, struct list_head *list)=0A=
 =0A=
 		if (ret =3D=3D BLK_MQ_RQ_QUEUE_BUSY)=0A=
 			break;=0A=
-=0A=
-		/*=0A=
-		 * We've done the first request. If we have more than 1=0A=
-		 * left in the list, set dptr to defer issue.=0A=
-		 */=0A=
-		if (!dptr && list->next !=3D list->prev)=0A=
-			dptr =3D &driver_list;=0A=
 	}=0A=
 =0A=
 	hctx->dispatched[queued_to_index(queued)]++;=0A=
@@ -1451,7 +1435,6 @@ static void __blk_mq_try_issue_directly(struct reques=
t *rq, blk_qc_t *cookie,=0A=
 	struct request_queue *q =3D rq->q;=0A=
 	struct blk_mq_queue_data bd =3D {=0A=
 		.rq =3D rq,=0A=
-		.list =3D NULL,=0A=
 		.last =3D 1=0A=
 	};=0A=
 	struct blk_mq_hw_ctx *hctx;=0A=
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h=0A=
index ebeea36ff9cd..90af6a31458d 100644=0A=
--- a/include/linux/blk-mq.h=0A=
+++ b/include/linux/blk-mq.h=0A=
@@ -81,7 +81,6 @@ struct blk_mq_tag_set {=0A=
 =0A=
 struct blk_mq_queue_data {=0A=
 	struct request *rq;=0A=
-	struct list_head *list;=0A=
 	bool last;=0A=
 };=0A=
 =0A=
-- =0A=
2.12.0=0A=
=0A=

^ permalink raw reply related

* Re: [PATCH] block: allow us to change max_segments on the fly
From: Jens Axboe @ 2017-04-03 19:15 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-block
In-Reply-To: <1491246798-1774-1-git-send-email-jbacik@fb.com>

On 04/03/2017 01:13 PM, Josef Bacik wrote:
> For things like NBD we want to be able to experiment with different
> sized requests going over the wire.  Half of our limit is controled by
> max_sectors_kb, but the other half is how many bvec's we can cram into a
> request, which is controlled by max_segments.  Changing this sysfs knob
> to be writeable allows us to be able to control our io limits more
> precisely.

This is a hw setting for most drivers, we can't make it user settable.
If you want it user settable, you'd have to split it in a hw and user
setting.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] block: allow us to change max_segments on the fly
From: Josef Bacik @ 2017-04-03 19:13 UTC (permalink / raw)
  To: kernel-team, axboe, linux-block

For things like NBD we want to be able to experiment with different
sized requests going over the wire.  Half of our limit is controled by
max_sectors_kb, but the other half is how many bvec's we can cram into a
request, which is controlled by max_segments.  Changing this sysfs knob
to be writeable allows us to be able to control our io limits more
precisely.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 block/blk-sysfs.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4585426..e7db3f8 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -116,6 +116,19 @@ static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 	return queue_var_show(max_sectors_kb, (page));
 }
 
+static ssize_t queue_max_segments_store(struct request_queue *q,
+					const char *page, size_t count)
+{
+	unsigned long max_segments;
+	ssize_t ret = queue_var_store(&max_segments, page, count);
+
+	if (ret < 0)
+		return ret;
+
+	q->limits.max_segments = (short)max_segments;
+	return ret;
+}
+
 static ssize_t queue_max_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(queue_max_segments(q), (page));
@@ -527,8 +540,9 @@ static struct queue_sysfs_entry queue_max_hw_sectors_entry = {
 };
 
 static struct queue_sysfs_entry queue_max_segments_entry = {
-	.attr = {.name = "max_segments", .mode = S_IRUGO },
+	.attr = {.name = "max_segments", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_max_segments_show,
+	.store = queue_max_segments_store,
 };
 
 static struct queue_sysfs_entry queue_max_discard_segments_entry = {
-- 
2.7.4

^ permalink raw reply related

* [PATCH 8/8] nowait aio: btrfs
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated
---
 fs/btrfs/file.c  | 25 ++++++++++++++++++++-----
 fs/btrfs/inode.c |  3 +++
 fs/btrfs/super.c |  2 +-
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb72..a870e5d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	ssize_t num_written = 0;
 	bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
 	ssize_t err;
-	loff_t pos;
-	size_t count;
+	loff_t pos = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
 	loff_t oldsize;
 	int clean_page = 0;
 
-	inode_lock(inode);
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+			(iocb->ki_flags & IOCB_DIRECT)) {
+		/* Don't sleep on inode rwsem */
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+		/*
+		 * We will allocate space in case nodatacow is not set,
+		 * so bail
+		 */
+		if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+					      BTRFS_INODE_PREALLOC)) ||
+		    check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) {
+			inode_unlock(inode);
+			return -EAGAIN;
+		}
+	} else
+		inode_lock(inode);
+
 	err = generic_write_checks(iocb, from);
 	if (err <= 0) {
 		inode_unlock(inode);
@@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	 */
 	update_time_for_write(inode);
 
-	pos = iocb->ki_pos;
-	count = iov_iter_count(from);
 	start_pos = round_down(pos, fs_info->sectorsize);
 	oldsize = i_size_read(inode);
 	if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a18510b..d91b21a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8627,6 +8627,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			dio_data.overwrite = 1;
 			inode_unlock(inode);
 			relock = true;
+		} else if (iocb->ki_flags & IOCB_NOWAIT) {
+			ret = -EAGAIN;
+			goto out;
 		}
 		ret = btrfs_delalloc_reserve_space(inode, offset, count);
 		if (ret)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index da687dc..5e60659 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2175,7 +2175,7 @@ static struct file_system_type btrfs_fs_type = {
 	.name		= "btrfs",
 	.mount		= btrfs_mount,
 	.kill_sb	= btrfs_kill_super,
-	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+	.fs_flags	= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_NOWAIT,
 };
 MODULE_ALIAS_FS("btrfs");
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH 7/8] nowait aio: xfs
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.
---
 fs/xfs/xfs_file.c  | 15 +++++++++++----
 fs/xfs/xfs_iomap.c | 13 +++++++++++++
 fs/xfs/xfs_super.c |  2 +-
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a8..08a5eef 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
 		iolock = XFS_IOLOCK_SHARED;
 	}
 
-	xfs_ilock(ip, iolock);
-
+	if (!xfs_ilock_nowait(ip, iolock)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, iolock);
+	}
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
 		goto out;
@@ -553,9 +556,13 @@ xfs_file_dio_aio_write(
 	 * otherwise demote the lock if we had to take the exclusive lock
 	 * for other reasons in xfs_file_aio_write_checks.
 	 */
-	if (unaligned_io)
+	if (unaligned_io) {
+		/* If we are going to wait for other DIO to finish, bail */
+		if ((iocb->ki_flags & IOCB_NOWAIT) &&
+		     atomic_read(&inode->i_dio_count))
+			return -EAGAIN;
 		inode_dio_wait(inode);
-	else if (iolock == XFS_IOLOCK_EXCL) {
+	} else if (iolock == XFS_IOLOCK_EXCL) {
 		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
 		iolock = XFS_IOLOCK_SHARED;
 	}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b..6843725 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,6 +1015,11 @@ xfs_file_iomap_begin(
 
 	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
 		if (flags & IOMAP_DIRECT) {
+			/* A reflinked inode will result in CoW alloc */
+			if (flags & IOMAP_NOWAIT) {
+				error = -EAGAIN;
+				goto out_unlock;
+			}
 			/* may drop and re-acquire the ilock */
 			error = xfs_reflink_allocate_cow(ip, &imap, &shared,
 					&lockmode);
@@ -1032,6 +1037,14 @@ xfs_file_iomap_begin(
 
 	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
 		/*
+		 * If nowait is set bail since we are going to make
+		 * allocations.
+		 */
+		if (flags & IOMAP_NOWAIT) {
+			error = -EAGAIN;
+			goto out_unlock;
+		}
+		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
 		 * with the work writeback does. This is a completely arbitrary
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 685c042..070a30e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1749,7 +1749,7 @@ static struct file_system_type xfs_fs_type = {
 	.name			= "xfs",
 	.mount			= xfs_fs_mount,
 	.kill_sb		= kill_block_super,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_NOWAIT,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH 6/8] nowait aio: ext4
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Return EAGAIN if any of the following checks fail for direct I/O:
 + i_rwsem is lockable
 + Writing beyond end of file (will trigger allocation)
 + Blocks are not allocated at the write location
---
 fs/ext4/file.c  | 48 +++++++++++++++++++++++++++++++-----------------
 fs/ext4/super.c |  6 +++---
 2 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8210c1f..e223b9f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -127,27 +127,22 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
 	return 0;
 }
 
-/* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+/* Are IO blocks allocated */
+static bool ext4_blocks_mapped(struct inode *inode, loff_t pos, loff_t len,
+				struct ext4_map_blocks *map)
 {
-	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
 	int err, blklen;
 
 	if (pos + len > i_size_read(inode))
 		return false;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	map->m_lblk = pos >> blkbits;
+	map->m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
+	blklen = map->m_len;
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. To exclude
-	 * unwritten extents, we need to check m_flags.
-	 */
-	return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
+	err = ext4_map_blocks(NULL, inode, map, 0);
+	return err == blklen;
 }
 
 static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
@@ -204,6 +199,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	int o_direct = iocb->ki_flags & IOCB_DIRECT;
+	int nowait = iocb->ki_flags & IOCB_NOWAIT;
 	int unaligned_aio = 0;
 	int overwrite = 0;
 	ssize_t ret;
@@ -216,7 +212,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		return ext4_dax_write_iter(iocb, from);
 #endif
 
-	inode_lock(inode);
+	if (o_direct && nowait) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
@@ -235,9 +237,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	iocb->private = &overwrite;
 	/* Check whether we do a DIO overwrite or not */
-	if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-	    ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-		overwrite = 1;
+	if (o_direct && !unaligned_aio) {
+		struct ext4_map_blocks map;
+		if (ext4_blocks_mapped(inode, iocb->ki_pos,
+				      iov_iter_count(from), &map)) {
+	 		/* To exclude unwritten extents, we need to check
+			 * m_flags.
+			 */
+			if (ext4_should_dioread_nolock(inode) &&
+			    (map.m_flags & EXT4_MAP_MAPPED))
+				overwrite = 1;
+		} else if (iocb->ki_flags & IOCB_NOWAIT) {
+			ret = -EAGAIN;
+			goto out;
+		}
+	}
 
 	ret = __generic_file_write_iter(iocb, from);
 	inode_unlock(inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a9448db..e1d424a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -117,7 +117,7 @@ static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_NOWAIT,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -132,7 +132,7 @@ static struct file_system_type ext3_fs_type = {
 	.name		= "ext3",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_NOWAIT,
 };
 MODULE_ALIAS_FS("ext3");
 MODULE_ALIAS("ext3");
@@ -5623,7 +5623,7 @@ static struct file_system_type ext4_fs_type = {
 	.name		= "ext4",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_NOWAIT,
 };
 MODULE_ALIAS_FS("ext4");
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH 5/8] nowait aio: return on congested block device
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A new flag BIO_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

To facilitate this, QUEUE_FLAG_NOWAIT is set to devices
which support this. While currently this is set to
virtio and sd only. Support to more devices will be added soon.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c           | 24 ++++++++++++++++++++++--
 block/blk-mq-sched.c       |  3 +++
 block/blk-mq.c             |  4 ++++
 drivers/block/virtio_blk.c |  3 +++
 drivers/scsi/sd.c          |  3 +++
 fs/direct-io.c             | 11 +++++++++--
 include/linux/bio.h        |  6 ++++++
 include/linux/blk_types.h  |  1 +
 include/linux/blkdev.h     |  2 ++
 9 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..95a9b18 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	if (!IS_ERR(rq))
 		return rq;
 
+	if (bio && bio_flagged(bio, BIO_NOWAIT)) {
+		blk_put_rl(rl);
+		return ERR_PTR(-EAGAIN);
+	}
+
 	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
@@ -1870,6 +1875,18 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	if (bio_flagged(bio, BIO_NOWAIT)) {
+		if (!blk_queue_nowait(q)) {
+			err = -EOPNOTSUPP;
+			goto end_io;
+		}
+		if (!(bio->bi_opf & (REQ_SYNC | REQ_IDLE))) {
+			err = -EINVAL;
+			goto end_io;
+		}
+	}
+
+
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_iter.bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
@@ -2021,7 +2038,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, false) == 0)) {
+		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
@@ -2046,7 +2063,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 			bio_list_merge(&bio_list_on_stack[0], &same);
 			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			bio_io_error(bio);
+			if (unlikely(!blk_queue_dying(q) && bio_flagged(bio, BIO_NOWAIT)))
+				bio_wouldblock_error(bio);
+			else
+				bio_io_error(bio);
 		}
 		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff..40e78b5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q,
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+	if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+		data->flags |= BLK_MQ_REQ_NOWAIT;
+
 	if (e) {
 		data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6b6e7bc..2d90b12 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1511,6 +1511,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
@@ -1635,6 +1637,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		if (bio && bio_flagged(bio, BIO_NOWAIT))
+			bio_wouldblock_error(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1d4c9f8..7481124 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -731,6 +731,9 @@ static int virtblk_probe(struct virtio_device *vdev)
 	/* No real sector limit. */
 	blk_queue_max_hw_sectors(q, -1U);
 
+	/* Request queue supports BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
+
 	/* Host can optionally specify maximum segment size and number of
 	 * segments. */
 	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc..9df85ee 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3177,6 +3177,9 @@ static int sd_probe(struct device *dev)
 					     SD_MOD_TIMEOUT);
 	}
 
+	/* Support BIO_NOWAIT */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, sdp->request_queue);
+
 	device_initialize(&sdkp->dev);
 	sdkp->dev.parent = dev;
 	sdkp->dev.class = &sd_disk_class;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea..f6835d3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	if (dio->iocb->ki_flags & IOCB_NOWAIT)
+		bio_set_flag(bio, BIO_NOWAIT);
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 	unsigned i;
 	int err;
 
-	if (bio->bi_error)
-		dio->io_error = -EIO;
+	if (bio->bi_error) {
+		if (bio_flagged(bio, BIO_NOWAIT))
+			dio->io_error = -EAGAIN;
+		else
+			dio->io_error = -EIO;
+	}
 
 	if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
 		err = bio->bi_error;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..1a92707 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -425,6 +425,12 @@ static inline void bio_io_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+static inline void bio_wouldblock_error(struct bio *bio)
+{
+	bio->bi_error = -EAGAIN;
+	bio_endio(bio);
+}
+
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb..514c08e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_NOWAIT	10	/* don't block over blk device congestion */
 
 /*
  * Flags starting here get preserved by bio_reset() - this includes
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da60..ae38ab6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -611,6 +611,7 @@ struct request_queue {
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
 #define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
+#define QUEUE_FLAG_NOWAIT      29	/* queue supports BIO_NOWAIT */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -701,6 +702,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_secure_erase(q) \
 	(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
+#define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.10.2

^ permalink raw reply related

* [PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.
---
 fs/iomap.c            | 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd..d1c8175 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	} else {
 		dio->flags |= IOMAP_DIO_WRITE;
 		flags |= IOMAP_WRITE;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			flags |= IOMAP_NOWAIT;
 	}
 
 	if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810..53f6af8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT		(1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
 	/*
-- 
2.10.2

^ permalink raw reply related

* [PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 802cfe2..4721136 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2515,6 +2515,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index e08f3b9..c020e23 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:		address space structure to wait for
+ * @start_byte:		offset in bytes where the range starts
+ * @end_byte:		offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+				loff_t start_byte, loff_t end_byte)
+{
+	pgoff_t index = start_byte >> PAGE_SHIFT;
+	pgoff_t end = end_byte >> PAGE_SHIFT;
+	struct pagevec pvec;
+	int ret;
+
+	if (end_byte < start_byte)
+		return 0;
+
+	if (mapping->nrpages == 0)
+		return 0;
+
+	pagevec_init(&pvec, 0);
+	ret = pagevec_lookup(&pvec, mapping, index, 1);
+	if (!ret)
+		return 0;
+	ret = (pvec.pages[0]->index <= end);
+	pagevec_release(&pvec);
+	return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 				     loff_t start_byte, loff_t end_byte)
 {
@@ -2640,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 
 	pos = iocb->ki_pos;
 
+	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+		return -EINVAL;
+
 	if (limit != RLIM_INFINITY) {
 		if (iocb->ki_pos >= limit) {
 			send_sig(SIGXFSZ, current, 0);
@@ -2709,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	write_len = iov_iter_count(from);
 	end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
-		goto out;
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* If there are pages to writeback, return */
+		if (filemap_range_has_page(inode->i_mapping, pos,
+					   pos + iov_iter_count(from)))
+			return -EAGAIN;
+	} else {
+		written = filemap_write_and_wait_range(mapping, pos,
+							pos + write_len - 1);
+		if (written)
+			goto out;
+	}
 
 	/*
 	 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.10.2

^ permalink raw reply related

* [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

A failure to lock i_rwsem would mean there is I/O being performed
by another thread. So, let's bail.

Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mm/filemap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1694623..e08f3b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2982,7 +2982,12 @@ ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
-	inode_lock(inode);
+	if (!inode_trylock(inode)) {
+		/* Don't sleep on inode rwsem */
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock(inode);
+	}
 	ret = generic_write_checks(iocb, from);
 	if (ret > 0)
 		ret = __generic_file_write_iter(iocb, from);
-- 
2.10.2

^ permalink raw reply related

* [PATCH 1/8] nowait aio: Introduce IOCB_RW_FLAG_NOWAIT
From: Goldwyn Rodrigues @ 2017-04-03 18:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-1-rgoldwyn@suse.de>

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This flag informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

Unfortunately, aio_flags is not checked for validity, which would
break existing applications which have it set to anything besides zero
or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it
to aio_rw_flags.

IOCB_RW_FLAG_NOWAIT is translated to IOCB_NOWAIT for
iocb->ki_flags.

Added FS_NOWAIT to make sure VFS knows that the filesystem is capable
of performing direct-AIO with IOCB_RW_FLAG_NOWAIT.
---
 fs/aio.c                     | 15 ++++++++++++++-
 include/linux/fs.h           |  2 ++
 include/uapi/linux/aio_abi.h |  9 ++++++++-
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925..25ae59b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,11 +1541,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+	if (unlikely(iocb->aio_reserved2)) {
 		pr_debug("EINVAL: reserve field set\n");
 		return -EINVAL;
 	}
 
+	if (unlikely(iocb->aio_rw_flags & ~IOCB_RW_FLAG_NOWAIT)) {
+		pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
+		return -EINVAL;
+	}
+
 	/* prevent overflows */
 	if (unlikely(
 	    (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
@@ -1586,6 +1591,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		req->common.ki_flags |= IOCB_EVENTFD;
 	}
 
+	if (iocb->aio_rw_flags & IOCB_RW_FLAG_NOWAIT) {
+		if (!(file->f_inode->i_sb->s_type->fs_flags & FS_NOWAIT)) {
+			ret = -EOPNOTSUPP;
+			goto out_put_req;
+		}
+		req->common.ki_flags |= IOCB_NOWAIT;
+	}
+
 	ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
 	if (unlikely(ret)) {
 		pr_debug("EFAULT: aio_key\n");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7b..802cfe2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,7 @@ struct writeback_control;
 #define IOCB_DSYNC		(1 << 4)
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
+#define IOCB_NOWAIT		(1 << 7)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -2020,6 +2021,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
+#define FS_NOWAIT		65536	/* FS supports nowait direct AIO */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
 	void (*kill_sb) (struct super_block *);
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..6d98cbe 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -54,6 +54,13 @@ enum {
  */
 #define IOCB_FLAG_RESFD		(1 << 0)
 
+/*
+ * Flags for aio_rw_flags member of "struct iocb".
+ * IOCB_RW_FLAG_NOWAIT - Set if the user wants the iocb to fail if it
+ *			would block for operations such as disk allocation.
+ */
+#define IOCB_RW_FLAG_NOWAIT	(1 << 1)
+
 /* read() from /dev/aio returns these structures. */
 struct io_event {
 	__u64		data;		/* the data field from the iocb */
@@ -79,7 +86,7 @@ struct io_event {
 struct iocb {
 	/* these are internal to the kernel/libc. */
 	__u64	aio_data;	/* data to be returned in event's data */
-	__u32	PADDED(aio_key, aio_reserved1);
+	__u32	PADDED(aio_key, aio_rw_flags);
 				/* the kernel sets aio_key to the req # */
 
 	/* common fields */
-- 
2.10.2

^ permalink raw reply related

* [PATCH 0/8 v4] No wait AIO
From: Goldwyn Rodrigues @ 2017-04-03 18:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
	avi, axboe, linux-api, willy, tom.leiming

Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, BIO_NOWAIT for bio and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
Same with QUEUE_FLAG_NOWAIT, which is currently set for sd and virtio devices.
This is primarily to block md/dm devices which may wait in places such as
recovery/sync/suspend. In the future, I intend to add support to
these devices as well. Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
	- included reflink 
	- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
	- Translate the flag through IOMAP_NOWAIT (iomap) to check for
	  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
    for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
    This is added (rather not set) to block devices such as dm/md currently.

-- 
Goldwyn

^ permalink raw reply

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Jens Axboe @ 2017-04-03 14:34 UTC (permalink / raw)
  To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-8-git-send-email-dmonakhov@openvz.org>

On 04/03/2017 01:23 AM, Dmitry Monakhov wrote:
> @@ -66,12 +67,15 @@ struct bvec_iter {
>  	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
>  })
>  
> -static inline void bvec_iter_advance(const struct bio_vec *bv,
> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>  				     struct bvec_iter *iter,
>  				     unsigned bytes)
>  {
> -	WARN_ONCE(bytes > iter->bi_size,
> -		  "Attempted to advance past end of bvec iter\n");
> +	if(unlikely(bytes > iter->bi_size)) {
> +		WARN(1, "Attempted to advance past end of bvec iter\n");
> +		iter->bi_size = 0;
> +		return -EINVAL;
> +	}

	if (WARN_ONCE(bytes > iter->bi_size,
			"Attempted to advance past end of bvec iter\n")) {
		...

would be cleaner.

-- 
Jens Axboe

^ permalink raw reply

* [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-03 12:05 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, NeilBrown, Jinpu Wang


blk_attempt_plug_merge() try to merge bio into request and chain them
by 'bi_next', while after the bio is done inside request, we forgot to
reset the 'bi_next'.

This lead into BUG while removing all the underlying devices from md-raid1,
the bio once go through:

  md_do_sync()
    sync_request()
      generic_make_request()
        blk_queue_bio()
          blk_attempt_plug_merge()
            CHAINED HERE

will keep chained and reused by:

  raid1d()
    sync_request_write()
      generic_make_request()
        BUG_ON(bio->bi_next)

After reset the 'bi_next' this can no longer happen.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 block/blk-core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..91223b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
                struct bio *bio = req->bio;
                unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);

-               if (bio_bytes == bio->bi_iter.bi_size)
+               if (bio_bytes == bio->bi_iter.bi_size) {
                        req->bio = bio->bi_next;
+                       bio->bi_next = NULL;
+               }

                req_bio_endio(req, bio, bio_bytes, error);

-- 
2.5.0

^ permalink raw reply related

* [PATCH 2/2] block: Improve error handling verbosity
From: Dmitry Monakhov @ 2017-04-03 12:03 UTC (permalink / raw)
  To: linux-block; +Cc: linux-scsi, Dmitry Monakhov
In-Reply-To: <1491221029-1609-1-git-send-email-dmonakhov@openvz.org>

EILSEQ is returned due to internal csum error on disk/fabric,
let's add special message to distinguish it from others. Also dump
original numerical error code.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/blk-core.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 071a998..8eab846 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2576,13 +2576,16 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 		case -ENODATA:
 			error_type = "critical medium";
 			break;
+		case -EILSEQ:
+			error_type = "bad data";
+			break;
 		case -EIO:
 		default:
 			error_type = "I/O";
 			break;
 		}
-		printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector %llu\n",
-				   __func__, error_type, req->rq_disk ?
+		printk_ratelimited(KERN_ERR "%s: %s error (%d), dev %s, sector %llu\n",
+				   __func__, error_type, error, req->rq_disk ?
 				   req->rq_disk->disk_name : "?",
 				   (unsigned long long)blk_rq_pos(req));
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 1/2] scsi: convert unrecovered read error to -EILSEQ
From: Dmitry Monakhov @ 2017-04-03 12:03 UTC (permalink / raw)
  To: linux-block; +Cc: linux-scsi, Dmitry Monakhov

It is quite easily to get URE after power failure and get scary message.
URE is happens due to internal drive crc mismatch due to partial sector
update. Most people interpret such message as "My drive is dying", which
isreasonable assumption if your dmesg is full of complain from disks and
read(2) return EIO. In fact this error is not fatal. One can fix it easily
by rewriting affected sector.

So we have to handle URE like follows:
- Return EILSEQ to signall caller that this is bad data related problem
- Do not retry command, because this is useless.



### Test case
#Test uses two HDD: disks sdb sdc
#Write_phase
# let fio work ~100sec and then cut the power
fio --ioengine=libaio --direct=1 --rw=write --bs=1M --iodepth=16 \
--time_based=1 --runtime=600 --filesize=1G --size=1T \
--name /dev/sdb --name /dev/sdc

# Check_phase after system goes back
fio --ioengine=libaio --direct=1 --group_reporting --rw=read --bs=1M \
--iodepth=16 --size=1G --filesize=1G
--name=/dev/sdb --name /dev/sdc

More info about URE probability here:
https://plus.google.com/101761226576930717211/posts/Pctq7kk1dLL

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 drivers/scsi/scsi_lib.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 19125d7..59d64ad 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -961,6 +961,19 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 			/* See SSC3rXX or current. */
 			action = ACTION_FAIL;
 			break;
+		case MEDIUM_ERROR:
+			if (sshdr.asc == 0x11) {
+				/* Handle unrecovered read error */
+				switch (sshdr.ascq) {
+				case 0x00: /* URE */
+				case 0x04: /* URE auto reallocate failed */
+				case 0x0B: /* URE recommend reassignment*/
+				case 0x0C: /* URE recommend rewrite the data */
+					action = ACTION_FAIL;
+					error = -EILSEQ;
+					break;
+				}
+			}
 		default:
 			action = ACTION_FAIL;
 			break;
-- 
2.9.3

^ permalink raw reply related

* [PATCH 7/7] Guard bvec iteration logic v2
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.

Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.

This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size

Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)

changes since V1:
 - Replace  BUG_ON with error logic.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 drivers/nvdimm/blk.c |  4 +++-
 drivers/nvdimm/btt.c |  4 +++-
 include/linux/bio.h  |  8 ++++++--
 include/linux/bvec.h | 11 ++++++++---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 1edb3f3..04c3075 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
 
 		len -= cur_len;
 		dev_offset += cur_len;
-		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+		err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+		if (err)
+			return err;
 	}
 
 	return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 03ded8d..3f3aa7b 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
 
 		len -= cur_len;
 		meta_nsoff += cur_len;
-		bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+		ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+		if (ret)
+			return ret;
 	}
 
 	return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0c1c95c..8bf1564 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 
 	if (bio_no_advance_iter(bio))
 		iter->bi_size -= bytes;
-	else
-		bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+	else {
+		int err;
+		err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+		if (unlikely(err))
+			bio->bi_error = err;
+	}
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)			\
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..c117f1a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
 
 #include <linux/kernel.h>
 #include <linux/bug.h>
+#include <linux/errno.h>
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
 	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
 })
 
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
 				     struct bvec_iter *iter,
 				     unsigned bytes)
 {
-	WARN_ONCE(bytes > iter->bi_size,
-		  "Attempted to advance past end of bvec iter\n");
+	if(unlikely(bytes > iter->bi_size)) {
+		WARN(1, "Attempted to advance past end of bvec iter\n");
+		iter->bi_size = 0;
+		return -EINVAL;
+	}
 
 	while (bytes) {
 		unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
 			iter->bi_idx++;
 		}
 	}
+	return 0;
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)			\
-- 
2.9.3

^ permalink raw reply related

* [PATCH 6/7] T10: Move opencoded contants to common header
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/t10-pi.c                   | 9 +++------
 drivers/scsi/lpfc/lpfc_scsi.c    | 5 +++--
 drivers/scsi/qla2xxx/qla_isr.c   | 8 ++++----
 drivers/target/target_core_sbc.c | 2 +-
 include/linux/t10-pi.h           | 3 +++
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
 
 typedef __be16 (csum_fn) (void *, unsigned int);
 
-static const __be16 APP_ESCAPE = (__force __be16) 0xffff;
-static const __be32 REF_ESCAPE = (__force __be32) 0xffffffff;
-
 static __be16 t10_pi_crc_fn(void *data, unsigned int len)
 {
 	return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, csum_fn *fn,
 		switch (type) {
 		case 1:
 		case 2:
-			if (pi->app_tag == APP_ESCAPE)
+			if (pi->app_tag == T10_APP_ESCAPE)
 				goto next;
 
 			if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, csum_fn *fn,
 			}
 			break;
 		case 3:
-			if (pi->app_tag == APP_ESCAPE &&
-			    pi->ref_tag == REF_ESCAPE)
+			if (pi->app_tag == T10_APP_ESCAPE &&
+			    pi->ref_tag == T10_REF_ESCAPE)
 				goto next;
 			break;
 		}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
 #include <linux/export.h>
 #include <linux/delay.h>
 #include <asm/unaligned.h>
+#include <linux/t10-pi.h>
 #include <linux/crc-t10dif.h>
 #include <net/checksum.h>
 
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct lpfc_scsi_buf *lpfc_cmd)
 				 * First check to see if a protection data
 				 * check is valid
 				 */
-				if ((src->ref_tag == 0xffffffff) ||
-				    (src->app_tag == 0xffff)) {
+				if ((src->ref_tag == T10_REF_ESCAPE) ||
+				    (src->app_tag == T10_APP_ESCAPE)) {
 					start_ref_tag++;
 					goto skipit;
 				}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx *sts24)
 	 * For type     3: ref & app tag is all 'f's
 	 * For type 0,1,2: app tag is all 'f's
 	 */
-	if ((a_app_tag == 0xffff) &&
+	if ((a_app_tag == T10_APP_ESCAPE) &&
 	    ((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
-	     (a_ref_tag == 0xffffffff))) {
+	     (a_ref_tag == T10_REF_ESCAPE))) {
 		uint32_t blocks_done, resid;
 		sector_t lba_s = scsi_get_lba(cmd);
 
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx *sts24)
 			spt = page_address(sg_page(sg)) + sg->offset;
 			spt += j;
 
-			spt->app_tag = 0xffff;
+			spt->app_tag = T10_APP_ESCAPE;
 			if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
-				spt->ref_tag = 0xffffffff;
+				spt->ref_tag = T10_REF_ESCAPE;
 		}
 
 		return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, unsigned int sectors,
 				 (unsigned long long)sector, sdt->guard_tag,
 				 sdt->app_tag, be32_to_cpu(sdt->ref_tag));
 
-			if (sdt->app_tag == cpu_to_be16(0xffff)) {
+			if (sdt->app_tag == T10_APP_ESCAPE) {
 				dsg_off += block_size;
 				goto next;
 			}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..c96845c 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -24,6 +24,9 @@ enum t10_dif_type {
 	T10_PI_TYPE3_PROTECTION = 0x3,
 };
 
+static const __be16 T10_APP_ESCAPE = (__force __be16) 0xffff;
+static const __be32 T10_REF_ESCAPE = (__force __be32) 0xffffffff;
+
 /*
  * T10 Protection Information tuple.
  */
-- 
2.9.3

^ permalink raw reply related

* [PATCH 5/7] bio-integrity: add bio_integrity_setup helper
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly. For example retry in
case of ENOMEM.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/blk-core.c     |  5 +----
 block/blk-mq.c       |  8 ++------
 drivers/nvdimm/blk.c | 13 ++-----------
 drivers/nvdimm/btt.c | 13 ++-----------
 include/linux/bio.h  | 25 +++++++++++++++++++++++++
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..071a998 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 
 	blk_queue_split(q, &bio, q->bio_split);
 
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio->bi_error = -EIO;
-		bio_endio(bio);
+	if (bio_integrity_setup(bio))
 		return BLK_QC_T_NONE;
-	}
 
 	if (op_is_flush(bio->bi_opf)) {
 		spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..a9931ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio_io_error(bio);
+	if (bio_integrity_setup(bio))
 		return BLK_QC_T_NONE;
-	}
 
 	blk_queue_split(q, &bio, q->bio_split);
 
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio_io_error(bio);
+	if (bio_integrity_setup(bio))
 		return BLK_QC_T_NONE;
-	}
 
 	blk_queue_split(q, &bio, q->bio_split);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..1edb3f3 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	int err = 0, rw;
 	bool do_acct;
 
-	/*
-	 * bio_integrity_enabled also checks if the bio already has an
-	 * integrity payload attached. If it does, we *don't* do a
-	 * bio_integrity_prep here - the payload has been generated by
-	 * another kernel subsystem, and we just pass it through.
-	 */
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio->bi_error = -EIO;
-		goto out;
-	}
+	if (bio_integrity_setup(bio))
+		return BLK_QC_T_NONE;
 
 	bip = bio_integrity(bio);
 	nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
- out:
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
 }
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..03ded8d 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
 	int err = 0;
 	bool do_acct;
 
-	/*
-	 * bio_integrity_enabled also checks if the bio already has an
-	 * integrity payload attached. If it does, we *don't* do a
-	 * bio_integrity_prep here - the payload has been generated by
-	 * another kernel subsystem, and we just pass it through.
-	 */
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio->bi_error = -EIO;
-		goto out;
-	}
+	if (bio_integrity_setup(bio))
+		return BLK_QC_T_NONE;
 
 	do_acct = nd_iostat_start(bio, &start);
 	bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-out:
 	bio_endio(bio);
 	return BLK_QC_T_NONE;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 350f71d..0c1c95c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -738,6 +738,26 @@ extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
 extern void bio_integrity_init(void);
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+	int err = 0;
+
+	/*
+	 * bio_integrity_enabled also checks if the bio already has an
+	 * integrity payload attached. If it does, we *don't* do a
+	 * bio_integrity_prep here - the payload has been generated by
+	 * another kernel subsystem, and we just pass it through.
+	 */
+	if (bio_integrity_enabled(bio)) {
+		err = bio_integrity_prep(bio);
+		if (err) {
+			bio->bi_error = err;
+			bio_endio(bio);
+		}
+	}
+	return err;
+}
+
 #else /* CONFIG_BLK_DEV_INTEGRITY */
 
 static inline void *bio_integrity(struct bio *bio)
@@ -765,6 +785,11 @@ static inline int bio_integrity_prep(struct bio *bio)
 	return 0;
 }
 
+static inline int bio_integrity_setup(struct bio *bio)
+{
+	return 0;
+}
+
 static inline void bio_integrity_free(struct bio *bio)
 {
 	return;
-- 
2.9.3

^ permalink raw reply related

* [PATCH 4/7] bio-integrity: fix interface for bio_integrity_trim
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful offset is 0. Let's just remove it completely.

TODO: add xfstests testcase here
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/bio-integrity.c | 8 +-------
 block/bio.c           | 4 ++--
 drivers/md/dm.c       | 2 +-
 include/linux/bio.h   | 5 ++---
 4 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 43a4476..43895a0 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,21 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:	bio whose integrity vector to update
- * @offset:	offset to first data sector
  * @sectors:	number of data sectors
  *
  * Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
  */
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
-			unsigned int sectors)
+void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
 	struct bio_integrity_payload *bip = bio_integrity(bio);
 	struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
 
-	bio_integrity_advance(bio, offset << 9);
 	bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
 }
 EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..6895986 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
 	split->bi_iter.bi_size = sectors << 9;
 
 	if (bio_integrity(split))
-		bio_integrity_trim(split, 0, sectors);
+		bio_integrity_trim(split, sectors);
 
 	bio_advance(bio, split->bi_iter.bi_size);
 
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
 	bio->bi_iter.bi_size = size;
 
 	if (bio_integrity(bio))
-		bio_integrity_trim(bio, 0, size);
+		bio_integrity_trim(bio, size);
 
 }
 EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..e54ecdd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 	clone->bi_iter.bi_size = to_bytes(len);
 
 	if (bio_integrity(bio))
-		bio_integrity_trim(clone, 0, len);
+		bio_integrity_trim(clone, len);
 
 	return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00b086a..350f71d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -732,7 +732,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
 extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -782,8 +782,7 @@ static inline void bio_integrity_advance(struct bio *bio,
 	return;
 }
 
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
-				      unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio, unsigned int sectors)
 {
 	return;
 }
-- 
2.9.3

^ permalink raw reply related

* [PATCH 3/7] bio-integrity: bio_trim should truncate integrity vector accordingly
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/bio.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
 	bio_advance(bio, offset << 9);
 
 	bio->bi_iter.bi_size = size;
+
+	if (bio_integrity(bio))
+		bio_integrity_trim(bio, 0, size);
+
 }
 EXPORT_SYMBOL_GPL(bio_trim);
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 2/7] bio-integrity: save original iterator for verify stage
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

In order to perform verification we need to know original data vector
But, after bio traverse io-stack it may be advanced, splited and relocated
many times so it is hard to guess original data vector.

In fact currently ->verify_fn not woks at all because at the moment
it it called bio->bi_iter.bi_size == 0

The simplest way to fix that is to save original data vector and treat is
as immutable.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/bio-integrity.c | 6 ++++--
 include/linux/bio.h   | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..43a4476 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -238,10 +238,10 @@ static int bio_integrity_process(struct bio *bio,
 
 	iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
 	iter.interval = 1 << bi->interval_exp;
-	iter.seed = bip_get_seed(bip);
+	iter.seed = bip->bip_verify_iter.bi_sector;
 	iter.prot_buf = prot_buf;
 
-	bio_for_each_segment(bv, bio, bviter) {
+	__bio_for_each_segment(bv, bio, bviter, bip->bip_verify_iter) {
 		void *kaddr = kmap_atomic(bv.bv_page);
 
 		iter.data_buf = kaddr + bv.bv_offset;
@@ -310,6 +310,7 @@ int bio_integrity_prep(struct bio *bio)
 	bip->bip_flags |= BIP_BLOCK_INTEGRITY;
 	bip->bip_iter.bi_size = len;
 	bip_set_seed(bip, bio->bi_iter.bi_sector);
+	bip->bip_verify_iter = bio->bi_iter;
 
 	if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
 		bip->bip_flags |= BIP_IP_CHECKSUM;
@@ -476,6 +477,7 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 
 	bip->bip_vcnt = bip_src->bip_vcnt;
 	bip->bip_iter = bip_src->bip_iter;
+	bip->bip_verify_iter = bip_src->bip_verify_iter;
 
 	return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..00b086a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -308,6 +308,7 @@ struct bio_integrity_payload {
 	struct bio		*bip_bio;	/* parent bio */
 
 	struct bvec_iter	bip_iter;
+	struct bvec_iter	bip_verify_iter;/* saved orig data iterator */
 
 	bio_end_io_t		*bip_end_io;	/* saved I/O completion fn */
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH 1/7] bio-integrity: Do not allocate integrity context for bio w/o data
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode: 0000 [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: ffff880137786440 task.stack: ffffc90000ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
 	if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
 		return false;
 
+	if (!bio_sectors(bio))
+		return false;
+
 	/* Already protected? */
 	if (bio_integrity(bio))
 		return false;
-- 
2.9.3

^ permalink raw reply related

* [PATCH 0/7] block: T10/DIF Fixes and cleanups v2
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov

This patch set fix various problems spotted during T10/DIF integrity machinery testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
0002-bio-integrity-save-original-iterator-for-verify-stage
0003-bio-integrity-bio_trim-should-truncate-integrity-vec
0004-bio-integrity-fix-interface-for-bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0005-bio-integrity-add-bio_integrity_setup-helper
0006-T10-Move-opencoded-contants-to-common-header
## General bulletproof protection for block layer
0007-Guard-bvec-iteration-logic-v2

Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox