[PATCH 0/9] Nowait feature for stacked block devices

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] Nowait feature for stacked block devices
@ 2017-07-26 23:57 Goldwyn Rodrigues
  2017-07-26 23:57 ` [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait Goldwyn Rodrigues
                   ` (8 more replies)
  0 siblings, 9 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:57 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel

This is a continuation of the nowait support which was incorporated
a while back. We introduced REQ_NOWAIT which would return immediately
if the call would block at the block layer. Request based-devices
do not wait. However, bio based devices (the ones which exclusively
call make_request_fn) need to be trained to handle REQ_NOWAIT.

This effort covers the devices under MD and DM which would block
for any reason. If there should be more devices or situations
which need to be covered, please let me know.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
@ 2017-07-26 23:57 ` Goldwyn Rodrigues
  2017-08-08 20:32   ` Shaohua Li
  2017-07-26 23:57 ` [PATCH 2/9] md: Add nowait support to md Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:57 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Nowait is a feature of direct AIO, where users can request
to return immediately if the I/O is going to block. This translates
to REQ_NOWAIT in bio.bi_opf flags. While request based devices
don't wait, stacked devices such as md/dm will.

In order to explicitly mark stacked devices as supported, we
set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
whenever the device would block.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 block/blk-core.c       | 3 ++-
 include/linux/blkdev.h | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 970b9c9638c5..1c9a981d88e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
 	 * if queue is not a request based queue.
 	 */
 
-	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
+	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
+	    !blk_queue_supports_nowait(q))
 		goto not_supported;
 
 	part = bio->bi_bdev->bd_part;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..fae021ebec1b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -633,6 +633,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
+#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
@@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/9] md: Add nowait support to md
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
  2017-07-26 23:57 ` [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait Goldwyn Rodrigues
@ 2017-07-26 23:57 ` Goldwyn Rodrigues
  2017-08-08 20:34   ` Shaohua Li
  2017-07-26 23:58 ` [PATCH 3/9] md: raid1 nowait support Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:57 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Set queue flags to QUEUE_FLAG_NOWAIT to indicate REQ_NOWAIT
will be handled.

If an I/O on the md will be delayed, it would bail by calling
bio_wouldblock_error(). The conditions when this could happen are:

 + MD is suspended
 + There is a change pending on the SB, and current I/O would
   block until that is complete.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/md.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8cdca0296749..d96c27d16841 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -285,6 +285,13 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
+
+	if (mddev->suspended && (bio->bi_opf & REQ_NOWAIT)) {
+		bio_wouldblock_error(bio);
+		rcu_read_unlock();
+		return BLK_QC_T_NONE;
+	}
+
 check_suspended:
 	rcu_read_lock();
 	if (mddev->suspended) {
@@ -5274,6 +5281,10 @@ static int md_alloc(dev_t dev, char *name)
 		mddev->queue = NULL;
 		goto abort;
 	}
+
+	/* Set the NOWAIT flags to show support */
+	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, mddev->queue);
+
 	disk->major = MAJOR(mddev->unit);
 	disk->first_minor = unit << shift;
 	if (name)
@@ -8010,8 +8021,20 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
 	rcu_read_unlock();
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
-	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
+
+	/* Don't wait for sb writes if marked with REQ_NOWAIT */
+	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
+			mddev->suspended) {
+		if (bi->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bi);
+			percpu_ref_put(&mddev->writes_pending);
+			return false;
+		}
+
+		wait_event(mddev->sb_wait,
+				!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
+	}
+
 	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		percpu_ref_put(&mddev->writes_pending);
 		return false;
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/9] md: raid1 nowait support
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
  2017-07-26 23:57 ` [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait Goldwyn Rodrigues
  2017-07-26 23:57 ` [PATCH 2/9] md: Add nowait support to md Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-08-08 20:39   ` Shaohua Li
  2017-07-26 23:58 ` [PATCH 4/9] md: raid5 " Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The RAID1 driver would bail with EAGAIN in case of:
 + I/O has to wait for a barrier
 + array is frozen
 + Area is suspended
 + There are too many pending I/O that it will be queued.

To facilitate error for wait barriers, wait_barrier() is
returning bool. True in case if there was a wait (or is not
required). False in case a wait was required, but was not performed.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/raid1.c | 74 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3febfc8391fb..66ca4288e3e8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -903,8 +903,9 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 	wake_up(&conf->wait_barrier);
 }
 
-static void _wait_barrier(struct r1conf *conf, int idx)
+static bool _wait_barrier(struct r1conf *conf, int idx, bool nowait)
 {
+	bool ret = true;
 	/*
 	 * We need to increase conf->nr_pending[idx] very early here,
 	 * then raise_barrier() can be blocked when it waits for
@@ -935,7 +936,7 @@ static void _wait_barrier(struct r1conf *conf, int idx)
 	 */
 	if (!READ_ONCE(conf->array_frozen) &&
 	    !atomic_read(&conf->barrier[idx]))
-		return;
+		return ret;
 
 	/*
 	 * After holding conf->resync_lock, conf->nr_pending[idx]
@@ -953,18 +954,26 @@ static void _wait_barrier(struct r1conf *conf, int idx)
 	 */
 	wake_up(&conf->wait_barrier);
 	/* Wait for the barrier in same barrier unit bucket to drop. */
-	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			     !atomic_read(&conf->barrier[idx]),
-			    conf->resync_lock);
+	if (conf->array_frozen || atomic_read(&conf->barrier[idx])) {
+		if (nowait)
+			ret = false;
+		else
+			wait_event_lock_irq(conf->wait_barrier,
+					!conf->array_frozen &&
+					!atomic_read(&conf->barrier[idx]),
+					conf->resync_lock);
+	}
 	atomic_inc(&conf->nr_pending[idx]);
 	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
+	return ret;
 }
 
-static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
+static bool wait_read_barrier(struct r1conf *conf, sector_t sector_nr,
+		bool nowait)
 {
 	int idx = sector_to_idx(sector_nr);
+	bool ret = true;
 
 	/*
 	 * Very similar to _wait_barrier(). The difference is, for read
@@ -976,7 +985,7 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 	atomic_inc(&conf->nr_pending[idx]);
 
 	if (!READ_ONCE(conf->array_frozen))
-		return;
+		return ret;
 
 	spin_lock_irq(&conf->resync_lock);
 	atomic_inc(&conf->nr_waiting[idx]);
@@ -987,19 +996,28 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 	 */
 	wake_up(&conf->wait_barrier);
 	/* Wait for array to be unfrozen */
-	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen,
-			    conf->resync_lock);
+	if (conf->array_frozen) {
+		/* If nowait flag is set, return false to
+		 * show we did not wait
+		 */
+		if (nowait)
+			ret = false;
+		else
+			wait_event_lock_irq(conf->wait_barrier,
+					!conf->array_frozen,
+					conf->resync_lock);
+	}
 	atomic_inc(&conf->nr_pending[idx]);
 	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
+	return ret;
 }
 
-static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+static bool wait_barrier(struct r1conf *conf, sector_t sector_nr, bool nowait)
 {
 	int idx = sector_to_idx(sector_nr);
 
-	_wait_barrier(conf, idx);
+	return _wait_barrier(conf, idx, nowait);
 }
 
 static void wait_all_barriers(struct r1conf *conf)
@@ -1007,7 +1025,7 @@ static void wait_all_barriers(struct r1conf *conf)
 	int idx;
 
 	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		_wait_barrier(conf, idx);
+		_wait_barrier(conf, idx, false);
 }
 
 static void _allow_barrier(struct r1conf *conf, int idx)
@@ -1223,7 +1241,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	 * Still need barrier for READ in case that whole
 	 * array is frozen.
 	 */
-	wait_read_barrier(conf, bio->bi_iter.bi_sector);
+	if (!wait_read_barrier(conf, bio->bi_iter.bi_sector,
+				bio->bi_opf & REQ_NOWAIT)) {
+		bio_wouldblock_error(bio);
+		return;
+	}
 
 	if (!r1_bio)
 		r1_bio = alloc_r1bio(mddev, bio);
@@ -1333,6 +1355,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		 * an interruptible wait.
 		 */
 		DEFINE_WAIT(w);
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
+
 		for (;;) {
 			sigset_t full, old;
 			prepare_to_wait(&conf->wait_barrier,
@@ -1351,7 +1378,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	wait_barrier(conf, bio->bi_iter.bi_sector);
+	if (!wait_barrier(conf, bio->bi_iter.bi_sector,
+				bio->bi_opf & REQ_NOWAIT)) {
+		bio_wouldblock_error(bio);
+		return;
+	}
 
 	r1_bio = alloc_r1bio(mddev, bio);
 	r1_bio->sectors = max_write_sectors;
@@ -1359,6 +1390,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		raid1_log(mddev, "wait queued");
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
 		wait_event(conf->wait_barrier,
 			   conf->pending_count < max_queued_requests);
 	}
@@ -1442,6 +1477,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		/* Wait for this device to become unblocked */
 		int j;
 
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
+
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
@@ -1449,7 +1489,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		wait_barrier(conf, bio->bi_iter.bi_sector);
+		wait_barrier(conf, bio->bi_iter.bi_sector, false);
 		goto retry_write;
 	}
 
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/9] md: raid5 nowait support
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 3/9] md: raid1 nowait support Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-08-08 20:43   ` Shaohua Li
  2017-07-26 23:58 ` [PATCH 5/9] md: raid10 " Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Return EAGAIN in case RAID5 would block because of waiting due to:
 + Reshaping
 + Suspension
 + Stripe Expansion

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/raid5.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index aeeb8d6854e2..d1b3bcf26d29 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5635,6 +5635,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				    ? logical_sector < conf->reshape_safe
 				    : logical_sector >= conf->reshape_safe) {
 					spin_unlock_irq(&conf->device_lock);
+					if (bi->bi_opf & REQ_NOWAIT) {
+						bio_wouldblock_error(bi);
+						finish_wait(&conf->wait_for_overlap, &w);
+						return true;
+					}
 					schedule();
 					do_prepare = true;
 					goto retry;
@@ -5672,6 +5677,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				spin_unlock_irq(&conf->device_lock);
 				if (must_retry) {
 					raid5_release_stripe(sh);
+					if (bi->bi_opf & REQ_NOWAIT) {
+						bio_wouldblock_error(bi);
+						finish_wait(&conf->wait_for_overlap, &w);
+						return true;
+					}
 					schedule();
 					do_prepare = true;
 					goto retry;
@@ -5700,6 +5710,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 					sigset_t full, old;
 					sigfillset(&full);
 					sigprocmask(SIG_BLOCK, &full, &old);
+					if (bi->bi_opf & REQ_NOWAIT) {
+						bio_wouldblock_error(bi);
+						finish_wait(&conf->wait_for_overlap, &w);
+						return true;
+					}
 					schedule();
 					sigprocmask(SIG_SETMASK, &old, NULL);
 					do_prepare = true;
@@ -5715,6 +5730,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				 */
 				md_wakeup_thread(mddev->thread);
 				raid5_release_stripe(sh);
+				if (bi->bi_opf & REQ_NOWAIT) {
+					bio_wouldblock_error(bi);
+					finish_wait(&conf->wait_for_overlap, &w);
+					return true;
+				}
 				schedule();
 				do_prepare = true;
 				goto retry;
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/9] md: raid10 nowait support
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 4/9] md: raid5 " Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-08-08 20:40   ` Shaohua Li
  2017-07-26 23:58 ` [PATCH 6/9] dm: add " Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Bail and status to EAGAIN if raid10 is going to wait for:
 + barriers
 + reshape operation
 + Too many queued requests

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/raid10.c | 62 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5026e7ad51d3..6d80438c5040 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -978,8 +978,9 @@ static void lower_barrier(struct r10conf *conf)
 	wake_up(&conf->wait_barrier);
 }
 
-static void wait_barrier(struct r10conf *conf)
+static bool wait_barrier(struct r10conf *conf, bool nowait)
 {
+	bool ret = true;
 	spin_lock_irq(&conf->resync_lock);
 	if (conf->barrier) {
 		conf->nr_waiting++;
@@ -993,19 +994,23 @@ static void wait_barrier(struct r10conf *conf)
 		 * count down.
 		 */
 		raid10_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->barrier ||
-				    (atomic_read(&conf->nr_pending) &&
-				     current->bio_list &&
-				     (!bio_list_empty(&current->bio_list[0]) ||
-				      !bio_list_empty(&current->bio_list[1]))),
-				    conf->resync_lock);
+		if (!nowait)
+			wait_event_lock_irq(conf->wait_barrier,
+					    !conf->barrier ||
+				            (atomic_read(&conf->nr_pending) &&
+				             current->bio_list &&
+				             (!bio_list_empty(&current->bio_list[0]) ||
+				              !bio_list_empty(&current->bio_list[1]))),
+					    conf->resync_lock);
+		else
+			ret = false;
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)
 			wake_up(&conf->wait_barrier);
 	}
 	atomic_inc(&conf->nr_pending);
 	spin_unlock_irq(&conf->resync_lock);
+	return ret;
 }
 
 static void allow_barrier(struct r10conf *conf)
@@ -1158,7 +1163,10 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
 	 */
-	wait_barrier(conf);
+	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) {
+		bio_wouldblock_error(bio);
+		return;
+	}
 
 	sectors = r10_bio->sectors;
 	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
@@ -1169,12 +1177,16 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 		 * pass
 		 */
 		raid10_log(conf->mddev, "wait reshape");
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
 		allow_barrier(conf);
 		wait_event(conf->wait_barrier,
 			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
 			   conf->reshape_progress >= bio->bi_iter.bi_sector +
 			   sectors);
-		wait_barrier(conf);
+		wait_barrier(conf, false);
 	}
 
 	rdev = read_balance(conf, r10_bio, &max_sectors);
@@ -1308,7 +1320,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
 	 */
-	wait_barrier(conf);
+	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) {
+		bio_wouldblock_error(bio);
+		return;
+	}
 
 	sectors = r10_bio->sectors;
 	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
@@ -1319,12 +1334,16 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		 * pass
 		 */
 		raid10_log(conf->mddev, "wait reshape");
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
 		allow_barrier(conf);
 		wait_event(conf->wait_barrier,
 			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
 			   conf->reshape_progress >= bio->bi_iter.bi_sector +
 			   sectors);
-		wait_barrier(conf);
+		wait_barrier(conf, false);
 	}
 
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
@@ -1339,6 +1358,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 			      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 		md_wakeup_thread(mddev->thread);
 		raid10_log(conf->mddev, "wait reshape metadata");
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 
@@ -1348,6 +1371,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		raid10_log(mddev, "wait queued");
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
 		wait_event(conf->wait_barrier,
 			   conf->pending_count < max_queued_requests);
 	}
@@ -1454,6 +1481,11 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		int j;
 		int d;
 
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return;
+		}
+
 		for (j = 0; j < i; j++) {
 			if (r10_bio->devs[j].bio) {
 				d = r10_bio->devs[j].devnum;
@@ -1474,7 +1506,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		allow_barrier(conf);
 		raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		wait_barrier(conf);
+		wait_barrier(conf, false);
 		goto retry_write;
 	}
 
@@ -1703,7 +1735,7 @@ static void print_conf(struct r10conf *conf)
 
 static void close_sync(struct r10conf *conf)
 {
-	wait_barrier(conf);
+	wait_barrier(conf, false);
 	allow_barrier(conf);
 
 	mempool_destroy(conf->r10buf_pool);
@@ -4347,7 +4379,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	if (need_flush ||
 	    time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
 		/* Need to update reshape_position in metadata */
-		wait_barrier(conf);
+		wait_barrier(conf, false);
 		mddev->reshape_position = conf->reshape_progress;
 		if (mddev->reshape_backwards)
 			mddev->curr_resync_completed = raid10_size(mddev, 0, 0)
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/9] dm: add nowait support
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 5/9] md: raid10 " Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-07-26 23:58 ` [PATCH 7/9] dm: Add nowait support to raid1 Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Add support for bio based dm devices, which exclusively
sets a make_request_fn(). Request based devices are supported
by default.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/dm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2edbcc2d7d3f..aa9c1a5f2966 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1550,6 +1550,8 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
 
 		if (!(bio->bi_opf & REQ_RAHEAD))
 			queue_io(md, bio);
+		else if (bio->bi_opf & REQ_NOWAIT)
+			bio_wouldblock_error(bio);
 		else
 			bio_io_error(bio);
 		return BLK_QC_T_NONE;
@@ -2066,6 +2068,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
 	case DM_TYPE_DAX_BIO_BASED:
 		dm_init_normal_md_queue(md);
 		blk_queue_make_request(md->queue, dm_make_request);
+		queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, md->queue);
 		/*
 		 * DM handles splitting bios as needed.  Free the bio_split bioset
 		 * since it won't be used (saves 1 process per bio-based DM device).
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/9] dm: Add nowait support to raid1
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 6/9] dm: add " Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-07-26 23:58 ` [PATCH 8/9] dm: Add nowait support to dm-delay Goldwyn Rodrigues
  2017-07-26 23:58 ` [PATCH 9/9] dm-mpath: Add nowait support Goldwyn Rodrigues
  8 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

If the I/O would block because the devices are syncing, bail.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/dm-raid1.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index a4fbd911d566..446ac581627f 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1219,6 +1219,11 @@ static int mirror_map(struct dm_target *ti, struct bio *bio)
 		if (bio->bi_opf & REQ_RAHEAD)
 			return DM_MAPIO_KILL;
 
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return DM_MAPIO_SUBMITTED;
+		}
+
 		queue_bio(ms, bio, rw);
 		return DM_MAPIO_SUBMITTED;
 	}
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/9] dm: Add nowait support to dm-delay
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 7/9] dm: Add nowait support to raid1 Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  2017-07-26 23:58 ` [PATCH 9/9] dm-mpath: Add nowait support Goldwyn Rodrigues
  8 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

I/O should bail out if any value for delay is set.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/dm-delay.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
index ae3158795d26..97da97c3c039 100644
--- a/drivers/md/dm-delay.c
+++ b/drivers/md/dm-delay.c
@@ -240,6 +240,10 @@ static int delay_bio(struct delay_c *dc, int delay, struct bio *bio)
 	if (!delay || !atomic_read(&dc->may_delay))
 		return DM_MAPIO_REMAPPED;
 
+	if (bio->bi_opf & REQ_NOWAIT) {
+		bio_wouldblock_error(bio);
+		return DM_MAPIO_SUBMITTED;
+	}
 	delayed = dm_per_bio_data(bio, sizeof(struct dm_delay_info));
 
 	delayed->context = dc;
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 9/9] dm-mpath: Add nowait support
  2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2017-07-26 23:58 ` [PATCH 8/9] dm: Add nowait support to dm-delay Goldwyn Rodrigues
@ 2017-07-26 23:58 ` Goldwyn Rodrigues
  8 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-07-26 23:58 UTC (permalink / raw)
  To: linux-block; +Cc: hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

If there are no queues, bail if REQ_NOWAIT is set instead
of queueing up I/O.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/dm-mpath.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 0e8ab5bb3575..c6572a9967dc 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -543,6 +543,11 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
 
 	if ((pgpath && queue_io) ||
 	    (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
+		/* Bail if nowait is set */
+		if (bio->bi_opf & REQ_NOWAIT) {
+			bio_wouldblock_error(bio);
+			return DM_MAPIO_SUBMITTED;
+		}
 		/* Queue for the daemon to resubmit */
 		spin_lock_irqsave(&m->lock, flags);
 		bio_list_add(&m->queued_bios, bio);
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-07-26 23:57 ` [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait Goldwyn Rodrigues
@ 2017-08-08 20:32   ` Shaohua Li
  2017-08-08 20:36     ` Jens Axboe
  2017-08-09 11:44     ` Goldwyn Rodrigues
  0 siblings, 2 replies; 37+ messages in thread
From: Shaohua Li @ 2017-08-08 20:32 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Nowait is a feature of direct AIO, where users can request
> to return immediately if the I/O is going to block. This translates
> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
> don't wait, stacked devices such as md/dm will.
> 
> In order to explicitly mark stacked devices as supported, we
> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
> whenever the device would block.

probably you should route this patch to Jens first, DM/MD are different trees.
 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  block/blk-core.c       | 3 ++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 970b9c9638c5..1c9a981d88e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
>  	 * if queue is not a request based queue.
>  	 */
>  
> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
> +	    !blk_queue_supports_nowait(q))
>  		goto not_supported;
>  
>  	part = bio->bi_bdev->bd_part;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 25f6a0cb27d3..fae021ebec1b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -633,6 +633,7 @@ struct request_queue {
>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>  
>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
>  #define blk_queue_scsi_passthrough(q)	\
>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)

Should this bit consider under layer disks? For example, one raid array disk
doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
 
I have another generic question. If a bio is splitted into 2 bios, one bio
doesn't need to wait but the other need to wait. We will return -EAGAIN for the
second bio, so the whole bio will return -EAGAIN, but the first bio is already
dispatched to disk. Is this correct behavior?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/9] md: Add nowait support to md
  2017-07-26 23:57 ` [PATCH 2/9] md: Add nowait support to md Goldwyn Rodrigues
@ 2017-08-08 20:34   ` Shaohua Li
  0 siblings, 0 replies; 37+ messages in thread
From: Shaohua Li @ 2017-08-08 20:34 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Jul 26, 2017 at 06:57:59PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Set queue flags to QUEUE_FLAG_NOWAIT to indicate REQ_NOWAIT
> will be handled.
> 
> If an I/O on the md will be delayed, it would bail by calling
> bio_wouldblock_error(). The conditions when this could happen are:
> 
>  + MD is suspended
>  + There is a change pending on the SB, and current I/O would
>    block until that is complete.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  drivers/md/md.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8cdca0296749..d96c27d16841 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -285,6 +285,13 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  		bio_endio(bio);
>  		return BLK_QC_T_NONE;
>  	}
> +
> +	if (mddev->suspended && (bio->bi_opf & REQ_NOWAIT)) {
> +		bio_wouldblock_error(bio);
> +		rcu_read_unlock();

this unlock is not required.

> +		return BLK_QC_T_NONE;
> +	}
> +
>  check_suspended:
>  	rcu_read_lock();
>  	if (mddev->suspended) {
> @@ -5274,6 +5281,10 @@ static int md_alloc(dev_t dev, char *name)
>  		mddev->queue = NULL;
>  		goto abort;
>  	}
> +
> +	/* Set the NOWAIT flags to show support */
> +	queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, mddev->queue);
> +
>  	disk->major = MAJOR(mddev->unit);
>  	disk->first_minor = unit << shift;
>  	if (name)
> @@ -8010,8 +8021,20 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
>  	rcu_read_unlock();
>  	if (did_change)
>  		sysfs_notify_dirent_safe(mddev->sysfs_state);
> -	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
> +
> +	/* Don't wait for sb writes if marked with REQ_NOWAIT */
> +	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
> +			mddev->suspended) {
> +		if (bi->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bi);
> +			percpu_ref_put(&mddev->writes_pending);
> +			return false;
> +		}
> +
> +		wait_event(mddev->sb_wait,
> +				!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
> +	}
> +
>  	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
>  		percpu_ref_put(&mddev->writes_pending);
>  		return false;
> -- 
> 2.12.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-08 20:32   ` Shaohua Li
@ 2017-08-08 20:36     ` Jens Axboe
  2017-08-10  2:18       ` Jens Axboe
  2017-08-09 11:44     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-08-08 20:36 UTC (permalink / raw)
  To: Shaohua Li, Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On 08/08/2017 02:32 PM, Shaohua Li wrote:
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 25f6a0cb27d3..fae021ebec1b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -633,6 +633,7 @@ struct request_queue {
>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */

Does this work on 32-bit, where sizeof(unsigned long) == 32?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/9] md: raid1 nowait support
  2017-07-26 23:58 ` [PATCH 3/9] md: raid1 nowait support Goldwyn Rodrigues
@ 2017-08-08 20:39   ` Shaohua Li
  2017-08-09 11:45     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Shaohua Li @ 2017-08-08 20:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Jul 26, 2017 at 06:58:00PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The RAID1 driver would bail with EAGAIN in case of:
>  + I/O has to wait for a barrier
>  + array is frozen
>  + Area is suspended
>  + There are too many pending I/O that it will be queued.
> 
> To facilitate error for wait barriers, wait_barrier() is
> returning bool. True in case if there was a wait (or is not
> required). False in case a wait was required, but was not performed.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  drivers/md/raid1.c | 74 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 3febfc8391fb..66ca4288e3e8 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -903,8 +903,9 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static void _wait_barrier(struct r1conf *conf, int idx)
> +static bool _wait_barrier(struct r1conf *conf, int idx, bool nowait)
>  {
> +	bool ret = true;
>  	/*
>  	 * We need to increase conf->nr_pending[idx] very early here,
>  	 * then raise_barrier() can be blocked when it waits for
> @@ -935,7 +936,7 @@ static void _wait_barrier(struct r1conf *conf, int idx)
>  	 */
>  	if (!READ_ONCE(conf->array_frozen) &&
>  	    !atomic_read(&conf->barrier[idx]))
> -		return;
> +		return ret;
>  
>  	/*
>  	 * After holding conf->resync_lock, conf->nr_pending[idx]
> @@ -953,18 +954,26 @@ static void _wait_barrier(struct r1conf *conf, int idx)
>  	 */
>  	wake_up(&conf->wait_barrier);
>  	/* Wait for the barrier in same barrier unit bucket to drop. */
> -	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen &&
> -			     !atomic_read(&conf->barrier[idx]),
> -			    conf->resync_lock);
> +	if (conf->array_frozen || atomic_read(&conf->barrier[idx])) {
> +		if (nowait)
> +			ret = false;

In this case, we nr_pending shouldn't be increased

> +		else
> +			wait_event_lock_irq(conf->wait_barrier,
> +					!conf->array_frozen &&
> +					!atomic_read(&conf->barrier[idx]),
> +					conf->resync_lock);
> +	}
>  	atomic_inc(&conf->nr_pending[idx]);
>  	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
> +	return ret;
>  }
>  
> -static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
> +static bool wait_read_barrier(struct r1conf *conf, sector_t sector_nr,
> +		bool nowait)
>  {
>  	int idx = sector_to_idx(sector_nr);
> +	bool ret = true;
>  
>  	/*
>  	 * Very similar to _wait_barrier(). The difference is, for read
> @@ -976,7 +985,7 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  	atomic_inc(&conf->nr_pending[idx]);
>  
>  	if (!READ_ONCE(conf->array_frozen))
> -		return;
> +		return ret;
>  
>  	spin_lock_irq(&conf->resync_lock);
>  	atomic_inc(&conf->nr_waiting[idx]);
> @@ -987,19 +996,28 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  	 */
>  	wake_up(&conf->wait_barrier);
>  	/* Wait for array to be unfrozen */
> -	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen,
> -			    conf->resync_lock);
> +	if (conf->array_frozen) {
> +		/* If nowait flag is set, return false to
> +		 * show we did not wait
> +		 */
> +		if (nowait)
> +			ret = false;

ditto
> +		else
> +			wait_event_lock_irq(conf->wait_barrier,
> +					!conf->array_frozen,
> +					conf->resync_lock);
> +	}
>  	atomic_inc(&conf->nr_pending[idx]);
>  	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
> +	return ret;
>  }
>  
> -static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> +static bool wait_barrier(struct r1conf *conf, sector_t sector_nr, bool nowait)
>  {
>  	int idx = sector_to_idx(sector_nr);
>  
> -	_wait_barrier(conf, idx);
> +	return _wait_barrier(conf, idx, nowait);
>  }
>  
>  static void wait_all_barriers(struct r1conf *conf)
> @@ -1007,7 +1025,7 @@ static void wait_all_barriers(struct r1conf *conf)
>  	int idx;
>  
>  	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		_wait_barrier(conf, idx);
> +		_wait_barrier(conf, idx, false);
>  }
>  
>  static void _allow_barrier(struct r1conf *conf, int idx)
> @@ -1223,7 +1241,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>  	 * Still need barrier for READ in case that whole
>  	 * array is frozen.
>  	 */
> -	wait_read_barrier(conf, bio->bi_iter.bi_sector);
> +	if (!wait_read_barrier(conf, bio->bi_iter.bi_sector,
> +				bio->bi_opf & REQ_NOWAIT)) {
> +		bio_wouldblock_error(bio);
> +		return;
> +	}
>  
>  	if (!r1_bio)
>  		r1_bio = alloc_r1bio(mddev, bio);
> @@ -1333,6 +1355,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  		 * an interruptible wait.
>  		 */
>  		DEFINE_WAIT(w);
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
> +
>  		for (;;) {
>  			sigset_t full, old;
>  			prepare_to_wait(&conf->wait_barrier,
> @@ -1351,7 +1378,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  		}
>  		finish_wait(&conf->wait_barrier, &w);
>  	}
> -	wait_barrier(conf, bio->bi_iter.bi_sector);
> +	if (!wait_barrier(conf, bio->bi_iter.bi_sector,
> +				bio->bi_opf & REQ_NOWAIT)) {
> +		bio_wouldblock_error(bio);
> +		return;
> +	}
>  
>  	r1_bio = alloc_r1bio(mddev, bio);
>  	r1_bio->sectors = max_write_sectors;
> @@ -1359,6 +1390,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  	if (conf->pending_count >= max_queued_requests) {
>  		md_wakeup_thread(mddev->thread);
>  		raid1_log(mddev, "wait queued");
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
>  		wait_event(conf->wait_barrier,
>  			   conf->pending_count < max_queued_requests);
>  	}
> @@ -1442,6 +1477,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  		/* Wait for this device to become unblocked */
>  		int j;
>  
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
> +
>  		for (j = 0; j < i; j++)
>  			if (r1_bio->bios[j])
>  				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
> @@ -1449,7 +1489,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  		allow_barrier(conf, bio->bi_iter.bi_sector);
>  		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> -		wait_barrier(conf, bio->bi_iter.bi_sector);
> +		wait_barrier(conf, bio->bi_iter.bi_sector, false);

There are other cases we could block, for example, md_wait_for_blocked_rdev
here. Is the goal just avoid block for normal situations?

>  		goto retry_write;
>  	}
>  
> -- 
> 2.12.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/9] md: raid10 nowait support
  2017-07-26 23:58 ` [PATCH 5/9] md: raid10 " Goldwyn Rodrigues
@ 2017-08-08 20:40   ` Shaohua Li
  0 siblings, 0 replies; 37+ messages in thread
From: Shaohua Li @ 2017-08-08 20:40 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Jul 26, 2017 at 06:58:02PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Bail and status to EAGAIN if raid10 is going to wait for:
>  + barriers
>  + reshape operation
>  + Too many queued requests
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  drivers/md/raid10.c | 62 ++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 5026e7ad51d3..6d80438c5040 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -978,8 +978,9 @@ static void lower_barrier(struct r10conf *conf)
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static void wait_barrier(struct r10conf *conf)
> +static bool wait_barrier(struct r10conf *conf, bool nowait)
>  {
> +	bool ret = true;
>  	spin_lock_irq(&conf->resync_lock);
>  	if (conf->barrier) {
>  		conf->nr_waiting++;
> @@ -993,19 +994,23 @@ static void wait_barrier(struct r10conf *conf)
>  		 * count down.
>  		 */
>  		raid10_log(conf->mddev, "wait barrier");
> -		wait_event_lock_irq(conf->wait_barrier,
> -				    !conf->barrier ||
> -				    (atomic_read(&conf->nr_pending) &&
> -				     current->bio_list &&
> -				     (!bio_list_empty(&current->bio_list[0]) ||
> -				      !bio_list_empty(&current->bio_list[1]))),
> -				    conf->resync_lock);
> +		if (!nowait)
> +			wait_event_lock_irq(conf->wait_barrier,
> +					    !conf->barrier ||
> +				            (atomic_read(&conf->nr_pending) &&
> +				             current->bio_list &&
> +				             (!bio_list_empty(&current->bio_list[0]) ||
> +				              !bio_list_empty(&current->bio_list[1]))),
> +					    conf->resync_lock);
> +		else
> +			ret = false;
same here, nr_pending shouldn't be increased

>  		conf->nr_waiting--;
>  		if (!conf->nr_waiting)
>  			wake_up(&conf->wait_barrier);
>  	}
>  	atomic_inc(&conf->nr_pending);
>  	spin_unlock_irq(&conf->resync_lock);
> +	return ret;
>  }
>  
>  static void allow_barrier(struct r10conf *conf)
> @@ -1158,7 +1163,10 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
>  	 * thread has put up a bar for new requests.
>  	 * Continue immediately if no resync is active currently.
>  	 */
> -	wait_barrier(conf);
> +	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) {
> +		bio_wouldblock_error(bio);
> +		return;
> +	}
>  
>  	sectors = r10_bio->sectors;
>  	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> @@ -1169,12 +1177,16 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
>  		 * pass
>  		 */
>  		raid10_log(conf->mddev, "wait reshape");
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
>  		allow_barrier(conf);
>  		wait_event(conf->wait_barrier,
>  			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
>  			   conf->reshape_progress >= bio->bi_iter.bi_sector +
>  			   sectors);
> -		wait_barrier(conf);
> +		wait_barrier(conf, false);
>  	}
>  
>  	rdev = read_balance(conf, r10_bio, &max_sectors);
> @@ -1308,7 +1320,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  	 * thread has put up a bar for new requests.
>  	 * Continue immediately if no resync is active currently.
>  	 */
> -	wait_barrier(conf);
> +	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) {
> +		bio_wouldblock_error(bio);
> +		return;
> +	}
>  
>  	sectors = r10_bio->sectors;
>  	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> @@ -1319,12 +1334,16 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		 * pass
>  		 */
>  		raid10_log(conf->mddev, "wait reshape");
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
>  		allow_barrier(conf);
>  		wait_event(conf->wait_barrier,
>  			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
>  			   conf->reshape_progress >= bio->bi_iter.bi_sector +
>  			   sectors);
> -		wait_barrier(conf);
> +		wait_barrier(conf, false);
>  	}
>  
>  	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> @@ -1339,6 +1358,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  			      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
>  		md_wakeup_thread(mddev->thread);
>  		raid10_log(conf->mddev, "wait reshape metadata");
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
>  		wait_event(mddev->sb_wait,
>  			   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
>  
> @@ -1348,6 +1371,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  	if (conf->pending_count >= max_queued_requests) {
>  		md_wakeup_thread(mddev->thread);
>  		raid10_log(mddev, "wait queued");
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
>  		wait_event(conf->wait_barrier,
>  			   conf->pending_count < max_queued_requests);
>  	}
> @@ -1454,6 +1481,11 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		int j;
>  		int d;
>  
> +		if (bio->bi_opf & REQ_NOWAIT) {
> +			bio_wouldblock_error(bio);
> +			return;
> +		}
> +
>  		for (j = 0; j < i; j++) {
>  			if (r10_bio->devs[j].bio) {
>  				d = r10_bio->devs[j].devnum;
> @@ -1474,7 +1506,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		allow_barrier(conf);
>  		raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> -		wait_barrier(conf);
> +		wait_barrier(conf, false);
>  		goto retry_write;
>  	}
>  
> @@ -1703,7 +1735,7 @@ static void print_conf(struct r10conf *conf)
>  
>  static void close_sync(struct r10conf *conf)
>  {
> -	wait_barrier(conf);
> +	wait_barrier(conf, false);
>  	allow_barrier(conf);
>  
>  	mempool_destroy(conf->r10buf_pool);
> @@ -4347,7 +4379,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
>  	if (need_flush ||
>  	    time_after(jiffies, conf->reshape_checkpoint + 10*HZ)) {
>  		/* Need to update reshape_position in metadata */
> -		wait_barrier(conf);
> +		wait_barrier(conf, false);
>  		mddev->reshape_position = conf->reshape_progress;
>  		if (mddev->reshape_backwards)
>  			mddev->curr_resync_completed = raid10_size(mddev, 0, 0)
> -- 
> 2.12.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/9] md: raid5 nowait support
  2017-07-26 23:58 ` [PATCH 4/9] md: raid5 " Goldwyn Rodrigues
@ 2017-08-08 20:43   ` Shaohua Li
  2017-08-09 11:45     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Shaohua Li @ 2017-08-08 20:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Jul 26, 2017 at 06:58:01PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Return EAGAIN in case RAID5 would block because of waiting due to:
>  + Reshaping
>  + Suspension
>  + Stripe Expansion
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  drivers/md/raid5.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index aeeb8d6854e2..d1b3bcf26d29 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5635,6 +5635,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  				    ? logical_sector < conf->reshape_safe
>  				    : logical_sector >= conf->reshape_safe) {
>  					spin_unlock_irq(&conf->device_lock);
> +					if (bi->bi_opf & REQ_NOWAIT) {
> +						bio_wouldblock_error(bi);
> +						finish_wait(&conf->wait_for_overlap, &w);
> +						return true;
> +					}

A bio could use several stripes. If one stripe block, simpliy return bio here
doesn't really make the whole bio finish.

>  					schedule();
>  					do_prepare = true;
>  					goto retry;
> @@ -5672,6 +5677,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  				spin_unlock_irq(&conf->device_lock);
>  				if (must_retry) {
>  					raid5_release_stripe(sh);
> +					if (bi->bi_opf & REQ_NOWAIT) {
> +						bio_wouldblock_error(bi);
> +						finish_wait(&conf->wait_for_overlap, &w);
> +						return true;
> +					}
>  					schedule();
>  					do_prepare = true;
>  					goto retry;
> @@ -5700,6 +5710,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  					sigset_t full, old;
>  					sigfillset(&full);
>  					sigprocmask(SIG_BLOCK, &full, &old);
> +					if (bi->bi_opf & REQ_NOWAIT) {
> +						bio_wouldblock_error(bi);
> +						finish_wait(&conf->wait_for_overlap, &w);
> +						return true;
> +					}
>  					schedule();
>  					sigprocmask(SIG_SETMASK, &old, NULL);
>  					do_prepare = true;
> @@ -5715,6 +5730,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  				 */
>  				md_wakeup_thread(mddev->thread);
>  				raid5_release_stripe(sh);
> +				if (bi->bi_opf & REQ_NOWAIT) {
> +					bio_wouldblock_error(bi);
> +					finish_wait(&conf->wait_for_overlap, &w);
> +					return true;
> +				}
>  				schedule();
>  				do_prepare = true;
>  				goto retry;
> -- 
> 2.12.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-08 20:32   ` Shaohua Li
  2017-08-08 20:36     ` Jens Axboe
@ 2017-08-09 11:44     ` Goldwyn Rodrigues
  2017-08-09 15:02       ` Shaohua Li
  1 sibling, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-09 11:44 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/08/2017 03:32 PM, Shaohua Li wrote:
> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> Nowait is a feature of direct AIO, where users can request
>> to return immediately if the I/O is going to block. This translates
>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
>> don't wait, stacked devices such as md/dm will.
>>
>> In order to explicitly mark stacked devices as supported, we
>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
>> whenever the device would block.
> 
> probably you should route this patch to Jens first, DM/MD are different trees.

Yes, I have sent it to linux-block as well, and he has commented as well.


>  
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> ---
>>  block/blk-core.c       | 3 ++-
>>  include/linux/blkdev.h | 2 ++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 970b9c9638c5..1c9a981d88e5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
>>  	 * if queue is not a request based queue.
>>  	 */
>>  
>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
>> +	    !blk_queue_supports_nowait(q))
>>  		goto not_supported;
>>  
>>  	part = bio->bi_bdev->bd_part;
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 25f6a0cb27d3..fae021ebec1b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -633,6 +633,7 @@ struct request_queue {
>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>  
>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
>>  #define blk_queue_scsi_passthrough(q)	\
>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> 
> Should this bit consider under layer disks? For example, one raid array disk
> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?

Yes, it should. I will add a check before setting the flag. Thanks.
Request-based devices don't wait. So, they would not have this flag set.
It is only the bio-based, with the  make_request_fn hook which need this.

>  
> I have another generic question. If a bio is splitted into 2 bios, one bio
> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
> second bio, so the whole bio will return -EAGAIN, but the first bio is already
> dispatched to disk. Is this correct behavior?
> 

No, from a multi-device point of view, this is inconsistent. I have
tried the request bio returns -EAGAIN before the split, but I shall
check again. Where do you see this happening?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/9] md: raid1 nowait support
  2017-08-08 20:39   ` Shaohua Li
@ 2017-08-09 11:45     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-09 11:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/08/2017 03:39 PM, Shaohua Li wrote:
> On Wed, Jul 26, 2017 at 06:58:00PM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> The RAID1 driver would bail with EAGAIN in case of:
>>  + I/O has to wait for a barrier
>>  + array is frozen
>>  + Area is suspended
>>  + There are too many pending I/O that it will be queued.
>>
>> To facilitate error for wait barriers, wait_barrier() is
>> returning bool. True in case if there was a wait (or is not
>> required). False in case a wait was required, but was not performed.
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> ---
>>  drivers/md/raid1.c | 74 +++++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 57 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 3febfc8391fb..66ca4288e3e8 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -903,8 +903,9 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>>  	wake_up(&conf->wait_barrier);
>>  }
>>  
>> -static void _wait_barrier(struct r1conf *conf, int idx)
>> +static bool _wait_barrier(struct r1conf *conf, int idx, bool nowait)
>>  {
>> +	bool ret = true;
>>  	/*
>>  	 * We need to increase conf->nr_pending[idx] very early here,
>>  	 * then raise_barrier() can be blocked when it waits for
>> @@ -935,7 +936,7 @@ static void _wait_barrier(struct r1conf *conf, int idx)
>>  	 */
>>  	if (!READ_ONCE(conf->array_frozen) &&
>>  	    !atomic_read(&conf->barrier[idx]))
>> -		return;
>> +		return ret;
>>  
>>  	/*
>>  	 * After holding conf->resync_lock, conf->nr_pending[idx]
>> @@ -953,18 +954,26 @@ static void _wait_barrier(struct r1conf *conf, int idx)
>>  	 */
>>  	wake_up(&conf->wait_barrier);
>>  	/* Wait for the barrier in same barrier unit bucket to drop. */
>> -	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> -			     !atomic_read(&conf->barrier[idx]),
>> -			    conf->resync_lock);
>> +	if (conf->array_frozen || atomic_read(&conf->barrier[idx])) {
>> +		if (nowait)
>> +			ret = false;
> 
> In this case, we nr_pending shouldn't be increased

Ok, will fix this.

> 
>> +		else
>> +			wait_event_lock_irq(conf->wait_barrier,
>> +					!conf->array_frozen &&
>> +					!atomic_read(&conf->barrier[idx]),
>> +					conf->resync_lock);
>> +	}
>>  	atomic_inc(&conf->nr_pending[idx]);
>>  	atomic_dec(&conf->nr_waiting[idx]);
>>  	spin_unlock_irq(&conf->resync_lock);
>> +	return ret;
>>  }
>>  
>> -static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>> +static bool wait_read_barrier(struct r1conf *conf, sector_t sector_nr,
>> +		bool nowait)
>>  {
>>  	int idx = sector_to_idx(sector_nr);
>> +	bool ret = true;
>>  
>>  	/*
>>  	 * Very similar to _wait_barrier(). The difference is, for read
>> @@ -976,7 +985,7 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>  	atomic_inc(&conf->nr_pending[idx]);
>>  
>>  	if (!READ_ONCE(conf->array_frozen))
>> -		return;
>> +		return ret;
>>  
>>  	spin_lock_irq(&conf->resync_lock);
>>  	atomic_inc(&conf->nr_waiting[idx]);
>> @@ -987,19 +996,28 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>  	 */
>>  	wake_up(&conf->wait_barrier);
>>  	/* Wait for array to be unfrozen */
>> -	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen,
>> -			    conf->resync_lock);
>> +	if (conf->array_frozen) {
>> +		/* If nowait flag is set, return false to
>> +		 * show we did not wait
>> +		 */
>> +		if (nowait)
>> +			ret = false;
> 
> ditto
>> +		else
>> +			wait_event_lock_irq(conf->wait_barrier,
>> +					!conf->array_frozen,
>> +					conf->resync_lock);
>> +	}
>>  	atomic_inc(&conf->nr_pending[idx]);
>>  	atomic_dec(&conf->nr_waiting[idx]);
>>  	spin_unlock_irq(&conf->resync_lock);
>> +	return ret;
>>  }
>>  
>> -static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>> +static bool wait_barrier(struct r1conf *conf, sector_t sector_nr, bool nowait)
>>  {
>>  	int idx = sector_to_idx(sector_nr);
>>  
>> -	_wait_barrier(conf, idx);
>> +	return _wait_barrier(conf, idx, nowait);
>>  }
>>  
>>  static void wait_all_barriers(struct r1conf *conf)
>> @@ -1007,7 +1025,7 @@ static void wait_all_barriers(struct r1conf *conf)
>>  	int idx;
>>  
>>  	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>> -		_wait_barrier(conf, idx);
>> +		_wait_barrier(conf, idx, false);
>>  }
>>  
>>  static void _allow_barrier(struct r1conf *conf, int idx)
>> @@ -1223,7 +1241,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>  	 * Still need barrier for READ in case that whole
>>  	 * array is frozen.
>>  	 */
>> -	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>> +	if (!wait_read_barrier(conf, bio->bi_iter.bi_sector,
>> +				bio->bi_opf & REQ_NOWAIT)) {
>> +		bio_wouldblock_error(bio);
>> +		return;
>> +	}
>>  
>>  	if (!r1_bio)
>>  		r1_bio = alloc_r1bio(mddev, bio);
>> @@ -1333,6 +1355,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  		 * an interruptible wait.
>>  		 */
>>  		DEFINE_WAIT(w);
>> +		if (bio->bi_opf & REQ_NOWAIT) {
>> +			bio_wouldblock_error(bio);
>> +			return;
>> +		}
>> +
>>  		for (;;) {
>>  			sigset_t full, old;
>>  			prepare_to_wait(&conf->wait_barrier,
>> @@ -1351,7 +1378,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  		}
>>  		finish_wait(&conf->wait_barrier, &w);
>>  	}
>> -	wait_barrier(conf, bio->bi_iter.bi_sector);
>> +	if (!wait_barrier(conf, bio->bi_iter.bi_sector,
>> +				bio->bi_opf & REQ_NOWAIT)) {
>> +		bio_wouldblock_error(bio);
>> +		return;
>> +	}
>>  
>>  	r1_bio = alloc_r1bio(mddev, bio);
>>  	r1_bio->sectors = max_write_sectors;
>> @@ -1359,6 +1390,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  	if (conf->pending_count >= max_queued_requests) {
>>  		md_wakeup_thread(mddev->thread);
>>  		raid1_log(mddev, "wait queued");
>> +		if (bio->bi_opf & REQ_NOWAIT) {
>> +			bio_wouldblock_error(bio);
>> +			return;
>> +		}
>>  		wait_event(conf->wait_barrier,
>>  			   conf->pending_count < max_queued_requests);
>>  	}
>> @@ -1442,6 +1477,11 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  		/* Wait for this device to become unblocked */
>>  		int j;
>>  
>> +		if (bio->bi_opf & REQ_NOWAIT) {
>> +			bio_wouldblock_error(bio);
>> +			return;
>> +		}
>> +
>>  		for (j = 0; j < i; j++)
>>  			if (r1_bio->bios[j])
>>  				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
>> @@ -1449,7 +1489,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  		allow_barrier(conf, bio->bi_iter.bi_sector);
>>  		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
>> -		wait_barrier(conf, bio->bi_iter.bi_sector);
>> +		wait_barrier(conf, bio->bi_iter.bi_sector, false);
> 
> There are other cases we could block, for example, md_wait_for_blocked_rdev
> here. Is the goal just avoid block for normal situations?

Isn't this covered by the if condition of the codeblock (blocked_rdev !=
NULL)?

> 
>>  		goto retry_write;
>>  	}
>>  
>> -- 
>> 2.12.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/9] md: raid5 nowait support
  2017-08-08 20:43   ` Shaohua Li
@ 2017-08-09 11:45     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-09 11:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/08/2017 03:43 PM, Shaohua Li wrote:
> On Wed, Jul 26, 2017 at 06:58:01PM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> Return EAGAIN in case RAID5 would block because of waiting due to:
>>  + Reshaping
>>  + Suspension
>>  + Stripe Expansion
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> ---
>>  drivers/md/raid5.c | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index aeeb8d6854e2..d1b3bcf26d29 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5635,6 +5635,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>  				    ? logical_sector < conf->reshape_safe
>>  				    : logical_sector >= conf->reshape_safe) {
>>  					spin_unlock_irq(&conf->device_lock);
>> +					if (bi->bi_opf & REQ_NOWAIT) {
>> +						bio_wouldblock_error(bi);
>> +						finish_wait(&conf->wait_for_overlap, &w);
>> +						return true;
>> +					}
> 
> A bio could use several stripes. If one stripe block, simpliy return bio here
> doesn't really make the whole bio finish.

My understanding is a little weak here.
How would you have to terminate the entire bio since one of the stripes
is blocking?

> 
>>  					schedule();
>>  					do_prepare = true;
>>  					goto retry;
>> @@ -5672,6 +5677,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>  				spin_unlock_irq(&conf->device_lock);
>>  				if (must_retry) {
>>  					raid5_release_stripe(sh);
>> +					if (bi->bi_opf & REQ_NOWAIT) {
>> +						bio_wouldblock_error(bi);
>> +						finish_wait(&conf->wait_for_overlap, &w);
>> +						return true;
>> +					}
>>  					schedule();
>>  					do_prepare = true;
>>  					goto retry;
>> @@ -5700,6 +5710,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>  					sigset_t full, old;
>>  					sigfillset(&full);
>>  					sigprocmask(SIG_BLOCK, &full, &old);
>> +					if (bi->bi_opf & REQ_NOWAIT) {
>> +						bio_wouldblock_error(bi);
>> +						finish_wait(&conf->wait_for_overlap, &w);
>> +						return true;
>> +					}
>>  					schedule();
>>  					sigprocmask(SIG_SETMASK, &old, NULL);
>>  					do_prepare = true;
>> @@ -5715,6 +5730,11 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>  				 */
>>  				md_wakeup_thread(mddev->thread);
>>  				raid5_release_stripe(sh);
>> +				if (bi->bi_opf & REQ_NOWAIT) {
>> +					bio_wouldblock_error(bi);
>> +					finish_wait(&conf->wait_for_overlap, &w);
>> +					return true;
>> +				}
>>  				schedule();
>>  				do_prepare = true;
>>  				goto retry;
>> -- 
>> 2.12.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-09 11:44     ` Goldwyn Rodrigues
@ 2017-08-09 15:02       ` Shaohua Li
  2017-08-09 15:35         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Shaohua Li @ 2017-08-09 15:02 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 08/08/2017 03:32 PM, Shaohua Li wrote:
> > On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>
> >> Nowait is a feature of direct AIO, where users can request
> >> to return immediately if the I/O is going to block. This translates
> >> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
> >> don't wait, stacked devices such as md/dm will.
> >>
> >> In order to explicitly mark stacked devices as supported, we
> >> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
> >> whenever the device would block.
> > 
> > probably you should route this patch to Jens first, DM/MD are different trees.
> 
> Yes, I have sent it to linux-block as well, and he has commented as well.
> 
> 
> >  
> >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >> ---
> >>  block/blk-core.c       | 3 ++-
> >>  include/linux/blkdev.h | 2 ++
> >>  2 files changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 970b9c9638c5..1c9a981d88e5 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
> >>  	 * if queue is not a request based queue.
> >>  	 */
> >>  
> >> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
> >> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
> >> +	    !blk_queue_supports_nowait(q))
> >>  		goto not_supported;
> >>  
> >>  	part = bio->bi_bdev->bd_part;
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index 25f6a0cb27d3..fae021ebec1b 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -633,6 +633,7 @@ struct request_queue {
> >>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
> >>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
> >>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
> >> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
> >>  
> >>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
> >>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
> >> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
> >>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
> >>  #define blk_queue_scsi_passthrough(q)	\
> >>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
> >> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> > 
> > Should this bit consider under layer disks? For example, one raid array disk
> > doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
> 
> Yes, it should. I will add a check before setting the flag. Thanks.
> Request-based devices don't wait. So, they would not have this flag set.
> It is only the bio-based, with the  make_request_fn hook which need this.
> 
> >  
> > I have another generic question. If a bio is splitted into 2 bios, one bio
> > doesn't need to wait but the other need to wait. We will return -EAGAIN for the
> > second bio, so the whole bio will return -EAGAIN, but the first bio is already
> > dispatched to disk. Is this correct behavior?
> > 
> 
> No, from a multi-device point of view, this is inconsistent. I have
> tried the request bio returns -EAGAIN before the split, but I shall
> check again. Where do you see this happening?

No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-09 15:02       ` Shaohua Li
@ 2017-08-09 15:35         ` Goldwyn Rodrigues
  2017-08-09 20:21           ` Shaohua Li
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-09 15:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/09/2017 10:02 AM, Shaohua Li wrote:
> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
>>
>>
>> On 08/08/2017 03:32 PM, Shaohua Li wrote:
>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>
>>>> Nowait is a feature of direct AIO, where users can request
>>>> to return immediately if the I/O is going to block. This translates
>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
>>>> don't wait, stacked devices such as md/dm will.
>>>>
>>>> In order to explicitly mark stacked devices as supported, we
>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
>>>> whenever the device would block.
>>>
>>> probably you should route this patch to Jens first, DM/MD are different trees.
>>
>> Yes, I have sent it to linux-block as well, and he has commented as well.
>>
>>
>>>  
>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>> ---
>>>>  block/blk-core.c       | 3 ++-
>>>>  include/linux/blkdev.h | 2 ++
>>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 970b9c9638c5..1c9a981d88e5 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
>>>>  	 * if queue is not a request based queue.
>>>>  	 */
>>>>  
>>>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
>>>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
>>>> +	    !blk_queue_supports_nowait(q))
>>>>  		goto not_supported;
>>>>  
>>>>  	part = bio->bi_bdev->bd_part;
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>>  
>>>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>>>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
>>>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
>>>>  #define blk_queue_scsi_passthrough(q)	\
>>>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
>>>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
>>>
>>> Should this bit consider under layer disks? For example, one raid array disk
>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
>>
>> Yes, it should. I will add a check before setting the flag. Thanks.
>> Request-based devices don't wait. So, they would not have this flag set.
>> It is only the bio-based, with the  make_request_fn hook which need this.
>>
>>>  
>>> I have another generic question. If a bio is splitted into 2 bios, one bio
>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already
>>> dispatched to disk. Is this correct behavior?
>>>
>>
>> No, from a multi-device point of view, this is inconsistent. I have
>> tried the request bio returns -EAGAIN before the split, but I shall
>> check again. Where do you see this happening?
> 
> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
> 

In that case, the bio end_io function is chained and the bio of the
split will replicate the error to the parent (if not already set).

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-09 15:35         ` Goldwyn Rodrigues
@ 2017-08-09 20:21           ` Shaohua Li
  2017-08-09 22:16             ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Shaohua Li @ 2017-08-09 20:21 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 08/09/2017 10:02 AM, Shaohua Li wrote:
> > On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
> >>
> >>
> >> On 08/08/2017 03:32 PM, Shaohua Li wrote:
> >>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
> >>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>>>
> >>>> Nowait is a feature of direct AIO, where users can request
> >>>> to return immediately if the I/O is going to block. This translates
> >>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
> >>>> don't wait, stacked devices such as md/dm will.
> >>>>
> >>>> In order to explicitly mark stacked devices as supported, we
> >>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
> >>>> whenever the device would block.
> >>>
> >>> probably you should route this patch to Jens first, DM/MD are different trees.
> >>
> >> Yes, I have sent it to linux-block as well, and he has commented as well.
> >>
> >>
> >>>  
> >>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>>> ---
> >>>>  block/blk-core.c       | 3 ++-
> >>>>  include/linux/blkdev.h | 2 ++
> >>>>  2 files changed, 4 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index 970b9c9638c5..1c9a981d88e5 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
> >>>>  	 * if queue is not a request based queue.
> >>>>  	 */
> >>>>  
> >>>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
> >>>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
> >>>> +	    !blk_queue_supports_nowait(q))
> >>>>  		goto not_supported;
> >>>>  
> >>>>  	part = bio->bi_bdev->bd_part;
> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>> index 25f6a0cb27d3..fae021ebec1b 100644
> >>>> --- a/include/linux/blkdev.h
> >>>> +++ b/include/linux/blkdev.h
> >>>> @@ -633,6 +633,7 @@ struct request_queue {
> >>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
> >>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
> >>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
> >>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
> >>>>  
> >>>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
> >>>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
> >>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
> >>>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
> >>>>  #define blk_queue_scsi_passthrough(q)	\
> >>>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
> >>>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> >>>
> >>> Should this bit consider under layer disks? For example, one raid array disk
> >>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
> >>
> >> Yes, it should. I will add a check before setting the flag. Thanks.
> >> Request-based devices don't wait. So, they would not have this flag set.
> >> It is only the bio-based, with the  make_request_fn hook which need this.
> >>
> >>>  
> >>> I have another generic question. If a bio is splitted into 2 bios, one bio
> >>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
> >>> second bio, so the whole bio will return -EAGAIN, but the first bio is already
> >>> dispatched to disk. Is this correct behavior?
> >>>
> >>
> >> No, from a multi-device point of view, this is inconsistent. I have
> >> tried the request bio returns -EAGAIN before the split, but I shall
> >> check again. Where do you see this happening?
> > 
> > No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
> > 
> 
> In that case, the bio end_io function is chained and the bio of the
> split will replicate the error to the parent (if not already set).

this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio
probably already dispatched to disk (if the bio is splitted to 2 bios, one
returns -EAGAIN, the other one doesn't block and dispatch to disk), what will
application be going to do? I think this is different to other IO errors. FOr
other IO errors, application will handle the error, while we ask app to retry
the whole bio here and app doesn't know part of bio is already written to disk.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-09 20:21           ` Shaohua Li
@ 2017-08-09 22:16             ` Goldwyn Rodrigues
  2017-08-10  1:17               ` Shaohua Li
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-09 22:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, hch, jack, linux-raid, dm-devel



On 08/09/2017 03:21 PM, Shaohua Li wrote:
> On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote:
>>
>>
>> On 08/09/2017 10:02 AM, Shaohua Li wrote:
>>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
>>>>
>>>>
>>>> On 08/08/2017 03:32 PM, Shaohua Li wrote:
>>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
>>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>>>
>>>>>> Nowait is a feature of direct AIO, where users can request
>>>>>> to return immediately if the I/O is going to block. This translates
>>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
>>>>>> don't wait, stacked devices such as md/dm will.
>>>>>>
>>>>>> In order to explicitly mark stacked devices as supported, we
>>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
>>>>>> whenever the device would block.
>>>>>
>>>>> probably you should route this patch to Jens first, DM/MD are different trees.
>>>>
>>>> Yes, I have sent it to linux-block as well, and he has commented as well.
>>>>
>>>>
>>>>>  
>>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>>> ---
>>>>>>  block/blk-core.c       | 3 ++-
>>>>>>  include/linux/blkdev.h | 2 ++
>>>>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>> index 970b9c9638c5..1c9a981d88e5 100644
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
>>>>>>  	 * if queue is not a request based queue.
>>>>>>  	 */
>>>>>>  
>>>>>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
>>>>>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
>>>>>> +	    !blk_queue_supports_nowait(q))
>>>>>>  		goto not_supported;
>>>>>>  
>>>>>>  	part = bio->bi_bdev->bd_part;
>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>>>> --- a/include/linux/blkdev.h
>>>>>> +++ b/include/linux/blkdev.h
>>>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>>>>  
>>>>>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>>>>>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
>>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
>>>>>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
>>>>>>  #define blk_queue_scsi_passthrough(q)	\
>>>>>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
>>>>>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
>>>>>
>>>>> Should this bit consider under layer disks? For example, one raid array disk
>>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
>>>>
>>>> Yes, it should. I will add a check before setting the flag. Thanks.
>>>> Request-based devices don't wait. So, they would not have this flag set.
>>>> It is only the bio-based, with the  make_request_fn hook which need this.
>>>>
>>>>>  
>>>>> I have another generic question. If a bio is splitted into 2 bios, one bio
>>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
>>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already
>>>>> dispatched to disk. Is this correct behavior?
>>>>>
>>>>
>>>> No, from a multi-device point of view, this is inconsistent. I have
>>>> tried the request bio returns -EAGAIN before the split, but I shall
>>>> check again. Where do you see this happening?
>>>
>>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
>>>
>>
>> In that case, the bio end_io function is chained and the bio of the
>> split will replicate the error to the parent (if not already set).
> 
> this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio
> probably already dispatched to disk (if the bio is splitted to 2 bios, one
> returns -EAGAIN, the other one doesn't block and dispatch to disk), what will
> application be going to do? I think this is different to other IO errors. FOr
> other IO errors, application will handle the error, while we ask app to retry
> the whole bio here and app doesn't know part of bio is already written to disk.

It is the same as for other I/O errors as well, such as EIO. You do not
know which bio of all submitted bio's returned the error EIO. The
application would and should consider the whole I/O as failed.

The user application does not know of bios, or how it is going to be
split in the underlying layers. It knows at the system call level. In
this case, the EAGAIN will be returned to the user for the whole I/O not
as a part of the I/O. It is up to application to try the I/O again with
or without RWF_NOWAIT set. In direct I/O, it is bubbled out using
dio->io_error. You can read about it at the patch header for the initial
patchset at [1].

Use case: It is for applications having two threads, a compute thread
and an I/O thread. It would try to push AIO as much as possible in the
compute thread using RWF_NOWAIT, and if it fails, would pass it on to
I/O thread which would perform without RWF_NOWAIT. End result if done
right is you save on context switches and all the
synchronization/messaging machinery to perform I/O.

[1] http://marc.info/?l=linux-block&m=149789003305876&w=2

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-09 22:16             ` Goldwyn Rodrigues
@ 2017-08-10  1:17               ` Shaohua Li
  2017-08-10  2:07                 ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Shaohua Li @ 2017-08-10  1:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-block, hch, jack, linux-raid, dm-devel

On Wed, Aug 09, 2017 at 05:16:23PM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 08/09/2017 03:21 PM, Shaohua Li wrote:
> > On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote:
> >>
> >>
> >> On 08/09/2017 10:02 AM, Shaohua Li wrote:
> >>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
> >>>>
> >>>>
> >>>> On 08/08/2017 03:32 PM, Shaohua Li wrote:
> >>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
> >>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>>>>>
> >>>>>> Nowait is a feature of direct AIO, where users can request
> >>>>>> to return immediately if the I/O is going to block. This translates
> >>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
> >>>>>> don't wait, stacked devices such as md/dm will.
> >>>>>>
> >>>>>> In order to explicitly mark stacked devices as supported, we
> >>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
> >>>>>> whenever the device would block.
> >>>>>
> >>>>> probably you should route this patch to Jens first, DM/MD are different trees.
> >>>>
> >>>> Yes, I have sent it to linux-block as well, and he has commented as well.
> >>>>
> >>>>
> >>>>>  
> >>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>>>>> ---
> >>>>>>  block/blk-core.c       | 3 ++-
> >>>>>>  include/linux/blkdev.h | 2 ++
> >>>>>>  2 files changed, 4 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>>> index 970b9c9638c5..1c9a981d88e5 100644
> >>>>>> --- a/block/blk-core.c
> >>>>>> +++ b/block/blk-core.c
> >>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
> >>>>>>  	 * if queue is not a request based queue.
> >>>>>>  	 */
> >>>>>>  
> >>>>>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
> >>>>>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
> >>>>>> +	    !blk_queue_supports_nowait(q))
> >>>>>>  		goto not_supported;
> >>>>>>  
> >>>>>>  	part = bio->bi_bdev->bd_part;
> >>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
> >>>>>> --- a/include/linux/blkdev.h
> >>>>>> +++ b/include/linux/blkdev.h
> >>>>>> @@ -633,6 +633,7 @@ struct request_queue {
> >>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
> >>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
> >>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
> >>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
> >>>>>>  
> >>>>>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
> >>>>>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
> >>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
> >>>>>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
> >>>>>>  #define blk_queue_scsi_passthrough(q)	\
> >>>>>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
> >>>>>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> >>>>>
> >>>>> Should this bit consider under layer disks? For example, one raid array disk
> >>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
> >>>>
> >>>> Yes, it should. I will add a check before setting the flag. Thanks.
> >>>> Request-based devices don't wait. So, they would not have this flag set.
> >>>> It is only the bio-based, with the  make_request_fn hook which need this.
> >>>>
> >>>>>  
> >>>>> I have another generic question. If a bio is splitted into 2 bios, one bio
> >>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
> >>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already
> >>>>> dispatched to disk. Is this correct behavior?
> >>>>>
> >>>>
> >>>> No, from a multi-device point of view, this is inconsistent. I have
> >>>> tried the request bio returns -EAGAIN before the split, but I shall
> >>>> check again. Where do you see this happening?
> >>>
> >>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
> >>>
> >>
> >> In that case, the bio end_io function is chained and the bio of the
> >> split will replicate the error to the parent (if not already set).
> > 
> > this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio
> > probably already dispatched to disk (if the bio is splitted to 2 bios, one
> > returns -EAGAIN, the other one doesn't block and dispatch to disk), what will
> > application be going to do? I think this is different to other IO errors. FOr
> > other IO errors, application will handle the error, while we ask app to retry
> > the whole bio here and app doesn't know part of bio is already written to disk.
> 
> It is the same as for other I/O errors as well, such as EIO. You do not
> know which bio of all submitted bio's returned the error EIO. The
> application would and should consider the whole I/O as failed.
> 
> The user application does not know of bios, or how it is going to be
> split in the underlying layers. It knows at the system call level. In
> this case, the EAGAIN will be returned to the user for the whole I/O not
> as a part of the I/O. It is up to application to try the I/O again with
> or without RWF_NOWAIT set. In direct I/O, it is bubbled out using
> dio->io_error. You can read about it at the patch header for the initial
> patchset at [1].
> 
> Use case: It is for applications having two threads, a compute thread
> and an I/O thread. It would try to push AIO as much as possible in the
> compute thread using RWF_NOWAIT, and if it fails, would pass it on to
> I/O thread which would perform without RWF_NOWAIT. End result if done
> right is you save on context switches and all the
> synchronization/messaging machinery to perform I/O.
> 
> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2

Yes, I knew the concept, but I didn't see previous patches mentioned the
-EAGAIN actually should be taken as a real IO error. This means a lot to
applications and make the API hard to use. I'm wondering if we should disable
bio split for NOWAIT bio, which will make the -EAGAIN only mean 'try again'.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10  1:17               ` Shaohua Li
@ 2017-08-10  2:07                 ` Goldwyn Rodrigues
  2017-08-10  2:17                   ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-10  2:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, hch, jack, linux-raid, dm-devel



On 08/09/2017 08:17 PM, Shaohua Li wrote:
> On Wed, Aug 09, 2017 at 05:16:23PM -0500, Goldwyn Rodrigues wrote:
>>
>>
>> On 08/09/2017 03:21 PM, Shaohua Li wrote:
>>> On Wed, Aug 09, 2017 at 10:35:39AM -0500, Goldwyn Rodrigues wrote:
>>>>
>>>>
>>>> On 08/09/2017 10:02 AM, Shaohua Li wrote:
>>>>> On Wed, Aug 09, 2017 at 06:44:55AM -0500, Goldwyn Rodrigues wrote:
>>>>>>
>>>>>>
>>>>>> On 08/08/2017 03:32 PM, Shaohua Li wrote:
>>>>>>> On Wed, Jul 26, 2017 at 06:57:58PM -0500, Goldwyn Rodrigues wrote:
>>>>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>>>>>
>>>>>>>> Nowait is a feature of direct AIO, where users can request
>>>>>>>> to return immediately if the I/O is going to block. This translates
>>>>>>>> to REQ_NOWAIT in bio.bi_opf flags. While request based devices
>>>>>>>> don't wait, stacked devices such as md/dm will.
>>>>>>>>
>>>>>>>> In order to explicitly mark stacked devices as supported, we
>>>>>>>> set the QUEUE_FLAG_NOWAIT in the queue_flags and return -EAGAIN
>>>>>>>> whenever the device would block.
>>>>>>>
>>>>>>> probably you should route this patch to Jens first, DM/MD are different trees.
>>>>>>
>>>>>> Yes, I have sent it to linux-block as well, and he has commented as well.
>>>>>>
>>>>>>
>>>>>>>  
>>>>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>>>>> ---
>>>>>>>>  block/blk-core.c       | 3 ++-
>>>>>>>>  include/linux/blkdev.h | 2 ++
>>>>>>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>>>> index 970b9c9638c5..1c9a981d88e5 100644
>>>>>>>> --- a/block/blk-core.c
>>>>>>>> +++ b/block/blk-core.c
>>>>>>>> @@ -2025,7 +2025,8 @@ generic_make_request_checks(struct bio *bio)
>>>>>>>>  	 * if queue is not a request based queue.
>>>>>>>>  	 */
>>>>>>>>  
>>>>>>>> -	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
>>>>>>>> +	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q) &&
>>>>>>>> +	    !blk_queue_supports_nowait(q))
>>>>>>>>  		goto not_supported;
>>>>>>>>  
>>>>>>>>  	part = bio->bi_bdev->bd_part;
>>>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>>>>>> --- a/include/linux/blkdev.h
>>>>>>>> +++ b/include/linux/blkdev.h
>>>>>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>>>>>>  
>>>>>>>>  #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>>>>>>>>  				 (1 << QUEUE_FLAG_STACKABLE)	|	\
>>>>>>>> @@ -732,6 +733,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
>>>>>>>>  #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
>>>>>>>>  #define blk_queue_scsi_passthrough(q)	\
>>>>>>>>  	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
>>>>>>>> +#define blk_queue_supports_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
>>>>>>>
>>>>>>> Should this bit consider under layer disks? For example, one raid array disk
>>>>>>> doesn't support NOWAIT, shouldn't we disable NOWAIT for the array?
>>>>>>
>>>>>> Yes, it should. I will add a check before setting the flag. Thanks.
>>>>>> Request-based devices don't wait. So, they would not have this flag set.
>>>>>> It is only the bio-based, with the  make_request_fn hook which need this.
>>>>>>
>>>>>>>  
>>>>>>> I have another generic question. If a bio is splitted into 2 bios, one bio
>>>>>>> doesn't need to wait but the other need to wait. We will return -EAGAIN for the
>>>>>>> second bio, so the whole bio will return -EAGAIN, but the first bio is already
>>>>>>> dispatched to disk. Is this correct behavior?
>>>>>>>
>>>>>>
>>>>>> No, from a multi-device point of view, this is inconsistent. I have
>>>>>> tried the request bio returns -EAGAIN before the split, but I shall
>>>>>> check again. Where do you see this happening?
>>>>>
>>>>> No, this isn't multi-device specific, any driver can do it. Please see blk_queue_split.
>>>>>
>>>>
>>>> In that case, the bio end_io function is chained and the bio of the
>>>> split will replicate the error to the parent (if not already set).
>>>
>>> this doesn't answer my question. So if a bio returns -EAGAIN, part of the bio
>>> probably already dispatched to disk (if the bio is splitted to 2 bios, one
>>> returns -EAGAIN, the other one doesn't block and dispatch to disk), what will
>>> application be going to do? I think this is different to other IO errors. FOr
>>> other IO errors, application will handle the error, while we ask app to retry
>>> the whole bio here and app doesn't know part of bio is already written to disk.
>>
>> It is the same as for other I/O errors as well, such as EIO. You do not
>> know which bio of all submitted bio's returned the error EIO. The
>> application would and should consider the whole I/O as failed.
>>
>> The user application does not know of bios, or how it is going to be
>> split in the underlying layers. It knows at the system call level. In
>> this case, the EAGAIN will be returned to the user for the whole I/O not
>> as a part of the I/O. It is up to application to try the I/O again with
>> or without RWF_NOWAIT set. In direct I/O, it is bubbled out using
>> dio->io_error. You can read about it at the patch header for the initial
>> patchset at [1].
>>
>> Use case: It is for applications having two threads, a compute thread
>> and an I/O thread. It would try to push AIO as much as possible in the
>> compute thread using RWF_NOWAIT, and if it fails, would pass it on to
>> I/O thread which would perform without RWF_NOWAIT. End result if done
>> right is you save on context switches and all the
>> synchronization/messaging machinery to perform I/O.
>>
>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
> 
> Yes, I knew the concept, but I didn't see previous patches mentioned the
> -EAGAIN actually should be taken as a real IO error. This means a lot to
> applications and make the API hard to use. I'm wondering if we should disable
> bio split for NOWAIT bio, which will make the -EAGAIN only mean 'try again'.
> 

Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say the
API is hard to use? Do you have a case to back it up?

No, not splitting the bio does not make sense here. I do not see any
advantage in it, unless you can present a case otherwise.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10  2:07                 ` Goldwyn Rodrigues
@ 2017-08-10  2:17                   ` Jens Axboe
  2017-08-10 11:49                     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-08-10  2:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel

On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>> I shall check again. Where do you see this happening?
>>>>>>
>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>> Please see blk_queue_split.
>>>>>>
>>>>>
>>>>> In that case, the bio end_io function is chained and the bio of
>>>>> the split will replicate the error to the parent (if not already
>>>>> set).
>>>>
>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>> of the bio probably already dispatched to disk (if the bio is
>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>> block and dispatch to disk), what will application be going to do?
>>>> I think this is different to other IO errors. FOr other IO errors,
>>>> application will handle the error, while we ask app to retry the
>>>> whole bio here and app doesn't know part of bio is already written
>>>> to disk.
>>>
>>> It is the same as for other I/O errors as well, such as EIO. You do
>>> not know which bio of all submitted bio's returned the error EIO.
>>> The application would and should consider the whole I/O as failed.
>>>
>>> The user application does not know of bios, or how it is going to be
>>> split in the underlying layers. It knows at the system call level.
>>> In this case, the EAGAIN will be returned to the user for the whole
>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>> out using dio->io_error. You can read about it at the patch header
>>> for the initial patchset at [1].
>>>
>>> Use case: It is for applications having two threads, a compute
>>> thread and an I/O thread. It would try to push AIO as much as
>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>> would pass it on to I/O thread which would perform without
>>> RWF_NOWAIT. End result if done right is you save on context switches
>>> and all the synchronization/messaging machinery to perform I/O.
>>>
>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>
>> Yes, I knew the concept, but I didn't see previous patches mentioned
>> the -EAGAIN actually should be taken as a real IO error. This means a
>> lot to applications and make the API hard to use. I'm wondering if we
>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>> only mean 'try again'.
> 
> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
> the API is hard to use? Do you have a case to back it up?

Because it is hard to use, and potentially suboptimal. Let's say you're
doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
short write, or do we return EWOULDBLOCK? If the latter, then that
really sucks from an API point of view.

> No, not splitting the bio does not make sense here. I do not see any
> advantage in it, unless you can present a case otherwise.

It ties back into the "hard to use" that I do agree with IFF we don't
return the short write. It's hard for an application to use that
efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
full 1MB from a different context.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-08 20:36     ` Jens Axboe
@ 2017-08-10  2:18       ` Jens Axboe
  2017-08-10 11:38         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-08-10  2:18 UTC (permalink / raw)
  To: Shaohua Li, Goldwyn Rodrigues
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On 08/08/2017 02:36 PM, Jens Axboe wrote:
> On 08/08/2017 02:32 PM, Shaohua Li wrote:
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
> 
> Does this work on 32-bit, where sizeof(unsigned long) == 32?

I didn't get an answer to this one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10  2:18       ` Jens Axboe
@ 2017-08-10 11:38         ` Goldwyn Rodrigues
  2017-08-10 14:14           ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-10 11:38 UTC (permalink / raw)
  To: Jens Axboe, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/09/2017 09:18 PM, Jens Axboe wrote:
> On 08/08/2017 02:36 PM, Jens Axboe wrote:
>> On 08/08/2017 02:32 PM, Shaohua Li wrote:
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>
>> Does this work on 32-bit, where sizeof(unsigned long) == 32?
> 
> I didn't get an answer to this one.
> 

Oh, I assumed the question is rhetorical.
No, it will not work on 32-bit. I was planning to change the field
queue_flags to u64. Is that okay?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10  2:17                   ` Jens Axboe
@ 2017-08-10 11:49                     ` Goldwyn Rodrigues
  2017-08-10 14:23                       ` Jens Axboe
  2017-08-10 14:25                       ` Jan Kara
  0 siblings, 2 replies; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-10 11:49 UTC (permalink / raw)
  To: Jens Axboe, Shaohua Li; +Cc: linux-block, hch, jack, linux-raid, dm-devel



On 08/09/2017 09:17 PM, Jens Axboe wrote:
> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>>> I shall check again. Where do you see this happening?
>>>>>>>
>>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>>> Please see blk_queue_split.
>>>>>>>
>>>>>>
>>>>>> In that case, the bio end_io function is chained and the bio of
>>>>>> the split will replicate the error to the parent (if not already
>>>>>> set).
>>>>>
>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>>> of the bio probably already dispatched to disk (if the bio is
>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>>> block and dispatch to disk), what will application be going to do?
>>>>> I think this is different to other IO errors. FOr other IO errors,
>>>>> application will handle the error, while we ask app to retry the
>>>>> whole bio here and app doesn't know part of bio is already written
>>>>> to disk.
>>>>
>>>> It is the same as for other I/O errors as well, such as EIO. You do
>>>> not know which bio of all submitted bio's returned the error EIO.
>>>> The application would and should consider the whole I/O as failed.
>>>>
>>>> The user application does not know of bios, or how it is going to be
>>>> split in the underlying layers. It knows at the system call level.
>>>> In this case, the EAGAIN will be returned to the user for the whole
>>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>>> out using dio->io_error. You can read about it at the patch header
>>>> for the initial patchset at [1].
>>>>
>>>> Use case: It is for applications having two threads, a compute
>>>> thread and an I/O thread. It would try to push AIO as much as
>>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>>> would pass it on to I/O thread which would perform without
>>>> RWF_NOWAIT. End result if done right is you save on context switches
>>>> and all the synchronization/messaging machinery to perform I/O.
>>>>
>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>>
>>> Yes, I knew the concept, but I didn't see previous patches mentioned
>>> the -EAGAIN actually should be taken as a real IO error. This means a
>>> lot to applications and make the API hard to use. I'm wondering if we
>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>>> only mean 'try again'.
>>
>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
>> the API is hard to use? Do you have a case to back it up?
> 
> Because it is hard to use, and potentially suboptimal. Let's say you're
> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
> short write, or do we return EWOULDBLOCK? If the latter, then that
> really sucks from an API point of view.
> 
>> No, not splitting the bio does not make sense here. I do not see any
>> advantage in it, unless you can present a case otherwise.
> 
> It ties back into the "hard to use" that I do agree with IFF we don't
> return the short write. It's hard for an application to use that
> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
> full 1MB from a different context.
> 

It returns the error code only and not short reads/writes. But isn't
that true for all system calls in case of error?

For aio, there are two result fields in io_event out of which one could
be used for error while the other be used for amount of writes/reads
performed. However, only one is used. This will not work with
pread()/pwrite() calls though because of the limitation of return values.

Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
successful. What short return value should the system call return?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 11:38         ` Goldwyn Rodrigues
@ 2017-08-10 14:14           ` Jens Axboe
  2017-08-10 17:15             ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-08-10 14:14 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 08/09/2017 09:18 PM, Jens Axboe wrote:
>> On 08/08/2017 02:36 PM, Jens Axboe wrote:
>>> On 08/08/2017 02:32 PM, Shaohua Li wrote:
>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>>> --- a/include/linux/blkdev.h
>>>>> +++ b/include/linux/blkdev.h
>>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>
>>> Does this work on 32-bit, where sizeof(unsigned long) == 32?
>>
>> I didn't get an answer to this one.
>>
> 
> Oh, I assumed the question is rhetorical.
> No, it will not work on 32-bit. I was planning to change the field
> queue_flags to u64. Is that okay?

No, besides that would not work with set/test_bit() and friends. Grab
a free bit instead.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 11:49                     ` Goldwyn Rodrigues
@ 2017-08-10 14:23                       ` Jens Axboe
  2017-08-10 14:25                       ` Jan Kara
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-08-10 14:23 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel

On 08/10/2017 05:49 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 08/09/2017 09:17 PM, Jens Axboe wrote:
>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>>>> I shall check again. Where do you see this happening?
>>>>>>>>
>>>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>>>> Please see blk_queue_split.
>>>>>>>>
>>>>>>>
>>>>>>> In that case, the bio end_io function is chained and the bio of
>>>>>>> the split will replicate the error to the parent (if not already
>>>>>>> set).
>>>>>>
>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>>>> of the bio probably already dispatched to disk (if the bio is
>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>>>> block and dispatch to disk), what will application be going to do?
>>>>>> I think this is different to other IO errors. FOr other IO errors,
>>>>>> application will handle the error, while we ask app to retry the
>>>>>> whole bio here and app doesn't know part of bio is already written
>>>>>> to disk.
>>>>>
>>>>> It is the same as for other I/O errors as well, such as EIO. You do
>>>>> not know which bio of all submitted bio's returned the error EIO.
>>>>> The application would and should consider the whole I/O as failed.
>>>>>
>>>>> The user application does not know of bios, or how it is going to be
>>>>> split in the underlying layers. It knows at the system call level.
>>>>> In this case, the EAGAIN will be returned to the user for the whole
>>>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>>>> out using dio->io_error. You can read about it at the patch header
>>>>> for the initial patchset at [1].
>>>>>
>>>>> Use case: It is for applications having two threads, a compute
>>>>> thread and an I/O thread. It would try to push AIO as much as
>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>>>> would pass it on to I/O thread which would perform without
>>>>> RWF_NOWAIT. End result if done right is you save on context switches
>>>>> and all the synchronization/messaging machinery to perform I/O.
>>>>>
>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>>>
>>>> Yes, I knew the concept, but I didn't see previous patches mentioned
>>>> the -EAGAIN actually should be taken as a real IO error. This means a
>>>> lot to applications and make the API hard to use. I'm wondering if we
>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>>>> only mean 'try again'.
>>>
>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
>>> the API is hard to use? Do you have a case to back it up?
>>
>> Because it is hard to use, and potentially suboptimal. Let's say you're
>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
>> short write, or do we return EWOULDBLOCK? If the latter, then that
>> really sucks from an API point of view.
>>
>>> No, not splitting the bio does not make sense here. I do not see any
>>> advantage in it, unless you can present a case otherwise.
>>
>> It ties back into the "hard to use" that I do agree with IFF we don't
>> return the short write. It's hard for an application to use that
>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
>> full 1MB from a different context.
>>
> 
> It returns the error code only and not short reads/writes. But isn't
> that true for all system calls in case of error?

It's not a hard error. If you wrote 896K in the example above, I'd
really expect the return value to be 896*1024. The API is hard to use
efficiently, if that's not the case.

> For aio, there are two result fields in io_event out of which one could
> be used for error while the other be used for amount of writes/reads
> performed. However, only one is used. This will not work with
> pread()/pwrite() calls though because of the limitation of return values.

Don't invent something new for this, the mechanism already exists for
returning a short read or write. That's how all of them have worked for
decades.

> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
> successful. What short return value should the system call return?

It should return 128*1024, since that's how much was successfully done
from the start offset. But yes, this is exactly the point that I brought
up, and why contesting Shaohua's suggestion to perhaps treat splits
differently should not be discarded so quickly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 11:49                     ` Goldwyn Rodrigues
  2017-08-10 14:23                       ` Jens Axboe
@ 2017-08-10 14:25                       ` Jan Kara
  2017-08-10 14:28                         ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-08-10 14:25 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Jens Axboe, Shaohua Li, linux-block, hch, jack, linux-raid,
	dm-devel

On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote:
> On 08/09/2017 09:17 PM, Jens Axboe wrote:
> > On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
> >>>>>>>> No, from a multi-device point of view, this is inconsistent. I
> >>>>>>>> have tried the request bio returns -EAGAIN before the split, but
> >>>>>>>> I shall check again. Where do you see this happening?
> >>>>>>>
> >>>>>>> No, this isn't multi-device specific, any driver can do it.
> >>>>>>> Please see blk_queue_split.
> >>>>>>>
> >>>>>>
> >>>>>> In that case, the bio end_io function is chained and the bio of
> >>>>>> the split will replicate the error to the parent (if not already
> >>>>>> set).
> >>>>>
> >>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
> >>>>> of the bio probably already dispatched to disk (if the bio is
> >>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
> >>>>> block and dispatch to disk), what will application be going to do?
> >>>>> I think this is different to other IO errors. FOr other IO errors,
> >>>>> application will handle the error, while we ask app to retry the
> >>>>> whole bio here and app doesn't know part of bio is already written
> >>>>> to disk.
> >>>>
> >>>> It is the same as for other I/O errors as well, such as EIO. You do
> >>>> not know which bio of all submitted bio's returned the error EIO.
> >>>> The application would and should consider the whole I/O as failed.
> >>>>
> >>>> The user application does not know of bios, or how it is going to be
> >>>> split in the underlying layers. It knows at the system call level.
> >>>> In this case, the EAGAIN will be returned to the user for the whole
> >>>> I/O not as a part of the I/O. It is up to application to try the I/O
> >>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
> >>>> out using dio->io_error. You can read about it at the patch header
> >>>> for the initial patchset at [1].
> >>>>
> >>>> Use case: It is for applications having two threads, a compute
> >>>> thread and an I/O thread. It would try to push AIO as much as
> >>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
> >>>> would pass it on to I/O thread which would perform without
> >>>> RWF_NOWAIT. End result if done right is you save on context switches
> >>>> and all the synchronization/messaging machinery to perform I/O.
> >>>>
> >>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
> >>>
> >>> Yes, I knew the concept, but I didn't see previous patches mentioned
> >>> the -EAGAIN actually should be taken as a real IO error. This means a
> >>> lot to applications and make the API hard to use. I'm wondering if we
> >>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
> >>> only mean 'try again'.
> >>
> >> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
> >> the API is hard to use? Do you have a case to back it up?
> > 
> > Because it is hard to use, and potentially suboptimal. Let's say you're
> > doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
> > short write, or do we return EWOULDBLOCK? If the latter, then that
> > really sucks from an API point of view.
> > 
> >> No, not splitting the bio does not make sense here. I do not see any
> >> advantage in it, unless you can present a case otherwise.
> > 
> > It ties back into the "hard to use" that I do agree with IFF we don't
> > return the short write. It's hard for an application to use that
> > efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
> > full 1MB from a different context.
> > 
> 
> It returns the error code only and not short reads/writes. But isn't
> that true for all system calls in case of error?
> 
> For aio, there are two result fields in io_event out of which one could
> be used for error while the other be used for amount of writes/reads
> performed. However, only one is used. This will not work with
> pread()/pwrite() calls though because of the limitation of return values.
> 
> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
> successful. What short return value should the system call return?

This is indeed tricky. If an application submits 1MB write, I don't think
we can afford to just write arbitrary subset of it. That just IMHO too much
violates how writes traditionally behaved. Even short writes trigger bugs
in various applications but I'm willing to require that applications using
NOWAIT IO can handle these. However writing arbitrary subset looks like a
nasty catch. IMHO we should not submit further bios until we are sure
current one does not return EWOULDBLOCK when splitting a larger one...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 14:25                       ` Jan Kara
@ 2017-08-10 14:28                         ` Jens Axboe
  2017-08-10 17:15                           ` Goldwyn Rodrigues
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-08-10 14:28 UTC (permalink / raw)
  To: Jan Kara, Goldwyn Rodrigues
  Cc: Shaohua Li, linux-block, hch, linux-raid, dm-devel

On 08/10/2017 08:25 AM, Jan Kara wrote:
> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote:
>> On 08/09/2017 09:17 PM, Jens Axboe wrote:
>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>>>>> I shall check again. Where do you see this happening?
>>>>>>>>>
>>>>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>>>>> Please see blk_queue_split.
>>>>>>>>>
>>>>>>>>
>>>>>>>> In that case, the bio end_io function is chained and the bio of
>>>>>>>> the split will replicate the error to the parent (if not already
>>>>>>>> set).
>>>>>>>
>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>>>>> of the bio probably already dispatched to disk (if the bio is
>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>>>>> block and dispatch to disk), what will application be going to do?
>>>>>>> I think this is different to other IO errors. FOr other IO errors,
>>>>>>> application will handle the error, while we ask app to retry the
>>>>>>> whole bio here and app doesn't know part of bio is already written
>>>>>>> to disk.
>>>>>>
>>>>>> It is the same as for other I/O errors as well, such as EIO. You do
>>>>>> not know which bio of all submitted bio's returned the error EIO.
>>>>>> The application would and should consider the whole I/O as failed.
>>>>>>
>>>>>> The user application does not know of bios, or how it is going to be
>>>>>> split in the underlying layers. It knows at the system call level.
>>>>>> In this case, the EAGAIN will be returned to the user for the whole
>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>>>>> out using dio->io_error. You can read about it at the patch header
>>>>>> for the initial patchset at [1].
>>>>>>
>>>>>> Use case: It is for applications having two threads, a compute
>>>>>> thread and an I/O thread. It would try to push AIO as much as
>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>>>>> would pass it on to I/O thread which would perform without
>>>>>> RWF_NOWAIT. End result if done right is you save on context switches
>>>>>> and all the synchronization/messaging machinery to perform I/O.
>>>>>>
>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>>>>
>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned
>>>>> the -EAGAIN actually should be taken as a real IO error. This means a
>>>>> lot to applications and make the API hard to use. I'm wondering if we
>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>>>>> only mean 'try again'.
>>>>
>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
>>>> the API is hard to use? Do you have a case to back it up?
>>>
>>> Because it is hard to use, and potentially suboptimal. Let's say you're
>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
>>> short write, or do we return EWOULDBLOCK? If the latter, then that
>>> really sucks from an API point of view.
>>>
>>>> No, not splitting the bio does not make sense here. I do not see any
>>>> advantage in it, unless you can present a case otherwise.
>>>
>>> It ties back into the "hard to use" that I do agree with IFF we don't
>>> return the short write. It's hard for an application to use that
>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
>>> full 1MB from a different context.
>>>
>>
>> It returns the error code only and not short reads/writes. But isn't
>> that true for all system calls in case of error?
>>
>> For aio, there are two result fields in io_event out of which one could
>> be used for error while the other be used for amount of writes/reads
>> performed. However, only one is used. This will not work with
>> pread()/pwrite() calls though because of the limitation of return values.
>>
>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
>> successful. What short return value should the system call return?
> 
> This is indeed tricky. If an application submits 1MB write, I don't think
> we can afford to just write arbitrary subset of it. That just IMHO too much
> violates how writes traditionally behaved. Even short writes trigger bugs
> in various applications but I'm willing to require that applications using
> NOWAIT IO can handle these. However writing arbitrary subset looks like a
> nasty catch. IMHO we should not submit further bios until we are sure
> current one does not return EWOULDBLOCK when splitting a larger one...

Exactly, that's the point that both Shaohua and I was getting at. Short
writes should be fine, especially if NOWAIT is set. Discontig writes
should also be OK, but it's horrible and inefficient. If we do that,
then using this feature is a net-loss, not a win by any stretch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 14:28                         ` Jens Axboe
@ 2017-08-10 17:15                           ` Goldwyn Rodrigues
  2017-08-10 17:20                             ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-10 17:15 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara; +Cc: Shaohua Li, linux-block, hch, linux-raid, dm-devel



On 08/10/2017 09:28 AM, Jens Axboe wrote:
> On 08/10/2017 08:25 AM, Jan Kara wrote:
>> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote:
>>> On 08/09/2017 09:17 PM, Jens Axboe wrote:
>>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>>>>>> I shall check again. Where do you see this happening?
>>>>>>>>>>
>>>>>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>>>>>> Please see blk_queue_split.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In that case, the bio end_io function is chained and the bio of
>>>>>>>>> the split will replicate the error to the parent (if not already
>>>>>>>>> set).
>>>>>>>>
>>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>>>>>> of the bio probably already dispatched to disk (if the bio is
>>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>>>>>> block and dispatch to disk), what will application be going to do?
>>>>>>>> I think this is different to other IO errors. FOr other IO errors,
>>>>>>>> application will handle the error, while we ask app to retry the
>>>>>>>> whole bio here and app doesn't know part of bio is already written
>>>>>>>> to disk.
>>>>>>>
>>>>>>> It is the same as for other I/O errors as well, such as EIO. You do
>>>>>>> not know which bio of all submitted bio's returned the error EIO.
>>>>>>> The application would and should consider the whole I/O as failed.
>>>>>>>
>>>>>>> The user application does not know of bios, or how it is going to be
>>>>>>> split in the underlying layers. It knows at the system call level.
>>>>>>> In this case, the EAGAIN will be returned to the user for the whole
>>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>>>>>> out using dio->io_error. You can read about it at the patch header
>>>>>>> for the initial patchset at [1].
>>>>>>>
>>>>>>> Use case: It is for applications having two threads, a compute
>>>>>>> thread and an I/O thread. It would try to push AIO as much as
>>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>>>>>> would pass it on to I/O thread which would perform without
>>>>>>> RWF_NOWAIT. End result if done right is you save on context switches
>>>>>>> and all the synchronization/messaging machinery to perform I/O.
>>>>>>>
>>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>>>>>
>>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned
>>>>>> the -EAGAIN actually should be taken as a real IO error. This means a
>>>>>> lot to applications and make the API hard to use. I'm wondering if we
>>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>>>>>> only mean 'try again'.
>>>>>
>>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
>>>>> the API is hard to use? Do you have a case to back it up?
>>>>
>>>> Because it is hard to use, and potentially suboptimal. Let's say you're
>>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
>>>> short write, or do we return EWOULDBLOCK? If the latter, then that
>>>> really sucks from an API point of view.
>>>>
>>>>> No, not splitting the bio does not make sense here. I do not see any
>>>>> advantage in it, unless you can present a case otherwise.
>>>>
>>>> It ties back into the "hard to use" that I do agree with IFF we don't
>>>> return the short write. It's hard for an application to use that
>>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
>>>> full 1MB from a different context.
>>>>
>>>
>>> It returns the error code only and not short reads/writes. But isn't
>>> that true for all system calls in case of error?
>>>
>>> For aio, there are two result fields in io_event out of which one could
>>> be used for error while the other be used for amount of writes/reads
>>> performed. However, only one is used. This will not work with
>>> pread()/pwrite() calls though because of the limitation of return values.
>>>
>>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
>>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
>>> successful. What short return value should the system call return?
>>
>> This is indeed tricky. If an application submits 1MB write, I don't think
>> we can afford to just write arbitrary subset of it. That just IMHO too much
>> violates how writes traditionally behaved. Even short writes trigger bugs
>> in various applications but I'm willing to require that applications using
>> NOWAIT IO can handle these. However writing arbitrary subset looks like a
>> nasty catch. IMHO we should not submit further bios until we are sure
>> current one does not return EWOULDBLOCK when splitting a larger one...
> 
> Exactly, that's the point that both Shaohua and I was getting at. Short
> writes should be fine, especially if NOWAIT is set. Discontig writes
> should also be OK, but it's horrible and inefficient. If we do that,
> then using this feature is a net-loss, not a win by any stretch.
> 

To make sure I understand this, we disable bio splits for NOWAIT bio so
we return EWOULDBLOCK for the entire I/O.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 14:14           ` Jens Axboe
@ 2017-08-10 17:15             ` Goldwyn Rodrigues
  2017-08-10 17:17               ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Goldwyn Rodrigues @ 2017-08-10 17:15 UTC (permalink / raw)
  To: Jens Axboe, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues



On 08/10/2017 09:14 AM, Jens Axboe wrote:
> On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote:
>>
>>
>> On 08/09/2017 09:18 PM, Jens Axboe wrote:
>>> On 08/08/2017 02:36 PM, Jens Axboe wrote:
>>>> On 08/08/2017 02:32 PM, Shaohua Li wrote:
>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>>>> --- a/include/linux/blkdev.h
>>>>>> +++ b/include/linux/blkdev.h
>>>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>>
>>>> Does this work on 32-bit, where sizeof(unsigned long) == 32?
>>>
>>> I didn't get an answer to this one.
>>>
>>
>> Oh, I assumed the question is rhetorical.
>> No, it will not work on 32-bit. I was planning to change the field
>> queue_flags to u64. Is that okay?
> 
> No, besides that would not work with set/test_bit() and friends. Grab
> a free bit instead.
> 

Which bit is free? I don't see any gaps in QUEUE_FLAG_*, and I am not
sure if any one is unused.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 17:15             ` Goldwyn Rodrigues
@ 2017-08-10 17:17               ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-08-10 17:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Shaohua Li
  Cc: linux-block, hch, jack, linux-raid, dm-devel, Goldwyn Rodrigues

On 08/10/2017 11:15 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 08/10/2017 09:14 AM, Jens Axboe wrote:
>> On 08/10/2017 05:38 AM, Goldwyn Rodrigues wrote:
>>>
>>>
>>> On 08/09/2017 09:18 PM, Jens Axboe wrote:
>>>> On 08/08/2017 02:36 PM, Jens Axboe wrote:
>>>>> On 08/08/2017 02:32 PM, Shaohua Li wrote:
>>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>>> index 25f6a0cb27d3..fae021ebec1b 100644
>>>>>>> --- a/include/linux/blkdev.h
>>>>>>> +++ b/include/linux/blkdev.h
>>>>>>> @@ -633,6 +633,7 @@ struct request_queue {
>>>>>>>  #define QUEUE_FLAG_REGISTERED  29	/* queue has been registered to a disk */
>>>>>>>  #define QUEUE_FLAG_SCSI_PASSTHROUGH 30	/* queue supports SCSI commands */
>>>>>>>  #define QUEUE_FLAG_QUIESCED    31	/* queue has been quiesced */
>>>>>>> +#define QUEUE_FLAG_NOWAIT      32	/* stack device driver supports REQ_NOWAIT */
>>>>>
>>>>> Does this work on 32-bit, where sizeof(unsigned long) == 32?
>>>>
>>>> I didn't get an answer to this one.
>>>>
>>>
>>> Oh, I assumed the question is rhetorical.
>>> No, it will not work on 32-bit. I was planning to change the field
>>> queue_flags to u64. Is that okay?
>>
>> No, besides that would not work with set/test_bit() and friends. Grab
>> a free bit instead.
>>
> 
> Which bit is free? I don't see any gaps in QUEUE_FLAG_*, and I am not
> sure if any one is unused.

Bit 0 is free in mainline, and I just looked, and two other bits are
gone as well:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.14/block&id=e743eb1ecd5564b5ae0a4a76c1566f748a358839

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait
  2017-08-10 17:15                           ` Goldwyn Rodrigues
@ 2017-08-10 17:20                             ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-08-10 17:20 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Jan Kara
  Cc: Shaohua Li, linux-block, hch, linux-raid, dm-devel

On 08/10/2017 11:15 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 08/10/2017 09:28 AM, Jens Axboe wrote:
>> On 08/10/2017 08:25 AM, Jan Kara wrote:
>>> On Thu 10-08-17 06:49:53, Goldwyn Rodrigues wrote:
>>>> On 08/09/2017 09:17 PM, Jens Axboe wrote:
>>>>> On 08/09/2017 08:07 PM, Goldwyn Rodrigues wrote:
>>>>>>>>>>>> No, from a multi-device point of view, this is inconsistent. I
>>>>>>>>>>>> have tried the request bio returns -EAGAIN before the split, but
>>>>>>>>>>>> I shall check again. Where do you see this happening?
>>>>>>>>>>>
>>>>>>>>>>> No, this isn't multi-device specific, any driver can do it.
>>>>>>>>>>> Please see blk_queue_split.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In that case, the bio end_io function is chained and the bio of
>>>>>>>>>> the split will replicate the error to the parent (if not already
>>>>>>>>>> set).
>>>>>>>>>
>>>>>>>>> this doesn't answer my question. So if a bio returns -EAGAIN, part
>>>>>>>>> of the bio probably already dispatched to disk (if the bio is
>>>>>>>>> splitted to 2 bios, one returns -EAGAIN, the other one doesn't
>>>>>>>>> block and dispatch to disk), what will application be going to do?
>>>>>>>>> I think this is different to other IO errors. FOr other IO errors,
>>>>>>>>> application will handle the error, while we ask app to retry the
>>>>>>>>> whole bio here and app doesn't know part of bio is already written
>>>>>>>>> to disk.
>>>>>>>>
>>>>>>>> It is the same as for other I/O errors as well, such as EIO. You do
>>>>>>>> not know which bio of all submitted bio's returned the error EIO.
>>>>>>>> The application would and should consider the whole I/O as failed.
>>>>>>>>
>>>>>>>> The user application does not know of bios, or how it is going to be
>>>>>>>> split in the underlying layers. It knows at the system call level.
>>>>>>>> In this case, the EAGAIN will be returned to the user for the whole
>>>>>>>> I/O not as a part of the I/O. It is up to application to try the I/O
>>>>>>>> again with or without RWF_NOWAIT set. In direct I/O, it is bubbled
>>>>>>>> out using dio->io_error. You can read about it at the patch header
>>>>>>>> for the initial patchset at [1].
>>>>>>>>
>>>>>>>> Use case: It is for applications having two threads, a compute
>>>>>>>> thread and an I/O thread. It would try to push AIO as much as
>>>>>>>> possible in the compute thread using RWF_NOWAIT, and if it fails,
>>>>>>>> would pass it on to I/O thread which would perform without
>>>>>>>> RWF_NOWAIT. End result if done right is you save on context switches
>>>>>>>> and all the synchronization/messaging machinery to perform I/O.
>>>>>>>>
>>>>>>>> [1] http://marc.info/?l=linux-block&m=149789003305876&w=2
>>>>>>>
>>>>>>> Yes, I knew the concept, but I didn't see previous patches mentioned
>>>>>>> the -EAGAIN actually should be taken as a real IO error. This means a
>>>>>>> lot to applications and make the API hard to use. I'm wondering if we
>>>>>>> should disable bio split for NOWAIT bio, which will make the -EAGAIN
>>>>>>> only mean 'try again'.
>>>>>>
>>>>>> Don't take it as EAGAIN, but read it as EWOULDBLOCK. Why do you say
>>>>>> the API is hard to use? Do you have a case to back it up?
>>>>>
>>>>> Because it is hard to use, and potentially suboptimal. Let's say you're
>>>>> doing a 1MB write, we hit EWOULDBLOCK for the last split. Do we return a
>>>>> short write, or do we return EWOULDBLOCK? If the latter, then that
>>>>> really sucks from an API point of view.
>>>>>
>>>>>> No, not splitting the bio does not make sense here. I do not see any
>>>>>> advantage in it, unless you can present a case otherwise.
>>>>>
>>>>> It ties back into the "hard to use" that I do agree with IFF we don't
>>>>> return the short write. It's hard for an application to use that
>>>>> efficiently, if we write 1MB-128K but get EWOULDBLOCK, the re-write the
>>>>> full 1MB from a different context.
>>>>>
>>>>
>>>> It returns the error code only and not short reads/writes. But isn't
>>>> that true for all system calls in case of error?
>>>>
>>>> For aio, there are two result fields in io_event out of which one could
>>>> be used for error while the other be used for amount of writes/reads
>>>> performed. However, only one is used. This will not work with
>>>> pread()/pwrite() calls though because of the limitation of return values.
>>>>
>>>> Finally, what if the EWOULDBLOCK is returned for an earlier bio (say
>>>> offset 128k) for a 1MB pwrite(), while the rest of the 7 128K are
>>>> successful. What short return value should the system call return?
>>>
>>> This is indeed tricky. If an application submits 1MB write, I don't think
>>> we can afford to just write arbitrary subset of it. That just IMHO too much
>>> violates how writes traditionally behaved. Even short writes trigger bugs
>>> in various applications but I'm willing to require that applications using
>>> NOWAIT IO can handle these. However writing arbitrary subset looks like a
>>> nasty catch. IMHO we should not submit further bios until we are sure
>>> current one does not return EWOULDBLOCK when splitting a larger one...
>>
>> Exactly, that's the point that both Shaohua and I was getting at. Short
>> writes should be fine, especially if NOWAIT is set. Discontig writes
>> should also be OK, but it's horrible and inefficient. If we do that,
>> then using this feature is a net-loss, not a win by any stretch.
>>
> 
> To make sure I understand this, we disable bio splits for NOWAIT bio so
> we return EWOULDBLOCK for the entire I/O.

That's also not great, since splits is a common operation, and the majority
of splits can proceed without hitting out-of-resources. So ideally we'd
handle that case, but in a saner fashion than the laissez faire approach
that the current patchset takes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-08-10 17:20 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-26 23:57 [PATCH 0/9] Nowait feature for stacked block devices Goldwyn Rodrigues
2017-07-26 23:57 ` [PATCH 1/9] QUEUE_FLAG_NOWAIT to indicate device supports nowait Goldwyn Rodrigues
2017-08-08 20:32   ` Shaohua Li
2017-08-08 20:36     ` Jens Axboe
2017-08-10  2:18       ` Jens Axboe
2017-08-10 11:38         ` Goldwyn Rodrigues
2017-08-10 14:14           ` Jens Axboe
2017-08-10 17:15             ` Goldwyn Rodrigues
2017-08-10 17:17               ` Jens Axboe
2017-08-09 11:44     ` Goldwyn Rodrigues
2017-08-09 15:02       ` Shaohua Li
2017-08-09 15:35         ` Goldwyn Rodrigues
2017-08-09 20:21           ` Shaohua Li
2017-08-09 22:16             ` Goldwyn Rodrigues
2017-08-10  1:17               ` Shaohua Li
2017-08-10  2:07                 ` Goldwyn Rodrigues
2017-08-10  2:17                   ` Jens Axboe
2017-08-10 11:49                     ` Goldwyn Rodrigues
2017-08-10 14:23                       ` Jens Axboe
2017-08-10 14:25                       ` Jan Kara
2017-08-10 14:28                         ` Jens Axboe
2017-08-10 17:15                           ` Goldwyn Rodrigues
2017-08-10 17:20                             ` Jens Axboe
2017-07-26 23:57 ` [PATCH 2/9] md: Add nowait support to md Goldwyn Rodrigues
2017-08-08 20:34   ` Shaohua Li
2017-07-26 23:58 ` [PATCH 3/9] md: raid1 nowait support Goldwyn Rodrigues
2017-08-08 20:39   ` Shaohua Li
2017-08-09 11:45     ` Goldwyn Rodrigues
2017-07-26 23:58 ` [PATCH 4/9] md: raid5 " Goldwyn Rodrigues
2017-08-08 20:43   ` Shaohua Li
2017-08-09 11:45     ` Goldwyn Rodrigues
2017-07-26 23:58 ` [PATCH 5/9] md: raid10 " Goldwyn Rodrigues
2017-08-08 20:40   ` Shaohua Li
2017-07-26 23:58 ` [PATCH 6/9] dm: add " Goldwyn Rodrigues
2017-07-26 23:58 ` [PATCH 7/9] dm: Add nowait support to raid1 Goldwyn Rodrigues
2017-07-26 23:58 ` [PATCH 8/9] dm: Add nowait support to dm-delay Goldwyn Rodrigues
2017-07-26 23:58 ` [PATCH 9/9] dm-mpath: Add nowait support Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).