Linux block layer
 help / color / mirror / Atom feed
* [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT
@ 2025-03-14  2:11 Ming Lei
  2025-03-14  2:11 ` [PATCH V2 1/5] loop: simplify do_req_filebacked() Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

Hello Jens,

This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
command to workqueue context, meantime refactor lo_rw_aio() a bit.

In my test VM, loop disk perf becomes very close to perf of the backing block
device(nvme/mq virtio-scsi).

And Mikulas verified that this way can improve 12jobs sequential rw io by
~5X, and basically solve the reported problem together with loop MQ change.

https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/

The loop MQ change will be posted as standalone patch, because it needs
losetup change.


Thanks,
Ming

V2:
	- patch style fix & cleanup (Christoph)
	- fix randwrite perf regression on sparse backing file
	- drop MQ change

Ming Lei (5):
  loop: simplify do_req_filebacked()
  loop: cleanup lo_rw_aio()
  loop: move command blkcg/memcg initialization into loop_queue_work
  loop: try to handle loop aio command via NOWAIT IO first
  loop: add hint for handling aio via IOCB_NOWAIT

 drivers/block/loop.c | 232 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 186 insertions(+), 46 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 1/5] loop: simplify do_req_filebacked()
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
@ 2025-03-14  2:11 ` Ming Lei
  2025-03-20  7:10   ` Christoph Hellwig
  2025-03-14  2:11 ` [PATCH V2 2/5] loop: cleanup lo_rw_aio() Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

lo_rw_aio() is only called for READ/WRITE operation, which can be
figured out from request directly, so remove 'rw' parameter from
lo_rw_aio(), meantime rename the local variable as 'dir' which makes
the check more readable in lo_rw_aio().

Meantime add lo_rw_simple() so that do_req_filebacked() can be
simplified in the following way:

```
	if (cmd->use_aio)
		return lo_rw_aio(lo, cmd, pos);
	else
		return lo_rw_simple(lo, rq, pos);
```

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 657bf53decf3..554f6cefe8f6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -285,6 +285,13 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
 	return 0;
 }
 
+static int lo_rw_simple(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+	if (req_op(rq) == REQ_OP_READ)
+		return lo_read_simple(lo, rq, pos);
+	return lo_write_simple(lo, rq, pos);
+}
+
 static void loop_clear_limits(struct loop_device *lo, int mode)
 {
 	struct queue_limits lim = queue_limits_start_update(lo->lo_queue);
@@ -400,13 +407,13 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
 	lo_rw_aio_do_completion(cmd);
 }
 
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
-		     loff_t pos, int rw)
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
 {
 	struct iov_iter iter;
 	struct req_iterator rq_iter;
 	struct bio_vec *bvec;
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
 	struct bio *bio = rq->bio;
 	struct file *file = lo->lo_backing_file;
 	struct bio_vec tmp;
@@ -448,7 +455,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	}
 	atomic_set(&cmd->ref, 2);
 
-	iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
+	iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
 	iter.iov_offset = offset;
 
 	cmd->iocb.ki_pos = pos;
@@ -457,7 +464,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	cmd->iocb.ki_flags = IOCB_DIRECT;
 	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
 
-	if (rw == ITER_SOURCE)
+	if (dir == ITER_SOURCE)
 		ret = file->f_op->write_iter(&cmd->iocb, &iter);
 	else
 		ret = file->f_op->read_iter(&cmd->iocb, &iter);
@@ -498,15 +505,11 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	case REQ_OP_DISCARD:
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_WRITE:
-		if (cmd->use_aio)
-			return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
-		else
-			return lo_write_simple(lo, rq, pos);
 	case REQ_OP_READ:
 		if (cmd->use_aio)
-			return lo_rw_aio(lo, cmd, pos, ITER_DEST);
+			return lo_rw_aio(lo, cmd, pos);
 		else
-			return lo_read_simple(lo, rq, pos);
+			return lo_rw_simple(lo, rq, pos);
 	default:
 		WARN_ON_ONCE(1);
 		return -EIO;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 2/5] loop: cleanup lo_rw_aio()
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  2025-03-14  2:11 ` [PATCH V2 1/5] loop: simplify do_req_filebacked() Ming Lei
@ 2025-03-14  2:11 ` Ming Lei
  2025-03-20  7:11   ` Christoph Hellwig
  2025-03-14  2:11 ` [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

Cleanup lo_rw_aio() a bit by refactoring it into three parts:

- lo_cmd_nr_bvec(), for calculating how many bvecs in this request

- lo_rw_aio_prep(), for preparing loop command, which need to be called
once

- lo_submit_rw_aio(), for submitting this lo command, which can be
called multiple times

Prepare for trying to handle loop command by NOWAIT read/write IO
first.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 81 ++++++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 25 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 554f6cefe8f6..3f921d9d6de6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -394,7 +394,6 @@ static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
 	if (!atomic_dec_and_test(&cmd->ref))
 		return;
 	kfree(cmd->bvec);
-	cmd->bvec = NULL;
 	if (likely(!blk_should_fake_timeout(rq->q)))
 		blk_mq_complete_request(rq);
 }
@@ -407,24 +406,29 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
 	lo_rw_aio_do_completion(cmd);
 }
 
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
+static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
 {
-	struct iov_iter iter;
 	struct req_iterator rq_iter;
-	struct bio_vec *bvec;
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
-	int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
-	struct bio *bio = rq->bio;
-	struct file *file = lo->lo_backing_file;
 	struct bio_vec tmp;
-	unsigned int offset;
 	int nr_bvec = 0;
-	int ret;
 
 	rq_for_each_bvec(tmp, rq, rq_iter)
 		nr_bvec++;
 
+	return nr_bvec;
+}
+
+static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd,
+			  unsigned nr_bvec)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+
 	if (rq->bio != rq->biotail) {
+		struct req_iterator rq_iter;
+		struct bio_vec *bvec;
+		struct bio_vec tmp;
 
 		bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 				     GFP_NOIO);
@@ -442,35 +446,62 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
 			*bvec = tmp;
 			bvec++;
 		}
-		bvec = cmd->bvec;
-		offset = 0;
 	} else {
+		cmd->bvec = NULL;
+	}
+	cmd->iocb.ki_pos = pos;
+	cmd->iocb.ki_filp = lo->lo_backing_file;
+	cmd->iocb.ki_complete = lo_rw_aio_complete;
+	cmd->iocb.ki_flags = IOCB_DIRECT;
+	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
+	return 0;
+}
+
+static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+			    int nr_bvec)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
+	struct file *file = lo->lo_backing_file;
+	struct iov_iter iter;
+	int ret;
+
+	if (cmd->bvec) {
+		iov_iter_bvec(&iter, dir, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = 0;
+	} else {
+		struct bio *bio = rq->bio;
+		struct bio_vec *bvec = __bvec_iter_bvec(bio->bi_io_vec,
+				bio->bi_iter);
+
 		/*
 		 * Same here, this bio may be started from the middle of the
 		 * 'bvec' because of bio splitting, so offset from the bvec
 		 * must be passed to iov iterator
 		 */
-		offset = bio->bi_iter.bi_bvec_done;
-		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+		iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = bio->bi_iter.bi_bvec_done;
 	}
-	atomic_set(&cmd->ref, 2);
-
-	iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
-	iter.iov_offset = offset;
-
-	cmd->iocb.ki_pos = pos;
-	cmd->iocb.ki_filp = file;
-	cmd->iocb.ki_complete = lo_rw_aio_complete;
-	cmd->iocb.ki_flags = IOCB_DIRECT;
-	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
 
+	atomic_set(&cmd->ref, 2);
 	if (dir == ITER_SOURCE)
 		ret = file->f_op->write_iter(&cmd->iocb, &iter);
 	else
 		ret = file->f_op->read_iter(&cmd->iocb, &iter);
-
 	lo_rw_aio_do_completion(cmd);
 
+	return ret;
+}
+
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
+{
+	unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
+	int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+
+	if (ret >= 0)
+		ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
 	if (ret != -EIOCBQUEUED)
 		lo_rw_aio_complete(&cmd->iocb, ret);
 	return 0;
@@ -507,7 +538,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	case REQ_OP_WRITE:
 	case REQ_OP_READ:
 		if (cmd->use_aio)
-			return lo_rw_aio(lo, cmd, pos);
+			return lo_rw_aio(lo, cmd);
 		else
 			return lo_rw_simple(lo, rq, pos);
 	default:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  2025-03-14  2:11 ` [PATCH V2 1/5] loop: simplify do_req_filebacked() Ming Lei
  2025-03-14  2:11 ` [PATCH V2 2/5] loop: cleanup lo_rw_aio() Ming Lei
@ 2025-03-14  2:11 ` Ming Lei
  2025-03-20  7:14   ` Christoph Hellwig
  2025-03-14  2:11 ` [PATCH V2 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

Move loop command blkcg/memcg initialization into loop_queue_work,
and prepare for supporting to handle loop io command by IOCB_NOWAIT.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3f921d9d6de6..81f6ba9bc911 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -878,11 +878,28 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
 
 static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
+	struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
 	struct rb_node **node, *parent = NULL;
 	struct loop_worker *cur_worker, *worker = NULL;
 	struct work_struct *work;
 	struct list_head *cmd_list;
 
+	/* always use the first bio's css */
+	cmd->blkcg_css = NULL;
+	cmd->memcg_css = NULL;
+#ifdef CONFIG_BLK_CGROUP
+	if (rq->bio) {
+		cmd->blkcg_css = bio_blkcg_css(rq->bio);
+#ifdef CONFIG_MEMCG
+		if (cmd->blkcg_css) {
+			cmd->memcg_css =
+				cgroup_get_e_css(cmd->blkcg_css->cgroup,
+						&memory_cgrp_subsys);
+		}
+#endif
+	}
+#endif
+
 	spin_lock_irq(&lo->lo_work_lock);
 
 	if (queue_on_root_worker(cmd->blkcg_css))
@@ -1920,21 +1937,6 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		break;
 	}
 
-	/* always use the first bio's css */
-	cmd->blkcg_css = NULL;
-	cmd->memcg_css = NULL;
-#ifdef CONFIG_BLK_CGROUP
-	if (rq->bio) {
-		cmd->blkcg_css = bio_blkcg_css(rq->bio);
-#ifdef CONFIG_MEMCG
-		if (cmd->blkcg_css) {
-			cmd->memcg_css =
-				cgroup_get_e_css(cmd->blkcg_css->cgroup,
-						&memory_cgrp_subsys);
-		}
-#endif
-	}
-#endif
 	loop_queue_work(lo, cmd);
 
 	return BLK_STS_OK;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 4/5] loop: try to handle loop aio command via NOWAIT IO first
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (2 preceding siblings ...)
  2025-03-14  2:11 ` [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
@ 2025-03-14  2:11 ` Ming Lei
  2025-03-14  2:11 ` [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
  2025-03-14  2:16 ` [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  5 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

Try to handle loop aio command via NOWAIT IO first, then we can avoid to
queue the aio command into workqueue. This is usually one big win in
case that FS block mapping is stable, Mikulas verified [1] that this way
improves IO perf by close to 5X in 12jobs sequential read/write test.

Fallback to workqueue in case of -EAGAIN. This way may bring a little
cost from the 1st retry, but when running the following write test over
loop/sparse_file, the actual effect on randwrite is obvious:

```
truncate -s 4G 1.img    #1.img is created on XFS/virtio-scsi
losetup -f 1.img --direct-io=on
fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \
	--iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW
```

- RW=randwrite: obvious IOPS drop observed
- RW=write: a little drop(%5 - 10%)

This perf drop on randwrite over sparse file will be addressed in the
following patch.

BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter()
which might sleep even though it is NOWAIT, and the only effect is that rcu read
lock is replaced with srcu read lock.

Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 51 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81f6ba9bc911..542b1fe938a7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -92,6 +92,8 @@ struct loop_cmd {
 #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
 #define LOOP_DEFAULT_HW_Q_DEPTH 128
 
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd);
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
@@ -393,6 +395,15 @@ static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
 
 	if (!atomic_dec_and_test(&cmd->ref))
 		return;
+
+	/* -EAGAIN could be returned from bdev's ->ki_complete */
+	if (cmd->ret == -EAGAIN) {
+		struct loop_device *lo = rq->q->queuedata;
+
+		loop_queue_work(lo, cmd);
+		return;
+	}
+
 	kfree(cmd->bvec);
 	if (likely(!blk_should_fake_timeout(rq->q)))
 		blk_mq_complete_request(rq);
@@ -498,15 +509,38 @@ static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
 {
 	unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
-	int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+	int ret;
 
-	if (ret >= 0)
-		ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
+	/*
+	 * This command is prepared, and we have tried IOCB_NOWAIT, but got
+	 * -EAGAIN, so clear it now
+	 */
+	cmd->iocb.ki_flags &= ~IOCB_NOWAIT;
+	ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
 	if (ret != -EIOCBQUEUED)
 		lo_rw_aio_complete(&cmd->iocb, ret);
 	return 0;
 }
 
+static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
+		struct loop_cmd *cmd)
+{
+	unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
+	int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+
+	if (unlikely(ret < 0))
+		return BLK_STS_IOERR;
+
+	cmd->iocb.ki_flags |= IOCB_NOWAIT;
+	ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
+	if (ret == -EAGAIN)
+		return BLK_STS_AGAIN;
+
+	if (ret != -EIOCBQUEUED)
+		lo_rw_aio_complete(&cmd->iocb, ret);
+	return BLK_STS_OK;
+}
+
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
@@ -1937,6 +1971,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		break;
 	}
 
+	if (cmd->use_aio) {
+		blk_status_t res = lo_rw_aio_nowait(lo, cmd);
+
+		if (res != BLK_STS_AGAIN)
+			return res;
+		/* fallback to workqueue for handling aio */
+	}
+
 	loop_queue_work(lo, cmd);
 
 	return BLK_STS_OK;
@@ -2087,7 +2129,8 @@ static int loop_add(int i)
 	lo->tag_set.queue_depth = hw_queue_depth;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT;
+	lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT |
+		BLK_MQ_F_BLOCKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (3 preceding siblings ...)
  2025-03-14  2:11 ` [PATCH V2 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
@ 2025-03-14  2:11 ` Ming Lei
  2025-03-20  7:22   ` Christoph Hellwig
  2025-03-14  2:16 ` [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  5 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon, Ming Lei

Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
to cause write(especially randwrite) perf regression on sparse file.

Try IOCB_NOWAIT in the following situations:

- backing file is block device

- READ aio command

- there isn't queued aio non-NOWAIT WRITE, since retry of NOWAIT won't
cause contention on WRITE and non-NOWAIT WRITE often implies exclusive
lock.

With this simple policy, perf regression of randwrte/write on sparse
backing file is fixed. Meantime this way addresses perf problem[1] in
case of stable FS block mapping via NOWAIT.

Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 542b1fe938a7..5bf14549cf8e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,6 +70,7 @@ struct loop_device {
 	struct rb_root          worker_tree;
 	struct timer_list       timer;
 	bool			sysfs_inited;
+	unsigned 		queued_wait_write;
 
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
@@ -522,6 +523,30 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
 	return 0;
 }
 
+static inline bool lo_aio_need_try_nowait(struct loop_device *lo,
+		struct loop_cmd *cmd)
+{
+	struct file *file = lo->lo_backing_file;
+	struct inode *inode = file->f_mapping->host;
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+	/* NOWAIT works fine for backing block device */
+	if (S_ISBLK(inode->i_mode))
+		return true;
+
+	/* NOWAIT is supposed to be fine for READ */
+	if (req_op(rq) == REQ_OP_READ)
+		return true;
+
+	/*
+	 * If there is any queued non-NOWAIT async WRITE , don't try new
+	 * NOWAIT WRITE for avoiding contention
+	 *
+	 * Here we focus on handling stable FS block mapping via NOWAIT
+	 */
+	return READ_ONCE(lo->queued_wait_write) == 0;
+}
+
 static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
 		struct loop_cmd *cmd)
 {
@@ -531,6 +556,9 @@ static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
 	if (unlikely(ret < 0))
 		return BLK_STS_IOERR;
 
+	if (!lo_aio_need_try_nowait(lo, cmd))
+		return BLK_STS_AGAIN;
+
 	cmd->iocb.ki_flags |= IOCB_NOWAIT;
 	ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
 	if (ret == -EAGAIN)
@@ -821,12 +849,18 @@ static ssize_t loop_attr_dio_show(struct loop_device *lo, char *buf)
 	return sysfs_emit(buf, "%s\n", dio ? "1" : "0");
 }
 
+static ssize_t loop_attr_nr_wait_write_show(struct loop_device *lo, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", lo->queued_wait_write);
+}
+
 LOOP_ATTR_RO(backing_file);
 LOOP_ATTR_RO(offset);
 LOOP_ATTR_RO(sizelimit);
 LOOP_ATTR_RO(autoclear);
 LOOP_ATTR_RO(partscan);
 LOOP_ATTR_RO(dio);
+LOOP_ATTR_RO(nr_wait_write);
 
 static struct attribute *loop_attrs[] = {
 	&loop_attr_backing_file.attr,
@@ -835,6 +869,7 @@ static struct attribute *loop_attrs[] = {
 	&loop_attr_autoclear.attr,
 	&loop_attr_partscan.attr,
 	&loop_attr_dio.attr,
+	&loop_attr_nr_wait_write.attr,
 	NULL,
 };
 
@@ -910,6 +945,30 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
 }
 #endif
 
+static inline void loop_inc_wait_write(struct loop_device *lo, struct loop_cmd *cmd)
+{
+	lockdep_assert_held(&lo->lo_mutex);
+
+	if (cmd->use_aio){
+		struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+		if (req_op(rq) == REQ_OP_WRITE)
+			lo->queued_wait_write += 1;
+	}
+}
+
+static inline void loop_dec_wait_write(struct loop_device *lo, struct loop_cmd *cmd)
+{
+	lockdep_assert_held(&lo->lo_mutex);
+
+	if (cmd->use_aio){
+		struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+		if (req_op(rq) == REQ_OP_WRITE)
+			lo->queued_wait_write -= 1;
+	}
+}
+
 static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
 	struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
@@ -992,6 +1051,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 		work = &lo->rootcg_work;
 		cmd_list = &lo->rootcg_cmd_list;
 	}
+	loop_inc_wait_write(lo, cmd);
 	list_add_tail(&cmd->list_entry, cmd_list);
 	queue_work(lo->workqueue, work);
 	spin_unlock_irq(&lo->lo_work_lock);
@@ -2051,6 +2111,7 @@ static void loop_process_work(struct loop_worker *worker,
 		cond_resched();
 
 		spin_lock_irq(&lo->lo_work_lock);
+		loop_dec_wait_write(lo, cmd);
 	}
 
 	/*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT
  2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (4 preceding siblings ...)
  2025-03-14  2:11 ` [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
@ 2025-03-14  2:16 ` Ming Lei
  5 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2025-03-14  2:16 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Mikulas Patocka
  Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
	Alasdair Kergon

On Fri, Mar 14, 2025 at 10:12 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hello Jens,
>
> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
> command to workqueue context, meantime refactor lo_rw_aio() a bit.
>
> In my test VM, loop disk perf becomes very close to perf of the backing block
> device(nvme/mq virtio-scsi).
>
> And Mikulas verified that this way can improve 12jobs sequential rw io by
> ~5X, and basically solve the reported problem together with loop MQ change.
>
> https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/
>
> The loop MQ change will be posted as standalone patch, because it needs
> losetup change.
>
>
> Thanks,
> Ming
>
> V2:
>         - patch style fix & cleanup (Christoph)
>         - fix randwrite perf regression on sparse backing file
>         - drop MQ change

Hi Mikulas,

Just realized that you are missing in Cc list and loop you in, sorry for that...

Thanks,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/5] loop: simplify do_req_filebacked()
  2025-03-14  2:11 ` [PATCH V2 1/5] loop: simplify do_req_filebacked() Ming Lei
@ 2025-03-20  7:10   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2025-03-20  7:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jooyung Han,
	Mike Snitzer, zkabelac, dm-devel, Alasdair Kergon

On Fri, Mar 14, 2025 at 10:11:41AM +0800, Ming Lei wrote:
> lo_rw_aio() is only called for READ/WRITE operation, which can be
> figured out from request directly, so remove 'rw' parameter from
> lo_rw_aio(), meantime rename the local variable as 'dir' which makes
> the check more readable in lo_rw_aio().
> 
> Meantime add lo_rw_simple() so that do_req_filebacked() can be
> simplified in the following way:

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 2/5] loop: cleanup lo_rw_aio()
  2025-03-14  2:11 ` [PATCH V2 2/5] loop: cleanup lo_rw_aio() Ming Lei
@ 2025-03-20  7:11   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2025-03-20  7:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jooyung Han,
	Mike Snitzer, zkabelac, dm-devel, Alasdair Kergon

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work
  2025-03-14  2:11 ` [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
@ 2025-03-20  7:14   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2025-03-20  7:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jooyung Han,
	Mike Snitzer, zkabelac, dm-devel, Alasdair Kergon, Dan Schatzberg

On Fri, Mar 14, 2025 at 10:11:43AM +0800, Ming Lei wrote:
> Move loop command blkcg/memcg initialization into loop_queue_work,
> and prepare for supporting to handle loop io command by IOCB_NOWAIT.

As a cosmetic change this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

But based on my digging into this code from the current discussion
I really thing we need to kill the blkcg loop_worker code with
prejudice ASAP in favor of a per-command work_struct.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT
  2025-03-14  2:11 ` [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
@ 2025-03-20  7:22   ` Christoph Hellwig
  2025-03-20  7:38     ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-03-20  7:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jooyung Han,
	Mike Snitzer, zkabelac, dm-devel, Alasdair Kergon

On Fri, Mar 14, 2025 at 10:11:45AM +0800, Ming Lei wrote:
> Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
> to cause write(especially randwrite) perf regression on sparse file.
> 
> Try IOCB_NOWAIT in the following situations:
> 
> - backing file is block device

Why limit yourself to block devices?

> - READ aio command
> - there isn't queued aio non-NOWAIT WRITE, since retry of NOWAIT won't
> cause contention on WRITE and non-NOWAIT WRITE often implies exclusive
> lock.

This reads really odd because to me the list implies that you only
support reads, but the code also supports writes.  Maybe try to
explain this more clearly.

> With this simple policy, perf regression of randwrte/write on sparse
> backing file is fixed. Meantime this way addresses perf problem[1] in
> case of stable FS block mapping via NOWAIT.

This needs to go in with the patch implementing the logic.

> @@ -70,6 +70,7 @@ struct loop_device {
>  	struct rb_root          worker_tree;
>  	struct timer_list       timer;
>  	bool			sysfs_inited;
> +	unsigned 		queued_wait_write;

lo_nr_blocking_writes?

What serializes access to this variable?

> +static inline bool lo_aio_need_try_nowait(struct loop_device *lo,
> +		struct loop_cmd *cmd)

Drop the need_ in the name, it not only is superfluous, but also
makes it really hard to read the function name.

Also the inline looks spurious.

> +LOOP_ATTR_RO(nr_wait_write);

nr_blocking_writes?

> +static inline void loop_inc_wait_write(struct loop_device *lo, struct loop_cmd *cmd)

Overly long line.

> +	if (cmd->use_aio){

missing whitespace.

> +		struct request *rq = blk_mq_rq_from_pdu(cmd);
> +
> +		if (req_op(rq) == REQ_OP_WRITE)
> +			lo->queued_wait_write += 1;


	if (cmd->use_aio && req_op(blk_mq_rq_from_pdu(cmd)) == REQ_OP_WRITE)
			lo->queued_wait_write++;

> +	}
> +}
> +
> +static inline void loop_dec_wait_write(struct loop_device *lo, struct loop_cmd *cmd)
> +{
> +	lockdep_assert_held(&lo->lo_mutex);
> +
> +	if (cmd->use_aio){
> +		struct request *rq = blk_mq_rq_from_pdu(cmd);
> +
> +		if (req_op(rq) == REQ_OP_WRITE)
> +			lo->queued_wait_write -= 1;
> +	}
> +}

All the things above apply here as well.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT
  2025-03-20  7:22   ` Christoph Hellwig
@ 2025-03-20  7:38     ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2025-03-20  7:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jooyung Han, Mike Snitzer, zkabelac,
	dm-devel, Alasdair Kergon

On Thu, Mar 20, 2025 at 08:22:47AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 14, 2025 at 10:11:45AM +0800, Ming Lei wrote:
> > Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
> > to cause write(especially randwrite) perf regression on sparse file.
> > 
> > Try IOCB_NOWAIT in the following situations:
> > 
> > - backing file is block device
> 
> Why limit yourself to block devices?

It doesn't limit to block device, just submit NOWAIT unconditionally.

I should have added 'OR' among the three lines.

> 
> > - READ aio command
> > - there isn't queued aio non-NOWAIT WRITE, since retry of NOWAIT won't
> > cause contention on WRITE and non-NOWAIT WRITE often implies exclusive
> > lock.
> 
> This reads really odd because to me the list implies that you only
> support reads, but the code also supports writes.  Maybe try to
> explain this more clearly.

Will improve the comment log.

> 
> > With this simple policy, perf regression of randwrte/write on sparse
> > backing file is fixed. Meantime this way addresses perf problem[1] in
> > case of stable FS block mapping via NOWAIT.
> 
> This needs to go in with the patch implementing the logic.

OK.

> 
> > @@ -70,6 +70,7 @@ struct loop_device {
> >  	struct rb_root          worker_tree;
> >  	struct timer_list       timer;
> >  	bool			sysfs_inited;
> > +	unsigned 		queued_wait_write;
> 
> lo_nr_blocking_writes?
> 
> What serializes access to this variable?

The write is serialized by the loop spin lock, and the read is done
via READ_ONCE(), since it is just a hint.

> 
> > +static inline bool lo_aio_need_try_nowait(struct loop_device *lo,
> > +		struct loop_cmd *cmd)
> 
> Drop the need_ in the name, it not only is superfluous, but also
> makes it really hard to read the function name.

OK.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-20  7:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14  2:11 [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-03-14  2:11 ` [PATCH V2 1/5] loop: simplify do_req_filebacked() Ming Lei
2025-03-20  7:10   ` Christoph Hellwig
2025-03-14  2:11 ` [PATCH V2 2/5] loop: cleanup lo_rw_aio() Ming Lei
2025-03-20  7:11   ` Christoph Hellwig
2025-03-14  2:11 ` [PATCH V2 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
2025-03-20  7:14   ` Christoph Hellwig
2025-03-14  2:11 ` [PATCH V2 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
2025-03-14  2:11 ` [PATCH V2 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
2025-03-20  7:22   ` Christoph Hellwig
2025-03-20  7:38     ` Ming Lei
2025-03-14  2:16 ` [PATCH V2 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox