* [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT
@ 2025-03-22 1:26 Ming Lei
2025-03-22 1:26 ` [PATCH V3 1/5] loop: simplify do_req_filebacked() Ming Lei
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
Hello Jens,
This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
command to workqueue context, meantime refactor lo_rw_aio() a bit.
In my test VM, loop disk perf becomes very close to perf of the backing block
device(nvme/mq virtio-scsi).
And Mikulas verified that this way can improve 12jobs sequential rw io by
~5X, and basically solve the reported problem together with loop MQ change.
https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/
The loop MQ change will be posted as standalone patch, because it needs
losetup change.
Thanks,
Ming
V3:
- add reviewed-by tag
- rename variable & improve commit log & comment on 5/5(Christoph)
V2:
- patch style fix & cleanup (Christoph)
- fix randwrite perf regression on sparse backing file
- drop MQ change
Ming Lei (5):
loop: simplify do_req_filebacked()
loop: cleanup lo_rw_aio()
loop: move command blkcg/memcg initialization into loop_queue_work
loop: try to handle loop aio command via NOWAIT IO first
loop: add hint for handling aio via IOCB_NOWAIT
drivers/block/loop.c | 227 ++++++++++++++++++++++++++++++++++---------
1 file changed, 181 insertions(+), 46 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH V3 1/5] loop: simplify do_req_filebacked()
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
@ 2025-03-22 1:26 ` Ming Lei
2025-03-22 1:26 ` [PATCH V3 2/5] loop: cleanup lo_rw_aio() Ming Lei
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
lo_rw_aio() is only called for READ/WRITE operation, which can be
figured out from request directly, so remove 'rw' parameter from
lo_rw_aio(), meantime rename the local variable as 'dir' which makes
the check more readable in lo_rw_aio().
Meantime add lo_rw_simple() so that do_req_filebacked() can be
simplified in the following way:
```
if (cmd->use_aio)
return lo_rw_aio(lo, cmd, pos);
else
return lo_rw_simple(lo, rq, pos);
```
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 674527d770dc..339b19671450 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -277,6 +277,13 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
return 0;
}
+static int lo_rw_simple(struct loop_device *lo, struct request *rq, loff_t pos)
+{
+ if (req_op(rq) == REQ_OP_READ)
+ return lo_read_simple(lo, rq, pos);
+ return lo_write_simple(lo, rq, pos);
+}
+
static void loop_clear_limits(struct loop_device *lo, int mode)
{
struct queue_limits lim = queue_limits_start_update(lo->lo_queue);
@@ -392,13 +399,13 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
lo_rw_aio_do_completion(cmd);
}
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
- loff_t pos, int rw)
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
{
struct iov_iter iter;
struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
+ int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
struct bio_vec tmp;
@@ -440,7 +447,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
}
atomic_set(&cmd->ref, 2);
- iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
+ iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
cmd->iocb.ki_pos = pos;
@@ -449,7 +456,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (rw == ITER_SOURCE)
+ if (dir == ITER_SOURCE)
ret = file->f_op->write_iter(&cmd->iocb, &iter);
else
ret = file->f_op->read_iter(&cmd->iocb, &iter);
@@ -490,15 +497,11 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_DISCARD:
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
case REQ_OP_WRITE:
- if (cmd->use_aio)
- return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
- else
- return lo_write_simple(lo, rq, pos);
case REQ_OP_READ:
if (cmd->use_aio)
- return lo_rw_aio(lo, cmd, pos, ITER_DEST);
+ return lo_rw_aio(lo, cmd, pos);
else
- return lo_read_simple(lo, rq, pos);
+ return lo_rw_simple(lo, rq, pos);
default:
WARN_ON_ONCE(1);
return -EIO;
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V3 2/5] loop: cleanup lo_rw_aio()
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-03-22 1:26 ` [PATCH V3 1/5] loop: simplify do_req_filebacked() Ming Lei
@ 2025-03-22 1:26 ` Ming Lei
2025-03-22 1:26 ` [PATCH V3 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
Cleanup lo_rw_aio() a bit by refactoring it into three parts:
- lo_cmd_nr_bvec(), for calculating how many bvecs in this request
- lo_rw_aio_prep(), for preparing loop command, which need to be called
once
- lo_submit_rw_aio(), for submitting this lo command, which can be
called multiple times
Prepare for trying to handle loop command by NOWAIT read/write IO
first.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 81 ++++++++++++++++++++++++++++++--------------
1 file changed, 56 insertions(+), 25 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 339b19671450..419ca675342a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -386,7 +386,6 @@ static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
if (!atomic_dec_and_test(&cmd->ref))
return;
kfree(cmd->bvec);
- cmd->bvec = NULL;
if (likely(!blk_should_fake_timeout(rq->q)))
blk_mq_complete_request(rq);
}
@@ -399,24 +398,29 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
lo_rw_aio_do_completion(cmd);
}
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
+static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
{
- struct iov_iter iter;
struct req_iterator rq_iter;
- struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
- int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
- struct bio *bio = rq->bio;
- struct file *file = lo->lo_backing_file;
struct bio_vec tmp;
- unsigned int offset;
int nr_bvec = 0;
- int ret;
rq_for_each_bvec(tmp, rq, rq_iter)
nr_bvec++;
+ return nr_bvec;
+}
+
+static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd,
+ unsigned nr_bvec)
+{
+ struct request *rq = blk_mq_rq_from_pdu(cmd);
+ loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+
if (rq->bio != rq->biotail) {
+ struct req_iterator rq_iter;
+ struct bio_vec *bvec;
+ struct bio_vec tmp;
bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
GFP_NOIO);
@@ -434,35 +438,62 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, loff_t pos)
*bvec = tmp;
bvec++;
}
- bvec = cmd->bvec;
- offset = 0;
} else {
+ cmd->bvec = NULL;
+ }
+ cmd->iocb.ki_pos = pos;
+ cmd->iocb.ki_filp = lo->lo_backing_file;
+ cmd->iocb.ki_complete = lo_rw_aio_complete;
+ cmd->iocb.ki_flags = IOCB_DIRECT;
+ cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
+ return 0;
+}
+
+static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+ int nr_bvec)
+{
+ struct request *rq = blk_mq_rq_from_pdu(cmd);
+ int dir = (req_op(rq) == REQ_OP_READ) ? ITER_DEST : ITER_SOURCE;
+ struct file *file = lo->lo_backing_file;
+ struct iov_iter iter;
+ int ret;
+
+ if (cmd->bvec) {
+ iov_iter_bvec(&iter, dir, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
+ iter.iov_offset = 0;
+ } else {
+ struct bio *bio = rq->bio;
+ struct bio_vec *bvec = __bvec_iter_bvec(bio->bi_io_vec,
+ bio->bi_iter);
+
/*
* Same here, this bio may be started from the middle of the
* 'bvec' because of bio splitting, so offset from the bvec
* must be passed to iov iterator
*/
- offset = bio->bi_iter.bi_bvec_done;
- bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+ iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
+ iter.iov_offset = bio->bi_iter.bi_bvec_done;
}
- atomic_set(&cmd->ref, 2);
-
- iov_iter_bvec(&iter, dir, bvec, nr_bvec, blk_rq_bytes(rq));
- iter.iov_offset = offset;
-
- cmd->iocb.ki_pos = pos;
- cmd->iocb.ki_filp = file;
- cmd->iocb.ki_complete = lo_rw_aio_complete;
- cmd->iocb.ki_flags = IOCB_DIRECT;
- cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+ atomic_set(&cmd->ref, 2);
if (dir == ITER_SOURCE)
ret = file->f_op->write_iter(&cmd->iocb, &iter);
else
ret = file->f_op->read_iter(&cmd->iocb, &iter);
-
lo_rw_aio_do_completion(cmd);
+ return ret;
+}
+
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
+{
+ unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
+ int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+
+ if (ret >= 0)
+ ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
if (ret != -EIOCBQUEUED)
lo_rw_aio_complete(&cmd->iocb, ret);
return 0;
@@ -499,7 +530,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_WRITE:
case REQ_OP_READ:
if (cmd->use_aio)
- return lo_rw_aio(lo, cmd, pos);
+ return lo_rw_aio(lo, cmd);
else
return lo_rw_simple(lo, rq, pos);
default:
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V3 3/5] loop: move command blkcg/memcg initialization into loop_queue_work
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-03-22 1:26 ` [PATCH V3 1/5] loop: simplify do_req_filebacked() Ming Lei
2025-03-22 1:26 ` [PATCH V3 2/5] loop: cleanup lo_rw_aio() Ming Lei
@ 2025-03-22 1:26 ` Ming Lei
2025-03-22 1:26 ` [PATCH V3 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
Move loop command blkcg/memcg initialization into loop_queue_work,
and prepare for supporting to handle loop io command by IOCB_NOWAIT.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 419ca675342a..c14da87efb07 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -877,11 +877,28 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
{
+ struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
struct rb_node **node, *parent = NULL;
struct loop_worker *cur_worker, *worker = NULL;
struct work_struct *work;
struct list_head *cmd_list;
+ /* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
+#ifdef CONFIG_BLK_CGROUP
+ if (rq->bio) {
+ cmd->blkcg_css = bio_blkcg_css(rq->bio);
+#ifdef CONFIG_MEMCG
+ if (cmd->blkcg_css) {
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+ }
+#endif
+ }
+#endif
+
spin_lock_irq(&lo->lo_work_lock);
if (queue_on_root_worker(cmd->blkcg_css))
@@ -1926,21 +1943,6 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
break;
}
- /* always use the first bio's css */
- cmd->blkcg_css = NULL;
- cmd->memcg_css = NULL;
-#ifdef CONFIG_BLK_CGROUP
- if (rq->bio) {
- cmd->blkcg_css = bio_blkcg_css(rq->bio);
-#ifdef CONFIG_MEMCG
- if (cmd->blkcg_css) {
- cmd->memcg_css =
- cgroup_get_e_css(cmd->blkcg_css->cgroup,
- &memory_cgrp_subsys);
- }
-#endif
- }
-#endif
loop_queue_work(lo, cmd);
return BLK_STS_OK;
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V3 4/5] loop: try to handle loop aio command via NOWAIT IO first
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
` (2 preceding siblings ...)
2025-03-22 1:26 ` [PATCH V3 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
@ 2025-03-22 1:26 ` Ming Lei
2025-03-22 1:26 ` [PATCH V3 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
2025-03-22 17:40 ` [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Jens Axboe
5 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
Try to handle loop aio command via NOWAIT IO first, then we can avoid to
queue the aio command into workqueue. This is usually one big win in
case that FS block mapping is stable, Mikulas verified [1] that this way
improves IO perf by close to 5X in 12jobs sequential read/write test,
in which FS block mapping is just stable.
Fallback to workqueue in case of -EAGAIN. This way may bring a little
cost from the 1st retry, but when running the following write test over
loop/sparse_file, the actual effect on randwrite is obvious:
```
truncate -s 4G 1.img #1.img is created on XFS/virtio-scsi
losetup -f 1.img --direct-io=on
fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \
--iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW
```
- RW=randwrite: obvious IOPS drop observed
- RW=write: a little drop(%5 - 10%)
This perf drop on randwrite over sparse file will be addressed in the
following patch.
BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter()
which might sleep even though it is NOWAIT, and the only effect is that rcu read
lock is replaced with srcu read lock.
Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 51 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c14da87efb07..3baabf150488 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -90,6 +90,8 @@ struct loop_cmd {
#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
#define LOOP_DEFAULT_HW_Q_DEPTH 128
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd);
+
static DEFINE_IDR(loop_index_idr);
static DEFINE_MUTEX(loop_ctl_mutex);
static DEFINE_MUTEX(loop_validate_mutex);
@@ -385,6 +387,15 @@ static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
if (!atomic_dec_and_test(&cmd->ref))
return;
+
+ /* -EAGAIN could be returned from bdev's ->ki_complete */
+ if (cmd->ret == -EAGAIN) {
+ struct loop_device *lo = rq->q->queuedata;
+
+ loop_queue_work(lo, cmd);
+ return;
+ }
+
kfree(cmd->bvec);
if (likely(!blk_should_fake_timeout(rq->q)))
blk_mq_complete_request(rq);
@@ -490,15 +501,38 @@ static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
{
unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
- int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+ int ret;
- if (ret >= 0)
- ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
+ /*
+ * This command is prepared, and we have tried IOCB_NOWAIT, but got
+ * -EAGAIN, so clear it now
+ */
+ cmd->iocb.ki_flags &= ~IOCB_NOWAIT;
+ ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
if (ret != -EIOCBQUEUED)
lo_rw_aio_complete(&cmd->iocb, ret);
return 0;
}
+static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
+ struct loop_cmd *cmd)
+{
+ unsigned int nr_bvec = lo_cmd_nr_bvec(cmd);
+ int ret = lo_rw_aio_prep(lo, cmd, nr_bvec);
+
+ if (unlikely(ret < 0))
+ return BLK_STS_IOERR;
+
+ cmd->iocb.ki_flags |= IOCB_NOWAIT;
+ ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
+ if (ret == -EAGAIN)
+ return BLK_STS_AGAIN;
+
+ if (ret != -EIOCBQUEUED)
+ lo_rw_aio_complete(&cmd->iocb, ret);
+ return BLK_STS_OK;
+}
+
static int do_req_filebacked(struct loop_device *lo, struct request *rq)
{
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
@@ -1943,6 +1977,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
break;
}
+ if (cmd->use_aio) {
+ blk_status_t res = lo_rw_aio_nowait(lo, cmd);
+
+ if (res != BLK_STS_AGAIN)
+ return res;
+ /* fallback to workqueue for handling aio */
+ }
+
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2093,7 +2135,8 @@ static int loop_add(int i)
lo->tag_set.queue_depth = hw_queue_depth;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
- lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT;
+ lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT |
+ BLK_MQ_F_BLOCKING;
lo->tag_set.driver_data = lo;
err = blk_mq_alloc_tag_set(&lo->tag_set);
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH V3 5/5] loop: add hint for handling aio via IOCB_NOWAIT
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
` (3 preceding siblings ...)
2025-03-22 1:26 ` [PATCH V3 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
@ 2025-03-22 1:26 ` Ming Lei
2025-03-22 17:40 ` [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Jens Axboe
5 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2025-03-22 1:26 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka, Ming Lei
Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
to cause write(especially randwrite) perf regression on sparse backed file.
Try IOCB_NOWAIT in the following situations:
- backing file is block device
OR
- READ aio command
OR
- there isn't any queued blocking async WRITEs, because NOWAIT won't cause
contention with blocking WRITE, which often implies exclusive lock
With this simple policy, perf regression of randwrite/write on sparse
backing file is fixed.
Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/loop.c | 56 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3baabf150488..e1b01285da2a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -68,6 +68,7 @@ struct loop_device {
struct rb_root worker_tree;
struct timer_list timer;
bool sysfs_inited;
+ unsigned lo_nr_blocking_writes;
struct request_queue *lo_queue;
struct blk_mq_tag_set tag_set;
@@ -514,6 +515,33 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd)
return 0;
}
+static inline bool lo_aio_try_nowait(struct loop_device *lo,
+ struct loop_cmd *cmd)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *inode = file->f_mapping->host;
+ struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+ /* NOWAIT works fine for backing block device */
+ if (S_ISBLK(inode->i_mode))
+ return true;
+
+ /*
+ * NOWAIT is supposed to be fine for READ without contending with
+ * blocking WRITE
+ */
+ if (req_op(rq) == REQ_OP_READ)
+ return true;
+
+ /*
+ * If there is any queued non-NOWAIT async WRITE , don't try new
+ * NOWAIT WRITE for avoiding contention
+ *
+ * Here we focus on handling stable FS block mapping via NOWAIT
+ */
+ return READ_ONCE(lo->lo_nr_blocking_writes) == 0;
+}
+
static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
struct loop_cmd *cmd)
{
@@ -523,6 +551,9 @@ static blk_status_t lo_rw_aio_nowait(struct loop_device *lo,
if (unlikely(ret < 0))
return BLK_STS_IOERR;
+ if (!lo_aio_try_nowait(lo, cmd))
+ return BLK_STS_AGAIN;
+
cmd->iocb.ki_flags |= IOCB_NOWAIT;
ret = lo_submit_rw_aio(lo, cmd, nr_bvec);
if (ret == -EAGAIN)
@@ -820,12 +851,19 @@ static ssize_t loop_attr_dio_show(struct loop_device *lo, char *buf)
return sysfs_emit(buf, "%s\n", dio ? "1" : "0");
}
+static ssize_t loop_attr_nr_blocking_writes_show(struct loop_device *lo,
+ char *buf)
+{
+ return sysfs_emit(buf, "%u\n", lo->lo_nr_blocking_writes);
+}
+
LOOP_ATTR_RO(backing_file);
LOOP_ATTR_RO(offset);
LOOP_ATTR_RO(sizelimit);
LOOP_ATTR_RO(autoclear);
LOOP_ATTR_RO(partscan);
LOOP_ATTR_RO(dio);
+LOOP_ATTR_RO(nr_blocking_writes);
static struct attribute *loop_attrs[] = {
&loop_attr_backing_file.attr,
@@ -834,6 +872,7 @@ static struct attribute *loop_attrs[] = {
&loop_attr_autoclear.attr,
&loop_attr_partscan.attr,
&loop_attr_dio.attr,
+ &loop_attr_nr_blocking_writes.attr,
NULL,
};
@@ -909,6 +948,19 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
}
#endif
+static inline void loop_update_blocking_writes(struct loop_device *lo,
+ struct loop_cmd *cmd, bool inc)
+{
+ lockdep_assert_held(&lo->lo_mutex);
+
+ if (req_op(blk_mq_rq_from_pdu(cmd)) == REQ_OP_WRITE) {
+ if (inc)
+ lo->lo_nr_blocking_writes += 1;
+ else
+ lo->lo_nr_blocking_writes -= 1;
+ }
+}
+
static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
{
struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
@@ -991,6 +1043,8 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
work = &lo->rootcg_work;
cmd_list = &lo->rootcg_cmd_list;
}
+ if (cmd->use_aio)
+ loop_update_blocking_writes(lo, cmd, true);
list_add_tail(&cmd->list_entry, cmd_list);
queue_work(lo->workqueue, work);
spin_unlock_irq(&lo->lo_work_lock);
@@ -2057,6 +2111,8 @@ static void loop_process_work(struct loop_worker *worker,
cond_resched();
spin_lock_irq(&lo->lo_work_lock);
+ if (cmd->use_aio)
+ loop_update_blocking_writes(lo, cmd, false);
}
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
` (4 preceding siblings ...)
2025-03-22 1:26 ` [PATCH V3 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
@ 2025-03-22 17:40 ` Jens Axboe
2025-03-24 14:50 ` Jens Axboe
5 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2025-03-22 17:40 UTC (permalink / raw)
To: linux-block, Ming Lei
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka
On Sat, 22 Mar 2025 09:26:09 +0800, Ming Lei wrote:
> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
> command to workqueue context, meantime refactor lo_rw_aio() a bit.
>
> In my test VM, loop disk perf becomes very close to perf of the backing block
> device(nvme/mq virtio-scsi).
>
> And Mikulas verified that this way can improve 12jobs sequential rw io by
> ~5X, and basically solve the reported problem together with loop MQ change.
>
> [...]
Applied, thanks!
[1/5] loop: simplify do_req_filebacked()
commit: 04dcb8a909b5b68464ec5ccb123e9614f3ac333d
[2/5] loop: cleanup lo_rw_aio()
commit: 832c9fec8e2314170c5451023565b94f05477aa7
[3/5] loop: move command blkcg/memcg initialization into loop_queue_work
commit: a23d34a31758000b2b158288226bf24f96d8864d
[4/5] loop: try to handle loop aio command via NOWAIT IO first
commit: dfc77a934a3acdb13dadf237b7417c6a31b19da8
[5/5] loop: add hint for handling aio via IOCB_NOWAIT
commit: 4c3f4bad7a6e9022489a9f8392f7147ed3ce74b1
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT
2025-03-22 17:40 ` [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Jens Axboe
@ 2025-03-24 14:50 ` Jens Axboe
2025-03-25 1:59 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2025-03-24 14:50 UTC (permalink / raw)
To: linux-block, Ming Lei
Cc: Christoph Hellwig, Jooyung Han, Mike Snitzer, zkabelac, dm-devel,
Alasdair Kergon, Mikulas Patocka
On 3/22/25 11:40 AM, Jens Axboe wrote:
>
> On Sat, 22 Mar 2025 09:26:09 +0800, Ming Lei wrote:
>> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
>> command to workqueue context, meantime refactor lo_rw_aio() a bit.
>>
>> In my test VM, loop disk perf becomes very close to perf of the backing block
>> device(nvme/mq virtio-scsi).
>>
>> And Mikulas verified that this way can improve 12jobs sequential rw io by
>> ~5X, and basically solve the reported problem together with loop MQ change.
>>
>> [...]
>
> Applied, thanks!
>
> [1/5] loop: simplify do_req_filebacked()
> commit: 04dcb8a909b5b68464ec5ccb123e9614f3ac333d
> [2/5] loop: cleanup lo_rw_aio()
> commit: 832c9fec8e2314170c5451023565b94f05477aa7
> [3/5] loop: move command blkcg/memcg initialization into loop_queue_work
> commit: a23d34a31758000b2b158288226bf24f96d8864d
> [4/5] loop: try to handle loop aio command via NOWAIT IO first
> commit: dfc77a934a3acdb13dadf237b7417c6a31b19da8
> [5/5] loop: add hint for handling aio via IOCB_NOWAIT
> commit: 4c3f4bad7a6e9022489a9f8392f7147ed3ce74b1
Just a heads-up that I had applied this for testing, not necessarily to
get included. To clear up that confusion, I have retained patches 1-3
for now, and then we can queue up 4-5/5 later when everybody is happy
with them.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT
2025-03-24 14:50 ` Jens Axboe
@ 2025-03-25 1:59 ` Ming Lei
2025-03-25 12:07 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-03-25 1:59 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jooyung Han, Mike Snitzer,
zkabelac, dm-devel, Alasdair Kergon, Mikulas Patocka
On Mon, Mar 24, 2025 at 08:50:14AM -0600, Jens Axboe wrote:
> On 3/22/25 11:40 AM, Jens Axboe wrote:
> >
> > On Sat, 22 Mar 2025 09:26:09 +0800, Ming Lei wrote:
> >> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
> >> command to workqueue context, meantime refactor lo_rw_aio() a bit.
> >>
> >> In my test VM, loop disk perf becomes very close to perf of the backing block
> >> device(nvme/mq virtio-scsi).
> >>
> >> And Mikulas verified that this way can improve 12jobs sequential rw io by
> >> ~5X, and basically solve the reported problem together with loop MQ change.
> >>
> >> [...]
> >
> > Applied, thanks!
> >
> > [1/5] loop: simplify do_req_filebacked()
> > commit: 04dcb8a909b5b68464ec5ccb123e9614f3ac333d
> > [2/5] loop: cleanup lo_rw_aio()
> > commit: 832c9fec8e2314170c5451023565b94f05477aa7
> > [3/5] loop: move command blkcg/memcg initialization into loop_queue_work
> > commit: a23d34a31758000b2b158288226bf24f96d8864d
> > [4/5] loop: try to handle loop aio command via NOWAIT IO first
> > commit: dfc77a934a3acdb13dadf237b7417c6a31b19da8
> > [5/5] loop: add hint for handling aio via IOCB_NOWAIT
> > commit: 4c3f4bad7a6e9022489a9f8392f7147ed3ce74b1
>
> Just a heads-up that I had applied this for testing, not necessarily to
> get included. To clear up that confusion, I have retained patches 1-3
> for now, and then we can queue up 4-5/5 later when everybody is happy
> with them.
Fine.
I'd see the reason if there is, looks not see it anywhere, :-)
And it should have been posted on mail list.
Christoph suggested per-cmd struct, which does cause regression for
the usual sequential IO workload from both throughput and cpu utilization viewpoints,
and this thing has been observed 10 years ago when enabling loop dio/aio.
https://lore.kernel.org/lkml/1439778711-9621-4-git-send-email-ming.lei@canonical.com/
And my recent test shows same result too:
https://lore.kernel.org/linux-block/Z9I2lm31KOQ784nb@fedora/
Mikulas's test shows per-cmd struct works much worse than this patchset:
https://lore.kernel.org/linux-block/7b8b8a24-f36b-d213-cca1-d8857b6aca02@redhat.com/
And anything else?
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT
2025-03-25 1:59 ` Ming Lei
@ 2025-03-25 12:07 ` Jens Axboe
0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-03-25 12:07 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, Christoph Hellwig, Jooyung Han, Mike Snitzer,
zkabelac, dm-devel, Alasdair Kergon, Mikulas Patocka
On 3/24/25 7:59 PM, Ming Lei wrote:
> On Mon, Mar 24, 2025 at 08:50:14AM -0600, Jens Axboe wrote:
>> On 3/22/25 11:40 AM, Jens Axboe wrote:
>>>
>>> On Sat, 22 Mar 2025 09:26:09 +0800, Ming Lei wrote:
>>>> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
>>>> command to workqueue context, meantime refactor lo_rw_aio() a bit.
>>>>
>>>> In my test VM, loop disk perf becomes very close to perf of the backing block
>>>> device(nvme/mq virtio-scsi).
>>>>
>>>> And Mikulas verified that this way can improve 12jobs sequential rw io by
>>>> ~5X, and basically solve the reported problem together with loop MQ change.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [1/5] loop: simplify do_req_filebacked()
>>> commit: 04dcb8a909b5b68464ec5ccb123e9614f3ac333d
>>> [2/5] loop: cleanup lo_rw_aio()
>>> commit: 832c9fec8e2314170c5451023565b94f05477aa7
>>> [3/5] loop: move command blkcg/memcg initialization into loop_queue_work
>>> commit: a23d34a31758000b2b158288226bf24f96d8864d
>>> [4/5] loop: try to handle loop aio command via NOWAIT IO first
>>> commit: dfc77a934a3acdb13dadf237b7417c6a31b19da8
>>> [5/5] loop: add hint for handling aio via IOCB_NOWAIT
>>> commit: 4c3f4bad7a6e9022489a9f8392f7147ed3ce74b1
>>
>> Just a heads-up that I had applied this for testing, not necessarily to
>> get included. To clear up that confusion, I have retained patches 1-3
>> for now, and then we can queue up 4-5/5 later when everybody is happy
>> with them.
>
> Fine.
>
> I'd see the reason if there is, looks not see it anywhere, :-)
>
> And it should have been posted on mail list.
There's no reason, it's what I emailed above. It's just that 4-5/5
aren't fully reviewed yet. We can still make 6.15 if folks are happy
with it, just wanted to ensure it had enough time on the list to ensure
that that is the case.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-03-25 12:07 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-22 1:26 [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-03-22 1:26 ` [PATCH V3 1/5] loop: simplify do_req_filebacked() Ming Lei
2025-03-22 1:26 ` [PATCH V3 2/5] loop: cleanup lo_rw_aio() Ming Lei
2025-03-22 1:26 ` [PATCH V3 3/5] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
2025-03-22 1:26 ` [PATCH V3 4/5] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
2025-03-22 1:26 ` [PATCH V3 5/5] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
2025-03-22 17:40 ` [PATCH V3 0/5] loop: improve loop aio perf by IOCB_NOWAIT Jens Axboe
2025-03-24 14:50 ` Jens Axboe
2025-03-25 1:59 ` Ming Lei
2025-03-25 12:07 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox