[PATCH V5 0/6] loop: improve loop aio perf by IOCB

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
@ 2025-10-15 11:07 Ming Lei
  2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Hello Jens,

This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
command to workqueue context, meantime refactor lo_rw_aio() a bit.

In my test VM, loop disk perf becomes very close to perf of the backing block
device(nvme/mq virtio-scsi).

And Mikulas verified that this way can improve 12jobs sequential readwrite io by
~5X, and basically solve the reported problem together with loop MQ change.

https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/

Zhaoyang Huang also mentioned it may fix their performance issue on Android
use case.

The loop MQ change will be posted as standalone patch, because it needs
UAPI change.

V5:
	- only try nowait in case that backing file supports it (Yu Kuai)
	- fix one lockdep assert (syzbot) 
	- improve comment log (Christoph)

V4:
	- rebase
	- re-organize and make it more readable 

V3:
	- add reviewed-by tag
	- rename variable & improve commit log & comment on 5/5(Christoph)

V2:
	- patch style fix & cleanup (Christoph)
	- fix randwrite perf regression on sparse backing file
	- drop MQ change

Ming Lei (6):
  loop: add helper lo_cmd_nr_bvec()
  loop: add helper lo_rw_aio_prep()
  loop: add lo_submit_rw_aio()
  loop: move command blkcg/memcg initialization into loop_queue_work
  loop: try to handle loop aio command via NOWAIT IO first
  loop: add hint for handling aio via IOCB_NOWAIT

 drivers/block/loop.c | 233 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 194 insertions(+), 39 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec()
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-10-15 15:49   ` Bart Van Assche
  2025-10-15 11:07 ` [PATCH V5 2/6] loop: add helper lo_rw_aio_prep() Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 053a086d547e..af443651dff5 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -337,6 +337,19 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
 	lo_rw_aio_do_completion(cmd);
 }
 
+static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct req_iterator rq_iter;
+	struct bio_vec tmp;
+	int nr_bvec = 0;
+
+	rq_for_each_bvec(tmp, rq, rq_iter)
+		nr_bvec++;
+
+	return nr_bvec;
+}
+
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		     loff_t pos, int rw)
 {
@@ -348,12 +361,9 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	struct file *file = lo->lo_backing_file;
 	struct bio_vec tmp;
 	unsigned int offset;
-	int nr_bvec = 0;
+	int nr_bvec = lo_cmd_nr_bvec(cmd);
 	int ret;
 
-	rq_for_each_bvec(tmp, rq, rq_iter)
-		nr_bvec++;
-
 	if (rq->bio != rq->biotail) {
 
 		bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 2/6] loop: add helper lo_rw_aio_prep()
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-10-15 11:07 ` [PATCH V5 3/6] loop: add lo_submit_rw_aio() Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio
vectors and initializing the iocb structure) from the actual I/O execution
in the loop block driver.

Prepare for using NOWAIT to improve loop performance.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 63 ++++++++++++++++++++++++++++----------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index af443651dff5..b065892106a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -350,21 +350,15 @@ static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
 	return nr_bvec;
 }
 
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
-		     loff_t pos, int rw)
+static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd,
+			  unsigned nr_bvec, loff_t pos)
 {
-	struct iov_iter iter;
-	struct req_iterator rq_iter;
-	struct bio_vec *bvec;
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
-	struct bio *bio = rq->bio;
-	struct file *file = lo->lo_backing_file;
-	struct bio_vec tmp;
-	unsigned int offset;
-	int nr_bvec = lo_cmd_nr_bvec(cmd);
-	int ret;
 
 	if (rq->bio != rq->biotail) {
+		struct req_iterator rq_iter;
+		struct bio_vec *bvec;
+		struct bio_vec tmp;
 
 		bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 				     GFP_NOIO);
@@ -382,8 +376,42 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 			*bvec = tmp;
 			bvec++;
 		}
-		bvec = cmd->bvec;
+	} else {
+		cmd->bvec = NULL;
+	}
+
+	cmd->iocb.ki_pos = pos;
+	cmd->iocb.ki_filp = lo->lo_backing_file;
+	cmd->iocb.ki_ioprio = req_get_ioprio(rq);
+	if (cmd->use_aio) {
+		cmd->iocb.ki_complete = lo_rw_aio_complete;
+		cmd->iocb.ki_flags = IOCB_DIRECT;
+	} else {
+		cmd->iocb.ki_complete = NULL;
+		cmd->iocb.ki_flags = 0;
+	}
+	return 0;
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     loff_t pos, int rw)
+{
+	struct iov_iter iter;
+	struct bio_vec *bvec;
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	struct bio *bio = rq->bio;
+	struct file *file = lo->lo_backing_file;
+	unsigned int offset;
+	int nr_bvec = lo_cmd_nr_bvec(cmd);
+	int ret;
+
+	ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
+	if (unlikely(ret))
+		return ret;
+
+	if (cmd->bvec) {
 		offset = 0;
+		bvec = cmd->bvec;
 	} else {
 		/*
 		 * Same here, this bio may be started from the middle of the
@@ -398,17 +426,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
 	iter.iov_offset = offset;
 
-	cmd->iocb.ki_pos = pos;
-	cmd->iocb.ki_filp = file;
-	cmd->iocb.ki_ioprio = req_get_ioprio(rq);
-	if (cmd->use_aio) {
-		cmd->iocb.ki_complete = lo_rw_aio_complete;
-		cmd->iocb.ki_flags = IOCB_DIRECT;
-	} else {
-		cmd->iocb.ki_complete = NULL;
-		cmd->iocb.ki_flags = 0;
-	}
-
 	if (rw == ITER_SOURCE) {
 		kiocb_start_write(&cmd->iocb);
 		ret = file->f_op->write_iter(&cmd->iocb, &iter);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 3/6] loop: add lo_submit_rw_aio()
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
  2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
  2025-10-15 11:07 ` [PATCH V5 2/6] loop: add helper lo_rw_aio_prep() Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-10-15 11:07 ` [PATCH V5 4/6] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Refactor lo_rw_aio() by extracting the I/O submission logic into a new
helper function lo_submit_rw_aio(). This further improves code organization
by separating the I/O preparation, submission, and completion handling into
distinct phases.

Prepare for using NOWAIT to improve loop performance.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b065892106a6..3ab910572bd9 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -393,38 +393,32 @@ static int lo_rw_aio_prep(struct loop_device *lo, struct loop_cmd *cmd,
 	return 0;
 }
 
-static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
-		     loff_t pos, int rw)
+static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+			    int nr_bvec, int rw)
 {
-	struct iov_iter iter;
-	struct bio_vec *bvec;
 	struct request *rq = blk_mq_rq_from_pdu(cmd);
-	struct bio *bio = rq->bio;
 	struct file *file = lo->lo_backing_file;
-	unsigned int offset;
-	int nr_bvec = lo_cmd_nr_bvec(cmd);
+	struct iov_iter iter;
 	int ret;
 
-	ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
-	if (unlikely(ret))
-		return ret;
-
 	if (cmd->bvec) {
-		offset = 0;
-		bvec = cmd->bvec;
+		iov_iter_bvec(&iter, rw, cmd->bvec, nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = 0;
 	} else {
+		struct bio *bio = rq->bio;
+		struct bio_vec *bvec = __bvec_iter_bvec(bio->bi_io_vec,
+				bio->bi_iter);
+
 		/*
 		 * Same here, this bio may be started from the middle of the
 		 * 'bvec' because of bio splitting, so offset from the bvec
 		 * must be passed to iov iterator
 		 */
-		offset = bio->bi_iter.bi_bvec_done;
-		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+		iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
+		iter.iov_offset = bio->bi_iter.bi_bvec_done;
 	}
 	atomic_set(&cmd->ref, 2);
 
-	iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
-	iter.iov_offset = offset;
 
 	if (rw == ITER_SOURCE) {
 		kiocb_start_write(&cmd->iocb);
@@ -433,7 +427,20 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		ret = file->f_op->read_iter(&cmd->iocb, &iter);
 
 	lo_rw_aio_do_completion(cmd);
+	return ret;
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
+		     loff_t pos, int rw)
+{
+	int nr_bvec = lo_cmd_nr_bvec(cmd);
+	int ret;
+
+	ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
+	if (unlikely(ret))
+		return ret;
 
+	ret = lo_submit_rw_aio(lo, cmd, nr_bvec, rw);
 	if (ret != -EIOCBQUEUED)
 		lo_rw_aio_complete(&cmd->iocb, ret);
 	return -EIOCBQUEUED;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 4/6] loop: move command blkcg/memcg initialization into loop_queue_work
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (2 preceding siblings ...)
  2025-10-15 11:07 ` [PATCH V5 3/6] loop: add lo_submit_rw_aio() Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-10-15 11:07 ` [PATCH V5 5/6] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Move loop command blkcg/memcg initialization into loop_queue_work,
and prepare for supporting to handle loop io command by IOCB_NOWAIT.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 3ab910572bd9..99eec0a25dbc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -829,11 +829,28 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
 
 static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
+	struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
 	struct rb_node **node, *parent = NULL;
 	struct loop_worker *cur_worker, *worker = NULL;
 	struct work_struct *work;
 	struct list_head *cmd_list;
 
+	/* always use the first bio's css */
+	cmd->blkcg_css = NULL;
+	cmd->memcg_css = NULL;
+#ifdef CONFIG_BLK_CGROUP
+	if (rq->bio) {
+		cmd->blkcg_css = bio_blkcg_css(rq->bio);
+#ifdef CONFIG_MEMCG
+		if (cmd->blkcg_css) {
+			cmd->memcg_css =
+				cgroup_get_e_css(cmd->blkcg_css->cgroup,
+						&memory_cgrp_subsys);
+		}
+#endif
+	}
+#endif
+
 	spin_lock_irq(&lo->lo_work_lock);
 
 	if (queue_on_root_worker(cmd->blkcg_css))
@@ -1903,21 +1920,6 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		break;
 	}
 
-	/* always use the first bio's css */
-	cmd->blkcg_css = NULL;
-	cmd->memcg_css = NULL;
-#ifdef CONFIG_BLK_CGROUP
-	if (rq->bio) {
-		cmd->blkcg_css = bio_blkcg_css(rq->bio);
-#ifdef CONFIG_MEMCG
-		if (cmd->blkcg_css) {
-			cmd->memcg_css =
-				cgroup_get_e_css(cmd->blkcg_css->cgroup,
-						&memory_cgrp_subsys);
-		}
-#endif
-	}
-#endif
 	loop_queue_work(lo, cmd);
 
 	return BLK_STS_OK;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 5/6] loop: try to handle loop aio command via NOWAIT IO first
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (3 preceding siblings ...)
  2025-10-15 11:07 ` [PATCH V5 4/6] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-10-15 11:07 ` [PATCH V5 6/6] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Try to handle loop aio command via NOWAIT IO first, then we can avoid to
queue the aio command into workqueue. This is usually one big win in
case that FS block mapping is stable, Mikulas verified [1] that this way
improves IO perf by close to 5X in 12jobs sequential read/write test,
in which FS block mapping is just stable.

Fallback to workqueue in case of -EAGAIN. This way may bring a little
cost from the 1st retry, but when running the following write test over
loop/sparse_file, the actual effect on randwrite is obvious:

```
truncate -s 4G 1.img    #1.img is created on XFS/virtio-scsi
losetup -f 1.img --direct-io=on
fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \
	--iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW
```

- RW=randwrite: obvious IOPS drop observed
- RW=write: a little drop(%5 - 10%)

This perf drop on randwrite over sparse file will be addressed in the
following patch.

BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter()
which might sleep even though it is NOWAIT, and the only effect is that rcu read
lock is replaced with srcu read lock.

Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 68 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 99eec0a25dbc..f752942d0889 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -90,6 +90,8 @@ struct loop_cmd {
 #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
 #define LOOP_DEFAULT_HW_Q_DEPTH 128
 
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd);
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 static DEFINE_MUTEX(loop_validate_mutex);
@@ -321,6 +323,15 @@ static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
 
 	if (!atomic_dec_and_test(&cmd->ref))
 		return;
+
+	/* -EAGAIN could be returned from bdev's ->ki_complete */
+	if (cmd->ret == -EAGAIN) {
+		struct loop_device *lo = rq->q->queuedata;
+
+		loop_queue_work(lo, cmd);
+		return;
+	}
+
 	kfree(cmd->bvec);
 	cmd->bvec = NULL;
 	if (req_op(rq) == REQ_OP_WRITE)
@@ -430,22 +441,51 @@ static int lo_submit_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	return ret;
 }
 
+static bool lo_backfile_support_nowait(const struct loop_device *lo)
+{
+	return lo->lo_backing_file->f_mode & FMODE_NOWAIT;
+}
+
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 		     loff_t pos, int rw)
 {
 	int nr_bvec = lo_cmd_nr_bvec(cmd);
 	int ret;
 
-	ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
-	if (unlikely(ret))
-		return ret;
+	/* prepared already if we have tried nowait */
+	if (!cmd->use_aio || !lo_backfile_support_nowait(lo)) {
+		ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
+		if (unlikely(ret))
+			goto fail;
+	}
 
+	cmd->iocb.ki_flags &= ~IOCB_NOWAIT;
 	ret = lo_submit_rw_aio(lo, cmd, nr_bvec, rw);
+fail:
 	if (ret != -EIOCBQUEUED)
 		lo_rw_aio_complete(&cmd->iocb, ret);
 	return -EIOCBQUEUED;
 }
 
+static int lo_rw_aio_nowait(struct loop_device *lo, struct loop_cmd *cmd,
+			    int rw)
+{
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+	loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+	int nr_bvec = lo_cmd_nr_bvec(cmd);
+	int ret = lo_rw_aio_prep(lo, cmd, nr_bvec, pos);
+
+	if (unlikely(ret))
+		goto fail;
+
+	cmd->iocb.ki_flags |= IOCB_NOWAIT;
+	ret = lo_submit_rw_aio(lo, cmd, nr_bvec, rw);
+fail:
+	if (ret != -EIOCBQUEUED && ret != -EAGAIN)
+		lo_rw_aio_complete(&cmd->iocb, ret);
+	return ret;
+}
+
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
@@ -1903,6 +1943,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct request *rq = bd->rq;
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 	struct loop_device *lo = rq->q->queuedata;
+	int rw = 0;
 
 	blk_mq_start_request(rq);
 
@@ -1915,9 +1956,25 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_WRITE_ZEROES:
 		cmd->use_aio = false;
 		break;
-	default:
+	case REQ_OP_READ:
+		rw = ITER_DEST;
 		cmd->use_aio = lo->lo_flags & LO_FLAGS_DIRECT_IO;
 		break;
+	case REQ_OP_WRITE:
+		rw = ITER_SOURCE;
+		cmd->use_aio = lo->lo_flags & LO_FLAGS_DIRECT_IO;
+		break;
+	default:
+		return BLK_STS_IOERR;
+	}
+
+	/* try NOWAIT if the backing file supports the mode */
+	if (cmd->use_aio && lo_backfile_support_nowait(lo)) {
+		int res = lo_rw_aio_nowait(lo, cmd, rw);
+
+		if (res != -EAGAIN && res != -EOPNOTSUPP)
+			return BLK_STS_OK;
+		/* fallback to workqueue for handling aio */
 	}
 
 	loop_queue_work(lo, cmd);
@@ -2069,7 +2126,8 @@ static int loop_add(int i)
 	lo->tag_set.queue_depth = hw_queue_depth;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT;
+	lo->tag_set.flags = BLK_MQ_F_STACKING | BLK_MQ_F_NO_SCHED_BY_DEFAULT |
+		BLK_MQ_F_BLOCKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V5 6/6] loop: add hint for handling aio via IOCB_NOWAIT
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (4 preceding siblings ...)
  2025-10-15 11:07 ` [PATCH V5 5/6] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
@ 2025-10-15 11:07 ` Ming Lei
  2025-11-18 12:55 ` [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-15 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Ming Lei

Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding
to cause write(especially randwrite) perf regression on sparse backed file.

Try IOCB_NOWAIT in the following situations:

- backing file is block device

OR

- READ aio command

OR

- there isn't any queued blocking async WRITEs, because NOWAIT won't cause
contention with blocking WRITE, which often implies exclusive lock

With this simple policy, perf regression of randwrite/write on sparse
backing file is fixed.

Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f752942d0889..e42bdfc73c20 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -68,6 +68,7 @@ struct loop_device {
 	struct rb_root          worker_tree;
 	struct timer_list       timer;
 	bool			sysfs_inited;
+	unsigned 		lo_nr_blocking_writes;
 
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
@@ -467,6 +468,33 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	return -EIOCBQUEUED;
 }
 
+static inline bool lo_aio_try_nowait(struct loop_device *lo,
+		struct loop_cmd *cmd)
+{
+	struct file *file = lo->lo_backing_file;
+	struct inode *inode = file->f_mapping->host;
+	struct request *rq = blk_mq_rq_from_pdu(cmd);
+
+	/* NOWAIT works fine for backing block device */
+	if (S_ISBLK(inode->i_mode))
+		return true;
+
+	/*
+	 * NOWAIT is supposed to be fine for READ without contending with
+	 * blocking WRITE
+	 */
+	if (req_op(rq) == REQ_OP_READ)
+		return true;
+
+	/*
+	 * If there is any queued non-NOWAIT async WRITE , don't try new
+	 * NOWAIT WRITE for avoiding contention
+	 *
+	 * Here we focus on handling stable FS block mapping via NOWAIT
+	 */
+	return READ_ONCE(lo->lo_nr_blocking_writes) == 0;
+}
+
 static int lo_rw_aio_nowait(struct loop_device *lo, struct loop_cmd *cmd,
 			    int rw)
 {
@@ -478,6 +506,9 @@ static int lo_rw_aio_nowait(struct loop_device *lo, struct loop_cmd *cmd,
 	if (unlikely(ret))
 		goto fail;
 
+	if (!lo_aio_try_nowait(lo, cmd))
+		return -EAGAIN;
+
 	cmd->iocb.ki_flags |= IOCB_NOWAIT;
 	ret = lo_submit_rw_aio(lo, cmd, nr_bvec, rw);
 fail:
@@ -778,12 +809,19 @@ static ssize_t loop_attr_dio_show(struct loop_device *lo, char *buf)
 	return sysfs_emit(buf, "%s\n", dio ? "1" : "0");
 }
 
+static ssize_t loop_attr_nr_blocking_writes_show(struct loop_device *lo,
+						 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", lo->lo_nr_blocking_writes);
+}
+
 LOOP_ATTR_RO(backing_file);
 LOOP_ATTR_RO(offset);
 LOOP_ATTR_RO(sizelimit);
 LOOP_ATTR_RO(autoclear);
 LOOP_ATTR_RO(partscan);
 LOOP_ATTR_RO(dio);
+LOOP_ATTR_RO(nr_blocking_writes);
 
 static struct attribute *loop_attrs[] = {
 	&loop_attr_backing_file.attr,
@@ -792,6 +830,7 @@ static struct attribute *loop_attrs[] = {
 	&loop_attr_autoclear.attr,
 	&loop_attr_partscan.attr,
 	&loop_attr_dio.attr,
+	&loop_attr_nr_blocking_writes.attr,
 	NULL,
 };
 
@@ -867,6 +906,24 @@ static inline int queue_on_root_worker(struct cgroup_subsys_state *css)
 }
 #endif
 
+static inline void loop_inc_blocking_writes(struct loop_device *lo,
+		struct loop_cmd *cmd)
+{
+	lockdep_assert_held(&lo->lo_work_lock);
+
+	if (req_op(blk_mq_rq_from_pdu(cmd)) == REQ_OP_WRITE)
+		lo->lo_nr_blocking_writes += 1;
+}
+
+static inline void loop_dec_blocking_writes(struct loop_device *lo,
+		struct loop_cmd *cmd)
+{
+	lockdep_assert_held(&lo->lo_work_lock);
+
+	if (req_op(blk_mq_rq_from_pdu(cmd)) == REQ_OP_WRITE)
+		lo->lo_nr_blocking_writes -= 1;
+}
+
 static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
 	struct request __maybe_unused *rq = blk_mq_rq_from_pdu(cmd);
@@ -949,6 +1006,8 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 		work = &lo->rootcg_work;
 		cmd_list = &lo->rootcg_cmd_list;
 	}
+	if (cmd->use_aio)
+		loop_inc_blocking_writes(lo, cmd);
 	list_add_tail(&cmd->list_entry, cmd_list);
 	queue_work(lo->workqueue, work);
 	spin_unlock_irq(&lo->lo_work_lock);
@@ -2048,6 +2107,8 @@ static void loop_process_work(struct loop_worker *worker,
 		cond_resched();
 
 		spin_lock_irq(&lo->lo_work_lock);
+		if (cmd->use_aio)
+			loop_dec_blocking_writes(lo, cmd);
 	}
 
 	/*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec()
  2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
@ 2025-10-15 15:49   ` Bart Van Assche
  2025-10-16  2:19     ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2025-10-15 15:49 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig

On 10/15/25 4:07 AM, Ming Lei wrote:
> +static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
> +{
> +	struct request *rq = blk_mq_rq_from_pdu(cmd);
> +	struct req_iterator rq_iter;
> +	struct bio_vec tmp;
> +	int nr_bvec = 0;
> +
> +	rq_for_each_bvec(tmp, rq, rq_iter)
> +		nr_bvec++;
> +
> +	return nr_bvec;
> +}

'cmd' is not used in this function other than in the conversion to a
struct request. Has it been considered to change the argument type of
this function from 'struct loop_cmd *' into 'struct request *'? That
will allow to leave out the blk_mq_rq_from_pdu() from this function.
Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec()
  2025-10-15 15:49   ` Bart Van Assche
@ 2025-10-16  2:19     ` Ming Lei
  0 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-10-16  2:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Christoph Hellwig

On Wed, Oct 15, 2025 at 08:49:45AM -0700, Bart Van Assche wrote:
> On 10/15/25 4:07 AM, Ming Lei wrote:
> > +static inline unsigned lo_cmd_nr_bvec(struct loop_cmd *cmd)
> > +{
> > +	struct request *rq = blk_mq_rq_from_pdu(cmd);
> > +	struct req_iterator rq_iter;
> > +	struct bio_vec tmp;
> > +	int nr_bvec = 0;
> > +
> > +	rq_for_each_bvec(tmp, rq, rq_iter)
> > +		nr_bvec++;
> > +
> > +	return nr_bvec;
> > +}
> 
> 'cmd' is not used in this function other than in the conversion to a
> struct request. Has it been considered to change the argument type of
> this function from 'struct loop_cmd *' into 'struct request *'? That
> will allow to leave out the blk_mq_rq_from_pdu() from this function.
> Otherwise this patch looks good to me.

It isn't one big deal, I can switch to `request *` if respin is needed,
otherwise I'd leave it as it is. 


Thanks,
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (5 preceding siblings ...)
  2025-10-15 11:07 ` [PATCH V5 6/6] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
@ 2025-11-18 12:55 ` Ming Lei
  2025-11-18 15:38 ` Jens Axboe
  2025-11-24  6:12 ` calling into file systems directly from ->queue_rq, was " Christoph Hellwig
  8 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-11-18 12:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig

On Wed, Oct 15, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hello Jens,
>
> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
> command to workqueue context, meantime refactor lo_rw_aio() a bit.
>
> In my test VM, loop disk perf becomes very close to perf of the backing block
> device(nvme/mq virtio-scsi).
>
> And Mikulas verified that this way can improve 12jobs sequential readwrite io by
> ~5X, and basically solve the reported problem together with loop MQ change.
>
> https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/
>
> Zhaoyang Huang also mentioned it may fix their performance issue on Android
> use case.
>
> The loop MQ change will be posted as standalone patch, because it needs
> UAPI change.
>
> V5:
>         - only try nowait in case that backing file supports it (Yu Kuai)
>         - fix one lockdep assert (syzbot)
>         - improve comment log (Christoph)

Hi Jens,

Ping...

thanks,


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (6 preceding siblings ...)
  2025-11-18 12:55 ` [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
@ 2025-11-18 15:38 ` Jens Axboe
  2025-11-24  6:12 ` calling into file systems directly from ->queue_rq, was " Christoph Hellwig
  8 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2025-11-18 15:38 UTC (permalink / raw)
  To: linux-block, Ming Lei
  Cc: Mikulas Patocka, Zhaoyang Huang, Dave Chinner, linux-fsdevel,
	Christoph Hellwig


On Wed, 15 Oct 2025 19:07:25 +0800, Ming Lei wrote:
> This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio
> command to workqueue context, meantime refactor lo_rw_aio() a bit.
> 
> In my test VM, loop disk perf becomes very close to perf of the backing block
> device(nvme/mq virtio-scsi).
> 
> And Mikulas verified that this way can improve 12jobs sequential readwrite io by
> ~5X, and basically solve the reported problem together with loop MQ change.
> 
> [...]

Applied, thanks!

[1/6] loop: add helper lo_cmd_nr_bvec()
      commit: c3e6c11147f6f05c15e9c2d74f5d234a6661013c
[2/6] loop: add helper lo_rw_aio_prep()
      commit: fd858d1ca9694c88703a8a936d5c7596c86ada74
[3/6] loop: add lo_submit_rw_aio()
      commit: c66e9708f92760147a1ea7f66c7b60ec801f85e3
[4/6] loop: move command blkcg/memcg initialization into loop_queue_work
      commit: f4788ae9d7bc01735cb6ada333b038c2e3fff260
[5/6] loop: try to handle loop aio command via NOWAIT IO first
      commit: 0ba93a906dda7ede9e7669adefe005ee18f3ff42
[6/6] loop: add hint for handling aio via IOCB_NOWAIT
      commit: 837ed303964673cf0c7e6a4624cd68d8cf254827

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 27+ messages in thread

* calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
                   ` (7 preceding siblings ...)
  2025-11-18 15:38 ` Jens Axboe
@ 2025-11-24  6:12 ` Christoph Hellwig
  2025-11-24  9:02   ` Ming Lei
  8 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-24  6:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel

FYI, with this series I'm seeing somewhat frequent stack overflows when
using loop on top of XFS on top of stacked block devices.

This seems to be because this can now issue I/O directly from ->queue_rq
instead of breaking the stack chain, i.e. we can build much deeper call
stacks now.

Also this now means a file systems using current->journal_info can call
into another file system trying to use, making things blow up even worse.

In other words:  I don't think issuing file system I/O from the
submission thread in loop can work, and we should drop this again.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-24  6:12 ` calling into file systems directly from ->queue_rq, was " Christoph Hellwig
@ 2025-11-24  9:02   ` Ming Lei
  2025-11-24  9:05     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-11-24  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel

On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> FYI, with this series I'm seeing somewhat frequent stack overflows when
> using loop on top of XFS on top of stacked block devices.

Can you share your setting?

BTW, there are one followup fix:

https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/

I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
not see stack overflow with the above fix against -next.

> 
> This seems to be because this can now issue I/O directly from ->queue_rq
> instead of breaking the stack chain, i.e. we can build much deeper call
> stacks now.
> 
> Also this now means a file systems using current->journal_info can call
> into another file system trying to use, making things blow up even worse.
> 
> In other words:  I don't think issuing file system I/O from the
> submission thread in loop can work, and we should drop this again.

I don't object to drop it one more time.

However, can we confirm if it is really a stack overflow because of
calling into FS from ->queue_rq()?

If yes, it could be dead end to improve loop in this way, then I can give up.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-24  9:02   ` Ming Lei
@ 2025-11-24  9:05     ` Christoph Hellwig
  2025-11-25  3:00       ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-24  9:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Mikulas Patocka,
	Zhaoyang Huang, Dave Chinner, linux-fsdevel

On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > using loop on top of XFS on top of stacked block devices.
> 
> Can you share your setting?
> 
> BTW, there are one followup fix:
> 
> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> 
> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> not see stack overflow with the above fix against -next.

This was with a development tree with lots of local code.  So the
messages aren't applicable (and probably a hint I need to reduce my
stack usage).  The observations is that we now stack through from block
submission context into the file system write path, which is bad for a
lot of reasons.  journal_info being the most obvious one.

> > In other words:  I don't think issuing file system I/O from the
> > submission thread in loop can work, and we should drop this again.
> 
> I don't object to drop it one more time.
> 
> However, can we confirm if it is really a stack overflow because of
> calling into FS from ->queue_rq()?

Yes.

> If yes, it could be dead end to improve loop in this way, then I can give up.

I think calling directly into the lower file system without a context
switch is very problematic, so IMHO yes, it is a dead end.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-24  9:05     ` Christoph Hellwig
@ 2025-11-25  3:00       ` Ming Lei
  2025-11-25  3:56         ` Jens Axboe
  2025-11-25  7:26         ` Gao Xiang
  0 siblings, 2 replies; 27+ messages in thread
From: Ming Lei @ 2025-11-25  3:00 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jens Axboe, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel

On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > using loop on top of XFS on top of stacked block devices.
> > 
> > Can you share your setting?
> > 
> > BTW, there are one followup fix:
> > 
> > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > 
> > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > not see stack overflow with the above fix against -next.
> 
> This was with a development tree with lots of local code.  So the
> messages aren't applicable (and probably a hint I need to reduce my
> stack usage).  The observations is that we now stack through from block
> submission context into the file system write path, which is bad for a
> lot of reasons.  journal_info being the most obvious one.
> 
> > > In other words:  I don't think issuing file system I/O from the
> > > submission thread in loop can work, and we should drop this again.
> > 
> > I don't object to drop it one more time.
> > 
> > However, can we confirm if it is really a stack overflow because of
> > calling into FS from ->queue_rq()?
> 
> Yes.
> 
> > If yes, it could be dead end to improve loop in this way, then I can give up.
> 
> I think calling directly into the lower file system without a context
> switch is very problematic, so IMHO yes, it is a dead end.

Hi Jens,

Can you drop or revert the patchset of "loop: improve loop aio perf by IOCB_NOWAIT"
from for-6.19/block?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  3:00       ` Ming Lei
@ 2025-11-25  3:56         ` Jens Axboe
  2025-11-25  7:26         ` Gao Xiang
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2025-11-25  3:56 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: linux-block, Mikulas Patocka, Zhaoyang Huang, Dave Chinner,
	linux-fsdevel

On 11/24/25 8:00 PM, Ming Lei wrote:
> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>> using loop on top of XFS on top of stacked block devices.
>>>
>>> Can you share your setting?
>>>
>>> BTW, there are one followup fix:
>>>
>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>
>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>> not see stack overflow with the above fix against -next.
>>
>> This was with a development tree with lots of local code.  So the
>> messages aren't applicable (and probably a hint I need to reduce my
>> stack usage).  The observations is that we now stack through from block
>> submission context into the file system write path, which is bad for a
>> lot of reasons.  journal_info being the most obvious one.
>>
>>>> In other words:  I don't think issuing file system I/O from the
>>>> submission thread in loop can work, and we should drop this again.
>>>
>>> I don't object to drop it one more time.
>>>
>>> However, can we confirm if it is really a stack overflow because of
>>> calling into FS from ->queue_rq()?
>>
>> Yes.
>>
>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>
>> I think calling directly into the lower file system without a context
>> switch is very problematic, so IMHO yes, it is a dead end.
> 
> Hi Jens,
> 
> Can you drop or revert the patchset of "loop: improve loop aio perf by IOCB_NOWAIT"
> from for-6.19/block?

Done

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  3:00       ` Ming Lei
  2025-11-25  3:56         ` Jens Axboe
@ 2025-11-25  7:26         ` Gao Xiang
  2025-11-25  9:19           ` Ming Lei
  1 sibling, 1 reply; 27+ messages in thread
From: Gao Xiang @ 2025-11-25  7:26 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: linux-block, Mikulas Patocka, Zhaoyang Huang, Dave Chinner,
	linux-fsdevel, Jens Axboe

Hi Ming and Christoph,

On 2025/11/25 11:00, Ming Lei wrote:
> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>> using loop on top of XFS on top of stacked block devices.
>>>
>>> Can you share your setting?
>>>
>>> BTW, there are one followup fix:
>>>
>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>
>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>> not see stack overflow with the above fix against -next.
>>
>> This was with a development tree with lots of local code.  So the
>> messages aren't applicable (and probably a hint I need to reduce my
>> stack usage).  The observations is that we now stack through from block
>> submission context into the file system write path, which is bad for a
>> lot of reasons.  journal_info being the most obvious one.
>>
>>>> In other words:  I don't think issuing file system I/O from the
>>>> submission thread in loop can work, and we should drop this again.
>>>
>>> I don't object to drop it one more time.
>>>
>>> However, can we confirm if it is really a stack overflow because of
>>> calling into FS from ->queue_rq()?
>>
>> Yes.
>>
>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>
>> I think calling directly into the lower file system without a context
>> switch is very problematic, so IMHO yes, it is a dead end.
I've already explained the details in
https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com

to zram folks why block devices act like this is very
risky (in brief, because virtual block devices don't
have any way (unlike the inner fs itself) to know enough
about whether the inner fs already did something without
context save (a.k.a side effect) so a new task context
is absolutely necessary for virtual block devices to
access backing fses for stacked usage.

So whether a nested fs can success is intrinsic to
specific fses (because either they assure no complex
journal_info access or save all effected contexts before
transiting to the block layer.  But that is not bdev can
do since they need to do any block fs.

But they seem they don't get any point, so I gave up.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  7:26         ` Gao Xiang
@ 2025-11-25  9:19           ` Ming Lei
  2025-11-25  9:39             ` Gao Xiang
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-11-25  9:19 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe

On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
> Hi Ming and Christoph,
> 
> On 2025/11/25 11:00, Ming Lei wrote:
> > On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> > > On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > > > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > > > using loop on top of XFS on top of stacked block devices.
> > > > 
> > > > Can you share your setting?
> > > > 
> > > > BTW, there are one followup fix:
> > > > 
> > > > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > > > 
> > > > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > > > not see stack overflow with the above fix against -next.
> > > 
> > > This was with a development tree with lots of local code.  So the
> > > messages aren't applicable (and probably a hint I need to reduce my
> > > stack usage).  The observations is that we now stack through from block
> > > submission context into the file system write path, which is bad for a
> > > lot of reasons.  journal_info being the most obvious one.
> > > 
> > > > > In other words:  I don't think issuing file system I/O from the
> > > > > submission thread in loop can work, and we should drop this again.
> > > > 
> > > > I don't object to drop it one more time.
> > > > 
> > > > However, can we confirm if it is really a stack overflow because of
> > > > calling into FS from ->queue_rq()?
> > > 
> > > Yes.
> > > 
> > > > If yes, it could be dead end to improve loop in this way, then I can give up.
> > > 
> > > I think calling directly into the lower file system without a context
> > > switch is very problematic, so IMHO yes, it is a dead end.
> I've already explained the details in
> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
> 
> to zram folks why block devices act like this is very
> risky (in brief, because virtual block devices don't
> have any way (unlike the inner fs itself) to know enough
> about whether the inner fs already did something without
> context save (a.k.a side effect) so a new task context
> is absolutely necessary for virtual block devices to
> access backing fses for stacked usage.
> 
> So whether a nested fs can success is intrinsic to
> specific fses (because either they assure no complex
> journal_info access or save all effected contexts before
> transiting to the block layer.  But that is not bdev can
> do since they need to do any block fs.

IMO, task stack overflow could be the biggest trouble.

block layer has current->blk_plug/current->bio_list, which are
dealt with in the following patches:

https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/

I am curious why FS task context can't be saved/restored inside block
layer when calling into new FS IO? Given it is just per-task info.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  9:19           ` Ming Lei
@ 2025-11-25  9:39             ` Gao Xiang
  2025-11-25 10:13               ` Gao Xiang
  2025-11-25 10:41               ` Ming Lei
  0 siblings, 2 replies; 27+ messages in thread
From: Gao Xiang @ 2025-11-25  9:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe

Hi Ming,

On 2025/11/25 17:19, Ming Lei wrote:
> On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
>> Hi Ming and Christoph,
>>
>> On 2025/11/25 11:00, Ming Lei wrote:
>>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>>>> using loop on top of XFS on top of stacked block devices.
>>>>>
>>>>> Can you share your setting?
>>>>>
>>>>> BTW, there are one followup fix:
>>>>>
>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>>>
>>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>>>> not see stack overflow with the above fix against -next.
>>>>
>>>> This was with a development tree with lots of local code.  So the
>>>> messages aren't applicable (and probably a hint I need to reduce my
>>>> stack usage).  The observations is that we now stack through from block
>>>> submission context into the file system write path, which is bad for a
>>>> lot of reasons.  journal_info being the most obvious one.
>>>>
>>>>>> In other words:  I don't think issuing file system I/O from the
>>>>>> submission thread in loop can work, and we should drop this again.
>>>>>
>>>>> I don't object to drop it one more time.
>>>>>
>>>>> However, can we confirm if it is really a stack overflow because of
>>>>> calling into FS from ->queue_rq()?
>>>>
>>>> Yes.
>>>>
>>>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>>>
>>>> I think calling directly into the lower file system without a context
>>>> switch is very problematic, so IMHO yes, it is a dead end.
>> I've already explained the details in
>> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
>>
>> to zram folks why block devices act like this is very
>> risky (in brief, because virtual block devices don't
>> have any way (unlike the inner fs itself) to know enough
>> about whether the inner fs already did something without
>> context save (a.k.a side effect) so a new task context
>> is absolutely necessary for virtual block devices to
>> access backing fses for stacked usage.
>>
>> So whether a nested fs can success is intrinsic to
>> specific fses (because either they assure no complex
>> journal_info access or save all effected contexts before
>> transiting to the block layer.  But that is not bdev can
>> do since they need to do any block fs.
> 
> IMO, task stack overflow could be the biggest trouble.
> 
> block layer has current->blk_plug/current->bio_list, which are
> dealt with in the following patches:
> 
> https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
> https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/

I think it's the simplist thing for this because the
context of "current->blk_plug/current->bio_list" is
_owned_ by the block layer, so of course the block
layer knows how to (and should) save and restore
them.

> 
> I am curious why FS task context can't be saved/restored inside block
> layer when calling into new FS IO? Given it is just per-task info.

The problem is a block driver don't know what the upper FS
(sorry about the terminology) did before calling into block
layer (the task_struct and journal_info side effect is just
the obvious one), because all FSes (mainly the write path)
doesn't assume the current context will be transited into
another FS context, and it could introduce any fs-specific
context before calling into the block layer.

So it's the fs's business to save / restore contexts since
they change the context and it's none of the block layer
business to save and restore because the block device knows
nothing about the specific fs behavior, it should deal with
all block FSes.

Let's put it into another way, thinking about generic
calling convention[1], which includes caller-saved contexts
and callee-saved contexts.  I think the problem is here
overally similiar, for loop devices, you know none of lower
or upper FS behaves (because it doesn't directly know either
upper or lower FS contexts), so it should either expect the
upper fs to save all the contexts, or to use a new kthread
context (to emulate userspace requests to FS) for lower FS.

[1] https://en.wikipedia.org/wiki/Calling_convention


Thanks,
Gao Xiang

> 
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  9:39             ` Gao Xiang
@ 2025-11-25 10:13               ` Gao Xiang
  2025-11-25 10:41               ` Ming Lei
  1 sibling, 0 replies; 27+ messages in thread
From: Gao Xiang @ 2025-11-25 10:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe



On 2025/11/25 17:39, Gao Xiang wrote:
> Hi Ming,
> 
> On 2025/11/25 17:19, Ming Lei wrote:
>> On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
>>> Hi Ming and Christoph,
>>>
>>> On 2025/11/25 11:00, Ming Lei wrote:
>>>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>>>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>>>>> using loop on top of XFS on top of stacked block devices.
>>>>>>
>>>>>> Can you share your setting?
>>>>>>
>>>>>> BTW, there are one followup fix:
>>>>>>
>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>>>>
>>>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>>>>> not see stack overflow with the above fix against -next.
>>>>>
>>>>> This was with a development tree with lots of local code.  So the
>>>>> messages aren't applicable (and probably a hint I need to reduce my
>>>>> stack usage).  The observations is that we now stack through from block
>>>>> submission context into the file system write path, which is bad for a
>>>>> lot of reasons.  journal_info being the most obvious one.
>>>>>
>>>>>>> In other words:  I don't think issuing file system I/O from the
>>>>>>> submission thread in loop can work, and we should drop this again.
>>>>>>
>>>>>> I don't object to drop it one more time.
>>>>>>
>>>>>> However, can we confirm if it is really a stack overflow because of
>>>>>> calling into FS from ->queue_rq()?
>>>>>
>>>>> Yes.
>>>>>
>>>>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>>>>
>>>>> I think calling directly into the lower file system without a context
>>>>> switch is very problematic, so IMHO yes, it is a dead end.
>>> I've already explained the details in
>>> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
>>>
>>> to zram folks why block devices act like this is very
>>> risky (in brief, because virtual block devices don't
>>> have any way (unlike the inner fs itself) to know enough
>>> about whether the inner fs already did something without
>>> context save (a.k.a side effect) so a new task context
>>> is absolutely necessary for virtual block devices to
>>> access backing fses for stacked usage.
>>>
>>> So whether a nested fs can success is intrinsic to
>>> specific fses (because either they assure no complex
>>> journal_info access or save all effected contexts before
>>> transiting to the block layer.  But that is not bdev can
>>> do since they need to do any block fs.
>>
>> IMO, task stack overflow could be the biggest trouble.
>>
>> block layer has current->blk_plug/current->bio_list, which are
>> dealt with in the following patches:
>>
>> https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
>> https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
> 
> I think it's the simplist thing for this because the
> context of "current->blk_plug/current->bio_list" is
> _owned_ by the block layer, so of course the block
> layer knows how to (and should) save and restore
> them.
> 
>>
>> I am curious why FS task context can't be saved/restored inside block
>> layer when calling into new FS IO? Given it is just per-task info.
> 
> The problem is a block driver don't know what the upper FS
> (sorry about the terminology) did before calling into block
> layer (the task_struct and journal_info side effect is just
> the obvious one), because all FSes (mainly the write path)
> doesn't assume the current context will be transited into
> another FS context, and it could introduce any fs-specific
> context before calling into the block layer.
> 
> So it's the fs's business to save / restore contexts since
> they change the context and it's none of the block layer
> business to save and restore because the block device knows
> nothing about the specific fs behavior, it should deal with
> all block FSes.
> 
> Let's put it into another way, thinking about generic
> calling convention[1], which includes caller-saved contexts
> and callee-saved contexts.  I think the problem is here
> overally similiar, for loop devices, you know none of lower
> or upper FS behaves (because it doesn't directly know either
> upper or lower FS contexts), so it should either expect the
> upper fs to save all the contexts, or to use a new kthread
> context (to emulate userspace requests to FS) for lower FS.

Either expect a) the upper fs to save all the changed
contexts (because only upper fs knows what it did) so the
current context can be reused, or b) to use a new kthread
context (just like forking a new process -- to emulate an
entirely clean userspace requests to FS) for lower FS
since that's what modern multi-task OSes' magic to isolate
all possible contexts using the task concept.

I'm not saying a) is impossible but it needs a formal
contract (a strict convention [e.g. save all changed contexts
before submit_bio()] or new save/restore hooks to call on
demand) to let all filesystems know they now need to deal
with nested fses, rather than just crossing the line to
save non-block contexts in the block driver itself, because
it cannot be exhaustive.

Just my opinion on this for reference.

Thanks,
Gao Xiang

> 
> [1] https://en.wikipedia.org/wiki/Calling_convention
> 
> 
> Thanks,
> Gao Xiang
> 
>>
>>
>> Thanks,
>> Ming
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25  9:39             ` Gao Xiang
  2025-11-25 10:13               ` Gao Xiang
@ 2025-11-25 10:41               ` Ming Lei
  2025-11-25 10:57                 ` Gao Xiang
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-11-25 10:41 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe

On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
> Hi Ming,
> 
> On 2025/11/25 17:19, Ming Lei wrote:
> > On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
> > > Hi Ming and Christoph,
> > > 
> > > On 2025/11/25 11:00, Ming Lei wrote:
> > > > On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > > > > > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > > > > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > > > > > using loop on top of XFS on top of stacked block devices.
> > > > > > 
> > > > > > Can you share your setting?
> > > > > > 
> > > > > > BTW, there are one followup fix:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > > > > > 
> > > > > > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > > > > > not see stack overflow with the above fix against -next.
> > > > > 
> > > > > This was with a development tree with lots of local code.  So the
> > > > > messages aren't applicable (and probably a hint I need to reduce my
> > > > > stack usage).  The observations is that we now stack through from block
> > > > > submission context into the file system write path, which is bad for a
> > > > > lot of reasons.  journal_info being the most obvious one.
> > > > > 
> > > > > > > In other words:  I don't think issuing file system I/O from the
> > > > > > > submission thread in loop can work, and we should drop this again.
> > > > > > 
> > > > > > I don't object to drop it one more time.
> > > > > > 
> > > > > > However, can we confirm if it is really a stack overflow because of
> > > > > > calling into FS from ->queue_rq()?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > If yes, it could be dead end to improve loop in this way, then I can give up.
> > > > > 
> > > > > I think calling directly into the lower file system without a context
> > > > > switch is very problematic, so IMHO yes, it is a dead end.
> > > I've already explained the details in
> > > https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
> > > 
> > > to zram folks why block devices act like this is very
> > > risky (in brief, because virtual block devices don't
> > > have any way (unlike the inner fs itself) to know enough
> > > about whether the inner fs already did something without
> > > context save (a.k.a side effect) so a new task context
> > > is absolutely necessary for virtual block devices to
> > > access backing fses for stacked usage.
> > > 
> > > So whether a nested fs can success is intrinsic to
> > > specific fses (because either they assure no complex
> > > journal_info access or save all effected contexts before
> > > transiting to the block layer.  But that is not bdev can
> > > do since they need to do any block fs.
> > 
> > IMO, task stack overflow could be the biggest trouble.
> > 
> > block layer has current->blk_plug/current->bio_list, which are
> > dealt with in the following patches:
> > 
> > https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
> > https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
> 
> I think it's the simplist thing for this because the
> context of "current->blk_plug/current->bio_list" is
> _owned_ by the block layer, so of course the block
> layer knows how to (and should) save and restore
> them.

Strictly speaking, all per-task context data is owned by task, instead
of subsystems, otherwise, it needn't to be stored in `task_struct` except
for some case just wants per-task storage.

For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
mm, block layer, md/dm, drivers, ...).

> 
> > 
> > I am curious why FS task context can't be saved/restored inside block
> > layer when calling into new FS IO? Given it is just per-task info.
> 
> The problem is a block driver don't know what the upper FS
> (sorry about the terminology) did before calling into block
> layer (the task_struct and journal_info side effect is just
> the obvious one), because all FSes (mainly the write path)
> doesn't assume the current context will be transited into
> another FS context, and it could introduce any fs-specific
> context before calling into the block layer.
> 
> So it's the fs's business to save / restore contexts since
> they change the context and it's none of the block layer
> business to save and restore because the block device knows
> nothing about the specific fs behavior, it should deal with
> all block FSes.
> 
> Let's put it into another way, thinking about generic
> calling convention[1], which includes caller-saved contexts
> and callee-saved contexts.  I think the problem is here
> overally similiar, for loop devices, you know none of lower
> or upper FS behaves (because it doesn't directly know either

loop just need to know which data to save/restore.

> upper or lower FS contexts), so it should either expect the
> upper fs to save all the contexts, or to use a new kthread
> context (to emulate userspace requests to FS) for lower FS.
> 
> [1] https://en.wikipedia.org/wiki/Calling_convention

For example of lo_rw_aio_nowait(), I am wondering why the following
save/restore doesn't work if current->journal_info is the only FS
context data?

	curr_journal = current->journal_info;
	current->journal_info = NULL;		/* like handling the IO by schedule wq */
	ret = lo_rw_aio_nowait();			/* call into FS write/read_iter() from .queue_rq() */
	current->journal_info = curr_journal;


Thanks,
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 10:41               ` Ming Lei
@ 2025-11-25 10:57                 ` Gao Xiang
  2025-11-25 11:10                   ` Christoph Hellwig
  2025-11-25 11:48                   ` Ming Lei
  0 siblings, 2 replies; 27+ messages in thread
From: Gao Xiang @ 2025-11-25 10:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe



On 2025/11/25 18:41, Ming Lei wrote:
> On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
>> Hi Ming,
>>
>> On 2025/11/25 17:19, Ming Lei wrote:
>>> On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
>>>> Hi Ming and Christoph,
>>>>
>>>> On 2025/11/25 11:00, Ming Lei wrote:
>>>>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>>>>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>>>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>>>>>> using loop on top of XFS on top of stacked block devices.
>>>>>>>
>>>>>>> Can you share your setting?
>>>>>>>
>>>>>>> BTW, there are one followup fix:
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>>>>>
>>>>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>>>>>> not see stack overflow with the above fix against -next.
>>>>>>
>>>>>> This was with a development tree with lots of local code.  So the
>>>>>> messages aren't applicable (and probably a hint I need to reduce my
>>>>>> stack usage).  The observations is that we now stack through from block
>>>>>> submission context into the file system write path, which is bad for a
>>>>>> lot of reasons.  journal_info being the most obvious one.
>>>>>>
>>>>>>>> In other words:  I don't think issuing file system I/O from the
>>>>>>>> submission thread in loop can work, and we should drop this again.
>>>>>>>
>>>>>>> I don't object to drop it one more time.
>>>>>>>
>>>>>>> However, can we confirm if it is really a stack overflow because of
>>>>>>> calling into FS from ->queue_rq()?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>>>>>
>>>>>> I think calling directly into the lower file system without a context
>>>>>> switch is very problematic, so IMHO yes, it is a dead end.
>>>> I've already explained the details in
>>>> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
>>>>
>>>> to zram folks why block devices act like this is very
>>>> risky (in brief, because virtual block devices don't
>>>> have any way (unlike the inner fs itself) to know enough
>>>> about whether the inner fs already did something without
>>>> context save (a.k.a side effect) so a new task context
>>>> is absolutely necessary for virtual block devices to
>>>> access backing fses for stacked usage.
>>>>
>>>> So whether a nested fs can success is intrinsic to
>>>> specific fses (because either they assure no complex
>>>> journal_info access or save all effected contexts before
>>>> transiting to the block layer.  But that is not bdev can
>>>> do since they need to do any block fs.
>>>
>>> IMO, task stack overflow could be the biggest trouble.
>>>
>>> block layer has current->blk_plug/current->bio_list, which are
>>> dealt with in the following patches:
>>>
>>> https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
>>> https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
>>
>> I think it's the simplist thing for this because the
>> context of "current->blk_plug/current->bio_list" is
>> _owned_ by the block layer, so of course the block
>> layer knows how to (and should) save and restore
>> them.
> 
> Strictly speaking, all per-task context data is owned by task, instead
> of subsystems, otherwise, it needn't to be stored in `task_struct` except
> for some case just wants per-task storage.
> 
> For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
> mm, block layer, md/dm, drivers, ...).
> 
>>
>>>
>>> I am curious why FS task context can't be saved/restored inside block
>>> layer when calling into new FS IO? Given it is just per-task info.
>>
>> The problem is a block driver don't know what the upper FS
>> (sorry about the terminology) did before calling into block
>> layer (the task_struct and journal_info side effect is just
>> the obvious one), because all FSes (mainly the write path)
>> doesn't assume the current context will be transited into
>> another FS context, and it could introduce any fs-specific
>> context before calling into the block layer.
>>
>> So it's the fs's business to save / restore contexts since
>> they change the context and it's none of the block layer
>> business to save and restore because the block device knows
>> nothing about the specific fs behavior, it should deal with
>> all block FSes.
>>
>> Let's put it into another way, thinking about generic
>> calling convention[1], which includes caller-saved contexts
>> and callee-saved contexts.  I think the problem is here
>> overally similiar, for loop devices, you know none of lower
>> or upper FS behaves (because it doesn't directly know either
> 
> loop just need to know which data to save/restore.

I've said there is no clear list of which data needs to be
saved/restored.

FSes can do _anything_. Maybe something in `current` needs
to be saved, but anything that uses `current`/PID as
a mapping key also needs to be saved, e.g., arbitrary

`hash_table[current]` or `context_table[current->pid]`.

Again, because not all filesystems allow nesting by design:
Linux kernel doesn't need block filesystem to be nested.

> 
>> upper or lower FS contexts), so it should either expect the
>> upper fs to save all the contexts, or to use a new kthread
>> context (to emulate userspace requests to FS) for lower FS.
>>
>> [1] https://en.wikipedia.org/wiki/Calling_convention
> 
> For example of lo_rw_aio_nowait(), I am wondering why the following
> save/restore doesn't work if current->journal_info is the only FS
> context data?
> 
> 	curr_journal = current->journal_info;
> 	current->journal_info = NULL;		/* like handling the IO by schedule wq */
> 	ret = lo_rw_aio_nowait();			/* call into FS write/read_iter() from .queue_rq() */
> 	current->journal_info = curr_journal;

- Why does journal_info need to be saved? For example, XFS
   never uses journal_info: why save it in loop if the upper
   FS is xfs?

- journal_info is resolved; what's next? It's just digging
   from one rabbit hole into another, and it's not the way to
   implement proper context save and restore.

I think it's already clear that I've expressed my idea, and
I don't want to involve in either zram or loop, it's none
of my business.

If playing whack-a-mole games is so much fun (maybe a quite
long time to find something needs to be saved easily or
hard), good luck too.

Thanks,
Gao Xiang

> 
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 10:57                 ` Gao Xiang
@ 2025-11-25 11:10                   ` Christoph Hellwig
  2025-11-25 11:48                   ` Ming Lei
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-11-25 11:10 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Ming Lei, Christoph Hellwig, linux-block, Mikulas Patocka,
	Zhaoyang Huang, Dave Chinner, linux-fsdevel, Jens Axboe

On Tue, Nov 25, 2025 at 06:57:15PM +0800, Gao Xiang wrote:
> I've said there is no clear list of which data needs to be
> saved/restored.
> 
> FSes can do _anything_. Maybe something in `current` needs
> to be saved, but anything that uses `current`/PID as
> a mapping key also needs to be saved, e.g., arbitrary
> 
> `hash_table[current]` or `context_table[current->pid]`.
> 
> Again, because not all filesystems allow nesting by design:
> Linux kernel doesn't need block filesystem to be nested.

Yes.  Various other PF_ flags also come to mind.  Or Kent's
magic fault disabling flag (although with bcachefs gone we can
probably remove that, thinking of it).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 10:57                 ` Gao Xiang
  2025-11-25 11:10                   ` Christoph Hellwig
@ 2025-11-25 11:48                   ` Ming Lei
  2025-11-25 11:58                     ` Gao Xiang
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-11-25 11:48 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe

On Tue, Nov 25, 2025 at 06:57:15PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/11/25 18:41, Ming Lei wrote:
> > On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
> > > Hi Ming,
> > > 
> > > On 2025/11/25 17:19, Ming Lei wrote:
> > > > On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
> > > > > Hi Ming and Christoph,
> > > > > 
> > > > > On 2025/11/25 11:00, Ming Lei wrote:
> > > > > > On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> > > > > > > On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > > > > > > > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > > > > > > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > > > > > > > using loop on top of XFS on top of stacked block devices.
> > > > > > > > 
> > > > > > > > Can you share your setting?
> > > > > > > > 
> > > > > > > > BTW, there are one followup fix:
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > > > > > > > 
> > > > > > > > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > > > > > > > not see stack overflow with the above fix against -next.
> > > > > > > 
> > > > > > > This was with a development tree with lots of local code.  So the
> > > > > > > messages aren't applicable (and probably a hint I need to reduce my
> > > > > > > stack usage).  The observations is that we now stack through from block
> > > > > > > submission context into the file system write path, which is bad for a
> > > > > > > lot of reasons.  journal_info being the most obvious one.
> > > > > > > 
> > > > > > > > > In other words:  I don't think issuing file system I/O from the
> > > > > > > > > submission thread in loop can work, and we should drop this again.
> > > > > > > > 
> > > > > > > > I don't object to drop it one more time.
> > > > > > > > 
> > > > > > > > However, can we confirm if it is really a stack overflow because of
> > > > > > > > calling into FS from ->queue_rq()?
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > If yes, it could be dead end to improve loop in this way, then I can give up.
> > > > > > > 
> > > > > > > I think calling directly into the lower file system without a context
> > > > > > > switch is very problematic, so IMHO yes, it is a dead end.
> > > > > I've already explained the details in
> > > > > https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
> > > > > 
> > > > > to zram folks why block devices act like this is very
> > > > > risky (in brief, because virtual block devices don't
> > > > > have any way (unlike the inner fs itself) to know enough
> > > > > about whether the inner fs already did something without
> > > > > context save (a.k.a side effect) so a new task context
> > > > > is absolutely necessary for virtual block devices to
> > > > > access backing fses for stacked usage.
> > > > > 
> > > > > So whether a nested fs can success is intrinsic to
> > > > > specific fses (because either they assure no complex
> > > > > journal_info access or save all effected contexts before
> > > > > transiting to the block layer.  But that is not bdev can
> > > > > do since they need to do any block fs.
> > > > 
> > > > IMO, task stack overflow could be the biggest trouble.
> > > > 
> > > > block layer has current->blk_plug/current->bio_list, which are
> > > > dealt with in the following patches:
> > > > 
> > > > https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
> > > > https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
> > > 
> > > I think it's the simplist thing for this because the
> > > context of "current->blk_plug/current->bio_list" is
> > > _owned_ by the block layer, so of course the block
> > > layer knows how to (and should) save and restore
> > > them.
> > 
> > Strictly speaking, all per-task context data is owned by task, instead
> > of subsystems, otherwise, it needn't to be stored in `task_struct` except
> > for some case just wants per-task storage.
> > 
> > For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
> > mm, block layer, md/dm, drivers, ...).
> > 
> > > 
> > > > 
> > > > I am curious why FS task context can't be saved/restored inside block
> > > > layer when calling into new FS IO? Given it is just per-task info.
> > > 
> > > The problem is a block driver don't know what the upper FS
> > > (sorry about the terminology) did before calling into block
> > > layer (the task_struct and journal_info side effect is just
> > > the obvious one), because all FSes (mainly the write path)
> > > doesn't assume the current context will be transited into
> > > another FS context, and it could introduce any fs-specific
> > > context before calling into the block layer.
> > > 
> > > So it's the fs's business to save / restore contexts since
> > > they change the context and it's none of the block layer
> > > business to save and restore because the block device knows
> > > nothing about the specific fs behavior, it should deal with
> > > all block FSes.
> > > 
> > > Let's put it into another way, thinking about generic
> > > calling convention[1], which includes caller-saved contexts
> > > and callee-saved contexts.  I think the problem is here
> > > overally similiar, for loop devices, you know none of lower
> > > or upper FS behaves (because it doesn't directly know either
> > 
> > loop just need to know which data to save/restore.
> 
> I've said there is no clear list of which data needs to be
> saved/restored.
> 
> FSes can do _anything_. Maybe something in `current` needs
> to be saved, but anything that uses `current`/PID as
> a mapping key also needs to be saved, e.g., arbitrary
> 
> `hash_table[current]` or `context_table[current->pid]`.
> 
> Again, because not all filesystems allow nesting by design:
> Linux kernel doesn't need block filesystem to be nested.

OK, got it, thanks for the sharing.

BTW, block layer actually uses current->bio_list to avoid nested bio
submission.

The similar trick could be played on FS ->read_iter/->write_iter() over
`kiocb` for avoiding nested FS IO too, but not sure if there is real
big use case.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 11:48                   ` Ming Lei
@ 2025-11-25 11:58                     ` Gao Xiang
  2025-11-25 12:18                       ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Gao Xiang @ 2025-11-25 11:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe



On 2025/11/25 19:48, Ming Lei wrote:
> On Tue, Nov 25, 2025 at 06:57:15PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/11/25 18:41, Ming Lei wrote:
>>> On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
>>>> Hi Ming,
>>>>
>>>> On 2025/11/25 17:19, Ming Lei wrote:
>>>>> On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
>>>>>> Hi Ming and Christoph,
>>>>>>
>>>>>> On 2025/11/25 11:00, Ming Lei wrote:
>>>>>>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>>>>>>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>>>>>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>>>>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>>>>>>>> using loop on top of XFS on top of stacked block devices.
>>>>>>>>>
>>>>>>>>> Can you share your setting?
>>>>>>>>>
>>>>>>>>> BTW, there are one followup fix:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>>>>>>>
>>>>>>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>>>>>>>> not see stack overflow with the above fix against -next.
>>>>>>>>
>>>>>>>> This was with a development tree with lots of local code.  So the
>>>>>>>> messages aren't applicable (and probably a hint I need to reduce my
>>>>>>>> stack usage).  The observations is that we now stack through from block
>>>>>>>> submission context into the file system write path, which is bad for a
>>>>>>>> lot of reasons.  journal_info being the most obvious one.
>>>>>>>>
>>>>>>>>>> In other words:  I don't think issuing file system I/O from the
>>>>>>>>>> submission thread in loop can work, and we should drop this again.
>>>>>>>>>
>>>>>>>>> I don't object to drop it one more time.
>>>>>>>>>
>>>>>>>>> However, can we confirm if it is really a stack overflow because of
>>>>>>>>> calling into FS from ->queue_rq()?
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>>>>>>>
>>>>>>>> I think calling directly into the lower file system without a context
>>>>>>>> switch is very problematic, so IMHO yes, it is a dead end.
>>>>>> I've already explained the details in
>>>>>> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
>>>>>>
>>>>>> to zram folks why block devices act like this is very
>>>>>> risky (in brief, because virtual block devices don't
>>>>>> have any way (unlike the inner fs itself) to know enough
>>>>>> about whether the inner fs already did something without
>>>>>> context save (a.k.a side effect) so a new task context
>>>>>> is absolutely necessary for virtual block devices to
>>>>>> access backing fses for stacked usage.
>>>>>>
>>>>>> So whether a nested fs can success is intrinsic to
>>>>>> specific fses (because either they assure no complex
>>>>>> journal_info access or save all effected contexts before
>>>>>> transiting to the block layer.  But that is not bdev can
>>>>>> do since they need to do any block fs.
>>>>>
>>>>> IMO, task stack overflow could be the biggest trouble.
>>>>>
>>>>> block layer has current->blk_plug/current->bio_list, which are
>>>>> dealt with in the following patches:
>>>>>
>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
>>>>
>>>> I think it's the simplist thing for this because the
>>>> context of "current->blk_plug/current->bio_list" is
>>>> _owned_ by the block layer, so of course the block
>>>> layer knows how to (and should) save and restore
>>>> them.
>>>
>>> Strictly speaking, all per-task context data is owned by task, instead
>>> of subsystems, otherwise, it needn't to be stored in `task_struct` except
>>> for some case just wants per-task storage.
>>>
>>> For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
>>> mm, block layer, md/dm, drivers, ...).
>>>
>>>>
>>>>>
>>>>> I am curious why FS task context can't be saved/restored inside block
>>>>> layer when calling into new FS IO? Given it is just per-task info.
>>>>
>>>> The problem is a block driver don't know what the upper FS
>>>> (sorry about the terminology) did before calling into block
>>>> layer (the task_struct and journal_info side effect is just
>>>> the obvious one), because all FSes (mainly the write path)
>>>> doesn't assume the current context will be transited into
>>>> another FS context, and it could introduce any fs-specific
>>>> context before calling into the block layer.
>>>>
>>>> So it's the fs's business to save / restore contexts since
>>>> they change the context and it's none of the block layer
>>>> business to save and restore because the block device knows
>>>> nothing about the specific fs behavior, it should deal with
>>>> all block FSes.
>>>>
>>>> Let's put it into another way, thinking about generic
>>>> calling convention[1], which includes caller-saved contexts
>>>> and callee-saved contexts.  I think the problem is here
>>>> overally similiar, for loop devices, you know none of lower
>>>> or upper FS behaves (because it doesn't directly know either
>>>
>>> loop just need to know which data to save/restore.
>>
>> I've said there is no clear list of which data needs to be
>> saved/restored.
>>
>> FSes can do _anything_. Maybe something in `current` needs
>> to be saved, but anything that uses `current`/PID as
>> a mapping key also needs to be saved, e.g., arbitrary
>>
>> `hash_table[current]` or `context_table[current->pid]`.
>>
>> Again, because not all filesystems allow nesting by design:
>> Linux kernel doesn't need block filesystem to be nested.
> 
> OK, got it, thanks for the sharing.
> 
> BTW, block layer actually uses current->bio_list to avoid nested bio
> submission.
> 
> The similar trick could be played on FS ->read_iter/->write_iter() over
> `kiocb` for avoiding nested FS IO too, but not sure if there is real
> big use case.

I don't think it's much similar, `current->bio_list` just deals
with the single BIO concept, but what nested fses need to deal
with is much complicated.

Just a premature conclusion: I don't think it's feasible for
filesystems to work like this.

Thanks,
Gao Xiang

> 
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 11:58                     ` Gao Xiang
@ 2025-11-25 12:18                       ` Ming Lei
  2025-11-25 15:16                         ` Gao Xiang
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-11-25 12:18 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe

On Tue, Nov 25, 2025 at 07:58:09PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/11/25 19:48, Ming Lei wrote:
> > On Tue, Nov 25, 2025 at 06:57:15PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2025/11/25 18:41, Ming Lei wrote:
> > > > On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
> > > > > Hi Ming,
> > > > > 
> > > > > On 2025/11/25 17:19, Ming Lei wrote:
> > > > > > On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
> > > > > > > Hi Ming and Christoph,
> > > > > > > 
> > > > > > > On 2025/11/25 11:00, Ming Lei wrote:
> > > > > > > > On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> > > > > > > > > On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > > > > > > > > > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > > > > > > > > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > > > > > > > > > using loop on top of XFS on top of stacked block devices.
> > > > > > > > > > 
> > > > > > > > > > Can you share your setting?
> > > > > > > > > > 
> > > > > > > > > > BTW, there are one followup fix:
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > > > > > > > > > 
> > > > > > > > > > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > > > > > > > > > not see stack overflow with the above fix against -next.
> > > > > > > > > 
> > > > > > > > > This was with a development tree with lots of local code.  So the
> > > > > > > > > messages aren't applicable (and probably a hint I need to reduce my
> > > > > > > > > stack usage).  The observations is that we now stack through from block
> > > > > > > > > submission context into the file system write path, which is bad for a
> > > > > > > > > lot of reasons.  journal_info being the most obvious one.
> > > > > > > > > 
> > > > > > > > > > > In other words:  I don't think issuing file system I/O from the
> > > > > > > > > > > submission thread in loop can work, and we should drop this again.
> > > > > > > > > > 
> > > > > > > > > > I don't object to drop it one more time.
> > > > > > > > > > 
> > > > > > > > > > However, can we confirm if it is really a stack overflow because of
> > > > > > > > > > calling into FS from ->queue_rq()?
> > > > > > > > > 
> > > > > > > > > Yes.
> > > > > > > > > 
> > > > > > > > > > If yes, it could be dead end to improve loop in this way, then I can give up.
> > > > > > > > > 
> > > > > > > > > I think calling directly into the lower file system without a context
> > > > > > > > > switch is very problematic, so IMHO yes, it is a dead end.
> > > > > > > I've already explained the details in
> > > > > > > https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
> > > > > > > 
> > > > > > > to zram folks why block devices act like this is very
> > > > > > > risky (in brief, because virtual block devices don't
> > > > > > > have any way (unlike the inner fs itself) to know enough
> > > > > > > about whether the inner fs already did something without
> > > > > > > context save (a.k.a side effect) so a new task context
> > > > > > > is absolutely necessary for virtual block devices to
> > > > > > > access backing fses for stacked usage.
> > > > > > > 
> > > > > > > So whether a nested fs can success is intrinsic to
> > > > > > > specific fses (because either they assure no complex
> > > > > > > journal_info access or save all effected contexts before
> > > > > > > transiting to the block layer.  But that is not bdev can
> > > > > > > do since they need to do any block fs.
> > > > > > 
> > > > > > IMO, task stack overflow could be the biggest trouble.
> > > > > > 
> > > > > > block layer has current->blk_plug/current->bio_list, which are
> > > > > > dealt with in the following patches:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
> > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
> > > > > 
> > > > > I think it's the simplist thing for this because the
> > > > > context of "current->blk_plug/current->bio_list" is
> > > > > _owned_ by the block layer, so of course the block
> > > > > layer knows how to (and should) save and restore
> > > > > them.
> > > > 
> > > > Strictly speaking, all per-task context data is owned by task, instead
> > > > of subsystems, otherwise, it needn't to be stored in `task_struct` except
> > > > for some case just wants per-task storage.
> > > > 
> > > > For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
> > > > mm, block layer, md/dm, drivers, ...).
> > > > 
> > > > > 
> > > > > > 
> > > > > > I am curious why FS task context can't be saved/restored inside block
> > > > > > layer when calling into new FS IO? Given it is just per-task info.
> > > > > 
> > > > > The problem is a block driver don't know what the upper FS
> > > > > (sorry about the terminology) did before calling into block
> > > > > layer (the task_struct and journal_info side effect is just
> > > > > the obvious one), because all FSes (mainly the write path)
> > > > > doesn't assume the current context will be transited into
> > > > > another FS context, and it could introduce any fs-specific
> > > > > context before calling into the block layer.
> > > > > 
> > > > > So it's the fs's business to save / restore contexts since
> > > > > they change the context and it's none of the block layer
> > > > > business to save and restore because the block device knows
> > > > > nothing about the specific fs behavior, it should deal with
> > > > > all block FSes.
> > > > > 
> > > > > Let's put it into another way, thinking about generic
> > > > > calling convention[1], which includes caller-saved contexts
> > > > > and callee-saved contexts.  I think the problem is here
> > > > > overally similiar, for loop devices, you know none of lower
> > > > > or upper FS behaves (because it doesn't directly know either
> > > > 
> > > > loop just need to know which data to save/restore.
> > > 
> > > I've said there is no clear list of which data needs to be
> > > saved/restored.
> > > 
> > > FSes can do _anything_. Maybe something in `current` needs
> > > to be saved, but anything that uses `current`/PID as
> > > a mapping key also needs to be saved, e.g., arbitrary
> > > 
> > > `hash_table[current]` or `context_table[current->pid]`.
> > > 
> > > Again, because not all filesystems allow nesting by design:
> > > Linux kernel doesn't need block filesystem to be nested.
> > 
> > OK, got it, thanks for the sharing.
> > 
> > BTW, block layer actually uses current->bio_list to avoid nested bio
> > submission.
> > 
> > The similar trick could be played on FS ->read_iter/->write_iter() over
> > `kiocb` for avoiding nested FS IO too, but not sure if there is real
> > big use case.
> 
> I don't think it's much similar, `current->bio_list` just deals
> with the single BIO concept, but what nested fses need to deal

No, it is not, it can be one tree of BIOs in case of dm/md.

> with is much complicated.

Care for sharing why/what the complicated is?

Anyway it is just one raw idea, and the devil is always in the details.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
  2025-11-25 12:18                       ` Ming Lei
@ 2025-11-25 15:16                         ` Gao Xiang
  0 siblings, 0 replies; 27+ messages in thread
From: Gao Xiang @ 2025-11-25 15:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, Mikulas Patocka, Zhaoyang Huang,
	Dave Chinner, linux-fsdevel, Jens Axboe



On 2025/11/25 20:18, Ming Lei wrote:
> On Tue, Nov 25, 2025 at 07:58:09PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/11/25 19:48, Ming Lei wrote:
>>> On Tue, Nov 25, 2025 at 06:57:15PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/11/25 18:41, Ming Lei wrote:
>>>>> On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
>>>>>> Hi Ming,
>>>>>>
>>>>>> On 2025/11/25 17:19, Ming Lei wrote:
>>>>>>> On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
>>>>>>>> Hi Ming and Christoph,
>>>>>>>>
>>>>>>>> On 2025/11/25 11:00, Ming Lei wrote:
>>>>>>>>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
>>>>>>>>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
>>>>>>>>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
>>>>>>>>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when
>>>>>>>>>>>> using loop on top of XFS on top of stacked block devices.
>>>>>>>>>>>
>>>>>>>>>>> Can you share your setting?
>>>>>>>>>>>
>>>>>>>>>>> BTW, there are one followup fix:
>>>>>>>>>>>
>>>>>>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
>>>>>>>>>>>
>>>>>>>>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
>>>>>>>>>>> not see stack overflow with the above fix against -next.
>>>>>>>>>>
>>>>>>>>>> This was with a development tree with lots of local code.  So the
>>>>>>>>>> messages aren't applicable (and probably a hint I need to reduce my
>>>>>>>>>> stack usage).  The observations is that we now stack through from block
>>>>>>>>>> submission context into the file system write path, which is bad for a
>>>>>>>>>> lot of reasons.  journal_info being the most obvious one.
>>>>>>>>>>
>>>>>>>>>>>> In other words:  I don't think issuing file system I/O from the
>>>>>>>>>>>> submission thread in loop can work, and we should drop this again.
>>>>>>>>>>>
>>>>>>>>>>> I don't object to drop it one more time.
>>>>>>>>>>>
>>>>>>>>>>> However, can we confirm if it is really a stack overflow because of
>>>>>>>>>>> calling into FS from ->queue_rq()?
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>> If yes, it could be dead end to improve loop in this way, then I can give up.
>>>>>>>>>>
>>>>>>>>>> I think calling directly into the lower file system without a context
>>>>>>>>>> switch is very problematic, so IMHO yes, it is a dead end.
>>>>>>>> I've already explained the details in
>>>>>>>> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
>>>>>>>>
>>>>>>>> to zram folks why block devices act like this is very
>>>>>>>> risky (in brief, because virtual block devices don't
>>>>>>>> have any way (unlike the inner fs itself) to know enough
>>>>>>>> about whether the inner fs already did something without
>>>>>>>> context save (a.k.a side effect) so a new task context
>>>>>>>> is absolutely necessary for virtual block devices to
>>>>>>>> access backing fses for stacked usage.
>>>>>>>>
>>>>>>>> So whether a nested fs can success is intrinsic to
>>>>>>>> specific fses (because either they assure no complex
>>>>>>>> journal_info access or save all effected contexts before
>>>>>>>> transiting to the block layer.  But that is not bdev can
>>>>>>>> do since they need to do any block fs.
>>>>>>>
>>>>>>> IMO, task stack overflow could be the biggest trouble.
>>>>>>>
>>>>>>> block layer has current->blk_plug/current->bio_list, which are
>>>>>>> dealt with in the following patches:
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
>>>>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
>>>>>>
>>>>>> I think it's the simplist thing for this because the
>>>>>> context of "current->blk_plug/current->bio_list" is
>>>>>> _owned_ by the block layer, so of course the block
>>>>>> layer knows how to (and should) save and restore
>>>>>> them.
>>>>>
>>>>> Strictly speaking, all per-task context data is owned by task, instead
>>>>> of subsystems, otherwise, it needn't to be stored in `task_struct` except
>>>>> for some case just wants per-task storage.
>>>>>
>>>>> For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
>>>>> mm, block layer, md/dm, drivers, ...).
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I am curious why FS task context can't be saved/restored inside block
>>>>>>> layer when calling into new FS IO? Given it is just per-task info.
>>>>>>
>>>>>> The problem is a block driver don't know what the upper FS
>>>>>> (sorry about the terminology) did before calling into block
>>>>>> layer (the task_struct and journal_info side effect is just
>>>>>> the obvious one), because all FSes (mainly the write path)
>>>>>> doesn't assume the current context will be transited into
>>>>>> another FS context, and it could introduce any fs-specific
>>>>>> context before calling into the block layer.
>>>>>>
>>>>>> So it's the fs's business to save / restore contexts since
>>>>>> they change the context and it's none of the block layer
>>>>>> business to save and restore because the block device knows
>>>>>> nothing about the specific fs behavior, it should deal with
>>>>>> all block FSes.
>>>>>>
>>>>>> Let's put it into another way, thinking about generic
>>>>>> calling convention[1], which includes caller-saved contexts
>>>>>> and callee-saved contexts.  I think the problem is here
>>>>>> overally similiar, for loop devices, you know none of lower
>>>>>> or upper FS behaves (because it doesn't directly know either
>>>>>
>>>>> loop just need to know which data to save/restore.
>>>>
>>>> I've said there is no clear list of which data needs to be
>>>> saved/restored.
>>>>
>>>> FSes can do _anything_. Maybe something in `current` needs
>>>> to be saved, but anything that uses `current`/PID as
>>>> a mapping key also needs to be saved, e.g., arbitrary
>>>>
>>>> `hash_table[current]` or `context_table[current->pid]`.
>>>>
>>>> Again, because not all filesystems allow nesting by design:
>>>> Linux kernel doesn't need block filesystem to be nested.
>>>
>>> OK, got it, thanks for the sharing.
>>>
>>> BTW, block layer actually uses current->bio_list to avoid nested bio
>>> submission.
>>>
>>> The similar trick could be played on FS ->read_iter/->write_iter() over
>>> `kiocb` for avoiding nested FS IO too, but not sure if there is real
>>> big use case.
>>
>> I don't think it's much similar, `current->bio_list` just deals
>> with the single BIO concept, but what nested fses need to deal
> 
> No, it is not, it can be one tree of BIOs in case of dm/md.
> 
>> with is much complicated.
> 
> Care for sharing why/what the complicated is?
> 
> Anyway it is just one raw idea, and the devil is always in the details.

FS I/Os are more complicated, for example, you need do some metadata
I/Os in advance to get where to find the data, and then send data
I/Os, and data I/Os could also need to be {de,en}crypted,
(de)compressed and then pass to upper/lower fses.

All those behaviors can be in arbitary nested ways, a much
simplified coroutine-like current->bio_list is not enough
for such cases, anyway.  Also the programming model is
just different, I don't think it can be a practical way.

Thanks,
Gao Xiang

> 
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-11-25 15:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
2025-10-15 15:49   ` Bart Van Assche
2025-10-16  2:19     ` Ming Lei
2025-10-15 11:07 ` [PATCH V5 2/6] loop: add helper lo_rw_aio_prep() Ming Lei
2025-10-15 11:07 ` [PATCH V5 3/6] loop: add lo_submit_rw_aio() Ming Lei
2025-10-15 11:07 ` [PATCH V5 4/6] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
2025-10-15 11:07 ` [PATCH V5 5/6] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
2025-10-15 11:07 ` [PATCH V5 6/6] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
2025-11-18 12:55 ` [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-11-18 15:38 ` Jens Axboe
2025-11-24  6:12 ` calling into file systems directly from ->queue_rq, was " Christoph Hellwig
2025-11-24  9:02   ` Ming Lei
2025-11-24  9:05     ` Christoph Hellwig
2025-11-25  3:00       ` Ming Lei
2025-11-25  3:56         ` Jens Axboe
2025-11-25  7:26         ` Gao Xiang
2025-11-25  9:19           ` Ming Lei
2025-11-25  9:39             ` Gao Xiang
2025-11-25 10:13               ` Gao Xiang
2025-11-25 10:41               ` Ming Lei
2025-11-25 10:57                 ` Gao Xiang
2025-11-25 11:10                   ` Christoph Hellwig
2025-11-25 11:48                   ` Ming Lei
2025-11-25 11:58                     ` Gao Xiang
2025-11-25 12:18                       ` Ming Lei
2025-11-25 15:16                         ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).