From: Ming Lei <ming.lei@redhat.com>
To: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Christoph Hellwig <hch@infradead.org>,
linux-block@vger.kernel.org,
Mikulas Patocka <mpatocka@redhat.com>,
Zhaoyang Huang <zhaoyang.huang@unisoc.com>,
Dave Chinner <dchinner@redhat.com>,
linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>
Subject: Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT
Date: Tue, 25 Nov 2025 18:41:11 +0800 [thread overview]
Message-ID: <aSWHx3ynP9Z_6DeY@fedora> (raw)
In-Reply-To: <00bc891e-4137-4d93-83a5-e4030903ffab@linux.alibaba.com>
On Tue, Nov 25, 2025 at 05:39:17PM +0800, Gao Xiang wrote:
> Hi Ming,
>
> On 2025/11/25 17:19, Ming Lei wrote:
> > On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote:
> > > Hi Ming and Christoph,
> > >
> > > On 2025/11/25 11:00, Ming Lei wrote:
> > > > On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote:
> > > > > > On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote:
> > > > > > > FYI, with this series I'm seeing somewhat frequent stack overflows when
> > > > > > > using loop on top of XFS on top of stacked block devices.
> > > > > >
> > > > > > Can you share your setting?
> > > > > >
> > > > > > BTW, there are one followup fix:
> > > > > >
> > > > > > https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/
> > > > > >
> > > > > > I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe,
> > > > > > not see stack overflow with the above fix against -next.
> > > > >
> > > > > This was with a development tree with lots of local code. So the
> > > > > messages aren't applicable (and probably a hint I need to reduce my
> > > > > stack usage). The observations is that we now stack through from block
> > > > > submission context into the file system write path, which is bad for a
> > > > > lot of reasons. journal_info being the most obvious one.
> > > > >
> > > > > > > In other words: I don't think issuing file system I/O from the
> > > > > > > submission thread in loop can work, and we should drop this again.
> > > > > >
> > > > > > I don't object to drop it one more time.
> > > > > >
> > > > > > However, can we confirm if it is really a stack overflow because of
> > > > > > calling into FS from ->queue_rq()?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > If yes, it could be dead end to improve loop in this way, then I can give up.
> > > > >
> > > > > I think calling directly into the lower file system without a context
> > > > > switch is very problematic, so IMHO yes, it is a dead end.
> > > I've already explained the details in
> > > https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com
> > >
> > > to zram folks why block devices act like this is very
> > > risky (in brief, because virtual block devices don't
> > > have any way (unlike the inner fs itself) to know enough
> > > about whether the inner fs already did something without
> > > context save (a.k.a side effect) so a new task context
> > > is absolutely necessary for virtual block devices to
> > > access backing fses for stacked usage.
> > >
> > > So whether a nested fs can success is intrinsic to
> > > specific fses (because either they assure no complex
> > > journal_info access or save all effected contexts before
> > > transiting to the block layer. But that is not bdev can
> > > do since they need to do any block fs.
> >
> > IMO, task stack overflow could be the biggest trouble.
> >
> > block layer has current->blk_plug/current->bio_list, which are
> > dealt with in the following patches:
> >
> > https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/
> > https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/
>
> I think it's the simplist thing for this because the
> context of "current->blk_plug/current->bio_list" is
> _owned_ by the block layer, so of course the block
> layer knows how to (and should) save and restore
> them.
Strictly speaking, all per-task context data is owned by task, instead
of subsystems, otherwise, it needn't to be stored in `task_struct` except
for some case just wants per-task storage.
For example of current->blk_plug, it is used by many subsystems(io_uring, FS,
mm, block layer, md/dm, drivers, ...).
>
> >
> > I am curious why FS task context can't be saved/restored inside block
> > layer when calling into new FS IO? Given it is just per-task info.
>
> The problem is a block driver don't know what the upper FS
> (sorry about the terminology) did before calling into block
> layer (the task_struct and journal_info side effect is just
> the obvious one), because all FSes (mainly the write path)
> doesn't assume the current context will be transited into
> another FS context, and it could introduce any fs-specific
> context before calling into the block layer.
>
> So it's the fs's business to save / restore contexts since
> they change the context and it's none of the block layer
> business to save and restore because the block device knows
> nothing about the specific fs behavior, it should deal with
> all block FSes.
>
> Let's put it into another way, thinking about generic
> calling convention[1], which includes caller-saved contexts
> and callee-saved contexts. I think the problem is here
> overally similiar, for loop devices, you know none of lower
> or upper FS behaves (because it doesn't directly know either
loop just need to know which data to save/restore.
> upper or lower FS contexts), so it should either expect the
> upper fs to save all the contexts, or to use a new kthread
> context (to emulate userspace requests to FS) for lower FS.
>
> [1] https://en.wikipedia.org/wiki/Calling_convention
For example of lo_rw_aio_nowait(), I am wondering why the following
save/restore doesn't work if current->journal_info is the only FS
context data?
curr_journal = current->journal_info;
current->journal_info = NULL; /* like handling the IO by schedule wq */
ret = lo_rw_aio_nowait(); /* call into FS write/read_iter() from .queue_rq() */
current->journal_info = curr_journal;
Thanks,
Ming
next prev parent reply other threads:[~2025-11-25 10:41 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-15 11:07 [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-10-15 11:07 ` [PATCH V5 1/6] loop: add helper lo_cmd_nr_bvec() Ming Lei
2025-10-15 15:49 ` Bart Van Assche
2025-10-16 2:19 ` Ming Lei
2025-10-15 11:07 ` [PATCH V5 2/6] loop: add helper lo_rw_aio_prep() Ming Lei
2025-10-15 11:07 ` [PATCH V5 3/6] loop: add lo_submit_rw_aio() Ming Lei
2025-10-15 11:07 ` [PATCH V5 4/6] loop: move command blkcg/memcg initialization into loop_queue_work Ming Lei
2025-10-15 11:07 ` [PATCH V5 5/6] loop: try to handle loop aio command via NOWAIT IO first Ming Lei
2025-10-15 11:07 ` [PATCH V5 6/6] loop: add hint for handling aio via IOCB_NOWAIT Ming Lei
2025-11-18 12:55 ` [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT Ming Lei
2025-11-18 15:38 ` Jens Axboe
2025-11-24 6:12 ` calling into file systems directly from ->queue_rq, was " Christoph Hellwig
2025-11-24 9:02 ` Ming Lei
2025-11-24 9:05 ` Christoph Hellwig
2025-11-25 3:00 ` Ming Lei
2025-11-25 3:56 ` Jens Axboe
2025-11-25 7:26 ` Gao Xiang
2025-11-25 9:19 ` Ming Lei
2025-11-25 9:39 ` Gao Xiang
2025-11-25 10:13 ` Gao Xiang
2025-11-25 10:41 ` Ming Lei [this message]
2025-11-25 10:57 ` Gao Xiang
2025-11-25 11:10 ` Christoph Hellwig
2025-11-25 11:48 ` Ming Lei
2025-11-25 11:58 ` Gao Xiang
2025-11-25 12:18 ` Ming Lei
2025-11-25 15:16 ` Gao Xiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aSWHx3ynP9Z_6DeY@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=dchinner@redhat.com \
--cc=hch@infradead.org \
--cc=hsiangkao@linux.alibaba.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mpatocka@redhat.com \
--cc=zhaoyang.huang@unisoc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.