From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C9E21D435F; Tue, 25 Nov 2025 09:39:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764063571; cv=none; b=bG71NEP3tptx2GaABvGXg4LlIJ38PxmzkUpB8Zs7BsU09O+akuKKmPG6Z/jicS79TKf8z+Cs9wcYIsgHmliPcCqfFEaSOKYmYmzmkkORF3hxJCwQNudTg8DfgP9yogrDYxS4G/YZo8M7tshSDaxmQLOLTrKs0Sbq3mjmB0q8bJk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764063571; c=relaxed/simple; bh=xzVLC7YXZcAG8mrmrRykLLuq9u6J5Uwcwq0VFRocBiA=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=WhHHrOLYQk8tNvhFP5iqzzvLwowDuRJeuaqiiIjcBTvUILW28gPPHyGg2e7V0gV0OHJYREQvydJwcq9cCFcnenI7H4o1GbnLrKhZCPc6213g4rAS0HXmN3UvV66Pws5U6rBo9WLEYA4NmfvbPkemHL7yIFFnxa3amL1bDh3rJCU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=RK+QQMDP; arc=none smtp.client-ip=115.124.30.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="RK+QQMDP" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1764063558; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=C6yMHeiTVa9zaiGXI/N10yghcNSzyHJKY9Yx3eagVDs=; b=RK+QQMDPJvsUqMa7FpT9M2puA5ypsYY3Wi/760BOT3Bo0yB8VbCIAXvZuZyTr38mb7txVwB/Jrrm+cNxmoM1bC4LcWFyfd4oBao0owG4GI8TsC1o4kLhYxKB20bQ7dl4Un4vTFG+XG1bvfVPgduPrvak+Nf23NAPcEE/McUqvrA= Received: from 30.221.132.26(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0WtNIO9f_1764063557 cluster:ay36) by smtp.aliyun-inc.com; Tue, 25 Nov 2025 17:39:18 +0800 Message-ID: <00bc891e-4137-4d93-83a5-e4030903ffab@linux.alibaba.com> Date: Tue, 25 Nov 2025 17:39:17 +0800 Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: calling into file systems directly from ->queue_rq, was Re: [PATCH V5 0/6] loop: improve loop aio perf by IOCB_NOWAIT To: Ming Lei Cc: Christoph Hellwig , linux-block@vger.kernel.org, Mikulas Patocka , Zhaoyang Huang , Dave Chinner , linux-fsdevel@vger.kernel.org, Jens Axboe References: <20251015110735.1361261-1-ming.lei@redhat.com> From: Gao Xiang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi Ming, On 2025/11/25 17:19, Ming Lei wrote: > On Tue, Nov 25, 2025 at 03:26:39PM +0800, Gao Xiang wrote: >> Hi Ming and Christoph, >> >> On 2025/11/25 11:00, Ming Lei wrote: >>> On Mon, Nov 24, 2025 at 01:05:46AM -0800, Christoph Hellwig wrote: >>>> On Mon, Nov 24, 2025 at 05:02:03PM +0800, Ming Lei wrote: >>>>> On Sun, Nov 23, 2025 at 10:12:24PM -0800, Christoph Hellwig wrote: >>>>>> FYI, with this series I'm seeing somewhat frequent stack overflows when >>>>>> using loop on top of XFS on top of stacked block devices. >>>>> >>>>> Can you share your setting? >>>>> >>>>> BTW, there are one followup fix: >>>>> >>>>> https://lore.kernel.org/linux-block/20251120160722.3623884-1-ming.lei@redhat.com/ >>>>> >>>>> I just run 'xfstests -q quick' on loop on top of XFS on top of dm-stripe, >>>>> not see stack overflow with the above fix against -next. >>>> >>>> This was with a development tree with lots of local code. So the >>>> messages aren't applicable (and probably a hint I need to reduce my >>>> stack usage). The observations is that we now stack through from block >>>> submission context into the file system write path, which is bad for a >>>> lot of reasons. journal_info being the most obvious one. >>>> >>>>>> In other words: I don't think issuing file system I/O from the >>>>>> submission thread in loop can work, and we should drop this again. >>>>> >>>>> I don't object to drop it one more time. >>>>> >>>>> However, can we confirm if it is really a stack overflow because of >>>>> calling into FS from ->queue_rq()? >>>> >>>> Yes. >>>> >>>>> If yes, it could be dead end to improve loop in this way, then I can give up. >>>> >>>> I think calling directly into the lower file system without a context >>>> switch is very problematic, so IMHO yes, it is a dead end. >> I've already explained the details in >> https://lore.kernel.org/r/8c596737-95c1-4274-9834-1fe06558b431@linux.alibaba.com >> >> to zram folks why block devices act like this is very >> risky (in brief, because virtual block devices don't >> have any way (unlike the inner fs itself) to know enough >> about whether the inner fs already did something without >> context save (a.k.a side effect) so a new task context >> is absolutely necessary for virtual block devices to >> access backing fses for stacked usage. >> >> So whether a nested fs can success is intrinsic to >> specific fses (because either they assure no complex >> journal_info access or save all effected contexts before >> transiting to the block layer. But that is not bdev can >> do since they need to do any block fs. > > IMO, task stack overflow could be the biggest trouble. > > block layer has current->blk_plug/current->bio_list, which are > dealt with in the following patches: > > https://lore.kernel.org/linux-block/20251120160722.3623884-4-ming.lei@redhat.com/ > https://lore.kernel.org/linux-block/20251120160722.3623884-5-ming.lei@redhat.com/ I think it's the simplist thing for this because the context of "current->blk_plug/current->bio_list" is _owned_ by the block layer, so of course the block layer knows how to (and should) save and restore them. > > I am curious why FS task context can't be saved/restored inside block > layer when calling into new FS IO? Given it is just per-task info. The problem is a block driver don't know what the upper FS (sorry about the terminology) did before calling into block layer (the task_struct and journal_info side effect is just the obvious one), because all FSes (mainly the write path) doesn't assume the current context will be transited into another FS context, and it could introduce any fs-specific context before calling into the block layer. So it's the fs's business to save / restore contexts since they change the context and it's none of the block layer business to save and restore because the block device knows nothing about the specific fs behavior, it should deal with all block FSes. Let's put it into another way, thinking about generic calling convention[1], which includes caller-saved contexts and callee-saved contexts. I think the problem is here overally similiar, for loop devices, you know none of lower or upper FS behaves (because it doesn't directly know either upper or lower FS contexts), so it should either expect the upper fs to save all the contexts, or to use a new kthread context (to emulate userspace requests to FS) for lower FS. [1] https://en.wikipedia.org/wiki/Calling_convention Thanks, Gao Xiang > > > Thanks, > Ming