From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3668413FEE for ; Thu, 30 Apr 2026 04:16:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777522600; cv=none; b=acYXzv4uRevSxiqhgf/Mo5OGpGmlcg7FmSH1fdwPbV+AH9mYBVQol7PAtPAMF74Rb8UCmIAJH5x1jewxdgybvjf4McGTzv01i4pLxnL6fZUOEX636nicygeSzURZhQ4kp7tVjyerH6m+EjqGGX6Ysxp6lR6AETluO9dDHXRV7oI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777522600; c=relaxed/simple; bh=O21kEOhjfByN54UZLdtuPwdAy6O3ToGBKeQrdJ/Hi38=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=twgsAoaUtsoZLnoSOweR9WW5f1KGJtEv5irxbQizhB/LmdfK4MB+p7+gTwLoQglsglO65QPIkJyTaKPVRQdxG9W2aT6y1XHxmc5qHlfzgcdgWEXQzX97yAbF4Un5Fr0I3TF+AobgIHmuFh4SqqCb5O02vTl4iMH0VxM+qlKVpZM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=EHmkXUeW; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="EHmkXUeW" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DB7B6C2BCB8; Thu, 30 Apr 2026 04:16:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777522599; bh=O21kEOhjfByN54UZLdtuPwdAy6O3ToGBKeQrdJ/Hi38=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=EHmkXUeWUXdhYMqEwBbHkbhAeTRWh8sTUkj1HhgLzrq58N9144BeEo9Z/zGyuQOxr XME9HbIPfJAYnBEYQJG/22JD8DvB34/evbrmR8COkLN+hW1uwMEH1Q43JTW4mdJi9N xWy1wVNNP+H3mV6HzjLcHtKl7zsjMGmQJnOvEd0TWhKibdf5AU+Q8Zed3SZwxTLWe3 aUf/tSlYKYS39k1km1XuF4zVkL5lB1+tLVyb/ai9lGKH+/lTm0IPnjJ0OgG+cTutA+ d3tH1OTKU9MnkiZFX8Xtyek0aUp5uZiWOgC1jlDOVHoOn1AdrebluAJDWhTXZISAJy 8rfvMWAbqkQxg== Date: Wed, 29 Apr 2026 21:16:39 -0700 From: "Darrick J. Wong" To: Joanne Koong Cc: Bernd Schubert , "bernd@bsbernd.com" , Miklos Szeredi , "linux-fsdevel@vger.kernel.org" , Luis Henriques , Gang He Subject: Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Message-ID: <20260430041639.GF3778109@frogsfrogsfrogs> References: <20260413-reduced-nr-ring-queues_3-v4-0-982b6414b723@bsbernd.com> <20260413-reduced-nr-ring-queues_3-v4-5-982b6414b723@bsbernd.com> <0a56969c-7fe6-428a-8eb5-6df5e61ff03f@ddn.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Wed, Apr 29, 2026 at 05:32:27PM +0100, Joanne Koong wrote: > On Wed, Apr 29, 2026 at 5:24 PM Bernd Schubert wrote: > > > > > > > > On 4/29/26 18:10, Joanne Koong wrote: > > > On Fri, Apr 24, 2026 at 11:01 PM Bernd Schubert wrote: > > >> > > >> On 4/24/26 20:28, Joanne Koong wrote: > > >>> On Mon, Apr 13, 2026 at 2:41 AM Bernd Schubert via B4 Relay > > >>> wrote: > > >>>> > > >>>> From: Bernd Schubert > > >>>> > > >>>> Queues selection (fuse_uring_get_queue) can handle reduced number > > >>>> queues - using io-uring is possible now even with a single > > >>>> queue and entry. > > >>>> > > >>>> The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that > > >>>> reduced queues are possible, i.e. if the flag is set, fuse server > > >>>> is free to reduce number queues. > > >>>> > > >>>> Signed-off-by: Bernd Schubert > > >>>> --- > > >>>> fs/fuse/dev_uring.c | 160 ++++++++++++++++++++++++---------------------- > > >>>> fs/fuse/inode.c | 2 +- > > >>>> include/uapi/linux/fuse.h | 3 + > > >>>> 3 files changed, 88 insertions(+), 77 deletions(-) > > >>>> > > >>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c > > >>>> index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644 > > >>>> --- a/fs/fuse/dev_uring.c > > >>>> +++ b/fs/fuse/dev_uring.c > > >>>> > > >>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring) > > >>>> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring) > > >>>> { > > >>>> unsigned int qid; > > >>>> - struct fuse_ring_queue *queue; > > >>>> + int node; > > >>>> + unsigned int nr_queues; > > >>>> + unsigned int cpu = task_cpu(current); > > >>>> > > >>>> - qid = task_cpu(current); > > >>>> + cpu = cpu % ring->max_nr_queues; > > >>>> > > >>>> - if (WARN_ONCE(qid >= ring->max_nr_queues, > > >>>> - "Core number (%u) exceeds nr queues (%zu)\n", qid, > > >>>> - ring->max_nr_queues)) > > >>>> - qid = 0; > > >>>> + /* numa local registered queue bitmap */ > > >>>> + node = cpu_to_node(cpu); > > >>>> + if (WARN_ONCE(node >= ring->nr_numa_nodes, > > >>>> + "Node number (%d) exceeds nr nodes (%d)\n", > > >>>> + node, ring->nr_numa_nodes)) { > > >>>> + node = 0; > > >>>> + } > > >>>> > > >>>> - queue = ring->queues[qid]; > > >>>> - WARN_ONCE(!queue, "Missing queue for qid %d\n", qid); > > >>>> + nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues); > > >>>> + if (nr_queues) { > > >>>> + qid = ring->numa_q_map[node].cpu_to_qid[cpu]; > > >>>> + if (WARN_ON_ONCE(qid >= ring->max_nr_queues)) > > >>>> + return NULL; > > >>>> + return READ_ONCE(ring->queues[qid]); > > >>>> + } > > >>> > > >>> Hi Bernd, > > >>> > > >>> Thanks for making the changes on this - I really like how much simpler > > >>> the logic is now. > > >>> > > >>> I'm looking through how the block multiqueue code works > > >>> (block/blk-mq.c and block/blk-mq-cpumap.c) because I think they > > >>> basically have to do the same thing with figuring out which cpu to > > >>> dispatch a request to. > > >>> > > >>> It looks like what they do is use group_cpus_evenly(), which as I > > >>> understand it, will partition CPUs taking into account numa nodes (as > > >>> well as clustering and SMT siblings). I think if we use this for fuse > > >>> io-uring, it will make things a lot simpler and we could get rid of > > >>> the per-numa state tracking (eg numa_q_map, registered_q_mask, > > >>> nr_numa_nodes) and simplify queue selection where now that can just > > >>> be a cpu to qid lookup instead of a two-level > > >>> numa-then-global-fallback lookup. > > >>> > > >>> Do you think something like this makes sense? > > >> > > >> Maybe, I need to check that code. However, does this really need to be > > >> done right now? This cannot be updated later? For me it looks a bit like > > >> we are going to replace one code by another, without a clear advantage. > > >> I can look into group_cpus_evenly(), but I cannot promise you when that > > >> will happen. > > >> My personal preference would be to work on real issue, like getting rid > > >> of two locks (queue->lock and bg->lock) and distribute max_bg accross > > >> queues. And that probably requires the distribution across queues, which > > >> you didn't like in the previous series. Anway, already finding the time > > >> for that is hard. > > >> > > >> My personal opinion is that queue selection needs to return the qid, so > > >> that the function can be overriden with eBPF. I didn't have time yet to > > >> try that out. > > >> > > >>> > > >>> Additionally, as I understand it, in this series, the ring->q_map > > >>> mapping has to get rebuilt every time a new queue gets created. What > > >>> do you think about just having the server declare the total queue > > >>> count upfront and then the mapping can just get established at ring > > >>> creation time? group_cpus_evenly() would only need to be called once, > > >>> the cpu_to_qid map would only have to be built once, and we could > > >>> avoid the rebuild-on-each-queue-creation complexity entirely. Do you > > >>> think something like this makes sense? > > >> > > >> That is why I said in another mail that a config SQE would make to some > > >> extend sense. However, the part where I disagree is that we could make > > >> it all entirely dynamic with the current approach. > > >> Only the logic for that in libfuse is missing. I.e. it _could_ start > > >> with a single queue or one queue per numa and one ring entry. Basically > > >> no memory usage then. > > >> And now libfuse could add logic - many small requests - set up ring > > >> entries with smaller payload size (or smaller pBuf). Many large requests > > >> - add more requests with larger payload size. And with the current > > >> approach queues can be added dynamically. > > > > > > Bernd, looking through this series some more, I still think it would > > > be preferable if userspace passed in the number of queues upfront at > > > registration time and requests are gated until all those queues have > > > completed set up. I think this makes races a lot simpler. Even without > > > configurable queues, there are already tricky races to reason about in > > > the dispatch and abort/teardown paths, and with configurable queues > > > that can now handle/submit requests while other queues are not yet set > > > up, there are now races against both request submission and > > > potentially concurrent queue registration, as well as races with > > > mappings that can reference queues in any state. I think it'd be > > > preferable to try to keep things as simple as possible, and have > > > dynamic queue addition added/supported later through a new uring cmd > > > if needed. > > > > > > What are your thoughts on this? > > > > Can we defer this to v6? I.e. v5 goes out on Friday with minimal fix > > changes and then we discuss with Miklos about it next week? In the end I > > had all these things initially entirely static and had an io-uring > > config ioctl. In the mean time I see a good reason to have it dynamic, > > mainly to keep memory usage low, but I also see the possible races, of > > course (although I hope that I didn't introduce new ones). > > > > In principle we would need at least monthly meeting to synchronize and > > agree on design choices. If I understand Darrick right ext4 has that. ext4 and xfs each have a weekly community conference call. > That sounds great! I agree it'll be a lot quicker to discuss this in > person :D Looking forward to seeing you next week. Me likewise! --D > Thanks, > Joanne > > > > Thanks, > > Bernd >