Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: Bernd Schubert <bschubert@ddn.com>,
	"bernd@bsbernd.com" <bernd@bsbernd.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Luis Henriques <luis@igalia.com>, Gang He <dchg2000@gmail.com>
Subject: Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues
Date: Wed, 29 Apr 2026 21:16:39 -0700	[thread overview]
Message-ID: <20260430041639.GF3778109@frogsfrogsfrogs> (raw)
In-Reply-To: <CAJnrk1ZFPr5AKULqMwz2ym3i9W9cJCQU7pFzF=FwOb-TtA8fxA@mail.gmail.com>

On Wed, Apr 29, 2026 at 05:32:27PM +0100, Joanne Koong wrote:
> On Wed, Apr 29, 2026 at 5:24 PM Bernd Schubert <bschubert@ddn.com> wrote:
> >
> >
> >
> > On 4/29/26 18:10, Joanne Koong wrote:
> > > On Fri, Apr 24, 2026 at 11:01 PM Bernd Schubert <bschubert@ddn.com> wrote:
> > >>
> > >> On 4/24/26 20:28, Joanne Koong wrote:
> > >>> On Mon, Apr 13, 2026 at 2:41 AM Bernd Schubert via B4 Relay
> > >>> <devnull+bernd.bsbernd.com@kernel.org> wrote:
> > >>>>
> > >>>> From: Bernd Schubert <bschubert@ddn.com>
> > >>>>
> > >>>> Queues selection (fuse_uring_get_queue) can handle reduced number
> > >>>> queues - using io-uring is possible now even with a single
> > >>>> queue and entry.
> > >>>>
> > >>>> The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
> > >>>> reduced queues are possible, i.e. if the flag is set, fuse server
> > >>>> is free to reduce number queues.
> > >>>>
> > >>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> > >>>> ---
> > >>>>  fs/fuse/dev_uring.c       | 160 ++++++++++++++++++++++++----------------------
> > >>>>  fs/fuse/inode.c           |   2 +-
> > >>>>  include/uapi/linux/fuse.h |   3 +
> > >>>>  3 files changed, 88 insertions(+), 77 deletions(-)
> > >>>>
> > >>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > >>>> index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644
> > >>>> --- a/fs/fuse/dev_uring.c
> > >>>> +++ b/fs/fuse/dev_uring.c
> > >>>>
> > >>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> > >>>> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
> > >>>>  {
> > >>>>         unsigned int qid;
> > >>>> -       struct fuse_ring_queue *queue;
> > >>>> +       int node;
> > >>>> +       unsigned int nr_queues;
> > >>>> +       unsigned int cpu = task_cpu(current);
> > >>>>
> > >>>> -       qid = task_cpu(current);
> > >>>> +       cpu = cpu % ring->max_nr_queues;
> > >>>>
> > >>>> -       if (WARN_ONCE(qid >= ring->max_nr_queues,
> > >>>> -                     "Core number (%u) exceeds nr queues (%zu)\n", qid,
> > >>>> -                     ring->max_nr_queues))
> > >>>> -               qid = 0;
> > >>>> +       /* numa local registered queue bitmap */
> > >>>> +       node = cpu_to_node(cpu);
> > >>>> +       if (WARN_ONCE(node >= ring->nr_numa_nodes,
> > >>>> +                     "Node number (%d) exceeds nr nodes (%d)\n",
> > >>>> +                     node, ring->nr_numa_nodes)) {
> > >>>> +               node = 0;
> > >>>> +       }
> > >>>>
> > >>>> -       queue = ring->queues[qid];
> > >>>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> > >>>> +       nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
> > >>>> +       if (nr_queues) {
> > >>>> +               qid = ring->numa_q_map[node].cpu_to_qid[cpu];
> > >>>> +               if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> > >>>> +                       return NULL;
> > >>>> +               return READ_ONCE(ring->queues[qid]);
> > >>>> +       }
> > >>>
> > >>> Hi Bernd,
> > >>>
> > >>> Thanks for making the changes on this - I really like how much simpler
> > >>> the logic is now.
> > >>>
> > >>> I'm looking through how the block multiqueue code works
> > >>> (block/blk-mq.c and block/blk-mq-cpumap.c) because I think they
> > >>> basically have to do the same thing with figuring out which cpu to
> > >>> dispatch a request to.
> > >>>
> > >>> It looks like what they do is use group_cpus_evenly(), which as I
> > >>> understand it, will partition CPUs taking into account numa nodes (as
> > >>> well as clustering and SMT siblings). I think if we use this for fuse
> > >>> io-uring, it will make things a lot simpler and we could get rid of
> > >>> the per-numa state tracking (eg numa_q_map, registered_q_mask,
> > >>> nr_numa_nodes)  and simplify queue selection where now that can just
> > >>> be a cpu to qid lookup instead of a two-level
> > >>> numa-then-global-fallback lookup.
> > >>>
> > >>> Do you think something like this makes sense?
> > >>
> > >> Maybe, I need to check that code. However, does this really need to be
> > >> done right now? This cannot be updated later? For me it looks a bit like
> > >> we are going to replace one code by another, without a clear advantage.
> > >> I can look into group_cpus_evenly(), but I cannot promise you when that
> > >> will happen.
> > >> My personal preference would be to work on real issue, like getting rid
> > >> of two locks (queue->lock and bg->lock) and distribute max_bg accross
> > >> queues. And that probably requires the distribution across queues, which
> > >> you didn't like in the previous series. Anway, already finding the time
> > >> for that is hard.
> > >>
> > >> My personal opinion is that queue selection needs to return the qid, so
> > >> that the function can be overriden with eBPF. I didn't have time yet to
> > >> try that out.
> > >>
> > >>>
> > >>> Additionally, as I understand it, in this series, the ring->q_map
> > >>> mapping has to get rebuilt every time a new queue gets created. What
> > >>> do you think about just having the server declare the total queue
> > >>> count upfront and then the mapping can just get established at ring
> > >>> creation time? group_cpus_evenly() would only need to be called once,
> > >>> the cpu_to_qid map would only have to be built once, and we could
> > >>> avoid the rebuild-on-each-queue-creation complexity entirely. Do you
> > >>> think something like this makes sense?
> > >>
> > >> That is why I said in another mail that a config SQE would make to some
> > >> extend sense. However, the part where I disagree is that we could make
> > >> it all entirely dynamic with the current approach.
> > >> Only the logic for that in libfuse is missing. I.e. it _could_ start
> > >> with a single queue or one queue per numa and one ring entry. Basically
> > >> no memory usage then.
> > >> And now libfuse could add logic - many small requests - set up ring
> > >> entries with smaller payload size (or smaller pBuf). Many large requests
> > >> - add more requests with larger payload size. And with the current
> > >> approach queues can be added dynamically.
> > >
> > > Bernd, looking through this series some more, I still think it would
> > > be preferable if userspace passed in the number of queues upfront at
> > > registration time and requests are gated until all those queues have
> > > completed set up. I think this makes races a lot simpler. Even without
> > > configurable queues, there are already tricky races to reason about in
> > > the dispatch and abort/teardown paths, and with configurable queues
> > > that can now handle/submit requests while other queues are not yet set
> > > up, there are now races against both request submission and
> > > potentially concurrent queue registration, as well as races with
> > > mappings that can reference queues in any state. I think it'd be
> > > preferable to try to keep things as simple as possible, and have
> > > dynamic queue addition added/supported later through a new uring cmd
> > > if needed.
> > >
> > > What are your thoughts on this?
> >
> > Can we defer this to v6? I.e. v5 goes out on Friday with minimal fix
> > changes and then we discuss with Miklos about it next week? In the end I
> > had all these things initially entirely static and had an io-uring
> > config ioctl. In the mean time I see a good reason to have it dynamic,
> > mainly to keep memory usage low, but I also see the possible races, of
> > course (although I hope that I didn't introduce new ones).
> >
> > In principle we would need at least monthly meeting to synchronize and
> > agree on design choices. If I understand Darrick right ext4 has that.

ext4 and xfs each have a weekly community conference call.

> That sounds great! I agree it'll be a lot quicker to discuss this in
> person :D Looking forward to seeing you next week.

Me likewise!

--D

> Thanks,
> Joanne
> >
> > Thanks,
> > Bernd
>

next prev parent reply	other threads:[~2026-04-30  4:16 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert via B4 Relay
2026-04-27 15:35   ` Joanne Koong
2026-04-13  9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
2026-04-24 15:04   ` Luis Henriques
2026-04-24 15:33     ` Bernd Schubert
2026-04-27  8:02       ` Luis Henriques
2026-04-27 10:39         ` Bernd Schubert
2026-04-13  9:41 ` [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
2026-04-24 15:15   ` Luis Henriques
2026-04-24 18:28   ` Joanne Koong
2026-04-24 22:00     ` Bernd Schubert
2026-04-27 13:10       ` Joanne Koong
2026-04-27 13:49         ` Bernd Schubert
2026-04-27 14:10           ` Joanne Koong
2026-04-27 14:42             ` Bernd Schubert
2026-04-27 15:10               ` Joanne Koong
2026-04-29 16:10       ` Joanne Koong
2026-04-29 16:24         ` Bernd Schubert
2026-04-29 16:32           ` Joanne Koong
2026-04-30  4:16             ` Darrick J. Wong [this message]
2026-04-13  9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
2026-04-24 15:26   ` Luis Henriques
2026-04-27 12:08     ` Bernd Schubert
2026-04-29 14:43   ` Joanne Koong
2026-04-29 16:01     ` Bernd Schubert
2026-04-29 16:56       ` Joanne Koong
2026-04-29 20:19         ` Bernd Schubert
2026-04-13  9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
2026-04-24 15:28   ` Luis Henriques
2026-04-29 15:03   ` Joanne Koong
2026-04-29 16:07     ` Bernd Schubert
2026-04-13  9:41 ` [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping Bernd Schubert via B4 Relay
2026-04-29 15:40   ` Joanne Koong
2026-04-29 16:11     ` Bernd Schubert
2026-04-29 16:15 ` [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260430041639.GF3778109@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=bernd@bsbernd.com \
    --cc=bschubert@ddn.com \
    --cc=dchg2000@gmail.com \
    --cc=joannelkoong@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox