From: Bernd Schubert <bschubert@ddn.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Luis Henriques <luis@igalia.com>, Gang He <dchg2000@gmail.com>
Subject: Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
Date: Fri, 24 Oct 2025 17:52:15 +0000 [thread overview]
Message-ID: <5240c7db-9a0a-446e-a4a7-a8fcc65cf2e0@ddn.com> (raw)
In-Reply-To: <CAJnrk1bNbdpsNDkYRG5BQAQMBp5Yb64WszOQ01Dffp9qu-p60A@mail.gmail.com>
On 10/24/25 19:05, Joanne Koong wrote:
> On Mon, Oct 20, 2025 at 3:59 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>>
>>> On 10/18/25 02:12, Joanne Koong wrote:
>>>> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>>
>>>>> So far queue selection was only for the queue corresponding
>>>>> to the current core.
>>>>> With bitmaps about registered queues and counting of queued
>>>>> requests per queue, distributing the load is possible now.
>>>>>
>>>>> This is on purpose lockless and accurate, under the assumption that a lock
>>>>> between queues might become the limiting factor. Approximate selection
>>>>> based on queue->nr_reqs should be good enough. If queues get slightly
>>>>> more requests than given by that counter it should not be too bad,
>>>>> as number of kernel/userspace transitions gets reduced with higher
>>>>> queue sizes.
>>>>>
>>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>>> ---
>>>>> fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>> 1 file changed, 84 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>>>>> --- a/fs/fuse/dev_uring.c
>>>>> +++ b/fs/fuse/dev_uring.c
>>>>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>>>>
>>>>> #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>>>>
>>>>> +/* Number of queued fuse requests until a queue is considered full */
>>>>> +#define FURING_Q_LOCAL_THRESHOLD 2
>>>>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>>>>
>>>>> bool fuse_uring_enabled(void)
>>>>> {
>>>>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>> fuse_uring_send(ent, cmd, err, issue_flags);
>>>>> }
>>>>>
>>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>>> +/*
>>>>> + * Pick best queue from mask. Follows the algorithm described in
>>>>> + * "The Power of Two Choices in Randomized Load Balancing"
>>>>> + * (Michael David Mitzenmacher, 1991)
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>>> + struct fuse_ring *ring)
>>>>> +{
>>>>> + unsigned int qid1, qid2;
>>>>> + struct fuse_ring_queue *queue1, *queue2;
>>>>> + int weight = cpumask_weight(mask);
>>>>> +
>>>>> + if (weight == 0)
>>>>> + return NULL;
>>>>> +
>>>>> + if (weight == 1) {
>>>>> + qid1 = cpumask_first(mask);
>>>>> + return READ_ONCE(ring->queues[qid1]);
>>>>> + }
>>>>> +
>>>>> + /* Get two different queues using optimized bounded random */
>>>>> + qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> + queue1 = READ_ONCE(ring->queues[qid1]);
>>>>> +
>>>>> + qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> +
>>>>> + /* Avoid retries and take this queue for code simplicity */
>>>>> + if (qid1 == qid2)
>>>>> + return queue1;
>>>>> +
>>>>> + queue2 = READ_ONCE(ring->queues[qid2]);
>>>>> +
>>>>> + if (WARN_ON_ONCE(!queue1 || !queue2))
>>>>> + return NULL;
>>>>> +
>>>>> + return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>>>>> + queue1 : queue2;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Get the best queue for the current CPU
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>>> {
>>>>> unsigned int qid;
>>>>> - struct fuse_ring_queue *queue;
>>>>> + struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>>> + int local_node;
>>>>> + const struct cpumask *numa_mask, *global_mask;
>>>>>
>>>>> qid = task_cpu(current);
>>>>> -
>>>>> if (WARN_ONCE(qid >= ring->max_nr_queues,
>>>>> "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>>>> ring->max_nr_queues))
>>>>> qid = 0;
>>>>>
>>>>> - queue = ring->queues[qid];
>>>>> - WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>>>>> + local_queue = READ_ONCE(ring->queues[qid]);
>>>>> + local_node = cpu_to_node(qid);
>>>>> + if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>>> + local_node = 0;
>>>>>
>>>>> - return queue;
>>>>> + /* Fast path: if local queue exists and is not overloaded, use it */
>>>>> + if (local_queue &&
>>>>> + READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>>>> + return local_queue;
>>>>> +
>>>>> + /* Find best NUMA-local queue */
>>>>> + numa_mask = ring->numa_registered_q_mask[local_node];
>>>>> + best_numa = fuse_uring_best_queue(numa_mask, ring);
>>>>> +
>>>>> + /* If NUMA queue is under threshold, use it */
>>>>> + if (best_numa &&
>>>>> + READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>>> + return best_numa;
>>>>> +
>>>>> + /* NUMA queues above threshold, try global queues */
>>>>> + global_mask = ring->registered_q_mask;
>>>>> + best_global = fuse_uring_best_queue(global_mask, ring);
>>>>> +
>>>>> + /* Might happen during tear down */
>>>>> + if (!best_global)
>>>>> + return NULL;
>>>>> +
>>>>> + /* If global queue is under double threshold, use it */
>>>>> + if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>>>>> + return best_global;
>>>>> +
>>>>> + /* There is no ideal queue, stay numa_local if possible */
>>>>> + return best_numa ? best_numa : best_global;
>>>>> }
>>>>
>>>> Hi Bernd,
>>>>
>>>> I started looking a bit at the block layer blk-mq.c code because, as I
>>>> understand it, they have to address this same problem of allocating
>>>> requests to queues while taking into account NUMA locality.
>>>>
>>>> I haven't looked at the code deeply yet but I think what it does is
>>>> maintain a static mapping (that considers numa topology) of cpus to
>>>> queues which then makes queue selection very simple with minimal
>>>> overhead. For distributing load, I think it relies on the CPU
>>>> scheduler to distribute application tasks fairly across CPUs rather
>>>> than doing load balancing itself (which would also then have to break
>>>> numa locality if the request gets moved to a different queue).
>>>> Regarding load balancing, my read of this patch is that it uses the
>>>> number of current requests on queues as the metric of load but I'm not
>>>> sure that's accurate - for example, some requests may be more
>>>> intensive (eg fetching a read over a network) where even if there's
>>>> only a few requests on that queue, that queue could still be more
>>>> loaded with higher latency than other queues.
>>>>
>>>> I'm curious to hear your thoughts on whether you think a simple
>>>> mapping solution like what the block layer does would suffice or not
>>>> for fuse uring queue selection.
>>>
>> Hi Bernd,
>>
>> Thanks for your reply and for sharing your thoughts on this.
>>
>>>
>>> Hi Joanne,
>>>
>>> thanks for looking at the patch. I think we have primarily a static
>>> mapping? For completeness, please also look at the patch 6/6, which
>>> updates queue selection. Basically with patch 6/6 we have static
>>> mapping to the local queue, with neighbor queues as retries. I
>>> had already answered Luis question - I can show that retries
>>> to the neighbor QIDs improves performance, at least for fio's
>>> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>>>
>>> So that leaves the fallback to random QIDs - I don't have strong
>>> opinion about that, but I don't think the CPU scheduler can handle it.
>>> Let's say you are doing write-back to a single file and let's say
>>> fuse is tuned to allow lot's of dirty pages. How should the scheduler
>>> be able to distribute single threaded dirty page flush? Especially
>>
>> For writeback, I believe the writeback workqueue is unbound (I'm
>> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
>> cpu. As I understand it, the worker thread can be migrated by the
>> scheduler which will distribute writing back dirty data across
>> multiple cpus as it sees fit.
>>
>>> also see in patch 6/6 that we really want to have a different CPU
>>> to handle async data - the cpu scheduler will not even try to move the
>>> the application or migration thread to a different cpu, because
>>> there is no conflict. And for cpu cache, C-states and frequency,
>>> we actually also want to the scheduler to limit migration to
>>> absolutely minimum.
>>>
>>> Another choice instead of random fallback would be to distribute
>>> requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
>>> Maybe that would even give better peformance, as random queues so
>>> far didn't have a positive effect in my testing.
>>>
>>> The kind of ideal queue selection for async requests seems to be
>>> to fill a queue and then to move to the next qid, within a numa
>>> domain. I just hadn't found a way to do that lockless yet.
>>>
>>> Regarding usage of number of requests - I guess there always
>>> will be workloads where the algorithm isn't perfect - see the
>>> scheduler wake discussion. Maybe we can find a way in the future
>>> to map queued requests in fuse-daemon and then use that + number
>>> of unhandled io-uring CQEs to know if a queue is busy. Any chance
>>> we can do it step by step?
>>>
>>> I don't have a big problem to remove
>>> the random queue selection fallback, but also would be good
>>> to have Miklos' opinion - Miklos had some concerns in September
>>
>> I agree, it'd be good to have Miklos's opinion on this and go with that.
>>
>> My opinion looking at this is that the fuse uring problem of
>> distributing requests to queues is very similar to what the block
>> layer has to do with assigning bio submissions to hardware queues. The
>> block layer's solution to me seems more elegantly simple and flows
>> organically with the cpu scheduler's internal load balancing. I think
>> we should try to keep things as simple as possible, as I don't see how
>> the optimizations with the custom load balancing we're doing here can
>> be accurate enough to warrant the extra complexity and overhead.
>>
>> But I defer to whatever approach you and Miklos think is best and
>> would rather go with.
>
> Miklos, could you share your thoughts on how we should approach this?
I have already updated the series, v4 without extended balancing
will come tomorrow - I don't manage to run benchmarks now. Left in
a new commit is still cpu+1 retry balancing, for the reason that
it gives better performance - I really don't want to remove that.
Thanks,
Bernd
next prev parent reply other threads:[~2025-10-24 17:52 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-15 9:19 ` Luis Henriques
2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-15 23:49 ` Joanne Koong
2025-10-16 11:33 ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-18 0:12 ` Joanne Koong
2025-10-20 19:00 ` Bernd Schubert
2025-10-20 22:59 ` Joanne Koong
2025-10-20 23:28 ` Bernd Schubert
2025-10-24 17:05 ` Joanne Koong
2025-10-24 17:52 ` Bernd Schubert [this message]
2025-10-24 17:58 ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-15 9:25 ` Luis Henriques
2025-10-15 9:31 ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-15 9:50 ` Luis Henriques
2025-10-15 10:27 ` Bernd Schubert
2025-10-15 11:05 ` Luis Henriques
2025-10-20 7:15 ` Dan Carpenter
2025-10-14 8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
2025-10-14 9:14 ` Bernd Schubert
2025-10-16 6:15 ` Gang He
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5240c7db-9a0a-446e-a4a7-a8fcc65cf2e0@ddn.com \
--to=bschubert@ddn.com \
--cc=dchg2000@gmail.com \
--cc=joannelkoong@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=luis@igalia.com \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).