linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bernd Schubert <bschubert@ddn.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Luis Henriques <luis@igalia.com>, Gang He <dchg2000@gmail.com>
Subject: Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
Date: Fri, 24 Oct 2025 17:52:15 +0000	[thread overview]
Message-ID: <5240c7db-9a0a-446e-a4a7-a8fcc65cf2e0@ddn.com> (raw)
In-Reply-To: <CAJnrk1bNbdpsNDkYRG5BQAQMBp5Yb64WszOQ01Dffp9qu-p60A@mail.gmail.com>

On 10/24/25 19:05, Joanne Koong wrote:
> On Mon, Oct 20, 2025 at 3:59 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>>
>>> On 10/18/25 02:12, Joanne Koong wrote:
>>>> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>>
>>>>> So far queue selection was only for the queue corresponding
>>>>> to the current core.
>>>>> With bitmaps about registered queues and counting of queued
>>>>> requests per queue, distributing the load is possible now.
>>>>>
>>>>> This is on purpose lockless and accurate, under the assumption that a lock
>>>>> between queues might become the limiting factor. Approximate selection
>>>>> based on queue->nr_reqs should be good enough. If queues get slightly
>>>>> more requests than given by that counter it should not be too bad,
>>>>> as number of kernel/userspace transitions gets reduced with higher
>>>>> queue sizes.
>>>>>
>>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>>> ---
>>>>>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>>  1 file changed, 84 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>>>>> --- a/fs/fuse/dev_uring.c
>>>>> +++ b/fs/fuse/dev_uring.c
>>>>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>>>>
>>>>>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>>>>
>>>>> +/* Number of queued fuse requests until a queue is considered full */
>>>>> +#define FURING_Q_LOCAL_THRESHOLD 2
>>>>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>>>>
>>>>>  bool fuse_uring_enabled(void)
>>>>>  {
>>>>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>>         fuse_uring_send(ent, cmd, err, issue_flags);
>>>>>  }
>>>>>
>>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>>> +/*
>>>>> + * Pick best queue from mask. Follows the algorithm described in
>>>>> + * "The Power of Two Choices in Randomized Load Balancing"
>>>>> + *  (Michael David Mitzenmacher, 1991)
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>>> +                                                    struct fuse_ring *ring)
>>>>> +{
>>>>> +       unsigned int qid1, qid2;
>>>>> +       struct fuse_ring_queue *queue1, *queue2;
>>>>> +       int weight = cpumask_weight(mask);
>>>>> +
>>>>> +       if (weight == 0)
>>>>> +               return NULL;
>>>>> +
>>>>> +       if (weight == 1) {
>>>>> +               qid1 = cpumask_first(mask);
>>>>> +               return READ_ONCE(ring->queues[qid1]);
>>>>> +       }
>>>>> +
>>>>> +       /* Get two different queues using optimized bounded random */
>>>>> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> +       queue1 = READ_ONCE(ring->queues[qid1]);
>>>>> +
>>>>> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> +
>>>>> +       /* Avoid retries and take this queue for code simplicity */
>>>>> +       if (qid1 == qid2)
>>>>> +               return queue1;
>>>>> +
>>>>> +       queue2 = READ_ONCE(ring->queues[qid2]);
>>>>> +
>>>>> +       if (WARN_ON_ONCE(!queue1 || !queue2))
>>>>> +               return NULL;
>>>>> +
>>>>> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>>>>> +               queue1 : queue2;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Get the best queue for the current CPU
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>>>  {
>>>>>         unsigned int qid;
>>>>> -       struct fuse_ring_queue *queue;
>>>>> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>>> +       int local_node;
>>>>> +       const struct cpumask *numa_mask, *global_mask;
>>>>>
>>>>>         qid = task_cpu(current);
>>>>> -
>>>>>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>>>>>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>>>>                       ring->max_nr_queues))
>>>>>                 qid = 0;
>>>>>
>>>>> -       queue = ring->queues[qid];
>>>>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>>>>> +       local_queue = READ_ONCE(ring->queues[qid]);
>>>>> +       local_node = cpu_to_node(qid);
>>>>> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>>> +               local_node = 0;
>>>>>
>>>>> -       return queue;
>>>>> +       /* Fast path: if local queue exists and is not overloaded, use it */
>>>>> +       if (local_queue &&
>>>>> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>>>> +               return local_queue;
>>>>> +
>>>>> +       /* Find best NUMA-local queue */
>>>>> +       numa_mask = ring->numa_registered_q_mask[local_node];
>>>>> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
>>>>> +
>>>>> +       /* If NUMA queue is under threshold, use it */
>>>>> +       if (best_numa &&
>>>>> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>>> +               return best_numa;
>>>>> +
>>>>> +       /* NUMA queues above threshold, try global queues */
>>>>> +       global_mask = ring->registered_q_mask;
>>>>> +       best_global = fuse_uring_best_queue(global_mask, ring);
>>>>> +
>>>>> +       /* Might happen during tear down */
>>>>> +       if (!best_global)
>>>>> +               return NULL;
>>>>> +
>>>>> +       /* If global queue is under double threshold, use it */
>>>>> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>>>>> +               return best_global;
>>>>> +
>>>>> +       /* There is no ideal queue, stay numa_local if possible */
>>>>> +       return best_numa ? best_numa : best_global;
>>>>>  }
>>>>
>>>> Hi Bernd,
>>>>
>>>> I started looking a bit at the block layer blk-mq.c code because, as I
>>>> understand it, they have to address this same problem of allocating
>>>> requests to queues while taking into account NUMA locality.
>>>>
>>>> I haven't looked at the code deeply yet but I think what it does is
>>>> maintain a static mapping (that considers numa topology) of cpus to
>>>> queues which then makes queue selection very simple with minimal
>>>> overhead. For distributing load, I think it relies on the CPU
>>>> scheduler to distribute application tasks fairly across CPUs rather
>>>> than doing load balancing itself (which would also then have to break
>>>> numa locality if the request gets moved to a different queue).
>>>> Regarding load balancing, my read of this patch is that it uses the
>>>> number of current requests on queues as the metric of load but I'm not
>>>> sure that's accurate - for example, some requests may be more
>>>> intensive (eg fetching a read over a network) where even if there's
>>>> only a few requests on that queue, that queue could still be more
>>>> loaded with higher latency than other queues.
>>>>
>>>> I'm curious to hear your thoughts on whether you think a simple
>>>> mapping solution like what the block layer does would suffice or not
>>>> for fuse uring queue selection.
>>>
>> Hi Bernd,
>>
>> Thanks for your reply and for sharing your thoughts on this.
>>
>>>
>>> Hi Joanne,
>>>
>>> thanks for looking at the patch. I think we have primarily a static
>>> mapping? For completeness, please also look at the patch 6/6, which
>>> updates queue selection. Basically with patch 6/6 we have static
>>> mapping to the local queue, with neighbor queues as retries. I
>>> had already answered Luis question - I can show that retries
>>> to the neighbor QIDs improves performance, at least for fio's
>>> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>>>
>>> So that leaves the fallback to random QIDs - I don't have strong
>>> opinion about that, but I don't think the CPU scheduler can handle it.
>>> Let's say you are doing write-back to a single file and let's say
>>> fuse is tuned to allow lot's of dirty pages. How should the scheduler
>>> be able to distribute single threaded dirty page flush? Especially
>>
>> For writeback, I believe the writeback workqueue is unbound (I'm
>> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
>> cpu. As I understand it, the worker thread can be migrated by the
>> scheduler which will distribute writing back dirty data across
>> multiple cpus as it sees fit.
>>
>>> also see in patch 6/6 that we really want to have a different CPU
>>> to handle async data - the cpu scheduler will not even try to move the
>>> the application or migration thread to a different cpu, because
>>> there is no conflict. And for cpu cache, C-states and frequency,
>>> we actually also want to the scheduler to limit migration to
>>> absolutely minimum.
>>>
>>> Another choice instead of random fallback would be to distribute
>>> requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
>>> Maybe that would even give better peformance, as random queues so
>>> far didn't have a positive effect in my testing.
>>>
>>> The kind of ideal queue selection for async requests seems to be
>>> to fill a queue and then to move to the next qid, within a numa
>>> domain. I just hadn't found a way to do that lockless yet.
>>>
>>> Regarding usage of number of requests - I guess there always
>>> will be workloads where the algorithm isn't perfect - see the
>>> scheduler wake discussion. Maybe we can find a way in the future
>>> to map queued requests in fuse-daemon and then use that + number
>>> of unhandled io-uring CQEs to know if a queue is busy. Any chance
>>> we can do it step by step?
>>>
>>> I don't have a big problem to remove
>>> the random queue selection fallback, but also would be good
>>> to have Miklos' opinion - Miklos had some concerns in September
>>
>> I agree, it'd be good to have Miklos's opinion on this and go with that.
>>
>> My opinion looking at this is that the fuse uring problem of
>> distributing requests to queues is very similar to what the block
>> layer has to do with assigning bio submissions to hardware queues. The
>> block layer's solution to me seems more elegantly simple and flows
>> organically with the cpu scheduler's internal load balancing. I think
>> we should try to keep things as simple as possible, as I don't see how
>> the optimizations with the custom load balancing we're doing here can
>> be accurate enough to warrant the extra complexity and overhead.
>>
>> But I defer to whatever approach you and Miklos think is best and
>> would rather go with.
> 
> Miklos, could you share your thoughts on how we should approach this?

I have already updated the series, v4 without extended balancing
will come tomorrow - I don't manage to run benchmarks now. Left in
a new commit is still cpu+1 retry balancing, for the reason that
it gives better performance - I really don't want to remove that.


Thanks,
Bernd

  reply	other threads:[~2025-10-24 17:52 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-15  9:19   ` Luis Henriques
2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-15 23:49   ` Joanne Koong
2025-10-16 11:33     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-18  0:12   ` Joanne Koong
2025-10-20 19:00     ` Bernd Schubert
2025-10-20 22:59       ` Joanne Koong
2025-10-20 23:28         ` Bernd Schubert
2025-10-24 17:05         ` Joanne Koong
2025-10-24 17:52           ` Bernd Schubert [this message]
2025-10-24 17:58             ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-15  9:25   ` Luis Henriques
2025-10-15  9:31     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-15  9:50   ` Luis Henriques
2025-10-15 10:27     ` Bernd Schubert
2025-10-15 11:05       ` Luis Henriques
2025-10-20  7:15   ` Dan Carpenter
2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
2025-10-14  9:14   ` Bernd Schubert
2025-10-16  6:15     ` Gang He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5240c7db-9a0a-446e-a4a7-a8fcc65cf2e0@ddn.com \
    --to=bschubert@ddn.com \
    --cc=dchg2000@gmail.com \
    --cc=joannelkoong@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).