linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Hannes Reinecke <hare@suse.de>, linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	dwagner@suse.de, gjoyce@ibm.com
Subject: Re: [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
Date: Tue, 23 Sep 2025 16:26:59 +0530	[thread overview]
Message-ID: <dc123d3e-2d33-4bb1-9f30-cb60aa2e0c98@linux.ibm.com> (raw)
In-Reply-To: <25a6ed74-70bf-4230-993b-de4fbaea16a7@suse.de>



On 9/23/25 12:33 PM, Hannes Reinecke wrote:
> On 9/23/25 05:43, Nilay Shroff wrote:
>>
>>
>> On 9/22/25 1:00 PM, Hannes Reinecke wrote:
>>> On 9/21/25 13:12, Nilay Shroff wrote:
> [ .. ]>>> +        srcu_idx = srcu_read_lock(&head->srcu);
>>>> +        list_for_each_entry_srcu(cur_ns, &head->list, siblings,
>>>> +                srcu_read_lock_held(&head->srcu)) {
>>>
>>> And this is even more awkward as we need to iterate over all paths
>>> (during completion!).
>>>
>> Hmm yes, but we only iterate once every ~15 seconds per CPU, so the overhead is minimal.
>> Typically we don’t have a large number of paths to deal with: enterprise SSDs usually
>> expose at most two controllers, and even in fabrics setups the path count is usually
>> limited to around 4–6. So the loop should run quite fast.
> 
> Hmm. Not from my experience. There is at least one implementation from a
> rather substantial array vendor exposing up to low hundreds of queues.
> 
Sorry but I think we're discussing two different thing here? I'm referring
to the nvme paths (i.e. nvmeXcYnz) and probably you're referring to the 
DMA queues or hw I/O queues. If that's true then yes some enterprise SSD
may expose low hundreds of queues, for instance, on my SSD I see 128 queues
being supported:
# nvme get-feature -f 0x7 /dev/nvme0 -H 
get-feature:0x07 (Number of Queues), Current value:0x007f007f
	Number of IO Completion Queues Allocated (NCQA): 128
	Number of IO Submission Queues Allocated (NSQA): 128

But when it comes to the nvme paths, that should be limited (mostly single 
digit). So my point was iterating those paths (and that's what the above 
code implements) shouldn't be expensive. But please correct me if my
assumption was wrong.  

>> Also, looping in itself isn’t unusual — for example, the queue-depth I/O policy already
>> iterates over all paths in the submission path to check queue depth before dispatching each
>> I/O. That said, if looping in the completion path is still a concern, we could consider
>> moving this into a dedicated worker thread instead. What do you think?
>>
> 
> Not sure if that's a good idea; either the worker thread runs
> asynchronous to the completion and then we have to deal with reliably
> adding up numbers, or we're running synchronous and lose performance.
> Still think that _not_ iterating and just adding up single-cpu latencies
> might be worthwhile.
> 
Yes worker thread should run asynchronously with completion and we can 
deal with it in one of the two ways:
- If accuracy is critical, we could use atomic counters (e.g. atomic_cmpxchg)
  to update batch counts and calculate path weights asynchronously in the worker.

- If a small skew is acceptable, snapshot-and-reset is sufficient. Since EWMA
  smoothing absorbs micro-errors, missing 1–2 samples in a 15s window at real
  I/O rates is negligible. Many kernel subsystems (block stats, net flow stats,
  etc.) already rely on snapshot-and-reset without atomic, exactly for this reason.

IMO, for the adaptive policy, we don’t need atomic_xchg() unless we’re chasing
perfect stats. Snapshot-and-reset per-CPU accumulators is good enough, because
EWMA smoothing and weight recalculation will easily absorb micro-errors. So what
do you prefer if in case we choose worker thread?

>>> Do we really need to do this?
>>> What would happen if we just measure the latency on the local CPU
>>> and do away with this loop?
>>> We would have less samples, true, but we would even be able to
>>> not only differentiate between distinct path latency but also between
>>> different CPU latencies; I would think this being a bonus for
>>> multi-socket machines.
>>>
>> The idea is to keep per-cpu view consistent for each path. As we know,
>> in NVMe/fabrics multipath, submission and completion CPUs don’t necessarily
>> match (depends on the host’s irq/core mapping). And so if we were to measure
>> the latency/EWMA locally per-cpu then the per-CPU accumulator might be biased
>> towards the completion CPU, not the submission CPU. For instance, if submission
>> is on CPU A but completion lands on CPU B, then CPU A’s weights never reflect
>> it's I/O experience — they’ll be skewed by how interrupts get steered.
>>
> True. Problem is that for the #CPUs > #queues we're setting up a cpu
> affinity group, and interrupts are directed to one of the CPU in that
> group. I had hoped that the blk-mq code would raise a softirq in that
> case and call .end_request on the cpu registered in the request itself.
> Probably need to be evaluated.
> 
I already evaluated this. If completion ends up on a cpu which is different 
from the submission cpu then yes block layer code may raise softirq to steer
the completion on the same cpu as submission. But this is not guaranteed always.
For instance if poll queues are used for submission then submission and 
completion on the same cpu is not guaranteed. Similarly, if threaded 
interrupt is configured then this is not guaranteed. In fact blk-mq
also supports a flag named QUEUE_FLAG_SAME_FORCE which forces completion 
on the same CPU as the submission CPU. But again this flags also doesn't
work always if we have poll-queue or thread interrupt is configured.
Please refer blk_mq_complete_request_remote().

>> So on a multi socket/NUMA systems, depending on topology, calculating local
>> per-cpu ewma/latency may or may not line up. For example:
>>
>> - If we have #cpu <= #vectors supported by NVMe disk then typically
>>    we have 1:1 mapping between submission and completion queues and hence all completions for
>>    a queue are steered to the same CPU that also submits, then per-CPU stats are accurate.
>>
>> - But when #CPUs > #vectors, completions may be centralized or spread differently. In that
>>    case, the per-CPU latency view can be distorted — e.g., CPU A may submit, but CPU B takes
>>    completions, so CPU A’s weights never reflect its own I/O behavior.
>>
> See above. We might check if blk-mq doesn't cover for this case already.
> Thing is, I actually _do_ want to measure per-CPU latency.
> On a multi-socket system it really does matter whether an I/O is run
> from a CPU on the socket attached to the PCI device, or from an
> off-socket CPU. If we are calculating just the per-path latency
> we completely miss that (as blk-mq will spread out across _all_
> cpus), but if we are measuring a per-cpu latency we will end up
> with a differential matrix where cpus with the lowest latency
> will be preferred.
> So if we have a system with two sockets and two PCI HBAs, each
> connected to a different socket, using per-path latency will be
> spreading out I/Os across all cpus. Using per-cpu latency will
> direct I/Os to the cpus with the lowest latency, preferring
> the local cpus.
> 
Yes, agreed and so in this proposed patch we measure per-cpu latency and 
_NOT_ per-path latency. Measuring the per-cpu latency and using it for
forwarding I/O is ideal for multi socket/NUMA systems. 

>>> _And_ we wouldn't need to worry about path failures, which is bound
>>> to expose some race conditions if we need to iterate paths at the
>>> same time than path failures are being handled.
>>>
>> Yes agreed we may have some race here and so the path score/weight may be
>> skewed when that happens but then that'd be auto-corrected in the next epoc
>> (after ~15 sec) when we re-calculate the path weight/score again, isn't it?
>>
> Let's see. I still would want to check if we can't do per-cpu
> statistics, as that would automatically avoid any races :-)
>
As I mentioned above, there are certain use cases (like poll-queues, threaded
interrupts etc.) where submission and completion could happen on different CPUs,
so we have to account for such cases and cross over CPU while accumulating the 
batch latency.

Thanks,
--Nilay
 


  reply	other threads:[~2025-09-23 10:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-09-22  7:30   ` Hannes Reinecke
2025-09-23  3:43     ` Nilay Shroff
2025-09-23  7:03       ` Hannes Reinecke
2025-09-23 10:56         ` Nilay Shroff [this message]
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
2025-09-22  7:35   ` Hannes Reinecke
2025-09-23  3:53     ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
2025-09-22  7:38   ` Hannes Reinecke
2025-09-23  9:33     ` Nilay Shroff
2025-09-23 10:27       ` Hannes Reinecke
2025-09-23 17:58         ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dc123d3e-2d33-4bb1-9f30-cb60aa2e0c98@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=dwagner@suse.de \
    --cc=gjoyce@ibm.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).