public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
@ 2024-07-31 13:46 MANISH PANDEY
  2024-08-01  9:25 ` MANISH PANDEY
  2024-08-02  9:03 ` Christian Loehle
  0 siblings, 2 replies; 29+ messages in thread
From: MANISH PANDEY @ 2024-07-31 13:46 UTC (permalink / raw)
  To: qyousef
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Bart Van Assche,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall

Hi Qais Yousef,
Recently we observed below patch has been merged
https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io

This patch is causing performance degradation ~20% in Random IO along 
with significant drop in Sequential IO performance. So we would like to 
revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ 
devices are also getting impacted due to this.

We have several concerns with the patch
1. This patch takes away the luxury of affining best possible cpus from 
   device drivers and limits driver to fall in same group of CPUs.

2. Why can't device driver use irq affinity to use desired CPUs to 
complete the IO request, instead of forcing it from block layer.

3. Already CPUs are grouped based on LLC, then if a new categorization 
is required ?

> big performance impact if the IO request
> was done from a CPU with higher capacity but the interrupt is serviced
> on a lower capacity CPU.

This patch doesn't considers the issue of contention in submission path 
and completion path. Also what if we want to complete the request of 
smaller capacity CPU to Higher capacity CPU?
Shouldn't a device driver take care of this and allow the vendors to use 
the best possible combination they want to use?
Does it considers MCQ devices and different SQ<->CQ mappings?

> Without the patch I see the BLOCK softirq always running on little cores
> (where the hardirq is serviced). With it I can see it running on all
> cores.

why we can't use echo 2 > rq_affinity to force complete on the same
group of CPUs from where request was initiated?
Also why to force vendors to always use SOFTIRQ for completion?
We should be flexible to either complete the IO request via IPI, HARDIRQ 
or SOFTIRQ.


An SoC can have different CPU configuration possible and this patch 
forces a restriction on the completion path. This problem is more worse 
in MCQ devices as we can have different SQ<->CQ mapping.

So we would like to revert the patch. Please let us know if any concerns?

Regards
Manish Pandey

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-07-31 13:46 Regarding patch "block/blk-mq: Don't complete locally if capacities are different" MANISH PANDEY
@ 2024-08-01  9:25 ` MANISH PANDEY
  2024-08-01 16:05   ` Qais Yousef
  2024-08-02  9:03 ` Christian Loehle
  1 sibling, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-01  9:25 UTC (permalink / raw)
  To: qyousef
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Bart Van Assche,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall,
	linux-kernel

++ adding linux-kernel group

On 7/31/2024 7:16 PM, MANISH PANDEY wrote:
> Hi Qais Yousef,
> Recently we observed below patch has been merged
> https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
> 
> This patch is causing performance degradation ~20% in Random IO along 
> with significant drop in Sequential IO performance. So we would like to 
> revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ 
> devices are also getting impacted due to this.
> 
> We have several concerns with the patch
> 1. This patch takes away the luxury of affining best possible cpus from 
>    device drivers and limits driver to fall in same group of CPUs.
> 
> 2. Why can't device driver use irq affinity to use desired CPUs to 
> complete the IO request, instead of forcing it from block layer.
> 
> 3. Already CPUs are grouped based on LLC, then if a new categorization 
> is required ?
> 
>> big performance impact if the IO request
>> was done from a CPU with higher capacity but the interrupt is serviced
>> on a lower capacity CPU.
> 
> This patch doesn't considers the issue of contention in submission path 
> and completion path. Also what if we want to complete the request of 
> smaller capacity CPU to Higher capacity CPU?
> Shouldn't a device driver take care of this and allow the vendors to use 
> the best possible combination they want to use?
> Does it considers MCQ devices and different SQ<->CQ mappings?
> 
>> Without the patch I see the BLOCK softirq always running on little cores
>> (where the hardirq is serviced). With it I can see it running on all
>> cores.
> 
> why we can't use echo 2 > rq_affinity to force complete on the same
> group of CPUs from where request was initiated?
> Also why to force vendors to always use SOFTIRQ for completion?
> We should be flexible to either complete the IO request via IPI, HARDIRQ 
> or SOFTIRQ.
> 
> 
> An SoC can have different CPU configuration possible and this patch 
> forces a restriction on the completion path. This problem is more worse 
> in MCQ devices as we can have different SQ<->CQ mapping.
> 
> So we would like to revert the patch. Please let us know if any concerns?
> 
> Regards
> Manish Pandey

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-01  9:25 ` MANISH PANDEY
@ 2024-08-01 16:05   ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-08-01 16:05 UTC (permalink / raw)
  To: MANISH PANDEY
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Bart Van Assche,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall,
	linux-kernel

On 08/01/24 14:55, MANISH PANDEY wrote:
> ++ adding linux-kernel group
> 
> On 7/31/2024 7:16 PM, MANISH PANDEY wrote:
> > Hi Qais Yousef,
> > Recently we observed below patch has been merged
> > https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
> > 
> > This patch is causing performance degradation ~20% in Random IO along
> > with significant drop in Sequential IO performance. So we would like to
> > revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ
> > devices are also getting impacted due to this.

Could you provide more info about your systems' topology and irq setup please?

> > 
> > We have several concerns with the patch
> > 1. This patch takes away the luxury of affining best possible cpus from
> >   device drivers and limits driver to fall in same group of CPUs.

I don't think it does. If rq_affinity is set to 1, then it is set to match the
performance of the requester to do the completion.

> > 
> > 2. Why can't device driver use irq affinity to use desired CPUs to
> > complete the IO request, instead of forcing it from block layer.

If you set the sysfs rq_affinity to 0, you should be able to distribute things
without block layer trying to match anything?

> > 
> > 3. Already CPUs are grouped based on LLC, then if a new categorization
> > is required ?

rq_affinity = 1 is asking to match the performance of the
requester/orginitaor. On HMP system, this means looking at grouping based on
capacity, not just LLC.

> > 
> > > big performance impact if the IO request
> > > was done from a CPU with higher capacity but the interrupt is serviced
> > > on a lower capacity CPU.
> > 
> > This patch doesn't considers the issue of contention in submission path

What is the issue of contention? Could you please explain it in more details?

> > and completion path. Also what if we want to complete the request of
> > smaller capacity CPU to Higher capacity CPU?

Assuming you want to use rq_affinity = 1 to match based on LLC but not
capacity. I'm curious why you want to match on LLC only but not capacity.

Is this on 6.1 kernel and beyond? Arm systems should have a shared LLC based on
DSU that is enforced in topology in 6.1. So can't see why you want to match
based on LLC but not capacity. Could you provide more info please?

If you don't want block layer to do any affinity, isn't it better to set
rq_affinity = 0?

> > Shouldn't a device driver take care of this and allow the vendors to use
> > the best possible combination they want to use?
> > Does it considers MCQ devices and different SQ<->CQ mappings?

Why this consideration matters when matching the perf based on capacity but it
doesn't matter when matching it based on LLC?

> > 
> > > Without the patch I see the BLOCK softirq always running on little cores
> > > (where the hardirq is serviced). With it I can see it running on all
> > > cores.
> > 
> > why we can't use echo 2 > rq_affinity to force complete on the same
> > group of CPUs from where request was initiated?

They are not the samae. rq_affinity = 1 means match. So if both are on the same
LLC or capacity, no need to force anything. rq_affinity = 2 means always match.
It's not the same thing, is it?


Thanks

--
Qais Yousef

> > Also why to force vendors to always use SOFTIRQ for completion?
> > We should be flexible to either complete the IO request via IPI, HARDIRQ
> > or SOFTIRQ.
> > 
> > 
> > An SoC can have different CPU configuration possible and this patch
> > forces a restriction on the completion path. This problem is more worse
> > in MCQ devices as we can have different SQ<->CQ mapping.
> > 
> > So we would like to revert the patch. Please let us know if any concerns?
> > 
> > Regards
> > Manish Pandey

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-07-31 13:46 Regarding patch "block/blk-mq: Don't complete locally if capacities are different" MANISH PANDEY
  2024-08-01  9:25 ` MANISH PANDEY
@ 2024-08-02  9:03 ` Christian Loehle
  2024-08-05  2:07   ` Qais Yousef
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2024-08-02  9:03 UTC (permalink / raw)
  To: MANISH PANDEY, qyousef
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Bart Van Assche,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall

On 7/31/24 14:46, MANISH PANDEY wrote:
> Hi Qais Yousef,

Qais already asked the important question, still some from my end.

> Recently we observed below patch has been merged
> https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
> 
> This patch is causing performance degradation ~20% in Random IO along with significant drop in Sequential IO performance. So we would like to revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ devices are also getting impacted due to this.

I'm curious about the sequential IO part in particular, what's the blocksize and throughput?
If blocksize is large enough the completion and submission parts are hopefully not as critical.

> 
> We have several concerns with the patch
> 1. This patch takes away the luxury of affining best possible cpus from   device drivers and limits driver to fall in same group of CPUs.
> 
> 2. Why can't device driver use irq affinity to use desired CPUs to complete the IO request, instead of forcing it from block layer.
> 
> 3. Already CPUs are grouped based on LLC, then if a new categorization is required ?

As Qais hinted at, because of systems that share LLC on all CPUs but are HMP.

> 
>> big performance impact if the IO request
>> was done from a CPU with higher capacity but the interrupt is serviced
>> on a lower capacity CPU.
> 
> This patch doesn't considers the issue of contention in submission path and completion path. Also what if we want to complete the request of smaller capacity CPU to Higher capacity CPU?
> Shouldn't a device driver take care of this and allow the vendors to use the best possible combination they want to use?
> Does it considers MCQ devices and different SQ<->CQ mappings?

So I'm assuming you're seeing something like the following:
Some CPU(s) (call them S) are submitting IO, hardirq triggers on
S.
Before the patch the completion softirq could run on a !S CPU,
now it runs on S. Am I then correct in assuming your workload
is CPU-bound on S? Would you share some details about the
workload, too?

What's the capacity of CPU(s) S then?
IOW does this help?

-->8--

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b55..a4a2500c4ef6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
        if (cpu == rq->mq_ctx->cpu ||
            (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
             cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
-            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
+            arch_scale_cpu_capacity(cpu) >= arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
                return false;
 
        /* don't try to IPI to an offline CPU */

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-02  9:03 ` Christian Loehle
@ 2024-08-05  2:07   ` Qais Yousef
  2024-08-05 10:18     ` Christian Loehle
  2024-08-05 17:17     ` Bart Van Assche
  0 siblings, 2 replies; 29+ messages in thread
From: Qais Yousef @ 2024-08-05  2:07 UTC (permalink / raw)
  To: Christian Loehle
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall

On 08/02/24 10:03, Christian Loehle wrote:
> On 7/31/24 14:46, MANISH PANDEY wrote:
> > Hi Qais Yousef,
> 
> Qais already asked the important question, still some from my end.
> 
> > Recently we observed below patch has been merged
> > https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
> > 
> > This patch is causing performance degradation ~20% in Random IO along with significant drop in Sequential IO performance. So we would like to revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ devices are also getting impacted due to this.
> 
> I'm curious about the sequential IO part in particular, what's the blocksize and throughput?
> If blocksize is large enough the completion and submission parts are hopefully not as critical.
> 
> > 
> > We have several concerns with the patch
> > 1. This patch takes away the luxury of affining best possible cpus from   device drivers and limits driver to fall in same group of CPUs.
> > 
> > 2. Why can't device driver use irq affinity to use desired CPUs to complete the IO request, instead of forcing it from block layer.
> > 
> > 3. Already CPUs are grouped based on LLC, then if a new categorization is required ?
> 
> As Qais hinted at, because of systems that share LLC on all CPUs but are HMP.
> 
> > 
> >> big performance impact if the IO request
> >> was done from a CPU with higher capacity but the interrupt is serviced
> >> on a lower capacity CPU.
> > 
> > This patch doesn't considers the issue of contention in submission path and completion path. Also what if we want to complete the request of smaller capacity CPU to Higher capacity CPU?
> > Shouldn't a device driver take care of this and allow the vendors to use the best possible combination they want to use?
> > Does it considers MCQ devices and different SQ<->CQ mappings?
> 
> So I'm assuming you're seeing something like the following:
> Some CPU(s) (call them S) are submitting IO, hardirq triggers on
> S.
> Before the patch the completion softirq could run on a !S CPU,
> now it runs on S. Am I then correct in assuming your workload
> is CPU-bound on S? Would you share some details about the
> workload, too?
> 
> What's the capacity of CPU(s) S then?
> IOW does this help?
> 
> -->8--
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b55..a4a2500c4ef6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
>         if (cpu == rq->mq_ctx->cpu ||
>             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> +            arch_scale_cpu_capacity(cpu) >= arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>                 return false;
>  
>         /* don't try to IPI to an offline CPU */

FWIW, that's what I had in the first version of the patch, but moved away from
it. I think this will constitute a policy.

Keep in mind that driver setting affinity like Manish case is not something
represent a kernel driver as I don't anticipate in-kernel driver to hardcode
affinities otherwise they won't be portable. irqbalancers usually move the
interrupts, and I'm not sure we can make an assumption about the reason an
interrupt is triggering on different capacity CPU.

My understanding of rq_affinity=1 is to match the perf of requester. Given that
the characteristic of HMP system is that power has an equal importance to perf
(I think this now has become true for all systems by the way), saying that the
match in one direction is better than the other is sort of forcing a policy of
perf first which I don't think is a good thing to enforce. We don't have enough
info to decide at this level. And our users care about both.

If no matching is required, it makes sense to set rq_affinity to 0. When
matching is enabled, we need to rely on per-task iowait boost to help the
requester to run at a bigger CPU, and naturally the completion will follow when
rq_affinity=1. If the requester doesn't need the big perf, but the irq
triggered on a bigger core, I struggle to understand why it is good for
completion to run on bigger core without the requester also being on a similar
bigger core to truly maximize perf.

By the way, if we assume LLC wasn't the same, then assuming HMP system too, and
reverting my patch, then the behavior was to move the completion from bigger
core to little core.

So two things to observe:

1. The patch keeps the behavior when LLC truly is not shared on such systems,
   which was in the past.
2. LLC in this case is most likely L2, and the usual trend is that the bigger
   the core the bigger L2. So the LLC characteristic is different and could
   have impacted performance. No one seem to have cared in the past. I think
   capacity gives this notion now implicitly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05  2:07   ` Qais Yousef
@ 2024-08-05 10:18     ` Christian Loehle
  2024-08-05 17:24       ` MANISH PANDEY
  2024-08-09  0:23       ` Qais Yousef
  2024-08-05 17:17     ` Bart Van Assche
  1 sibling, 2 replies; 29+ messages in thread
From: Christian Loehle @ 2024-08-05 10:18 UTC (permalink / raw)
  To: Qais Yousef
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

Sorry, I didn't include linux-kernel, so my and Qais' reply that is missing
is quoted here in full. You can also find it here:
https://lore.kernel.org/all/d2009fca-57db-49e6-a874-e8291c3e27f5@quicinc.com/T/#m2f5195487d2e3ad0c0b6fc68f9967704c7aa4a70

On 8/5/24 03:07, Qais Yousef wrote:
> On 08/02/24 10:03, Christian Loehle wrote:
>> On 7/31/24 14:46, MANISH PANDEY wrote:
>>> Hi Qais Yousef,
>>
>> Qais already asked the important question, still some from my end.
>>
>>> Recently we observed below patch has been merged
>>> https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
>>>
>>> This patch is causing performance degradation ~20% in Random IO along with significant drop in Sequential IO performance. So we would like to revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ devices are also getting impacted due to this.
>>
>> I'm curious about the sequential IO part in particular, what's the blocksize and throughput?
>> If blocksize is large enough the completion and submission parts are hopefully not as critical.
>>
>>>
>>> We have several concerns with the patch
>>> 1. This patch takes away the luxury of affining best possible cpus from   device drivers and limits driver to fall in same group of CPUs.
>>>
>>> 2. Why can't device driver use irq affinity to use desired CPUs to complete the IO request, instead of forcing it from block layer.
>>>
>>> 3. Already CPUs are grouped based on LLC, then if a new categorization is required ?
>>
>> As Qais hinted at, because of systems that share LLC on all CPUs but are HMP.
>>
>>>
>>>> big performance impact if the IO request
>>>> was done from a CPU with higher capacity but the interrupt is serviced
>>>> on a lower capacity CPU.
>>>
>>> This patch doesn't considers the issue of contention in submission path and completion path. Also what if we want to complete the request of smaller capacity CPU to Higher capacity CPU?
>>> Shouldn't a device driver take care of this and allow the vendors to use the best possible combination they want to use?
>>> Does it considers MCQ devices and different SQ<->CQ mappings?
>>
>> So I'm assuming you're seeing something like the following:
>> Some CPU(s) (call them S) are submitting IO, hardirq triggers on
>> S.
>> Before the patch the completion softirq could run on a !S CPU,
>> now it runs on S. Am I then correct in assuming your workload
>> is CPU-bound on S? Would you share some details about the
>> workload, too?
>>
>> What's the capacity of CPU(s) S then?
>> IOW does this help?
>>
>> -->8--
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e3c3c0c21b55..a4a2500c4ef6 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
>>         if (cpu == rq->mq_ctx->cpu ||
>>             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>>              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>> +            arch_scale_cpu_capacity(cpu) >= arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>>                 return false;
>>  
>>         /* don't try to IPI to an offline CPU */
> 
> FWIW, that's what I had in the first version of the patch, but moved away from
> it. I think this will constitute a policy.
> 
> Keep in mind that driver setting affinity like Manish case is not something
> represent a kernel driver as I don't anticipate in-kernel driver to hardcode
> affinities otherwise they won't be portable. irqbalancers usually move the
> interrupts, and I'm not sure we can make an assumption about the reason an
> interrupt is triggering on different capacity CPU.

FWIW that patch was mostly to narrow down the problem Manish is experiencing, I
wasn't proposing it as a proper patch.

> My understanding of rq_affinity=1 is to match the perf of requester. Given that
> the characteristic of HMP system is that power has an equal importance to perf
> (I think this now has become true for all systems by the way), saying that the
> match in one direction is better than the other is sort of forcing a policy of
> perf first which I don't think is a good thing to enforce. We don't have enough
> info to decide at this level. And our users care about both.

I would argue rq_affinity=1 matches the perf, so that flag should already bias
perf in favor of power slightly?
Although the actual effect on power probably isn't that significant, given
that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
almost ignore a potential idle wakeup and the actual CPU time spent in the block
completion is pretty short of course.

> If no matching is required, it makes sense to set rq_affinity to 0. When
> matching is enabled, we need to rely on per-task iowait boost to help the
> requester to run at a bigger CPU, and naturally the completion will follow when
> rq_affinity=1. If the requester doesn't need the big perf, but the irq
> triggered on a bigger core, I struggle to understand why it is good for
> completion to run on bigger core without the requester also being on a similar
> bigger core to truly maximize perf.

So first of all, per-task iowait boosting has nothing to do with it IMO.
Plenty of IO workloads build up utilization perfectly fine.
I wouldn't consider the setup: requester little perf, irq+completion big perf
invalid necessarily, it does decrease IO latency for the application.
Consider the IO being page faults (maybe even of various applications running
on little).

> 
> By the way, if we assume LLC wasn't the same, then assuming HMP system too, and
> reverting my patch, then the behavior was to move the completion from bigger
> core to little core.
> 
> So two things to observe:
> 
> 1. The patch keeps the behavior when LLC truly is not shared on such systems,
>    which was in the past.
> 2. LLC in this case is most likely L2, and the usual trend is that the bigger
>    the core the bigger L2. So the LLC characteristic is different and could
>    have impacted performance. No one seem to have cared in the past. I think
>    capacity gives this notion now implicitly.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05  2:07   ` Qais Yousef
  2024-08-05 10:18     ` Christian Loehle
@ 2024-08-05 17:17     ` Bart Van Assche
  2024-08-05 17:35       ` MANISH PANDEY
  2024-08-09  0:28       ` Qais Yousef
  1 sibling, 2 replies; 29+ messages in thread
From: Bart Van Assche @ 2024-08-05 17:17 UTC (permalink / raw)
  To: Qais Yousef, Christian Loehle
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall

On 8/4/24 7:07 PM, Qais Yousef wrote:
> irqbalancers usually move the interrupts, and I'm not sure we can
> make an assumption about the reason an interrupt is triggering on
> different capacity CPU.
User space software can't modify the affinity of managed interrupts.
 From include/linux/irq.h:

  * IRQD_AFFINITY_MANAGED - Affinity is auto-managed by the kernel

That flag is tested by the procfs code that implements the smp_affinity
procfs attribute:

static ssize_t write_irq_affinity(int type, struct file *file,
		const char __user *buffer, size_t count, loff_t *pos)
{
	[ ... ]
	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
		return -EIO;
	[ ... ]
}

I'm not sure whether or not the interrupts on Manish test setup are
managed. Manish, can you please provide the output of the following
commands?

adb shell 'grep -i ufshcd /proc/interrupts'
adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do ls -ld 
/proc/irq/${a%:}/smp_affinity; done'
adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do grep -aH 
. /proc/irq/${a%:}/smp_affinity; done'

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 10:18     ` Christian Loehle
@ 2024-08-05 17:24       ` MANISH PANDEY
  2024-08-09  0:47         ` Qais Yousef
  2024-08-09  0:23       ` Qais Yousef
  1 sibling, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-05 17:24 UTC (permalink / raw)
  To: Christian Loehle, Qais Yousef
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Bart Van Assche,
	Christoph Hellwig, kailash, tkjos, dhavale, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org



On 8/5/2024 3:48 PM, Christian Loehle wrote:
> Sorry, I didn't include linux-kernel, so my and Qais' reply that is missing
> is quoted here in full. You can also find it here:
> https://lore.kernel.org/all/d2009fca-57db-49e6-a874-e8291c3e27f5@quicinc.com/T/#m2f5195487d2e3ad0c0b6fc68f9967704c7aa4a70
> 
> On 8/5/24 03:07, Qais Yousef wrote:
>> On 08/02/24 10:03, Christian Loehle wrote:
>>> On 7/31/24 14:46, MANISH PANDEY wrote:
>>>> Hi Qais Yousef,
>>>
>>> Qais already asked the important question, still some from my end.
>>>
>>>> Recently we observed below patch has been merged
>>>> https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io
>>>>
>>>> This patch is causing performance degradation ~20% in Random IO along with significant drop in Sequential IO performance. So we would like to revert this patch as it impacts MCQ UFS devices heavily. Though Non MCQ devices are also getting impacted due to this.
>>>
>>> I'm curious about the sequential IO part in particular, what's the blocksize and throughput?
>>> If blocksize is large enough the completion and submission parts are hopefully not as critical.
>>>
>>>>
>>>> We have several concerns with the patch
>>>> 1. This patch takes away the luxury of affining best possible cpus from   device drivers and limits driver to fall in same group of CPUs.
>>>>
>>>> 2. Why can't device driver use irq affinity to use desired CPUs to complete the IO request, instead of forcing it from block layer.
>>>>
>>>> 3. Already CPUs are grouped based on LLC, then if a new categorization is required ?
>>>
>>> As Qais hinted at, because of systems that share LLC on all CPUs but are HMP.
>>>
>>>>
>>>>> big performance impact if the IO request
>>>>> was done from a CPU with higher capacity but the interrupt is serviced
>>>>> on a lower capacity CPU.
>>>>
>>>> This patch doesn't considers the issue of contention in submission path and completion path. Also what if we want to complete the request of smaller capacity CPU to Higher capacity CPU?
>>>> Shouldn't a device driver take care of this and allow the vendors to use the best possible combination they want to use?
>>>> Does it considers MCQ devices and different SQ<->CQ mappings?
>>>
>>> So I'm assuming you're seeing something like the following:
>>> Some CPU(s) (call them S) are submitting IO, hardirq triggers on
>>> S.
>>> Before the patch the completion softirq could run on a !S CPU,
>>> now it runs on S. Am I then correct in assuming your workload
>>> is CPU-bound on S? Would you share some details about the
>>> workload, too?
>>>
I would like to share details with example:
Say in SoC we have 3 clusters S(small), M(Medium) and L(large) and all 
three clusters shares L3 as LLC.
Say on UFS with MCQ, most of the submission would be done from M 
clusters. So SQ's CPUs would be mostly bounded to M clusters. Now if we 
assign ESI IRQ for CQ on same M cluster cpu, these CPU's gets overloaded 
with SQ and IRQ bounded with that CQ ( IRQn <->CQn). So we intentionally 
want to affine IRQn <-> CQn on large clusters as LLC is shared, so we 
don't need IPI.
Earlier below code was working like blk_mq_complete_need_ipi() returning 
false and hence blk_mq_complete_request() returs false. Thus we complete 
CQ of M clusters on L clusters CPU in HardIRQ context and get better 
Perf numbers.

blk_mq_complete_need_ipi()
	/* same CPU or cache domain and capacity?  Complete locally */
	if (cpu == rq->mq_ctx->cpu ||
	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
	     cpus_share_cache(cpu, rq->mq_ctx->cpu))
		return false;

void blk_mq_complete_request(struct request *rq)
{
	if (!blk_mq_complete_request_remote(rq))  --> true,
		rq->q->mq_ops->complete(rq);   --> execute here
}

But now with addition of cpus_equal_capacity(cpu, rq->mq_ctx->cpu)
blk_mq_complete_need_ipi() would try for IPI. Further 
blk_mq_complete_request_remote() would be true and hence
__blk_mq_complete_request_remote() would complete the request in SOFTIRQ 
context, which would cause low performance due to context switching.


>>> What's the capacity of CPU(s) S then?
>>> IOW does this help?
>>>
>>> -->8--
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index e3c3c0c21b55..a4a2500c4ef6 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
>>>          if (cpu == rq->mq_ctx->cpu ||
>>>              (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>>>               cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>>> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>>> +            arch_scale_cpu_capacity(cpu) >= arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>>>                  return false;
>>>   
>>>          /* don't try to IPI to an offline CPU */
>>
>> FWIW, that's what I had in the first version of the patch, but moved away from
>> it. I think this will constitute a policy.
>>
>> Keep in mind that driver setting affinity like Manish case is not something
>> represent a kernel driver as I don't anticipate in-kernel driver to hardcode
>> affinities otherwise they won't be portable. irqbalancers usually move the
>> interrupts, and I'm not sure we can make an assumption about the reason an
>> interrupt is triggering on different capacity CPU.
>
> FWIW that patch was mostly to narrow down the problem Manish is experiencing, I
> wasn't proposing it as a proper patch.
> 
>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
>> the characteristic of HMP system is that power has an equal importance to perf
>> (I think this now has become true for all systems by the way), saying that the
>> match in one direction is better than the other is sort of forcing a policy of
>> perf first which I don't think is a good thing to enforce. We don't have enough
>> info to decide at this level. And our users care about both.
> 
> I would argue rq_affinity=1 matches the perf, so that flag should already bias
> perf in favor of power slightly?
> Although the actual effect on power probably isn't that significant, given
> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
> almost ignore a potential idle wakeup and the actual CPU time spent in the block
> completion is pretty short of course.
> 
>> If no matching is required, it makes sense to set rq_affinity to 0. When
>> matching is enabled, we need to rely on per-task iowait boost to help the
>> requester to run at a bigger CPU, and naturally the completion will follow when
>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
>> triggered on a bigger core, I struggle to understand why it is good for
>> completion to run on bigger core without the requester also being on a similar
>> bigger core to truly maximize perf.
> 
Not all the SoCs implements L3 as shared LLC. There are SoCs with L2 as 
LLC and not shared among all CPU clusters. So in this case, if we use 
rq=0, this would force to use a CPU, which doesn't shares L2 cache.
Say in a system cpu[0-5] shares L2 as LLC and cpu[6-7] shares L2 as LLC, 
then any request from CPU[0-5] / CPU[6-7] would force to serve IRQ on 
CPUs which actually doesn't shares cache, would would result low 
performance.


> So first of all, per-task iowait boosting has nothing to do with it IMO.
> Plenty of IO workloads build up utilization perfectly fine.
> I wouldn't consider the setup: requester little perf, irq+completion big perf
> invalid necessarily, it does decrease IO latency for the application.
> Consider the IO being page faults (maybe even of various applications running
> on little).
> 
>>
>> By the way, if we assume LLC wasn't the same, then assuming HMP system too, and
>> reverting my patch, then the behavior was to move the completion from bigger
>> core to little core.
>>
>> So two things to observe:
>>
>> 1. The patch keeps the behavior when LLC truly is not shared on such systems,
>>     which was in the past.
>> 2. LLC in this case is most likely L2, and the usual trend is that the bigger
>>     the core the bigger L2. So the LLC characteristic is different and could
>>     have impacted performance. No one seem to have cared in the past. I think
>>     capacity gives this notion now implicitly.
> 

So basically the change enforces to use SOFTIRQs for completing the IO 
requests, which could have been completed in HardIRQ context itself and 
this affects more to MCQ devices.

For Non MCQ devices, there is a case,where we can can use Low capacity 
CPUs for completing the request (handling IRQs on S CPUs) to save Power 
and avoiding wakeup of all CPUs of same capacity in order to submit and 
complete the IO requests.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 17:17     ` Bart Van Assche
@ 2024-08-05 17:35       ` MANISH PANDEY
  2024-08-05 17:52         ` Bart Van Assche
  2024-08-09  0:28       ` Qais Yousef
  1 sibling, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-05 17:35 UTC (permalink / raw)
  To: Bart Van Assche, Qais Yousef, Christian Loehle
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Christoph Hellwig,
	kailash, tkjos, dhavale, bvanassche, quic_nitirawa, quic_cang,
	quic_rampraka, quic_narepall



On 8/5/2024 10:47 PM, Bart Van Assche wrote:
> On 8/4/24 7:07 PM, Qais Yousef wrote:
>> irqbalancers usually move the interrupts, and I'm not sure we can
>> make an assumption about the reason an interrupt is triggering on
>> different capacity CPU.
> User space software can't modify the affinity of managed interrupts.
>  From include/linux/irq.h:
> 
>   * IRQD_AFFINITY_MANAGED - Affinity is auto-managed by the kernel
> 
> That flag is tested by the procfs code that implements the smp_affinity
> procfs attribute:
> 
> static ssize_t write_irq_affinity(int type, struct file *file,
>          const char __user *buffer, size_t count, loff_t *pos)
> {
>      [ ... ]
>      if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
>          return -EIO;
>      [ ... ]
> }
> 
> I'm not sure whether or not the interrupts on Manish test setup are
> managed. Manish, can you please provide the output of the following
> commands?
> 
> adb shell 'grep -i ufshcd /proc/interrupts'
> adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do ls -ld 
> /proc/irq/${a%:}/smp_affinity; done'
> adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do grep -aH 
> . /proc/irq/${a%:}/smp_affinity; done'
> 
In our SoC's we manage Power and Perf balancing by dynamically changing 
the IRQs based on the load. Say if we have more load, we assign UFS IRQs 
on Large cluster CPUs and if we have less load, we affine the IRQs on 
Small cluster CPUs.

Also for some SoC's since IRQs are mainly delivered to Small cluster 
CPUs by default, so we manage to complete the request via using 
QUEUE_FLAG_SAME_FORCE.

This issue is more affecting UFS MCQ devices, which usages ESI/MSI IRQs 
and have distributed ESI IRQs for CQs.
Mostly we use Large cluster CPUs for binding IRQ and CQ and hence 
completing more completions on Large cluster which won't be from same 
capacity CPU as request may be from S/M clusters.

> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 17:35       ` MANISH PANDEY
@ 2024-08-05 17:52         ` Bart Van Assche
  2024-08-08  6:05           ` MANISH PANDEY
  0 siblings, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2024-08-05 17:52 UTC (permalink / raw)
  To: MANISH PANDEY, Qais Yousef, Christian Loehle
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Christoph Hellwig,
	kailash, tkjos, dhavale, bvanassche, quic_nitirawa, quic_cang,
	quic_rampraka, quic_narepall

On 8/5/24 10:35 AM, MANISH PANDEY wrote:
> In our SoC's we manage Power and Perf balancing by dynamically changing 
> the IRQs based on the load. Say if we have more load, we assign UFS IRQs 
> on Large cluster CPUs and if we have less load, we affine the IRQs on 
> Small cluster CPUs.

I don't think that this is compatible with the command completion code
in the block layer core. The blk-mq code is based on the assumption that
the association of a completion interrupt with a CPU core does not
change. See also the blk_mq_map_queues() function and its callers.

Is this mechanism even useful? If completion interrupts are always sent 
to the CPU core that submitted the I/O, no interrupts will be sent to
the large cluster if no code that submits I/O is running on that
cluster. Sending e.g. all completion interrupts to the large cluster can
be achieved by migrating all processes and threads to the large cluster.

> This issue is more affecting UFS MCQ devices, which usages ESI/MSI IRQs 
> and have distributed ESI IRQs for CQs.
> Mostly we use Large cluster CPUs for binding IRQ and CQ and hence 
> completing more completions on Large cluster which won't be from same 
> capacity CPU as request may be from S/M clusters.

Please use an approach that is supported by the block layer. I don't
think that dynamically changing the IRQ affinity is compatible with the
block layer.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 17:52         ` Bart Van Assche
@ 2024-08-08  6:05           ` MANISH PANDEY
  2024-08-09  0:36             ` Qais Yousef
  2024-08-11 17:41             ` Dietmar Eggemann
  0 siblings, 2 replies; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-08  6:05 UTC (permalink / raw)
  To: Bart Van Assche, Qais Yousef, Christian Loehle
  Cc: axboe, mingo, peterz, vincent.guittot, dietmar.eggemann,
	linux-block, sudeep.holla, Jaegeuk Kim, Christoph Hellwig,
	kailash, tkjos, dhavale, bvanassche, quic_nitirawa, quic_cang,
	quic_rampraka, quic_narepall



On 8/5/2024 11:22 PM, Bart Van Assche wrote:
> On 8/5/24 10:35 AM, MANISH PANDEY wrote:
>> In our SoC's we manage Power and Perf balancing by dynamically 
>> changing the IRQs based on the load. Say if we have more load, we 
>> assign UFS IRQs on Large cluster CPUs and if we have less load, we 
>> affine the IRQs on Small cluster CPUs.
> 
> I don't think that this is compatible with the command completion code
> in the block layer core. The blk-mq code is based on the assumption that
> the association of a completion interrupt with a CPU core does not
> change. See also the blk_mq_map_queues() function and its callers.
> 
IRQ <-> CPU bonded before the start of the operation and it makes sure 
that completion interrupt CPU doesn't change.

> Is this mechanism even useful? If completion interrupts are always sent 
> to the CPU core that submitted the I/O, no interrupts will be sent to
> the large cluster if no code that submits I/O is running on that
> cluster. Sending e.g. all completion interrupts to the large cluster can
> be achieved by migrating all processes and threads to the large cluster.
> 
 >> migrating all completion interrupts to the large cluster can
 >> be achieved by migrating all processes and threads to the large
 >> cluster.

Agree, this can be achieved, but then for this all the process and 
threads have to be migrated to large cluster and this will have power 
impacts. Hence to balance power and perf, it is not preferred way for 
vendors.

>> This issue is more affecting UFS MCQ devices, which usages ESI/MSI 
>> IRQs and have distributed ESI IRQs for CQs.
>> Mostly we use Large cluster CPUs for binding IRQ and CQ and hence 
>> completing more completions on Large cluster which won't be from same 
>> capacity CPU as request may be from S/M clusters.
> 
> Please use an approach that is supported by the block layer. I don't
> think that dynamically changing the IRQ affinity is compatible with the
> block layer.

For UFS with MCQ, ESI IRQs are bounded at the time of initialization.
so basically i would like to use High Performance cluster CPUs to 
migrate few completions from Mid clusters and take the advantage of high 
capacity CPUs. The new change takes away this opportunity from driver.
So basically we should be able to use High Performance CPUs like below

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b55..a4a2500c4ef6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct 
request *rq)
         if (cpu == rq->mq_ctx->cpu ||
             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
-            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
+            arch_scale_cpu_capacity(cpu) >= 	 
arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
                 return false;

This way driver can use best possible CPUs for it's use case.
> 
> Thanks,
> 
> Bart.
> 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 10:18     ` Christian Loehle
  2024-08-05 17:24       ` MANISH PANDEY
@ 2024-08-09  0:23       ` Qais Yousef
  2024-08-13 16:20         ` Christian Loehle
  1 sibling, 1 reply; 29+ messages in thread
From: Qais Yousef @ 2024-08-09  0:23 UTC (permalink / raw)
  To: Christian Loehle
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 08/05/24 11:18, Christian Loehle wrote:

> > My understanding of rq_affinity=1 is to match the perf of requester. Given that
> > the characteristic of HMP system is that power has an equal importance to perf
> > (I think this now has become true for all systems by the way), saying that the
> > match in one direction is better than the other is sort of forcing a policy of
> > perf first which I don't think is a good thing to enforce. We don't have enough
> > info to decide at this level. And our users care about both.
> 
> I would argue rq_affinity=1 matches the perf, so that flag should already bias
> perf in favor of power slightly?

Not on this type of systems. If perf was the only thing important, just use
equally big cpus. Balancing perf and power is important on those systems, and
I don't think we have enough info to decide which decision is best when
capacities are not the same. Matching the perf level the requesting on makes
sense when irq_affinity=1.

> Although the actual effect on power probably isn't that significant, given
> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
> almost ignore a potential idle wakeup and the actual CPU time spent in the block
> completion is pretty short of course.
> 
> > If no matching is required, it makes sense to set rq_affinity to 0. When
> > matching is enabled, we need to rely on per-task iowait boost to help the
> > requester to run at a bigger CPU, and naturally the completion will follow when
> > rq_affinity=1. If the requester doesn't need the big perf, but the irq
> > triggered on a bigger core, I struggle to understand why it is good for
> > completion to run on bigger core without the requester also being on a similar
> > bigger core to truly maximize perf.
> 
> So first of all, per-task iowait boosting has nothing to do with it IMO.

It has. If the perf is not good because the requester is running on little
core, the requester need to move up to ensure the overall IO perf is better.

> Plenty of IO workloads build up utilization perfectly fine.

These ones have no problems, no? They should migrate to big core and the
completion will follow them when they move.

> I wouldn't consider the setup: requester little perf, irq+completion big perf
> invalid necessarily, it does decrease IO latency for the application.

I didn't say invalid. But it is not something we can guess automatically when
irq_affinity=1. We don't have enough info to judge. The only info we have the
requester that originated the request is running at different perf level
(whther higher or lower), so we follow it.

> Consider the IO being page faults (maybe even of various applications running
> on little).
> 
> > 
> > By the way, if we assume LLC wasn't the same, then assuming HMP system too, and
> > reverting my patch, then the behavior was to move the completion from bigger
> > core to little core.
> > 
> > So two things to observe:
> > 
> > 1. The patch keeps the behavior when LLC truly is not shared on such systems,
> >    which was in the past.
> > 2. LLC in this case is most likely L2, and the usual trend is that the bigger
> >    the core the bigger L2. So the LLC characteristic is different and could
> >    have impacted performance. No one seem to have cared in the past. I think
> >    capacity gives this notion now implicitly.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 17:17     ` Bart Van Assche
  2024-08-05 17:35       ` MANISH PANDEY
@ 2024-08-09  0:28       ` Qais Yousef
  1 sibling, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-08-09  0:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christian Loehle, MANISH PANDEY, axboe, mingo, peterz,
	vincent.guittot, dietmar.eggemann, linux-block, sudeep.holla,
	Jaegeuk Kim, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall

On 08/05/24 10:17, Bart Van Assche wrote:
> On 8/4/24 7:07 PM, Qais Yousef wrote:
> > irqbalancers usually move the interrupts, and I'm not sure we can
> > make an assumption about the reason an interrupt is triggering on
> > different capacity CPU.
> User space software can't modify the affinity of managed interrupts.
> From include/linux/irq.h:

True. But this is special case and was introduced for isolated CPUs. I don't
think drivers can request this themselves.

> 
>  * IRQD_AFFINITY_MANAGED - Affinity is auto-managed by the kernel
> 
> That flag is tested by the procfs code that implements the smp_affinity
> procfs attribute:
> 
> static ssize_t write_irq_affinity(int type, struct file *file,
> 		const char __user *buffer, size_t count, loff_t *pos)
> {
> 	[ ... ]
> 	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
> 		return -EIO;
> 	[ ... ]
> }
> 
> I'm not sure whether or not the interrupts on Manish test setup are
> managed. Manish, can you please provide the output of the following
> commands?
> 
> adb shell 'grep -i ufshcd /proc/interrupts'
> adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do ls -ld
> /proc/irq/${a%:}/smp_affinity; done'
> adb shell 'grep -i ufshcd /proc/interrupts | while read a b; do grep -aH .
> /proc/irq/${a%:}/smp_affinity; done'
> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-08  6:05           ` MANISH PANDEY
@ 2024-08-09  0:36             ` Qais Yousef
  2024-08-11 17:41             ` Dietmar Eggemann
  1 sibling, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-08-09  0:36 UTC (permalink / raw)
  To: MANISH PANDEY
  Cc: Bart Van Assche, Christian Loehle, axboe, mingo, peterz,
	vincent.guittot, dietmar.eggemann, linux-block, sudeep.holla,
	Jaegeuk Kim, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall

On 08/08/24 11:35, MANISH PANDEY wrote:
> 
> 
> On 8/5/2024 11:22 PM, Bart Van Assche wrote:
> > On 8/5/24 10:35 AM, MANISH PANDEY wrote:
> > > In our SoC's we manage Power and Perf balancing by dynamically
> > > changing the IRQs based on the load. Say if we have more load, we
> > > assign UFS IRQs on Large cluster CPUs and if we have less load, we
> > > affine the IRQs on Small cluster CPUs.
> > 
> > I don't think that this is compatible with the command completion code
> > in the block layer core. The blk-mq code is based on the assumption that
> > the association of a completion interrupt with a CPU core does not
> > change. See also the blk_mq_map_queues() function and its callers.
> > 
> IRQ <-> CPU bonded before the start of the operation and it makes sure that
> completion interrupt CPU doesn't change.
> 
> > Is this mechanism even useful? If completion interrupts are always sent
> > to the CPU core that submitted the I/O, no interrupts will be sent to
> > the large cluster if no code that submits I/O is running on that
> > cluster. Sending e.g. all completion interrupts to the large cluster can
> > be achieved by migrating all processes and threads to the large cluster.
> > 
> >> migrating all completion interrupts to the large cluster can
> >> be achieved by migrating all processes and threads to the large
> >> cluster.
> 
> Agree, this can be achieved, but then for this all the process and threads
> have to be migrated to large cluster and this will have power impacts. Hence
> to balance power and perf, it is not preferred way for vendors.

I don't get why irq_affinity=1 is compatible with this case? Isn't this custom
setup is a fully managed system by you and means you want rq_affinity=0? What
do you lose if you move to rq_affinity=0?

> 
> > > This issue is more affecting UFS MCQ devices, which usages ESI/MSI
> > > IRQs and have distributed ESI IRQs for CQs.
> > > Mostly we use Large cluster CPUs for binding IRQ and CQ and hence
> > > completing more completions on Large cluster which won't be from
> > > same capacity CPU as request may be from S/M clusters.
> > 
> > Please use an approach that is supported by the block layer. I don't
> > think that dynamically changing the IRQ affinity is compatible with the
> > block layer.
> 
> For UFS with MCQ, ESI IRQs are bounded at the time of initialization.
> so basically i would like to use High Performance cluster CPUs to migrate
> few completions from Mid clusters and take the advantage of high capacity
> CPUs. The new change takes away this opportunity from driver.

It doesn't. You want to fully customize where your completion runs without any
interference from block layer from what I read. Disable rq_affinity and do what
you want? Your description says you don't want the block layer to interfere
with your affinity setup.

> So basically we should be able to use High Performance CPUs like below
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b55..a4a2500c4ef6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct
> request *rq)
>         if (cpu == rq->mq_ctx->cpu ||
>             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> +            arch_scale_cpu_capacity(cpu) >= 	
> arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>                 return false;
> 
> This way driver can use best possible CPUs for it's use case.
> > 
> > Thanks,
> > 
> > Bart.
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-05 17:24       ` MANISH PANDEY
@ 2024-08-09  0:47         ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-08-09  0:47 UTC (permalink / raw)
  To: MANISH PANDEY
  Cc: Christian Loehle, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 08/05/24 22:54, MANISH PANDEY wrote:

> > > If no matching is required, it makes sense to set rq_affinity to 0. When
> > > matching is enabled, we need to rely on per-task iowait boost to help the
> > > requester to run at a bigger CPU, and naturally the completion will follow when
> > > rq_affinity=1. If the requester doesn't need the big perf, but the irq
> > > triggered on a bigger core, I struggle to understand why it is good for
> > > completion to run on bigger core without the requester also being on a similar
> > > bigger core to truly maximize perf.
> > 
> Not all the SoCs implements L3 as shared LLC. There are SoCs with L2 as LLC
> and not shared among all CPU clusters. So in this case, if we use rq=0, this
> would force to use a CPU, which doesn't shares L2 cache.
> Say in a system cpu[0-5] shares L2 as LLC and cpu[6-7] shares L2 as LLC,
> then any request from CPU[0-5] / CPU[6-7] would force to serve IRQ on CPUs
> which actually doesn't shares cache, would would result low performance.

For these systems rq_affinity=1 is what you want? the rq_affinity is not
supposed to be one size fits all. We wouldn't have different rq_affinity values
if one is supposed to work on all systems.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-08  6:05           ` MANISH PANDEY
  2024-08-09  0:36             ` Qais Yousef
@ 2024-08-11 17:41             ` Dietmar Eggemann
  2024-08-12 18:15               ` Sandeep Dhavale
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2024-08-11 17:41 UTC (permalink / raw)
  To: MANISH PANDEY, Bart Van Assche, Qais Yousef, Christian Loehle
  Cc: axboe, mingo, peterz, vincent.guittot, linux-block, sudeep.holla,
	Jaegeuk Kim, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

+ linux-kernel@vger.kernel.org

On 08/08/2024 08:05, MANISH PANDEY wrote:
>  
> On 8/5/2024 11:22 PM, Bart Van Assche wrote:
>> On 8/5/24 10:35 AM, MANISH PANDEY wrote:

[...]

>> Please use an approach that is supported by the block layer. I don't
>> think that dynamically changing the IRQ affinity is compatible with the
>> block layer.
> 
> For UFS with MCQ, ESI IRQs are bounded at the time of initialization.
> so basically i would like to use High Performance cluster CPUs to
> migrate few completions from Mid clusters and take the advantage of high
> capacity CPUs. The new change takes away this opportunity from driver.
> So basically we should be able to use High Performance CPUs like below
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b55..a4a2500c4ef6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1164,7 +1164,7 @@ static inline bool blk_mq_complete_need_ipi(struct
> request *rq)
>         if (cpu == rq->mq_ctx->cpu ||
>             (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>              cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> +            arch_scale_cpu_capacity(cpu) >=     
> arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>                 return false;
> 
> This way driver can use best possible CPUs for it's use case.

So the issue for you with commit af550e4c9682 seems to be that those
completions don't happen on big CPUs (cpu_capacity = 1024) anymore,
since the condition in  blk_mq_complete_need_ipi() (1):

 if (!QUEUE_FLAG_SAME_FORCE && cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
     cpus_equal_capacity(cpu, rq->mq_ctx->cpu))

is no longer true if 'rq->mq_ctx->cpu != big CPU' so (1) returns true
and blk_mq_complete_request_remote() sends an ipi to 'rq->mq_ctx->cpu'.


I tried to simulate this with a 6 CPUs aarch64 QEMU tri-gear (3
different cpu_capacity values) system:

cat /sys/devices/system/cpu/online
0-5

# cat /sys/devices/system/cpu/cpu*/cpu_capacity
446
446
871
871
1024
1024

# grep -i virtio /proc/interrupts | while read a b; do grep -aH .
/proc/irq/${a%:}/smp_affinity; done
/proc/irq/15/smp_affinity:3f /* block device */
/proc/irq/16/smp_affinity:3f /* network device */

So you set the block device irq affine to the big CPUs (0x30).

# echo 30 > /proc/irq/15/smp_affinity

And with the patch, you send ipi's in blk_mq_complete_request_remote()
in case 'rq->mq_ctx->cpu=[0-4]' whereas w/o the patch or the change to:

 arch_scale_cpu_capacity(cpu) >=
                            arch_scale_cpu_capacity(rq->mq_ctx->cpu) (2)

you would complete the request locally (i.e. on CPU4/5):

gic_handle_irq() -> ... -> handle_irq_event() -> ... -> vm_interrupt()
-> ... -> virtblk_done() (callback) -> blk_mq_complete_request() ->
blk_mq_complete_request_remote(), rq->q->mq_ops->complete(rq)

The patch IMHO was introduced to avoid running local when 'local =
little CPU'. Since you use system knowledge and set IRQ affinity
explicitly to big CPU's to run local on them, maybe (2) is the way to
allow both?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-11 17:41             ` Dietmar Eggemann
@ 2024-08-12 18:15               ` Sandeep Dhavale
  2024-08-21 12:29                 ` MANISH PANDEY
  0 siblings, 1 reply; 29+ messages in thread
From: Sandeep Dhavale @ 2024-08-12 18:15 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: MANISH PANDEY, Bart Van Assche, Qais Yousef, Christian Loehle,
	axboe, mingo, peterz, vincent.guittot, linux-block, sudeep.holla,
	Jaegeuk Kim, Christoph Hellwig, kailash, tkjos, bvanassche,
	quic_nitirawa, quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org

Hi Dietmar,
[..]
>
> So the issue for you with commit af550e4c9682 seems to be that those
> completions don't happen on big CPUs (cpu_capacity = 1024) anymore,
> since the condition in  blk_mq_complete_need_ipi() (1):
>
>  if (!QUEUE_FLAG_SAME_FORCE && cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>      cpus_equal_capacity(cpu, rq->mq_ctx->cpu))
>
> is no longer true if 'rq->mq_ctx->cpu != big CPU' so (1) returns true
> and blk_mq_complete_request_remote() sends an ipi to 'rq->mq_ctx->cpu'.
>
>
> I tried to simulate this with a 6 CPUs aarch64 QEMU tri-gear (3
> different cpu_capacity values) system:
>
> cat /sys/devices/system/cpu/online
> 0-5
>
> # cat /sys/devices/system/cpu/cpu*/cpu_capacity
> 446
> 446
> 871
> 871
> 1024
> 1024
>
> # grep -i virtio /proc/interrupts | while read a b; do grep -aH .
> /proc/irq/${a%:}/smp_affinity; done
> /proc/irq/15/smp_affinity:3f /* block device */
> /proc/irq/16/smp_affinity:3f /* network device */
>
> So you set the block device irq affine to the big CPUs (0x30).
>
> # echo 30 > /proc/irq/15/smp_affinity
>
> And with the patch, you send ipi's in blk_mq_complete_request_remote()
> in case 'rq->mq_ctx->cpu=[0-4]' whereas w/o the patch or the change to:
>
>  arch_scale_cpu_capacity(cpu) >=
>                             arch_scale_cpu_capacity(rq->mq_ctx->cpu) (2)
>
> you would complete the request locally (i.e. on CPU4/5):
>
> gic_handle_irq() -> ... -> handle_irq_event() -> ... -> vm_interrupt()
> -> ... -> virtblk_done() (callback) -> blk_mq_complete_request() ->
> blk_mq_complete_request_remote(), rq->q->mq_ops->complete(rq)
>
> The patch IMHO was introduced to avoid running local when 'local =
> little CPU'. Since you use system knowledge and set IRQ affinity
> explicitly to big CPU's to run local on them, maybe (2) is the way to
> allow both?

Thank you for doing the experiment.
I agree that changing cpus_equal_capacity() with greater than equal
check (this is what Qais had in his v1 patch [1] ) will allow both.

[1] https://lore.kernel.org/all/20240122224220.1206234-1-qyousef@layalina.io/

Thanks,
Sandeep

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-09  0:23       ` Qais Yousef
@ 2024-08-13 16:20         ` Christian Loehle
  2024-09-01 17:25           ` Qais Yousef
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2024-08-13 16:20 UTC (permalink / raw)
  To: Qais Yousef
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 8/9/24 01:23, Qais Yousef wrote:
> On 08/05/24 11:18, Christian Loehle wrote:
> 
>>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
>>> the characteristic of HMP system is that power has an equal importance to perf
>>> (I think this now has become true for all systems by the way), saying that the
>>> match in one direction is better than the other is sort of forcing a policy of
>>> perf first which I don't think is a good thing to enforce. We don't have enough
>>> info to decide at this level. And our users care about both.
>>
>> I would argue rq_affinity=1 matches the perf, so that flag should already bias
>> perf in favor of power slightly?
> 
> Not on this type of systems. If perf was the only thing important, just use
> equally big cpus. Balancing perf and power is important on those systems, and
> I don't think we have enough info to decide which decision is best when
> capacities are not the same. Matching the perf level the requesting on makes
> sense when irq_affinity=1.

Well you could still want a
"IO performance always beats power considerations" and still go HMP because
sometimes for non-IO you prefer power, but I agree that we don't have enough
information about what the user wants from the system/kernel.

> 
>> Although the actual effect on power probably isn't that significant, given
>> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
>> almost ignore a potential idle wakeup and the actual CPU time spent in the block
>> completion is pretty short of course.
>>
>>> If no matching is required, it makes sense to set rq_affinity to 0. When
>>> matching is enabled, we need to rely on per-task iowait boost to help the
>>> requester to run at a bigger CPU, and naturally the completion will follow when
>>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
>>> triggered on a bigger core, I struggle to understand why it is good for
>>> completion to run on bigger core without the requester also being on a similar
>>> bigger core to truly maximize perf.
>>
>> So first of all, per-task iowait boosting has nothing to do with it IMO.
> 
> It has. If the perf is not good because the requester is running on little
> core, the requester need to move up to ensure the overall IO perf is better.

See below but also
"the requester need to move up to ensure the overall IO perf is better" is
just not true, with asynchronous IO submission done right, the submission
runtime isn't critical to the IO throughput, therefore it should run the
most power-efficient way.
This can be observed e.g. with any io_uring fio workload with significant
iodepth (and possibly multi-threading).
Completion may be a different story, depending on the device stack, if we're
dealing with !MCQ then the completion path (irq + block layer completion)
is absolutely critical.
For any mmc / ufs<4.0 system the performance difference between
fio --name=little --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$LITTLE_CPUS
and
fio --name=big --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$BIG_CPUS
is (usually) only because of the completion path and setting irq affinity of
/dev/sda to $BIG_CPUS will make the difference disappear (rq_affinity=0 and
implying LLC is the same).
Running the submission on little CPUs will usually be the most power-efficient
way then.

> 
>> Plenty of IO workloads build up utilization perfectly fine.
> 
> These ones have no problems, no? They should migrate to big core and the
> completion will follow them when they move.

So if I understood Manish correctly the only reason they want the completion
to run on a bigger CPU than the submission is because the submission is already
saturating the CPU, therefore utilization of submission is no issue whatsoever.
They don't want to run (submission) on big though because of power
considerations.

> 
>> I wouldn't consider the setup: requester little perf, irq+completion big perf
>> invalid necessarily, it does decrease IO latency for the application.
> 
> I didn't say invalid. But it is not something we can guess automatically when
> irq_affinity=1. We don't have enough info to judge. The only info we have the
> requester that originated the request is running at different perf level
> (whther higher or lower), so we follow it.
>
Anyway, Manish's problem should be solved by rq_affinity=0 in that case (with
irq affinities set to big CPU then the completion will be run on the irq CPU)
and "rq_affinity=1 <=> equal capacity CPU" is the correct interpretation, is that
more or less agreed upon now?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-12 18:15               ` Sandeep Dhavale
@ 2024-08-21 12:29                 ` MANISH PANDEY
  2024-08-21 17:22                   ` Bart Van Assche
  2024-09-01 17:13                   ` Qais Yousef
  0 siblings, 2 replies; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-21 12:29 UTC (permalink / raw)
  To: Sandeep Dhavale, Dietmar Eggemann
  Cc: Bart Van Assche, Qais Yousef, Christian Loehle, axboe, mingo,
	peterz, vincent.guittot, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, bvanassche, quic_nitirawa,
	quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org

Hi all,

We agree with the points like rq_affinity can vary from target to target.

But also please consider below use cases

1. Storage devices with single hardware queue (like UFS 2.x/ 3.x)
in a HMP system with shared LLC, not all the CPUs will be active and 
online. Hence for large amount (say ~1GB) of random IO transactions , if 
requester CPU is from
smaller cpu group, then due to capacity based grouping, Large cluster 
CPUs would be mostly unused /idle, as the completion would also happen 
on same capacity CPU. Again due to this, the smaller / mid capacity CPUs 
only would have to submit and complete the request. Actually we could 
have completed the requested on large capacity CPUs and could have 
better utilized the power of Large capacity CPUs.

But to move IO requests from S/M cluster CPUs to L cluster, scheduler 
would need to wait until a threshold IO hits, and by that time 
Performance application would spent some runtime and would report low 
performance as overall results.


On 08/02/24 10:03, Christian Loehle wrote:
 > So I'm assuming you're seeing something like the following:
 > Some CPU(s) (call them S) are submitting IO, hardirq triggers on
 > S.
 > Before the patch the completion softirq could run on a !S CPU,
 > now it runs on S. Am I then correct in assuming your workload
 > is CPU-bound on S? Would you share some details about the
 > workload, too?
 >
 > What's the capacity of CPU(s) S then?


Yes.. for few SoCs we follow the same.
Say an SoC with 3 clusters ( S/M/L), if M CPU(s) are submitting IO 
requests ( which is a large data transfer), then in this case since LLC 
is shared, we can allow L CPU(s) to complete the request via softirq / 
hardirq. This will make sure to avoid contention in submission and 
completion path.


Again consider an SoC with 2 clusters ( 4 cpus of small capacity  / 2 
CPUs of Mid capacity).
In such case, if M CPUs are used for IO submersion and completions, then 
these CPU would face heavy work load and hence Performance will be 
impacted.  Now if the new patch was not there, we could have moved few 
IO completions to S cluster CPUs.

 > If no matching is required, it makes sense to set rq_affinity to 0.

 > I don't get why irq_affinity=1 is compatible with this case? Isn't
 > this custom
 >setup is a fully managed system by you and means you want 
 >rq_affinity=0? What
 >do you lose if you move to rq_affinity=0?

actually rq_affinity=0 would help for few SoC's having MCQ like UFS 4.x, 
but even this won't be generic solution for us. As this won't considers 
an SoC which doesn't shares LLC, and thus would have significant 
performance issue. Also since the change is picked up in all the kernel 
branches, so it would be very difficult to experiment on older Socs and 
get the best solution for each target.

This won't work for many SoCs like above example of 2 clusters, as
rq_affinity to 0 would then complete the request on hardIRQ context, 
which may not be suited for all the SoCs.
Also not all SoCs are arm based and shares LLC.



2. Storage devices with MCQ (Like UFS 4.x / NVME), usages ESI/MSI 
interrupts, hence we would have opportunity to bind ESI IRQ associated 
with an CQ.


On 08/05/24 10:17, Bart Van Assche wrote:
 > On 8/4/24 7:07 PM, Qais Yousef wrote:
 > > irqbalancers usually move the interrupts, and I'm not sure we can
 > > make an assumption about the reason an interrupt is triggering on
 > > different capacity CPU.
 > User space software can't modify the affinity of managed interrupts.
 > From include/linux/irq.h:

 > > >True. But this is special case and was introduced for isolated
 > > > CPUs. I don't think drivers can request this themselves

There are drivers, which can manage the cpu affinty for IRQs
using irq_set_affinity_hint() based on the use case.
Since in the SoC's the LLC is shared ( M and L clusters). So if IO is 
submitted from M capacity CPU(s), and request is completed on L capacity 
cpu(s). Though the capacity of M and L is different, but since the LLC 
is shared, so completion can be done on L cluster without the need of 
IPI. With the new change, we may not get advantage of shared LLC.

Proposed solution:
We can have a solution which is backward compatible and a new control 
flag (QUEUE_FLAG_SAME_CAPACITY_FORCE) could be provided for this new 
change to maintain backward compatibly.

How about introducing a new rq_affinity ( may be rq_affinity = 3) for 
using cpus_equal_capacity() using new flag QUEUE_FLAG_SAME_CAPACITY.


if (cpu == rq->mq_ctx->cpu ||
	(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
	  cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
+	  (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags))
	   && cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
		return false;



Could you please consider raising similar change, if this seems fine for 
all.


Regards
Manish

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-21 12:29                 ` MANISH PANDEY
@ 2024-08-21 17:22                   ` Bart Van Assche
  2024-08-22 10:46                     ` MANISH PANDEY
  2024-09-01 17:13                   ` Qais Yousef
  1 sibling, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2024-08-21 17:22 UTC (permalink / raw)
  To: MANISH PANDEY, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, Christian Loehle, axboe, mingo, peterz,
	vincent.guittot, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, bvanassche, quic_nitirawa,
	quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org

On 8/21/24 5:29 AM, MANISH PANDEY wrote:
> How about introducing a new rq_affinity ( may be rq_affinity = 3) for 
> using cpus_equal_capacity() using new flag QUEUE_FLAG_SAME_CAPACITY.
> 
> if (cpu == rq->mq_ctx->cpu ||
>      (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>        cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> +      (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags))
>         && cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>          return false;
> 
> Could you please consider raising similar change, if this seems fine for 
> all.

I'm not sure that a change like the above would be acceptable.

What is the performance impact of the above change? Redirecting
completion interrupts from a slow core to a fast core causes additional
cache misses if the I/O was submitted from a slow core. Are there
perhaps use cases for which the above change slows down I/O?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-21 17:22                   ` Bart Van Assche
@ 2024-08-22 10:46                     ` MANISH PANDEY
  2024-08-22 14:24                       ` Bart Van Assche
  0 siblings, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-22 10:46 UTC (permalink / raw)
  To: Bart Van Assche, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, Christian Loehle, axboe, mingo, peterz,
	vincent.guittot, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, bvanassche, quic_nitirawa,
	quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org



On 8/21/2024 10:52 PM, Bart Van Assche wrote:
> On 8/21/24 5:29 AM, MANISH PANDEY wrote:
>> How about introducing a new rq_affinity ( may be rq_affinity = 3) for 
>> using cpus_equal_capacity() using new flag QUEUE_FLAG_SAME_CAPACITY.
>>
>> if (cpu == rq->mq_ctx->cpu ||
>>      (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>>        cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>> +      (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags))
>>         && cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>>          return false;
>>
>> Could you please consider raising similar change, if this seems fine 
>> for all.
> 
> I'm not sure that a change like the above would be acceptable.
> 
> What is the performance impact of the above change? Redirecting
> completion interrupts from a slow core to a fast core causes additional
> cache misses if the I/O was submitted from a slow core. Are there
> perhaps use cases for which the above change slows down I/O?
> 
> Thanks,
> 
> Bart.

Hi Bart,


 > What is the performance impact of the above change?
No impact at all, as we are not changing the logic, we are just 
proposing an on/off switch and give flexibility to users. Let the user 
choose what's the best for their system.

Intention behind proposing a new flag is like we shouldn't break the 
backward compatibility, as the change is also included in stable release 
branches.

	/* same CPU or cache domain and capacity?  Complete locally */
	if (cpu == rq->mq_ctx->cpu ||
	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
	     cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
+	     (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags) ||
	      cpus_equal_capacity(cpu, rq->mq_ctx->cpu))))
		return false;

So basically below would act as on/ off switch
QUEUE_FLAG_CPU_CAPACITY - with rq_affinity=1 , it will be clear
                         - with rq_affinity=3 , it will be set.


Regards
Mansih


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-22 10:46                     ` MANISH PANDEY
@ 2024-08-22 14:24                       ` Bart Van Assche
  2024-08-23  7:57                         ` MANISH PANDEY
  0 siblings, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2024-08-22 14:24 UTC (permalink / raw)
  To: MANISH PANDEY, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, Christian Loehle, axboe, mingo, peterz,
	vincent.guittot, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, bvanassche, quic_nitirawa,
	quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org

On 8/22/24 3:46 AM, MANISH PANDEY wrote:
> On 8/21/2024 10:52 PM, Bart Van Assche wrote:
> > What is the performance impact of the above change?
 >
> No impact at all
Is this a good summary of this email thread?
* The first email in this thread reports an important performance
   regression.
* In your previous email there is a candidate fix for the performance
   regression.
* Above I read that the proposed fix has no performance impact at all
   on any setup.

Is this a good summary of this email thread? If so, do you agree that
this must be confusing everyone who is following this email thread?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-22 14:24                       ` Bart Van Assche
@ 2024-08-23  7:57                         ` MANISH PANDEY
  2024-08-23 12:03                           ` Christian Loehle
  0 siblings, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-23  7:57 UTC (permalink / raw)
  To: Bart Van Assche, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, Christian Loehle, axboe, mingo, peterz,
	vincent.guittot, linux-block, sudeep.holla, Jaegeuk Kim,
	Christoph Hellwig, kailash, tkjos, bvanassche, quic_nitirawa,
	quic_cang, quic_rampraka, quic_narepall,
	linux-kernel@vger.kernel.org



On 8/22/2024 7:54 PM, Bart Van Assche wrote:
> On 8/22/24 3:46 AM, MANISH PANDEY wrote:
>> On 8/21/2024 10:52 PM, Bart Van Assche wrote:
>> > What is the performance impact of the above change?
>  >
>> No impact at all
> Is this a good summary of this email thread?
> * The first email in this thread reports an important performance
>    regression.
> * In your previous email there is a candidate fix for the performance
>    regression.
> * Above I read that the proposed fix has no performance impact at all
>    on any setup.
> 
> Is this a good summary of this email thread? If so, do you agree that
> this must be confusing everyone who is following this email thread?
> 
> Thanks,
> 
> Bart.

Hi Bart,

Performance impact due to addition of cpu capacity check 
(https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io/) 
...[1]
is already mentioned in the first email.

But let me summarize it again:

We are not able to get advantage of affining the IRQ in different 
capacity CPU(s)/clusters and complete the request in higher cluster 
cpu(s), even though the LLC is shared between these clusters as it is 
causing the block completion to happen on SOFTIRQ context, if requester 
and completion clusters are different.

Below is the performance impact with the current patch [1]

1. For MCQ capable UFS host (paired with UFS 4.x), we are observing ~20% 
random R/W performance drop.

2. For single doorbell ufs hosts (paired with UFS 2.x/ UFS 3.x), we are 
observing ~7-10% random R/W performance drop.


Also in previous emails on this thread, below were few suggestions to 
add check for equal or greater capacity cpus by @Christian Loehle
https://lore.kernel.org/all/3feb5226-7872-432b-9781-29903979d34a@arm.com/

 > From: Christian Loehle @ 2024-08-02  9:03 UTC (permalink / raw)
 > [......]
 > diff --git a/block/blk-mq.c b/block/blk-mq.c
 > index e3c3c0c21b55..a4a2500c4ef6 100644
 > --- a/block/blk-mq.c
 > +++ b/block/blk-mq.c
 > @@ -1164,7 +1164,7 @@ static inline bool
 > blk_mq_complete_need_ipi(struct request *rq)
 >        if (cpu == rq->mq_ctx->cpu ||
 >            (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
 >             cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
 > -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
 > +            arch_scale_cpu_capacity(cpu) >=
 >              arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
 >                return false;
 >
 >       /* don't try to IPI to an offline CPU */


There can be SoCs with different CPU cluster configurations and to have 
optimal IO load balancing or to avoid contention b/w submission path and 
completion path, we may need to complete IO request of large capacity 
CPU(s) on small cluster cpus. So the above propose solution may not be 
suffice to all the use cases.

Hence with below proposed solution, we are trying to propose a new rq 
flag QUEUE_FLAG_CPU_CAPACITY. The proposed solution will provide us a 
way such that users who are benefited with CPU capacity check [1] would 
be able to use the fix as it is, and if a user (including us) want to 
bypass cpu capacity fix [1], they can set rq_affinity to 3 and would be 
able to retain performance drop as mentioned in first email. This would 
give flexibility to user to choose what's the best for their system.

if (cpu == rq->mq_ctx->cpu ||
	    (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
	     cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
+	     (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags) ||
	      cpus_equal_capacity(cpu, rq->mq_ctx->cpu))))
		return false;


Regards
Manish



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-23  7:57                         ` MANISH PANDEY
@ 2024-08-23 12:03                           ` Christian Loehle
  2024-08-23 13:49                             ` MANISH PANDEY
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Loehle @ 2024-08-23 12:03 UTC (permalink / raw)
  To: MANISH PANDEY, Bart Van Assche, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, axboe, mingo, peterz, vincent.guittot, linux-block,
	sudeep.holla, Jaegeuk Kim, Christoph Hellwig, kailash, tkjos,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 8/23/24 08:57, MANISH PANDEY wrote:
> 
> 
> On 8/22/2024 7:54 PM, Bart Van Assche wrote:
>> On 8/22/24 3:46 AM, MANISH PANDEY wrote:
>>> On 8/21/2024 10:52 PM, Bart Van Assche wrote:
>>> > What is the performance impact of the above change?
>>  >
>>> No impact at all
>> Is this a good summary of this email thread?
>> * The first email in this thread reports an important performance
>>    regression.
>> * In your previous email there is a candidate fix for the performance
>>    regression.
>> * Above I read that the proposed fix has no performance impact at all
>>    on any setup.
>>
>> Is this a good summary of this email thread? If so, do you agree that
>> this must be confusing everyone who is following this email thread?
>>
>> Thanks,
>>
>> Bart.
> 
> Hi Bart,
> 
> Performance impact due to addition of cpu capacity check (https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io/) ...[1]
> is already mentioned in the first email.
> 
> But let me summarize it again:
> 
> We are not able to get advantage of affining the IRQ in different capacity CPU(s)/clusters and complete the request in higher cluster cpu(s), even though the LLC is shared between these clusters as it is causing the block completion to happen on SOFTIRQ context, if requester and completion clusters are different.
> 
> Below is the performance impact with the current patch [1]
> 
> 1. For MCQ capable UFS host (paired with UFS 4.x), we are observing ~20% random R/W performance drop.
> 
> 2. For single doorbell ufs hosts (paired with UFS 2.x/ UFS 3.x), we are observing ~7-10% random R/W performance drop.
> 

If you do decide to write your proposal up as a patch, a description of the
topology would be helpful as well.

> 
> Also in previous emails on this thread, below were few suggestions to add check for equal or greater capacity cpus by @Christian Loehle
> https://lore.kernel.org/all/3feb5226-7872-432b-9781-29903979d34a@arm.com/
> 
>> From: Christian Loehle @ 2024-08-02  9:03 UTC (permalink / raw)
>> [......]
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e3c3c0c21b55..a4a2500c4ef6 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1164,7 +1164,7 @@ static inline bool
>> blk_mq_complete_need_ipi(struct request *rq)
>>        if (cpu == rq->mq_ctx->cpu ||
>>            (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>>             cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>> +            arch_scale_cpu_capacity(cpu) >=
>>              arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>>                return false;
>>
>>       /* don't try to IPI to an offline CPU */
> 
> 
> There can be SoCs with different CPU cluster configurations and to have optimal IO load balancing or to avoid contention b/w submission path and completion path, we may need to complete IO request of large capacity CPU(s) on small cluster cpus. So the above propose solution may not be suffice to all the use cases.
> 
> Hence with below proposed solution, we are trying to propose a new rq flag QUEUE_FLAG_CPU_CAPACITY. The proposed solution will provide us a way such that users who are benefited with CPU capacity check [1] would be able to use the fix as it is, and if a user (including us) want to bypass cpu capacity fix [1], they can set rq_affinity to 3 and would be able to retain performance drop as mentioned in first email. This would give flexibility to user to choose what's the best for their system.
>

FWIW I'd agree with introducing a new queue_flag that behaves like
QUEUE_FLAG_SAME_COMP before commit af550e4c9682 ("block/blk-mq: Don't complete locally if capacities are different").
Equal capacity makes sense as the default behavior for
QUEUE_FLAG_SAME_COMP, but is limiting, there might be just one
CPU of that capacity and that might be fully utilized by submission
(and related work), so completing locally makes sense.

So QUEUE_FLAG_SAME_COMP I'd leave as-is and introduce
QUEUE_FLAG_SAME_LLC that actually just checks LLC.

Regards,
Christian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-23 12:03                           ` Christian Loehle
@ 2024-08-23 13:49                             ` MANISH PANDEY
  2024-08-23 14:12                               ` Bart Van Assche
  0 siblings, 1 reply; 29+ messages in thread
From: MANISH PANDEY @ 2024-08-23 13:49 UTC (permalink / raw)
  To: Christian Loehle, Bart Van Assche, Sandeep Dhavale,
	Dietmar Eggemann
  Cc: Qais Yousef, axboe, mingo, peterz, vincent.guittot, linux-block,
	sudeep.holla, Jaegeuk Kim, Christoph Hellwig, kailash, tkjos,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org



On 8/23/2024 5:33 PM, Christian Loehle wrote:
> On 8/23/24 08:57, MANISH PANDEY wrote:
>>
>>
>> On 8/22/2024 7:54 PM, Bart Van Assche wrote:
>>> On 8/22/24 3:46 AM, MANISH PANDEY wrote:
>>>> On 8/21/2024 10:52 PM, Bart Van Assche wrote:
>>>>> What is the performance impact of the above change?
>>>   >
>>>> No impact at all
>>> Is this a good summary of this email thread?
>>> * The first email in this thread reports an important performance
>>>     regression.
>>> * In your previous email there is a candidate fix for the performance
>>>     regression.
>>> * Above I read that the proposed fix has no performance impact at all
>>>     on any setup.
>>>
>>> Is this a good summary of this email thread? If so, do you agree that
>>> this must be confusing everyone who is following this email thread?
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>> Hi Bart,
>>
>> Performance impact due to addition of cpu capacity check (https://lore.kernel.org/all/20240223155749.2958009-3-qyousef@layalina.io/) ...[1]
>> is already mentioned in the first email.
>>
>> But let me summarize it again:
>>
>> We are not able to get advantage of affining the IRQ in different capacity CPU(s)/clusters and complete the request in higher cluster cpu(s), even though the LLC is shared between these clusters as it is causing the block completion to happen on SOFTIRQ context, if requester and completion clusters are different.
>>
>> Below is the performance impact with the current patch [1]
>>
>> 1. For MCQ capable UFS host (paired with UFS 4.x), we are observing ~20% random R/W performance drop.
>>
>> 2. For single doorbell ufs hosts (paired with UFS 2.x/ UFS 3.x), we are observing ~7-10% random R/W performance drop.
>>
> 
> If you do decide to write your proposal up as a patch, a description of the
> topology would be helpful as well.
> 
Thanks Christian for meaningful discussions and suggestions.
We would mention the same in the commit text description in the patch.

>>
>> Also in previous emails on this thread, below were few suggestions to add check for equal or greater capacity cpus by @Christian Loehle
>> https://lore.kernel.org/all/3feb5226-7872-432b-9781-29903979d34a@arm.com/
>>
>>> From: Christian Loehle @ 2024-08-02  9:03 UTC (permalink / raw)
>>> [......]
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index e3c3c0c21b55..a4a2500c4ef6 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1164,7 +1164,7 @@ static inline bool
>>> blk_mq_complete_need_ipi(struct request *rq)
>>>          if (cpu == rq->mq_ctx->cpu ||
>>>              (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
>>>               cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
>>> -            cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
>>> +            arch_scale_cpu_capacity(cpu) >=
>>>                arch_scale_cpu_capacity(rq->mq_ctx->cpu)))
>>>                  return false;
>>>
>>>         /* don't try to IPI to an offline CPU */
>>
>>
>> There can be SoCs with different CPU cluster configurations and to have optimal IO load balancing or to avoid contention b/w submission path and completion path, we may need to complete IO request of large capacity CPU(s) on small cluster cpus. So the above propose solution may not be suffice to all the use cases.
>>
>> Hence with below proposed solution, we are trying to propose a new rq flag QUEUE_FLAG_CPU_CAPACITY. The proposed solution will provide us a way such that users who are benefited with CPU capacity check [1] would be able to use the fix as it is, and if a user (including us) want to bypass cpu capacity fix [1], they can set rq_affinity to 3 and would be able to retain performance drop as mentioned in first email. This would give flexibility to user to choose what's the best for their system.
>>
> 
> FWIW I'd agree with introducing a new queue_flag that behaves like
> QUEUE_FLAG_SAME_COMP before commit af550e4c9682 ("block/blk-mq: Don't complete locally if capacities are different").
> Equal capacity makes sense as the default behavior for
> QUEUE_FLAG_SAME_COMP, but is limiting, there might be just one
> CPU of that capacity and that might be fully utilized by submission
> (and related work), so completing locally makes sense.
> 
> So QUEUE_FLAG_SAME_COMP I'd leave as-is and introduce
> QUEUE_FLAG_SAME_LLC that actually just checks LLC.
> 
> Regards,
> Christian
> 

Hi Christian,
while making the patch, i figured out that queue_flags is unsigned long 
type and we already reached up to 32 flags as of now.

So we need to figure out some other way.
Please let us know if you have some suggestions for the solution, which 
can provide flexibility of with and without cpu capacity check.


Regards
Manish


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-23 13:49                             ` MANISH PANDEY
@ 2024-08-23 14:12                               ` Bart Van Assche
  2024-08-26 17:32                                 ` Christian Loehle
  0 siblings, 1 reply; 29+ messages in thread
From: Bart Van Assche @ 2024-08-23 14:12 UTC (permalink / raw)
  To: MANISH PANDEY, Christian Loehle, Sandeep Dhavale,
	Dietmar Eggemann
  Cc: Qais Yousef, axboe, mingo, peterz, vincent.guittot, linux-block,
	sudeep.holla, Jaegeuk Kim, Christoph Hellwig, kailash, tkjos,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 8/23/24 6:49 AM, MANISH PANDEY wrote:
> while making the patch, i figured out that queue_flags is unsigned long 
> type and we already reached up to 32 flags as of now.

Really? In Linus' master branch I see 13 queue flags. This is less than
half of 32. From include/linux/blkdev.h:

enum {
	QUEUE_FLAG_DYING,		/* queue being torn down */
	QUEUE_FLAG_NOMERGES,		/* disable merge attempts */
	QUEUE_FLAG_SAME_COMP,		/* complete on same CPU-group */
	QUEUE_FLAG_FAIL_IO,		/* fake timeout */
	QUEUE_FLAG_NOXMERGES,		/* No extended merges */
	QUEUE_FLAG_SAME_FORCE,		/* force complete on same CPU */
	QUEUE_FLAG_INIT_DONE,		/* queue is initialized */
	QUEUE_FLAG_STATS,		/* track IO start and completion times */
	QUEUE_FLAG_REGISTERED,		/* queue has been registered to a disk */
	QUEUE_FLAG_QUIESCED,		/* queue has been quiesced */
	QUEUE_FLAG_RQ_ALLOC_TIME,	/* record rq->alloc_time_ns */
	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
	QUEUE_FLAG_MAX
};

Bart.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-23 14:12                               ` Bart Van Assche
@ 2024-08-26 17:32                                 ` Christian Loehle
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Loehle @ 2024-08-26 17:32 UTC (permalink / raw)
  To: Bart Van Assche, MANISH PANDEY, Sandeep Dhavale, Dietmar Eggemann
  Cc: Qais Yousef, axboe, mingo, peterz, vincent.guittot, linux-block,
	sudeep.holla, Jaegeuk Kim, Christoph Hellwig, kailash, tkjos,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 8/23/24 15:12, Bart Van Assche wrote:
> On 8/23/24 6:49 AM, MANISH PANDEY wrote:
>> while making the patch, i figured out that queue_flags is unsigned long type and we already reached up to 32 flags as of now.
> 
> Really? In Linus' master branch I see 13 queue flags. This is less than
> half of 32. From include/linux/blkdev.h:

@Manish:
See recent (6.11) cleanup
https://lore.kernel.org/linux-block/20240719112912.3830443-1-john.g.garry@oracle.com/

> 
> enum {
>     QUEUE_FLAG_DYING,        /* queue being torn down */
>     QUEUE_FLAG_NOMERGES,        /* disable merge attempts */
>     QUEUE_FLAG_SAME_COMP,        /* complete on same CPU-group */
>     QUEUE_FLAG_FAIL_IO,        /* fake timeout */
>     QUEUE_FLAG_NOXMERGES,        /* No extended merges */
>     QUEUE_FLAG_SAME_FORCE,        /* force complete on same CPU */
>     QUEUE_FLAG_INIT_DONE,        /* queue is initialized */
>     QUEUE_FLAG_STATS,        /* track IO start and completion times */
>     QUEUE_FLAG_REGISTERED,        /* queue has been registered to a disk */
>     QUEUE_FLAG_QUIESCED,        /* queue has been quiesced */
>     QUEUE_FLAG_RQ_ALLOC_TIME,    /* record rq->alloc_time_ns */
>     QUEUE_FLAG_HCTX_ACTIVE,        /* at least one blk-mq hctx is active */
>     QUEUE_FLAG_SQ_SCHED,        /* single queue style io dispatch */
>     QUEUE_FLAG_MAX
> };
> 
> Bart.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-21 12:29                 ` MANISH PANDEY
  2024-08-21 17:22                   ` Bart Van Assche
@ 2024-09-01 17:13                   ` Qais Yousef
  1 sibling, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-09-01 17:13 UTC (permalink / raw)
  To: MANISH PANDEY
  Cc: Sandeep Dhavale, Dietmar Eggemann, Bart Van Assche,
	Christian Loehle, axboe, mingo, peterz, vincent.guittot,
	linux-block, sudeep.holla, Jaegeuk Kim, Christoph Hellwig,
	kailash, tkjos, bvanassche, quic_nitirawa, quic_cang,
	quic_rampraka, quic_narepall, linux-kernel@vger.kernel.org

On 08/21/24 17:59, MANISH PANDEY wrote:
> Hi all,
> 
> We agree with the points like rq_affinity can vary from target to target.
> 
> But also please consider below use cases
> 
> 1. Storage devices with single hardware queue (like UFS 2.x/ 3.x)
> in a HMP system with shared LLC, not all the CPUs will be active and online.

Is this a mainline behavior? Hotplug is not considered a normal operation . And
a CPU not being active is not something I am aware of in mainline behavior
aside from hotplug.

I think you're referring to a platform specific out-of-tree feature that is not
part of mainline linux.

> Hence for large amount (say ~1GB) of random IO transactions , if requester
> CPU is from
> smaller cpu group, then due to capacity based grouping, Large cluster CPUs
> would be mostly unused /idle, as the completion would also happen on same
> capacity CPU. Again due to this, the smaller / mid capacity CPUs only would
> have to submit and complete the request. Actually we could have completed
> the requested on large capacity CPUs and could have better utilized the
> power of Large capacity CPUs.

Sorry if I missed your reply elsewhere. But if you don't want the
rq_affinity=1 and do your custom routing, why rq_affinity=0 isn't the right
solution for custom routing behavior?

> 
> But to move IO requests from S/M cluster CPUs to L cluster, scheduler would
> need to wait until a threshold IO hits, and by that time Performance
> application would spent some runtime and would report low performance as
> overall results.

This is not an IO specific problem. Tasks that need more performance will have
to wait for utilization to grow quickly. I don't see this is a problem related
to rq_affinity=1 behavior but a generic scheduler behavior issues that is
outside of block layer.

> 
> 
> On 08/02/24 10:03, Christian Loehle wrote:
> > So I'm assuming you're seeing something like the following:
> > Some CPU(s) (call them S) are submitting IO, hardirq triggers on
> > S.
> > Before the patch the completion softirq could run on a !S CPU,
> > now it runs on S. Am I then correct in assuming your workload
> > is CPU-bound on S? Would you share some details about the
> > workload, too?
> >
> > What's the capacity of CPU(s) S then?
> 
> 
> Yes.. for few SoCs we follow the same.
> Say an SoC with 3 clusters ( S/M/L), if M CPU(s) are submitting IO requests
> ( which is a large data transfer), then in this case since LLC is shared, we
> can allow L CPU(s) to complete the request via softirq / hardirq. This will
> make sure to avoid contention in submission and completion path.

So in this case you want rq_affinity=0?

> 
> 
> Again consider an SoC with 2 clusters ( 4 cpus of small capacity  / 2 CPUs
> of Mid capacity).
> In such case, if M CPUs are used for IO submersion and completions, then
> these CPU would face heavy work load and hence Performance will be impacted.
> Now if the new patch was not there, we could have moved few IO completions
> to S cluster CPUs.

And on this different system you want rq_affinity=1?

I still don't see why you must use rq_affinity=1 for your custom routing which
AFAICT is outside of the supported realm. We have no clue this is contention or
not. It seems you have other larger logic that does custom software in many
layers and I don't think what you're doing is part of what's supported.

Could you help contributing patches to help detect contention problems and how
we can differentiate between the two scenarios instead of hacking rq_affinity=1
?

> 
> > If no matching is required, it makes sense to set rq_affinity to 0.
> 
> > I don't get why irq_affinity=1 is compatible with this case? Isn't
> > this custom
> >setup is a fully managed system by you and means you want >rq_affinity=0?
> What
> >do you lose if you move to rq_affinity=0?
> 
> actually rq_affinity=0 would help for few SoC's having MCQ like UFS 4.x, but
> even this won't be generic solution for us. As this won't considers an SoC
> which doesn't shares LLC, and thus would have significant performance issue.
> Also since the change is picked up in all the kernel branches, so it would
> be very difficult to experiment on older Socs and get the best solution for
> each target.

This is a userspace knob. I don't think the expectation to ship one
configuration for all SoCs...

> 
> This won't work for many SoCs like above example of 2 clusters, as
> rq_affinity to 0 would then complete the request on hardIRQ context, which
> may not be suited for all the SoCs.
> Also not all SoCs are arm based and shares LLC.

Why you can't modify the value based on the SoC? The only thing I am gathering
from your arguments is that you can't modify rq_affinity to suit the setup the
SoC requires, but I don't get why.

> 
> 
> 
> 2. Storage devices with MCQ (Like UFS 4.x / NVME), usages ESI/MSI
> interrupts, hence we would have opportunity to bind ESI IRQ associated with
> an CQ.
> 
> 
> On 08/05/24 10:17, Bart Van Assche wrote:
> > On 8/4/24 7:07 PM, Qais Yousef wrote:
> > > irqbalancers usually move the interrupts, and I'm not sure we can
> > > make an assumption about the reason an interrupt is triggering on
> > > different capacity CPU.
> > User space software can't modify the affinity of managed interrupts.
> > From include/linux/irq.h:
> 
> > > >True. But this is special case and was introduced for isolated
> > > > CPUs. I don't think drivers can request this themselves
> 
> There are drivers, which can manage the cpu affinty for IRQs
> using irq_set_affinity_hint() based on the use case.

Are these in tree drivers? I'd love to learn how the use case is detected
generically if this is in tree.

> Since in the SoC's the LLC is shared ( M and L clusters). So if IO is
> submitted from M capacity CPU(s), and request is completed on L capacity
> cpu(s). Though the capacity of M and L is different, but since the LLC is
> shared, so completion can be done on L cluster without the need of IPI. With
> the new change, we may not get advantage of shared LLC.

How come? The LLC is shared on mid and little. It wouldn't make a difference
which CPU handles it to take advantage of LLC. What did I miss?

> 
> Proposed solution:
> We can have a solution which is backward compatible and a new control flag
> (QUEUE_FLAG_SAME_CAPACITY_FORCE) could be provided for this new change to
> maintain backward compatibly.
> 
> How about introducing a new rq_affinity ( may be rq_affinity = 3) for using
> cpus_equal_capacity() using new flag QUEUE_FLAG_SAME_CAPACITY.
> 
> 
> if (cpu == rq->mq_ctx->cpu ||
> 	(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
> 	  cpus_share_cache(cpu, rq->mq_ctx->cpu) &&
> +	  (test_bit(QUEUE_FLAG_CPU_CAPACITY, &rq->q->queue_flags))
> 	   && cpus_equal_capacity(cpu, rq->mq_ctx->cpu)))
> 		return false;
> 
> 
> 
> Could you please consider raising similar change, if this seems fine for
> all.

This is a hack to workaround the fact that you seem to not want or for some
reason I am unable to get yet can't change rq_affinity setup to suit the need
of the platform.

AFAICT we don't have any knowledge about contentions to automatically decide
the best way to route handling completion. It'd be a welcome addition if you
have a solution that can teach the system to handle this automatically. But in
your case it seems you have a lot of custom setup that is not part of usual
upstream handling and would like just to keep backward compatibility for
reasons I am unable to make sense of yet :(

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
  2024-08-13 16:20         ` Christian Loehle
@ 2024-09-01 17:25           ` Qais Yousef
  0 siblings, 0 replies; 29+ messages in thread
From: Qais Yousef @ 2024-09-01 17:25 UTC (permalink / raw)
  To: Christian Loehle
  Cc: MANISH PANDEY, axboe, mingo, peterz, vincent.guittot,
	dietmar.eggemann, linux-block, sudeep.holla, Jaegeuk Kim,
	Bart Van Assche, Christoph Hellwig, kailash, tkjos, dhavale,
	bvanassche, quic_nitirawa, quic_cang, quic_rampraka,
	quic_narepall, linux-kernel@vger.kernel.org

On 08/13/24 17:20, Christian Loehle wrote:
> On 8/9/24 01:23, Qais Yousef wrote:
> > On 08/05/24 11:18, Christian Loehle wrote:
> > 
> >>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
> >>> the characteristic of HMP system is that power has an equal importance to perf
> >>> (I think this now has become true for all systems by the way), saying that the
> >>> match in one direction is better than the other is sort of forcing a policy of
> >>> perf first which I don't think is a good thing to enforce. We don't have enough
> >>> info to decide at this level. And our users care about both.
> >>
> >> I would argue rq_affinity=1 matches the perf, so that flag should already bias
> >> perf in favor of power slightly?
> > 
> > Not on this type of systems. If perf was the only thing important, just use
> > equally big cpus. Balancing perf and power is important on those systems, and
> > I don't think we have enough info to decide which decision is best when
> > capacities are not the same. Matching the perf level the requesting on makes
> > sense when irq_affinity=1.
> 
> Well you could still want a
> "IO performance always beats power considerations" and still go HMP because
> sometimes for non-IO you prefer power, but I agree that we don't have enough
> information about what the user wants from the system/kernel.

The flag was to keep requester and completion on the same LLC. Note the keyword
same. Assume L3 don't exist and L2 is LLC which means capacity level and LLC
mapping to cluster is the same.

Then note that the capability of the LLC is different since the big clusters
tend to have larger L2. I think modern SoCs for servers can end up with complex
LLC not all of which have the same performance.

Keeping requester/completion on the same capacity keeps the old behavior. To
teach it to do more then we need sensible extensions based on sensible use
cases. And not hack rq_affinity=1 to do complex things.

> 
> > 
> >> Although the actual effect on power probably isn't that significant, given
> >> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
> >> almost ignore a potential idle wakeup and the actual CPU time spent in the block
> >> completion is pretty short of course.
> >>
> >>> If no matching is required, it makes sense to set rq_affinity to 0. When
> >>> matching is enabled, we need to rely on per-task iowait boost to help the
> >>> requester to run at a bigger CPU, and naturally the completion will follow when
> >>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
> >>> triggered on a bigger core, I struggle to understand why it is good for
> >>> completion to run on bigger core without the requester also being on a similar
> >>> bigger core to truly maximize perf.
> >>
> >> So first of all, per-task iowait boosting has nothing to do with it IMO.
> > 
> > It has. If the perf is not good because the requester is running on little
> > core, the requester need to move up to ensure the overall IO perf is better.
> 
> See below but also
> "the requester need to move up to ensure the overall IO perf is better" is
> just not true, with asynchronous IO submission done right, the submission

I think it is true for rq_affinity=1, which is the context of the discussion
and the patch.

> runtime isn't critical to the IO throughput, therefore it should run the
> most power-efficient way.
> This can be observed e.g. with any io_uring fio workload with significant
> iodepth (and possibly multi-threading).
> Completion may be a different story, depending on the device stack, if we're
> dealing with !MCQ then the completion path (irq + block layer completion)
> is absolutely critical.
> For any mmc / ufs<4.0 system the performance difference between
> fio --name=little --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$LITTLE_CPUS
> and
> fio --name=big --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$BIG_CPUS
> is (usually) only because of the completion path and setting irq affinity of
> /dev/sda to $BIG_CPUS will make the difference disappear (rq_affinity=0 and
> implying LLC is the same).
> Running the submission on little CPUs will usually be the most power-efficient
> way then.

From rq_affinity=1 context, how it is supposed to discern all of that? If
there's a good answer to this, then this is the direction that should be taken
to handle it transparently. AFAICT we don't have this info.

> 
> > 
> >> Plenty of IO workloads build up utilization perfectly fine.
> > 
> > These ones have no problems, no? They should migrate to big core and the
> > completion will follow them when they move.
> 
> So if I understood Manish correctly the only reason they want the completion
> to run on a bigger CPU than the submission is because the submission is already
> saturating the CPU, therefore utilization of submission is no issue whatsoever.
> They don't want to run (submission) on big though because of power
> considerations.

Is this what rq_affinity=1 means? This use case is rq_affinity=0. IOW, custom
affinity management.

> 
> > 
> >> I wouldn't consider the setup: requester little perf, irq+completion big perf
> >> invalid necessarily, it does decrease IO latency for the application.
> > 
> > I didn't say invalid. But it is not something we can guess automatically when
> > irq_affinity=1. We don't have enough info to judge. The only info we have the
> > requester that originated the request is running at different perf level
> > (whther higher or lower), so we follow it.
> >
> Anyway, Manish's problem should be solved by rq_affinity=0 in that case (with
> irq affinities set to big CPU then the completion will be run on the irq CPU)
> and "rq_affinity=1 <=> equal capacity CPU" is the correct interpretation, is that
> more or less agreed upon now?
> 

I think this is the sensible route. The sensible extensions I foresee is
teaching how to discern different cases rather than add a hacky flag to confuse
what rq_affinity=1 means.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-09-01 17:25 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31 13:46 Regarding patch "block/blk-mq: Don't complete locally if capacities are different" MANISH PANDEY
2024-08-01  9:25 ` MANISH PANDEY
2024-08-01 16:05   ` Qais Yousef
2024-08-02  9:03 ` Christian Loehle
2024-08-05  2:07   ` Qais Yousef
2024-08-05 10:18     ` Christian Loehle
2024-08-05 17:24       ` MANISH PANDEY
2024-08-09  0:47         ` Qais Yousef
2024-08-09  0:23       ` Qais Yousef
2024-08-13 16:20         ` Christian Loehle
2024-09-01 17:25           ` Qais Yousef
2024-08-05 17:17     ` Bart Van Assche
2024-08-05 17:35       ` MANISH PANDEY
2024-08-05 17:52         ` Bart Van Assche
2024-08-08  6:05           ` MANISH PANDEY
2024-08-09  0:36             ` Qais Yousef
2024-08-11 17:41             ` Dietmar Eggemann
2024-08-12 18:15               ` Sandeep Dhavale
2024-08-21 12:29                 ` MANISH PANDEY
2024-08-21 17:22                   ` Bart Van Assche
2024-08-22 10:46                     ` MANISH PANDEY
2024-08-22 14:24                       ` Bart Van Assche
2024-08-23  7:57                         ` MANISH PANDEY
2024-08-23 12:03                           ` Christian Loehle
2024-08-23 13:49                             ` MANISH PANDEY
2024-08-23 14:12                               ` Bart Van Assche
2024-08-26 17:32                                 ` Christian Loehle
2024-09-01 17:13                   ` Qais Yousef
2024-08-09  0:28       ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox