Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

Linux MultiMedia Card development
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
       [not found]         ` <20240321123935.zqscwi2aom7lfhts@airbuntu>
@ 2024-03-21 17:57           ` Christian Loehle
  2024-03-21 19:52             ` Bart Van Assche
  2024-03-25  2:53             ` Qais Yousef
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Loehle @ 2024-03-21 17:57 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Bart Van Assche, linux-kernel, peterz, juri.lelli, mingo, rafael,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, andres, asml.silence, linux-pm,
	linux-block, io-uring, linux-mmc

On 21/03/2024 12:39, Qais Yousef wrote:
[snip]
>> On 05/03/2024 18:36, Bart Van Assche wrote:
>>> On 3/5/24 01:13, Christian Loehle wrote:
>>>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>>>> - Higher cap is not always beneficial, we might place the task away
>>>>>> from the CPU where the interrupt handler is running, making it run
>>>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>>>> then be reverted again, but a ping-pong every interval is possible).
>>>>>
>>>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>>>> controller in the test setup only supports one completion interrupt for
>>>>> all completion queues instead of one completion interrupt per completion
>>>>> queue? There are already Android phones and developer boards available
>>>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>>>
>>>> No, both NVMe test setups have one completion interrupt per completion queue,
>>>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>>>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>>>> interrupt (on CPU0 on my setup).
>>>
>>> I think that measurements should be provided in the cover letter for the
>>> two types of storage controllers: one series of measurements for a
>>> storage controller with a single completion interrupt and a second
>>> series of measurements for storage controllers with one completion
>>> interrupt per CPU.
>>
>> Of the same type of storage controller? Or what is missing for you in
>> the cover letter exactly (ufs/emmc: single completion interrupt,
>> nvme: one completion interrupt per CPU).
>>
>>>
>>>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>>>> interrupt to a big CPU, too. Similarly for the mmc.
>>>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>>>> interrupt to the same performance domain as the task, which is often optimal both in
>>>> terms of throughput and in terms of power.
>>>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>>>> patch will of course be greatly increased.
>>>
>>> I'm not sure whether making the completion interrupt follow the workload
>>> is a good solution. I'm concerned that this would increase energy
>>> consumption by keeping the big cores active longer than necessary. I
>>> like this solution better (improves storage performance on at least
>>> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
>>> Handle HMP systems when completing IO"
>>> (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).
>>
>> That patch is good, don't get me wrong, but you still lose out by running everything
>> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
>> while having a big CPU available at a high OPP anyway ("for free").
>> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
>> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
>> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):
> 
> So you want the hardirq to move to the big core? Unlike softirq, there will be
> a single hardirq for the controller (to my limited knowledge), so if there are
> multiple requests I'm not sure we can easily match which one relates to which
> before it triggers. So we can end up waking up the wrong core.

It would be beneficial to move the hardirq to a big core if the IO task
is using it anyway.
I'm not sure I actually want to. There are quite a few pitfalls (like you
mentioned) that the scheduler really shouldn't be concerned about.
Moving the hardirq, if implemented in the kernel, would have to be done by the
host controller driver anyway, which would explode this series.
(host controller drivers are quite fragmented e.g. on mmc)

The fact that having a higher capacity CPU available ("running faster") for an
IO task doesn't (always) imply higher throughput because of the hardirq staying
on some LITTLE CPU is bothering (for this series), though.

> 
> Generally this should be a userspace policy. If there's a scenario where the
> throughput is that important they can easily move the hardirq to the big core
> unconditionally and move it back again once this high throughput scenario is no
> longer important.

It also feels wrong to let this be a userspace policy, as the hardirq must be
migrated to the perf domain of the task, which userspace isn't aware of.
Unless you expect userspace to do
CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
but then you could just as well ask them to set performance governor for
big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
any iowait boosting.

Furthermore you can't generally expect userspace to know if their IO will lead
to any interrupt at all, much less which one. They ideally don't even know if
the file IO they are doing is backed by any physical storage in the first place.
(Or even further, that they are doing file IO at all, they might just be
e.g. page-faulting.)

> 
> Or where you describing a different problem?

That is the problem I mentioned in the series and Bart and I were discussing.
It's a problem of the series as in "the numbers aren't that impressive".
Current iowait boosting on embedded/mobile systems will perform quite well by
chance, as the (low util) task will often be on the same perf domain the hardirq
will be run on. As can be seen in the cover letter the benefit of running the
task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
will be much greater.
I just wanted to point this out. We might just acknowledge the problem and say
"don't care" about the potential performance benefits of those scenarios that
would require hardirq moving.
In the long-term it looks like for UFS the problem will disappear as we are
expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
is already the case.

I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
is available (and it doesn't look like this is changing anytime soon).
I would also love to hear what Bart or other UFS folks think about it.
Furthermore if I forgot any storage subsystem with the same behavior in that
regards do tell me.

Lastly, you could consider the IO workload:
IO task being in iowait very frequently [1] with just a single IO inflight [2]
and only very little time being spent on the CPU in-between iowaits[3],
therefore the interrupt handler being on the critical path for IO throughput
to a non-negligible degree, to be niche, but it's precisely the use-case where
iowait boosting shows it's biggest benefit.

Sorry for the abomination of a sentence, see footnotes for the reasons.

[1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
won't apply any significant boost currently.
[2] If the storage devices has enough in-flight requests to serve, iowait
boosting is unnecessary/wasteful, see cover letter.
[3] If the task actually uses the CPU in-between iowaits, it will build up
utilization, iowait boosting benefit diminishes.

> 
> Glad to see your series by the way :-) I'll get a chance to review it over the
> weekend hopefully.

Thank you!
Apologies for not CCing you in the first place, I am curious about your opinion
on the concept!

FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
only received a quick smoke test, so 1/2 needs the following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aaf64023b03..2b6f521be658 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
        } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
                /* Reduce boost */
                if (p->io_boost_level > 1)
-                       io_boost_scale_interval(p, true);
+                       io_boost_scale_interval(p, false);
                else
                        p->io_boost_level = 0;
        } else if (p->io_boost_level == IO_BOOST_LEVELS) {

I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
changes are mostly cosmetic and addressing Bart's comments about the benchmark
numbers in the cover letter.

Kind Regards,
Christian

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
  2024-03-21 17:57           ` [RFC PATCH 0/2] Introduce per-task io utilization boost Christian Loehle
@ 2024-03-21 19:52             ` Bart Van Assche
  2024-03-25 12:06               ` Christian Loehle
  2024-03-25  2:53             ` Qais Yousef
  1 sibling, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2024-03-21 19:52 UTC (permalink / raw)
  To: Christian Loehle, Qais Yousef
  Cc: linux-kernel, peterz, juri.lelli, mingo, rafael, dietmar.eggemann,
	vschneid, vincent.guittot, Johannes.Thumshirn, adrian.hunter,
	ulf.hansson, andres, asml.silence, linux-pm, linux-block,
	io-uring, linux-mmc

On 3/21/24 10:57, Christian Loehle wrote:
> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.

Why the focus on storage controllers with a single completion interrupt?
It probably won't take long (one year?) until all new high-end
smartphones may have support for multiple completion interrupts.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
  2024-03-21 17:57           ` [RFC PATCH 0/2] Introduce per-task io utilization boost Christian Loehle
  2024-03-21 19:52             ` Bart Van Assche
@ 2024-03-25  2:53             ` Qais Yousef
  1 sibling, 0 replies; 5+ messages in thread
From: Qais Yousef @ 2024-03-25  2:53 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Bart Van Assche, linux-kernel, peterz, juri.lelli, mingo, rafael,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, andres, asml.silence, linux-pm,
	linux-block, io-uring, linux-mmc

On 03/21/24 17:57, Christian Loehle wrote:

> > So you want the hardirq to move to the big core? Unlike softirq, there will be
> > a single hardirq for the controller (to my limited knowledge), so if there are
> > multiple requests I'm not sure we can easily match which one relates to which
> > before it triggers. So we can end up waking up the wrong core.
> 
> It would be beneficial to move the hardirq to a big core if the IO task
> is using it anyway.
> I'm not sure I actually want to. There are quite a few pitfalls (like you

I'm actually against it. I think it's too much complexity for not necessasrily
a big gain. FWIW, one of the design request to get per task iowait boost so
that we can *disable* it. It wastes power when only a handful of tasks actually
care about perf.

Caring where the hardirq run for perf is unlikely a problem in practice.
Softirq should follow the requester already when it matters.

> mentioned) that the scheduler really shouldn't be concerned about.
> Moving the hardirq, if implemented in the kernel, would have to be done by the
> host controller driver anyway, which would explode this series.
> (host controller drivers are quite fragmented e.g. on mmc)
> 
> The fact that having a higher capacity CPU available ("running faster") for an
> IO task doesn't (always) imply higher throughput because of the hardirq staying
> on some LITTLE CPU is bothering (for this series), though.
> 
> > 
> > Generally this should be a userspace policy. If there's a scenario where the
> > throughput is that important they can easily move the hardirq to the big core
> > unconditionally and move it back again once this high throughput scenario is no
> > longer important.
> 
> It also feels wrong to let this be a userspace policy, as the hardirq must be
> migrated to the perf domain of the task, which userspace isn't aware of.
> Unless you expect userspace to do

irq balancer is a userspace policy. For kernel to make an automatic decision
there are a lot of ifs must be present. Again, I don't see on such system
maximizing throughput is a concern. And userspace can fix the problem simply
- they know after all when the throughput really matters to the point where the
hardirq runs is a bottleneck. In practice, I don't think it is a bottleneck.
But this is my handwavy judgement. The experts know better. And note, I mean
use cases that are not benchmarks ;-)

> CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
> but then you could just as well ask them to set performance governor for
> big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
> any iowait boosting.
> 
> Furthermore you can't generally expect userspace to know if their IO will lead
> to any interrupt at all, much less which one. They ideally don't even know if
> the file IO they are doing is backed by any physical storage in the first place.
> (Or even further, that they are doing file IO at all, they might just be
> e.g. page-faulting.)

The way I see it, it's like gigabit networking. The hardirq will matter once
you reach such high throughput scenarios. Which are corner cases and not the
norm?

> 
> > 
> > Or where you describing a different problem?
> 
> That is the problem I mentioned in the series and Bart and I were discussing.
> It's a problem of the series as in "the numbers aren't that impressive".
> Current iowait boosting on embedded/mobile systems will perform quite well by
> chance, as the (low util) task will often be on the same perf domain the hardirq
> will be run on. As can be seen in the cover letter the benefit of running the
> task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
> for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
> will be much greater.
> I just wanted to point this out. We might just acknowledge the problem and say
> "don't care" about the potential performance benefits of those scenarios that
> would require hardirq moving.

I thought the softirq does the bulk of the work. hardirq being such
a bottleneck is (naively maybe) a red flag for me that it's doing too much than
a simple interrupt servicing.

You don't boost when the task is sleeping, right? I think this is likely
a cause of the problem where softirq is not running as fast - where before the
series the CPU will be iowait boosted regardless the task is blocked or not.

> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.
> 
> I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
> 'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
> is available (and it doesn't look like this is changing anytime soon).
> I would also love to hear what Bart or other UFS folks think about it.
> Furthermore if I forgot any storage subsystem with the same behavior in that
> regards do tell me.
> 
> Lastly, you could consider the IO workload:
> IO task being in iowait very frequently [1] with just a single IO inflight [2]
> and only very little time being spent on the CPU in-between iowaits[3],
> therefore the interrupt handler being on the critical path for IO throughput
> to a non-negligible degree, to be niche, but it's precisely the use-case where
> iowait boosting shows it's biggest benefit.
> 
> Sorry for the abomination of a sentence, see footnotes for the reasons.
> 
> [1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
> won't apply any significant boost currently.

I CCed you to a patch where I fix this. I've been sleeping on it for too long.
Maybe I should have split this fix out of the consolidation patch.

> [2] If the storage devices has enough in-flight requests to serve, iowait
> boosting is unnecessary/wasteful, see cover letter.
> [3] If the task actually uses the CPU in-between iowaits, it will build up
> utilization, iowait boosting benefit diminishes.

The current mechanism is very aggressive. It needs to evolve for sure.

> 
> > 
> > Glad to see your series by the way :-) I'll get a chance to review it over the
> > weekend hopefully.
> 
> Thank you!
> Apologies for not CCing you in the first place, I am curious about your opinion
> on the concept!

I actually had a patch that implements iowait boost per-task (on top of my
remove uclamp max aggregation series) where I did actually take the extra step
to remove iowait from intel_pstate. Can share the patches if you think you'll
find them useful.

Just want to note that this mechanism can end up waste power and this is an
important direction to consider. It's not about perf only (which matters too).

> 
> FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
> only received a quick smoke test, so 1/2 needs the following:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4aaf64023b03..2b6f521be658 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
>         } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
>                 /* Reduce boost */
>                 if (p->io_boost_level > 1)
> -                       io_boost_scale_interval(p, true);
> +                       io_boost_scale_interval(p, false);
>                 else
>                         p->io_boost_level = 0;
>         } else if (p->io_boost_level == IO_BOOST_LEVELS) {
> 
> 
> I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
> changes are mostly cosmetic and addressing Bart's comments about the benchmark
> numbers in the cover letter.

I didn't spend a lot of time on the series, but I can see a number of problems.
Let us discuss them first and plan a future direction. No need to v2 if it's
just for this fix IMO.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
  2024-03-21 19:52             ` Bart Van Assche
@ 2024-03-25 12:06               ` Christian Loehle
  2024-03-25 17:23                 ` Bart Van Assche
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Loehle @ 2024-03-25 12:06 UTC (permalink / raw)
  To: Bart Van Assche, Qais Yousef
  Cc: linux-kernel, peterz, juri.lelli, mingo, rafael, dietmar.eggemann,
	vschneid, vincent.guittot, Johannes.Thumshirn, adrian.hunter,
	ulf.hansson, andres, asml.silence, linux-pm, linux-block,
	io-uring, linux-mmc

On 21/03/2024 19:52, Bart Van Assche wrote:
> On 3/21/24 10:57, Christian Loehle wrote:
>> In the long-term it looks like for UFS the problem will disappear as we are
>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>> is already the case.
> 
> Why the focus on storage controllers with a single completion interrupt?
> It probably won't take long (one year?) until all new high-end
> smartphones may have support for multiple completion interrupts.
> 
> Thanks,
> 
> Bart.
> 

Apart from going to "This patch shows significant performance improvements on
hardware that runs mainline today" to "This patch will have significant
performance improvements on devices running mainline in a couple years"
nothing in particular.
I'm fine with leaving it with having acknowledged the problem.
Maybe I would just gate the task placement on the task having been in
UFS (with multiple completion interrupts) or NVMe submission recently to
avoid regressions to current behavior in future versions. I did have that
already at some point, although it was a bit hacky.
Anyway, thank you for your input on that, it is what I wanted to hear!

Kind Regards,
Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
  2024-03-25 12:06               ` Christian Loehle
@ 2024-03-25 17:23                 ` Bart Van Assche
  0 siblings, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2024-03-25 17:23 UTC (permalink / raw)
  To: Christian Loehle, Qais Yousef
  Cc: linux-kernel, peterz, juri.lelli, mingo, rafael, dietmar.eggemann,
	vschneid, vincent.guittot, Johannes.Thumshirn, adrian.hunter,
	ulf.hansson, andres, asml.silence, linux-pm, linux-block,
	io-uring, linux-mmc

On 3/25/24 05:06, Christian Loehle wrote:
> On 21/03/2024 19:52, Bart Van Assche wrote:
>> On 3/21/24 10:57, Christian Loehle wrote:
>>> In the long-term it looks like for UFS the problem will disappear as we are
>>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>>> is already the case.
>>
>> Why the focus on storage controllers with a single completion interrupt?
>> It probably won't take long (one year?) until all new high-end
>> smartphones may have support for multiple completion interrupts.
> 
> Apart from going to "This patch shows significant performance improvements on
> hardware that runs mainline today" to "This patch will have significant
> performance improvements on devices running mainline in a couple years"
> nothing in particular.

That doesn't make sense to me. Smartphones with UFSHCI 4.0 controllers
are available from multiple vendors. See also 
https://en.wikipedia.org/wiki/Universal_Flash_Storage. See also
https://www.gsmarena.com/samsung_galaxy_s24-12773.php.

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-25 17:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240304201625.100619-1-christian.loehle@arm.com>
     [not found] ` <86f0af00-8765-4481-9245-1819fb2c6379@acm.org>
     [not found]   ` <0dc6a839-2922-40ac-8854-2884196da9b9@arm.com>
     [not found]     ` <c5b7fc1f-f233-4d25-952b-539607c2a0cc@acm.org>
     [not found]       ` <2784c093-eea1-4b73-87da-1a45f14013c8@arm.com>
     [not found]         ` <20240321123935.zqscwi2aom7lfhts@airbuntu>
2024-03-21 17:57           ` [RFC PATCH 0/2] Introduce per-task io utilization boost Christian Loehle
2024-03-21 19:52             ` Bart Van Assche
2024-03-25 12:06               ` Christian Loehle
2024-03-25 17:23                 ` Bart Van Assche
2024-03-25  2:53             ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox