public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
       [not found] <33882f284ac6e6d1ec766ca4bb2f3b88@intel.com>
@ 2025-03-03 22:24 ` Christian Loehle
  2025-03-17 10:03   ` King, Colin
  0 siblings, 1 reply; 12+ messages in thread
From: Christian Loehle @ 2025-03-03 22:24 UTC (permalink / raw)
  To: Colin Ian King, Jens Axboe, Rafael J. Wysocki, Daniel Lezcano,
	linux-block, linux-pm
  Cc: linux-kernel

On 3/3/25 16:43, Colin Ian King wrote:
> Modern processors can drop into deep sleep states relatively quickly
> to save power. However, coming out of deep sleep states takes a small
> amount of time and this is detrimental to performance for I/O devices
> such as fast PCIe NVME drives when servicing a completed I/O
> transactions.
> 
> Testing with fio with read/write RAID0 PCIe NVME devices on various
> modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
> has shown that on 85-90% of read/write transactions issued on a CPU
> are completed by the same CPU, so it makes some sense to prevent the
> CPU from dropping into a deep sleep state to help reduce I/O handling
> latency.

For the platform you tested on that may be true, but even if we constrain
ourselves to pci-nvme there's a variety of queue/irq mappings where
this doesn't hold I'm afraid.

> 
> This commit introduces a simple, lightweight and fast power sleep
> demotion mechanism that provides the block layer a way to inform the
> menu governor to prevent a CPU from going into a deep sleep when an
> I/O operation is requested. While it is true that some I/Os may not

s/requested/completed is the full truth, isn't it?

> be serviced on the same CPU that issued the I/O request and hence
> is not 100% perfect the mechanism does work well in the vast majority
> of I/O operations and there is very small overhead with the sleep
> demotion prevention.
> 
> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
> array using fio 3.35 performing random read and read-write test on a
> 512GB file with 8 concurrent I/O jobs. Tested with the NHM_C1_AUTO_DEMOTE
> bit set in MSR_PKG_CST_CONFIG_CONTROL set in the BIOS.
> 
> Test case: random reads, results based on geometic mean of results from
> 5 test runs:
>            Bandwidth         IO-ops   Latency   Bandwidth
>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> Baseline:  21365755610	     20377     390105   1.86%
> Patched:   25950107558       24748     322905   0.16%

What is the baseline?
Do you mind trying with Rafael's recently posted series?
Given the IOPS I'd expect good results from that alone already.
https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/

(Happy to see teo as comparison too, which you don't modify).

> 
> Read rate improvement of ~21%.
> 
> Test case: random read+writes, results based on geometic mean of results
> from 5 test runs:
> 
>            Bandwidth         IO-ops   Latency   Bandwidth
>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> Baseline:   9937848224        9477     550094   1.04%
> Patched:   10502592508       10016     509315   1.85%
> 
> Read rate improvement of ~5.7%
> 
>            Bandwidth         IO-ops   Latency   Bandwidth
>            write (bytes/sec) per sec    (ns)    % Std.Deviation
> Baseline:   9945197656        9484     288933   1.02%
> Patched:   10517268400       10030     287026   1.85%
> 
> Write rate improvement of ~5.7%
> 
> For kernel builds, where all CPUs are fully loaded no perfomance
> improvement or regressions were observed based on the results of
> 5 kernel build test runs.
> 
> By default, CPU power sleep demotion blocking is set to run
> for 3 ms on I/O requests, but this can be modified using the
> new sysfs interface:
> 
>   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

rounding up a jiffie sure is a heavy price to pay then.

> 
> setting this to zero will disabled the mechanism.
> 
> Signed-off-by: Colin Ian King <colin.king@intel.com>
> ---
>  block/blk-mq.c                   |   2 +
>  drivers/cpuidle/Kconfig          |  10 +++
>  drivers/cpuidle/Makefile         |   1 +
>  drivers/cpuidle/governors/menu.c |   4 +
>  drivers/cpuidle/psd.c            | 123 +++++++++++++++++++++++++++++++
>  include/linux/cpuidle_psd.h      |  32 ++++++++
>  6 files changed, 172 insertions(+)
>  create mode 100644 drivers/cpuidle/psd.c
>  create mode 100644 include/linux/cpuidle_psd.h
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-03 22:24 ` [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices Christian Loehle
@ 2025-03-17 10:03   ` King, Colin
  2025-03-23  9:18     ` Christian Loehle
  2025-03-23 12:35     ` Bart Van Assche
  0 siblings, 2 replies; 12+ messages in thread
From: King, Colin @ 2025-03-17 10:03 UTC (permalink / raw)
  To: Christian Loehle, Jens Axboe, Rafael J. Wysocki, Daniel Lezcano,
	linux-block@vger.kernel.org, linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

Hi Christian, 

Follow-up below:

> -----Original Message-----
> From: Christian Loehle <christian.loehle@arm.com>
> Sent: 03 March 2025 22:25
> To: King, Colin <colin.king@intel.com>; Jens Axboe <axboe@kernel.dk>; Rafael
> J. Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/3/25 16:43, Colin Ian King wrote:
> > Modern processors can drop into deep sleep states relatively quickly
> > to save power. However, coming out of deep sleep states takes a small
> > amount of time and this is detrimental to performance for I/O devices
> > such as fast PCIe NVME drives when servicing a completed I/O
> > transactions.
> >
> > Testing with fio with read/write RAID0 PCIe NVME devices on various
> > modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
> > has shown that on 85-90% of read/write transactions issued on a CPU
> > are completed by the same CPU, so it makes some sense to prevent the
> > CPU from dropping into a deep sleep state to help reduce I/O handling
> > latency.
> 
> For the platform you tested on that may be true, but even if we constrain
> ourselves to pci-nvme there's a variety of queue/irq mappings where this
> doesn't hold I'm afraid.

This code is optional, one can enable it or disable it via the config option. Also, 
even when it is built-in one can disable it by writing 0 to the sysfs file
  /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

> 
> >
> > This commit introduces a simple, lightweight and fast power sleep
> > demotion mechanism that provides the block layer a way to inform the
> > menu governor to prevent a CPU from going into a deep sleep when an
> > I/O operation is requested. While it is true that some I/Os may not
> 
> s/requested/completed is the full truth, isn't it?
> 
> > be serviced on the same CPU that issued the I/O request and hence is
> > not 100% perfect the mechanism does work well in the vast majority of
> > I/O operations and there is very small overhead with the sleep
> > demotion prevention.
> >
> > Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
> > array using fio 3.35 performing random read and read-write test on a
> > 512GB file with 8 concurrent I/O jobs. Tested with the
> > NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in
> the BIOS.
> >
> > Test case: random reads, results based on geometic mean of results
> > from
> > 5 test runs:
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> > Baseline:  21365755610	     20377     390105   1.86%
> > Patched:   25950107558       24748     322905   0.16%
> 
> What is the baseline?
> Do you mind trying with Rafael's recently posted series?
> Given the IOPS I'd expect good results from that alone already.
> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
> 
> (Happy to see teo as comparison too, which you don't modify).

OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without
Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper
C-state sleeps and in theory should provide "best performance". I also benchmarked with
Raphel's patch and just my patch, and finally Raphels patch AND my patch:

Reads
                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                25689182477     0.15            326177          0.15
C6, C6P disabled        25839580014     0.19            324349          0.19
Raphels Patch:          25695523747     0.06            326150          0.06
My patch:               25782011833     0.07            324999          0.07
Raphel + My patch:      25792551514     0.10            324924          0.10

Writes
                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                15220468898     3.33            550290          3.36
C6, C6P disabled        13405624707     0.66            625424          0.66
Raphels Patch:          14017625200     1.15            598049          1.16
My patch:               15444417488     3.73            467818          29.10
Raphel + My patch:      14037711032     1.13            597143          1.13

Combined Read+Writes, Reads

                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                10132229433     0.41            484919          0.25
C6, C6P disabled        10567536346     0.60            515260          1.16
Raphels Patch:          10171044817     0.37            486937          0.20
My patch:               10468953527     0.07            504797          0.07
Raphel + My patch:      10174707546     1.26            488263          1.13

Combined Read+Writes, Writes

                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                10139393169     0.44            342132          1.23
C6, C6P disabled        10583264662     0.63            277052          3.87
Raphels Patch:          10178275035     0.39            336989          0.94
My patch:               10482766569     1.28            294803          6.87
Raphel + My patch:      10183837235     0.38            330657          3.39      
> 
> >
> > Read rate improvement of ~21%.
> >
> > Test case: random read+writes, results based on geometic mean of
> > results from 5 test runs:
> >
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> > Baseline:   9937848224        9477     550094   1.04%
> > Patched:   10502592508       10016     509315   1.85%
> >
> > Read rate improvement of ~5.7%
> >
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            write (bytes/sec) per sec    (ns)    % Std.Deviation
> > Baseline:   9945197656        9484     288933   1.02%
> > Patched:   10517268400       10030     287026   1.85%
> >
> > Write rate improvement of ~5.7%
> >
> > For kernel builds, where all CPUs are fully loaded no perfomance
> > improvement or regressions were observed based on the results of
> > 5 kernel build test runs.
> >
> > By default, CPU power sleep demotion blocking is set to run for 3 ms
> > on I/O requests, but this can be modified using the new sysfs
> > interface:
> >
> >   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> 
> rounding up a jiffie sure is a heavy price to pay then.

Jiffies were used because the code is designed to be as light-weight as possible. Rounding it up isn't
problematic, using higher precision timing is more CPU expensive and really doesn't gain us any extra
win for this kind of time resolution.  We need the code to be light weight and "good-enough" on the I/O
path rather than overly expensive and "100% perfect".

> 
> >
> > setting this to zero will disabled the mechanism.
> >
> > Signed-off-by: Colin Ian King <colin.king@intel.com>
> > ---
> >  block/blk-mq.c                   |   2 +
> >  drivers/cpuidle/Kconfig          |  10 +++
> >  drivers/cpuidle/Makefile         |   1 +
> >  drivers/cpuidle/governors/menu.c |   4 +
> >  drivers/cpuidle/psd.c            | 123 +++++++++++++++++++++++++++++++
> >  include/linux/cpuidle_psd.h      |  32 ++++++++
> >  6 files changed, 172 insertions(+)
> >  create mode 100644 drivers/cpuidle/psd.c  create mode 100644
> > include/linux/cpuidle_psd.h
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-17 10:03   ` King, Colin
@ 2025-03-23  9:18     ` Christian Loehle
  2025-03-26 15:14       ` King, Colin
  2025-03-23 12:35     ` Bart Van Assche
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Loehle @ 2025-03-23  9:18 UTC (permalink / raw)
  To: King, Colin, Jens Axboe, Rafael J. Wysocki, Daniel Lezcano,
	linux-block@vger.kernel.org, linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

On 3/17/25 10:03, King, Colin wrote:
> Hi Christian, 
> 
> Follow-up below:
> 
>> -----Original Message-----
>> From: Christian Loehle <christian.loehle@arm.com>
>> Sent: 03 March 2025 22:25
>> To: King, Colin <colin.king@intel.com>; Jens Axboe <axboe@kernel.dk>; Rafael
>> J. Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
>> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
>> fast I/O devices
>>
>> On 3/3/25 16:43, Colin Ian King wrote:
>>> Modern processors can drop into deep sleep states relatively quickly
>>> to save power. However, coming out of deep sleep states takes a small
>>> amount of time and this is detrimental to performance for I/O devices
>>> such as fast PCIe NVME drives when servicing a completed I/O
>>> transactions.
>>>
>>> Testing with fio with read/write RAID0 PCIe NVME devices on various
>>> modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
>>> has shown that on 85-90% of read/write transactions issued on a CPU
>>> are completed by the same CPU, so it makes some sense to prevent the
>>> CPU from dropping into a deep sleep state to help reduce I/O handling
>>> latency.
>>
>> For the platform you tested on that may be true, but even if we constrain
>> ourselves to pci-nvme there's a variety of queue/irq mappings where this
>> doesn't hold I'm afraid.
> 
> This code is optional, one can enable it or disable it via the config option. Also, 
> even when it is built-in one can disable it by writing 0 to the sysfs file
>   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> 
>>
>>>
>>> This commit introduces a simple, lightweight and fast power sleep
>>> demotion mechanism that provides the block layer a way to inform the
>>> menu governor to prevent a CPU from going into a deep sleep when an
>>> I/O operation is requested. While it is true that some I/Os may not
>>
>> s/requested/completed is the full truth, isn't it?
>>
>>> be serviced on the same CPU that issued the I/O request and hence is
>>> not 100% perfect the mechanism does work well in the vast majority of
>>> I/O operations and there is very small overhead with the sleep
>>> demotion prevention.
>>>
>>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
>>> array using fio 3.35 performing random read and read-write test on a
>>> 512GB file with 8 concurrent I/O jobs. Tested with the
>>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in
>> the BIOS.
>>>
>>> Test case: random reads, results based on geometic mean of results
>>> from
>>> 5 test runs:
>>>            Bandwidth         IO-ops   Latency   Bandwidth
>>>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
>>> Baseline:  21365755610	     20377     390105   1.86%
>>> Patched:   25950107558       24748     322905   0.16%
>>
>> What is the baseline?
>> Do you mind trying with Rafael's recently posted series?
>> Given the IOPS I'd expect good results from that alone already.
>> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
>>
>> (Happy to see teo as comparison too, which you don't modify).
> 
> OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without
> Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper
> C-state sleeps and in theory should provide "best performance". I also benchmarked with
> Raphel's patch and just my patch, and finally Raphels patch AND my patch:
> 
> Reads
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                25689182477     0.15            326177          0.15
> C6, C6P disabled        25839580014     0.19            324349          0.19
> Raphels Patch:          25695523747     0.06            326150          0.06
> My patch:               25782011833     0.07            324999          0.07
> Raphel + My patch:      25792551514     0.10            324924          0.10

So these are mostly equal right?

> 
> Writes
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                15220468898     3.33            550290          3.36
> C6, C6P disabled        13405624707     0.66            625424          0.66
> Raphels Patch:          14017625200     1.15            598049          1.16
> My patch:               15444417488     3.73            467818          29.10
> Raphel + My patch:      14037711032     1.13            597143          1.13

These don't make sense to me, why would C6 / C6P be this bad?
Why would Rafael's patch make it worse?
Are these just noise?

> 
> Combined Read+Writes, Reads
> 
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                10132229433     0.41            484919          0.25
> C6, C6P disabled        10567536346     0.60            515260          1.16
> Raphels Patch:          10171044817     0.37            486937          0.20
> My patch:               10468953527     0.07            504797          0.07
> Raphel + My patch:      10174707546     1.26            488263          1.13
> 
> Combined Read+Writes, Writes
> 
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                10139393169     0.44            342132          1.23
> C6, C6P disabled        10583264662     0.63            277052          3.87
> Raphels Patch:          10178275035     0.39            336989          0.94
> My patch:               10482766569     1.28            294803          6.87
> Raphel + My patch:      10183837235     0.38            330657          3.39      

The above two indicate a +3% from (now mainline) Rafael's patch to yours.
There's no reason why Rafael + your patch should be worse than just your patch.
If this is statistically significant we would need to look into why.

FWIW I think the energy cost of essentially remaining in polling even for quite
sporadic IO can't be understated IMO.
Comparison for teo, which might slightly better than menu while not outright
disabling idle would be interesting still.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-17 10:03   ` King, Colin
  2025-03-23  9:18     ` Christian Loehle
@ 2025-03-23 12:35     ` Bart Van Assche
  2025-03-26 15:04       ` King, Colin
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2025-03-23 12:35 UTC (permalink / raw)
  To: King, Colin, Christian Loehle, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

On 3/17/25 3:03 AM, King, Colin wrote:
> This code is optional, one can enable it or disable it via the config option. Also,
> even when it is built-in one can disable it by writing 0 to the sysfs file
>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

I'm not sure we need even more configuration knobs in sysfs. How are
users expected to find this configuration option? How should they
decide whether to enable or to disable it?

Please take a look at this proposal and let me know whether this would
solve the issue that you are looking into: "[LSF/MM/BPF Topic] Energy-
Efficient I/O" (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
compared to the cpuidle patch is that it requires RPM (runtime power
management) to be enabled. Maybe I should look into modifying the
approach such that it does not rely on RPM.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-23 12:35     ` Bart Van Assche
@ 2025-03-26 15:04       ` King, Colin
  2025-03-26 15:14         ` Rafael J. Wysocki
  2025-03-26 16:26         ` Christian Loehle
  0 siblings, 2 replies; 12+ messages in thread
From: King, Colin @ 2025-03-26 15:04 UTC (permalink / raw)
  To: Bart Van Assche, Christian Loehle, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

Hi,

> -----Original Message-----
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: 23 March 2025 12:36
> To: King, Colin <colin.king@intel.com>; Christian Loehle
> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
> Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/17/25 3:03 AM, King, Colin wrote:
> > This code is optional, one can enable it or disable it via the config
> > option. Also, even when it is built-in one can disable it by writing 0 to the
> sysfs file
> >    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> 
> I'm not sure we need even more configuration knobs in sysfs. 

It's useful for enabling / disabling the functionality, as well as some form of tuning for slower I/O devices, so I think it is justifiable.

> How are users
> expected to find this configuration option? How should they decide whether
> to enable or to disable it?

I can send a V2 with some documentation if that's required.

> 
> Please take a look at this proposal and let me know whether this would solve
> the issue that you are looking into: "[LSF/MM/BPF Topic] Energy- Efficient I/O"
> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
> compared to the cpuidle patch is that it requires RPM (runtime power
> management) to be enabled. Maybe I should look into modifying the
> approach such that it does not rely on RPM.

I've had a look, the scope of my patch is a bit wider.  If my patch gets accepted I'm 
going to also look at putting the psd call into other devices (such as network devices) to 
also stop deep states while these devices are busy.  Since the code is very lightweight I
was hoping this was going to be relatively easy and simple to use in various devices in the future.

Colin

> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-23  9:18     ` Christian Loehle
@ 2025-03-26 15:14       ` King, Colin
  0 siblings, 0 replies; 12+ messages in thread
From: King, Colin @ 2025-03-26 15:14 UTC (permalink / raw)
  To: Christian Loehle, Jens Axboe, Rafael J. Wysocki, Daniel Lezcano,
	linux-block@vger.kernel.org, linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

Hi,

> -----Original Message-----
> From: Christian Loehle <christian.loehle@arm.com>
> Sent: 23 March 2025 09:19
> To: King, Colin <colin.king@intel.com>; Jens Axboe <axboe@kernel.dk>; Rafael
> J. Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/17/25 10:03, King, Colin wrote:
> > Hi Christian,
> >
> > Follow-up below:
> >
> >> -----Original Message-----
> >> From: Christian Loehle <christian.loehle@arm.com>
> >> Sent: 03 March 2025 22:25
> >> To: King, Colin <colin.king@intel.com>; Jens Axboe <axboe@kernel.dk>;
> >> Rafael J. Wysocki <rafael@kernel.org>; Daniel Lezcano
> >> <daniel.lezcano@linaro.org>; linux-block@vger.kernel.org;
> >> linux-pm@vger.kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
> >> prevention for fast I/O devices
> >>
> >> On 3/3/25 16:43, Colin Ian King wrote:
> >>> Modern processors can drop into deep sleep states relatively quickly
> >>> to save power. However, coming out of deep sleep states takes a
> >>> small amount of time and this is detrimental to performance for I/O
> >>> devices such as fast PCIe NVME drives when servicing a completed I/O
> >>> transactions.
> >>>
> >>> Testing with fio with read/write RAID0 PCIe NVME devices on various
> >>> modern SMP based systems (such as 96 thead Granite Rapids Xeon
> >>> 6741P) has shown that on 85-90% of read/write transactions issued on
> >>> a CPU are completed by the same CPU, so it makes some sense to
> >>> prevent the CPU from dropping into a deep sleep state to help reduce
> >>> I/O handling latency.
> >>
> >> For the platform you tested on that may be true, but even if we
> >> constrain ourselves to pci-nvme there's a variety of queue/irq
> >> mappings where this doesn't hold I'm afraid.
> >
> > This code is optional, one can enable it or disable it via the config
> > option. Also, even when it is built-in one can disable it by writing 0 to the
> sysfs file
> >   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >
> >>
> >>>
> >>> This commit introduces a simple, lightweight and fast power sleep
> >>> demotion mechanism that provides the block layer a way to inform the
> >>> menu governor to prevent a CPU from going into a deep sleep when an
> >>> I/O operation is requested. While it is true that some I/Os may not
> >>
> >> s/requested/completed is the full truth, isn't it?
> >>
> >>> be serviced on the same CPU that issued the I/O request and hence is
> >>> not 100% perfect the mechanism does work well in the vast majority
> >>> of I/O operations and there is very small overhead with the sleep
> >>> demotion prevention.
> >>>
> >>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME
> >>> md array using fio 3.35 performing random read and read-write test
> >>> on a 512GB file with 8 concurrent I/O jobs. Tested with the
> >>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set
> in
> >> the BIOS.
> >>>
> >>> Test case: random reads, results based on geometic mean of results
> >>> from
> >>> 5 test runs:
> >>>            Bandwidth         IO-ops   Latency   Bandwidth
> >>>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> >>> Baseline:  21365755610	     20377     390105   1.86%
> >>> Patched:   25950107558       24748     322905   0.16%
> >>
> >> What is the baseline?
> >> Do you mind trying with Rafael's recently posted series?
> >> Given the IOPS I'd expect good results from that alone already.
> >> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
> >>
> >> (Happy to see teo as comparison too, which you don't modify).
> >
> > OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is
> > 6.14-rc5 without Raphel's patch. I also testing the Baseline with C6
> > and C6P disabled as this stops deeper C-state sleeps and in theory
> > should provide "best performance". I also benchmarked with Raphel's patch
> and just my patch, and finally Raphels patch AND my patch:
> >
> > Reads
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                25689182477     0.15            326177          0.15
> > C6, C6P disabled        25839580014     0.19            324349          0.19
> > Raphels Patch:          25695523747     0.06            326150          0.06
> > My patch:               25782011833     0.07            324999          0.07
> > Raphel + My patch:      25792551514     0.10            324924          0.10
> 
> So these are mostly equal right?

Not really, in this case, we have an system under a lot of random read I/O. Disabling C6 increases
the read rate since we  remove the latency from the sleep states. With my patch we approach the C6 disabled
rate. The results are an average of 5 runs, the metrics have fairly low jitter, so I believe the read
rates are reliable. We're getting a better read rate with my patch than the baseline and without the
power penalty of C6 being permanently disabled. 

> 
> >
> > Writes
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                15220468898     3.33            550290          3.36
> > C6, C6P disabled        13405624707     0.66            625424          0.66
> > Raphels Patch:          14017625200     1.15            598049          1.16
> > My patch:               15444417488     3.73            467818          29.10
> > Raphel + My patch:      14037711032     1.13            597143          1.13
> 
> These don't make sense to me, why would C6 / C6P be this bad?

I double checked, it's ..because I got the row labels mixed up when pasting the results 
into the email. Doh! It should be:

Baseline                 13405624707     0.66            625424          0.66
C6, C6P disabled   15220468898     3.33            550290          3.36

> Why would Rafael's patch make it worse?
> Are these just noise?



> 
> >
> > Combined Read+Writes, Reads
> >
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                10132229433     0.41            484919          0.25
> > C6, C6P disabled        10567536346     0.60            515260          1.16
> > Raphels Patch:          10171044817     0.37            486937          0.20
> > My patch:               10468953527     0.07            504797          0.07
> > Raphel + My patch:      10174707546     1.26            488263          1.13
> >
> > Combined Read+Writes, Writes
> >
> >                         Bandwidth       Bandwidth       latency         latency
> >                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> > Baseline                10139393169     0.44            342132          1.23
> > C6, C6P disabled        10583264662     0.63            277052          3.87
> > Raphels Patch:          10178275035     0.39            336989          0.94
> > My patch:               10482766569     1.28            294803          6.87
> > Raphel + My patch:      10183837235     0.38            330657          3.39
> 
> The above two indicate a +3% from (now mainline) Rafael's patch to yours.
> There's no reason why Rafael + your patch should be worse than just your
> patch.
> If this is statistically significant we would need to look into why.
> 
> FWIW I think the energy cost of essentially remaining in polling even for quite
> sporadic IO can't be understated IMO.
> Comparison for teo, which might slightly better than menu while not outright
> disabling idle would be interesting still.
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-26 15:04       ` King, Colin
@ 2025-03-26 15:14         ` Rafael J. Wysocki
  2025-03-26 16:26         ` Christian Loehle
  1 sibling, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2025-03-26 15:14 UTC (permalink / raw)
  To: King, Colin
  Cc: Bart Van Assche, Christian Loehle, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, Mar 26, 2025 at 4:04 PM King, Colin <colin.king@intel.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Bart Van Assche <bvanassche@acm.org>
> > Sent: 23 March 2025 12:36
> > To: King, Colin <colin.king@intel.com>; Christian Loehle
> > <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
> > Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
> > linux-block@vger.kernel.org; linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> > fast I/O devices
> >
> > On 3/17/25 3:03 AM, King, Colin wrote:
> > > This code is optional, one can enable it or disable it via the config
> > > option. Also, even when it is built-in one can disable it by writing 0 to the
> > sysfs file
> > >    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >
> > I'm not sure we need even more configuration knobs in sysfs.
>
> It's useful for enabling / disabling the functionality, as well as some form of tuning for slower I/O devices, so I think it is justifiable.
>
> > How are users
> > expected to find this configuration option? How should they decide whether
> > to enable or to disable it?
>
> I can send a V2 with some documentation if that's required.

It would be useful because the original patch didn't get to linux-pm
(among other things).

> >
> > Please take a look at this proposal and let me know whether this would solve
> > the issue that you are looking into: "[LSF/MM/BPF Topic] Energy- Efficient I/O"
> > (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
> > b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
> > compared to the cpuidle patch is that it requires RPM (runtime power
> > management) to be enabled. Maybe I should look into modifying the
> > approach such that it does not rely on RPM.
>
> I've had a look, the scope of my patch is a bit wider.  If my patch gets accepted I'm
> going to also look at putting the psd call into other devices (such as network devices) to
> also stop deep states while these devices are busy.  Since the code is very lightweight I
> was hoping this was going to be relatively easy and simple to use in various devices in the future.

But I'm generally wondering why this is using a completely new
mechanism instead of CPU latency QoS which is there and why is menu
the only governor targeted by it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-26 15:04       ` King, Colin
  2025-03-26 15:14         ` Rafael J. Wysocki
@ 2025-03-26 16:26         ` Christian Loehle
  2025-03-26 17:46           ` Rafael J. Wysocki
  2025-04-01 15:03           ` King, Colin
  1 sibling, 2 replies; 12+ messages in thread
From: Christian Loehle @ 2025-03-26 16:26 UTC (permalink / raw)
  To: King, Colin, Bart Van Assche, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

On 3/26/25 15:04, King, Colin wrote:
> Hi,
> 
>> -----Original Message-----
>> From: Bart Van Assche <bvanassche@acm.org>
>> Sent: 23 March 2025 12:36
>> To: King, Colin <colin.king@intel.com>; Christian Loehle
>> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
>> Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
>> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
>> fast I/O devices
>>
>> On 3/17/25 3:03 AM, King, Colin wrote:
>>> This code is optional, one can enable it or disable it via the config
>>> option. Also, even when it is built-in one can disable it by writing 0 to the
>> sysfs file
>>>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
>>
>> I'm not sure we need even more configuration knobs in sysfs. 
> 
> It's useful for enabling / disabling the functionality, as well as some form of tuning for slower I/O devices, so I think it is justifiable.
> 
>> How are users
>> expected to find this configuration option? How should they decide whether
>> to enable or to disable it?
> 
> I can send a V2 with some documentation if that's required.
> 
>>
>> Please take a look at this proposal and let me know whether this would solve
>> the issue that you are looking into: "[LSF/MM/BPF Topic] Energy- Efficient I/O"
>> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
>> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
>> compared to the cpuidle patch is that it requires RPM (runtime power
>> management) to be enabled. Maybe I should look into modifying the
>> approach such that it does not rely on RPM.
> 
> I've had a look, the scope of my patch is a bit wider.  If my patch gets accepted I'm 
> going to also look at putting the psd call into other devices (such as network devices) to 
> also stop deep states while these devices are busy.  Since the code is very lightweight I
> was hoping this was going to be relatively easy and simple to use in various devices in the future.

IMO this needs to be a lot more fine-grained then, both in terms of which devices or even
IO is affected (Surely some IO is fine with at least *some* latency) but also how aggressive
we are in blocking.
Just looking at some common latency/residency of idle states out there I don't think
it's reasonable to force polling for a 3-10ms (rounding up with the jiffie) period.
Playing devil's advocate if the system is under some thermal/power pressure we might
actually reduce throughput by burning so much power on this.
This seems like the stuff that is easily convincing because it improves throughput and
then taking care of power afterwards is really hard. :/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-26 16:26         ` Christian Loehle
@ 2025-03-26 17:46           ` Rafael J. Wysocki
  2025-04-01 15:03           ` King, Colin
  1 sibling, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2025-03-26 17:46 UTC (permalink / raw)
  To: Christian Loehle
  Cc: King, Colin, Bart Van Assche, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, Mar 26, 2025 at 5:26 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 3/26/25 15:04, King, Colin wrote:
> > Hi,
> >
> >> -----Original Message-----
> >> From: Bart Van Assche <bvanassche@acm.org>
> >> Sent: 23 March 2025 12:36
> >> To: King, Colin <colin.king@intel.com>; Christian Loehle
> >> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
> >> Wysocki <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>;
> >> linux-block@vger.kernel.org; linux-pm@vger.kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> >> fast I/O devices
> >>
> >> On 3/17/25 3:03 AM, King, Colin wrote:
> >>> This code is optional, one can enable it or disable it via the config
> >>> option. Also, even when it is built-in one can disable it by writing 0 to the
> >> sysfs file
> >>>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >>
> >> I'm not sure we need even more configuration knobs in sysfs.
> >
> > It's useful for enabling / disabling the functionality, as well as some form of tuning for slower I/O devices, so I think it is justifiable.
> >
> >> How are users
> >> expected to find this configuration option? How should they decide whether
> >> to enable or to disable it?
> >
> > I can send a V2 with some documentation if that's required.
> >
> >>
> >> Please take a look at this proposal and let me know whether this would solve
> >> the issue that you are looking into: "[LSF/MM/BPF Topic] Energy- Efficient I/O"
> >> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
> >> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
> >> compared to the cpuidle patch is that it requires RPM (runtime power
> >> management) to be enabled. Maybe I should look into modifying the
> >> approach such that it does not rely on RPM.
> >
> > I've had a look, the scope of my patch is a bit wider.  If my patch gets accepted I'm
> > going to also look at putting the psd call into other devices (such as network devices) to
> > also stop deep states while these devices are busy.  Since the code is very lightweight I
> > was hoping this was going to be relatively easy and simple to use in various devices in the future.
>
> IMO this needs to be a lot more fine-grained then, both in terms of which devices or even
> IO is affected (Surely some IO is fine with at least *some* latency) but also how aggressive
> we are in blocking.
> Just looking at some common latency/residency of idle states out there I don't think
> it's reasonable to force polling for a 3-10ms (rounding up with the jiffie) period.
> Playing devil's advocate if the system is under some thermal/power pressure we might
> actually reduce throughput by burning so much power on this.
> This seems like the stuff that is easily convincing because it improves throughput and
> then taking care of power afterwards is really hard. :/

I agree and recall the iowait thing that you've recently eliminated
from the menu governor.  Its purpose was ostensibly very similar and
it had similar issues.

Besides, any piece of kernel code today can add a CPU latency QoS
request, either per CPU or globally, and manage it as desired.  The
hard part is to know when to use it and what limit to set through it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-03-26 16:26         ` Christian Loehle
  2025-03-26 17:46           ` Rafael J. Wysocki
@ 2025-04-01 15:03           ` King, Colin
  2025-04-01 15:15             ` Christian Loehle
  1 sibling, 1 reply; 12+ messages in thread
From: King, Colin @ 2025-04-01 15:03 UTC (permalink / raw)
  To: Christian Loehle, Bart Van Assche, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

Hi,

Reply at end..

> -----Original Message-----
> From: Christian Loehle <christian.loehle@arm.com>
> Sent: 26 March 2025 16:27
> To: King, Colin <colin.king@intel.com>; Bart Van Assche
> <bvanassche@acm.org>; Jens Axboe <axboe@kernel.dk>; Rafael J. Wysocki
> <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>; linux-
> block@vger.kernel.org; linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/26/25 15:04, King, Colin wrote:
> > Hi,
> >
> >> -----Original Message-----
> >> From: Bart Van Assche <bvanassche@acm.org>
> >> Sent: 23 March 2025 12:36
> >> To: King, Colin <colin.king@intel.com>; Christian Loehle
> >> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
> >> Wysocki <rafael@kernel.org>; Daniel Lezcano
> >> <daniel.lezcano@linaro.org>; linux-block@vger.kernel.org;
> >> linux-pm@vger.kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
> >> prevention for fast I/O devices
> >>
> >> On 3/17/25 3:03 AM, King, Colin wrote:
> >>> This code is optional, one can enable it or disable it via the
> >>> config option. Also, even when it is built-in one can disable it by
> >>> writing 0 to the
> >> sysfs file
> >>>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >>
> >> I'm not sure we need even more configuration knobs in sysfs.
> >
> > It's useful for enabling / disabling the functionality, as well as some form of
> tuning for slower I/O devices, so I think it is justifiable.
> >
> >> How are users
> >> expected to find this configuration option? How should they decide
> >> whether to enable or to disable it?
> >
> > I can send a V2 with some documentation if that's required.
> >
> >>
> >> Please take a look at this proposal and let me know whether this
> >> would solve the issue that you are looking into: "[LSF/MM/BPF Topic]
> Energy- Efficient I/O"
> >> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
> >> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
> >> compared to the cpuidle patch is that it requires RPM (runtime power
> >> management) to be enabled. Maybe I should look into modifying the
> >> approach such that it does not rely on RPM.
> >
> > I've had a look, the scope of my patch is a bit wider.  If my patch
> > gets accepted I'm going to also look at putting the psd call into
> > other devices (such as network devices) to also stop deep states while
> > these devices are busy.  Since the code is very lightweight I was hoping this
> was going to be relatively easy and simple to use in various devices in the
> future.
> 
> IMO this needs to be a lot more fine-grained then, both in terms of which
> devices or even IO is affected (Surely some IO is fine with at least *some*
> latency) but also how aggressive we are in blocking.
> Just looking at some common latency/residency of idle states out there I don't
> think it's reasonable to force polling for a 3-10ms (rounding up with the jiffie)
> period.

The current solution by a customer is that they are resorting to disabling C6/C6P and hence
all the CPUs are essentially in a non-low power state all the time.  The opt-in solution 
provided in the patch provides nearly the same performance and will re-enable deeper
C-states once the I/O is completed.

As I mentioned earlier, the jiffies are used because it's low-touch and very fast with negligible
impact on the I/O paths. Using finer grained timing is far more an expensive operation and
is a huge overhead on very fast I/O devices.

Also, this is a user config and tune-able choice. Users can opt-in to using this if they want
to pay for the extra CPU overhead for a bit more I/O performance. If they don't want it, they
don't need to enable it.

> Playing devil's advocate if the system is under some thermal/power pressure
> we might actually reduce throughput by burning so much power on this.
> This seems like the stuff that is easily convincing because it improves
> throughput and then taking care of power afterwards is really hard. :/
> 

The current solution is when the user is trying to get maximum bandwidth and disabling C6/C6P 
so they are already keeping the system busy. This solution at least will save power when I/O is idling.

Colin 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-04-01 15:03           ` King, Colin
@ 2025-04-01 15:15             ` Christian Loehle
  2025-04-01 16:41               ` Rafael J. Wysocki
  0 siblings, 1 reply; 12+ messages in thread
From: Christian Loehle @ 2025-04-01 15:15 UTC (permalink / raw)
  To: King, Colin, Bart Van Assche, Jens Axboe, Rafael J. Wysocki,
	Daniel Lezcano, linux-block@vger.kernel.org,
	linux-pm@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

On 4/1/25 16:03, King, Colin wrote:
> Hi,
> 
> Reply at end..
> 
>> -----Original Message-----
>> From: Christian Loehle <christian.loehle@arm.com>
>> Sent: 26 March 2025 16:27
>> To: King, Colin <colin.king@intel.com>; Bart Van Assche
>> <bvanassche@acm.org>; Jens Axboe <axboe@kernel.dk>; Rafael J. Wysocki
>> <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>; linux-
>> block@vger.kernel.org; linux-pm@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
>> fast I/O devices
>>
>> On 3/26/25 15:04, King, Colin wrote:
>>> Hi,
>>>
>>>> -----Original Message-----
>>>> From: Bart Van Assche <bvanassche@acm.org>
>>>> Sent: 23 March 2025 12:36
>>>> To: King, Colin <colin.king@intel.com>; Christian Loehle
>>>> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
>>>> Wysocki <rafael@kernel.org>; Daniel Lezcano
>>>> <daniel.lezcano@linaro.org>; linux-block@vger.kernel.org;
>>>> linux-pm@vger.kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
>>>> prevention for fast I/O devices
>>>>
>>>> On 3/17/25 3:03 AM, King, Colin wrote:
>>>>> This code is optional, one can enable it or disable it via the
>>>>> config option. Also, even when it is built-in one can disable it by
>>>>> writing 0 to the
>>>> sysfs file
>>>>>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
>>>>
>>>> I'm not sure we need even more configuration knobs in sysfs.
>>>
>>> It's useful for enabling / disabling the functionality, as well as some form of
>> tuning for slower I/O devices, so I think it is justifiable.
>>>
>>>> How are users
>>>> expected to find this configuration option? How should they decide
>>>> whether to enable or to disable it?
>>>
>>> I can send a V2 with some documentation if that's required.
>>>
>>>>
>>>> Please take a look at this proposal and let me know whether this
>>>> would solve the issue that you are looking into: "[LSF/MM/BPF Topic]
>> Energy- Efficient I/O"
>>>> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
>>>> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
>>>> compared to the cpuidle patch is that it requires RPM (runtime power
>>>> management) to be enabled. Maybe I should look into modifying the
>>>> approach such that it does not rely on RPM.
>>>
>>> I've had a look, the scope of my patch is a bit wider.  If my patch
>>> gets accepted I'm going to also look at putting the psd call into
>>> other devices (such as network devices) to also stop deep states while
>>> these devices are busy.  Since the code is very lightweight I was hoping this
>> was going to be relatively easy and simple to use in various devices in the
>> future.
>>
>> IMO this needs to be a lot more fine-grained then, both in terms of which
>> devices or even IO is affected (Surely some IO is fine with at least *some*
>> latency) but also how aggressive we are in blocking.
>> Just looking at some common latency/residency of idle states out there I don't
>> think it's reasonable to force polling for a 3-10ms (rounding up with the jiffie)
>> period.
> 
> The current solution by a customer is that they are resorting to disabling C6/C6P and hence
> all the CPUs are essentially in a non-low power state all the time.  The opt-in solution 
> provided in the patch provides nearly the same performance and will re-enable deeper
> C-states once the I/O is completed.
> 
> As I mentioned earlier, the jiffies are used because it's low-touch and very fast with negligible
> impact on the I/O paths. Using finer grained timing is far more an expensive operation and
> is a huge overhead on very fast I/O devices.
> 
> Also, this is a user config and tune-able choice. Users can opt-in to using this if they want
> to pay for the extra CPU overhead for a bit more I/O performance. If they don't want it, they
> don't need to enable it.
> 
>> Playing devil's advocate if the system is under some thermal/power pressure
>> we might actually reduce throughput by burning so much power on this.
>> This seems like the stuff that is easily convincing because it improves
>> throughput and then taking care of power afterwards is really hard. :/
>>
> 
> The current solution is when the user is trying to get maximum bandwidth and disabling C6/C6P 
> so they are already keeping the system busy. This solution at least will save power when I/O is idling.
> 

No. They can set the pm qos latency constraint when they care about 'maximum bandwidth'
and remove the constraint when they don't.
If they just disable the idle states at boot and never enable them at least they have no
grounds to complain to kernel people about, they should know what they're doing is detrimental
to power.

Furthermore we might be better off disabling C6/C6P than staying in a polling state (whenever we've
completed an IO in the last ~5 to 20ms, depending on the HZ setting).
Again, the wastefulness of a polling state can hardly be overestimated, especially given
that it doesn't seem to be necessary at all here?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices
  2025-04-01 15:15             ` Christian Loehle
@ 2025-04-01 16:41               ` Rafael J. Wysocki
  0 siblings, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2025-04-01 16:41 UTC (permalink / raw)
  To: Christian Loehle, King, Colin
  Cc: Bart Van Assche, Jens Axboe, Rafael J. Wysocki, Daniel Lezcano,
	linux-block@vger.kernel.org, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Tue, Apr 1, 2025 at 5:15 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 4/1/25 16:03, King, Colin wrote:
> > Hi,
> >
> > Reply at end..
> >
> >> -----Original Message-----
> >> From: Christian Loehle <christian.loehle@arm.com>
> >> Sent: 26 March 2025 16:27
> >> To: King, Colin <colin.king@intel.com>; Bart Van Assche
> >> <bvanassche@acm.org>; Jens Axboe <axboe@kernel.dk>; Rafael J. Wysocki
> >> <rafael@kernel.org>; Daniel Lezcano <daniel.lezcano@linaro.org>; linux-
> >> block@vger.kernel.org; linux-pm@vger.kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> >> fast I/O devices
> >>
> >> On 3/26/25 15:04, King, Colin wrote:
> >>> Hi,
> >>>
> >>>> -----Original Message-----
> >>>> From: Bart Van Assche <bvanassche@acm.org>
> >>>> Sent: 23 March 2025 12:36
> >>>> To: King, Colin <colin.king@intel.com>; Christian Loehle
> >>>> <christian.loehle@arm.com>; Jens Axboe <axboe@kernel.dk>; Rafael J.
> >>>> Wysocki <rafael@kernel.org>; Daniel Lezcano
> >>>> <daniel.lezcano@linaro.org>; linux-block@vger.kernel.org;
> >>>> linux-pm@vger.kernel.org
> >>>> Cc: linux-kernel@vger.kernel.org
> >>>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
> >>>> prevention for fast I/O devices
> >>>>
> >>>> On 3/17/25 3:03 AM, King, Colin wrote:
> >>>>> This code is optional, one can enable it or disable it via the
> >>>>> config option. Also, even when it is built-in one can disable it by
> >>>>> writing 0 to the
> >>>> sysfs file
> >>>>>    /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >>>>
> >>>> I'm not sure we need even more configuration knobs in sysfs.
> >>>
> >>> It's useful for enabling / disabling the functionality, as well as some form of
> >> tuning for slower I/O devices, so I think it is justifiable.
> >>>
> >>>> How are users
> >>>> expected to find this configuration option? How should they decide
> >>>> whether to enable or to disable it?
> >>>
> >>> I can send a V2 with some documentation if that's required.
> >>>
> >>>>
> >>>> Please take a look at this proposal and let me know whether this
> >>>> would solve the issue that you are looking into: "[LSF/MM/BPF Topic]
> >> Energy- Efficient I/O"
> >>>> (https://lore.kernel.org/linux-block/ad1018b6-7c0b-4d70-
> >>>> b845-c869287d3cf3@acm.org/). The only disadvantage of this approach
> >>>> compared to the cpuidle patch is that it requires RPM (runtime power
> >>>> management) to be enabled. Maybe I should look into modifying the
> >>>> approach such that it does not rely on RPM.
> >>>
> >>> I've had a look, the scope of my patch is a bit wider.  If my patch
> >>> gets accepted I'm going to also look at putting the psd call into
> >>> other devices (such as network devices) to also stop deep states while
> >>> these devices are busy.  Since the code is very lightweight I was hoping this
> >> was going to be relatively easy and simple to use in various devices in the
> >> future.
> >>
> >> IMO this needs to be a lot more fine-grained then, both in terms of which
> >> devices or even IO is affected (Surely some IO is fine with at least *some*
> >> latency) but also how aggressive we are in blocking.
> >> Just looking at some common latency/residency of idle states out there I don't
> >> think it's reasonable to force polling for a 3-10ms (rounding up with the jiffie)
> >> period.
> >
> > The current solution by a customer is that they are resorting to disabling C6/C6P and hence
> > all the CPUs are essentially in a non-low power state all the time.  The opt-in solution
> > provided in the patch provides nearly the same performance and will re-enable deeper
> > C-states once the I/O is completed.
> >
> > As I mentioned earlier, the jiffies are used because it's low-touch and very fast with negligible
> > impact on the I/O paths. Using finer grained timing is far more an expensive operation and
> > is a huge overhead on very fast I/O devices.
> >
> > Also, this is a user config and tune-able choice. Users can opt-in to using this if they want
> > to pay for the extra CPU overhead for a bit more I/O performance. If they don't want it, they
> > don't need to enable it.
> >
> >> Playing devil's advocate if the system is under some thermal/power pressure
> >> we might actually reduce throughput by burning so much power on this.
> >> This seems like the stuff that is easily convincing because it improves
> >> throughput and then taking care of power afterwards is really hard. :/
> >>
> >
> > The current solution is when the user is trying to get maximum bandwidth and disabling C6/C6P
> > so they are already keeping the system busy. This solution at least will save power when I/O is idling.
> >
>
> No. They can set the pm qos latency constraint when they care about 'maximum bandwidth'
> and remove the constraint when they don't.
> If they just disable the idle states at boot and never enable them at least they have no
> grounds to complain to kernel people about, they should know what they're doing is detrimental
> to power.

I think that they know about it, but they are complaining that there's
no alternative.

Well, this is not entirely true because of the PM QoS latency
constraints that can be used, but this requires admin access which
isn't popular.

Hence (I think) the idea to add an opt-in that will effectively
disable C6/C6P temporarily when I/O is pending (in progress, queued up
or whatever), but this may very well use the existing PM QoS latency
constraints framework under the hood instead of adding something new.

> Furthermore we might be better off disabling C6/C6P than staying in a polling state (whenever we've
> completed an IO in the last ~5 to 20ms, depending on the HZ setting).
> Again, the wastefulness of a polling state can hardly be overestimated, especially given
> that it doesn't seem to be necessary at all here?

I don't think it is necessary or at least the need for it would need
to be demonstrated.  In fact, on x86, every time the poll state is
selected, it turns out to be overly shallow, at least in the results
I've seen so far.

Most likely, it would be sufficient to have a CPU latency PM QoS
request registered per I/O entity and set the limit for it slightly
above the C1 (or C1E) exit latency when there's I/O to process.  After
completing the I/O, the request can be update to the "don't care"
value.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-01 16:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <33882f284ac6e6d1ec766ca4bb2f3b88@intel.com>
2025-03-03 22:24 ` [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices Christian Loehle
2025-03-17 10:03   ` King, Colin
2025-03-23  9:18     ` Christian Loehle
2025-03-26 15:14       ` King, Colin
2025-03-23 12:35     ` Bart Van Assche
2025-03-26 15:04       ` King, Colin
2025-03-26 15:14         ` Rafael J. Wysocki
2025-03-26 16:26         ` Christian Loehle
2025-03-26 17:46           ` Rafael J. Wysocki
2025-04-01 15:03           ` King, Colin
2025-04-01 15:15             ` Christian Loehle
2025-04-01 16:41               ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox