linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] block: CPU latency PM QoS tuning
@ 2024-08-29  7:18 Tero Kristo
  2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Tero Kristo @ 2024-08-29  7:18 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel

Hello,

These patches introduce a mechanism for limiting deep CPU idle states
during block IO. With certain workloads, it is possible for CPU to
enter deep idle while waiting for the IO completion, causing a large
latency to the completion interrupt. See example below, where I used
an Intel Icelake Xeon system to run a simple 'fio' test with random
reads, and with CPU C6 state disabled / enabled (results from 2 * 2min
runs):

C6 enabled:
    slat (nsec): min=1769, max=73247, avg=6960.96, stdev=2115.90
    clat (nsec): min=442, max=242706, avg=23767.06, stdev=13348.74
     lat (usec): min=12, max=250, avg=30.73, stdev=13.96

    slat (nsec): min=1849, max=58824, avg=6970.61, stdev=2134.38
    clat (nsec): min=1684, max=241880, avg=23545.68, stdev=13448.87
     lat (usec): min=12, max=249, avg=30.52, stdev=14.03

C6 disabled:
    slat (nsec): min=2110, max=57871, avg=6867.86, stdev=1711.55
    clat (nsec): min=486, max=98292, avg=22185.50, stdev=10473.34
     lat (usec): min=13, max=105, avg=29.05, stdev=10.99

    slat (nsec): min=2128, max=67730, avg=6913.52, stdev=1714.89
    clat (nsec): min=552, max=93409, avg=22582.50, stdev=10407.53
     lat (usec): min=13, max=108, avg=29.50, stdev=10.93

The maximum latency with C6 enabled is about 2.5x seen with C6
disabled.

Now, the patches provided here introduce a mechanism for the block
layer to limit the maximum CPU latencies, with user configurable
sysfs knobs per block device. Doing following config in my test
system:

  /sys/block/nvme0n1/cpu_lat_limit_us = 10
  /sys/block/nvme0n1/cpu_lat_timeout_ms = 3

This limits the maximum CPU latency for the active CPUs doing block IO
to 10us, and the limit is removed if there is no block IO for 3ms.

Running the same fio test used above with C6 enabled, I get:

    slat (nsec): min=1887, max=71037, avg=7239.68, stdev=1850.67
    clat (nsec): min=438, max=103628, avg=22488.75, stdev=10457.86
     lat (usec): min=12, max=133, avg=29.73, stdev=11.04

    slat (nsec): min=1942, max=69159, avg=7194.01, stdev=1788.63
    clat (nsec): min=418, max=115739, avg=22239.51, stdev=10448.37
     lat (usec): min=12, max=123, avg=29.43, stdev=10.96

... so the maximum latencies are cut by approx 100us and are quite close
to the levels seen with C6 disabled completely system wide.

Any thoughts about the patches and the approach taken?

-Tero


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-09-04 11:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29  7:18 [RFC PATCH 0/2] block: CPU latency PM QoS tuning Tero Kristo
2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
2024-08-29 11:37   ` Jens Axboe
2024-08-30 11:55     ` Tero Kristo
2024-08-30 14:26       ` Ming Lei
2024-09-04 11:37         ` Tero Kristo
2024-08-29  7:18 ` [RFC PATCH 2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings Tero Kristo
2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
2024-08-30 12:01   ` Tero Kristo
2024-09-04 11:35   ` Tero Kristo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).