linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] block: CPU latency PM QoS tuning
@ 2024-08-29  7:18 Tero Kristo
  2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Tero Kristo @ 2024-08-29  7:18 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel

Hello,

These patches introduce a mechanism for limiting deep CPU idle states
during block IO. With certain workloads, it is possible for CPU to
enter deep idle while waiting for the IO completion, causing a large
latency to the completion interrupt. See example below, where I used
an Intel Icelake Xeon system to run a simple 'fio' test with random
reads, and with CPU C6 state disabled / enabled (results from 2 * 2min
runs):

C6 enabled:
    slat (nsec): min=1769, max=73247, avg=6960.96, stdev=2115.90
    clat (nsec): min=442, max=242706, avg=23767.06, stdev=13348.74
     lat (usec): min=12, max=250, avg=30.73, stdev=13.96

    slat (nsec): min=1849, max=58824, avg=6970.61, stdev=2134.38
    clat (nsec): min=1684, max=241880, avg=23545.68, stdev=13448.87
     lat (usec): min=12, max=249, avg=30.52, stdev=14.03

C6 disabled:
    slat (nsec): min=2110, max=57871, avg=6867.86, stdev=1711.55
    clat (nsec): min=486, max=98292, avg=22185.50, stdev=10473.34
     lat (usec): min=13, max=105, avg=29.05, stdev=10.99

    slat (nsec): min=2128, max=67730, avg=6913.52, stdev=1714.89
    clat (nsec): min=552, max=93409, avg=22582.50, stdev=10407.53
     lat (usec): min=13, max=108, avg=29.50, stdev=10.93

The maximum latency with C6 enabled is about 2.5x seen with C6
disabled.

Now, the patches provided here introduce a mechanism for the block
layer to limit the maximum CPU latencies, with user configurable
sysfs knobs per block device. Doing following config in my test
system:

  /sys/block/nvme0n1/cpu_lat_limit_us = 10
  /sys/block/nvme0n1/cpu_lat_timeout_ms = 3

This limits the maximum CPU latency for the active CPUs doing block IO
to 10us, and the limit is removed if there is no block IO for 3ms.

Running the same fio test used above with C6 enabled, I get:

    slat (nsec): min=1887, max=71037, avg=7239.68, stdev=1850.67
    clat (nsec): min=438, max=103628, avg=22488.75, stdev=10457.86
     lat (usec): min=12, max=133, avg=29.73, stdev=11.04

    slat (nsec): min=1942, max=69159, avg=7194.01, stdev=1788.63
    clat (nsec): min=418, max=115739, avg=22239.51, stdev=10448.37
     lat (usec): min=12, max=123, avg=29.43, stdev=10.96

... so the maximum latencies are cut by approx 100us and are quite close
to the levels seen with C6 disabled completely system wide.

Any thoughts about the patches and the approach taken?

-Tero


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/2] bdev: add support for CPU latency PM QoS tuning
  2024-08-29  7:18 [RFC PATCH 0/2] block: CPU latency PM QoS tuning Tero Kristo
@ 2024-08-29  7:18 ` Tero Kristo
  2024-08-29 11:37   ` Jens Axboe
  2024-08-29  7:18 ` [RFC PATCH 2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings Tero Kristo
  2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
  2 siblings, 1 reply; 10+ messages in thread
From: Tero Kristo @ 2024-08-29  7:18 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel

Add support for limiting CPU latency while block IO is running. When a
block IO is started, it will add a user configurable CPU latency limit
in place (if any.) The limit is removed with a configurable timeout mechanism.

Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
---
 block/bdev.c              | 51 +++++++++++++++++++++++++++++++++++++++
 block/bio.c               |  2 ++
 block/blk.h               |  2 ++
 include/linux/blk_types.h | 12 +++++++++
 4 files changed, 67 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 353677ac49b3..8f20681a4ea6 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -405,10 +405,18 @@ void __init bdev_cache_init(void)
 	blockdev_superblock = blockdev_mnt->mnt_sb;   /* For writeback */
 }
 
+static void bdev_pm_qos_work(struct work_struct *work)
+{
+	struct bdev_cpu_latency_qos *qos =
+		container_of(work, struct bdev_cpu_latency_qos, work.work);
+	dev_pm_qos_remove_request(&qos->req);
+}
+
 struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 {
 	struct block_device *bdev;
 	struct inode *inode;
+	int cpu;
 
 	inode = new_inode(blockdev_superblock);
 	if (!inode)
@@ -433,6 +441,16 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		return NULL;
 	}
 	bdev->bd_disk = disk;
+	bdev->bd_pm_qos = alloc_percpu(struct bdev_cpu_latency_qos);
+	if (!bdev->bd_pm_qos) {
+		free_percpu(bdev->bd_stats);
+		iput(inode);
+		return NULL;
+	}
+	for_each_possible_cpu(cpu)
+		INIT_DELAYED_WORK(per_cpu_ptr(&bdev->bd_pm_qos->work, cpu),
+				  bdev_pm_qos_work);
+	bdev->cpu_lat_limit = -1;
 	return bdev;
 }
 
@@ -462,6 +480,19 @@ void bdev_unhash(struct block_device *bdev)
 
 void bdev_drop(struct block_device *bdev)
 {
+	int cpu;
+	struct bdev_cpu_latency_qos *qos;
+
+	for_each_possible_cpu(cpu) {
+		qos = per_cpu_ptr(bdev->bd_pm_qos, cpu);
+		if (dev_pm_qos_request_active(&qos->req)) {
+			cancel_delayed_work(&qos->work);
+			dev_pm_qos_remove_request(&qos->req);
+		}
+	}
+
+	free_percpu(bdev->bd_pm_qos);
+
 	iput(BD_INODE(bdev));
 }
 
@@ -1281,6 +1312,26 @@ void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
 	blkdev_put_no_open(bdev);
 }
 
+void bdev_update_cpu_latency_pm_qos(struct block_device *bdev)
+{
+	int cpu;
+	struct bdev_cpu_latency_qos *qos;
+
+	if (!bdev || bdev->cpu_lat_limit < 0)
+		return;
+
+	cpu = raw_smp_processor_id();
+	qos = per_cpu_ptr(bdev->bd_pm_qos, cpu);
+
+	if (!dev_pm_qos_request_active(&qos->req))
+		dev_pm_qos_add_request(get_cpu_device(cpu), &qos->req,
+				       DEV_PM_QOS_RESUME_LATENCY,
+				       bdev->cpu_lat_limit);
+
+	mod_delayed_work(system_wq, &qos->work,
+			 msecs_to_jiffies(bdev->cpu_lat_timeout));
+}
+
 bool disk_live(struct gendisk *disk)
 {
 	return !inode_unhashed(BD_INODE(disk->part0));
diff --git a/block/bio.c b/block/bio.c
index e9e809a63c59..6c46d75345d7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -282,6 +282,8 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_max_vecs = max_vecs;
 	bio->bi_io_vec = table;
 	bio->bi_pool = NULL;
+
+	bdev_update_cpu_latency_pm_qos(bio->bi_bdev);
 }
 EXPORT_SYMBOL(bio_init);
 
diff --git a/block/blk.h b/block/blk.h
index 189bc25beb50..dda2a188984b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -516,6 +516,8 @@ void drop_partition(struct block_device *part);
 
 void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
 
+void bdev_update_cpu_latency_pm_qos(struct block_device *bdev);
+
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 		struct lock_class_key *lkclass);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 781c4500491b..0ed29603eaa9 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -11,6 +11,7 @@
 #include <linux/device.h>
 #include <linux/ktime.h>
 #include <linux/rw_hint.h>
+#include <linux/pm_qos.h>
 
 struct bio_set;
 struct bio;
@@ -38,6 +39,11 @@ struct bio_crypt_ctx;
 #define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
 #define SECTOR_MASK		(PAGE_SECTORS - 1)
 
+struct bdev_cpu_latency_qos {
+	struct dev_pm_qos_request	req;
+	struct delayed_work		work;
+};
+
 struct block_device {
 	sector_t		bd_start_sect;
 	sector_t		bd_nr_sectors;
@@ -71,6 +77,12 @@ struct block_device {
 
 	struct partition_meta_info *bd_meta_info;
 	int			bd_writers;
+
+	/* For preventing deep idle during block I/O */
+	struct bdev_cpu_latency_qos __percpu	*bd_pm_qos;
+	int cpu_lat_timeout;
+	int cpu_lat_limit;
+
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast
 	 * path
-- 
2.43.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings
  2024-08-29  7:18 [RFC PATCH 0/2] block: CPU latency PM QoS tuning Tero Kristo
  2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
@ 2024-08-29  7:18 ` Tero Kristo
  2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
  2 siblings, 0 replies; 10+ messages in thread
From: Tero Kristo @ 2024-08-29  7:18 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel

Add sysfs knobs for the following parameters:

  cpu_lat_limit_us: for limiting the CPU latency to given value when block IO
		    is running
  cpu_lat_timeout_ms: for clearing up the CPU latency limit after block IO
		      is complete

This can be used to prevent the CPU from entering deep idle states when
block IO is running and waiting for an interrupt, potentially causing
large latencies to the operation.

Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
---
 block/genhd.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 8f1f3c6b4d67..2fbd967a3e36 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1046,6 +1046,48 @@ static ssize_t partscan_show(struct device *dev,
 	return sprintf(buf, "%u\n", disk_has_partscan(dev_to_disk(dev)));
 }
 
+static ssize_t cpu_lat_limit_us_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+
+	return sprintf(buf, "%d\n", bdev->cpu_lat_limit);
+}
+
+static ssize_t cpu_lat_limit_us_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	int i;
+
+	if (count > 0 && !kstrtoint(buf, 10, &i))
+		bdev->cpu_lat_limit = i;
+
+	return count;
+}
+
+static ssize_t cpu_lat_timeout_ms_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+
+	return sprintf(buf, "%d\n", bdev->cpu_lat_timeout);
+}
+
+static ssize_t cpu_lat_timeout_ms_store(struct device *dev,
+					struct device_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	int i;
+
+	if (count > 0 && !kstrtoint(buf, 10, &i))
+		bdev->cpu_lat_timeout = i;
+
+	return count;
+}
+
 static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
 static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
 static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -1060,6 +1102,8 @@ static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
 static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
 static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
 static DEVICE_ATTR(partscan, 0444, partscan_show, NULL);
+static DEVICE_ATTR_RW(cpu_lat_limit_us);
+static DEVICE_ATTR_RW(cpu_lat_timeout_ms);
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 ssize_t part_fail_show(struct device *dev,
@@ -1111,6 +1155,8 @@ static struct attribute *disk_attrs[] = {
 	&dev_attr_events_poll_msecs.attr,
 	&dev_attr_diskseq.attr,
 	&dev_attr_partscan.attr,
+	&dev_attr_cpu_lat_limit_us.attr,
+	&dev_attr_cpu_lat_timeout_ms.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
 #endif
-- 
2.43.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] block: CPU latency PM QoS tuning
  2024-08-29  7:18 [RFC PATCH 0/2] block: CPU latency PM QoS tuning Tero Kristo
  2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
  2024-08-29  7:18 ` [RFC PATCH 2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings Tero Kristo
@ 2024-08-29 11:04 ` Bart Van Assche
  2024-08-30 12:01   ` Tero Kristo
  2024-09-04 11:35   ` Tero Kristo
  2 siblings, 2 replies; 10+ messages in thread
From: Bart Van Assche @ 2024-08-29 11:04 UTC (permalink / raw)
  To: Tero Kristo, axboe; +Cc: linux-block, linux-kernel

On 8/29/24 3:18 AM, Tero Kristo wrote:
> Any thoughts about the patches and the approach taken?

The optimal value for the PM QoS latency depends on the request size
and on the storage device characteristics. I think it would be better
if the latency value would be chosen automatically rather than
introducing yet another set of tunable sysfs parameters.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] bdev: add support for CPU latency PM QoS tuning
  2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
@ 2024-08-29 11:37   ` Jens Axboe
  2024-08-30 11:55     ` Tero Kristo
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2024-08-29 11:37 UTC (permalink / raw)
  To: Tero Kristo; +Cc: linux-block, linux-kernel

On 8/29/24 1:18 AM, Tero Kristo wrote:
> diff --git a/block/bio.c b/block/bio.c
> index e9e809a63c59..6c46d75345d7 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -282,6 +282,8 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
>  	bio->bi_max_vecs = max_vecs;
>  	bio->bi_io_vec = table;
>  	bio->bi_pool = NULL;
> +
> +	bdev_update_cpu_latency_pm_qos(bio->bi_bdev);
>  }
>  EXPORT_SYMBOL(bio_init);

This is entirely the wrong place to do this, presumably it should be
done at IO dispatch time, not when something initializes a bio.

And also feels like entirely the wrong way to go about this, adding
overhead to potentially each IO dispatch, of which there can be millions
per second.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] bdev: add support for CPU latency PM QoS tuning
  2024-08-29 11:37   ` Jens Axboe
@ 2024-08-30 11:55     ` Tero Kristo
  2024-08-30 14:26       ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Tero Kristo @ 2024-08-30 11:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

On Thu, 2024-08-29 at 05:37 -0600, Jens Axboe wrote:
> On 8/29/24 1:18 AM, Tero Kristo wrote:
> > diff --git a/block/bio.c b/block/bio.c
> > index e9e809a63c59..6c46d75345d7 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -282,6 +282,8 @@ void bio_init(struct bio *bio, struct
> > block_device *bdev, struct bio_vec *table,
> >  	bio->bi_max_vecs = max_vecs;
> >  	bio->bi_io_vec = table;
> >  	bio->bi_pool = NULL;
> > +
> > +	bdev_update_cpu_latency_pm_qos(bio->bi_bdev);
> >  }
> >  EXPORT_SYMBOL(bio_init);
> 
> This is entirely the wrong place to do this, presumably it should be
> done at IO dispatch time, not when something initializes a bio.
> 
> And also feels like entirely the wrong way to go about this, adding
> overhead to potentially each IO dispatch, of which there can be
> millions
> per second.

Any thoughts where it could/should be added?

I moved the bdev_* callback from bio_init to the below location and it
seems to work also:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3b4df8e5ac9e..d97a3a4252de 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2706,6 +2706,7 @@ static void __blk_mq_flush_plug_list(struct
request_queue *q,
 {
        if (blk_queue_quiesced(q))
                return;
+       bdev_update_cpu_latency_pm_qos(q->disk->part0);
        q->mq_ops->queue_rqs(&plug->mq_list);
 }
 


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] block: CPU latency PM QoS tuning
  2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
@ 2024-08-30 12:01   ` Tero Kristo
  2024-09-04 11:35   ` Tero Kristo
  1 sibling, 0 replies; 10+ messages in thread
From: Tero Kristo @ 2024-08-30 12:01 UTC (permalink / raw)
  To: Bart Van Assche, axboe; +Cc: linux-block, linux-kernel

On Thu, 2024-08-29 at 07:04 -0400, Bart Van Assche wrote:
> On 8/29/24 3:18 AM, Tero Kristo wrote:
> > Any thoughts about the patches and the approach taken?
> 
> The optimal value for the PM QoS latency depends on the request size
> and on the storage device characteristics. I think it would be better
> if the latency value would be chosen automatically rather than
> introducing yet another set of tunable sysfs parameters.

Are these device parameters stored somewhere in the kernel? I did try
looking for this kind of data but could not find anything useful; thats
the main reason I implemented the sysfs tunables.

-Tero

> 
> Thanks,
> 
> Bart.
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] bdev: add support for CPU latency PM QoS tuning
  2024-08-30 11:55     ` Tero Kristo
@ 2024-08-30 14:26       ` Ming Lei
  2024-09-04 11:37         ` Tero Kristo
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2024-08-30 14:26 UTC (permalink / raw)
  To: Tero Kristo; +Cc: Jens Axboe, linux-block, linux-kernel, ming.lei

On Fri, Aug 30, 2024 at 02:55:56PM +0300, Tero Kristo wrote:
> On Thu, 2024-08-29 at 05:37 -0600, Jens Axboe wrote:
> > On 8/29/24 1:18 AM, Tero Kristo wrote:
> > > diff --git a/block/bio.c b/block/bio.c
> > > index e9e809a63c59..6c46d75345d7 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -282,6 +282,8 @@ void bio_init(struct bio *bio, struct
> > > block_device *bdev, struct bio_vec *table,
> > >  	bio->bi_max_vecs = max_vecs;
> > >  	bio->bi_io_vec = table;
> > >  	bio->bi_pool = NULL;
> > > +
> > > +	bdev_update_cpu_latency_pm_qos(bio->bi_bdev);
> > >  }
> > >  EXPORT_SYMBOL(bio_init);
> > 
> > This is entirely the wrong place to do this, presumably it should be
> > done at IO dispatch time, not when something initializes a bio.
> > 
> > And also feels like entirely the wrong way to go about this, adding
> > overhead to potentially each IO dispatch, of which there can be
> > millions
> > per second.
> 
> Any thoughts where it could/should be added?
> 
> I moved the bdev_* callback from bio_init to the below location and it
> seems to work also:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3b4df8e5ac9e..d97a3a4252de 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2706,6 +2706,7 @@ static void __blk_mq_flush_plug_list(struct
> request_queue *q,
>  {
>         if (blk_queue_quiesced(q))
>                 return;
> +       bdev_update_cpu_latency_pm_qos(q->disk->part0);
>         q->mq_ops->queue_rqs(&plug->mq_list);

IO submission CPU may not be same with the completion CPU, so this
approach looks wrong.

What you are trying to do is to avoid IO completion CPU to enter
deep idle in case of inflight block IOs.

Only fast device cares this CPU latency, maybe you just need to call
some generic helper in driver(NVMe), and you may have to figure out
the exact IO completion CPU for hardware queue with inflight IOs.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] block: CPU latency PM QoS tuning
  2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
  2024-08-30 12:01   ` Tero Kristo
@ 2024-09-04 11:35   ` Tero Kristo
  1 sibling, 0 replies; 10+ messages in thread
From: Tero Kristo @ 2024-09-04 11:35 UTC (permalink / raw)
  To: Bart Van Assche, axboe; +Cc: linux-block, linux-kernel

On Thu, 2024-08-29 at 07:04 -0400, Bart Van Assche wrote:
> On 8/29/24 3:18 AM, Tero Kristo wrote:
> > Any thoughts about the patches and the approach taken?
> 
> The optimal value for the PM QoS latency depends on the request size
> and on the storage device characteristics. I think it would be better
> if the latency value would be chosen automatically rather than
> introducing yet another set of tunable sysfs parameters.
> 
> Thanks,
> 
> Bart.
> 

Hi all,

Based on the feedback received, I've updated my patch to work on the
NVMe driver level instead of block layer. I'll send that to the
corresponding list as a separate RFC, but for now these two patches can
be ignored.

-Tero

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] bdev: add support for CPU latency PM QoS tuning
  2024-08-30 14:26       ` Ming Lei
@ 2024-09-04 11:37         ` Tero Kristo
  0 siblings, 0 replies; 10+ messages in thread
From: Tero Kristo @ 2024-09-04 11:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel

On Fri, 2024-08-30 at 22:26 +0800, Ming Lei wrote:
> On Fri, Aug 30, 2024 at 02:55:56PM +0300, Tero Kristo wrote:
> > On Thu, 2024-08-29 at 05:37 -0600, Jens Axboe wrote:
> > > On 8/29/24 1:18 AM, Tero Kristo wrote:
> > > > diff --git a/block/bio.c b/block/bio.c
> > > > index e9e809a63c59..6c46d75345d7 100644
> > > > --- a/block/bio.c
> > > > +++ b/block/bio.c
> > > > @@ -282,6 +282,8 @@ void bio_init(struct bio *bio, struct
> > > > block_device *bdev, struct bio_vec *table,
> > > >  	bio->bi_max_vecs = max_vecs;
> > > >  	bio->bi_io_vec = table;
> > > >  	bio->bi_pool = NULL;
> > > > +
> > > > +	bdev_update_cpu_latency_pm_qos(bio->bi_bdev);
> > > >  }
> > > >  EXPORT_SYMBOL(bio_init);
> > > 
> > > This is entirely the wrong place to do this, presumably it should
> > > be
> > > done at IO dispatch time, not when something initializes a bio.
> > > 
> > > And also feels like entirely the wrong way to go about this,
> > > adding
> > > overhead to potentially each IO dispatch, of which there can be
> > > millions
> > > per second.
> > 
> > Any thoughts where it could/should be added?
> > 
> > I moved the bdev_* callback from bio_init to the below location and
> > it
> > seems to work also:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 3b4df8e5ac9e..d97a3a4252de 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2706,6 +2706,7 @@ static void __blk_mq_flush_plug_list(struct
> > request_queue *q,
> >  {
> >         if (blk_queue_quiesced(q))
> >                 return;
> > +       bdev_update_cpu_latency_pm_qos(q->disk->part0);
> >         q->mq_ops->queue_rqs(&plug->mq_list);
> 
> IO submission CPU may not be same with the completion CPU, so this
> approach looks wrong.
> 
> What you are trying to do is to avoid IO completion CPU to enter
> deep idle in case of inflight block IOs.
> 
> Only fast device cares this CPU latency, maybe you just need to call
> some generic helper in driver(NVMe), and you may have to figure out
> the exact IO completion CPU for hardware queue with inflight IOs.
> 
> Thanks,
> Ming
> 

Thanks for feedback, I've updated my patch to work on the NVMe driver
instead, taking the queue CPU affinity into account. I will send a
separate RFC of that out soonish to the corresponding list.

-Tero


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-09-04 11:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29  7:18 [RFC PATCH 0/2] block: CPU latency PM QoS tuning Tero Kristo
2024-08-29  7:18 ` [RFC PATCH 1/2] bdev: add support for " Tero Kristo
2024-08-29 11:37   ` Jens Axboe
2024-08-30 11:55     ` Tero Kristo
2024-08-30 14:26       ` Ming Lei
2024-09-04 11:37         ` Tero Kristo
2024-08-29  7:18 ` [RFC PATCH 2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings Tero Kristo
2024-08-29 11:04 ` [RFC PATCH 0/2] block: CPU latency PM QoS tuning Bart Van Assche
2024-08-30 12:01   ` Tero Kristo
2024-09-04 11:35   ` Tero Kristo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).