[PATCH v3 00/15] honor isolcpus configuration

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/15] honor isolcpus configuration
@ 2024-08-06 12:06 Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
                   ` (15 more replies)
  0 siblings, 16 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

After the discussion with Ming on managed_irq, I decided to bring back
the io_queue option because managed_irq is clearly targetting a different
use case and Ming ask not to drop support for his use case.

In an offline discussion with Frederic, I learned what the plans are with
isolcpus. The future is in cpusets but reconfiguration happens only on
offline CPUs. I think this approach will go into this direction.

I've digged up Ming's attempt to replace the
blk_mq_[pci|virtio]_map_queues a more generic blk_mq_dev_map_queues
function which takes callback to ask the driver for an affinity mask for
a hardware queue. With this central function in place, it's also simple
to overwrite affinities in the core and the drivers don't have to be made
aware of isolcpus configurations.

The original attempt was to update the nvme-pci driver only. Hannes asked
me to look into the other multiqueue drivers/devices. Unfortunatly, I
don't have all the hardware to test against. So this is only tested with
nvme-pci, smartpqi, qla2xxx and megaraid. The testing also involved CPU
hotplug events and I was not able to observe any stalls, e.g. with hctx
which have online and offline CPUs. The only stall I was able to trigger
reliable was with qemu's PCI emulation. It looks like when a CPU is
offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to
an offline CPU instead to newly programmed destination CPU. All worked
fine on real hardware.

Finally, I also added a new CPU-hctx mapping function for the isolcpus
case. Initially the blk_mq_map_queues function was used but it turns out
this will map all isol CPUs to the first CPU. The new function first maps
the housekeeping CPUs to the right htcx using the existing mapping logic.
The isol CPUs are then assigned evenly to the hctxs. I suppose this could
be done a bit smarter and also considering NUMA aspects when we agree on
this approach.

This series is based on linux-block/for-6.12/block. If this is the wrong
branch please let me know which one is better suited. Thanks.

Initial cover letter:

The nvme-pci driver is ignoring the isolcpus configuration. There were
several attempts to fix this in the past [1][2]. This is another attempt
but this time trying to address the feedback and solve it in the core
code.

The first patch introduces a new option for isolcpus 'io_queue', but I'm
not really sure if this is needed and we could just use the managed_irq
option instead. I guess depends if there is an use case which depens on
queues on the isolated CPUs.

The second patch introduces a new block layer helper which returns the
number of possible queues. I suspect it would makes sense also to make
this helper a bit smarter and also consider the number of queues the
hardware supports.

And the last patch updates the group_cpus_evenly function so that it uses
only the housekeeping CPUs when they are defined

Note this series is not addressing the affinity setting of the admin
queue (queue 0). I'd like to address this after we agreed on how to solve
this. Currently, the admin queue affinity can be controlled by the
irq_afffinity command line option, so there is at least a workaround for
it.

Baseline:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 1536 MB
node 0 free: 1227 MB
node 1 cpus: 4 5 6 7
node 1 size: 1729 MB
node 1 free: 1422 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

options nvme write_queues=4 poll_queues=4

55: 0 41 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
63: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4-5
64: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 6-7
65: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0-1
66: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 2-3
67: 0 0 0 0 24 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
68: 0 0 0 0 0 1 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
69: 0 0 0 0 0 0 41 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 6
70: 0 0 0 0 0 0 0 3 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 7
71: 1 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 9-edge nvme0q9 affinity: 0
72: 0 18 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 10-edge nvme0q10 affinity: 1
73: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 11-edge nvme0q11 affinity: 2
74: 0 0 0 3 0 0 0 0 PCI-MSIX-0000:00:05.0 12-edge nvme0q12 affinity: 3

queue mapping for /dev/nvme0n1
        hctx0: default 4 5
        hctx1: default 6 7
        hctx2: default 0 1
        hctx3: default 2 3
        hctx4: read 4
        hctx5: read 5
        hctx6: read 6
        hctx7: read 7
        hctx8: read 0
        hctx9: read 1
        hctx10: read 2
        hctx11: read 3
        hctx12: poll 4 5
        hctx13: poll 6 7
        hctx14: poll 0 1
        hctx15: poll 2 3

PCI name is 00:05.0: nvme0n1
        irq 55, cpu list 0-3, effective list 1
        irq 63, cpu list 4-5, effective list 5
        irq 64, cpu list 6-7, effective list 7
        irq 65, cpu list 0-1, effective list 1
        irq 66, cpu list 2-3, effective list 3
        irq 67, cpu list 4, effective list 4
        irq 68, cpu list 5, effective list 5
        irq 69, cpu list 6, effective list 6
        irq 70, cpu list 7, effective list 7
        irq 71, cpu list 0, effective list 0
        irq 72, cpu list 1, effective list 1
        irq 73, cpu list 2, effective list 2
        irq 74, cpu list 3, effective list 3

* patched:

48: 0 0 33 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
58: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4
59: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 5
60: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0
61: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 1
62: 0 0 0 0 45 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
63: 0 0 0 0 0 12 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
64: 2 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 0
65: 0 35 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 1

queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1
        hctx4: read 4
        hctx5: read 5
        hctx6: read 0
        hctx7: read 1
        hctx8: poll 4
        hctx9: poll 5
        hctx10: poll 0
        hctx11: poll 1

PCI name is 00:05.0: nvme0n1
        irq 48, cpu list 0-3, effective list 2
        irq 58, cpu list 4, effective list 4
        irq 59, cpu list 5, effective list 5
        irq 60, cpu list 0, effective list 0
        irq 61, cpu list 1, effective list 1
        irq 62, cpu list 4, effective list 4
        irq 63, cpu list 5, effective list 5
        irq 64, cpu list 0, effective list 0
        irq 65, cpu list 1, effective list 1

[1] https://lore.kernel.org/lkml/20220423054331.GA17823@lst.de/T/#m9939195a465accbf83187caf346167c4242e798d
[2] https://lore.kernel.org/linux-nvme/87fruci5nj.ffs@tglx/

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (13):
      scsi: pm8001: do not overwrite PCI queue mapping
      scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
      nvme: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
      virtio: blk/scs: replace blk_mq_virtio_map_queues with blk_mq_dev_map_queues
      blk-mq: remove unused queue mapping helpers
      sched/isolation: Add io_queue housekeeping option
      docs: add io_queue as isolcpus options
      blk-mq: add number of queue calc helper
      nvme-pci: use block layer helpers to calculate num of queues
      scsi: use block layer helpers to calculate num of queues
      virtio: blk/scsi: use block layer helpers to calculate num of queues
      lib/group_cpus.c: honor housekeeping config when grouping CPUs
      blk-mq: use hk cpus only when isolcpus=io_queue is enabled

Ming Lei (2):
      virito: add APIs for retrieving vq affinity
      blk-mq: introduce blk_mq_dev_map_queues

 Documentation/admin-guide/kernel-parameters.txt |   9 ++
 block/blk-mq-cpumap.c                           | 136 ++++++++++++++++++++++++
 block/blk-mq-pci.c                              |  41 ++-----
 block/blk-mq-virtio.c                           |  46 +++-----
 drivers/block/virtio_blk.c                      |   8 +-
 drivers/nvme/host/pci.c                         |   8 +-
 drivers/scsi/fnic/fnic_main.c                   |   3 +-
 drivers/scsi/hisi_sas/hisi_sas.h                |   1 -
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c          |  20 ++--
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c          |   5 +-
 drivers/scsi/megaraid/megaraid_sas_base.c       |  18 ++--
 drivers/scsi/mpi3mr/mpi3mr_os.c                 |   3 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c            |   3 +-
 drivers/scsi/pm8001/pm8001_init.c               |   9 +-
 drivers/scsi/qla2xxx/qla_isr.c                  |  10 +-
 drivers/scsi/qla2xxx/qla_nvme.c                 |   3 +-
 drivers/scsi/qla2xxx/qla_os.c                   |   3 +-
 drivers/scsi/smartpqi/smartpqi_init.c           |  12 +--
 drivers/scsi/virtio_scsi.c                      |   4 +-
 drivers/virtio/virtio.c                         |  10 ++
 include/linux/blk-mq-pci.h                      |   7 +-
 include/linux/blk-mq-virtio.h                   |   8 +-
 include/linux/blk-mq.h                          |   7 ++
 include/linux/sched/isolation.h                 |  15 +++
 include/linux/virtio.h                          |   2 +
 kernel/sched/isolation.c                        |   7 ++
 lib/group_cpus.c                                |  75 ++++++++++++-
 27 files changed, 350 insertions(+), 123 deletions(-)
---
base-commit: f48ada402d2f1e46fa241bcc6725bdde70725e15
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,
-- 
Daniel Wagner <dwagner@suse.de>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 13:24   ` Christoph Hellwig
  2024-08-06 15:03   ` John Garry
  2024-08-06 12:06 ` [PATCH v3 02/15] virito: add APIs for retrieving vq affinity Daniel Wagner
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

blk_mq_pci_map_queues maps all queues but right after this, we
overwrite these mappings by calling blk_mq_map_queues. Just use one
helper but not both.

Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/scsi/pm8001/pm8001_init.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index 1e63cb6cd8e3..33e1eba62ca1 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -100,10 +100,12 @@ static void pm8001_map_queues(struct Scsi_Host *shost)
 	struct pm8001_hba_info *pm8001_ha = sha->lldd_ha;
 	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
 
-	if (pm8001_ha->number_of_intr > 1)
+	if (pm8001_ha->number_of_intr > 1) {
 		blk_mq_pci_map_queues(qmap, pm8001_ha->pdev, 1);
+		return;
+	}
 
-	return blk_mq_map_queues(qmap);
+	blk_mq_map_queues(qmap);
 }
 
 /*

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 02/15] virito: add APIs for retrieving vq affinity
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 13:25   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues Daniel Wagner
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

From: Ming Lei <ming.lei@redhat.com>

virtio-blk/virtio-scsi needs this API for retrieving vq's affinity.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/virtio/virtio.c | 10 ++++++++++
 include/linux/virtio.h  |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index a9b93e99c23a..c59a193ef337 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -592,6 +592,16 @@ int virtio_device_restore(struct virtio_device *dev)
 EXPORT_SYMBOL_GPL(virtio_device_restore);
 #endif
 
+const struct cpumask *virtio_get_vq_affinity(struct virtio_device *dev,
+		int index)
+{
+	if (!dev->config->get_vq_affinity)
+		return NULL;
+
+	return dev->config->get_vq_affinity(dev, index);
+}
+EXPORT_SYMBOL_GPL(virtio_get_vq_affinity);
+
 static int virtio_init(void)
 {
 	if (bus_register(&virtio_bus) != 0)
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index ecc5cb7b8c91..bab3858a4c8a 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -170,6 +170,8 @@ int virtio_device_restore(struct virtio_device *dev);
 void virtio_reset_device(struct virtio_device *dev);
 
 size_t virtio_max_dma_size(const struct virtio_device *vdev);
+const struct cpumask *virtio_get_vq_affinity(struct virtio_device *dev,
+					     int index);
 
 #define virtio_device_for_each_vq(vdev, vq) \
 	list_for_each_entry(vq, &vdev->vqs, list)

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 02/15] virito: add APIs for retrieving vq affinity Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 13:26   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

From: Ming Lei <ming.lei@redhat.com>

blk_mq_pci_map_queues and blk_mq_virtio_map_queues will create a CPU to
hardware queue mapping based on affinity information. These two
function share code which only differs on how the affinity information
is retrieved. Also there is the hisi_sas which open codes the same loop.

Thus introduce a new helper function for creating these mappings which
takes an callback function for fetching the affinity mask. Also
introduce common helper function for PCI and virtio devices to retrieve
affinity masks.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[dwagner: - removed fallback mapping
          - added affintity helpers
	  - updated commit message]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-cpumap.c         | 35 +++++++++++++++++++++++++++++++++++
 block/blk-mq-pci.c            | 18 ++++++++++++++++++
 block/blk-mq-virtio.c         | 19 +++++++++++++++++++
 include/linux/blk-mq-pci.h    |  2 ++
 include/linux/blk-mq-virtio.h |  3 +++
 include/linux/blk-mq.h        |  5 +++++
 6 files changed, 82 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9638b25fd521..7037a2dc485f 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -54,3 +54,38 @@ int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int index)
 
 	return NUMA_NO_NODE;
 }
+
+/**
+ * blk_mq_dev_map_queues - Create CPU to hardware queue mapping
+ * @qmap:	CPU to hardware queue map.
+ * @dev_off:	Offset to use for the device.
+ * @dev_data:	Device data passed to get_queue_affinity().
+ * @get_queue_affinity:	Callback to retrieve queue affinity.
+ *
+ * Create a CPU to hardware queue mapping in @qmap. For each queue
+ * @get_queue_affinity will be called to retrieve the affinity for given
+ * queue.
+ */
+void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
+			   void *dev_data, int dev_off,
+			   get_queue_affinty_fn *get_queue_affinity)
+{
+	const struct cpumask *mask;
+	unsigned int queue, cpu;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		mask = get_queue_affinity(dev_data, dev_off, queue);
+		if (!mask)
+			goto fallback;
+
+		for_each_cpu(cpu, mask)
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+	}
+
+	return;
+
+fallback:
+	WARN_ON_ONCE(qmap->nr_queues > 1);
+	blk_mq_clear_mq_map(qmap);
+}
+EXPORT_SYMBOL_GPL(blk_mq_dev_map_queues);
diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index d47b5c73c9eb..71a73238aeb2 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -44,3 +44,21 @@ void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 	blk_mq_clear_mq_map(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_pci_map_queues);
+
+/**
+ * blk_mq_pci_get_queue_affinity - get affinity mask queue mapping for PCI device
+ * @dev_data:	Pointer to struct pci_dev.
+ * @offset:	Offset to use for the pci irq vector
+ * @queue:	Queue index
+ *
+ * This function returns for a queue the affinity mask for a PCI device.
+ * It is usually used as callback for blk_mq_dev_map_queues().
+ */
+const struct cpumask *blk_mq_pci_get_queue_affinity(void *dev_data, int offset,
+						    int queue)
+{
+	struct pci_dev *pdev = dev_data;
+
+	return pci_irq_get_affinity(pdev, offset + queue);
+}
+EXPORT_SYMBOL_GPL(blk_mq_pci_get_queue_affinity);
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index 68d0945c0b08..d3d33f8d69ce 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -44,3 +44,22 @@ void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 	blk_mq_map_queues(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_virtio_map_queues);
+
+/**
+ * blk_mq_virtio_get_queue_affinity - get affinity mask queue mapping for virtio device
+ * @dev_data:	Pointer to struct virtio_device.
+ * @offset:	Offset to use for the virtio irq vector
+ * @queue:	Queue index
+ *
+ * This function returns for a queue the affinity mask for a virtio device.
+ * It is usually used as callback for blk_mq_dev_map_queues().
+ */
+const struct cpumask *blk_mq_virtio_get_queue_affinity(void *dev_data,
+						       int offset,
+						       int queue)
+{
+	struct virtio_device *vdev = dev_data;
+
+	return virtio_get_vq_affinity(vdev, offset + queue);
+}
+EXPORT_SYMBOL_GPL(blk_mq_virtio_get_queue_affinity);
diff --git a/include/linux/blk-mq-pci.h b/include/linux/blk-mq-pci.h
index ca544e1d3508..2e701f6f6341 100644
--- a/include/linux/blk-mq-pci.h
+++ b/include/linux/blk-mq-pci.h
@@ -7,5 +7,7 @@ struct pci_dev;
 
 void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 			   int offset);
+const struct cpumask *blk_mq_pci_get_queue_affinity(void *dev_data, int offset,
+						    int queue);
 
 #endif /* _LINUX_BLK_MQ_PCI_H */
diff --git a/include/linux/blk-mq-virtio.h b/include/linux/blk-mq-virtio.h
index 13226e9b22dd..9d3273ba4abf 100644
--- a/include/linux/blk-mq-virtio.h
+++ b/include/linux/blk-mq-virtio.h
@@ -7,5 +7,8 @@ struct virtio_device;
 
 void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 		struct virtio_device *vdev, int first_vec);
+const struct cpumask *blk_mq_virtio_get_queue_affinity(void *dev_data,
+						       int offset,
+						       int queue);
 
 #endif /* _LINUX_BLK_MQ_VIRTIO_H */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8d304b1d16b1..cfc96d6a3136 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -929,7 +929,12 @@ void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
+typedef const struct cpumask *(get_queue_affinty_fn)(void *dev_data,
+					      int dev_off, int queue_idx);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
+void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
+			   void *dev_data, int dev_off,
+			   get_queue_affinty_fn *get_queue_affinity);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (2 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:06   ` Christoph Hellwig
  2024-08-12 15:31   ` John Garry
  2024-08-06 12:06 ` [PATCH v3 05/15] nvme: " Daniel Wagner
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Replace all users of blk_mq_pci_map_queues with the more generic
blk_mq_dev_map_queues. This in preparation to retire
blk_mq_pci_map_queues.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/scsi/fnic/fnic_main.c             |  3 ++-
 drivers/scsi/hisi_sas/hisi_sas.h          |  1 -
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c    | 20 ++++++++++----------
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  5 +++--
 drivers/scsi/megaraid/megaraid_sas_base.c |  3 ++-
 drivers/scsi/mpi3mr/mpi3mr_os.c           |  3 ++-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  3 ++-
 drivers/scsi/pm8001/pm8001_init.c         |  3 ++-
 drivers/scsi/qla2xxx/qla_nvme.c           |  3 ++-
 drivers/scsi/qla2xxx/qla_os.c             |  3 ++-
 drivers/scsi/smartpqi/smartpqi_init.c     |  7 ++++---
 11 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/fnic/fnic_main.c b/drivers/scsi/fnic/fnic_main.c
index 29eead383eb9..dee7f241c38a 100644
--- a/drivers/scsi/fnic/fnic_main.c
+++ b/drivers/scsi/fnic/fnic_main.c
@@ -601,7 +601,8 @@ void fnic_mq_map_queues_cpus(struct Scsi_Host *host)
 		return;
 	}
 
-	blk_mq_pci_map_queues(qmap, l_pdev, FNIC_PCI_OFFSET);
+	blk_mq_dev_map_queues(qmap, l_pdev, FNIC_PCI_OFFSET,
+			      blk_mq_pci_get_queue_affinity);
 }
 
 static int fnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index d223f482488f..010479a354ee 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -9,7 +9,6 @@
 
 #include <linux/acpi.h>
 #include <linux/blk-mq.h>
-#include <linux/blk-mq-pci.h>
 #include <linux/clk.h>
 #include <linux/debugfs.h>
 #include <linux/dmapool.h>
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index 342d75f12051..91daf57f328c 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3549,21 +3549,21 @@ static const struct attribute_group *sdev_groups_v2_hw[] = {
 	NULL
 };
 
+static const struct cpumask *hisi_hba_get_queue_affinity(void *dev_data,
+							 int offset, int idx)
+{
+	struct hisi_hba *hba = dev_data;
+
+	return irq_get_affinity_mask(hba->irq_map[offset + idx]);
+}
+
 static void map_queues_v2_hw(struct Scsi_Host *shost)
 {
 	struct hisi_hba *hisi_hba = shost_priv(shost);
 	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
 
-	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = irq_get_affinity_mask(hisi_hba->irq_map[96 + queue]);
-		if (!mask)
-			continue;
-
-		for_each_cpu(cpu, mask)
-			qmap->mq_map[cpu] = qmap->queue_offset + queue;
-	}
+	return blk_mq_dev_map_queues(qmap, hisi_hba, 96,
+				     hisi_hba_get_queue_affinity);
 }
 
 static const struct scsi_host_template sht_v2_hw = {
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index feda9b54b443..0b3c7b49813a 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -3322,8 +3322,9 @@ static void hisi_sas_map_queues(struct Scsi_Host *shost)
 		if (i == HCTX_TYPE_POLL)
 			blk_mq_map_queues(qmap);
 		else
-			blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
-					      BASE_VECTORS_V3_HW);
+			blk_mq_dev_map_queues(qmap, hisi_hba->pci_dev,
+					      BASE_VECTORS_V3_HW,
+					      blk_mq_pci_get_queue_affinity);
 		qoff += qmap->nr_queues;
 	}
 }
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 6c79c350a4d5..d026b7513c8d 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -3193,7 +3193,8 @@ static void megasas_map_queues(struct Scsi_Host *shost)
 	map = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
 	map->nr_queues = instance->msix_vectors - offset;
 	map->queue_offset = 0;
-	blk_mq_pci_map_queues(map, instance->pdev, offset);
+	blk_mq_dev_map_queues(map, instance->pdev, offset,
+			      blk_mq_pci_get_queue_affinity);
 	qoff += map->nr_queues;
 	offset += map->nr_queues;
 
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index 69b14918de59..3256a71390a4 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -4031,7 +4031,8 @@ static void mpi3mr_map_queues(struct Scsi_Host *shost)
 		 */
 		map->queue_offset = qoff;
 		if (i != HCTX_TYPE_POLL)
-			blk_mq_pci_map_queues(map, mrioc->pdev, offset);
+			blk_mq_dev_map_queues(map, mrioc->pdev, offset,
+					      blk_mq_pci_get_queue_affinity);
 		else
 			blk_mq_map_queues(map);
 
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 97c2472cd434..8f7667d8bfdc 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -11890,7 +11890,8 @@ static void scsih_map_queues(struct Scsi_Host *shost)
 		 */
 		map->queue_offset = qoff;
 		if (i != HCTX_TYPE_POLL)
-			blk_mq_pci_map_queues(map, ioc->pdev, offset);
+			blk_mq_dev_map_queues(map, ioc->pdev, offset,
+					      blk_mq_pci_get_queue_affinity);
 		else
 			blk_mq_map_queues(map);
 
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index 33e1eba62ca1..b70d17905d02 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -101,7 +101,8 @@ static void pm8001_map_queues(struct Scsi_Host *shost)
 	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
 
 	if (pm8001_ha->number_of_intr > 1) {
-		blk_mq_pci_map_queues(qmap, pm8001_ha->pdev, 1);
+		blk_mq_dev_map_queues(qmap, pm8001_ha->pdev, 1,
+				      blk_mq_pci_get_queue_affinity);
 		return;
 	}
 
diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c
index 8f4cc136a9c9..30c4581e782b 100644
--- a/drivers/scsi/qla2xxx/qla_nvme.c
+++ b/drivers/scsi/qla2xxx/qla_nvme.c
@@ -841,7 +841,8 @@ static void qla_nvme_map_queues(struct nvme_fc_local_port *lport,
 {
 	struct scsi_qla_host *vha = lport->private;
 
-	blk_mq_pci_map_queues(map, vha->hw->pdev, vha->irq_offset);
+	blk_mq_dev_map_queues(map, vha->hw->pdev, vha->irq_offset,
+			      blk_mq_pci_get_queue_affinity);
 }
 
 static void qla_nvme_localport_delete(struct nvme_fc_local_port *lport)
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index bc3b2aea3f8b..5a6ceeb3ad2a 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -8068,7 +8068,8 @@ static void qla2xxx_map_queues(struct Scsi_Host *shost)
 	if (USER_CTRL_IRQ(vha->hw) || !vha->hw->mqiobase)
 		blk_mq_map_queues(qmap);
 	else
-		blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset);
+		blk_mq_dev_map_queues(qmap, vha->hw->pdev, vha->irq_offset,
+				      blk_mq_pci_get_queue_affinity);
 }
 
 struct scsi_host_template qla2xxx_driver_template = {
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 24c7cb285dca..1fb13d4e0448 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -6533,10 +6533,11 @@ static void pqi_map_queues(struct Scsi_Host *shost)
 	struct pqi_ctrl_info *ctrl_info = shost_to_hba(shost);
 
 	if (!ctrl_info->disable_managed_interrupts)
-		return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
-			      ctrl_info->pci_dev, 0);
+		blk_mq_dev_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+				      ctrl_info->pci_dev, 0,
+				      blk_mq_pci_get_queue_affinity);
 	else
-		return blk_mq_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT]);
+		blk_mq_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT]);
 }
 
 static inline bool pqi_is_tape_changer_device(struct pqi_scsi_dev *device)

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 05/15] nvme: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (3 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:06   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues " Daniel Wagner
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Replace all users of blk_mq_pci_map_queues with the more generic
blk_mq_dev_map_queues. This in preparation to retire
blk_mq_pci_map_queues.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec..5d509405a3e4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -457,7 +457,8 @@ static void nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		 */
 		map->queue_offset = qoff;
 		if (i != HCTX_TYPE_POLL && offset)
-			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+			blk_mq_dev_map_queues(map, to_pci_dev(dev->dev), offset,
+					      blk_mq_pci_get_queue_affinity);
 		else
 			blk_mq_map_queues(map);
 		qoff += map->nr_queues;

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (4 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 05/15] nvme: " Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:07   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers Daniel Wagner
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Replace all users of blk_mq_virtio_map_queues with the more generic
blk_mq_dev_map_queues. This in preparation to retire
blk_mq_virtio_map_queues.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/block/virtio_blk.c | 3 ++-
 drivers/scsi/virtio_scsi.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 194417abc105..8f68037da00b 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1186,7 +1186,8 @@ static void virtblk_map_queues(struct blk_mq_tag_set *set)
 		if (i == HCTX_TYPE_POLL)
 			blk_mq_map_queues(&set->map[i]);
 		else
-			blk_mq_virtio_map_queues(&set->map[i], vblk->vdev, 0);
+			blk_mq_dev_map_queues(&set->map[i], vblk->vdev, 0,
+					      blk_mq_virtio_get_queue_affinity);
 	}
 }
 
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 8471f38b730e..8013a082598b 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -746,7 +746,8 @@ static void virtscsi_map_queues(struct Scsi_Host *shost)
 		if (i == HCTX_TYPE_POLL)
 			blk_mq_map_queues(map);
 		else
-			blk_mq_virtio_map_queues(map, vscsi->vdev, 2);
+			blk_mq_dev_map_queues(map, vscsi->vdev, 2,
+					      blk_mq_virtio_get_queue_affinity);
 	}
 }
 

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (5 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues " Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:08   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 08/15] sched/isolation: Add io_queue housekeeping option Daniel Wagner
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

There are no users left of the pci and virtio queue mapping helpers.
Thus remove them.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-pci.c            | 39 ---------------------------------------
 block/blk-mq-virtio.c         | 41 +----------------------------------------
 include/linux/blk-mq-pci.h    |  5 -----
 include/linux/blk-mq-virtio.h |  5 -----
 4 files changed, 1 insertion(+), 89 deletions(-)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 71a73238aeb2..b69dd52d8719 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -2,49 +2,10 @@
 /*
  * Copyright (c) 2016 Christoph Hellwig.
  */
-#include <linux/kobject.h>
-#include <linux/blkdev.h>
 #include <linux/blk-mq-pci.h>
 #include <linux/pci.h>
 #include <linux/module.h>
 
-#include "blk-mq.h"
-
-/**
- * blk_mq_pci_map_queues - provide a default queue mapping for PCI device
- * @qmap:	CPU to hardware queue map.
- * @pdev:	PCI device associated with @set.
- * @offset:	Offset to use for the pci irq vector
- *
- * This function assumes the PCI device @pdev has at least as many available
- * interrupt vectors as @set has queues.  It will then query the vector
- * corresponding to each queue for it's affinity mask and built queue mapping
- * that maps a queue to the CPUs that have irq affinity for the corresponding
- * vector.
- */
-void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
-			   int offset)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = pci_irq_get_affinity(pdev, queue + offset);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			qmap->mq_map[cpu] = qmap->queue_offset + queue;
-	}
-
-	return;
-
-fallback:
-	WARN_ON_ONCE(qmap->nr_queues > 1);
-	blk_mq_clear_mq_map(qmap);
-}
-EXPORT_SYMBOL_GPL(blk_mq_pci_map_queues);
-
 /**
  * blk_mq_pci_get_queue_affinity - get affinity mask queue mapping for PCI device
  * @dev_data:	Pointer to struct pci_dev.
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index d3d33f8d69ce..0fd9f17a2f44 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -2,48 +2,9 @@
 /*
  * Copyright (c) 2016 Christoph Hellwig.
  */
-#include <linux/device.h>
 #include <linux/blk-mq-virtio.h>
-#include <linux/virtio_config.h>
+#include <linux/virtio.h>
 #include <linux/module.h>
-#include "blk-mq.h"
-
-/**
- * blk_mq_virtio_map_queues - provide a default queue mapping for virtio device
- * @qmap:	CPU to hardware queue map.
- * @vdev:	virtio device to provide a mapping for.
- * @first_vec:	first interrupt vectors to use for queues (usually 0)
- *
- * This function assumes the virtio device @vdev has at least as many available
- * interrupt vectors as @set has queues.  It will then query the vector
- * corresponding to each queue for it's affinity mask and built queue mapping
- * that maps a queue to the CPUs that have irq affinity for the corresponding
- * vector.
- */
-void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
-		struct virtio_device *vdev, int first_vec)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	if (!vdev->config->get_vq_affinity)
-		goto fallback;
-
-	for (queue = 0; queue < qmap->nr_queues; queue++) {
-		mask = vdev->config->get_vq_affinity(vdev, first_vec + queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			qmap->mq_map[cpu] = qmap->queue_offset + queue;
-	}
-
-	return;
-
-fallback:
-	blk_mq_map_queues(qmap);
-}
-EXPORT_SYMBOL_GPL(blk_mq_virtio_map_queues);
 
 /**
  * blk_mq_virtio_get_queue_affinity - get affinity mask queue mapping for virtio device
diff --git a/include/linux/blk-mq-pci.h b/include/linux/blk-mq-pci.h
index 2e701f6f6341..e6de88da8eea 100644
--- a/include/linux/blk-mq-pci.h
+++ b/include/linux/blk-mq-pci.h
@@ -2,11 +2,6 @@
 #ifndef _LINUX_BLK_MQ_PCI_H
 #define _LINUX_BLK_MQ_PCI_H
 
-struct blk_mq_queue_map;
-struct pci_dev;
-
-void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
-			   int offset);
 const struct cpumask *blk_mq_pci_get_queue_affinity(void *dev_data, int offset,
 						    int queue);
 
diff --git a/include/linux/blk-mq-virtio.h b/include/linux/blk-mq-virtio.h
index 9d3273ba4abf..de01dfb36c43 100644
--- a/include/linux/blk-mq-virtio.h
+++ b/include/linux/blk-mq-virtio.h
@@ -2,11 +2,6 @@
 #ifndef _LINUX_BLK_MQ_VIRTIO_H
 #define _LINUX_BLK_MQ_VIRTIO_H
 
-struct blk_mq_queue_map;
-struct virtio_device;
-
-void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
-		struct virtio_device *vdev, int first_vec);
 const struct cpumask *blk_mq_virtio_get_queue_affinity(void *dev_data,
 						       int offset,
 						       int queue);

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 08/15] sched/isolation: Add io_queue housekeeping option
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (6 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 09/15] docs: add io_queue as isolcpus options Daniel Wagner
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Multiqueue drivers such as nvme-pci are spreading IO queues on all CPUs
for optimal performance. isolcpu users are usually more concerned about
noise on isolated CPUs. Introduce a new isolcpus mask which allows the
user to define on which CPUs IO queues should be placed.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 include/linux/sched/isolation.h | 15 +++++++++++++++
 kernel/sched/isolation.c        |  7 +++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 2b461129d1fa..0101d0fc8c00 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -6,6 +6,20 @@
 #include <linux/init.h>
 #include <linux/tick.h>
 
+/**
+ * enum hk_type - housekeeping cpu mask types
+ * @HK_TYPE_TIMER:	housekeeping cpu mask for timers
+ * @HK_TYPE_RCU:	housekeeping cpu mask for RCU
+ * @HK_TYPE_MISC:	housekeeping cpu mask for miscalleanous resources
+ * @HK_TYPE_SCHED:	housekeeping cpu mask for scheduling
+ * @HK_TYPE_TICK:	housekeeping cpu maks for timer tick
+ * @HK_TYPE_DOMAIN:	housekeeping cpu mask for general SMP balancing
+ *			and scheduling algoririthms
+ * @HK_TYPE_WQ:		housekeeping cpu mask for worksqueues
+ * @HK_TYPE_MANAGED_IRQ: housekeeping cpu mask for managed IRQs
+ * @HK_TYPE_KTHREAD:	housekeeping cpu mask for kthreads
+ * @HK_TYPE_IO_QUEUE:	housekeeping cpu mask for I/O queues
+ */
 enum hk_type {
 	HK_TYPE_TIMER,
 	HK_TYPE_RCU,
@@ -16,6 +30,7 @@ enum hk_type {
 	HK_TYPE_WQ,
 	HK_TYPE_MANAGED_IRQ,
 	HK_TYPE_KTHREAD,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_MAX
 };
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5891e715f00d..91d7a434330c 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -18,6 +18,7 @@ enum hk_flags {
 	HK_FLAG_WQ		= BIT(HK_TYPE_WQ),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
 	HK_FLAG_KTHREAD		= BIT(HK_TYPE_KTHREAD),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -228,6 +229,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 09/15] docs: add io_queue as isolcpus options
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (7 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 08/15] sched/isolation: Add io_queue housekeeping option Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 10/15] blk-mq: add number of queue calc helper Daniel Wagner
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

isolcpus learned a new io_queue options. Explain what it does.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f1384c7b59c9..a416cc969e97 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2446,6 +2446,15 @@
 			  housekeeping CPUs has no influence on those
 			  queues.
 
+			io_queue
+			  Isolate CPUs from IO queue placement by
+			  multiqueue drivers. Multiqueue drivers are
+			  allocating and distributing IO queues on all
+			  CPUs. When this option is set, the drivers will
+			  allocated only IO queues on the housekeeping
+			  CPUs. This option can't be used toghether with
+			  managed_irq, use one or the other.
+
 			The format of <cpu-list> is described above.
 
 	iucv=		[HW,NET]

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 10/15] blk-mq: add number of queue calc helper
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (8 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 09/15] docs: add io_queue as isolcpus options Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:03   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=io_queue is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Add two variants of helpers which calculates the correct number of
queues which should be used. The need for two variants is necessary
because some drivers calculate their max number of queues based on the
possible CPU mask, others based on the online CPU mask.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-cpumap.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 7037a2dc485f..c1277763aeeb 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -11,10 +11,55 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #include "blk.h"
 #include "blk-mq.h"
 
+static unsigned int blk_mq_num_queues(const struct cpumask *mask,
+				      unsigned int max_queues)
+{
+	unsigned int num;
+
+	if (housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+
+	num = cpumask_weight(mask);
+	return min_not_zero(num, max_queues);
+}
+
+/**
+ * blk_mq_num_possible_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of possible cpu. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_possible_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
+
+/**
+ * blk_mq_num_online_queues - Calc nr of queues for multiqueue devices
+ * @max_queues:	The maximal number of queues the hardware/driver
+ *		supports. If max_queues is 0, the argument is
+ *		ignored.
+ *
+ * Calculate the number of queues which should be used for a multiqueue
+ * device based on the number of online cpus. The helper is considering
+ * isolcpus settings.
+ */
+unsigned int blk_mq_num_online_queues(unsigned int max_queues)
+{
+	return blk_mq_num_queues(cpu_online_mask, max_queues);
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cfc96d6a3136..bcd5ef712af7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -931,6 +931,8 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 
 typedef const struct cpumask *(get_queue_affinty_fn)(void *dev_data,
 					      int dev_off, int queue_idx);
+unsigned int blk_mq_num_possible_queues(unsigned int max_queues);
+unsigned int blk_mq_num_online_queues(unsigned int max_queues);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
 			   void *dev_data, int dev_off,

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (9 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 10/15] blk-mq: add number of queue calc helper Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:04   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 12/15] scsi: " Daniel Wagner
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=io_queue is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d509405a3e4..50ff2cef3d4f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -81,7 +81,7 @@ static int io_queue_count_set(const char *val, const struct kernel_param *kp)
 	int ret;
 
 	ret = kstrtouint(val, 10, &n);
-	if (ret != 0 || n > num_possible_cpus())
+	if (ret != 0 || n > blk_mq_num_possible_queues(0))
 		return -EINVAL;
 	return param_set_uint(val, kp);
 }
@@ -2300,7 +2300,8 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+	return blk_mq_num_possible_queues(0) + dev->nr_write_queues +
+		dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 12/15] scsi: use block layer helpers to calculate num of queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (10 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-12  9:09   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 13/15] virtio: blk/scsi: " Daniel Wagner
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=io_queue is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 15 +++++++++------
 drivers/scsi/qla2xxx/qla_isr.c            | 10 +++++-----
 drivers/scsi/smartpqi/smartpqi_init.c     |  5 ++---
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d026b7513c8d..ddddcdbf71b2 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5964,7 +5964,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		else
 			instance->iopoll_q_count = 0;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
 
@@ -5980,7 +5981,8 @@ megasas_alloc_irq_vectors(struct megasas_instance *instance)
 		/* Disable Balanced IOPS mode and try realloc vectors */
 		instance->perf_mode = MR_LATENCY_PERF_MODE;
 		instance->low_latency_index_start = 1;
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6236,7 +6238,7 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		intr_coalescing = (scratch_pad_1 & MR_INTR_COALESCING_SUPPORT_OFFSET) ?
 								true : false;
 		if (intr_coalescing &&
-			(num_online_cpus() >= MR_HIGH_IOPS_QUEUE_COUNT) &&
+			(blk_mq_num_online_queues(0) >= MR_HIGH_IOPS_QUEUE_COUNT) &&
 			(instance->msix_vectors == MEGASAS_MAX_MSIX_QUEUES))
 			instance->perf_mode = MR_BALANCED_PERF_MODE;
 		else
@@ -6280,7 +6282,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 		else
 			instance->low_latency_index_start = 1;
 
-		num_msix_req = num_online_cpus() + instance->low_latency_index_start;
+		num_msix_req = blk_mq_num_online_queues(0) +
+			instance->low_latency_index_start;
 
 		instance->msix_vectors = min(num_msix_req,
 				instance->msix_vectors);
@@ -6312,8 +6315,8 @@ static int megasas_init_fw(struct megasas_instance *instance)
 	megasas_setup_reply_map(instance);
 
 	dev_info(&instance->pdev->dev,
-		"current msix/online cpus\t: (%d/%d)\n",
-		instance->msix_vectors, (unsigned int)num_online_cpus());
+		"current msix/max num queues\t: (%d/%u)\n",
+		instance->msix_vectors, blk_mq_num_online_queues(0));
 	dev_info(&instance->pdev->dev,
 		"RDPQ mode\t: (%s)\n", instance->is_rdpq ? "enabled" : "disabled");
 
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index fe98c76e9be3..c4c6b5c6658c 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -4533,13 +4533,13 @@ qla24xx_enable_msix(struct qla_hw_data *ha, struct rsp_que *rsp)
 	if (USER_CTRL_IRQ(ha) || !ha->mqiobase) {
 		/* user wants to control IRQ setting for target mode */
 		ret = pci_alloc_irq_vectors(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX);
 	} else
 		ret = pci_alloc_irq_vectors_affinity(ha->pdev, min_vecs,
-		    min((u16)ha->msix_count, (u16)(num_online_cpus() + min_vecs)),
-		    PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
-		    &desc);
+			blk_mq_num_online_queues(ha->msix_count) + min_vecs,
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&desc);
 
 	if (ret < 0) {
 		ql_log(ql_log_fatal, vha, 0x00c7,
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 1fb13d4e0448..da361602ad7d 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -5264,15 +5264,14 @@ static void pqi_calculate_queue_resources(struct pqi_ctrl_info *ctrl_info)
 	if (reset_devices) {
 		num_queue_groups = 1;
 	} else {
-		int num_cpus;
 		int max_queue_groups;
 
 		max_queue_groups = min(ctrl_info->max_inbound_queues / 2,
 			ctrl_info->max_outbound_queues - 1);
 		max_queue_groups = min(max_queue_groups, PQI_MAX_QUEUE_GROUPS);
 
-		num_cpus = num_online_cpus();
-		num_queue_groups = min(num_cpus, ctrl_info->max_msix_vectors);
+		num_queue_groups =
+			blk_mq_num_online_queues(ctrl_info->max_msix_vectors);
 		num_queue_groups = min(num_queue_groups, max_queue_groups);
 	}
 

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 13/15] virtio: blk/scsi: use block layer helpers to calculate num of queues
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (11 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 12/15] scsi: " Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

Multiqueue devices should only allocate queues for the housekeeping CPUs
when isolcpus=io_queue is set. This avoids that the isolated CPUs get
disturbed with OS workload.

Use helpers which calculates the correct number of queues which should
be used when isolcpus is used.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/block/virtio_blk.c | 5 ++---
 drivers/scsi/virtio_scsi.c | 1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 8f68037da00b..3584a1e00791 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -982,9 +982,8 @@ static int init_vq(struct virtio_blk *vblk)
 		return -EINVAL;
 	}
 
-	num_vqs = min_t(unsigned int,
-			min_not_zero(num_request_queues, nr_cpu_ids),
-			num_vqs);
+	num_vqs = blk_mq_num_possible_queues(
+			min_not_zero(num_request_queues, num_vqs));
 
 	num_poll_vqs = min_t(unsigned int, poll_queues, num_vqs - 1);
 
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 8013a082598b..1c1feec59781 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -921,6 +921,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 	/* We need to know how many queues before we allocate. */
 	num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
 	num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
+	num_queues = blk_mq_num_possible_queues(num_queues);
 
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
 

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (12 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 13/15] virtio: blk/scsi: " Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 14:47   ` Ming Lei
  2024-08-12  9:09   ` Christoph Hellwig
  2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2024-08-06 13:09 ` [PATCH v3 00/15] honor isolcpus configuration Stefan Hajnoczi
  15 siblings, 2 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

group_cpus_evenly distributes all present CPUs into groups. This ignores
the isolcpus configuration and assigns isolated CPUs into the groups.

Make group_cpus_evenly aware of isolcpus configuration and use the
housekeeping CPU mask as base for distributing the available CPUs into
groups.

Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 lib/group_cpus.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..713c9fdd774a 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #ifdef CONFIG_SMP
 
@@ -330,7 +331,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 }
 
 /**
- * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * group_possible_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
@@ -344,7 +345,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+static struct cpumask *group_possible_cpus_evenly(unsigned int numgrps)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -423,6 +424,76 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	}
 	return masks;
 }
+
+/**
+ * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ * @cpu_mask: CPU to consider for the grouping
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group.
+ *
+ * Try to put close CPUs from viewpoint of CPU and NUMA locality into
+ * same group. Allocate present CPUs on these groups evenly.
+ */
+static struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+					      const struct cpumask *cpu_mask)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	return masks;
+}
+
+/**
+ * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ *
+ * Return: cpumask array if successful, NULL otherwise.
+ *
+ * group_possible_cpus_evently() is used for distributing the cpus on all
+ * possible cpus in absence of isolcpus command line argument.
+ * group_mask_cpu_evenly() is used when the isolcpus command line
+ * argument is used with managed_irq option. In this case only the
+ * housekeeping CPUs are considered.
+ */
+struct cpumask *group_cpus_evenly(unsigned int numgrps)
+{
+	const struct cpumask *hk_mask;
+
+	hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	if (!cpumask_empty(hk_mask))
+		return group_mask_cpus_evenly(numgrps, hk_mask);
+
+	return group_possible_cpus_evenly(numgrps);
+}
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int numgrps)
 {

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (13 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
@ 2024-08-06 12:06 ` Daniel Wagner
  2024-08-06 14:55   ` Ming Lei
                     ` (2 more replies)
  2024-08-06 13:09 ` [PATCH v3 00/15] honor isolcpus configuration Stefan Hajnoczi
  15 siblings, 3 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-06 12:06 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, Daniel Wagner

When isolcpus=io_queue is enabled all hardware queues should run on the
housekeeping CPUs only. Thus ignore the affinity mask provided by the
driver. Also we can't use blk_mq_map_queues because it will map all CPUs
to first hctx unless, the CPU is the same as the hctx has the affinity
set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config

  queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1

  PCI name is 00:05.0: nvme0n1
        irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 61 affinity 1 effective 1 is_managed:1 nvme0q4

where as with blk_mq_hk_map_queues we get

  queue mapping for /dev/nvme0n1
        hctx0: default 2 4
        hctx1: default 3 5
        hctx2: default 0 6
        hctx3: default 1 7

  PCI name is 00:05.0: nvme0n1
        irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 64 affinity 1 effective 1 is_managed:1 nvme0q4

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index c1277763aeeb..7e026c2ffa02 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -60,11 +60,64 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
+{
+	struct cpumask *hk_masks;
+	cpumask_var_t isol_mask;
+
+	unsigned int queue, cpu;
+
+	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return false;
+
+	/* map housekeeping cpus to matching hardware context */
+	hk_masks = group_cpus_evenly(qmap->nr_queues);
+	if (!hk_masks)
+		goto fallback;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, &hk_masks[queue])
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+	}
+
+	kfree(hk_masks);
+
+	/* map isolcpus to hardware context */
+	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
+		goto fallback;
+
+	queue = 0;
+	cpumask_andnot(isol_mask,
+		       cpu_possible_mask,
+		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+
+	for_each_cpu(cpu, isol_mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+
+	free_cpumask_var(isol_mask);
+
+	return true;
+
+fallback:
+	/* map all cpus to hardware context ignoring any affinity */
+	queue = 0;
+	for_each_possible_cpu(cpu) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+	return true;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
 	unsigned int queue, cpu;
 
+	if (blk_mq_hk_map_queues(qmap))
+		return;
+
 	masks = group_cpus_evenly(qmap->nr_queues);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -118,6 +171,9 @@ void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	if (blk_mq_hk_map_queues(qmap))
+		return;
+
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = get_queue_affinity(dev_data, dev_off, queue);
 		if (!mask)

-- 
2.46.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 00/15] honor isolcpus configuration
  2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
                   ` (14 preceding siblings ...)
  2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2024-08-06 13:09 ` Stefan Hajnoczi
  2024-08-07 12:25   ` Daniel Wagner
  15 siblings, 1 reply; 46+ messages in thread
From: Stefan Hajnoczi @ 2024-08-06 13:09 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, 6 Aug 2024 at 08:10, Daniel Wagner <dwagner@suse.de> wrote:
> The only stall I was able to trigger
> reliable was with qemu's PCI emulation. It looks like when a CPU is
> offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to
> an offline CPU instead to newly programmed destination CPU. All worked
> fine on real hardware.

Hi Daniel,
Please file a QEMU bug report here (or just reply to this emails with
details on how to reproduce the issue and I'll file the issue on your
behalf):
https://gitlab.com/qemu-project/qemu/-/issues

We can also wait until your Linux patches have landed if that makes it
easier to reproduce the bug.

Thanks!

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping
  2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
@ 2024-08-06 13:24   ` Christoph Hellwig
  2024-08-06 15:03   ` John Garry
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-06 13:24 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:33PM +0200, Daniel Wagner wrote:
> blk_mq_pci_map_queues maps all queues but right after this, we
> overwrite these mappings by calling blk_mq_map_queues. Just use one
> helper but not both.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 02/15] virito: add APIs for retrieving vq affinity
  2024-08-06 12:06 ` [PATCH v3 02/15] virito: add APIs for retrieving vq affinity Daniel Wagner
@ 2024-08-06 13:25   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-06 13:25 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

The title should spell virtio I think, and the commit log could explain
a bit more why this is needed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues
  2024-08-06 12:06 ` [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues Daniel Wagner
@ 2024-08-06 13:26   ` Christoph Hellwig
  2024-08-07 12:49     ` Daniel Wagner
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-06 13:26 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:35PM +0200, Daniel Wagner wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> blk_mq_pci_map_queues and blk_mq_virtio_map_queues will create a CPU to
> hardware queue mapping based on affinity information. These two
> function share code which only differs on how the affinity information
> is retrieved. Also there is the hisi_sas which open codes the same loop.
> 
> Thus introduce a new helper function for creating these mappings which
> takes an callback function for fetching the affinity mask. Also
> introduce common helper function for PCI and virtio devices to retrieve
> affinity masks.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [dwagner: - removed fallback mapping
>           - added affintity helpers
> 	  - updated commit message]
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/blk-mq-cpumap.c         | 35 +++++++++++++++++++++++++++++++++++
>  block/blk-mq-pci.c            | 18 ++++++++++++++++++
>  block/blk-mq-virtio.c         | 19 +++++++++++++++++++
>  include/linux/blk-mq-pci.h    |  2 ++
>  include/linux/blk-mq-virtio.h |  3 +++
>  include/linux/blk-mq.h        |  5 +++++
>  6 files changed, 82 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9638b25fd521..7037a2dc485f 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -54,3 +54,38 @@ int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int index)
>  
>  	return NUMA_NO_NODE;
>  }
> +
> +/**
> + * blk_mq_dev_map_queues - Create CPU to hardware queue mapping
> + * @qmap:	CPU to hardware queue map.
> + * @dev_off:	Offset to use for the device.
> + * @dev_data:	Device data passed to get_queue_affinity().
> + * @get_queue_affinity:	Callback to retrieve queue affinity.
> + *
> + * Create a CPU to hardware queue mapping in @qmap. For each queue
> + * @get_queue_affinity will be called to retrieve the affinity for given
> + * queue.
> + */
> +void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
> +			   void *dev_data, int dev_off,
> +			   get_queue_affinty_fn *get_queue_affinity)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		mask = get_queue_affinity(dev_data, dev_off, queue);
> +		if (!mask)
> +			goto fallback;
> +
> +		for_each_cpu(cpu, mask)
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	return;
> +
> +fallback:
> +	WARN_ON_ONCE(qmap->nr_queues > 1);
> +	blk_mq_clear_mq_map(qmap);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_dev_map_queues);
> diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
> index d47b5c73c9eb..71a73238aeb2 100644
> --- a/block/blk-mq-pci.c
> +++ b/block/blk-mq-pci.c
> @@ -44,3 +44,21 @@ void blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
>  	blk_mq_clear_mq_map(qmap);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_pci_map_queues);
> +
> +/**
> + * blk_mq_pci_get_queue_affinity - get affinity mask queue mapping for PCI device
> + * @dev_data:	Pointer to struct pci_dev.
> + * @offset:	Offset to use for the pci irq vector
> + * @queue:	Queue index
> + *
> + * This function returns for a queue the affinity mask for a PCI device.
> + * It is usually used as callback for blk_mq_dev_map_queues().
> + */
> +const struct cpumask *blk_mq_pci_get_queue_affinity(void *dev_data, int offset,
> +						    int queue)
> +{
> +	struct pci_dev *pdev = dev_data;
> +
> +	return pci_irq_get_affinity(pdev, offset + queue);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_pci_get_queue_affinity);
> diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
> index 68d0945c0b08..d3d33f8d69ce 100644
> --- a/block/blk-mq-virtio.c
> +++ b/block/blk-mq-virtio.c
> @@ -44,3 +44,22 @@ void blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
>  	blk_mq_map_queues(qmap);
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_virtio_map_queues);
> +
> +/**
> + * blk_mq_virtio_get_queue_affinity - get affinity mask queue mapping for virtio device

Please avoid the overly long line here.

> +const struct cpumask *blk_mq_virtio_get_queue_affinity(void *dev_data,
> +						       int offset,
> +						       int queue)
> +{

And maybe use sane formatting here:

const struct cpumask *blk_mq_virtio_get_queue_affinity(void *dev_data,
		int offset, int queue)

/me also wonders why the parameters aren't unsigned, but that's history..


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs
  2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
@ 2024-08-06 14:47   ` Ming Lei
  2024-08-12  9:09   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Ming Lei @ 2024-08-06 14:47 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:46PM +0200, Daniel Wagner wrote:
> group_cpus_evenly distributes all present CPUs into groups. This ignores
> the isolcpus configuration and assigns isolated CPUs into the groups.
> 
> Make group_cpus_evenly aware of isolcpus configuration and use the
> housekeeping CPU mask as base for distributing the available CPUs into
> groups.
> 
> Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")

This patch fixes nothing on commit 11ea68f553e2, please remove the above
Fixes tag.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
@ 2024-08-06 14:55   ` Ming Lei
  2024-08-07 12:40     ` Daniel Wagner
  2024-08-09 15:23   ` Ming Lei
  2024-08-13 12:56   ` Ming Lei
  2 siblings, 1 reply; 46+ messages in thread
From: Ming Lei @ 2024-08-06 14:55 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config

What is the expected behavior if someone still tries to submit IO on isolated
CPUs?

BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
patchset, that means one random hctx(or even NULL) may be used for submitting
IO from isolated CPUs, then there can be io hang risk during cpu hotplug, or
kernel panic when submitting bio.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping
  2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
  2024-08-06 13:24   ` Christoph Hellwig
@ 2024-08-06 15:03   ` John Garry
  1 sibling, 0 replies; 46+ messages in thread
From: John Garry @ 2024-08-06 15:03 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Sagi Grimberg,
	Thomas Gleixner, Christoph Hellwig, Martin K. Petersen,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On 06/08/2024 13:06, Daniel Wagner wrote:
> blk_mq_pci_map_queues maps all queues but right after this, we
> overwrite these mappings by calling blk_mq_map_queues. Just use one
> helper but not both.
> 
> Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
> Signed-off-by: Daniel Wagner<dwagner@suse.de>

Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 00/15] honor isolcpus configuration
  2024-08-06 13:09 ` [PATCH v3 00/15] honor isolcpus configuration Stefan Hajnoczi
@ 2024-08-07 12:25   ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-07 12:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, Aug 06, 2024 at 09:09:50AM GMT, Stefan Hajnoczi wrote:
> On Tue, 6 Aug 2024 at 08:10, Daniel Wagner <dwagner@suse.de> wrote:
> > The only stall I was able to trigger
> > reliable was with qemu's PCI emulation. It looks like when a CPU is
> > offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to
> > an offline CPU instead to newly programmed destination CPU. All worked
> > fine on real hardware.
> 
> Hi Daniel,
> Please file a QEMU bug report here (or just reply to this emails with
> details on how to reproduce the issue and I'll file the issue on your
> behalf):
> https://gitlab.com/qemu-project/qemu/-/issues
> 
> We can also wait until your Linux patches have landed if that makes it
> easier to reproduce the bug.

Thanks for the offer. I tried simplify the setup and come up with
producer using qemu directly instead with libvirt. And now it works just
fine. I'll try to figure out what the magic argument...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-06 14:55   ` Ming Lei
@ 2024-08-07 12:40     ` Daniel Wagner
  2024-08-08  5:26       ` Ming Lei
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-07 12:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote:
> On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> > When isolcpus=io_queue is enabled all hardware queues should run on the
> > housekeeping CPUs only. Thus ignore the affinity mask provided by the
> > driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> > to first hctx unless, the CPU is the same as the hctx has the affinity
> > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
> What is the expected behavior if someone still tries to submit IO on isolated
> CPUs?

If a user thread is issuing an IO the IO is handled by the housekeeping
CPU, which will cause some noise on the submitting CPU. As far I was
told this is acceptable. Our customers really don't want to have any
IO not from their application ever hitting the isolcpus. When their
application is issuing an IO.

> BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
> patchset,

I was trying to figure out what you tried to explain last time with
hangs, but didn't really understand what the conditions are for this
problem to occur.

> that means one random hctx(or even NULL) may be used for submitting
> IO from isolated CPUs,
> then there can be io hang risk during cpu hotplug, or
> kernel panic when submitting bio.

Can you elaborate a bit more? I must miss something important here.

Anyway, my understanding is that when the last CPU of a hctx goes
offline the affinity is broken and assigned to an online HK CPU. And we
ensure all flight IO have finished and also ensure we don't submit any
new IO to a CPU which goes offline.

FWIW, I tried really hard to get an IO hang with cpu hotplug.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues
  2024-08-06 13:26   ` Christoph Hellwig
@ 2024-08-07 12:49     ` Daniel Wagner
  2024-08-12  9:05       ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-07 12:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Martin K. Petersen, John Garry, Michael S. Tsirkin, Jason Wang,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Sathya Prakash Veerichetty, Suganath Prabu Subramani,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Jonathan Corbet,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 06, 2024 at 03:26:45PM GMT, Christoph Hellwig wrote:
> > +
> > +/**
> > + * blk_mq_virtio_get_queue_affinity - get affinity mask queue mapping for virtio device
> 
> Please avoid the overly long line here.

I thought for some reason the brief description needs to be on one
line. It can be multiple lines:

https://www.kernel.org/doc/html/v6.10/doc-guide/kernel-doc.html#structure-union-and-enumeration-documentation

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-07 12:40     ` Daniel Wagner
@ 2024-08-08  5:26       ` Ming Lei
  2024-08-09  7:22         ` Daniel Wagner
  0 siblings, 1 reply; 46+ messages in thread
From: Ming Lei @ 2024-08-08  5:26 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, ming.lei

On Wed, Aug 07, 2024 at 02:40:11PM +0200, Daniel Wagner wrote:
> On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote:
> > On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> > > When isolcpus=io_queue is enabled all hardware queues should run on the
> > > housekeeping CPUs only. Thus ignore the affinity mask provided by the
> > > driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> > > to first hctx unless, the CPU is the same as the hctx has the affinity
> > > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> > 
> > What is the expected behavior if someone still tries to submit IO on isolated
> > CPUs?
> 
> If a user thread is issuing an IO the IO is handled by the housekeeping
> CPU, which will cause some noise on the submitting CPU. As far I was
> told this is acceptable. Our customers really don't want to have any
> IO not from their application ever hitting the isolcpus. When their
> application is issuing an IO.
> 
> > BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
> > patchset,
> 
> I was trying to figure out what you tried to explain last time with
> hangs, but didn't really understand what the conditions are for this
> problem to occur.

Isolated CPUs are removed from queue mapping in this patchset, when someone
submit IOs from the isolated CPU, what is the correct hctx used for handling
these IOs?

From current implementation, it depends on implied zero filled
tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.

During CPU offline, in blk_mq_hctx_notify_offline(),
blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
hctx 0 won't be drained.

However managed irq core code still shutdowns the hw queue's irq because all
CPUs in this hctx are offline now. Then IO hang is triggered, isn't it?

The current blk-mq takes static & global queue/CPUs mapping, in which all CPUs
are covered. This patchset removes isolated CPUs from the mapping, and the
change is big from viewpoint of blk-mq queue mapping.

> 
> > that means one random hctx(or even NULL) may be used for submitting
> > IO from isolated CPUs,
> > then there can be io hang risk during cpu hotplug, or
> > kernel panic when submitting bio.
> 
> Can you elaborate a bit more? I must miss something important here.
> 
> Anyway, my understanding is that when the last CPU of a hctx goes
> offline the affinity is broken and assigned to an online HK CPU. And we
> ensure all flight IO have finished and also ensure we don't submit any
> new IO to a CPU which goes offline.
> 
> FWIW, I tried really hard to get an IO hang with cpu hotplug.
 
Please see above.


thanks,
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-08  5:26       ` Ming Lei
@ 2024-08-09  7:22         ` Daniel Wagner
  2024-08-09 14:53           ` Ming Lei
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Wagner @ 2024-08-09  7:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> Isolated CPUs are removed from queue mapping in this patchset, when someone
> submit IOs from the isolated CPU, what is the correct hctx used for handling
> these IOs?

No, every possible CPU gets a mapping. What this patch series does, is
to limit/aligns the number of hardware context to the number of
housekeeping CPUs. There is still a complete ctx-hctc mapping. So
whenever an user thread on an isolated CPU is issuing an IO a
housekeeping CPU will also be involved (with the additional overhead,
which seems to be okay for these users).

Without hardware queue on the isolated CPUs ensures we really never get
any unexpected IO on those CPUs unless userspace does it own its own.
It's a safety net.

Just to illustrate it, the non isolcpus configuration (default) map
for an 8 CPU setup:

queue mapping for /dev/vda
        hctx0: default 0
        hctx1: default 1
        hctx2: default 2
        hctx3: default 3
        hctx4: default 4
        hctx5: default 5
        hctx6: default 6
        hctx7: default 7

and with isolcpus=io_queue,2-3,6-7

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

> From current implementation, it depends on implied zero filled
> tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> 
> During CPU offline, in blk_mq_hctx_notify_offline(),
> blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> hctx 0 won't be drained.
> 
> However managed irq core code still shutdowns the hw queue's irq because all
> CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> it?

Thanks for the explanation. I was able to reproduce this scenario, that
is a hardware context with two CPUs which go offline. Initially, I used
fio for creating the workload but this never hit the hanger. Instead
some background workload from systemd-journald is pretty reliable to
trigger the hanger you describe.

Example:

  hctx2: default 4 6

CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:

  smpboot: CPU 5 is now offline
  blk_mq_hctx_has_online_cpu:3537 hctx3 offline
  blk_mq_hctx_has_online_cpu:3537 hctx2 offline

and there is no forward progress anymore, the cpuhotplug state machine
is blocked and an IO is hanging:

  # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
  /sys/kernel/debug/block/vda/hctx2/tags:busy=61

and blk_mq_hctx_notify_offline busy loops forever:

   task:cpuhp/6         state:D stack:0     pid:439   tgid:439   ppid:2      flags:0x00004000
   Call Trace:
    <TASK>
    __schedule+0x79d/0x15c0
    ? lockdep_hardirqs_on_prepare+0x152/0x210
    ? kvm_sched_clock_read+0xd/0x20
    ? local_clock_noinstr+0x28/0xb0
    ? local_clock+0x11/0x30
    ? lock_release+0x122/0x4a0
    schedule+0x3d/0xb0
    schedule_timeout+0x88/0xf0
    ? __pfx_process_timeout+0x10/0x10d
    msleep+0x28/0x40
    blk_mq_hctx_notify_offline+0x1b5/0x200
    ? cpuhp_thread_fun+0x41/0x1f0
    cpuhp_invoke_callback+0x27e/0x780
    ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
    ? cpuhp_thread_fun+0x42/0x1f0
    cpuhp_thread_fun+0x178/0x1f0
    smpboot_thread_fn+0x12e/0x1c0
    ? __pfx_smpboot_thread_fn+0x10/0x10
    kthread+0xe8/0x110
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x33/0x40
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>

I don't think this is a new problem this code introduces. This problem
exists for any hardware context which has more than one CPU. As far I
understand it, the problem is that there is no forward progress possible
for the IO itself (I assume the corresponding resources for the CPU
going offline have already been shutdown, thus no progress?) and
blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.

Couldn't we do something like:

+static bool blk_mq_hctx_timeout_rq(struct request *rq, void *data)
+{
+       blk_mq_rq_timed_out(rq);
+       return true;
+}
+
+static void blk_mq_hctx_timeout_rqs(struct blk_mq_hw_ctx *hctx)
+{
+       struct blk_mq_tags *tags = hctx->sched_tags ?
+                       hctx->sched_tags : hctx->tags;
+       blk_mq_all_tag_iter(tags, blk_mq_hctx_timeout_rq, NULL);
+}
+
+
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
        struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
                        struct blk_mq_hw_ctx, cpuhp_online);
+       int i;

        if (blk_mq_hctx_has_online_cpu(hctx, cpu))
                return 0;
@@ -3551,9 +3589,16 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
         * requests.  If we could not grab a reference the queue has been
         * frozen and there are no requests.
         */
+       i = 0;
        if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
-               while (blk_mq_hctx_has_requests(hctx))
+               while (blk_mq_hctx_has_requests(hctx) && i++ < 10)
                        msleep(5);
+               if (blk_mq_hctx_has_requests(hctx)) {
+                       pr_info("%s:%d hctx %d force timeout request\n",
+                               __func__, __LINE__, hctx->queue_num);
+                       blk_mq_hctx_timeout_rqs(hctx);
+               }
+

This guarantees forward progress and it worked in my test scenario, got
the corresponding log entries

  blk_mq_hctx_notify_offline:3598 hctx 2 force timeout request

and the hotplug state machine continued. Didn't see an IO error either,
but I haven't looked closely, this is just a POC.

BTW, when looking at the tag allocator, I didn't see any hctx state
checks for the batched alloction path. Don't we need to check if the
corresponding hardware context is active there too?

@ -486,6 +487,15 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
        if (data->nr_tags > 1) {
                rq = __blk_mq_alloc_requests_batch(data);
                if (rq) {
+                       if (unlikely(test_bit(BLK_MQ_S_INACTIVE,
+                                             &data->hctx->state))) {
+                               blk_mq_put_tag(blk_mq_tags_from_data(data),
+                                              rq->mq_ctx, rq->tag);
+                               msleep(3);
+                               goto retry;
+                       }
                        blk_mq_rq_time_init(rq, alloc_time_ns);
                        return rq;
                }

But given this is the hotpath and the hotplug path is very unlikely to
be used at all, at least for the majority of users, I would suggest to
try to get blk_mq_hctx_notify_offline to guarantee forward progress?.
This would make the hotpath an 'if' less.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-09  7:22         ` Daniel Wagner
@ 2024-08-09 14:53           ` Ming Lei
  2024-08-13 12:17             ` Daniel Wagner
  0 siblings, 1 reply; 46+ messages in thread
From: Ming Lei @ 2024-08-09 14:53 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > these IOs?
> 
> No, every possible CPU gets a mapping. What this patch series does, is
> to limit/aligns the number of hardware context to the number of
> housekeeping CPUs. There is still a complete ctx-hctc mapping. So

OK, then I guess patch 1~7 aren't supposed to belong to this series,
cause you just want to reduce nr_hw_queues, meantime spread
house-keeping CPUs first for avoiding queues with all isolated cpu mask.

> whenever an user thread on an isolated CPU is issuing an IO a
> housekeeping CPU will also be involved (with the additional overhead,
> which seems to be okay for these users).
> 
> Without hardware queue on the isolated CPUs ensures we really never get
> any unexpected IO on those CPUs unless userspace does it own its own.
> It's a safety net.
> 
> Just to illustrate it, the non isolcpus configuration (default) map
> for an 8 CPU setup:
> 
> queue mapping for /dev/vda
>         hctx0: default 0
>         hctx1: default 1
>         hctx2: default 2
>         hctx3: default 3
>         hctx4: default 4
>         hctx5: default 5
>         hctx6: default 6
>         hctx7: default 7
> 
> and with isolcpus=io_queue,2-3,6-7
> 
> queue mapping for /dev/vda
>         hctx0: default 0 2
>         hctx1: default 1 3
>         hctx2: default 4 6
>         hctx3: default 5 7

OK, Looks I missed the point in patch 15 in which you added isolated cpu
into mapping manually, just wondering why not take the current two-stage
policy to cover both house-keeping and isolated CPUs in group_cpus_evenly()?

Such as spread house-keeping CPUs first, then isolated CPUs, just like
what we did for present & non-present cpus.

Then the whole patchset can be simplified a lot.

> 
> > From current implementation, it depends on implied zero filled
> > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> > 
> > During CPU offline, in blk_mq_hctx_notify_offline(),
> > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> > hctx 0 won't be drained.
> > 
> > However managed irq core code still shutdowns the hw queue's irq because all
> > CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> > it?
> 
> Thanks for the explanation. I was able to reproduce this scenario, that
> is a hardware context with two CPUs which go offline. Initially, I used
> fio for creating the workload but this never hit the hanger. Instead
> some background workload from systemd-journald is pretty reliable to
> trigger the hanger you describe.
> 
> Example:
> 
>   hctx2: default 4 6
> 
> CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:
> 
>   smpboot: CPU 5 is now offline
>   blk_mq_hctx_has_online_cpu:3537 hctx3 offline
>   blk_mq_hctx_has_online_cpu:3537 hctx2 offline
> 
> and there is no forward progress anymore, the cpuhotplug state machine
> is blocked and an IO is hanging:
> 
>   # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
>   /sys/kernel/debug/block/vda/hctx2/tags:busy=61
> 
> and blk_mq_hctx_notify_offline busy loops forever:
> 
>    task:cpuhp/6         state:D stack:0     pid:439   tgid:439   ppid:2      flags:0x00004000
>    Call Trace:
>     <TASK>
>     __schedule+0x79d/0x15c0
>     ? lockdep_hardirqs_on_prepare+0x152/0x210
>     ? kvm_sched_clock_read+0xd/0x20
>     ? local_clock_noinstr+0x28/0xb0
>     ? local_clock+0x11/0x30
>     ? lock_release+0x122/0x4a0
>     schedule+0x3d/0xb0
>     schedule_timeout+0x88/0xf0
>     ? __pfx_process_timeout+0x10/0x10d
>     msleep+0x28/0x40
>     blk_mq_hctx_notify_offline+0x1b5/0x200
>     ? cpuhp_thread_fun+0x41/0x1f0
>     cpuhp_invoke_callback+0x27e/0x780
>     ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
>     ? cpuhp_thread_fun+0x42/0x1f0
>     cpuhp_thread_fun+0x178/0x1f0
>     smpboot_thread_fn+0x12e/0x1c0
>     ? __pfx_smpboot_thread_fn+0x10/0x10
>     kthread+0xe8/0x110
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork+0x33/0x40
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork_asm+0x1a/0x30
>     </TASK>
> 
> I don't think this is a new problem this code introduces. This problem
> exists for any hardware context which has more than one CPU. As far I
> understand it, the problem is that there is no forward progress possible
> for the IO itself (I assume the corresponding resources for the CPU

When blk_mq_hctx_notify_offline() is running, the current CPU isn't
offline yet, and the hctx is active, same with the managed irq, so it is fine
to wait until all in-flight IOs originated from this hctx completed there.

The reason is why these requests can't be completed? And the forward
progress is provided by blk-mq. And these requests are very likely
allocated & submitted from CPU6.

Can you figure out what is effective mask for irq of hctx2?  It is
supposed to be cpu6. And block debugfs for vda should provide helpful
hint.

> going offline have already been shutdown, thus no progress?) and
> blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.

RH has internal cpu hotplug stress test, but not see such report so far.

I will try to setup such kind of setting and see if it can be
reproduced.

> 
> Couldn't we do something like:

I usually won't thinking about any solution until root-cause is figured
out, :-)
 

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2024-08-06 14:55   ` Ming Lei
@ 2024-08-09 15:23   ` Ming Lei
  2024-08-13 12:53     ` Daniel Wagner
  2024-08-13 12:56   ` Ming Lei
  2 siblings, 1 reply; 46+ messages in thread
From: Ming Lei @ 2024-08-09 15:23 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 3 4 6 7
>         hctx1: default 5
>         hctx2: default 0
>         hctx3: default 1
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 4
>         hctx1: default 3 5
>         hctx2: default 0 6
>         hctx3: default 1 7
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index c1277763aeeb..7e026c2ffa02 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -60,11 +60,64 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>  
> +static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
> +{
> +	struct cpumask *hk_masks;
> +	cpumask_var_t isol_mask;
> +
> +	unsigned int queue, cpu;
> +
> +	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return false;
> +
> +	/* map housekeeping cpus to matching hardware context */
> +	hk_masks = group_cpus_evenly(qmap->nr_queues);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, &hk_masks[queue])
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	kfree(hk_masks);
> +
> +	/* map isolcpus to hardware context */
> +	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	queue = 0;
> +	cpumask_andnot(isol_mask,
> +		       cpu_possible_mask,
> +		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +
> +	for_each_cpu(cpu, isol_mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +
> +	free_cpumask_var(isol_mask);
> +
> +	return true;
> +
> +fallback:
> +	/* map all cpus to hardware context ignoring any affinity */
> +	queue = 0;
> +	for_each_possible_cpu(cpu) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +	return true;
> +}
> +
>  void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>  {
>  	const struct cpumask *masks;
>  	unsigned int queue, cpu;
>  
> +	if (blk_mq_hk_map_queues(qmap))
> +		return;
> +
>  	masks = group_cpus_evenly(qmap->nr_queues);
>  	if (!masks) {
>  		for_each_possible_cpu(cpu)
> @@ -118,6 +171,9 @@ void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
>  	const struct cpumask *mask;
>  	unsigned int queue, cpu;
>  
> +	if (blk_mq_hk_map_queues(qmap))
> +		return;
> +
>  	for (queue = 0; queue < qmap->nr_queues; queue++) {
>  		mask = get_queue_affinity(dev_data, dev_off, queue);
>  		if (!mask)

From above implementation, "isolcpus=io_queue" is actually just one
optimization on "isolcpus=managed_irq", and there isn't essential
difference between the two.

And I'd suggest to optimize 'isolcpus=managed_irq' directly, such as:

- reduce nr_queues or numgrps for group_cpus_evenly() according to
house-keeping cpu mask

- spread house-keeping & isolate cpu mask evenly on each queue, and
you can use the existed two-stage spread for doing that


thanks,
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 10/15] blk-mq: add number of queue calc helper
  2024-08-06 12:06 ` [PATCH v3 10/15] blk-mq: add number of queue calc helper Daniel Wagner
@ 2024-08-12  9:03   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:03 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues
  2024-08-06 12:06 ` [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
@ 2024-08-12  9:04   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:04 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues
  2024-08-07 12:49     ` Daniel Wagner
@ 2024-08-12  9:05       ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:05 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Thomas Gleixner, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Wed, Aug 07, 2024 at 02:49:58PM +0200, Daniel Wagner wrote:
> On Tue, Aug 06, 2024 at 03:26:45PM GMT, Christoph Hellwig wrote:
> > > +
> > > +/**
> > > + * blk_mq_virtio_get_queue_affinity - get affinity mask queue mapping for virtio device
> > 
> > Please avoid the overly long line here.
> 
> I thought for some reason the brief description needs to be on one
> line. It can be multiple lines:
> 
> https://www.kernel.org/doc/html/v6.10/doc-guide/kernel-doc.html#structure-union-and-enumeration-documentation

I'm pretty sure I've see line wraps there.  But if it wraps that's
probably and argument it is too lone as it isn't single line at that
point for a reasonable definition of line.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
@ 2024-08-12  9:06   ` Christoph Hellwig
  2024-08-12 15:31   ` John Garry
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:06 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:36PM +0200, Daniel Wagner wrote:
> Replace all users of blk_mq_pci_map_queues with the more generic
> blk_mq_dev_map_queues. This in preparation to retire
> blk_mq_pci_map_queues.

The hisi_sas one doesn't look like a trivial scripted conversion.
Can you split that one out and better document what is done?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 05/15] nvme: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 ` [PATCH v3 05/15] nvme: " Daniel Wagner
@ 2024-08-12  9:06   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:06 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 ` [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues " Daniel Wagner
@ 2024-08-12  9:07   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:07 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

What is blk/scs supposed to mean?

The code changes themselves look fine to me.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers
  2024-08-06 12:06 ` [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers Daniel Wagner
@ 2024-08-12  9:08   ` Christoph Hellwig
  2024-08-13  9:41     ` Daniel Wagner
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:08 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Tue, Aug 06, 2024 at 02:06:39PM +0200, Daniel Wagner wrote:
>   * Copyright (c) 2016 Christoph Hellwig.

None of my code left at this point :)

>  /**
>   * blk_mq_pci_get_queue_affinity - get affinity mask queue mapping for PCI device
>   * @dev_data:	Pointer to struct pci_dev.

But on to something more substancial:  thes get_queue_affinity
wrappers have nothing specific in them.  So maybe move them into
the PCI and virtio subsystems instead and kill off these files
entirely.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 12/15] scsi: use block layer helpers to calculate num of queues
  2024-08-06 12:06 ` [PATCH v3 12/15] scsi: " Daniel Wagner
@ 2024-08-12  9:09   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:09 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

Can you avoid the overly long lines?  Otherwise this looks fine.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs
  2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
  2024-08-06 14:47   ` Ming Lei
@ 2024-08-12  9:09   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2024-08-12  9:09 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
  2024-08-12  9:06   ` Christoph Hellwig
@ 2024-08-12 15:31   ` John Garry
  2024-08-13  9:39     ` Daniel Wagner
  1 sibling, 1 reply; 46+ messages in thread
From: John Garry @ 2024-08-12 15:31 UTC (permalink / raw)
  To: Daniel Wagner, Jens Axboe, Keith Busch, Sagi Grimberg,
	Thomas Gleixner, Christoph Hellwig, Martin K. Petersen,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On 06/08/2024 13:06, Daniel Wagner wrote:
> Replace all users of blk_mq_pci_map_queues with the more generic
> blk_mq_dev_map_queues. This in preparation to retire
> blk_mq_pci_map_queues.

nit: About blk_mq_dev_map_queues(), from the name it gives the 
impression that we deal in struct device, which is not the case.

> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>   drivers/scsi/fnic/fnic_main.c             |  3 ++-
>   drivers/scsi/hisi_sas/hisi_sas.h          |  1 -
>   drivers/scsi/hisi_sas/hisi_sas_v2_hw.c    | 20 ++++++++++----------
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  5 +++--
>   drivers/scsi/megaraid/megaraid_sas_base.c |  3 ++-
>   drivers/scsi/mpi3mr/mpi3mr_os.c           |  3 ++-
>   drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  3 ++-
>   drivers/scsi/pm8001/pm8001_init.c         |  3 ++-
>   drivers/scsi/qla2xxx/qla_nvme.c           |  3 ++-
>   drivers/scsi/qla2xxx/qla_os.c             |  3 ++-
>   drivers/scsi/smartpqi/smartpqi_init.c     |  7 ++++---
>   11 files changed, 31 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/scsi/fnic/fnic_main.c b/drivers/scsi/fnic/fnic_main.c
> index 29eead383eb9..dee7f241c38a 100644
> --- a/drivers/scsi/fnic/fnic_main.c
> +++ b/drivers/scsi/fnic/fnic_main.c
> @@ -601,7 +601,8 @@ void fnic_mq_map_queues_cpus(struct Scsi_Host *host)
>   		return;
>   	}
>   
> -	blk_mq_pci_map_queues(qmap, l_pdev, FNIC_PCI_OFFSET);
> +	blk_mq_dev_map_queues(qmap, l_pdev, FNIC_PCI_OFFSET,
> +			      blk_mq_pci_get_queue_affinity);
>   }
>   
>   static int fnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
> index d223f482488f..010479a354ee 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas.h
> +++ b/drivers/scsi/hisi_sas/hisi_sas.h
> @@ -9,7 +9,6 @@
>   
>   #include <linux/acpi.h>
>   #include <linux/blk-mq.h>
> -#include <linux/blk-mq-pci.h>
>   #include <linux/clk.h>
>   #include <linux/debugfs.h>
>   #include <linux/dmapool.h>
> diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> index 342d75f12051..91daf57f328c 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> @@ -3549,21 +3549,21 @@ static const struct attribute_group *sdev_groups_v2_hw[] = {
>   	NULL
>   };
>   
> +static const struct cpumask *hisi_hba_get_queue_affinity(void *dev_data,
> +							 int offset, int idx)

personally I think that name "queue" would be better than "idx"

> +{
> +	struct hisi_hba *hba = dev_data;
> +
> +	return irq_get_affinity_mask(hba->irq_map[offset + idx]);
> +}
> +
>   static void map_queues_v2_hw(struct Scsi_Host *shost)
>   {
>   	struct hisi_hba *hisi_hba = shost_priv(shost);
>   	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
> -	const struct cpumask *mask;
> -	unsigned int queue, cpu;
>   
> -	for (queue = 0; queue < qmap->nr_queues; queue++) {
> -		mask = irq_get_affinity_mask(hisi_hba->irq_map[96 + queue]);
> -		if (!mask)
> -			continue;
> -
> -		for_each_cpu(cpu, mask)
> -			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> -	}
> +	return blk_mq_dev_map_queues(qmap, hisi_hba, 96,

blk_mq_dev_map_queues() returns void, and so we should not return the 
value (which is void).

And I know that the current code is like this, but using CQ0_IRQ_INDEX 
instead of 96 would be nicer.

> +				     hisi_hba_get_queue_affinity);
>   }
>   
>   static const struct scsi_host_template sht_v2_hw = {
> diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> index feda9b54b443..0b3c7b49813a 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> @@ -3322,8 +3322,9 @@ static void hisi_sas_map_queues(struct Scsi_Host *shost)
>   		if (i == HCTX_TYPE_POLL)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues
  2024-08-12 15:31   ` John Garry
@ 2024-08-13  9:39     ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-13  9:39 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, Michael S. Tsirkin,
	Jason Wang, Kashyap Desai, Sumit Saxena, Shivasharan S,
	Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme, linux-scsi,
	virtualization, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, storagedev, linux-doc

On Mon, Aug 12, 2024 at 04:31:32PM GMT, John Garry wrote:
> On 06/08/2024 13:06, Daniel Wagner wrote:
> > Replace all users of blk_mq_pci_map_queues with the more generic
> > blk_mq_dev_map_queues. This in preparation to retire
> > blk_mq_pci_map_queues.
> 
> nit: About blk_mq_dev_map_queues(), from the name it gives the impression
> that we deal in struct device, which is not the case.

What about blk_mq_hctx_map_queues?

> > +static const struct cpumask *hisi_hba_get_queue_affinity(void *dev_data,
> > +							 int offset, int idx)
> 
> personally I think that name "queue" would be better than "idx"

Yes, makes sense and would be more consistent with the rest of the code.

> > +	return blk_mq_dev_map_queues(qmap, hisi_hba, 96,
> 
> blk_mq_dev_map_queues() returns void, and so we should not return the value
> (which is void).
>
> And I know that the current code is like this, but using CQ0_IRQ_INDEX
> instead of 96 would be nicer.

Sure, will do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers
  2024-08-12  9:08   ` Christoph Hellwig
@ 2024-08-13  9:41     ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-13  9:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Martin K. Petersen, John Garry, Michael S. Tsirkin, Jason Wang,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Chandrakanth patil,
	Sathya Prakash Veerichetty, Suganath Prabu Subramani,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Jonathan Corbet,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Mon, Aug 12, 2024 at 11:08:48AM GMT, Christoph Hellwig wrote:
> On Tue, Aug 06, 2024 at 02:06:39PM +0200, Daniel Wagner wrote:
> >   * Copyright (c) 2016 Christoph Hellwig.
> 
> None of my code left at this point :)

I do my best to reduce code :)

> >  /**
> >   * blk_mq_pci_get_queue_affinity - get affinity mask queue mapping for PCI device
> >   * @dev_data:	Pointer to struct pci_dev.
> 
> But on to something more substancial:  thes get_queue_affinity
> wrappers have nothing specific in them.  So maybe move them into
> the PCI and virtio subsystems instead and kill off these files
> entirely.

Make sense!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-09 14:53           ` Ming Lei
@ 2024-08-13 12:17             ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-13 12:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev

On Fri, Aug 09, 2024 at 10:53:16PM GMT, Ming Lei wrote:
> On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> > On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > > these IOs?
> > 
> > No, every possible CPU gets a mapping. What this patch series does, is
> > to limit/aligns the number of hardware context to the number of
> > housekeeping CPUs. There is still a complete ctx-hctc mapping. So
> 
> OK, then I guess patch 1~7 aren't supposed to belong to this series,
> cause you just want to reduce nr_hw_queues, meantime spread
> house-keeping CPUs first for avoiding queues with all isolated cpu
> mask.

I tried to explain the reason for these patches in the cover letter. The
idea here is that it makes the later changes simpler, because we only
have to touch one place. Furthermore, the caller just needs to provide
an affinity mask the rest of the code then is generic. This allows to
replace the open coded mapping code in hisi for example. Overall I think
the resulting code is nicer and cleaner.

> OK, Looks I missed the point in patch 15 in which you added isolated cpu
> into mapping manually, just wondering why not take the current two-stage
> policy to cover both house-keeping and isolated CPUs in
> group_cpus_evenly()?

Patch #15 explains why this approach didn't work in the current form.
blk_mq_map_queues will map all isolated CPUs to the first hctx.

> Such as spread house-keeping CPUs first, then isolated CPUs, just like
> what we did for present & non-present cpus.

I've experimented with this approach and it didn't work (see above).

> When blk_mq_hctx_notify_offline() is running, the current CPU isn't
> offline yet, and the hctx is active, same with the managed irq, so it is fine
> to wait until all in-flight IOs originated from this hctx completed
> there.

But if the if for some reason these never complete (as in my case),
this blocks forever. Wouldn't it make sense to abort the wait after a
while?

> The reason is why these requests can't be completed? And the forward
> progress is provided by blk-mq. And these requests are very likely
> allocated & submitted from CPU6.

Yes, I can confirm that the in flight request have been allocated and
submitted by the CPU which is offlined.

Here a log snipped from a different debug session. CPU 1 and 2 are
already offline, CPU 3 is offlined. The CPU mapping for hctx1 is

        hctx1: default 1 3

I've added a printk to my hack timeout handler:

 blk_mq_hctx_notify_offline:3600 hctx 1 force timeout request
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3

that means these request have been allocated on CPU 3 and are still
marked as in flight. I am trying to figure out why they are not
completed as next step.

> Can you figure out what is effective mask for irq of hctx2?  It is
> supposed to be cpu6. And block debugfs for vda should provide helpful
> hint.

The effective mask for the above debug output is

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

PCI name is 00:02.0: vda
        irq 27 affinity 0-1 effective 0  virtio0-config
        irq 28 affinity 0 effective 0  virtio0-req.0
        irq 29 affinity 1 effective 1  virtio0-req.1
        irq 30 affinity 4 effective 4  virtio0-req.2
        irq 31 affinity 5 effective 5  virtio0-req.3

Maybe there is still something off with qemu and the IRQ routing and the
interrupts have been delivered to the wrong CPU.

> > going offline have already been shutdown, thus no progress?) and
> > blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.
> 
> RH has internal cpu hotplug stress test, but not see such report so
> far.

Is this stress test running on real hardware? If so, it adds to my
theory that the interrupt might be lost in certain situation when
running qemu.

 > Couldn't we do something like:
> 
> I usually won't thinking about any solution until root-cause is figured
> out, :-)

I agree, though sometimes is also is okay to have some defensive
programming in place, such an upper limit when until giving up the wait.

But yeah, let's focus figuring out what's wrong.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-09 15:23   ` Ming Lei
@ 2024-08-13 12:53     ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-13 12:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Fri, Aug 09, 2024 at 11:23:58PM GMT, Ming Lei wrote:
> From above implementation, "isolcpus=io_queue" is actually just one
> optimization on "isolcpus=managed_irq", and there isn't essential
> difference between the two.

Indeed, the two versions do not differ so much. I understood, that you
really want to keep managed_irq as it currently is and that's why I
thought we need io_queue.

> And I'd suggest to optimize 'isolcpus=managed_irq' directly, such as:
> 
> - reduce nr_queues or numgrps for group_cpus_evenly() according to
> house-keeping cpu mask

Okay.

> - spread house-keeping & isolate cpu mask evenly on each queue, and
> you can use the existed two-stage spread for doing that

Sure if we can get the spreading sorted out so that not all isolcpus are
mapped to the first hctx.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
  2024-08-06 14:55   ` Ming Lei
  2024-08-09 15:23   ` Ming Lei
@ 2024-08-13 12:56   ` Ming Lei
  2024-08-13 13:11     ` Daniel Wagner
  2 siblings, 1 reply; 46+ messages in thread
From: Ming Lei @ 2024-08-13 12:56 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc, ming.lei

On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 3 4 6 7
>         hctx1: default 5
>         hctx2: default 0
>         hctx3: default 1
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 4
>         hctx1: default 3 5
>         hctx2: default 0 6
>         hctx3: default 1 7
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index c1277763aeeb..7e026c2ffa02 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -60,11 +60,64 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>  
> +static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
> +{
> +	struct cpumask *hk_masks;
> +	cpumask_var_t isol_mask;
> +
> +	unsigned int queue, cpu;
> +
> +	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return false;
> +
> +	/* map housekeeping cpus to matching hardware context */
> +	hk_masks = group_cpus_evenly(qmap->nr_queues);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, &hk_masks[queue])
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	kfree(hk_masks);
> +
> +	/* map isolcpus to hardware context */
> +	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	queue = 0;
> +	cpumask_andnot(isol_mask,
> +		       cpu_possible_mask,
> +		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +
> +	for_each_cpu(cpu, isol_mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +

With patch 14 and the above change, managed irq's affinity becomes not
matched with blk-mq mapping any more.

If the last CPU in managed irq's affinity becomes offline, blk-mq
mapping may have other isolated CPUs, so IOs in this hctx won't be
drained from blk_mq_hctx_notify_offline() in case of CPU offline,
but genirq still shutdowns this manage irq.

So IO hang risk is introduced here, it should be the reason of your
hang observation.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled
  2024-08-13 12:56   ` Ming Lei
@ 2024-08-13 13:11     ` Daniel Wagner
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Wagner @ 2024-08-13 13:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Martin K. Petersen, John Garry,
	Michael S. Tsirkin, Jason Wang, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Chandrakanth patil, Sathya Prakash Veerichetty,
	Suganath Prabu Subramani, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Jonathan Corbet, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	linux-kernel, linux-block, linux-nvme, linux-scsi, virtualization,
	megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl,
	storagedev, linux-doc

On Tue, Aug 13, 2024 at 08:56:02PM GMT, Ming Lei wrote:
> With patch 14 and the above change, managed irq's affinity becomes not
> matched with blk-mq mapping any more.

Ah, got it. The problem here is that I need to update also the irq
affinity mask for the hctx when offlining a CPU.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2024-08-13 13:11 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-06 12:06 [PATCH v3 00/15] honor isolcpus configuration Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 01/15] scsi: pm8001: do not overwrite PCI queue mapping Daniel Wagner
2024-08-06 13:24   ` Christoph Hellwig
2024-08-06 15:03   ` John Garry
2024-08-06 12:06 ` [PATCH v3 02/15] virito: add APIs for retrieving vq affinity Daniel Wagner
2024-08-06 13:25   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 03/15] blk-mq: introduce blk_mq_dev_map_queues Daniel Wagner
2024-08-06 13:26   ` Christoph Hellwig
2024-08-07 12:49     ` Daniel Wagner
2024-08-12  9:05       ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 04/15] scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues Daniel Wagner
2024-08-12  9:06   ` Christoph Hellwig
2024-08-12 15:31   ` John Garry
2024-08-13  9:39     ` Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 05/15] nvme: " Daniel Wagner
2024-08-12  9:06   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 06/15] virtio: blk/scs: replace blk_mq_virtio_map_queues " Daniel Wagner
2024-08-12  9:07   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 07/15] blk-mq: remove unused queue mapping helpers Daniel Wagner
2024-08-12  9:08   ` Christoph Hellwig
2024-08-13  9:41     ` Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 08/15] sched/isolation: Add io_queue housekeeping option Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 09/15] docs: add io_queue as isolcpus options Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 10/15] blk-mq: add number of queue calc helper Daniel Wagner
2024-08-12  9:03   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 11/15] nvme-pci: use block layer helpers to calculate num of queues Daniel Wagner
2024-08-12  9:04   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 12/15] scsi: " Daniel Wagner
2024-08-12  9:09   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 13/15] virtio: blk/scsi: " Daniel Wagner
2024-08-06 12:06 ` [PATCH v3 14/15] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
2024-08-06 14:47   ` Ming Lei
2024-08-12  9:09   ` Christoph Hellwig
2024-08-06 12:06 ` [PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Daniel Wagner
2024-08-06 14:55   ` Ming Lei
2024-08-07 12:40     ` Daniel Wagner
2024-08-08  5:26       ` Ming Lei
2024-08-09  7:22         ` Daniel Wagner
2024-08-09 14:53           ` Ming Lei
2024-08-13 12:17             ` Daniel Wagner
2024-08-09 15:23   ` Ming Lei
2024-08-13 12:53     ` Daniel Wagner
2024-08-13 12:56   ` Ming Lei
2024-08-13 13:11     ` Daniel Wagner
2024-08-06 13:09 ` [PATCH v3 00/15] honor isolcpus configuration Stefan Hajnoczi
2024-08-07 12:25   ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).