linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel
@ 2023-07-26  9:40 Ming Lei
  2023-07-26  9:40 ` [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues() Ming Lei
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei

Hi,

On arm and ppc64, 'maxcpus=1' is required for kdump kernel,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

This issue is only triggered on managed irq in case of multiple hw
queues. Some drivers takes online cpus into account for nr_hw_queues,
and don't have such issue, such as nvme rdma/tcp.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

V2:
	- add helper of scsi_max_nr_hw_queues() for avoiding potential build
	failure because scsi driver often doesn't deal with blk-mq directly
	- apply scsi_max_nr_hw_queues() for all scsi changes
	- move lpfc's change into managed irq code path


Thanks,
Ming

Ming Lei (9):
  blk-mq: add blk_mq_max_nr_hw_queues()
  nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues
  scsi: core: add helper of scsi_max_nr_hw_queues()
  scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  scsi: hisi: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: mpi3mr: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: megaraid: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: mpt3sas: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors
  scsi: pm8001: take blk_mq_max_nr_hw_queues() into account for
    calculating io vectors

 block/blk-mq.c                            | 16 ++++++++++++++++
 drivers/nvme/host/pci.c                   |  2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  3 +++
 drivers/scsi/lpfc/lpfc_init.c             |  2 ++
 drivers/scsi/megaraid/megaraid_sas_base.c |  6 +++++-
 drivers/scsi/mpi3mr/mpi3mr_fw.c           |  3 +++
 drivers/scsi/mpt3sas/mpt3sas_base.c       |  4 ++--
 drivers/scsi/pm8001/pm8001_init.c         |  4 +++-
 include/linux/blk-mq.h                    |  1 +
 include/scsi/scsi_host.h                  |  5 +++++
 10 files changed, 41 insertions(+), 5 deletions(-)

-- 
2.40.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues()
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26 16:36   ` John Garry
  2023-07-26  9:40 ` [PATCH V2 2/9] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues Ming Lei
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei

blk_mq_alloc_tag_set() may override set->nr_hw_queues as 1 in case of kdump
kernel. This way causes trouble for driver, because blk-mq and driver see
different queue mapping. Especially the only online CPU may not be 1 for
kdump kernel, in which 'maxcpus=1' is passed from kernel command line,
then driver may map hctx0 into one inactive real hw queue which cpu
affinity is 0(offline).

The issue exists on all drivers which use managed irq and support
multiple hw queue.

Prepare for fixing this kind of issue by applying the added helper, so
driver can take blk-mq max nr_hw_queues knowledge into account when
calculating io queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 16 ++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b04ff6f56926..617d6f849a7b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -140,6 +140,22 @@ void blk_mq_freeze_queue_wait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
+/*
+ * Return the max supported nr_hw_queues for each hw queue type
+ *
+ * blk_mq_alloc_tag_set() may change nr_hw_queues for kdump kernel, so
+ * driver has to take blk-mq max supported nr_hw_queues into account
+ * when figuring out nr_hw_queues from hardware info, for avoiding
+ * inconsistency between driver and blk-mq.
+ */
+unsigned int blk_mq_max_nr_hw_queues(void)
+{
+	if (is_kdump_kernel())
+		return 1;
+	return nr_cpu_ids;
+}
+EXPORT_SYMBOL_GPL(blk_mq_max_nr_hw_queues);
+
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 495ca198775f..4c0cfd1f9e52 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -711,6 +711,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
 		const struct blk_mq_ops *ops, unsigned int queue_depth,
 		unsigned int set_flags);
 void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
+unsigned int blk_mq_max_nr_hw_queues(void);
 
 void blk_mq_free_request(struct request *rq);
 int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 2/9] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
  2023-07-26  9:40 ` [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues() Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26  9:40 ` [PATCH V2 3/9] scsi: core: add helper of scsi_max_nr_hw_queues() Ming Lei
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see
`Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Reported-by: Wen Xiong <wenxiong@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index baf69af7ea78..a1227ae7eb39 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2251,7 +2251,7 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+	return blk_mq_max_nr_hw_queues() + dev->nr_write_queues + dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 3/9] scsi: core: add helper of scsi_max_nr_hw_queues()
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
  2023-07-26  9:40 ` [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues() Ming Lei
  2023-07-26  9:40 ` [PATCH V2 2/9] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26  9:40 ` [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors Ming Lei
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei

blk_mq_alloc_tag_set() may change nr_hw_queues for kdump kernel, so
driver has to take blk-mq max supported nr_hw_queues into account
when figuring out nr_hw_queues from hardware info, for avoiding
inconsistency between scsi driver and blk-mq.

Add helper of scsi_max_nr_hw_queues() for avoiding nr_hw_queues
inconsistency between scsi driver and blk-mq.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/scsi/scsi_host.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 70b7475dcf56..2273f855940f 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -803,6 +803,11 @@ extern int scsi_host_unblock(struct Scsi_Host *shost, int new_state);
 void scsi_host_busy_iter(struct Scsi_Host *,
 			 bool (*fn)(struct scsi_cmnd *, void *), void *priv);
 
+static inline unsigned int scsi_max_nr_hw_queues(void)
+{
+	return blk_mq_max_nr_hw_queues();
+}
+
 struct class_container;
 
 /*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (2 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 3/9] scsi: core: add helper of scsi_max_nr_hw_queues() Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26 22:12   ` Justin Tee
  2023-07-26  9:40 ` [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating " Ming Lei
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Justin Tee,
	James Smart

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Justin Tee <justintee8345@gmail.com>
Cc: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/lpfc/lpfc_init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 3221a934066b..c546e5275108 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -13022,6 +13022,8 @@ lpfc_sli4_enable_msix(struct lpfc_hba *phba)
 		cpu = cpumask_first(aff_mask);
 		cpu_select = lpfc_next_online_cpu(aff_mask, cpu);
 	} else {
+		vectors = min_t(unsigned int, vectors,
+				scsi_max_nr_hw_queues());
 		flags |= PCI_IRQ_AFFINITY;
 	}
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (3 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26 15:42   ` John Garry
  2023-07-26  9:40 ` [PATCH V2 6/9] scsi: mpi3mr: " Ming Lei
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Xiang Chen

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 20e1607c6282..60d2301e7f9d 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 
 
 	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
+	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
+		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
+
 	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
 
 	return devm_add_action(&pdev->dev, hisi_sas_v3_free_vectors, pdev);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 6/9] scsi: mpi3mr: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (4 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating " Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26  9:40 ` [PATCH V2 7/9] scsi: megaraid: " Ming Lei
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Sreekanth Reddy,
	Sathya Prakash Veerichetty, Kashyap Desai, Sumit Saxena

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Cc: Sathya Prakash Veerichetty <sathya.prakash@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/mpi3mr/mpi3mr_fw.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c
index 5fa07d6ee5b8..8fe14e728af6 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_fw.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c
@@ -815,6 +815,9 @@ static int mpi3mr_setup_isr(struct mpi3mr_ioc *mrioc, u8 setup_one)
 
 		desc.post_vectors = mrioc->requested_poll_qcount;
 		min_vec = desc.pre_vectors + desc.post_vectors;
+		if (max_vectors - min_vec > scsi_max_nr_hw_queues())
+			max_vectors = min_vec + scsi_max_nr_hw_queues();
+
 		irq_flags |= PCI_IRQ_AFFINITY | PCI_IRQ_ALL_TYPES;
 
 		retval = pci_alloc_irq_vectors_affinity(mrioc->pdev,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 7/9] scsi: megaraid: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (5 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 6/9] scsi: mpi3mr: " Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26  9:40 ` [PATCH V2 8/9] scsi: mpt3sas: " Ming Lei
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Sreekanth Reddy,
	Sathya Prakash Veerichetty, Kashyap Desai, Sumit Saxena

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Cc: Sathya Prakash Veerichetty <sathya.prakash@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 050eed8e2684..587ccbc1ec92 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -5921,6 +5921,10 @@ __megasas_alloc_irq_vectors(struct megasas_instance *instance)
 	int i, irq_flags;
 	struct irq_affinity desc = { .pre_vectors = instance->low_latency_index_start };
 	struct irq_affinity *descp = &desc;
+	unsigned max_vecs = instance->msix_vectors - instance->iopoll_q_count;
+
+	if (max_vecs > scsi_max_nr_hw_queues())
+		max_vecs = scsi_max_nr_hw_queues();
 
 	irq_flags = PCI_IRQ_MSIX;
 
@@ -5934,7 +5938,7 @@ __megasas_alloc_irq_vectors(struct megasas_instance *instance)
 	 */
 	i = pci_alloc_irq_vectors_affinity(instance->pdev,
 		instance->low_latency_index_start,
-		instance->msix_vectors - instance->iopoll_q_count, irq_flags, descp);
+		max_vecs, irq_flags, descp);
 
 	return i;
 }
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 8/9] scsi: mpt3sas: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (6 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 7/9] scsi: megaraid: " Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-26  9:40 ` [PATCH V2 9/9] scsi: pm8001: " Ming Lei
  2023-07-31  7:14 ` [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Christoph Hellwig
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Sathya Prakash,
	Sreekanth Reddy, Suganath Prabu Subramani

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 53f5492579cb..d238e0275edd 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3332,8 +3332,8 @@ _base_alloc_irq_vectors(struct MPT3SAS_ADAPTER *ioc)
 	 * Don't allocate msix vectors for poll_queues.
 	 * msix_vectors is always within a range of FW supported reply queue.
 	 */
-	int nr_msix_vectors = ioc->iopoll_q_start_index;
-
+	int nr_msix_vectors = min_t(unsigned int, ioc->iopoll_q_start_index,
+			scsi_max_nr_hw_queues());
 
 	if (ioc->smp_affinity_enable)
 		irq_flags |= PCI_IRQ_AFFINITY | PCI_IRQ_ALL_TYPES;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 9/9] scsi: pm8001: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (7 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 8/9] scsi: mpt3sas: " Ming Lei
@ 2023-07-26  9:40 ` Ming Lei
  2023-07-31  7:14 ` [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Christoph Hellwig
  9 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-26  9:40 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Ming Lei, Jack Wang

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/pm8001/pm8001_init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index 2e886c1d867d..3e769f4fe26d 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -965,6 +965,8 @@ static u32 pm8001_setup_msix(struct pm8001_hba_info *pm8001_ha)
 		rc = pci_alloc_irq_vectors(pm8001_ha->pdev, 1, 1,
 					   PCI_IRQ_MSIX);
 	} else {
+		unsigned int max_vecs = min_t(unsigned int, PM8001_MAX_MSIX_VEC,
+				scsi_max_nr_hw_queues() + 1);
 		/*
 		 * Queue index #0 is used always for housekeeping, so don't
 		 * include in the affinity spreading.
@@ -973,7 +975,7 @@ static u32 pm8001_setup_msix(struct pm8001_hba_info *pm8001_ha)
 			.pre_vectors = 1,
 		};
 		rc = pci_alloc_irq_vectors_affinity(
-				pm8001_ha->pdev, 2, PM8001_MAX_MSIX_VEC,
+				pm8001_ha->pdev, 2, max_vecs,
 				PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &desc);
 	}
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26  9:40 ` [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating " Ming Lei
@ 2023-07-26 15:42   ` John Garry
  2023-07-27  1:15     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2023-07-26 15:42 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-nvme,
	Martin K . Petersen, linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch, Xiang Chen

On 26/07/2023 10:40, Ming Lei wrote:
> Take blk-mq's knowledge into account for calculating io queues.
> 
> Fix wrong queue mapping in case of kdump kernel.
> 
> On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
> see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> still returns all CPUs because 'maxcpus=1' just bring up one single
> cpu core during booting.
> 
> blk-mq sees single queue in kdump kernel, and in driver's viewpoint
> there are still multiple queues, this inconsistency causes driver to apply
> wrong queue mapping for handling IO, and IO timeout is triggered.
> 
> Meantime, single queue makes much less resource utilization, and reduce
> risk of kernel failure.
> 
> Cc: Xiang Chen <chenxiang66@hisilicon.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> index 20e1607c6282..60d2301e7f9d 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
>   
>   
>   	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
> +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
> +
>   	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;

For other drivers you limit the max MSI vectors which we try to allocate 
according to scsi_max_nr_hw_queues(), but here you continue to alloc the 
same max vectors but then limit the driver's completion queue count. Why 
not limit the max MSI vectors also here?

Thanks,
John

>   
>   	return devm_add_action(&pdev->dev, hisi_sas_v3_free_vectors, pdev);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues()
  2023-07-26  9:40 ` [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues() Ming Lei
@ 2023-07-26 16:36   ` John Garry
  2023-07-27  1:06     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2023-07-26 16:36 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-nvme,
	Martin K . Petersen, linux-scsi
  Cc: linux-block, Wen Xiong, Keith Busch

On 26/07/2023 10:40, Ming Lei wrote:

Hi Ming,

> blk_mq_alloc_tag_set() may override set->nr_hw_queues as 1 in case of kdump
> kernel. This way causes trouble for driver, because blk-mq and driver see
> different queue mapping. Especially the only online CPU may not be 1 for
> kdump kernel, in which 'maxcpus=1' is passed from kernel command line,

"the only online CPU may not be 1 for kdump kernel, in which 'maxcpus=1' 
..." - this seems inconsistent with the cover letter, where we have 
"'maxcpus=1' just bring up one single cpu core during booting."

> then driver may map hctx0 into one inactive real hw queue which cpu
> affinity is 0(offline).
> 
> The issue exists on all drivers which use managed irq and support
> multiple hw queue.
> 
> Prepare for fixing this kind of issue by applying the added helper, so
> driver can take blk-mq max nr_hw_queues knowledge into account when
> calculating io queues.

Could you alternatively solve in blk_mq_pci_map_queues() by fixing the 
mappings in case of kdump to have all per-CPU mappings point at the HW 
queue associated with cpu0 (which I assume would be active)? It ain't 
pretty ...

> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 16 ++++++++++++++++
>   include/linux/blk-mq.h |  1 +
>   2 files changed, 17 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b04ff6f56926..617d6f849a7b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -140,6 +140,22 @@ void blk_mq_freeze_queue_wait(struct request_queue *q)
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
>   
> +/*
> + * Return the max supported nr_hw_queues for each hw queue type
> + *
> + * blk_mq_alloc_tag_set() may change nr_hw_queues for kdump kernel, so
> + * driver has to take blk-mq max supported nr_hw_queues into account
> + * when figuring out nr_hw_queues from hardware info, for avoiding
> + * inconsistency between driver and blk-mq.
> + */
> +unsigned int blk_mq_max_nr_hw_queues(void)
> +{
> +	if (is_kdump_kernel())
> +		return 1;
> +	return nr_cpu_ids;
> +}

We rely on the driver having set->nr_hw_queues == 1 for kdump, right?

If so, how about enforcing it then, like:

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4426,7 +4426,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
         * 64 tags to prevent using too much memory.
         */
        if (is_kdump_kernel()) {
-               set->nr_hw_queues = 1;
+               if (set->nr_hw_queues != 1)
+                       return -EINVAL;
                set->nr_maps = 1;
                set->queue_depth = min(64U, set->queue_depth);
        }


Thanks,
John


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  2023-07-26  9:40 ` [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors Ming Lei
@ 2023-07-26 22:12   ` Justin Tee
  2023-07-27  1:19     ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Justin Tee @ 2023-07-26 22:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, James Smart,
	Justin Tee

Hi Ming,

From version 1 of the patchset, I thought we had planned to put the
min comparison right above pci_alloc_irq_vectors instead?

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 3221a934066b..20410789e8b8 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -13025,6 +13025,8 @@ lpfc_sli4_enable_msix(struct lpfc_hba *phba)
                flags |= PCI_IRQ_AFFINITY;
        }

+       vectors = min_t(unsigned int, vectors, scsi_max_nr_hw_queues());
+
        rc = pci_alloc_irq_vectors(phba->pcidev, 1, vectors, flags);
        if (rc < 0) {
                lpfc_printf_log(phba, KERN_INFO, LOG_INIT,

Thanks,
Justin

On Wed, Jul 26, 2023 at 2:40 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Take blk-mq's knowledge into account for calculating io queues.
>
> Fix wrong queue mapping in case of kdump kernel.
>
> On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
> see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> still returns all CPUs because 'maxcpus=1' just bring up one single
> cpu core during booting.
>
> blk-mq sees single queue in kdump kernel, and in driver's viewpoint
> there are still multiple queues, this inconsistency causes driver to apply
> wrong queue mapping for handling IO, and IO timeout is triggered.
>
> Meantime, single queue makes much less resource utilization, and reduce
> risk of kernel failure.
>
> Cc: Justin Tee <justintee8345@gmail.com>
> Cc: James Smart <james.smart@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/lpfc/lpfc_init.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 3221a934066b..c546e5275108 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -13022,6 +13022,8 @@ lpfc_sli4_enable_msix(struct lpfc_hba *phba)
>                 cpu = cpumask_first(aff_mask);
>                 cpu_select = lpfc_next_online_cpu(aff_mask, cpu);
>         } else {
> +               vectors = min_t(unsigned int, vectors,
> +                               scsi_max_nr_hw_queues());
>                 flags |= PCI_IRQ_AFFINITY;
>         }
>
> --
> 2.40.1
>


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues()
  2023-07-26 16:36   ` John Garry
@ 2023-07-27  1:06     ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2023-07-27  1:06 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, ming.lei

On Wed, Jul 26, 2023 at 05:36:34PM +0100, John Garry wrote:
> On 26/07/2023 10:40, Ming Lei wrote:
> 
> Hi Ming,
> 
> > blk_mq_alloc_tag_set() may override set->nr_hw_queues as 1 in case of kdump
> > kernel. This way causes trouble for driver, because blk-mq and driver see
> > different queue mapping. Especially the only online CPU may not be 1 for
> > kdump kernel, in which 'maxcpus=1' is passed from kernel command line,
> 
> "the only online CPU may not be 1 for kdump kernel, in which 'maxcpus=1'
> ..." - this seems inconsistent with the cover letter, where we have
> "'maxcpus=1' just bring up one single cpu core during booting."

OK, looks I should have mentioned "the only online CPU may not be 1 for
maxcpus=1" in cover letter.

> > then driver may map hctx0 into one inactive real hw queue which cpu
> > affinity is 0(offline).
> > 
> > The issue exists on all drivers which use managed irq and support
> > multiple hw queue.
> > 
> > Prepare for fixing this kind of issue by applying the added helper, so
> > driver can take blk-mq max nr_hw_queues knowledge into account when
> > calculating io queues.
> 
> Could you alternatively solve in blk_mq_pci_map_queues() by fixing the
> mappings in case of kdump to have all per-CPU mappings point at the HW queue
> associated with cpu0 (which I assume would be active)? It ain't pretty ...

cpu0 mapping isn't correct, as I mentioned that 'maxcpus=1' doesn't
mean the only online cpu is cpu0, such as, on ppc64, the only online
cpu could be selected at random during booting.

> 
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 16 ++++++++++++++++
> >   include/linux/blk-mq.h |  1 +
> >   2 files changed, 17 insertions(+)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b04ff6f56926..617d6f849a7b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -140,6 +140,22 @@ void blk_mq_freeze_queue_wait(struct request_queue *q)
> >   }
> >   EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
> > +/*
> > + * Return the max supported nr_hw_queues for each hw queue type
> > + *
> > + * blk_mq_alloc_tag_set() may change nr_hw_queues for kdump kernel, so
> > + * driver has to take blk-mq max supported nr_hw_queues into account
> > + * when figuring out nr_hw_queues from hardware info, for avoiding
> > + * inconsistency between driver and blk-mq.
> > + */
> > +unsigned int blk_mq_max_nr_hw_queues(void)
> > +{
> > +	if (is_kdump_kernel())
> > +		return 1;
> > +	return nr_cpu_ids;
> > +}
> 
> We rely on the driver having set->nr_hw_queues == 1 for kdump, right?
> 
> If so, how about enforcing it then, like:
> 
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4426,7 +4426,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>         * 64 tags to prevent using too much memory.
>         */
>        if (is_kdump_kernel()) {
> -               set->nr_hw_queues = 1;
> +               if (set->nr_hw_queues != 1)
> +                       return -EINVAL;
>                set->nr_maps = 1;
>                set->queue_depth = min(64U, set->queue_depth);
>        }

In theory, this way should be ideal approach, but it needs all MQ drivers
to get converted with blk_mq_max_nr_hw_queues().

However, it shouldn't be one issue for non-managed irq given non-managed
irq can be migrated to the only online cpu.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-26 15:42   ` John Garry
@ 2023-07-27  1:15     ` Ming Lei
  2023-07-27  7:35       ` John Garry
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-27  1:15 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen,
	ming.lei

On Wed, Jul 26, 2023 at 04:42:42PM +0100, John Garry wrote:
> On 26/07/2023 10:40, Ming Lei wrote:
> > Take blk-mq's knowledge into account for calculating io queues.
> > 
> > Fix wrong queue mapping in case of kdump kernel.
> > 
> > On arm and ppc64, 'maxcpus=1' is passed to kdump kernel command line,
> > see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> > still returns all CPUs because 'maxcpus=1' just bring up one single
> > cpu core during booting.
> > 
> > blk-mq sees single queue in kdump kernel, and in driver's viewpoint
> > there are still multiple queues, this inconsistency causes driver to apply
> > wrong queue mapping for handling IO, and IO timeout is triggered.
> > 
> > Meantime, single queue makes much less resource utilization, and reduce
> > risk of kernel failure.
> > 
> > Cc: Xiang Chen <chenxiang66@hisilicon.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> > index 20e1607c6282..60d2301e7f9d 100644
> > --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> > +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> > @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
> >   	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> > +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
> > +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
> > +
> >   	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
> 
> For other drivers you limit the max MSI vectors which we try to allocate
> according to scsi_max_nr_hw_queues(), but here you continue to alloc the
> same max vectors but then limit the driver's completion queue count. Why not
> limit the max MSI vectors also here?

Good catch!

I guess it is because of the following line:

   	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;

I am a bit confused why hisi_sas_v3 takes ->iopoll_q_cnt into account
allocated msi vectors?  ->iopoll_q_cnt supposes to not consume msi
vectors, so I think we need the following fix first:

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 20e1607c6282..032c13ce8373 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2549,7 +2549,7 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
                return -ENOENT;


-       hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
+       hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
        shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;

        return devm_add_action(&pdev->dev, hisi_sas_v3_free_vectors, pdev);



Thanks,
Ming



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  2023-07-26 22:12   ` Justin Tee
@ 2023-07-27  1:19     ` Ming Lei
  2023-07-27 16:56       ` Justin Tee
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-27  1:19 UTC (permalink / raw)
  To: Justin Tee
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, James Smart,
	Justin Tee

On Wed, Jul 26, 2023 at 03:12:16PM -0700, Justin Tee wrote:
> Hi Ming,
> 
> From version 1 of the patchset, I thought we had planned to put the
> min comparison right above pci_alloc_irq_vectors instead?
> 
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 3221a934066b..20410789e8b8 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -13025,6 +13025,8 @@ lpfc_sli4_enable_msix(struct lpfc_hba *phba)
>                 flags |= PCI_IRQ_AFFINITY;
>         }
> 
> +       vectors = min_t(unsigned int, vectors, scsi_max_nr_hw_queues());

Strictly speaking, the above change is better, but non-managed irq
doesn't have such issue, that is why I just apply the change on managed
irq branch.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27  1:15     ` Ming Lei
@ 2023-07-27  7:35       ` John Garry
  2023-07-27  9:42         ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2023-07-27  7:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen

On 27/07/2023 02:15, Ming Lei wrote:
>>> hisi_sas_v3_hw.c
>>> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
>>> @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
>>>    	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
>>> +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
>>> +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
>>> +
>>>    	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
>> For other drivers you limit the max MSI vectors which we try to allocate
>> according to scsi_max_nr_hw_queues(), but here you continue to alloc the
>> same max vectors but then limit the driver's completion queue count. Why not
>> limit the max MSI vectors also here?

Ah, checking again, I think that this driver always allocates maximum 
possible MSI due to arm interrupt controller driver bug - see comment at 
top of function interrupt_preinit_v3_hw(). IIRC, there was a problem if 
we remove and re-insert the device driver that the GIC ITS fails to 
allocate MSI unless all MSI were previously allocated.

Xiang Chen can confirm.

> Good catch!
> 
> I guess it is because of the following line:
> 
>     	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> 
> I am a bit confused why hisi_sas_v3 takes ->iopoll_q_cnt into account
> allocated msi vectors?  ->iopoll_q_cnt supposes to not consume msi
> vectors, so I think we need the following fix first:
> 
> diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> index 20e1607c6282..032c13ce8373 100644
> --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> @@ -2549,7 +2549,7 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
>                  return -ENOENT;
> 
> 
> -       hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> +       hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
>          shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;

Thanks,
John


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27  7:35       ` John Garry
@ 2023-07-27  9:42         ` Ming Lei
  2023-07-27 10:28           ` John Garry
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-27  9:42 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen,
	ming.lei

On Thu, Jul 27, 2023 at 08:35:06AM +0100, John Garry wrote:
> On 27/07/2023 02:15, Ming Lei wrote:
> > > > hisi_sas_v3_hw.c
> > > > +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> > > > @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
> > > >    	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> > > > +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
> > > > +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
> > > > +
> > > >    	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
> > > For other drivers you limit the max MSI vectors which we try to allocate
> > > according to scsi_max_nr_hw_queues(), but here you continue to alloc the
> > > same max vectors but then limit the driver's completion queue count. Why not
> > > limit the max MSI vectors also here?
> 
> Ah, checking again, I think that this driver always allocates maximum
> possible MSI due to arm interrupt controller driver bug - see comment at top
> of function interrupt_preinit_v3_hw(). IIRC, there was a problem if we
> remove and re-insert the device driver that the GIC ITS fails to allocate
> MSI unless all MSI were previously allocated.

My question is actually why hisi_hba->iopoll_q_cnt is subtracted from
allocated vectors since io poll queues does _not_ use msi vector.

So it isn't related with driver's msi vector allocation bug, is it?



Thanks,
Ming



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27  9:42         ` Ming Lei
@ 2023-07-27 10:28           ` John Garry
  2023-07-27 10:56             ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2023-07-27 10:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen

On 27/07/2023 10:42, Ming Lei wrote:
>>>>> hisi_sas_v3_hw.c
>>>>> +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
>>>>> @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
>>>>>     	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
>>>>> +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
>>>>> +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
>>>>> +
>>>>>     	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
>>>> For other drivers you limit the max MSI vectors which we try to allocate
>>>> according to scsi_max_nr_hw_queues(), but here you continue to alloc the
>>>> same max vectors but then limit the driver's completion queue count. Why not
>>>> limit the max MSI vectors also here?
>> Ah, checking again, I think that this driver always allocates maximum
>> possible MSI due to arm interrupt controller driver bug - see comment at top
>> of function interrupt_preinit_v3_hw(). IIRC, there was a problem if we
>> remove and re-insert the device driver that the GIC ITS fails to allocate
>> MSI unless all MSI were previously allocated.
> My question is actually why hisi_hba->iopoll_q_cnt is subtracted from
> allocated vectors since io poll queues does_not_  use msi vector.
> 

It is subtracted as superfluous vectors were allocated.

As I mentioned, I think that the driver is forced to allocate all 32 MSI 
vectors, even though it really needs max of 32 - iopoll_q_cnt, i.e. we 
don't need an MSI vector for a HW queue assigned to polling. Then, since 
it gets 16x MSI for cq vectors, it subtracts iopoll_q_cnt to give the 
desired count in cq_nvecs.

> So it isn't related with driver's msi vector allocation bug, is it?

My deduction is this is how this currently "works" for non-zero iopoll 
queues:
- allocate max MSI of 32, which gives 32 vectors including 16 cq 
vectors. That then gives:
    - cq_nvecs = 16 - iopoll_q_cnt
    - shost->nr_hw_queues = 16
    - 16x MSI cq vectors were spread over all CPUs

- in hisi_sas_map_queues()
    - HCTX_TYPE_DEFAULT qmap->nr_queues = 16 - iopoll_q_cnt, and for 
blk_mq_pci_map_queues() we setup affinity for 16 - iopoll_q_cnt hw 
queues. This looks broken, as we originally spread 16x vectors over all 
CPUs, but now only setup mappings for (16 - iopoll_q_cnt) vectors, whose 
affinity would spread a subset of CPUs. And then qmap->mq_map[] for 
other CPUs is not set at all.

Thanks,
John


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27 10:28           ` John Garry
@ 2023-07-27 10:56             ` Ming Lei
  2023-07-27 11:30               ` John Garry
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-27 10:56 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen,
	ming.lei

On Thu, Jul 27, 2023 at 11:28:31AM +0100, John Garry wrote:
> On 27/07/2023 10:42, Ming Lei wrote:
> > > > > > hisi_sas_v3_hw.c
> > > > > > +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
> > > > > > @@ -2550,6 +2550,9 @@ static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
> > > > > >     	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW - hisi_hba->iopoll_q_cnt;
> > > > > > +	if (hisi_hba->cq_nvecs > scsi_max_nr_hw_queues())
> > > > > > +		hisi_hba->cq_nvecs = scsi_max_nr_hw_queues();
> > > > > > +
> > > > > >     	shost->nr_hw_queues = hisi_hba->cq_nvecs + hisi_hba->iopoll_q_cnt;
> > > > > For other drivers you limit the max MSI vectors which we try to allocate
> > > > > according to scsi_max_nr_hw_queues(), but here you continue to alloc the
> > > > > same max vectors but then limit the driver's completion queue count. Why not
> > > > > limit the max MSI vectors also here?
> > > Ah, checking again, I think that this driver always allocates maximum
> > > possible MSI due to arm interrupt controller driver bug - see comment at top
> > > of function interrupt_preinit_v3_hw(). IIRC, there was a problem if we
> > > remove and re-insert the device driver that the GIC ITS fails to allocate
> > > MSI unless all MSI were previously allocated.
> > My question is actually why hisi_hba->iopoll_q_cnt is subtracted from
> > allocated vectors since io poll queues does_not_  use msi vector.
> > 
> 
> It is subtracted as superfluous vectors were allocated.
> 
> As I mentioned, I think that the driver is forced to allocate all 32 MSI
> vectors, even though it really needs max of 32 - iopoll_q_cnt, i.e. we don't
> need an MSI vector for a HW queue assigned to polling. Then, since it gets
> 16x MSI for cq vectors, it subtracts iopoll_q_cnt to give the desired count
> in cq_nvecs.

Looks the current driver wires nr_irq_queues with nr_poll_queues, which is
wrong logically.

Here:

1) hisi_hba->queue_count means the max supported queue count(irq queues
+ poll queues)

2) max vectors(32) means the max supported irq queues, but actual
nr_irq_queues can be less than allocated vectors because of the irq
allocation bug

3) so the max supported poll queues should be (hisi_hba->queue_count -
nr_irq_queues).

What I meant is that nr_poll_queues shouldn't be related with allocated
msi vectors.


> > So it isn't related with driver's msi vector allocation bug, is it?
> 
> My deduction is this is how this currently "works" for non-zero iopoll
> queues:
> - allocate max MSI of 32, which gives 32 vectors including 16 cq vectors.
> That then gives:
>    - cq_nvecs = 16 - iopoll_q_cnt
>    - shost->nr_hw_queues = 16
>    - 16x MSI cq vectors were spread over all CPUs

It should be that cq_nvecs vectors spread over all CPUs, and
iopoll_q_cnt are spread over all CPUs too.

For each queue type, nr_queues of this type are spread over all
CPUs.

> 
> - in hisi_sas_map_queues()
>    - HCTX_TYPE_DEFAULT qmap->nr_queues = 16 - iopoll_q_cnt, and for
> blk_mq_pci_map_queues() we setup affinity for 16 - iopoll_q_cnt hw queues.
> This looks broken, as we originally spread 16x vectors over all CPUs, but
> now only setup mappings for (16 - iopoll_q_cnt) vectors, whose affinity
> would spread a subset of CPUs. And then qmap->mq_map[] for other CPUs is not
> set at all.

That isn't true, please see my above comment.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27 10:56             ` Ming Lei
@ 2023-07-27 11:30               ` John Garry
  2023-07-27 12:01                 ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: John Garry @ 2023-07-27 11:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen

On 27/07/2023 11:56, Ming Lei wrote:
>> As I mentioned, I think that the driver is forced to allocate all 32 MSI
>> vectors, even though it really needs max of 32 - iopoll_q_cnt, i.e. we don't
>> need an MSI vector for a HW queue assigned to polling. Then, since it gets
>> 16x MSI for cq vectors, it subtracts iopoll_q_cnt to give the desired count
>> in cq_nvecs.
> Looks the current driver wires nr_irq_queues with nr_poll_queues, which is
> wrong logically.
> 
> Here:
> 
> 1) hisi_hba->queue_count means the max supported queue count(irq queues
> + poll queues)
> 

JFYI, from 
https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1620/Hi1620AcpiTables/Dsdt/Hi1620Pci.asl#L587C25-L587C41, 
hisi_hba->queue_count is fixed at 16.

> 2) max vectors(32) means the max supported irq queues,
I'll just mention some details of this HW, in case missed.

For this HW, max vectors is 32, but not all are for completion queue 
interrupts.

interrupts 0-15 are for housekeeping, like phy up notification
interrupts 16-31 are for completion queue interrupts

That is a fixed assignment. irq #16 is for HW queue #0 interrupt, irq 
#17 is for HW queue #1 interrupt, and so on.

> but actual
> nr_irq_queues can be less than allocated vectors because of the irq
> allocation bug
> 
> 3) so the max supported poll queues should be (hisi_hba->queue_count -
> nr_irq_queues).
> 
> What I meant is that nr_poll_queues shouldn't be related with allocated
> msi vectors.

Sure, I agree with that. nr_poll_queues is set in that driver as a 
module param.

I am just saying that we have a fixed number of HW queues (16), each of 
which may be used for interrupt or polling mode. And since we always 
allocate max number of MSI, then number of interrupt queues available 
will be 16 - nr_poll_queues.

> 
> 
>>> So it isn't related with driver's msi vector allocation bug, is it?
>> My deduction is this is how this currently "works" for non-zero iopoll
>> queues:
>> - allocate max MSI of 32, which gives 32 vectors including 16 cq vectors.
>> That then gives:
>>     - cq_nvecs = 16 - iopoll_q_cnt
>>     - shost->nr_hw_queues = 16
>>     - 16x MSI cq vectors were spread over all CPUs
> It should be that cq_nvecs vectors spread over all CPUs, and
> iopoll_q_cnt are spread over all CPUs too.

I agree, it should be, but I don't think that it is for 
HCTX_TYPE_DEFAULT, as below.

> 
> For each queue type, nr_queues of this type are spread over all
> CPUs. >> - in hisi_sas_map_queues()
>>     - HCTX_TYPE_DEFAULT qmap->nr_queues = 16 - iopoll_q_cnt, and for
>> blk_mq_pci_map_queues() we setup affinity for 16 - iopoll_q_cnt hw queues.
>> This looks broken, as we originally spread 16x vectors over all CPUs, but
>> now only setup mappings for (16 - iopoll_q_cnt) vectors, whose affinity
>> would spread a subset of CPUs. And then qmap->mq_map[] for other CPUs is not
>> set at all.
> That isn't true, please see my above comment.

I am just basing that on what I mention above, so please let me know my 
inaccuracy there.

Thanks,
John





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27 11:30               ` John Garry
@ 2023-07-27 12:01                 ` Ming Lei
  2023-07-27 12:36                   ` John Garry
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2023-07-27 12:01 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen,
	ming.lei

On Thu, Jul 27, 2023 at 12:30:34PM +0100, John Garry wrote:
> On 27/07/2023 11:56, Ming Lei wrote:
> > > As I mentioned, I think that the driver is forced to allocate all 32 MSI
> > > vectors, even though it really needs max of 32 - iopoll_q_cnt, i.e. we don't
> > > need an MSI vector for a HW queue assigned to polling. Then, since it gets
> > > 16x MSI for cq vectors, it subtracts iopoll_q_cnt to give the desired count
> > > in cq_nvecs.
> > Looks the current driver wires nr_irq_queues with nr_poll_queues, which is
> > wrong logically.
> > 
> > Here:
> > 
> > 1) hisi_hba->queue_count means the max supported queue count(irq queues
> > + poll queues)
> > 
> 
> JFYI, from https://github.com/tianocore/edk2-platforms/blob/master/Silicon/Hisilicon/Hi1620/Hi1620AcpiTables/Dsdt/Hi1620Pci.asl#L587C25-L587C41,
> hisi_hba->queue_count is fixed at 16.
> 
> > 2) max vectors(32) means the max supported irq queues,
> I'll just mention some details of this HW, in case missed.
> 
> For this HW, max vectors is 32, but not all are for completion queue
> interrupts.
> 
> interrupts 0-15 are for housekeeping, like phy up notification
> interrupts 16-31 are for completion queue interrupts
> 
> That is a fixed assignment. irq #16 is for HW queue #0 interrupt, irq #17 is
> for HW queue #1 interrupt, and so on.
> 
> > but actual
> > nr_irq_queues can be less than allocated vectors because of the irq
> > allocation bug
> > 
> > 3) so the max supported poll queues should be (hisi_hba->queue_count -
> > nr_irq_queues).
> > 
> > What I meant is that nr_poll_queues shouldn't be related with allocated
> > msi vectors.
> 
> Sure, I agree with that. nr_poll_queues is set in that driver as a module
> param.
> 
> I am just saying that we have a fixed number of HW queues (16), each of
> which may be used for interrupt or polling mode. And since we always
> allocate max number of MSI, then number of interrupt queues available will
> be 16 - nr_poll_queues.

No.

queue_count is fixed at 16, but pci_alloc_irq_vectors_affinity() still
may return less vectors, which is one system-wide resource, and queue
count is device resource.

So when less vectors are allocated, you should have been capable of using
more poll queues, unfortunately the current code can't support that.

Even worse, hisi_hba->cq_nvecs can become negative if less vectors are returned.

> 
> > 
> > 
> > > > So it isn't related with driver's msi vector allocation bug, is it?
> > > My deduction is this is how this currently "works" for non-zero iopoll
> > > queues:
> > > - allocate max MSI of 32, which gives 32 vectors including 16 cq vectors.
> > > That then gives:
> > >     - cq_nvecs = 16 - iopoll_q_cnt
> > >     - shost->nr_hw_queues = 16
> > >     - 16x MSI cq vectors were spread over all CPUs
> > It should be that cq_nvecs vectors spread over all CPUs, and
> > iopoll_q_cnt are spread over all CPUs too.
> 
> I agree, it should be, but I don't think that it is for HCTX_TYPE_DEFAULT,
> as below.
> 
> > 
> > For each queue type, nr_queues of this type are spread over all
> > CPUs. >> - in hisi_sas_map_queues()
> > >     - HCTX_TYPE_DEFAULT qmap->nr_queues = 16 - iopoll_q_cnt, and for
> > > blk_mq_pci_map_queues() we setup affinity for 16 - iopoll_q_cnt hw queues.
> > > This looks broken, as we originally spread 16x vectors over all CPUs, but
> > > now only setup mappings for (16 - iopoll_q_cnt) vectors, whose affinity
> > > would spread a subset of CPUs. And then qmap->mq_map[] for other CPUs is not
> > > set at all.
> > That isn't true, please see my above comment.
> 
> I am just basing that on what I mention above, so please let me know my
> inaccuracy there.

You said queue mapping for HCTX_TYPE_DEFAULT is broken, but it isn't.

You said 'we originally spread 16x vectors over all CPUs', which isn't
true. Again, '16 - iopoll_q_cnt' vectors are spread on all CPUs, and
same with iopoll_q_cnt vectors.

Since both blk_mq_map_queues() and blk_mq_pci_map_queues() does spread
map->nr_queues over all CPUs, so there isn't spread a subset of CPUs.

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating io vectors
  2023-07-27 12:01                 ` Ming Lei
@ 2023-07-27 12:36                   ` John Garry
  0 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2023-07-27 12:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, Xiang Chen


>>
>> I am just saying that we have a fixed number of HW queues (16), each of
>> which may be used for interrupt or polling mode. And since we always
>> allocate max number of MSI, then number of interrupt queues available will
>> be 16 - nr_poll_queues.
> 
> No.
> 
> queue_count is fixed at 16, but pci_alloc_irq_vectors_affinity() still
> may return less vectors, which is one system-wide resource, and queue
> count is device resource.
> 
> So when less vectors are allocated, you should have been capable of using
> more poll queues, unfortunately the current code can't support that.
> 
> Even worse, hisi_hba->cq_nvecs can become negative if less vectors are returned.

OK, I see what you mean here. I thought that we were only considering 
case of vectors allocated was max vectors requested.

Yes, I see how allocating less than max can cause an issue. I am not 
sure if increasing iopoll_q_cnt over driver module param value is proper 
then, but obviously we don't want cq_nvecs to become negative.

> 
>>
>>>
>>>
>>>>> So it isn't related with driver's msi vector allocation bug, is it?
>>>> My deduction is this is how this currently "works" for non-zero iopoll
>>>> queues:
>>>> - allocate max MSI of 32, which gives 32 vectors including 16 cq vectors.
>>>> That then gives:
>>>>      - cq_nvecs = 16 - iopoll_q_cnt
>>>>      - shost->nr_hw_queues = 16
>>>>      - 16x MSI cq vectors were spread over all CPUs
>>> It should be that cq_nvecs vectors spread over all CPUs, and
>>> iopoll_q_cnt are spread over all CPUs too.
>>
>> I agree, it should be, but I don't think that it is for HCTX_TYPE_DEFAULT,
>> as below.
>>
>>>
>>> For each queue type, nr_queues of this type are spread over all
>>> CPUs. >> - in hisi_sas_map_queues()
>>>>      - HCTX_TYPE_DEFAULT qmap->nr_queues = 16 - iopoll_q_cnt, and for
>>>> blk_mq_pci_map_queues() we setup affinity for 16 - iopoll_q_cnt hw queues.
>>>> This looks broken, as we originally spread 16x vectors over all CPUs, but
>>>> now only setup mappings for (16 - iopoll_q_cnt) vectors, whose affinity
>>>> would spread a subset of CPUs. And then qmap->mq_map[] for other CPUs is not
>>>> set at all.
>>> That isn't true, please see my above comment.
>>
>> I am just basing that on what I mention above, so please let me know my
>> inaccuracy there.
> 
> You said queue mapping for HCTX_TYPE_DEFAULT is broken, but it isn't.
> 
> You said 'we originally spread 16x vectors over all CPUs', which isn't
> true. 

Are you talking about the case of allocating less then max requested 
vectors, as above?

If we have min_msi = 17, max_msi = 32, affinity_desc = {16, 0}, and we 
allocate 32 vectors from pci_alloc_irq_vectors_affinity(), then I would 
have thought that the affinity for the 16x cq vectors is spread over all 
CPUs. Is that wrong?

> Again, '16 - iopoll_q_cnt' vectors are spread on all CPUs, and
> same with iopoll_q_cnt vectors.
> 
> Since both blk_mq_map_queues() and blk_mq_pci_map_queues() does spread
> map->nr_queues over all CPUs, so there isn't spread a subset of CPUs.
>

Thanks,
John



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors
  2023-07-27  1:19     ` Ming Lei
@ 2023-07-27 16:56       ` Justin Tee
  0 siblings, 0 replies; 25+ messages in thread
From: Justin Tee @ 2023-07-27 16:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch, James Smart,
	Justin Tee

On Wed, Jul 26, 2023 at 6:20 PM Ming Lei <ming.lei@redhat.com> wrote:
> Strictly speaking, the above change is better, but non-managed irq
> doesn't have such issue, that is why I just apply the change on managed
> irq branch.
>
>
> Thanks,
> Ming

Sure, thanks Ming.

Reviewed-by: Justin Tee <justin.tee@broadcom.com>

Regards,
Justin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel
  2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
                   ` (8 preceding siblings ...)
  2023-07-26  9:40 ` [PATCH V2 9/9] scsi: pm8001: " Ming Lei
@ 2023-07-31  7:14 ` Christoph Hellwig
  9 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2023-07-31  7:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Martin K . Petersen,
	linux-scsi, linux-block, Wen Xiong, Keith Busch

On Wed, Jul 26, 2023 at 05:40:18PM +0800, Ming Lei wrote:
> Hi,
> 
> On arm and ppc64, 'maxcpus=1' is required for kdump kernel,
> see `Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
> still returns all CPUs because 'maxcpus=1' just bring up one single
> cpu core during booting.

Just as said last time:  Please drop the odd is_kdump_kernel checks
in blk_mq_update_queue_map and blk_mq_alloc_tag_set instead of sprinkling
this crap even further. 



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-07-31  7:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-26  9:40 [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Ming Lei
2023-07-26  9:40 ` [PATCH V2 1/9] blk-mq: add blk_mq_max_nr_hw_queues() Ming Lei
2023-07-26 16:36   ` John Garry
2023-07-27  1:06     ` Ming Lei
2023-07-26  9:40 ` [PATCH V2 2/9] nvme-pci: use blk_mq_max_nr_hw_queues() to calculate io queues Ming Lei
2023-07-26  9:40 ` [PATCH V2 3/9] scsi: core: add helper of scsi_max_nr_hw_queues() Ming Lei
2023-07-26  9:40 ` [PATCH V2 4/9] scsi: lpfc: use blk_mq_max_nr_hw_queues() to calculate io vectors Ming Lei
2023-07-26 22:12   ` Justin Tee
2023-07-27  1:19     ` Ming Lei
2023-07-27 16:56       ` Justin Tee
2023-07-26  9:40 ` [PATCH V2 5/9] scsi: hisi: take blk_mq_max_nr_hw_queues() into account for calculating " Ming Lei
2023-07-26 15:42   ` John Garry
2023-07-27  1:15     ` Ming Lei
2023-07-27  7:35       ` John Garry
2023-07-27  9:42         ` Ming Lei
2023-07-27 10:28           ` John Garry
2023-07-27 10:56             ` Ming Lei
2023-07-27 11:30               ` John Garry
2023-07-27 12:01                 ` Ming Lei
2023-07-27 12:36                   ` John Garry
2023-07-26  9:40 ` [PATCH V2 6/9] scsi: mpi3mr: " Ming Lei
2023-07-26  9:40 ` [PATCH V2 7/9] scsi: megaraid: " Ming Lei
2023-07-26  9:40 ` [PATCH V2 8/9] scsi: mpt3sas: " Ming Lei
2023-07-26  9:40 ` [PATCH V2 9/9] scsi: pm8001: " Ming Lei
2023-07-31  7:14 ` [PATCH V2 0/9] blk-mq: fix wrong queue mapping for kdump kernel Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).