[PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA
@ 2025-04-25 10:53 Shradha Gupta
  2025-04-25 10:53 ` [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc Shradha Gupta
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-25 10:53 UTC (permalink / raw)
  To: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen,
	Shradha Gupta, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm
  Cc: Shradha Gupta

In this patchset we want to enable the MANA driver to be able to
allocate MSI-X vectors in PCI dynamically

The first patch exports pci_msix_prepare_desc() in PCI to be able to
correctly prepare descriptors for dynamically added MSI-X vectors

The second patch adds the support of dynamic vector allocation in
pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN
flag and using the pci_msix_prepare_desc() exported in first patch.

The third patch has the changes in MANA driver to be able to allocate
MSI-X vectors dynamically if it is supported by the infra. If the
support does not exist it defaults to older behavior.

---
 Changes in v2
 * split the first patch into two(exporting the preapre_desc
   func and using the function and flag in pci-hyperv)
 * replace 'pci vectors' by 'MSI-X vectors'
 * Change the cover letter description to align with changes made
---

Shradha Gupta (3):
  PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  PCI: hv: Allow dynamic MSI-X vector allocation
  net: mana: Allocate MSI-X vectors dynamically as required

 .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
 drivers/pci/controller/pci-hyperv.c           |   7 +-
 drivers/pci/msi/irqdomain.c                   |   5 +-
 include/linux/msi.h                           |   2 +
 include/net/mana/gdma.h                       |  11 +-
 5 files changed, 279 insertions(+), 69 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  2025-04-25 10:53 [PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA Shradha Gupta
@ 2025-04-25 10:53 ` Shradha Gupta
  2025-04-25 16:37   ` Bjorn Helgaas
  2025-04-25 10:54 ` [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation Shradha Gupta
  2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
  2 siblings, 1 reply; 19+ messages in thread
From: Shradha Gupta @ 2025-04-25 10:53 UTC (permalink / raw)
  To: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen,
	Shradha Gupta, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm
  Cc: Shradha Gupta

For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
to prepare the desc is also needed.

Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
MSI-X vector allocation.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/pci/msi/irqdomain.c | 5 +++--
 include/linux/msi.h         | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index d7ba8795d60f..43129aa6d6c7 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -222,13 +222,14 @@ static void pci_irq_unmask_msix(struct irq_data *data)
 	pci_msix_unmask(irq_data_get_msi_desc(data));
 }
 
-static void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
-				  struct msi_desc *desc)
+void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
+			   struct msi_desc *desc)
 {
 	/* Don't fiddle with preallocated MSI descriptors */
 	if (!desc->pci.mask_base)
 		msix_prepare_msi_desc(to_pci_dev(desc->dev), desc);
 }
+EXPORT_SYMBOL_GPL(pci_msix_prepare_desc);
 
 static const struct msi_domain_template pci_msix_template = {
 	.chip = {
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 86e42742fd0f..d5864d5e75c2 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -691,6 +691,8 @@ struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode,
 					     struct irq_domain *parent);
 u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
 struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
+void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
+			   struct msi_desc *desc);
 #else /* CONFIG_PCI_MSI */
 static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation
  2025-04-25 10:53 [PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA Shradha Gupta
  2025-04-25 10:53 ` [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc Shradha Gupta
@ 2025-04-25 10:54 ` Shradha Gupta
  2025-04-28 13:47   ` Michael Kelley
  2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
  2 siblings, 1 reply; 19+ messages in thread
From: Shradha Gupta @ 2025-04-25 10:54 UTC (permalink / raw)
  To: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen,
	Shradha Gupta, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm
  Cc: Shradha Gupta

Allow dynamic MSI-X vector allocation for pci_hyperv PCI controller
by adding support for the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN and using
pci_msix_prepare_desc() to prepare the descriptors.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 Changes in v2:
 * split the patch to keep changes in PCI and pci_hyperv controller
   seperate
 * replace strings "pci vectors" by "MSI-X vectors"
---
 drivers/pci/controller/pci-hyperv.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index ac27bda5ba26..f2fa6bdb6bb8 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -598,7 +598,8 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
 	return cfg->vector;
 }
 
-#define hv_msi_prepare		pci_msi_prepare
+#define hv_msi_prepare			pci_msi_prepare
+#define hv_msix_prepare_desc		pci_msix_prepare_desc
 
 /**
  * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
@@ -727,6 +728,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
 #define FLOW_HANDLER		NULL
 #define FLOW_NAME		NULL
 #define hv_msi_prepare		NULL
+#define hv_msix_prepare_desc	NULL
 
 struct hv_pci_chip_data {
 	DECLARE_BITMAP(spi_map, HV_PCI_MSI_SPI_NR);
@@ -2063,6 +2065,7 @@ static struct irq_chip hv_msi_irq_chip = {
 static struct msi_domain_ops hv_msi_ops = {
 	.msi_prepare	= hv_msi_prepare,
 	.msi_free	= hv_msi_free,
+	.prepare_desc	= hv_msix_prepare_desc,
 };
 
 /**
@@ -2084,7 +2087,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
 	hbus->msi_info.ops = &hv_msi_ops;
 	hbus->msi_info.flags = (MSI_FLAG_USE_DEF_DOM_OPS |
 		MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_MULTI_PCI_MSI |
-		MSI_FLAG_PCI_MSIX);
+		MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN);
 	hbus->msi_info.handler = FLOW_HANDLER;
 	hbus->msi_info.handler_name = FLOW_NAME;
 	hbus->msi_info.data = hbus;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-25 10:53 [PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA Shradha Gupta
  2025-04-25 10:53 ` [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc Shradha Gupta
  2025-04-25 10:54 ` [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation Shradha Gupta
@ 2025-04-25 10:54 ` Shradha Gupta
  2025-04-25 16:35   ` Jakub Kicinski
                     ` (2 more replies)
  2 siblings, 3 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-25 10:54 UTC (permalink / raw)
  To: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen,
	Shradha Gupta, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm
  Cc: Shradha Gupta

Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.

To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 Changes in v2:
 * Use string 'MSI-X vectors' instead of 'pci vectors'
 * make skip-cpu a bool instead of int
 * rearrange the comment arout skip_cpu variable appropriately
 * update the capability bit for driver indicating dynamic IRQ allocation
 * enforced max line length to 80
 * enforced RCT convention
 * initialized gic to NULL, for when there is a possibility of gic
   not being populated correctly
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
 include/net/mana/gdma.h                       |  11 +-
 2 files changed, 269 insertions(+), 65 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 4ffaf7588885..753b0208e574 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -6,6 +6,9 @@
 #include <linux/pci.h>
 #include <linux/utsname.h>
 #include <linux/version.h>
+#include <linux/msi.h>
+#include <linux/irqdomain.h>
+#include <linux/list.h>
 
 #include <net/mana/mana.h>
 
@@ -80,8 +83,15 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 
-	if (gc->num_msix_usable > resp.max_msix)
-		gc->num_msix_usable = resp.max_msix;
+	if (!pci_msix_can_alloc_dyn(pdev)) {
+		if (gc->num_msix_usable > resp.max_msix)
+			gc->num_msix_usable = resp.max_msix;
+	} else {
+		/* If dynamic allocation is enabled we have already allocated
+		 * hwc msi
+		 */
+		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
+	}
 
 	if (gc->num_msix_usable <= 1)
 		return -ENOSPC;
@@ -462,12 +472,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
 				const struct gdma_queue_spec *spec)
 {
 	struct gdma_dev *gd = queue->gdma_dev;
-	struct gdma_irq_context *gic;
+	struct gdma_irq_context *gic = NULL;
+	unsigned long flags, flag_irq;
 	struct gdma_context *gc;
 	unsigned int msi_index;
-	unsigned long flags;
+	struct list_head *pos;
 	struct device *dev;
-	int err = 0;
+	int err = 0, count;
 
 	gc = gd->gdma_context;
 	dev = gc->dev;
@@ -482,7 +493,23 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
 	}
 
 	queue->eq.msix_index = msi_index;
-	gic = &gc->irq_contexts[msi_index];
+
+	/* get the msi_index value from the list*/
+	count = 0;
+	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
+	list_for_each(pos, &gc->irq_contexts) {
+		if (count == msi_index) {
+			gic = list_entry(pos, struct gdma_irq_context,
+					 gic_list);
+			break;
+		}
+
+		count++;
+	}
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
+
+	if (!gic)
+		return -1;
 
 	spin_lock_irqsave(&gic->lock, flags);
 	list_add_rcu(&queue->entry, &gic->eq_list);
@@ -494,11 +521,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
 static void mana_gd_deregiser_irq(struct gdma_queue *queue)
 {
 	struct gdma_dev *gd = queue->gdma_dev;
-	struct gdma_irq_context *gic;
+	struct gdma_irq_context *gic = NULL;
+	unsigned long flags, flag_irq;
 	struct gdma_context *gc;
 	unsigned int msix_index;
-	unsigned long flags;
+	struct list_head *pos;
 	struct gdma_queue *eq;
+	int count;
 
 	gc = gd->gdma_context;
 
@@ -507,7 +536,23 @@ static void mana_gd_deregiser_irq(struct gdma_queue *queue)
 	if (WARN_ON(msix_index >= gc->num_msix_usable))
 		return;
 
-	gic = &gc->irq_contexts[msix_index];
+	/* get the msi_index value from the list*/
+	count = 0;
+	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
+	list_for_each(pos, &gc->irq_contexts) {
+		if (count == msix_index) {
+			gic = list_entry(pos, struct gdma_irq_context,
+					 gic_list);
+			break;
+		}
+
+		count++;
+	}
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
+
+	if (!gic)
+		return;
+
 	spin_lock_irqsave(&gic->lock, flags);
 	list_for_each_entry_rcu(eq, &gic->eq_list, entry) {
 		if (queue == eq) {
@@ -1288,7 +1333,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
 	r->size = 0;
 }
 
-static int irq_setup(unsigned int *irqs, unsigned int len, int node)
+static int irq_setup(unsigned int *irqs, unsigned int len, int node,
+		     bool skip_first_cpu)
 {
 	const struct cpumask *next, *prev = cpu_none_mask;
 	cpumask_var_t cpus __free(free_cpumask_var);
@@ -1303,9 +1349,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node)
 		while (weight > 0) {
 			cpumask_andnot(cpus, next, prev);
 			for_each_cpu(cpu, cpus) {
+				/*
+				 * if the CPU sibling set is to be skipped we
+				 * just move on to the next CPUs without len--
+				 */
+				if (unlikely(skip_first_cpu)) {
+					skip_first_cpu = false;
+					goto next_cpumask;
+				}
+
 				if (len-- == 0)
 					goto done;
+
 				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
+next_cpumask:
 				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
 				--weight;
 			}
@@ -1317,29 +1374,99 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node)
 	return 0;
 }
 
-static int mana_gd_setup_irqs(struct pci_dev *pdev)
+static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
-	unsigned int max_queues_per_port;
 	struct gdma_irq_context *gic;
-	unsigned int max_irqs, cpu;
-	int start_irq_index = 1;
-	int nvec, *irqs, irq;
+	bool skip_first_cpu = false;
+	unsigned long flags;
 	int err, i = 0, j;
+	int *irqs, irq;
 
 	cpus_read_lock();
-	max_queues_per_port = num_online_cpus();
-	if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
-		max_queues_per_port = MANA_MAX_NUM_QUEUES;
+	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
+	irqs = kmalloc_array(nvec, sizeof(int), GFP_KERNEL);
+	if (!irqs) {
+		err = -ENOMEM;
+		goto free_irq_vector;
+	}
 
-	/* Need 1 interrupt for the Hardware communication Channel (HWC) */
-	max_irqs = max_queues_per_port + 1;
+	for (i = 0; i < nvec; i++) {
+		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
+		if (!gic) {
+			err = -ENOMEM;
+			goto free_irq;
+		}
+		gic->handler = mana_gd_process_eq_events;
+		INIT_LIST_HEAD(&gic->eq_list);
+		spin_lock_init(&gic->lock);
 
-	nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
-	if (nvec < 0) {
-		cpus_read_unlock();
-		return nvec;
+		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
+			 i, pci_name(pdev));
+
+		/* one pci vector is already allocated for HWC */
+		irqs[i] = pci_irq_vector(pdev, i + 1);
+		if (irqs[i] < 0) {
+			err = irqs[i];
+			goto free_current_gic;
+		}
+
+		err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
+		if (err)
+			goto free_current_gic;
+
+		list_add_tail(&gic->gic_list, &gc->irq_contexts);
 	}
+
+	/*
+	 * When calling irq_setup() for dynamically added IRQs, if number of
+	 * IRQs is more than or equal to allocated MSI-X, we need to skip the
+	 * first CPU sibling group since they are already affinitized to HWC IRQ
+	 */
+	if (gc->num_msix_usable <= num_online_cpus())
+		skip_first_cpu = true;
+
+	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
+	if (err)
+		goto free_irq;
+
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
+	cpus_read_unlock();
+	kfree(irqs);
+	return 0;
+
+free_current_gic:
+	kfree(gic);
+free_irq:
+	for (j = i - 1; j >= 0; j--) {
+		irq = pci_irq_vector(pdev, j + 1);
+		gic = list_last_entry(&gc->irq_contexts,
+				      struct gdma_irq_context, gic_list);
+		irq_update_affinity_hint(irq, NULL);
+		free_irq(irq, gic);
+		list_del(&gic->gic_list);
+		kfree(gic);
+	}
+	kfree(irqs);
+free_irq_vector:
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
+	cpus_read_unlock();
+	return err;
+}
+
+static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
+{
+	struct gdma_context *gc = pci_get_drvdata(pdev);
+	struct gdma_irq_context *gic;
+	int start_irq_index = 1;
+	unsigned long flags;
+	unsigned int cpu;
+	int *irqs, irq;
+	int err, i = 0, j;
+
+	cpus_read_lock();
+	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
+
 	if (nvec <= num_online_cpus())
 		start_irq_index = 0;
 
@@ -1349,15 +1476,12 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
 		goto free_irq_vector;
 	}
 
-	gc->irq_contexts = kcalloc(nvec, sizeof(struct gdma_irq_context),
-				   GFP_KERNEL);
-	if (!gc->irq_contexts) {
-		err = -ENOMEM;
-		goto free_irq_array;
-	}
-
 	for (i = 0; i < nvec; i++) {
-		gic = &gc->irq_contexts[i];
+		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
+		if (!gic) {
+			err = -ENOMEM;
+			goto free_irq;
+		}
 		gic->handler = mana_gd_process_eq_events;
 		INIT_LIST_HEAD(&gic->eq_list);
 		spin_lock_init(&gic->lock);
@@ -1372,22 +1496,14 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
 		irq = pci_irq_vector(pdev, i);
 		if (irq < 0) {
 			err = irq;
-			goto free_irq;
+			goto free_current_gic;
 		}
 
 		if (!i) {
 			err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
 			if (err)
-				goto free_irq;
-
-			/* If number of IRQ is one extra than number of online CPUs,
-			 * then we need to assign IRQ0 (hwc irq) and IRQ1 to
-			 * same CPU.
-			 * Else we will use different CPUs for IRQ0 and IRQ1.
-			 * Also we are using cpumask_local_spread instead of
-			 * cpumask_first for the node, because the node can be
-			 * mem only.
-			 */
+				goto free_current_gic;
+
 			if (start_irq_index) {
 				cpu = cpumask_local_spread(i, gc->numa_node);
 				irq_set_affinity_and_hint(irq, cpumask_of(cpu));
@@ -1396,39 +1512,108 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
 			}
 		} else {
 			irqs[i - start_irq_index] = irq;
-			err = request_irq(irqs[i - start_irq_index], mana_gd_intr, 0,
-					  gic->name, gic);
+			err = request_irq(irqs[i - start_irq_index],
+					  mana_gd_intr, 0, gic->name, gic);
 			if (err)
-				goto free_irq;
+				goto free_current_gic;
 		}
+
+		list_add_tail(&gic->gic_list, &gc->irq_contexts);
 	}
 
-	err = irq_setup(irqs, (nvec - start_irq_index), gc->numa_node);
+	err = irq_setup(irqs, nvec - start_irq_index, gc->numa_node, false);
 	if (err)
 		goto free_irq;
 
-	gc->max_num_msix = nvec;
-	gc->num_msix_usable = nvec;
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
 	cpus_read_unlock();
 	kfree(irqs);
 	return 0;
 
+free_current_gic:
+	kfree(gic);
 free_irq:
 	for (j = i - 1; j >= 0; j--) {
 		irq = pci_irq_vector(pdev, j);
-		gic = &gc->irq_contexts[j];
-
+		gic = list_last_entry(&gc->irq_contexts,
+				      struct gdma_irq_context, gic_list);
 		irq_update_affinity_hint(irq, NULL);
 		free_irq(irq, gic);
+		list_del(&gic->gic_list);
+		kfree(gic);
 	}
-
-	kfree(gc->irq_contexts);
-	gc->irq_contexts = NULL;
-free_irq_array:
 	kfree(irqs);
 free_irq_vector:
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
 	cpus_read_unlock();
-	pci_free_irq_vectors(pdev);
+	return err;
+}
+
+static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
+{
+	struct gdma_context *gc = pci_get_drvdata(pdev);
+	unsigned int max_irqs, min_irqs;
+	int max_queues_per_port;
+	int nvec, err;
+
+	if (pci_msix_can_alloc_dyn(pdev)) {
+		max_irqs = 1;
+		min_irqs = 1;
+	} else {
+		max_queues_per_port = num_online_cpus();
+		if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
+			max_queues_per_port = MANA_MAX_NUM_QUEUES;
+		/* Need 1 interrupt for HWC */
+		max_irqs = max_queues_per_port + 1;
+		min_irqs = 2;
+	}
+
+	nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
+	if (nvec < 0)
+		return nvec;
+
+	err = mana_gd_setup_irqs(pdev, nvec);
+	if (err) {
+		pci_free_irq_vectors(pdev);
+		return err;
+	}
+
+	gc->num_msix_usable = nvec;
+	gc->max_num_msix = nvec;
+
+	return err;
+}
+
+static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
+{
+	struct gdma_context *gc = pci_get_drvdata(pdev);
+	int max_irqs, i, err = 0;
+	struct msi_map irq_map;
+
+	if (!pci_msix_can_alloc_dyn(pdev))
+		/* remain irqs are already allocated with HWC IRQ */
+		return 0;
+
+	/* allocate only remaining IRQs*/
+	max_irqs = gc->num_msix_usable - 1;
+
+	for (i = 1; i <= max_irqs; i++) {
+		irq_map = pci_msix_alloc_irq_at(pdev, i, NULL);
+		if (!irq_map.virq) {
+			err = irq_map.index;
+			/* caller will handle cleaning up all allocated
+			 * irqs, after HWC is destroyed
+			 */
+			return err;
+		}
+	}
+
+	err = mana_gd_setup_dyn_irqs(pdev, max_irqs);
+	if (err)
+		return err;
+
+	gc->max_num_msix = gc->max_num_msix + max_irqs;
+
 	return err;
 }
 
@@ -1436,29 +1621,34 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	int irq, i;
+	struct list_head *pos, *n;
+	unsigned long flags;
+	int irq, i = 0;
 
 	if (gc->max_num_msix < 1)
 		return;
 
-	for (i = 0; i < gc->max_num_msix; i++) {
+	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
+	list_for_each_safe(pos, n, &gc->irq_contexts) {
 		irq = pci_irq_vector(pdev, i);
 		if (irq < 0)
 			continue;
 
-		gic = &gc->irq_contexts[i];
+		gic = list_entry(pos, struct gdma_irq_context, gic_list);
 
 		/* Need to clear the hint before free_irq */
 		irq_update_affinity_hint(irq, NULL);
 		free_irq(irq, gic);
+		list_del(pos);
+		kfree(gic);
+		i++;
 	}
+	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
 
 	pci_free_irq_vectors(pdev);
 
 	gc->max_num_msix = 0;
 	gc->num_msix_usable = 0;
-	kfree(gc->irq_contexts);
-	gc->irq_contexts = NULL;
 }
 
 static int mana_gd_setup(struct pci_dev *pdev)
@@ -1469,9 +1659,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	mana_gd_init_registers(pdev);
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
-	err = mana_gd_setup_irqs(pdev);
+	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
-		dev_err(gc->dev, "Failed to setup IRQs: %d\n", err);
+		dev_err(gc->dev, "Failed to setup IRQs for HWC creation: %d\n",
+			err);
 		return err;
 	}
 
@@ -1487,6 +1678,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	if (err)
 		goto destroy_hwc;
 
+	err = mana_gd_setup_remaining_irqs(pdev);
+	if (err)
+		goto destroy_hwc;
+
 	err = mana_gd_detect_devices(pdev);
 	if (err)
 		goto destroy_hwc;
@@ -1563,6 +1758,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->is_pf = mana_is_pf(pdev->device);
 	gc->bar0_va = bar0_va;
 	gc->dev = &pdev->dev;
+	INIT_LIST_HEAD(&gc->irq_contexts);
+	spin_lock_init(&gc->irq_ctxs_lock);
 
 	if (gc->is_pf)
 		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 228603bf03f2..6ef4785c63b4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -363,6 +363,7 @@ struct gdma_irq_context {
 	spinlock_t lock;
 	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
+	struct list_head gic_list;
 };
 
 struct gdma_context {
@@ -373,7 +374,9 @@ struct gdma_context {
 	unsigned int		max_num_queues;
 	unsigned int		max_num_msix;
 	unsigned int		num_msix_usable;
-	struct gdma_irq_context	*irq_contexts;
+	struct list_head	irq_contexts;
+	/* Protect the irq_contexts list */
+	spinlock_t		irq_ctxs_lock;
 
 	/* L2 MTU */
 	u16 adapter_mtu;
@@ -558,12 +561,16 @@ enum {
 /* Driver can handle holes (zeros) in the device list */
 #define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
 
+/* Driver supports dynamic MSI-X vector allocation */
+#define GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT BIT(13)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
 	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG | \
 	 GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT | \
-	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP)
+	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP | \
+	 GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
@ 2025-04-25 16:35   ` Jakub Kicinski
  2025-04-28  4:01     ` Shradha Gupta
  2025-04-28 13:38   ` Yury Norov
  2025-05-01  5:27   ` Michael Kelley
  2 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2025-04-25 16:35 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Lorenzo Pieralisi, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Konstantin Taranov, Simon Horman, Leon Romanovsky,
	Maxim Levitsky, Erni Sri Satya Vennela, Peter Zijlstra, netdev,
	linux-rdma, Paul Rosswurm, Shradha Gupta

On Fri, 25 Apr 2025 03:54:38 -0700 Shradha Gupta wrote:
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> +	irqs = kmalloc_array(nvec, sizeof(int), GFP_KERNEL);

Could you please test your submissions on a kernel with debug options
enabled? Thank you!
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  2025-04-25 10:53 ` [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc Shradha Gupta
@ 2025-04-25 16:37   ` Bjorn Helgaas
  2025-04-28  5:22     ` Shradha Gupta
  2025-04-28 12:22     ` Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Bjorn Helgaas @ 2025-04-25 16:37 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Lorenzo Pieralisi, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Leon Romanovsky, Maxim Levitsky, Erni Sri Satya Vennela,
	Peter Zijlstra, netdev, linux-rdma, Paul Rosswurm, Shradha Gupta

On Fri, Apr 25, 2025 at 03:53:57AM -0700, Shradha Gupta wrote:
> For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
> the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
> to prepare the desc is also needed.
> 
> Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
> MSI-X vector allocation.
> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

Thanks for the update and for splitting this from the hv driver
update.  Will watch for Thomas's ack here.

For future postings, you might consider limiting the "To:" line to
people you expect to actually act on the patch, and moving the rest to
"Cc:".

> ---
>  drivers/pci/msi/irqdomain.c | 5 +++--
>  include/linux/msi.h         | 2 ++
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
> index d7ba8795d60f..43129aa6d6c7 100644
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -222,13 +222,14 @@ static void pci_irq_unmask_msix(struct irq_data *data)
>  	pci_msix_unmask(irq_data_get_msi_desc(data));
>  }
>  
> -static void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> -				  struct msi_desc *desc)
> +void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> +			   struct msi_desc *desc)
>  {
>  	/* Don't fiddle with preallocated MSI descriptors */
>  	if (!desc->pci.mask_base)
>  		msix_prepare_msi_desc(to_pci_dev(desc->dev), desc);
>  }
> +EXPORT_SYMBOL_GPL(pci_msix_prepare_desc);
>  
>  static const struct msi_domain_template pci_msix_template = {
>  	.chip = {
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 86e42742fd0f..d5864d5e75c2 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -691,6 +691,8 @@ struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode,
>  					     struct irq_domain *parent);
>  u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
>  struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
> +void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> +			   struct msi_desc *desc);
>  #else /* CONFIG_PCI_MSI */
>  static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
>  {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-25 16:35   ` Jakub Kicinski
@ 2025-04-28  4:01     ` Shradha Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-28  4:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy??ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm, Shradha Gupta

On Fri, Apr 25, 2025 at 09:35:56AM -0700, Jakub Kicinski wrote:
> On Fri, 25 Apr 2025 03:54:38 -0700 Shradha Gupta wrote:
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> > +	irqs = kmalloc_array(nvec, sizeof(int), GFP_KERNEL);
> 
> Could you please test your submissions on a kernel with debug options
> enabled? Thank you!
> -- 
> pw-bot: cr

Thank you Jakub. Will send out a newer version with this change.
Also, would make sure to build with debug options enabled for subsequent
submissions.

Regards,
Shradha.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  2025-04-25 16:37   ` Bjorn Helgaas
@ 2025-04-28  5:22     ` Shradha Gupta
  2025-04-28 12:22     ` Thomas Gleixner
  1 sibling, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-28  5:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy??ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm, Shradha Gupta

On Fri, Apr 25, 2025 at 11:37:48AM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 25, 2025 at 03:53:57AM -0700, Shradha Gupta wrote:
> > For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
> > the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
> > to prepare the desc is also needed.
> > 
> > Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
> > MSI-X vector allocation.
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Thanks for the update and for splitting this from the hv driver
> update.  Will watch for Thomas's ack here.
> 
> For future postings, you might consider limiting the "To:" line to
> people you expect to actually act on the patch, and moving the rest to
> "Cc:".

Thanks Bjorn, Noted.
> 
> > ---
> >  drivers/pci/msi/irqdomain.c | 5 +++--
> >  include/linux/msi.h         | 2 ++
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
> > index d7ba8795d60f..43129aa6d6c7 100644
> > --- a/drivers/pci/msi/irqdomain.c
> > +++ b/drivers/pci/msi/irqdomain.c
> > @@ -222,13 +222,14 @@ static void pci_irq_unmask_msix(struct irq_data *data)
> >  	pci_msix_unmask(irq_data_get_msi_desc(data));
> >  }
> >  
> > -static void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> > -				  struct msi_desc *desc)
> > +void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> > +			   struct msi_desc *desc)
> >  {
> >  	/* Don't fiddle with preallocated MSI descriptors */
> >  	if (!desc->pci.mask_base)
> >  		msix_prepare_msi_desc(to_pci_dev(desc->dev), desc);
> >  }
> > +EXPORT_SYMBOL_GPL(pci_msix_prepare_desc);
> >  
> >  static const struct msi_domain_template pci_msix_template = {
> >  	.chip = {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 86e42742fd0f..d5864d5e75c2 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -691,6 +691,8 @@ struct irq_domain *pci_msi_create_irq_domain(struct fwnode_handle *fwnode,
> >  					     struct irq_domain *parent);
> >  u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
> >  struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
> > +void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> > +			   struct msi_desc *desc);
> >  #else /* CONFIG_PCI_MSI */
> >  static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
> >  {
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  2025-04-25 16:37   ` Bjorn Helgaas
  2025-04-28  5:22     ` Shradha Gupta
@ 2025-04-28 12:22     ` Thomas Gleixner
  2025-04-29  8:43       ` Shradha Gupta
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2025-04-28 12:22 UTC (permalink / raw)
  To: Bjorn Helgaas, Shradha Gupta
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm, Shradha Gupta

On Fri, Apr 25 2025 at 11:37, Bjorn Helgaas wrote:

Subject prefix wants to be PCI/MSI

  git log --format=oneline path/to/file

gives you a pretty decent hint



> On Fri, Apr 25, 2025 at 03:53:57AM -0700, Shradha Gupta wrote:
>> For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
>> the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
>> to prepare the desc is also needed.

Please write things out: ... to prepare the MSI descriptor ....

This is not twitter.

>> Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
>> MSI-X vector allocation.
>> 
>> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
>> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>

> Thanks for the update and for splitting this from the hv driver
> update.  Will watch for Thomas's ack here.

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
  2025-04-25 16:35   ` Jakub Kicinski
@ 2025-04-28 13:38   ` Yury Norov
  2025-04-29  8:51     ` Shradha Gupta
  2025-05-01  5:27   ` Michael Kelley
  2 siblings, 1 reply; 19+ messages in thread
From: Yury Norov @ 2025-04-28 13:38 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczyński,
	Lorenzo Pieralisi, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Leon Romanovsky, Maxim Levitsky, Erni Sri Satya Vennela,
	Peter Zijlstra, netdev, linux-rdma, Paul Rosswurm, Shradha Gupta

On Fri, Apr 25, 2025 at 03:54:38AM -0700, Shradha Gupta wrote:
> Currently, the MANA driver allocates MSI-X vectors statically based on
> MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> up allocating more vectors than it needs. This is because, by this time
> we do not have a HW channel and do not know how many IRQs should be
> allocated.
> 
> To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> after getting the value supported by hardware, dynamically add the
> remaining MSI-X vectors.
> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Changes in v2:
>  * Use string 'MSI-X vectors' instead of 'pci vectors'
>  * make skip-cpu a bool instead of int
>  * rearrange the comment arout skip_cpu variable appropriately
>  * update the capability bit for driver indicating dynamic IRQ allocation
>  * enforced max line length to 80
>  * enforced RCT convention
>  * initialized gic to NULL, for when there is a possibility of gic
>    not being populated correctly
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
>  include/net/mana/gdma.h                       |  11 +-
>  2 files changed, 269 insertions(+), 65 deletions(-)

To me, this patch looks too big, and it doesn't look like it does
exactly one thing.

Can you split it to a few small more reviewable chunks? For example,
I authored irq_setup() helper. If you split its rework and make it
a small preparation patch, I'll be able to add my review tag.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation
  2025-04-25 10:54 ` [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation Shradha Gupta
@ 2025-04-28 13:47   ` Michael Kelley
  2025-04-29  8:46     ` Shradha Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-04-28 13:47 UTC (permalink / raw)
  To: Shradha Gupta, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nipun Gupta, Yury Norov, Jason Gunthorpe, Jonathan Cameron,
	Anna-Maria Behnsen, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm
  Cc: Shradha Gupta

From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25, 2025 3:54 AM
> 
> Allow dynamic MSI-X vector allocation for pci_hyperv PCI controller
> by adding support for the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN and using
> pci_msix_prepare_desc() to prepare the descriptors.

I'm unclear from the code below whether the intent is to support dynamic
allocation only x86, or on both x86 and arm64. On arm64, you've defined
hv_msi_prepare_desc as NULL, but hv_msi_prepare is also defined as NULL
on arm64, so I'm not sure what to conclude. MSI_FLAG_PCI_MSIX_ALLOC_DYN
is being set for both architectures.

In any case, please be explicit about your intent in the commit message.
If the intent is to not support arm64, why is that?

Michael

> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Changes in v2:
>  * split the patch to keep changes in PCI and pci_hyperv controller
>    seperate
>  * replace strings "pci vectors" by "MSI-X vectors"
> ---
>  drivers/pci/controller/pci-hyperv.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index ac27bda5ba26..f2fa6bdb6bb8 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -598,7 +598,8 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>  	return cfg->vector;
>  }
> 
> -#define hv_msi_prepare		pci_msi_prepare
> +#define hv_msi_prepare			pci_msi_prepare
> +#define hv_msix_prepare_desc		pci_msix_prepare_desc
> 
>  /**
>   * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
> @@ -727,6 +728,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>  #define FLOW_HANDLER		NULL
>  #define FLOW_NAME		NULL
>  #define hv_msi_prepare		NULL
> +#define hv_msix_prepare_desc	NULL
> 
>  struct hv_pci_chip_data {
>  	DECLARE_BITMAP(spi_map, HV_PCI_MSI_SPI_NR);
> @@ -2063,6 +2065,7 @@ static struct irq_chip hv_msi_irq_chip = {
>  static struct msi_domain_ops hv_msi_ops = {
>  	.msi_prepare	= hv_msi_prepare,
>  	.msi_free	= hv_msi_free,
> +	.prepare_desc	= hv_msix_prepare_desc,
>  };
> 
>  /**
> @@ -2084,7 +2087,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
>  	hbus->msi_info.ops = &hv_msi_ops;
>  	hbus->msi_info.flags = (MSI_FLAG_USE_DEF_DOM_OPS |
>  		MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_MULTI_PCI_MSI |
> -		MSI_FLAG_PCI_MSIX);
> +		MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN);
>  	hbus->msi_info.handler = FLOW_HANDLER;
>  	hbus->msi_info.handler_name = FLOW_NAME;
>  	hbus->msi_info.data = hbus;
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc
  2025-04-28 12:22     ` Thomas Gleixner
@ 2025-04-29  8:43       ` Shradha Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-29  8:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bjorn Helgaas, linux-hyperv, linux-pci, linux-kernel, Nipun Gupta,
	Yury Norov, Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen,
	Kevin Tian, Long Li, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy??ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm, Shradha Gupta

On Mon, Apr 28, 2025 at 02:22:57PM +0200, Thomas Gleixner wrote:
> On Fri, Apr 25 2025 at 11:37, Bjorn Helgaas wrote:
> 
> Subject prefix wants to be PCI/MSI
> 
>   git log --format=oneline path/to/file
> 
> gives you a pretty decent hint
> 
> 
> 
> > On Fri, Apr 25, 2025 at 03:53:57AM -0700, Shradha Gupta wrote:
> >> For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
> >> the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
> >> to prepare the desc is also needed.
> 
> Please write things out: ... to prepare the MSI descriptor ....
> 
> This is not twitter.
> 
> >> Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
> >> MSI-X vector allocation.
> >> 
> >> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> >> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> > Thanks for the update and for splitting this from the hv driver
> > update.  Will watch for Thomas's ack here.
> 
> Other than that:
> 
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Thank you for all the comments Thomas.
I'll be mindful of these going forward.

regards,
Shradha.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation
  2025-04-28 13:47   ` Michael Kelley
@ 2025-04-29  8:46     ` Shradha Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-29  8:46 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy?ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm, Shradha Gupta

On Mon, Apr 28, 2025 at 01:47:22PM +0000, Michael Kelley wrote:
> From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25, 2025 3:54 AM
> > 
> > Allow dynamic MSI-X vector allocation for pci_hyperv PCI controller
> > by adding support for the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN and using
> > pci_msix_prepare_desc() to prepare the descriptors.
> 
> I'm unclear from the code below whether the intent is to support dynamic
> allocation only x86, or on both x86 and arm64. On arm64, you've defined
> hv_msi_prepare_desc as NULL, but hv_msi_prepare is also defined as NULL
> on arm64, so I'm not sure what to conclude. MSI_FLAG_PCI_MSIX_ALLOC_DYN
> is being set for both architectures.
> 
> In any case, please be explicit about your intent in the commit message.
> If the intent is to not support arm64, why is that?
> 
> Michael

Thank you for the comments Michael, in my recent testing on arm64, turns
out this function is indeed needed for the complete support.
I will test this out more thoroughly on arm64 and include the arm64 support
in next version.

Regards,
Shradha.
> 
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  Changes in v2:
> >  * split the patch to keep changes in PCI and pci_hyperv controller
> >    seperate
> >  * replace strings "pci vectors" by "MSI-X vectors"
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index ac27bda5ba26..f2fa6bdb6bb8 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -598,7 +598,8 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> >  	return cfg->vector;
> >  }
> > 
> > -#define hv_msi_prepare		pci_msi_prepare
> > +#define hv_msi_prepare			pci_msi_prepare
> > +#define hv_msix_prepare_desc		pci_msix_prepare_desc
> > 
> >  /**
> >   * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
> > @@ -727,6 +728,7 @@ static void hv_arch_irq_unmask(struct irq_data *data)
> >  #define FLOW_HANDLER		NULL
> >  #define FLOW_NAME		NULL
> >  #define hv_msi_prepare		NULL
> > +#define hv_msix_prepare_desc	NULL
> > 
> >  struct hv_pci_chip_data {
> >  	DECLARE_BITMAP(spi_map, HV_PCI_MSI_SPI_NR);
> > @@ -2063,6 +2065,7 @@ static struct irq_chip hv_msi_irq_chip = {
> >  static struct msi_domain_ops hv_msi_ops = {
> >  	.msi_prepare	= hv_msi_prepare,
> >  	.msi_free	= hv_msi_free,
> > +	.prepare_desc	= hv_msix_prepare_desc,
> >  };
> > 
> >  /**
> > @@ -2084,7 +2087,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
> >  	hbus->msi_info.ops = &hv_msi_ops;
> >  	hbus->msi_info.flags = (MSI_FLAG_USE_DEF_DOM_OPS |
> >  		MSI_FLAG_USE_DEF_CHIP_OPS | MSI_FLAG_MULTI_PCI_MSI |
> > -		MSI_FLAG_PCI_MSIX);
> > +		MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN);
> >  	hbus->msi_info.handler = FLOW_HANDLER;
> >  	hbus->msi_info.handler_name = FLOW_NAME;
> >  	hbus->msi_info.data = hbus;
> > --
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-28 13:38   ` Yury Norov
@ 2025-04-29  8:51     ` Shradha Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-04-29  8:51 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-hyperv, linux-pci, linux-kernel, Nipun Gupta,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy??ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev, linux-rdma,
	Paul Rosswurm, Shradha Gupta

On Mon, Apr 28, 2025 at 09:38:06AM -0400, Yury Norov wrote:
> On Fri, Apr 25, 2025 at 03:54:38AM -0700, Shradha Gupta wrote:
> > Currently, the MANA driver allocates MSI-X vectors statically based on
> > MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> > up allocating more vectors than it needs. This is because, by this time
> > we do not have a HW channel and do not know how many IRQs should be
> > allocated.
> > 
> > To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> > after getting the value supported by hardware, dynamically add the
> > remaining MSI-X vectors.
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  Changes in v2:
> >  * Use string 'MSI-X vectors' instead of 'pci vectors'
> >  * make skip-cpu a bool instead of int
> >  * rearrange the comment arout skip_cpu variable appropriately
> >  * update the capability bit for driver indicating dynamic IRQ allocation
> >  * enforced max line length to 80
> >  * enforced RCT convention
> >  * initialized gic to NULL, for when there is a possibility of gic
> >    not being populated correctly
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
> >  include/net/mana/gdma.h                       |  11 +-
> >  2 files changed, 269 insertions(+), 65 deletions(-)
> 
> To me, this patch looks too big, and it doesn't look like it does
> exactly one thing.
> 
> Can you split it to a few small more reviewable chunks? For example,
> I authored irq_setup() helper. If you split its rework and make it
> a small preparation patch, I'll be able to add my review tag.
> 
> Thanks,
> Yury

Thank you for the comments Yury. I think I can split this patch into the 
irq_setup() preparation patch and other functionalities. Would also
investigate if these other functionalities can also be split, before sending
out the next version.

Regards,
Shradha.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
  2025-04-25 16:35   ` Jakub Kicinski
  2025-04-28 13:38   ` Yury Norov
@ 2025-05-01  5:27   ` Michael Kelley
  2025-05-01 14:23     ` Shradha Gupta
  2 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-05-01  5:27 UTC (permalink / raw)
  To: Shradha Gupta, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nipun Gupta, Yury Norov, Jason Gunthorpe, Jonathan Cameron,
	Anna-Maria Behnsen, Kevin Tian, Long Li, Thomas Gleixner,
	Bjorn Helgaas, Rob Herring, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Lorenzo Pieralisi, Dexuan Cui, Wei Liu,
	Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Konstantin Taranov,
	Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm
  Cc: Shradha Gupta

From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25, 2025 3:55 AM
> 
> Currently, the MANA driver allocates MSI-X vectors statically based on
> MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> up allocating more vectors than it needs. This is because, by this time
> we do not have a HW channel and do not know how many IRQs should be
> allocated.
> 
> To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> after getting the value supported by hardware, dynamically add the
> remaining MSI-X vectors.

I have a top-level thought about the data structures used to manage a
dynamic number of MSI-X vectors. The current code allocates a fixed size
array of struct gdma_irq_context, with one entry in the array for each
MSI-X vector. To find the entry for a particular msi_index, the code can
just index into the array, which is nice and simple.

The new code uses a linked list of struct gdma_irq_context entries, with
one entry in the list for each MSI-X vector.  In the dynamic case, you can
start with one entry in the list, and then add to the list however many
additional entries the hardware will support.

But this additional linked list adds significant complexity to the code
because it must be linearly searched to find the entry for a particular
msi_index, and there's the messiness of putting entries onto the list
and taking them off.  A spin lock is required.  Etc., etc.

Here's an intermediate approach that would be simpler. Allocate a fixed
size array of pointers to struct gdma_irq_context. The fixed size is the
maximum number of possible MSI-X vectors for the device, which I
think is MANA_MAX_NUM_QUEUES, or 64 (correct me if I'm wrong
about that). Allocate a new struct gdma_irq_context when needed,
but store the address in the array rather than adding it onto a list.
Code can then directly index into the array to access the entry.

Some entries in the array will be unused (and "wasted") if the device
uses fewer MSI-X vector, but each unused entry is only 8 bytes. The
max space unused is fewer than 512 bytes (assuming 64 entries in
the array), which is neglible in the grand scheme of things. With the
simpler code, and not having the additional list entry embedded in
each struct gmda_irq_context, you'll get some of that space back
anyway.

Maybe there's a reason for the list that I missed in my initial
review of the code. But if not, it sure seems like the code could
be simpler, and having some unused 8 bytes entries in the array
is worth the tradeoff for the simplicity.

Michael

> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Changes in v2:
>  * Use string 'MSI-X vectors' instead of 'pci vectors'
>  * make skip-cpu a bool instead of int
>  * rearrange the comment arout skip_cpu variable appropriately
>  * update the capability bit for driver indicating dynamic IRQ allocation
>  * enforced max line length to 80
>  * enforced RCT convention
>  * initialized gic to NULL, for when there is a possibility of gic
>    not being populated correctly
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
>  include/net/mana/gdma.h                       |  11 +-
>  2 files changed, 269 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 4ffaf7588885..753b0208e574 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -6,6 +6,9 @@
>  #include <linux/pci.h>
>  #include <linux/utsname.h>
>  #include <linux/version.h>
> +#include <linux/msi.h>
> +#include <linux/irqdomain.h>
> +#include <linux/list.h>
> 
>  #include <net/mana/mana.h>
> 
> @@ -80,8 +83,15 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  		return err ? err : -EPROTO;
>  	}
> 
> -	if (gc->num_msix_usable > resp.max_msix)
> -		gc->num_msix_usable = resp.max_msix;
> +	if (!pci_msix_can_alloc_dyn(pdev)) {
> +		if (gc->num_msix_usable > resp.max_msix)
> +			gc->num_msix_usable = resp.max_msix;
> +	} else {
> +		/* If dynamic allocation is enabled we have already allocated
> +		 * hwc msi
> +		 */
> +		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> +	}
> 
>  	if (gc->num_msix_usable <= 1)
>  		return -ENOSPC;
> @@ -462,12 +472,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
>  				const struct gdma_queue_spec *spec)
>  {
>  	struct gdma_dev *gd = queue->gdma_dev;
> -	struct gdma_irq_context *gic;
> +	struct gdma_irq_context *gic = NULL;
> +	unsigned long flags, flag_irq;
>  	struct gdma_context *gc;
>  	unsigned int msi_index;
> -	unsigned long flags;
> +	struct list_head *pos;
>  	struct device *dev;
> -	int err = 0;
> +	int err = 0, count;
> 
>  	gc = gd->gdma_context;
>  	dev = gc->dev;
> @@ -482,7 +493,23 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
>  	}
> 
>  	queue->eq.msix_index = msi_index;
> -	gic = &gc->irq_contexts[msi_index];
> +
> +	/* get the msi_index value from the list*/
> +	count = 0;
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
> +	list_for_each(pos, &gc->irq_contexts) {
> +		if (count == msi_index) {
> +			gic = list_entry(pos, struct gdma_irq_context,
> +					 gic_list);
> +			break;
> +		}
> +
> +		count++;
> +	}
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
> +
> +	if (!gic)
> +		return -1;
> 
>  	spin_lock_irqsave(&gic->lock, flags);
>  	list_add_rcu(&queue->entry, &gic->eq_list);
> @@ -494,11 +521,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
>  static void mana_gd_deregiser_irq(struct gdma_queue *queue)
>  {
>  	struct gdma_dev *gd = queue->gdma_dev;
> -	struct gdma_irq_context *gic;
> +	struct gdma_irq_context *gic = NULL;
> +	unsigned long flags, flag_irq;
>  	struct gdma_context *gc;
>  	unsigned int msix_index;
> -	unsigned long flags;
> +	struct list_head *pos;
>  	struct gdma_queue *eq;
> +	int count;
> 
>  	gc = gd->gdma_context;
> 
> @@ -507,7 +536,23 @@ static void mana_gd_deregiser_irq(struct gdma_queue
> *queue)
>  	if (WARN_ON(msix_index >= gc->num_msix_usable))
>  		return;
> 
> -	gic = &gc->irq_contexts[msix_index];
> +	/* get the msi_index value from the list*/
> +	count = 0;
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
> +	list_for_each(pos, &gc->irq_contexts) {
> +		if (count == msix_index) {
> +			gic = list_entry(pos, struct gdma_irq_context,
> +					 gic_list);
> +			break;
> +		}
> +
> +		count++;
> +	}
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
> +
> +	if (!gic)
> +		return;
> +
>  	spin_lock_irqsave(&gic->lock, flags);
>  	list_for_each_entry_rcu(eq, &gic->eq_list, entry) {
>  		if (queue == eq) {
> @@ -1288,7 +1333,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
>  	r->size = 0;
>  }
> 
> -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> +		     bool skip_first_cpu)
>  {
>  	const struct cpumask *next, *prev = cpu_none_mask;
>  	cpumask_var_t cpus __free(free_cpumask_var);
> @@ -1303,9 +1349,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int
> node)
>  		while (weight > 0) {
>  			cpumask_andnot(cpus, next, prev);
>  			for_each_cpu(cpu, cpus) {
> +				/*
> +				 * if the CPU sibling set is to be skipped we
> +				 * just move on to the next CPUs without len--
> +				 */
> +				if (unlikely(skip_first_cpu)) {
> +					skip_first_cpu = false;
> +					goto next_cpumask;
> +				}
> +
>  				if (len-- == 0)
>  					goto done;
> +
>  				irq_set_affinity_and_hint(*irqs++,
> topology_sibling_cpumask(cpu));
> +next_cpumask:
>  				cpumask_andnot(cpus, cpus,
> topology_sibling_cpumask(cpu));
>  				--weight;
>  			}
> @@ -1317,29 +1374,99 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int
> node)
>  	return 0;
>  }
> 
> -static int mana_gd_setup_irqs(struct pci_dev *pdev)
> +static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> -	unsigned int max_queues_per_port;
>  	struct gdma_irq_context *gic;
> -	unsigned int max_irqs, cpu;
> -	int start_irq_index = 1;
> -	int nvec, *irqs, irq;
> +	bool skip_first_cpu = false;
> +	unsigned long flags;
>  	int err, i = 0, j;
> +	int *irqs, irq;
> 
>  	cpus_read_lock();
> -	max_queues_per_port = num_online_cpus();
> -	if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
> -		max_queues_per_port = MANA_MAX_NUM_QUEUES;
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> +	irqs = kmalloc_array(nvec, sizeof(int), GFP_KERNEL);
> +	if (!irqs) {
> +		err = -ENOMEM;
> +		goto free_irq_vector;
> +	}
> 
> -	/* Need 1 interrupt for the Hardware communication Channel (HWC) */
> -	max_irqs = max_queues_per_port + 1;
> +	for (i = 0; i < nvec; i++) {
> +		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
> +		if (!gic) {
> +			err = -ENOMEM;
> +			goto free_irq;
> +		}
> +		gic->handler = mana_gd_process_eq_events;
> +		INIT_LIST_HEAD(&gic->eq_list);
> +		spin_lock_init(&gic->lock);
> 
> -	nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
> -	if (nvec < 0) {
> -		cpus_read_unlock();
> -		return nvec;
> +		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
> +			 i, pci_name(pdev));
> +
> +		/* one pci vector is already allocated for HWC */
> +		irqs[i] = pci_irq_vector(pdev, i + 1);
> +		if (irqs[i] < 0) {
> +			err = irqs[i];
> +			goto free_current_gic;
> +		}
> +
> +		err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
> +		if (err)
> +			goto free_current_gic;
> +
> +		list_add_tail(&gic->gic_list, &gc->irq_contexts);
>  	}
> +
> +	/*
> +	 * When calling irq_setup() for dynamically added IRQs, if number of
> +	 * IRQs is more than or equal to allocated MSI-X, we need to skip the
> +	 * first CPU sibling group since they are already affinitized to HWC IRQ
> +	 */
> +	if (gc->num_msix_usable <= num_online_cpus())
> +		skip_first_cpu = true;
> +
> +	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> +	if (err)
> +		goto free_irq;
> +
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> +	cpus_read_unlock();
> +	kfree(irqs);
> +	return 0;
> +
> +free_current_gic:
> +	kfree(gic);
> +free_irq:
> +	for (j = i - 1; j >= 0; j--) {
> +		irq = pci_irq_vector(pdev, j + 1);
> +		gic = list_last_entry(&gc->irq_contexts,
> +				      struct gdma_irq_context, gic_list);
> +		irq_update_affinity_hint(irq, NULL);
> +		free_irq(irq, gic);
> +		list_del(&gic->gic_list);
> +		kfree(gic);
> +	}
> +	kfree(irqs);
> +free_irq_vector:
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> +	cpus_read_unlock();
> +	return err;
> +}
> +
> +static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
> +{
> +	struct gdma_context *gc = pci_get_drvdata(pdev);
> +	struct gdma_irq_context *gic;
> +	int start_irq_index = 1;
> +	unsigned long flags;
> +	unsigned int cpu;
> +	int *irqs, irq;
> +	int err, i = 0, j;
> +
> +	cpus_read_lock();
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> +
>  	if (nvec <= num_online_cpus())
>  		start_irq_index = 0;
> 
> @@ -1349,15 +1476,12 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
>  		goto free_irq_vector;
>  	}
> 
> -	gc->irq_contexts = kcalloc(nvec, sizeof(struct gdma_irq_context),
> -				   GFP_KERNEL);
> -	if (!gc->irq_contexts) {
> -		err = -ENOMEM;
> -		goto free_irq_array;
> -	}
> -
>  	for (i = 0; i < nvec; i++) {
> -		gic = &gc->irq_contexts[i];
> +		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
> +		if (!gic) {
> +			err = -ENOMEM;
> +			goto free_irq;
> +		}
>  		gic->handler = mana_gd_process_eq_events;
>  		INIT_LIST_HEAD(&gic->eq_list);
>  		spin_lock_init(&gic->lock);
> @@ -1372,22 +1496,14 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
>  		irq = pci_irq_vector(pdev, i);
>  		if (irq < 0) {
>  			err = irq;
> -			goto free_irq;
> +			goto free_current_gic;
>  		}
> 
>  		if (!i) {
>  			err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
>  			if (err)
> -				goto free_irq;
> -
> -			/* If number of IRQ is one extra than number of online CPUs,
> -			 * then we need to assign IRQ0 (hwc irq) and IRQ1 to
> -			 * same CPU.
> -			 * Else we will use different CPUs for IRQ0 and IRQ1.
> -			 * Also we are using cpumask_local_spread instead of
> -			 * cpumask_first for the node, because the node can be
> -			 * mem only.
> -			 */
> +				goto free_current_gic;
> +
>  			if (start_irq_index) {
>  				cpu = cpumask_local_spread(i, gc->numa_node);
>  				irq_set_affinity_and_hint(irq, cpumask_of(cpu));
> @@ -1396,39 +1512,108 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
>  			}
>  		} else {
>  			irqs[i - start_irq_index] = irq;
> -			err = request_irq(irqs[i - start_irq_index], mana_gd_intr, 0,
> -					  gic->name, gic);
> +			err = request_irq(irqs[i - start_irq_index],
> +					  mana_gd_intr, 0, gic->name, gic);
>  			if (err)
> -				goto free_irq;
> +				goto free_current_gic;
>  		}
> +
> +		list_add_tail(&gic->gic_list, &gc->irq_contexts);
>  	}
> 
> -	err = irq_setup(irqs, (nvec - start_irq_index), gc->numa_node);
> +	err = irq_setup(irqs, nvec - start_irq_index, gc->numa_node, false);
>  	if (err)
>  		goto free_irq;
> 
> -	gc->max_num_msix = nvec;
> -	gc->num_msix_usable = nvec;
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
>  	cpus_read_unlock();
>  	kfree(irqs);
>  	return 0;
> 
> +free_current_gic:
> +	kfree(gic);
>  free_irq:
>  	for (j = i - 1; j >= 0; j--) {
>  		irq = pci_irq_vector(pdev, j);
> -		gic = &gc->irq_contexts[j];
> -
> +		gic = list_last_entry(&gc->irq_contexts,
> +				      struct gdma_irq_context, gic_list);
>  		irq_update_affinity_hint(irq, NULL);
>  		free_irq(irq, gic);
> +		list_del(&gic->gic_list);
> +		kfree(gic);
>  	}
> -
> -	kfree(gc->irq_contexts);
> -	gc->irq_contexts = NULL;
> -free_irq_array:
>  	kfree(irqs);
>  free_irq_vector:
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
>  	cpus_read_unlock();
> -	pci_free_irq_vectors(pdev);
> +	return err;
> +}
> +
> +static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
> +{
> +	struct gdma_context *gc = pci_get_drvdata(pdev);
> +	unsigned int max_irqs, min_irqs;
> +	int max_queues_per_port;
> +	int nvec, err;
> +
> +	if (pci_msix_can_alloc_dyn(pdev)) {
> +		max_irqs = 1;
> +		min_irqs = 1;
> +	} else {
> +		max_queues_per_port = num_online_cpus();
> +		if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
> +			max_queues_per_port = MANA_MAX_NUM_QUEUES;
> +		/* Need 1 interrupt for HWC */
> +		max_irqs = max_queues_per_port + 1;
> +		min_irqs = 2;
> +	}
> +
> +	nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
> +	if (nvec < 0)
> +		return nvec;
> +
> +	err = mana_gd_setup_irqs(pdev, nvec);
> +	if (err) {
> +		pci_free_irq_vectors(pdev);
> +		return err;
> +	}
> +
> +	gc->num_msix_usable = nvec;
> +	gc->max_num_msix = nvec;
> +
> +	return err;
> +}
> +
> +static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
> +{
> +	struct gdma_context *gc = pci_get_drvdata(pdev);
> +	int max_irqs, i, err = 0;
> +	struct msi_map irq_map;
> +
> +	if (!pci_msix_can_alloc_dyn(pdev))
> +		/* remain irqs are already allocated with HWC IRQ */
> +		return 0;
> +
> +	/* allocate only remaining IRQs*/
> +	max_irqs = gc->num_msix_usable - 1;
> +
> +	for (i = 1; i <= max_irqs; i++) {
> +		irq_map = pci_msix_alloc_irq_at(pdev, i, NULL);
> +		if (!irq_map.virq) {
> +			err = irq_map.index;
> +			/* caller will handle cleaning up all allocated
> +			 * irqs, after HWC is destroyed
> +			 */
> +			return err;
> +		}
> +	}
> +
> +	err = mana_gd_setup_dyn_irqs(pdev, max_irqs);
> +	if (err)
> +		return err;
> +
> +	gc->max_num_msix = gc->max_num_msix + max_irqs;
> +
>  	return err;
>  }
> 
> @@ -1436,29 +1621,34 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_irq_context *gic;
> -	int irq, i;
> +	struct list_head *pos, *n;
> +	unsigned long flags;
> +	int irq, i = 0;
> 
>  	if (gc->max_num_msix < 1)
>  		return;
> 
> -	for (i = 0; i < gc->max_num_msix; i++) {
> +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> +	list_for_each_safe(pos, n, &gc->irq_contexts) {
>  		irq = pci_irq_vector(pdev, i);
>  		if (irq < 0)
>  			continue;
> 
> -		gic = &gc->irq_contexts[i];
> +		gic = list_entry(pos, struct gdma_irq_context, gic_list);
> 
>  		/* Need to clear the hint before free_irq */
>  		irq_update_affinity_hint(irq, NULL);
>  		free_irq(irq, gic);
> +		list_del(pos);
> +		kfree(gic);
> +		i++;
>  	}
> +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> 
>  	pci_free_irq_vectors(pdev);
> 
>  	gc->max_num_msix = 0;
>  	gc->num_msix_usable = 0;
> -	kfree(gc->irq_contexts);
> -	gc->irq_contexts = NULL;
>  }
> 
>  static int mana_gd_setup(struct pci_dev *pdev)
> @@ -1469,9 +1659,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
>  	mana_gd_init_registers(pdev);
>  	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
> 
> -	err = mana_gd_setup_irqs(pdev);
> +	err = mana_gd_setup_hwc_irqs(pdev);
>  	if (err) {
> -		dev_err(gc->dev, "Failed to setup IRQs: %d\n", err);
> +		dev_err(gc->dev, "Failed to setup IRQs for HWC creation: %d\n",
> +			err);
>  		return err;
>  	}
> 
> @@ -1487,6 +1678,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
>  	if (err)
>  		goto destroy_hwc;
> 
> +	err = mana_gd_setup_remaining_irqs(pdev);
> +	if (err)
> +		goto destroy_hwc;
> +
>  	err = mana_gd_detect_devices(pdev);
>  	if (err)
>  		goto destroy_hwc;
> @@ -1563,6 +1758,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct
> pci_device_id *ent)
>  	gc->is_pf = mana_is_pf(pdev->device);
>  	gc->bar0_va = bar0_va;
>  	gc->dev = &pdev->dev;
> +	INIT_LIST_HEAD(&gc->irq_contexts);
> +	spin_lock_init(&gc->irq_ctxs_lock);
> 
>  	if (gc->is_pf)
>  		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
> diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> index 228603bf03f2..6ef4785c63b4 100644
> --- a/include/net/mana/gdma.h
> +++ b/include/net/mana/gdma.h
> @@ -363,6 +363,7 @@ struct gdma_irq_context {
>  	spinlock_t lock;
>  	struct list_head eq_list;
>  	char name[MANA_IRQ_NAME_SZ];
> +	struct list_head gic_list;
>  };
> 
>  struct gdma_context {
> @@ -373,7 +374,9 @@ struct gdma_context {
>  	unsigned int		max_num_queues;
>  	unsigned int		max_num_msix;
>  	unsigned int		num_msix_usable;
> -	struct gdma_irq_context	*irq_contexts;
> +	struct list_head	irq_contexts;
> +	/* Protect the irq_contexts list */
> +	spinlock_t		irq_ctxs_lock;
> 
>  	/* L2 MTU */
>  	u16 adapter_mtu;
> @@ -558,12 +561,16 @@ enum {
>  /* Driver can handle holes (zeros) in the device list */
>  #define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
> 
> +/* Driver supports dynamic MSI-X vector allocation */
> +#define GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT BIT(13)
> +
>  #define GDMA_DRV_CAP_FLAGS1 \
>  	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
>  	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
>  	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG | \
>  	 GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT | \
> -	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP)
> +	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP | \
> +	 GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT)
> 
>  #define GDMA_DRV_CAP_FLAGS2 0
> 
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-05-01  5:27   ` Michael Kelley
@ 2025-05-01 14:23     ` Shradha Gupta
  2025-05-01 15:56       ` Michael Kelley
  0 siblings, 1 reply; 19+ messages in thread
From: Shradha Gupta @ 2025-05-01 14:23 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy?ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm, Shradha Gupta

On Thu, May 01, 2025 at 05:27:49AM +0000, Michael Kelley wrote:
> From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25, 2025 3:55 AM
> > 
> > Currently, the MANA driver allocates MSI-X vectors statically based on
> > MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> > up allocating more vectors than it needs. This is because, by this time
> > we do not have a HW channel and do not know how many IRQs should be
> > allocated.
> > 
> > To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> > after getting the value supported by hardware, dynamically add the
> > remaining MSI-X vectors.
> 
> I have a top-level thought about the data structures used to manage a
> dynamic number of MSI-X vectors. The current code allocates a fixed size
> array of struct gdma_irq_context, with one entry in the array for each
> MSI-X vector. To find the entry for a particular msi_index, the code can
> just index into the array, which is nice and simple.
> 
> The new code uses a linked list of struct gdma_irq_context entries, with
> one entry in the list for each MSI-X vector.  In the dynamic case, you can
> start with one entry in the list, and then add to the list however many
> additional entries the hardware will support.
> 
> But this additional linked list adds significant complexity to the code
> because it must be linearly searched to find the entry for a particular
> msi_index, and there's the messiness of putting entries onto the list
> and taking them off.  A spin lock is required.  Etc., etc.
> 
> Here's an intermediate approach that would be simpler. Allocate a fixed
> size array of pointers to struct gdma_irq_context. The fixed size is the
> maximum number of possible MSI-X vectors for the device, which I
> think is MANA_MAX_NUM_QUEUES, or 64 (correct me if I'm wrong
> about that). Allocate a new struct gdma_irq_context when needed,
> but store the address in the array rather than adding it onto a list.
> Code can then directly index into the array to access the entry.
> 
> Some entries in the array will be unused (and "wasted") if the device
> uses fewer MSI-X vector, but each unused entry is only 8 bytes. The
> max space unused is fewer than 512 bytes (assuming 64 entries in
> the array), which is neglible in the grand scheme of things. With the
> simpler code, and not having the additional list entry embedded in
> each struct gmda_irq_context, you'll get some of that space back
> anyway.
> 
> Maybe there's a reason for the list that I missed in my initial
> review of the code. But if not, it sure seems like the code could
> be simpler, and having some unused 8 bytes entries in the array
> is worth the tradeoff for the simplicity.
> 
> Michael

Hey  Michael,

Thanks for your inputs. We did think of this approach and in fact that
was how this patch was implemented(fixed size array) in the v1 of our
internal reviews. 

However, it came up in those reviews that we want to move away
from the 64(MANA_MAX_NUM_QUEUES) as a hard limit for some new
requirements, atleast for the dynamic IRQ allocation path. And now the
new limit for all hardening purposes would be num_online_cpus().

Using this limit and the fixed array size approach creates problems,
especially in machines with high number of vCPUs. It would lead to
quite a bit of memory/resource wastage.

Hence, we decided to go ahead with this design.

Regards,
Shradha.
> 
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> >  Changes in v2:
> >  * Use string 'MSI-X vectors' instead of 'pci vectors'
> >  * make skip-cpu a bool instead of int
> >  * rearrange the comment arout skip_cpu variable appropriately
> >  * update the capability bit for driver indicating dynamic IRQ allocation
> >  * enforced max line length to 80
> >  * enforced RCT convention
> >  * initialized gic to NULL, for when there is a possibility of gic
> >    not being populated correctly
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 323 ++++++++++++++----
> >  include/net/mana/gdma.h                       |  11 +-
> >  2 files changed, 269 insertions(+), 65 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 4ffaf7588885..753b0208e574 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -6,6 +6,9 @@
> >  #include <linux/pci.h>
> >  #include <linux/utsname.h>
> >  #include <linux/version.h>
> > +#include <linux/msi.h>
> > +#include <linux/irqdomain.h>
> > +#include <linux/list.h>
> > 
> >  #include <net/mana/mana.h>
> > 
> > @@ -80,8 +83,15 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> >  		return err ? err : -EPROTO;
> >  	}
> > 
> > -	if (gc->num_msix_usable > resp.max_msix)
> > -		gc->num_msix_usable = resp.max_msix;
> > +	if (!pci_msix_can_alloc_dyn(pdev)) {
> > +		if (gc->num_msix_usable > resp.max_msix)
> > +			gc->num_msix_usable = resp.max_msix;
> > +	} else {
> > +		/* If dynamic allocation is enabled we have already allocated
> > +		 * hwc msi
> > +		 */
> > +		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > +	}
> > 
> >  	if (gc->num_msix_usable <= 1)
> >  		return -ENOSPC;
> > @@ -462,12 +472,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
> >  				const struct gdma_queue_spec *spec)
> >  {
> >  	struct gdma_dev *gd = queue->gdma_dev;
> > -	struct gdma_irq_context *gic;
> > +	struct gdma_irq_context *gic = NULL;
> > +	unsigned long flags, flag_irq;
> >  	struct gdma_context *gc;
> >  	unsigned int msi_index;
> > -	unsigned long flags;
> > +	struct list_head *pos;
> >  	struct device *dev;
> > -	int err = 0;
> > +	int err = 0, count;
> > 
> >  	gc = gd->gdma_context;
> >  	dev = gc->dev;
> > @@ -482,7 +493,23 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
> >  	}
> > 
> >  	queue->eq.msix_index = msi_index;
> > -	gic = &gc->irq_contexts[msi_index];
> > +
> > +	/* get the msi_index value from the list*/
> > +	count = 0;
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
> > +	list_for_each(pos, &gc->irq_contexts) {
> > +		if (count == msi_index) {
> > +			gic = list_entry(pos, struct gdma_irq_context,
> > +					 gic_list);
> > +			break;
> > +		}
> > +
> > +		count++;
> > +	}
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
> > +
> > +	if (!gic)
> > +		return -1;
> > 
> >  	spin_lock_irqsave(&gic->lock, flags);
> >  	list_add_rcu(&queue->entry, &gic->eq_list);
> > @@ -494,11 +521,13 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
> >  static void mana_gd_deregiser_irq(struct gdma_queue *queue)
> >  {
> >  	struct gdma_dev *gd = queue->gdma_dev;
> > -	struct gdma_irq_context *gic;
> > +	struct gdma_irq_context *gic = NULL;
> > +	unsigned long flags, flag_irq;
> >  	struct gdma_context *gc;
> >  	unsigned int msix_index;
> > -	unsigned long flags;
> > +	struct list_head *pos;
> >  	struct gdma_queue *eq;
> > +	int count;
> > 
> >  	gc = gd->gdma_context;
> > 
> > @@ -507,7 +536,23 @@ static void mana_gd_deregiser_irq(struct gdma_queue
> > *queue)
> >  	if (WARN_ON(msix_index >= gc->num_msix_usable))
> >  		return;
> > 
> > -	gic = &gc->irq_contexts[msix_index];
> > +	/* get the msi_index value from the list*/
> > +	count = 0;
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flag_irq);
> > +	list_for_each(pos, &gc->irq_contexts) {
> > +		if (count == msix_index) {
> > +			gic = list_entry(pos, struct gdma_irq_context,
> > +					 gic_list);
> > +			break;
> > +		}
> > +
> > +		count++;
> > +	}
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flag_irq);
> > +
> > +	if (!gic)
> > +		return;
> > +
> >  	spin_lock_irqsave(&gic->lock, flags);
> >  	list_for_each_entry_rcu(eq, &gic->eq_list, entry) {
> >  		if (queue == eq) {
> > @@ -1288,7 +1333,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> >  	r->size = 0;
> >  }
> > 
> > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > +		     bool skip_first_cpu)
> >  {
> >  	const struct cpumask *next, *prev = cpu_none_mask;
> >  	cpumask_var_t cpus __free(free_cpumask_var);
> > @@ -1303,9 +1349,20 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int
> > node)
> >  		while (weight > 0) {
> >  			cpumask_andnot(cpus, next, prev);
> >  			for_each_cpu(cpu, cpus) {
> > +				/*
> > +				 * if the CPU sibling set is to be skipped we
> > +				 * just move on to the next CPUs without len--
> > +				 */
> > +				if (unlikely(skip_first_cpu)) {
> > +					skip_first_cpu = false;
> > +					goto next_cpumask;
> > +				}
> > +
> >  				if (len-- == 0)
> >  					goto done;
> > +
> >  				irq_set_affinity_and_hint(*irqs++,
> > topology_sibling_cpumask(cpu));
> > +next_cpumask:
> >  				cpumask_andnot(cpus, cpus,
> > topology_sibling_cpumask(cpu));
> >  				--weight;
> >  			}
> > @@ -1317,29 +1374,99 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int
> > node)
> >  	return 0;
> >  }
> > 
> > -static int mana_gd_setup_irqs(struct pci_dev *pdev)
> > +static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > -	unsigned int max_queues_per_port;
> >  	struct gdma_irq_context *gic;
> > -	unsigned int max_irqs, cpu;
> > -	int start_irq_index = 1;
> > -	int nvec, *irqs, irq;
> > +	bool skip_first_cpu = false;
> > +	unsigned long flags;
> >  	int err, i = 0, j;
> > +	int *irqs, irq;
> > 
> >  	cpus_read_lock();
> > -	max_queues_per_port = num_online_cpus();
> > -	if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
> > -		max_queues_per_port = MANA_MAX_NUM_QUEUES;
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> > +	irqs = kmalloc_array(nvec, sizeof(int), GFP_KERNEL);
> > +	if (!irqs) {
> > +		err = -ENOMEM;
> > +		goto free_irq_vector;
> > +	}
> > 
> > -	/* Need 1 interrupt for the Hardware communication Channel (HWC) */
> > -	max_irqs = max_queues_per_port + 1;
> > +	for (i = 0; i < nvec; i++) {
> > +		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
> > +		if (!gic) {
> > +			err = -ENOMEM;
> > +			goto free_irq;
> > +		}
> > +		gic->handler = mana_gd_process_eq_events;
> > +		INIT_LIST_HEAD(&gic->eq_list);
> > +		spin_lock_init(&gic->lock);
> > 
> > -	nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
> > -	if (nvec < 0) {
> > -		cpus_read_unlock();
> > -		return nvec;
> > +		snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
> > +			 i, pci_name(pdev));
> > +
> > +		/* one pci vector is already allocated for HWC */
> > +		irqs[i] = pci_irq_vector(pdev, i + 1);
> > +		if (irqs[i] < 0) {
> > +			err = irqs[i];
> > +			goto free_current_gic;
> > +		}
> > +
> > +		err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
> > +		if (err)
> > +			goto free_current_gic;
> > +
> > +		list_add_tail(&gic->gic_list, &gc->irq_contexts);
> >  	}
> > +
> > +	/*
> > +	 * When calling irq_setup() for dynamically added IRQs, if number of
> > +	 * IRQs is more than or equal to allocated MSI-X, we need to skip the
> > +	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > +	 */
> > +	if (gc->num_msix_usable <= num_online_cpus())
> > +		skip_first_cpu = true;
> > +
> > +	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > +	if (err)
> > +		goto free_irq;
> > +
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> > +	cpus_read_unlock();
> > +	kfree(irqs);
> > +	return 0;
> > +
> > +free_current_gic:
> > +	kfree(gic);
> > +free_irq:
> > +	for (j = i - 1; j >= 0; j--) {
> > +		irq = pci_irq_vector(pdev, j + 1);
> > +		gic = list_last_entry(&gc->irq_contexts,
> > +				      struct gdma_irq_context, gic_list);
> > +		irq_update_affinity_hint(irq, NULL);
> > +		free_irq(irq, gic);
> > +		list_del(&gic->gic_list);
> > +		kfree(gic);
> > +	}
> > +	kfree(irqs);
> > +free_irq_vector:
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> > +	cpus_read_unlock();
> > +	return err;
> > +}
> > +
> > +static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
> > +{
> > +	struct gdma_context *gc = pci_get_drvdata(pdev);
> > +	struct gdma_irq_context *gic;
> > +	int start_irq_index = 1;
> > +	unsigned long flags;
> > +	unsigned int cpu;
> > +	int *irqs, irq;
> > +	int err, i = 0, j;
> > +
> > +	cpus_read_lock();
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> > +
> >  	if (nvec <= num_online_cpus())
> >  		start_irq_index = 0;
> > 
> > @@ -1349,15 +1476,12 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
> >  		goto free_irq_vector;
> >  	}
> > 
> > -	gc->irq_contexts = kcalloc(nvec, sizeof(struct gdma_irq_context),
> > -				   GFP_KERNEL);
> > -	if (!gc->irq_contexts) {
> > -		err = -ENOMEM;
> > -		goto free_irq_array;
> > -	}
> > -
> >  	for (i = 0; i < nvec; i++) {
> > -		gic = &gc->irq_contexts[i];
> > +		gic = kcalloc(1, sizeof(struct gdma_irq_context), GFP_KERNEL);
> > +		if (!gic) {
> > +			err = -ENOMEM;
> > +			goto free_irq;
> > +		}
> >  		gic->handler = mana_gd_process_eq_events;
> >  		INIT_LIST_HEAD(&gic->eq_list);
> >  		spin_lock_init(&gic->lock);
> > @@ -1372,22 +1496,14 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
> >  		irq = pci_irq_vector(pdev, i);
> >  		if (irq < 0) {
> >  			err = irq;
> > -			goto free_irq;
> > +			goto free_current_gic;
> >  		}
> > 
> >  		if (!i) {
> >  			err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
> >  			if (err)
> > -				goto free_irq;
> > -
> > -			/* If number of IRQ is one extra than number of online CPUs,
> > -			 * then we need to assign IRQ0 (hwc irq) and IRQ1 to
> > -			 * same CPU.
> > -			 * Else we will use different CPUs for IRQ0 and IRQ1.
> > -			 * Also we are using cpumask_local_spread instead of
> > -			 * cpumask_first for the node, because the node can be
> > -			 * mem only.
> > -			 */
> > +				goto free_current_gic;
> > +
> >  			if (start_irq_index) {
> >  				cpu = cpumask_local_spread(i, gc->numa_node);
> >  				irq_set_affinity_and_hint(irq, cpumask_of(cpu));
> > @@ -1396,39 +1512,108 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
> >  			}
> >  		} else {
> >  			irqs[i - start_irq_index] = irq;
> > -			err = request_irq(irqs[i - start_irq_index], mana_gd_intr, 0,
> > -					  gic->name, gic);
> > +			err = request_irq(irqs[i - start_irq_index],
> > +					  mana_gd_intr, 0, gic->name, gic);
> >  			if (err)
> > -				goto free_irq;
> > +				goto free_current_gic;
> >  		}
> > +
> > +		list_add_tail(&gic->gic_list, &gc->irq_contexts);
> >  	}
> > 
> > -	err = irq_setup(irqs, (nvec - start_irq_index), gc->numa_node);
> > +	err = irq_setup(irqs, nvec - start_irq_index, gc->numa_node, false);
> >  	if (err)
> >  		goto free_irq;
> > 
> > -	gc->max_num_msix = nvec;
> > -	gc->num_msix_usable = nvec;
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> >  	cpus_read_unlock();
> >  	kfree(irqs);
> >  	return 0;
> > 
> > +free_current_gic:
> > +	kfree(gic);
> >  free_irq:
> >  	for (j = i - 1; j >= 0; j--) {
> >  		irq = pci_irq_vector(pdev, j);
> > -		gic = &gc->irq_contexts[j];
> > -
> > +		gic = list_last_entry(&gc->irq_contexts,
> > +				      struct gdma_irq_context, gic_list);
> >  		irq_update_affinity_hint(irq, NULL);
> >  		free_irq(irq, gic);
> > +		list_del(&gic->gic_list);
> > +		kfree(gic);
> >  	}
> > -
> > -	kfree(gc->irq_contexts);
> > -	gc->irq_contexts = NULL;
> > -free_irq_array:
> >  	kfree(irqs);
> >  free_irq_vector:
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> >  	cpus_read_unlock();
> > -	pci_free_irq_vectors(pdev);
> > +	return err;
> > +}
> > +
> > +static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
> > +{
> > +	struct gdma_context *gc = pci_get_drvdata(pdev);
> > +	unsigned int max_irqs, min_irqs;
> > +	int max_queues_per_port;
> > +	int nvec, err;
> > +
> > +	if (pci_msix_can_alloc_dyn(pdev)) {
> > +		max_irqs = 1;
> > +		min_irqs = 1;
> > +	} else {
> > +		max_queues_per_port = num_online_cpus();
> > +		if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
> > +			max_queues_per_port = MANA_MAX_NUM_QUEUES;
> > +		/* Need 1 interrupt for HWC */
> > +		max_irqs = max_queues_per_port + 1;
> > +		min_irqs = 2;
> > +	}
> > +
> > +	nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
> > +	if (nvec < 0)
> > +		return nvec;
> > +
> > +	err = mana_gd_setup_irqs(pdev, nvec);
> > +	if (err) {
> > +		pci_free_irq_vectors(pdev);
> > +		return err;
> > +	}
> > +
> > +	gc->num_msix_usable = nvec;
> > +	gc->max_num_msix = nvec;
> > +
> > +	return err;
> > +}
> > +
> > +static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
> > +{
> > +	struct gdma_context *gc = pci_get_drvdata(pdev);
> > +	int max_irqs, i, err = 0;
> > +	struct msi_map irq_map;
> > +
> > +	if (!pci_msix_can_alloc_dyn(pdev))
> > +		/* remain irqs are already allocated with HWC IRQ */
> > +		return 0;
> > +
> > +	/* allocate only remaining IRQs*/
> > +	max_irqs = gc->num_msix_usable - 1;
> > +
> > +	for (i = 1; i <= max_irqs; i++) {
> > +		irq_map = pci_msix_alloc_irq_at(pdev, i, NULL);
> > +		if (!irq_map.virq) {
> > +			err = irq_map.index;
> > +			/* caller will handle cleaning up all allocated
> > +			 * irqs, after HWC is destroyed
> > +			 */
> > +			return err;
> > +		}
> > +	}
> > +
> > +	err = mana_gd_setup_dyn_irqs(pdev, max_irqs);
> > +	if (err)
> > +		return err;
> > +
> > +	gc->max_num_msix = gc->max_num_msix + max_irqs;
> > +
> >  	return err;
> >  }
> > 
> > @@ -1436,29 +1621,34 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> >  	struct gdma_irq_context *gic;
> > -	int irq, i;
> > +	struct list_head *pos, *n;
> > +	unsigned long flags;
> > +	int irq, i = 0;
> > 
> >  	if (gc->max_num_msix < 1)
> >  		return;
> > 
> > -	for (i = 0; i < gc->max_num_msix; i++) {
> > +	spin_lock_irqsave(&gc->irq_ctxs_lock, flags);
> > +	list_for_each_safe(pos, n, &gc->irq_contexts) {
> >  		irq = pci_irq_vector(pdev, i);
> >  		if (irq < 0)
> >  			continue;
> > 
> > -		gic = &gc->irq_contexts[i];
> > +		gic = list_entry(pos, struct gdma_irq_context, gic_list);
> > 
> >  		/* Need to clear the hint before free_irq */
> >  		irq_update_affinity_hint(irq, NULL);
> >  		free_irq(irq, gic);
> > +		list_del(pos);
> > +		kfree(gic);
> > +		i++;
> >  	}
> > +	spin_unlock_irqrestore(&gc->irq_ctxs_lock, flags);
> > 
> >  	pci_free_irq_vectors(pdev);
> > 
> >  	gc->max_num_msix = 0;
> >  	gc->num_msix_usable = 0;
> > -	kfree(gc->irq_contexts);
> > -	gc->irq_contexts = NULL;
> >  }
> > 
> >  static int mana_gd_setup(struct pci_dev *pdev)
> > @@ -1469,9 +1659,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
> >  	mana_gd_init_registers(pdev);
> >  	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
> > 
> > -	err = mana_gd_setup_irqs(pdev);
> > +	err = mana_gd_setup_hwc_irqs(pdev);
> >  	if (err) {
> > -		dev_err(gc->dev, "Failed to setup IRQs: %d\n", err);
> > +		dev_err(gc->dev, "Failed to setup IRQs for HWC creation: %d\n",
> > +			err);
> >  		return err;
> >  	}
> > 
> > @@ -1487,6 +1678,10 @@ static int mana_gd_setup(struct pci_dev *pdev)
> >  	if (err)
> >  		goto destroy_hwc;
> > 
> > +	err = mana_gd_setup_remaining_irqs(pdev);
> > +	if (err)
> > +		goto destroy_hwc;
> > +
> >  	err = mana_gd_detect_devices(pdev);
> >  	if (err)
> >  		goto destroy_hwc;
> > @@ -1563,6 +1758,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct
> > pci_device_id *ent)
> >  	gc->is_pf = mana_is_pf(pdev->device);
> >  	gc->bar0_va = bar0_va;
> >  	gc->dev = &pdev->dev;
> > +	INIT_LIST_HEAD(&gc->irq_contexts);
> > +	spin_lock_init(&gc->irq_ctxs_lock);
> > 
> >  	if (gc->is_pf)
> >  		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
> > diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> > index 228603bf03f2..6ef4785c63b4 100644
> > --- a/include/net/mana/gdma.h
> > +++ b/include/net/mana/gdma.h
> > @@ -363,6 +363,7 @@ struct gdma_irq_context {
> >  	spinlock_t lock;
> >  	struct list_head eq_list;
> >  	char name[MANA_IRQ_NAME_SZ];
> > +	struct list_head gic_list;
> >  };
> > 
> >  struct gdma_context {
> > @@ -373,7 +374,9 @@ struct gdma_context {
> >  	unsigned int		max_num_queues;
> >  	unsigned int		max_num_msix;
> >  	unsigned int		num_msix_usable;
> > -	struct gdma_irq_context	*irq_contexts;
> > +	struct list_head	irq_contexts;
> > +	/* Protect the irq_contexts list */
> > +	spinlock_t		irq_ctxs_lock;
> > 
> >  	/* L2 MTU */
> >  	u16 adapter_mtu;
> > @@ -558,12 +561,16 @@ enum {
> >  /* Driver can handle holes (zeros) in the device list */
> >  #define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
> > 
> > +/* Driver supports dynamic MSI-X vector allocation */
> > +#define GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT BIT(13)
> > +
> >  #define GDMA_DRV_CAP_FLAGS1 \
> >  	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
> >  	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
> >  	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG | \
> >  	 GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT | \
> > -	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP)
> > +	 GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP | \
> > +	 GDMA_DRV_CAP_FLAG_1_DYNAMIC_IRQ_ALLOC_SUPPORT)
> > 
> >  #define GDMA_DRV_CAP_FLAGS2 0
> > 
> > --
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-05-01 14:23     ` Shradha Gupta
@ 2025-05-01 15:56       ` Michael Kelley
  2025-05-02  6:08         ` Shradha Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-05-01 15:56 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy?ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm, Shradha Gupta

From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Thursday, May 1, 2025 7:24 AM
> 
> On Thu, May 01, 2025 at 05:27:49AM +0000, Michael Kelley wrote:
> > From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25,
> 2025 3:55 AM
> > >
> > > Currently, the MANA driver allocates MSI-X vectors statically based on
> > > MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> > > up allocating more vectors than it needs. This is because, by this time
> > > we do not have a HW channel and do not know how many IRQs should be
> > > allocated.
> > >
> > > To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> > > after getting the value supported by hardware, dynamically add the
> > > remaining MSI-X vectors.
> >
> > I have a top-level thought about the data structures used to manage a
> > dynamic number of MSI-X vectors. The current code allocates a fixed size
> > array of struct gdma_irq_context, with one entry in the array for each
> > MSI-X vector. To find the entry for a particular msi_index, the code can
> > just index into the array, which is nice and simple.
> >
> > The new code uses a linked list of struct gdma_irq_context entries, with
> > one entry in the list for each MSI-X vector.  In the dynamic case, you can
> > start with one entry in the list, and then add to the list however many
> > additional entries the hardware will support.
> >
> > But this additional linked list adds significant complexity to the code
> > because it must be linearly searched to find the entry for a particular
> > msi_index, and there's the messiness of putting entries onto the list
> > and taking them off.  A spin lock is required.  Etc., etc.
> >
> > Here's an intermediate approach that would be simpler. Allocate a fixed
> > size array of pointers to struct gdma_irq_context. The fixed size is the
> > maximum number of possible MSI-X vectors for the device, which I
> > think is MANA_MAX_NUM_QUEUES, or 64 (correct me if I'm wrong
> > about that). Allocate a new struct gdma_irq_context when needed,
> > but store the address in the array rather than adding it onto a list.
> > Code can then directly index into the array to access the entry.
> >
> > Some entries in the array will be unused (and "wasted") if the device
> > uses fewer MSI-X vector, but each unused entry is only 8 bytes. The
> > max space unused is fewer than 512 bytes (assuming 64 entries in
> > the array), which is neglible in the grand scheme of things. With the
> > simpler code, and not having the additional list entry embedded in
> > each struct gmda_irq_context, you'll get some of that space back
> > anyway.
> >
> > Maybe there's a reason for the list that I missed in my initial
> > review of the code. But if not, it sure seems like the code could
> > be simpler, and having some unused 8 bytes entries in the array
> > is worth the tradeoff for the simplicity.
> >
> > Michael
> 
> Hey  Michael,
> 
> Thanks for your inputs. We did think of this approach and in fact that
> was how this patch was implemented(fixed size array) in the v1 of our
> internal reviews.
> 
> However, it came up in those reviews that we want to move away
> from the 64(MANA_MAX_NUM_QUEUES) as a hard limit for some new
> requirements, atleast for the dynamic IRQ allocation path. And now the
> new limit for all hardening purposes would be num_online_cpus().
> 
> Using this limit and the fixed array size approach creates problems,
> especially in machines with high number of vCPUs. It would lead to
> quite a bit of memory/resource wastage.
> 
> Hence, we decided to go ahead with this design.
> 
> Regards,
> Shradha.

One other thought:  Did you look at using an xarray? See
https://www.kernel.org/doc/html/latest/core-api/xarray.html.
It has most of or all the properties you need to deal with
a variable number of entries, while handling all the locking
automatically. Entries can be accessed with just a simple
index value.

I don't have first-hand experience writing code using xarrays,
so I can't be sure that it would simplify things for MANA IRQ
allocation, but it seems to be a very appropriate abstraction
for this use case.

Michael



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-05-01 15:56       ` Michael Kelley
@ 2025-05-02  6:08         ` Shradha Gupta
  2025-05-06  8:49           ` Shradha Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Shradha Gupta @ 2025-05-02  6:08 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy?ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm, Shradha Gupta

On Thu, May 01, 2025 at 03:56:48PM +0000, Michael Kelley wrote:
> From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Thursday, May 1, 2025 7:24 AM
> > 
> > On Thu, May 01, 2025 at 05:27:49AM +0000, Michael Kelley wrote:
> > > From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25,
> > 2025 3:55 AM
> > > >
> > > > Currently, the MANA driver allocates MSI-X vectors statically based on
> > > > MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> > > > up allocating more vectors than it needs. This is because, by this time
> > > > we do not have a HW channel and do not know how many IRQs should be
> > > > allocated.
> > > >
> > > > To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> > > > after getting the value supported by hardware, dynamically add the
> > > > remaining MSI-X vectors.
> > >
> > > I have a top-level thought about the data structures used to manage a
> > > dynamic number of MSI-X vectors. The current code allocates a fixed size
> > > array of struct gdma_irq_context, with one entry in the array for each
> > > MSI-X vector. To find the entry for a particular msi_index, the code can
> > > just index into the array, which is nice and simple.
> > >
> > > The new code uses a linked list of struct gdma_irq_context entries, with
> > > one entry in the list for each MSI-X vector.  In the dynamic case, you can
> > > start with one entry in the list, and then add to the list however many
> > > additional entries the hardware will support.
> > >
> > > But this additional linked list adds significant complexity to the code
> > > because it must be linearly searched to find the entry for a particular
> > > msi_index, and there's the messiness of putting entries onto the list
> > > and taking them off.  A spin lock is required.  Etc., etc.
> > >
> > > Here's an intermediate approach that would be simpler. Allocate a fixed
> > > size array of pointers to struct gdma_irq_context. The fixed size is the
> > > maximum number of possible MSI-X vectors for the device, which I
> > > think is MANA_MAX_NUM_QUEUES, or 64 (correct me if I'm wrong
> > > about that). Allocate a new struct gdma_irq_context when needed,
> > > but store the address in the array rather than adding it onto a list.
> > > Code can then directly index into the array to access the entry.
> > >
> > > Some entries in the array will be unused (and "wasted") if the device
> > > uses fewer MSI-X vector, but each unused entry is only 8 bytes. The
> > > max space unused is fewer than 512 bytes (assuming 64 entries in
> > > the array), which is neglible in the grand scheme of things. With the
> > > simpler code, and not having the additional list entry embedded in
> > > each struct gmda_irq_context, you'll get some of that space back
> > > anyway.
> > >
> > > Maybe there's a reason for the list that I missed in my initial
> > > review of the code. But if not, it sure seems like the code could
> > > be simpler, and having some unused 8 bytes entries in the array
> > > is worth the tradeoff for the simplicity.
> > >
> > > Michael
> > 
> > Hey  Michael,
> > 
> > Thanks for your inputs. We did think of this approach and in fact that
> > was how this patch was implemented(fixed size array) in the v1 of our
> > internal reviews.
> > 
> > However, it came up in those reviews that we want to move away
> > from the 64(MANA_MAX_NUM_QUEUES) as a hard limit for some new
> > requirements, atleast for the dynamic IRQ allocation path. And now the
> > new limit for all hardening purposes would be num_online_cpus().
> > 
> > Using this limit and the fixed array size approach creates problems,
> > especially in machines with high number of vCPUs. It would lead to
> > quite a bit of memory/resource wastage.
> > 
> > Hence, we decided to go ahead with this design.
> > 
> > Regards,
> > Shradha.
> 
> One other thought:  Did you look at using an xarray? See
> https://www.kernel.org/doc/html/latest/core-api/xarray.html.
> It has most of or all the properties you need to deal with
> a variable number of entries, while handling all the locking
> automatically. Entries can be accessed with just a simple
> index value.
> 
> I don't have first-hand experience writing code using xarrays,
> so I can't be sure that it would simplify things for MANA IRQ
> allocation, but it seems to be a very appropriate abstraction
> for this use case.
> 
> Michael
>
Thanks Michael,

This does look promising for our usecase. I will try it with this patch,
update the thread and then send out the next version as required.

Regards,
Shradha.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required
  2025-05-02  6:08         ` Shradha Gupta
@ 2025-05-06  8:49           ` Shradha Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Shradha Gupta @ 2025-05-06  8:49 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nipun Gupta, Yury Norov,
	Jason Gunthorpe, Jonathan Cameron, Anna-Maria Behnsen, Kevin Tian,
	Long Li, Thomas Gleixner, Bjorn Helgaas, Rob Herring,
	Manivannan Sadhasivam, Krzysztof Wilczy?ski, Lorenzo Pieralisi,
	Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Leon Romanovsky, Maxim Levitsky,
	Erni Sri Satya Vennela, Peter Zijlstra, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm, Shradha Gupta

On Thu, May 01, 2025 at 11:08:09PM -0700, Shradha Gupta wrote:
> On Thu, May 01, 2025 at 03:56:48PM +0000, Michael Kelley wrote:
> > From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Thursday, May 1, 2025 7:24 AM
> > > 
> > > On Thu, May 01, 2025 at 05:27:49AM +0000, Michael Kelley wrote:
> > > > From: Shradha Gupta <shradhagupta@linux.microsoft.com> Sent: Friday, April 25,
> > > 2025 3:55 AM
> > > > >
> > > > > Currently, the MANA driver allocates MSI-X vectors statically based on
> > > > > MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
> > > > > up allocating more vectors than it needs. This is because, by this time
> > > > > we do not have a HW channel and do not know how many IRQs should be
> > > > > allocated.
> > > > >
> > > > > To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
> > > > > after getting the value supported by hardware, dynamically add the
> > > > > remaining MSI-X vectors.
> > > >
> > > > I have a top-level thought about the data structures used to manage a
> > > > dynamic number of MSI-X vectors. The current code allocates a fixed size
> > > > array of struct gdma_irq_context, with one entry in the array for each
> > > > MSI-X vector. To find the entry for a particular msi_index, the code can
> > > > just index into the array, which is nice and simple.
> > > >
> > > > The new code uses a linked list of struct gdma_irq_context entries, with
> > > > one entry in the list for each MSI-X vector.  In the dynamic case, you can
> > > > start with one entry in the list, and then add to the list however many
> > > > additional entries the hardware will support.
> > > >
> > > > But this additional linked list adds significant complexity to the code
> > > > because it must be linearly searched to find the entry for a particular
> > > > msi_index, and there's the messiness of putting entries onto the list
> > > > and taking them off.  A spin lock is required.  Etc., etc.
> > > >
> > > > Here's an intermediate approach that would be simpler. Allocate a fixed
> > > > size array of pointers to struct gdma_irq_context. The fixed size is the
> > > > maximum number of possible MSI-X vectors for the device, which I
> > > > think is MANA_MAX_NUM_QUEUES, or 64 (correct me if I'm wrong
> > > > about that). Allocate a new struct gdma_irq_context when needed,
> > > > but store the address in the array rather than adding it onto a list.
> > > > Code can then directly index into the array to access the entry.
> > > >
> > > > Some entries in the array will be unused (and "wasted") if the device
> > > > uses fewer MSI-X vector, but each unused entry is only 8 bytes. The
> > > > max space unused is fewer than 512 bytes (assuming 64 entries in
> > > > the array), which is neglible in the grand scheme of things. With the
> > > > simpler code, and not having the additional list entry embedded in
> > > > each struct gmda_irq_context, you'll get some of that space back
> > > > anyway.
> > > >
> > > > Maybe there's a reason for the list that I missed in my initial
> > > > review of the code. But if not, it sure seems like the code could
> > > > be simpler, and having some unused 8 bytes entries in the array
> > > > is worth the tradeoff for the simplicity.
> > > >
> > > > Michael
> > > 
> > > Hey  Michael,
> > > 
> > > Thanks for your inputs. We did think of this approach and in fact that
> > > was how this patch was implemented(fixed size array) in the v1 of our
> > > internal reviews.
> > > 
> > > However, it came up in those reviews that we want to move away
> > > from the 64(MANA_MAX_NUM_QUEUES) as a hard limit for some new
> > > requirements, atleast for the dynamic IRQ allocation path. And now the
> > > new limit for all hardening purposes would be num_online_cpus().
> > > 
> > > Using this limit and the fixed array size approach creates problems,
> > > especially in machines with high number of vCPUs. It would lead to
> > > quite a bit of memory/resource wastage.
> > > 
> > > Hence, we decided to go ahead with this design.
> > > 
> > > Regards,
> > > Shradha.
> > 
> > One other thought:  Did you look at using an xarray? See
> > https://www.kernel.org/doc/html/latest/core-api/xarray.html.
> > It has most of or all the properties you need to deal with
> > a variable number of entries, while handling all the locking
> > automatically. Entries can be accessed with just a simple
> > index value.
> > 
> > I don't have first-hand experience writing code using xarrays,
> > so I can't be sure that it would simplify things for MANA IRQ
> > allocation, but it seems to be a very appropriate abstraction
> > for this use case.
> > 
> > Michael
> >
> Thanks Michael,
> 
> This does look promising for our usecase. I will try it with this patch,
> update the thread and then send out the next version as required.
> 
> Regards,
> Shradha.

Hi Michael,

going ahead with xarray implementation of the irq_contexts structure for
the next version.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-05-06  8:49 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 10:53 [PATCH v2 0/3] Allow dyn MSI-X vector allocation of MANA Shradha Gupta
2025-04-25 10:53 ` [PATCH v2 1/3] PCI: Export pci_msix_prepare_desc() for dynamic MSI-X alloc Shradha Gupta
2025-04-25 16:37   ` Bjorn Helgaas
2025-04-28  5:22     ` Shradha Gupta
2025-04-28 12:22     ` Thomas Gleixner
2025-04-29  8:43       ` Shradha Gupta
2025-04-25 10:54 ` [PATCH v2 2/3] PCI: hv: Allow dynamic MSI-X vector allocation Shradha Gupta
2025-04-28 13:47   ` Michael Kelley
2025-04-29  8:46     ` Shradha Gupta
2025-04-25 10:54 ` [PATCH v2 3/3] net: mana: Allocate MSI-X vectors dynamically as required Shradha Gupta
2025-04-25 16:35   ` Jakub Kicinski
2025-04-28  4:01     ` Shradha Gupta
2025-04-28 13:38   ` Yury Norov
2025-04-29  8:51     ` Shradha Gupta
2025-05-01  5:27   ` Michael Kelley
2025-05-01 14:23     ` Shradha Gupta
2025-05-01 15:56       ` Michael Kelley
2025-05-02  6:08         ` Shradha Gupta
2025-05-06  8:49           ` Shradha Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).