[PATCH 0/4] Nested virtualization fixes for root partition

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Nested virtualization fixes for root partition
@ 2025-06-10 23:52 Nuno Das Neves
  2025-06-10 23:52 ` [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal Nuno Das Neves
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-10 23:52 UTC (permalink / raw)
  To: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, Nuno Das Neves

Fixes for running as nested root partition on the Microsoft Hypervisor.

The first patch prevents the vmbus driver being registered on baremetal, since
there's no vmbus in this scenario.

The second patch changes vmbus to make hypercalls to the L0 hypervisor instead
of the L1. This is needed because L0 hypervisor, not the L1, is the one hosting
the Windows root partition with the VMM that provides vmbus.

The 3rd and 4th patches fix interrupt unmasking on nested. In this scenario,
the L1 (nested) hypervisor does the interrupt mapping to root partition cores.
The vectors just need to be mapped with MAP_DEVICE_INTERRUPT instead of
affinitized with RETARGET_INTERRUPT.

Mukesh Rathor (1):
  PCI: hv: Do not do vmbus initialization on baremetal

Nuno Das Neves (1):
  Drivers: hv: Use nested hypercall for post message and signal event

Stanislav Kinsburskii (2):
  x86: hyperv: Expose hv_map_msi_interrupt function
  PCI: hv: Use the correct hypercall for unmasking interrupts on nested

 arch/arm64/include/asm/mshyperv.h   | 10 ++++++
 arch/x86/hyperv/irqdomain.c         | 47 +++++++++++++++++++++--------
 arch/x86/include/asm/mshyperv.h     |  2 ++
 drivers/hv/connection.c             |  3 ++
 drivers/hv/hv.c                     |  3 ++
 drivers/pci/controller/pci-hyperv.c | 21 +++++++++++--
 6 files changed, 71 insertions(+), 15 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal
  2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
@ 2025-06-10 23:52 ` Nuno Das Neves
  2025-06-11 23:06   ` Michael Kelley
  2025-06-10 23:52 ` [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event Nuno Das Neves
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-10 23:52 UTC (permalink / raw)
  To: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, Nuno Das Neves

From: Mukesh Rathor <mrathor@linux.microsoft.com>

init_hv_pci_drv() is not relevant for root partition on baremetal as there
is no vmbus. On nested (with a Windows L1 root), vmbus is present.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index b4f29ee75848..4d25754dfe2f 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -4150,6 +4150,9 @@ static int __init init_hv_pci_drv(void)
 	if (!hv_is_hyperv_initialized())
 		return -ENODEV;
 
+	if (hv_root_partition() && !hv_nested)
+		return -ENODEV;
+
 	ret = hv_pci_irqchip_init();
 	if (ret)
 		return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event
  2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
  2025-06-10 23:52 ` [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal Nuno Das Neves
@ 2025-06-10 23:52 ` Nuno Das Neves
  2025-06-11 23:06   ` Michael Kelley
  2025-06-10 23:52 ` [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function Nuno Das Neves
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-10 23:52 UTC (permalink / raw)
  To: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, Nuno Das Neves

When running nested, these hypercalls must be sent to the L0 hypervisor
or vmbus will fail.

Add ARM64 stubs for the nested hypercall helpers to not break
compilation (nested is still only supported in x86).

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/arm64/include/asm/mshyperv.h | 10 ++++++++++
 drivers/hv/connection.c           |  3 +++
 drivers/hv/hv.c                   |  3 +++
 3 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab6..893d6a2e8dab 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -53,6 +53,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 	return hv_get_msr(reg);
 }
 
+static inline u64 hv_do_nested_hypercall(u64 control, void *input, void *output)
+{
+	return U64_MAX;
+}
+
+static inline u64 hv_do_fast_nested_hypercall8(u64 control, u64 input1)
+{
+	return U64_MAX;
+}
+
 /* SMCCC hypercall parameters */
 #define HV_SMCCC_FUNC_NUMBER	1
 #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index be490c598785..992022bc770c 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -518,6 +518,9 @@ void vmbus_set_event(struct vmbus_channel *channel)
 					 channel->sig_event, 0);
 		else
 			WARN_ON_ONCE(1);
+	} else if (hv_nested) {
+		hv_do_fast_nested_hypercall8(HVCALL_SIGNAL_EVENT,
+					     channel->sig_event);
 	} else {
 		hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
 	}
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 308c8f279df8..99b73e779bf0 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -84,6 +84,9 @@ int hv_post_message(union hv_connection_id connection_id,
 						   sizeof(*aligned_msg));
 		else
 			status = HV_STATUS_INVALID_PARAMETER;
+	} else if (hv_nested) {
+		status = hv_do_nested_hypercall(HVCALL_POST_MESSAGE,
+						aligned_msg, NULL);
 	} else {
 		status = hv_do_hypercall(HVCALL_POST_MESSAGE,
 					 aligned_msg, NULL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
  2025-06-10 23:52 ` [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal Nuno Das Neves
  2025-06-10 23:52 ` [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event Nuno Das Neves
@ 2025-06-10 23:52 ` Nuno Das Neves
  2025-06-11 23:07   ` Michael Kelley
  2025-06-10 23:52 ` [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested Nuno Das Neves
  2025-06-11 17:55 ` [PATCH 0/4] Nested virtualization fixes for root partition Roman Kisel
  4 siblings, 1 reply; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-10 23:52 UTC (permalink / raw)
  To: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, Nuno Das Neves

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

This patch moves a part of currently internal logic into the
hv_map_msi_interrupt function and makes it globally available helper
function, which will be used to map PCI interrupts in case of root
partition.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/hyperv/irqdomain.c     | 47 ++++++++++++++++++++++++---------
 arch/x86/include/asm/mshyperv.h |  2 ++
 2 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 31f0d29cbc5e..82f3bafb93d6 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -169,13 +169,40 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
 	return dev_id;
 }
 
-static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
-				struct hv_interrupt_entry *entry)
+/**
+ * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+ * @data:      Describes the IRQ
+ * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
+ *
+ * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
+ */
+int hv_map_msi_interrupt(struct irq_data *data,
+			 struct hv_interrupt_entry *out_entry)
 {
-	union hv_device_id device_id = hv_build_pci_dev_id(dev);
+	struct msi_desc *msidesc;
+	struct pci_dev *dev;
+	union hv_device_id device_id;
+	struct hv_interrupt_entry dummy;
+	struct irq_cfg *cfg = irqd_cfg(data);
+	const cpumask_t *affinity;
+	int cpu;
+	u64 res;
 
-	return hv_map_interrupt(device_id, false, cpu, vector, entry);
+	msidesc = irq_data_get_msi_desc(data);
+	dev = msi_desc_to_pci_dev(msidesc);
+	device_id = hv_build_pci_dev_id(dev);
+	affinity = irq_data_get_effective_affinity_mask(data);
+	cpu = cpumask_first_and(affinity, cpu_online_mask);
+
+	res = hv_map_interrupt(device_id, false, cpu, cfg->vector,
+			       out_entry ? out_entry : &dummy);
+	if (!hv_result_success(res))
+		pr_err("%s: failed to map interrupt: %s",
+		       __func__, hv_result_to_string(res));
+
+	return hv_result_to_errno(res);
 }
+EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
 static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
 {
@@ -190,10 +217,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct msi_desc *msidesc;
 	struct pci_dev *dev;
-	struct hv_interrupt_entry out_entry, *stored_entry;
+	struct hv_interrupt_entry *stored_entry;
 	struct irq_cfg *cfg = irqd_cfg(data);
-	const cpumask_t *affinity;
-	int cpu;
 	u64 status;
 
 	msidesc = irq_data_get_msi_desc(data);
@@ -204,9 +229,6 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	affinity = irq_data_get_effective_affinity_mask(data);
-	cpu = cpumask_first_and(affinity, cpu_online_mask);
-
 	if (data->chip_data) {
 		/*
 		 * This interrupt is already mapped. Let's unmap first.
@@ -235,15 +257,14 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
+	status = hv_map_msi_interrupt(data, stored_entry);
 	if (status != HV_STATUS_SUCCESS) {
 		kfree(stored_entry);
 		return;
 	}
 
-	*stored_entry = out_entry;
 	data->chip_data = stored_entry;
-	entry_to_msi_msg(&out_entry, msg);
+	entry_to_msi_msg(data->chip_data, msg);
 
 	return;
 }
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5ec92e3e2e37..843121465ddd 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -261,6 +261,8 @@ static inline void hv_apic_init(void) {}
 
 struct irq_domain *hv_create_pci_msi_domain(void);
 
+int hv_map_msi_interrupt(struct irq_data *data,
+			 struct hv_interrupt_entry *out_entry);
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
 		struct hv_interrupt_entry *entry);
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested
  2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
                   ` (2 preceding siblings ...)
  2025-06-10 23:52 ` [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function Nuno Das Neves
@ 2025-06-10 23:52 ` Nuno Das Neves
  2025-06-11 23:07   ` Michael Kelley
  2025-06-13 19:36   ` Bjorn Helgaas
  2025-06-11 17:55 ` [PATCH 0/4] Nested virtualization fixes for root partition Roman Kisel
  4 siblings, 2 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-10 23:52 UTC (permalink / raw)
  To: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, Nuno Das Neves

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Running as nested root on MSHV imposes a different requirement
for the pci-hyperv controller.

In this setup, the interrupt will first come to the L1 (nested) hypervisor,
which will deliver it to the appropriate root CPU. Instead of issuing the
RETARGET hypercall, we should issue the MAP_DEVICE_INTERRUPT
hypercall to L1 to complete the setup.

Rename hv_arch_irq_unmask() to hv_irq_retarget_interrupt().

Co-developed-by: Jinank Jain <jinankjain@linux.microsoft.com>
Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 4d25754dfe2f..0f491c802fb9 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -600,7 +600,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
 #define hv_msi_prepare		pci_msi_prepare
 
 /**
- * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
+ * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
  * affinity.
  * @data:	Describes the IRQ
  *
@@ -609,7 +609,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
  * is built out of this PCI bus's instance GUID and the function
  * number of the device.
  */
-static void hv_arch_irq_unmask(struct irq_data *data)
+static void hv_irq_retarget_interrupt(struct irq_data *data)
 {
 	struct msi_desc *msi_desc = irq_data_get_msi_desc(data);
 	struct hv_retarget_device_interrupt *params;
@@ -714,6 +714,20 @@ static void hv_arch_irq_unmask(struct irq_data *data)
 		dev_err(&hbus->hdev->device,
 			"%s() failed: %#llx", __func__, res);
 }
+
+static void hv_arch_irq_unmask(struct irq_data *data)
+{
+	if (hv_nested && hv_root_partition())
+		/*
+		 * In case of the nested root partition, the nested hypervisor
+		 * is taking care of interrupt remapping and thus the
+		 * MAP_DEVICE_INTERRUPT hypercall is required instead of
+		 * RETARGET_INTERRUPT.
+		 */
+		(void)hv_map_msi_interrupt(data, NULL);
+	else
+		hv_irq_retarget_interrupt(data);
+}
 #elif defined(CONFIG_ARM64)
 /*
  * SPI vectors to use for vPCI; arch SPIs range is [32, 1019], but leaving a bit
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] Nested virtualization fixes for root partition
  2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
                   ` (3 preceding siblings ...)
  2025-06-10 23:52 ` [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested Nuno Das Neves
@ 2025-06-11 17:55 ` Roman Kisel
  4 siblings, 0 replies; 19+ messages in thread
From: Roman Kisel @ 2025-06-11 17:55 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-pci



On 6/10/2025 4:52 PM, Nuno Das Neves wrote:
> Fixes for running as nested root partition on the Microsoft Hypervisor.
> 
> The first patch prevents the vmbus driver being registered on baremetal, since
> there's no vmbus in this scenario.
> 
> The second patch changes vmbus to make hypercalls to the L0 hypervisor instead
> of the L1. This is needed because L0 hypervisor, not the L1, is the one hosting
> the Windows root partition with the VMM that provides vmbus.
> 
> The 3rd and 4th patches fix interrupt unmasking on nested. In this scenario,
> the L1 (nested) hypervisor does the interrupt mapping to root partition cores.
> The vectors just need to be mapped with MAP_DEVICE_INTERRUPT instead of
> affinitized with RETARGET_INTERRUPT.
> 
> Mukesh Rathor (1):
>    PCI: hv: Do not do vmbus initialization on baremetal
> 
> Nuno Das Neves (1):
>    Drivers: hv: Use nested hypercall for post message and signal event
> 
> Stanislav Kinsburskii (2):
>    x86: hyperv: Expose hv_map_msi_interrupt function
>    PCI: hv: Use the correct hypercall for unmasking interrupts on nested
> 
>   arch/arm64/include/asm/mshyperv.h   | 10 ++++++
>   arch/x86/hyperv/irqdomain.c         | 47 +++++++++++++++++++++--------
>   arch/x86/include/asm/mshyperv.h     |  2 ++
>   drivers/hv/connection.c             |  3 ++
>   drivers/hv/hv.c                     |  3 ++
>   drivers/pci/controller/pci-hyperv.c | 21 +++++++++++--
>   6 files changed, 71 insertions(+), 15 deletions(-)
> 

Very cool stuff :) LGTM.

For the patch series:
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal
  2025-06-10 23:52 ` [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal Nuno Das Neves
@ 2025-06-11 23:06   ` Michael Kelley
  2025-06-16 20:06     ` Nuno Das Neves
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-06-11 23:06 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, manivannan.sadhasivam@linaro.org, robh@kernel.org,
	bhelgaas@google.com, jinankjain@linux.microsoft.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	x86@kernel.org

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
> 
> From: Mukesh Rathor <mrathor@linux.microsoft.com>

The patch Subject line is confusing to me. Perhaps this better captures
the intent:

 "PCI: hv: Don't load the driver for the root partition on a bare-metal hypervisor"

> 
> init_hv_pci_drv() is not relevant for root partition on baremetal as there
> is no vmbus. On nested (with a Windows L1 root), vmbus is present.

This needs more precision. init_hv_pci_drv() isn't what is
"not relevant". It's the entire driver that doesn't need to be loaded
because the root partition on a bare-metal hypervisor doesn't use
VMBus. It's only when the root partition is running on a nested
hypervisor that it uses VMBus, and hence may need the Hyper-V
PCI driver to be loaded.

> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b4f29ee75848..4d25754dfe2f 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -4150,6 +4150,9 @@ static int __init init_hv_pci_drv(void)
>  	if (!hv_is_hyperv_initialized())
>  		return -ENODEV;
> 
> +	if (hv_root_partition() && !hv_nested)
> +		return -ENODEV;
> +
>  	ret = hv_pci_irqchip_init();
>  	if (ret)
>  		return ret;
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event
  2025-06-10 23:52 ` [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event Nuno Das Neves
@ 2025-06-11 23:06   ` Michael Kelley
  2025-06-16 20:18     ` Nuno Das Neves
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-06-11 23:06 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, manivannan.sadhasivam@linaro.org, robh@kernel.org,
	bhelgaas@google.com, jinankjain@linux.microsoft.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	x86@kernel.org

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
> 
> When running nested, these hypercalls must be sent to the L0 hypervisor
> or vmbus will fail.

s/vmbus/VMBus/

> 
> Add ARM64 stubs for the nested hypercall helpers to not break
> compilation (nested is still only supported in x86).
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/arm64/include/asm/mshyperv.h | 10 ++++++++++
>  drivers/hv/connection.c           |  3 +++
>  drivers/hv/hv.c                   |  3 +++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/mshyperv.h
> b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab6..893d6a2e8dab 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -53,6 +53,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>  	return hv_get_msr(reg);
>  }
> 
> +static inline u64 hv_do_nested_hypercall(u64 control, void *input, void *output)
> +{
> +	return U64_MAX;
> +}
> +
> +static inline u64 hv_do_fast_nested_hypercall8(u64 control, u64 input1)
> +{
> +	return U64_MAX;
> +}

I think the definitions of hv_do_nested_hypercall() and
hv_do_fast_nested_hypercall8() are architecture independent. All
they do is add the HV_HYPERCALL_NESTED flag, which when
implemented for ARM64, will presumably be the same flag as
currently defined for x86.  As such, couldn't the definitions of
hv_do_nested_hypercall() and hv_do_fast_nested_hypercall8()
be moved to asm-generic/mshyperv.h? Then stubs would not
be needed for ARM64. These two functions would never be
called on ARM64 because hv_nested is never true on ARM64
(at least for now), but the code would compile. And if either
function was erroneously called on ARM64, presumably
Hyper-V would return an error because HV_HYPERCALL_NESTED
is set.

> +
>  /* SMCCC hypercall parameters */
>  #define HV_SMCCC_FUNC_NUMBER	1
>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index be490c598785..992022bc770c 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -518,6 +518,9 @@ void vmbus_set_event(struct vmbus_channel *channel)
>  					 channel->sig_event, 0);
>  		else
>  			WARN_ON_ONCE(1);
> +	} else if (hv_nested) {
> +		hv_do_fast_nested_hypercall8(HVCALL_SIGNAL_EVENT,
> +					     channel->sig_event);
>  	} else {
>  		hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
>  	}
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 308c8f279df8..99b73e779bf0 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -84,6 +84,9 @@ int hv_post_message(union hv_connection_id connection_id,
>  						   sizeof(*aligned_msg));
>  		else
>  			status = HV_STATUS_INVALID_PARAMETER;
> +	} else if (hv_nested) {
> +		status = hv_do_nested_hypercall(HVCALL_POST_MESSAGE,
> +						aligned_msg, NULL);
>  	} else {
>  		status = hv_do_hypercall(HVCALL_POST_MESSAGE,
>  					 aligned_msg, NULL);

Are HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE the only two
hypercalls that are ever expected to need a "nested" version? I'm
wondering if the function hv_do_nested_hypercall() and
hv_do_fast_nested_hypercall8() could be dropped entirely, and just
pass the first argument to hv_do_hypercall() or hv_do_fast_hypercall8()
as <hypercall_name> | HV_HYPERCALL_NESTED. For only two cases, a
little bit of open coding might be preferable to the overhead of defining
functions just to wrap the or'ing of HV_HYPERCALL_NESTED. 

The code above could then look like:

	} else {
		u64 control = HVCALL_POST_MESSAGE;

		control |= hv_nested ? HV_HYPERCALL_NESTED : 0;
		status = hv_do_hypercall(control, aligned_msg, NULL);
	}

Again, ARM64 is implicitly handled because hv_nested is never set.

This is just a suggestion. It's motivated by the fact that we already have
three flavors of hypercall for HVCALL_SIGNAL_EVENT and
HVCALL_POST_MESSAGE, and I was looking for a way to avoid adding
a fourth flavor. But it's a marginal win, and if you prefer to keep the
inline functions, I'm OK with that.

Michael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-10 23:52 ` [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function Nuno Das Neves
@ 2025-06-11 23:07   ` Michael Kelley
  2025-06-18 21:08     ` Nuno Das Neves
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-06-11 23:07 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
> 
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

The preferred patch Subject prefix is "x86/hyperv:"

> 
> This patch moves a part of currently internal logic into the
> hv_map_msi_interrupt function and makes it globally available helper
> function, which will be used to map PCI interrupts in case of root
> partition.

Avoid "this patch" in commit messages.  Suggest:

Create a helper function hv_map_msi_interrupt() that contains some
logic that is currently internal to irqdomain.c. Make the helper function
globally available so it can be used to map PCI interrupts when running
in the root partition.

> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c     | 47 ++++++++++++++++++++++++---------
>  arch/x86/include/asm/mshyperv.h |  2 ++
>  2 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 31f0d29cbc5e..82f3bafb93d6 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -169,13 +169,40 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
>  	return dev_id;
>  }
> 
> -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> -				struct hv_interrupt_entry *entry)
> +/**
> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> + * @data:      Describes the IRQ
> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> + *
> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> + */
> +int hv_map_msi_interrupt(struct irq_data *data,
> +			 struct hv_interrupt_entry *out_entry)
>  {
> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
> +	struct msi_desc *msidesc;
> +	struct pci_dev *dev;
> +	union hv_device_id device_id;
> +	struct hv_interrupt_entry dummy;
> +	struct irq_cfg *cfg = irqd_cfg(data);
> +	const cpumask_t *affinity;
> +	int cpu;
> +	u64 res;
> 
> -	return hv_map_interrupt(device_id, false, cpu, vector, entry);
> +	msidesc = irq_data_get_msi_desc(data);
> +	dev = msi_desc_to_pci_dev(msidesc);
> +	device_id = hv_build_pci_dev_id(dev);
> +	affinity = irq_data_get_effective_affinity_mask(data);
> +	cpu = cpumask_first_and(affinity, cpu_online_mask);

Is the cpus_read_lock held at this point? I'm not sure what the
overall calling sequence looks like. If it is not held, the CPU that
is selected could go offline before hv_map_interrupt() is called.
This computation of the target CPU is the same as in the code
before this patch, but that existing code looks like it has the
same problem.

> +
> +	res = hv_map_interrupt(device_id, false, cpu, cfg->vector,
> +			       out_entry ? out_entry : &dummy);
> +	if (!hv_result_success(res))
> +		pr_err("%s: failed to map interrupt: %s",
> +		       __func__, hv_result_to_string(res));

hv_map_interrupt() already outputs a message if the hypercall
fails. Is another message needed here?

> +
> +	return hv_result_to_errno(res);

The error handling is rather messed up. First hv_map_interrupt()
sometimes returns a Linux errno (not negated), and sometimes a
hypercall result. The errno is EINVAL, which has value "22", which is
the same as hypercall result HV_STATUS_ACKNOWLEDGED. And
the hypercall result returned from hv_map_interrupt() is just
the result code, not the full 64-bit status, as hv_map_interrupt()
has already done hv_result(status). Hence the "res" input arg to
hv_result_to_errno() isn't really the correct input. For example,
if the hypercall returns U64_MAX, that won't be caught by
hv_result_to_errno() since the value has been truncated to
32 bits. Fixing all this will require some unscrambling.

>  }
> +EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
> 
>  static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
>  {
> @@ -190,10 +217,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct msi_desc *msidesc;
>  	struct pci_dev *dev;
> -	struct hv_interrupt_entry out_entry, *stored_entry;
> +	struct hv_interrupt_entry *stored_entry;
>  	struct irq_cfg *cfg = irqd_cfg(data);
> -	const cpumask_t *affinity;
> -	int cpu;
>  	u64 status;
> 
>  	msidesc = irq_data_get_msi_desc(data);
> @@ -204,9 +229,6 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		return;
>  	}
> 
> -	affinity = irq_data_get_effective_affinity_mask(data);
> -	cpu = cpumask_first_and(affinity, cpu_online_mask);
> -
>  	if (data->chip_data) {
>  		/*
>  		 * This interrupt is already mapped. Let's unmap first.
> @@ -235,15 +257,14 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		return;
>  	}
> 
> -	status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
> +	status = hv_map_msi_interrupt(data, stored_entry);
>  	if (status != HV_STATUS_SUCCESS) {

hv_map_msi_interrupt() returns an errno, so testing for HV_STATUS_SUCCESS
is bogus.

>  		kfree(stored_entry);
>  		return;
>  	}
> 
> -	*stored_entry = out_entry;
>  	data->chip_data = stored_entry;
> -	entry_to_msi_msg(&out_entry, msg);
> +	entry_to_msi_msg(data->chip_data, msg);
> 
>  	return;
>  }
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 5ec92e3e2e37..843121465ddd 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -261,6 +261,8 @@ static inline void hv_apic_init(void) {}
> 
>  struct irq_domain *hv_create_pci_msi_domain(void);
> 
> +int hv_map_msi_interrupt(struct irq_data *data,
> +			 struct hv_interrupt_entry *out_entry);
>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
>  		struct hv_interrupt_entry *entry);
>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested
  2025-06-10 23:52 ` [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested Nuno Das Neves
@ 2025-06-11 23:07   ` Michael Kelley
  2025-06-18 21:24     ` Nuno Das Neves
  2025-06-13 19:36   ` Bjorn Helgaas
  1 sibling, 1 reply; 19+ messages in thread
From: Michael Kelley @ 2025-06-11 23:07 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
> 
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> Running as nested root on MSHV imposes a different requirement
> for the pci-hyperv controller.
> 
> In this setup, the interrupt will first come to the L1 (nested) hypervisor,
> which will deliver it to the appropriate root CPU. Instead of issuing the
> RETARGET hypercall, we should issue the MAP_DEVICE_INTERRUPT
> hypercall to L1 to complete the setup.
> 
> Rename hv_arch_irq_unmask() to hv_irq_retarget_interrupt().
> 
> Co-developed-by: Jinank Jain <jinankjain@linux.microsoft.com>
> Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 4d25754dfe2f..0f491c802fb9 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -600,7 +600,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>  #define hv_msi_prepare		pci_msi_prepare
> 
>  /**
> - * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
> + * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
>   * affinity.
>   * @data:	Describes the IRQ
>   *
> @@ -609,7 +609,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>   * is built out of this PCI bus's instance GUID and the function
>   * number of the device.
>   */
> -static void hv_arch_irq_unmask(struct irq_data *data)
> +static void hv_irq_retarget_interrupt(struct irq_data *data)
>  {
>  	struct msi_desc *msi_desc = irq_data_get_msi_desc(data);
>  	struct hv_retarget_device_interrupt *params;
> @@ -714,6 +714,20 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>  		dev_err(&hbus->hdev->device,
>  			"%s() failed: %#llx", __func__, res);
>  }
> +
> +static void hv_arch_irq_unmask(struct irq_data *data)
> +{
> +	if (hv_nested && hv_root_partition())

Based on Patch 1 of this series, this driver is not loaded for the root
partition in the non-nested case. So testing hv_nested is redundant.

> +		/*
> +		 * In case of the nested root partition, the nested hypervisor
> +		 * is taking care of interrupt remapping and thus the
> +		 * MAP_DEVICE_INTERRUPT hypercall is required instead of
> +		 * RETARGET_INTERRUPT.
> +		 */
> +		(void)hv_map_msi_interrupt(data, NULL);
> +	else
> +		hv_irq_retarget_interrupt(data);
> +}
>  #elif defined(CONFIG_ARM64)
>  /*
>   * SPI vectors to use for vPCI; arch SPIs range is [32, 1019], but leaving a bit
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested
  2025-06-10 23:52 ` [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested Nuno Das Neves
  2025-06-11 23:07   ` Michael Kelley
@ 2025-06-13 19:36   ` Bjorn Helgaas
  2025-06-18 22:45     ` Nuno Das Neves
  1 sibling, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2025-06-13 19:36 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci, kys,
	haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will, tglx,
	mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86

On Tue, Jun 10, 2025 at 04:52:06PM -0700, Nuno Das Neves wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> Running as nested root on MSHV imposes a different requirement
> for the pci-hyperv controller.
> 
> In this setup, the interrupt will first come to the L1 (nested) hypervisor,
> which will deliver it to the appropriate root CPU. Instead of issuing the
> RETARGET hypercall, we should issue the MAP_DEVICE_INTERRUPT
> hypercall to L1 to complete the setup.

Maybe strengthen this to say that this issues MAP_DEVICE_INTERRUPT
instead of RETARGET in this case?  (Not just that we "should".) 

> Rename hv_arch_irq_unmask() to hv_irq_retarget_interrupt().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal
  2025-06-11 23:06   ` Michael Kelley
@ 2025-06-16 20:06     ` Nuno Das Neves
  0 siblings, 0 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-16 20:06 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, manivannan.sadhasivam@linaro.org, robh@kernel.org,
	bhelgaas@google.com, jinankjain@linux.microsoft.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	x86@kernel.org

On 6/11/2025 4:06 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> The patch Subject line is confusing to me. Perhaps this better captures
> the intent:
> 
>  "PCI: hv: Don't load the driver for the root partition on a bare-metal hypervisor"
> 

Thanks, that does make more sense.

>>
>> init_hv_pci_drv() is not relevant for root partition on baremetal as there
>> is no vmbus. On nested (with a Windows L1 root), vmbus is present.
> 
> This needs more precision. init_hv_pci_drv() isn't what is
> "not relevant". It's the entire driver that doesn't need to be loaded
> because the root partition on a bare-metal hypervisor doesn't use
> VMBus. It's only when the root partition is running on a nested
> hypervisor that it uses VMBus, and hence may need the Hyper-V
> PCI driver to be loaded.
> 

I'll update it so it is clearer, thanks.

>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  drivers/pci/controller/pci-hyperv.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>> index b4f29ee75848..4d25754dfe2f 100644
>> --- a/drivers/pci/controller/pci-hyperv.c
>> +++ b/drivers/pci/controller/pci-hyperv.c
>> @@ -4150,6 +4150,9 @@ static int __init init_hv_pci_drv(void)
>>  	if (!hv_is_hyperv_initialized())
>>  		return -ENODEV;
>>
>> +	if (hv_root_partition() && !hv_nested)
>> +		return -ENODEV;
>> +
>>  	ret = hv_pci_irqchip_init();
>>  	if (ret)
>>  		return ret;
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event
  2025-06-11 23:06   ` Michael Kelley
@ 2025-06-16 20:18     ` Nuno Das Neves
  0 siblings, 0 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-16 20:18 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, manivannan.sadhasivam@linaro.org, robh@kernel.org,
	bhelgaas@google.com, jinankjain@linux.microsoft.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	x86@kernel.org

On 6/11/2025 4:06 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>
>> When running nested, these hypercalls must be sent to the L0 hypervisor
>> or vmbus will fail.
> 
> s/vmbus/VMBus/
> 

Ack

>>
>> Add ARM64 stubs for the nested hypercall helpers to not break
>> compilation (nested is still only supported in x86).
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/arm64/include/asm/mshyperv.h | 10 ++++++++++
>>  drivers/hv/connection.c           |  3 +++
>>  drivers/hv/hv.c                   |  3 +++
>>  3 files changed, 16 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index b721d3134ab6..893d6a2e8dab 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -53,6 +53,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>  	return hv_get_msr(reg);
>>  }
>>
>> +static inline u64 hv_do_nested_hypercall(u64 control, void *input, void *output)
>> +{
>> +	return U64_MAX;
>> +}
>> +
>> +static inline u64 hv_do_fast_nested_hypercall8(u64 control, u64 input1)
>> +{
>> +	return U64_MAX;
>> +}
> 
> I think the definitions of hv_do_nested_hypercall() and
> hv_do_fast_nested_hypercall8() are architecture independent. All
> they do is add the HV_HYPERCALL_NESTED flag, which when
> implemented for ARM64, will presumably be the same flag as
> currently defined for x86.  As such, couldn't the definitions of
> hv_do_nested_hypercall() and hv_do_fast_nested_hypercall8()
> be moved to asm-generic/mshyperv.h? Then stubs would not
> be needed for ARM64. These two functions would never be
> called on ARM64 because hv_nested is never true on ARM64
> (at least for now), but the code would compile. And if either
> function was erroneously called on ARM64, presumably
> Hyper-V would return an error because HV_HYPERCALL_NESTED
> is set.
> 

Good point - letting the hypervisor return the error is fine.

>> +
>>  /* SMCCC hypercall parameters */
>>  #define HV_SMCCC_FUNC_NUMBER	1
>>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
>> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
>> index be490c598785..992022bc770c 100644
>> --- a/drivers/hv/connection.c
>> +++ b/drivers/hv/connection.c
>> @@ -518,6 +518,9 @@ void vmbus_set_event(struct vmbus_channel *channel)
>>  					 channel->sig_event, 0);
>>  		else
>>  			WARN_ON_ONCE(1);
>> +	} else if (hv_nested) {
>> +		hv_do_fast_nested_hypercall8(HVCALL_SIGNAL_EVENT,
>> +					     channel->sig_event);
>>  	} else {
>>  		hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
>>  	}
>> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
>> index 308c8f279df8..99b73e779bf0 100644
>> --- a/drivers/hv/hv.c
>> +++ b/drivers/hv/hv.c
>> @@ -84,6 +84,9 @@ int hv_post_message(union hv_connection_id connection_id,
>>  						   sizeof(*aligned_msg));
>>  		else
>>  			status = HV_STATUS_INVALID_PARAMETER;
>> +	} else if (hv_nested) {
>> +		status = hv_do_nested_hypercall(HVCALL_POST_MESSAGE,
>> +						aligned_msg, NULL);
>>  	} else {
>>  		status = hv_do_hypercall(HVCALL_POST_MESSAGE,
>>  					 aligned_msg, NULL);
> 
> Are HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE the only two
> hypercalls that are ever expected to need a "nested" version? I'm
> wondering if the function hv_do_nested_hypercall() and
> hv_do_fast_nested_hypercall8() could be dropped entirely, and just
> pass the first argument to hv_do_hypercall() or hv_do_fast_hypercall8()
> as <hypercall_name> | HV_HYPERCALL_NESTED. For only two cases, a
> little bit of open coding might be preferable to the overhead of defining
> functions just to wrap the or'ing of HV_HYPERCALL_NESTED. 
> 
> The code above could then look like:
> 
> 	} else {
> 		u64 control = HVCALL_POST_MESSAGE;
> 
> 		control |= hv_nested ? HV_HYPERCALL_NESTED : 0;
> 		status = hv_do_hypercall(control, aligned_msg, NULL);
> 	}
> 
> Again, ARM64 is implicitly handled because hv_nested is never set.
> 
> This is just a suggestion. It's motivated by the fact that we already have
> three flavors of hypercall for HVCALL_SIGNAL_EVENT and
> HVCALL_POST_MESSAGE, and I was looking for a way to avoid adding
> a fourth flavor. But it's a marginal win, and if you prefer to keep the
> inline functions, I'm OK with that.
> 

I like the suggestion to open-code these cases.

There are several places we need to get/set nested MSRs, but as for hypercalls
it is just these 2, so far. When I consider it from that angle, it feels cleaner
to just open-code them, and remove the existing '_nested' versions of the
hypercall helpers for now.

Thanks
Nuno

> Michael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-11 23:07   ` Michael Kelley
@ 2025-06-18 21:08     ` Nuno Das Neves
  2025-06-19 22:02       ` Michael Kelley
  2025-06-20 16:19       ` Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-18 21:08 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

On 6/11/2025 4:07 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>
>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> The preferred patch Subject prefix is "x86/hyperv:"
> 

Thank you for clarifying - I thought I saw some precedent for x86: hyperv:
but must have been mistaken.

>>
>> This patch moves a part of currently internal logic into the
>> hv_map_msi_interrupt function and makes it globally available helper
>> function, which will be used to map PCI interrupts in case of root
>> partition.
> 
> Avoid "this patch" in commit messages.  Suggest:
> 
> Create a helper function hv_map_msi_interrupt() that contains some
> logic that is currently internal to irqdomain.c. Make the helper function
> globally available so it can be used to map PCI interrupts when running
> in the root partition.
> 

Thanks, I'll rephrase.

>>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/x86/hyperv/irqdomain.c     | 47 ++++++++++++++++++++++++---------
>>  arch/x86/include/asm/mshyperv.h |  2 ++
>>  2 files changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
>> index 31f0d29cbc5e..82f3bafb93d6 100644
>> --- a/arch/x86/hyperv/irqdomain.c
>> +++ b/arch/x86/hyperv/irqdomain.c
>> @@ -169,13 +169,40 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
>>  	return dev_id;
>>  }
>>
>> -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
>> -				struct hv_interrupt_entry *entry)
>> +/**
>> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
>> + * @data:      Describes the IRQ
>> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
>> + *
>> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
>> + */
>> +int hv_map_msi_interrupt(struct irq_data *data,
>> +			 struct hv_interrupt_entry *out_entry)
>>  {
>> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
>> +	struct msi_desc *msidesc;
>> +	struct pci_dev *dev;
>> +	union hv_device_id device_id;
>> +	struct hv_interrupt_entry dummy;
>> +	struct irq_cfg *cfg = irqd_cfg(data);
>> +	const cpumask_t *affinity;
>> +	int cpu;
>> +	u64 res;
>>
>> -	return hv_map_interrupt(device_id, false, cpu, vector, entry);
>> +	msidesc = irq_data_get_msi_desc(data);
>> +	dev = msi_desc_to_pci_dev(msidesc);
>> +	device_id = hv_build_pci_dev_id(dev);
>> +	affinity = irq_data_get_effective_affinity_mask(data);
>> +	cpu = cpumask_first_and(affinity, cpu_online_mask);
> 
> Is the cpus_read_lock held at this point? I'm not sure what the
> overall calling sequence looks like. If it is not held, the CPU that
> is selected could go offline before hv_map_interrupt() is called.
> This computation of the target CPU is the same as in the code
> before this patch, but that existing code looks like it has the
> same problem.
> 

Thanks for pointing it out - It *looks* like the read lock is not held
everywhere this could be called, so it could indeed be a problem.

I've been thinking about different ways around this but I lack the
knowledge to have an informed opinion about it:

- We could take the cpu read lock in this function, would that work?

- I'm not actually sure why the code is getting the first cpu off the effective
  affinity mask in the first place. It is possible to get the apic id (and hence
  the cpu) already associated with the irq, as per e.g. x86_vector_msi_compose_msg()
  Maybe we could get the cpu that way, assuming that doesn't have a similar issue.

- We could just let this race happen, maybe the outcome isn't too catastrophic?

What do you think?

>> +
>> +	res = hv_map_interrupt(device_id, false, cpu, cfg->vector,
>> +			       out_entry ? out_entry : &dummy);
>> +	if (!hv_result_success(res))
>> +		pr_err("%s: failed to map interrupt: %s",
>> +		       __func__, hv_result_to_string(res));
> 
> hv_map_interrupt() already outputs a message if the hypercall
> fails. Is another message needed here?
> 

It does print the function name, which gives additional context.
Probably it can just be removed however.

>> +
>> +	return hv_result_to_errno(res);
> 
> The error handling is rather messed up. First hv_map_interrupt()
> sometimes returns a Linux errno (not negated), and sometimes a
> hypercall result. The errno is EINVAL, which has value "22", which is
> the same as hypercall result HV_STATUS_ACKNOWLEDGED. And
> the hypercall result returned from hv_map_interrupt() is just
> the result code, not the full 64-bit status, as hv_map_interrupt()
> has already done hv_result(status). Hence the "res" input arg to
> hv_result_to_errno() isn't really the correct input. For example,
> if the hypercall returns U64_MAX, that won't be caught by
> hv_result_to_errno() since the value has been truncated to
> 32 bits. Fixing all this will require some unscrambling.
> 

Good point, it's pretty messed up! I think in v2 I'll add a patch to
clean this up first.

>>  }
>> +EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
>>
>>  static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
>>  {
>> @@ -190,10 +217,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>  {
>>  	struct msi_desc *msidesc;
>>  	struct pci_dev *dev;
>> -	struct hv_interrupt_entry out_entry, *stored_entry;
>> +	struct hv_interrupt_entry *stored_entry;
>>  	struct irq_cfg *cfg = irqd_cfg(data);
>> -	const cpumask_t *affinity;
>> -	int cpu;
>>  	u64 status;
>>
>>  	msidesc = irq_data_get_msi_desc(data);
>> @@ -204,9 +229,6 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>  		return;
>>  	}
>>
>> -	affinity = irq_data_get_effective_affinity_mask(data);
>> -	cpu = cpumask_first_and(affinity, cpu_online_mask);
>> -
>>  	if (data->chip_data) {
>>  		/*
>>  		 * This interrupt is already mapped. Let's unmap first.
>> @@ -235,15 +257,14 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>  		return;
>>  	}
>>
>> -	status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
>> +	status = hv_map_msi_interrupt(data, stored_entry);
>>  	if (status != HV_STATUS_SUCCESS) {
> 
> hv_map_msi_interrupt() returns an errno, so testing for HV_STATUS_SUCCESS
> is bogus.
> 

Thanks, noted.

>>  		kfree(stored_entry);
>>  		return;
>>  	}
>>
>> -	*stored_entry = out_entry;
>>  	data->chip_data = stored_entry;
>> -	entry_to_msi_msg(&out_entry, msg);
>> +	entry_to_msi_msg(data->chip_data, msg);
>>
>>  	return;
>>  }
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 5ec92e3e2e37..843121465ddd 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -261,6 +261,8 @@ static inline void hv_apic_init(void) {}
>>
>>  struct irq_domain *hv_create_pci_msi_domain(void);
>>
>> +int hv_map_msi_interrupt(struct irq_data *data,
>> +			 struct hv_interrupt_entry *out_entry);
>>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
>>  		struct hv_interrupt_entry *entry);
>>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested
  2025-06-11 23:07   ` Michael Kelley
@ 2025-06-18 21:24     ` Nuno Das Neves
  0 siblings, 0 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-18 21:24 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

On 6/11/2025 4:07 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>
>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>
>> Running as nested root on MSHV imposes a different requirement
>> for the pci-hyperv controller.
>>
>> In this setup, the interrupt will first come to the L1 (nested) hypervisor,
>> which will deliver it to the appropriate root CPU. Instead of issuing the
>> RETARGET hypercall, we should issue the MAP_DEVICE_INTERRUPT
>> hypercall to L1 to complete the setup.
>>
>> Rename hv_arch_irq_unmask() to hv_irq_retarget_interrupt().
>>
>> Co-developed-by: Jinank Jain <jinankjain@linux.microsoft.com>
>> Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  drivers/pci/controller/pci-hyperv.c | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>> index 4d25754dfe2f..0f491c802fb9 100644
>> --- a/drivers/pci/controller/pci-hyperv.c
>> +++ b/drivers/pci/controller/pci-hyperv.c
>> @@ -600,7 +600,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>  #define hv_msi_prepare		pci_msi_prepare
>>
>>  /**
>> - * hv_arch_irq_unmask() - "Unmask" the IRQ by setting its current
>> + * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
>>   * affinity.
>>   * @data:	Describes the IRQ
>>   *
>> @@ -609,7 +609,7 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>   * is built out of this PCI bus's instance GUID and the function
>>   * number of the device.
>>   */
>> -static void hv_arch_irq_unmask(struct irq_data *data)
>> +static void hv_irq_retarget_interrupt(struct irq_data *data)
>>  {
>>  	struct msi_desc *msi_desc = irq_data_get_msi_desc(data);
>>  	struct hv_retarget_device_interrupt *params;
>> @@ -714,6 +714,20 @@ static void hv_arch_irq_unmask(struct irq_data *data)
>>  		dev_err(&hbus->hdev->device,
>>  			"%s() failed: %#llx", __func__, res);
>>  }
>> +
>> +static void hv_arch_irq_unmask(struct irq_data *data)
>> +{
>> +	if (hv_nested && hv_root_partition())
> 
> Based on Patch 1 of this series, this driver is not loaded for the root
> partition in the non-nested case. So testing hv_nested is redundant.
> 

Good point, I'll change it to just check for hv_root_partition() here.

>> +		/*
>> +		 * In case of the nested root partition, the nested hypervisor
>> +		 * is taking care of interrupt remapping and thus the
>> +		 * MAP_DEVICE_INTERRUPT hypercall is required instead of
>> +		 * RETARGET_INTERRUPT.
>> +		 */
>> +		(void)hv_map_msi_interrupt(data, NULL);
>> +	else
>> +		hv_irq_retarget_interrupt(data);
>> +}
>>  #elif defined(CONFIG_ARM64)
>>  /*
>>   * SPI vectors to use for vPCI; arch SPIs range is [32, 1019], but leaving a bit
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested
  2025-06-13 19:36   ` Bjorn Helgaas
@ 2025-06-18 22:45     ` Nuno Das Neves
  0 siblings, 0 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-18 22:45 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-hyperv, linux-arm-kernel, linux-kernel, linux-pci, kys,
	haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will, tglx,
	mingo, bp, dave.hansen, hpa, lpieralisi, kw,
	manivannan.sadhasivam, robh, bhelgaas, jinankjain, skinsburskii,
	mrathor, x86

On 6/13/2025 12:36 PM, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2025 at 04:52:06PM -0700, Nuno Das Neves wrote:
>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>
>> Running as nested root on MSHV imposes a different requirement
>> for the pci-hyperv controller.
>>
>> In this setup, the interrupt will first come to the L1 (nested) hypervisor,
>> which will deliver it to the appropriate root CPU. Instead of issuing the
>> RETARGET hypercall, we should issue the MAP_DEVICE_INTERRUPT
>> hypercall to L1 to complete the setup.
> 
> Maybe strengthen this to say that this issues MAP_DEVICE_INTERRUPT
> instead of RETARGET in this case?  (Not just that we "should".) 
> 

Good suggestion, thanks.

>> Rename hv_arch_irq_unmask() to hv_irq_retarget_interrupt().


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-18 21:08     ` Nuno Das Neves
@ 2025-06-19 22:02       ` Michael Kelley
  2025-06-20 16:19       ` Thomas Gleixner
  1 sibling, 0 replies; 19+ messages in thread
From: Michael Kelley @ 2025-06-19 22:02 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, lpieralisi@kernel.org,
	kw@linux.com, robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, June 18, 2025 2:08 PM
> 
> On 6/11/2025 4:07 PM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June
> 10, 2025 4:52 PM
> >>
> >> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >
> > The preferred patch Subject prefix is "x86/hyperv:"
> >
> 
> Thank you for clarifying - I thought I saw some precedent for x86: hyperv:
> but must have been mistaken.
> 
> >>
> >> This patch moves a part of currently internal logic into the
> >> hv_map_msi_interrupt function and makes it globally available helper
> >> function, which will be used to map PCI interrupts in case of root
> >> partition.
> >
> > Avoid "this patch" in commit messages.  Suggest:
> >
> > Create a helper function hv_map_msi_interrupt() that contains some
> > logic that is currently internal to irqdomain.c. Make the helper function
> > globally available so it can be used to map PCI interrupts when running
> > in the root partition.
> >
> 
> Thanks, I'll rephrase.
> 
> >>
> >> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >>  arch/x86/hyperv/irqdomain.c     | 47 ++++++++++++++++++++++++---------
> >>  arch/x86/include/asm/mshyperv.h |  2 ++
> >>  2 files changed, 36 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> >> index 31f0d29cbc5e..82f3bafb93d6 100644
> >> --- a/arch/x86/hyperv/irqdomain.c
> >> +++ b/arch/x86/hyperv/irqdomain.c
> >> @@ -169,13 +169,40 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> >>  	return dev_id;
> >>  }
> >>
> >> -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> >> -				struct hv_interrupt_entry *entry)
> >> +/**
> >> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> >> + * @data:      Describes the IRQ
> >> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> >> + *
> >> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> >> + */
> >> +int hv_map_msi_interrupt(struct irq_data *data,
> >> +			 struct hv_interrupt_entry *out_entry)
> >>  {
> >> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
> >> +	struct msi_desc *msidesc;
> >> +	struct pci_dev *dev;
> >> +	union hv_device_id device_id;
> >> +	struct hv_interrupt_entry dummy;
> >> +	struct irq_cfg *cfg = irqd_cfg(data);
> >> +	const cpumask_t *affinity;
> >> +	int cpu;
> >> +	u64 res;
> >>
> >> -	return hv_map_interrupt(device_id, false, cpu, vector, entry);
> >> +	msidesc = irq_data_get_msi_desc(data);
> >> +	dev = msi_desc_to_pci_dev(msidesc);
> >> +	device_id = hv_build_pci_dev_id(dev);
> >> +	affinity = irq_data_get_effective_affinity_mask(data);
> >> +	cpu = cpumask_first_and(affinity, cpu_online_mask);
> >
> > Is the cpus_read_lock held at this point? I'm not sure what the
> > overall calling sequence looks like. If it is not held, the CPU that
> > is selected could go offline before hv_map_interrupt() is called.
> > This computation of the target CPU is the same as in the code
> > before this patch, but that existing code looks like it has the
> > same problem.
> >
> 
> Thanks for pointing it out - It *looks* like the read lock is not held
> everywhere this could be called, so it could indeed be a problem.
> 
> I've been thinking about different ways around this but I lack the
> knowledge to have an informed opinion about it:
> 
> - We could take the cpu read lock in this function, would that work?
> 
> - I'm not actually sure why the code is getting the first cpu off the effective
>   affinity mask in the first place. It is possible to get the apic id (and hence
>   the cpu) already associated with the irq, as per e.g. x86_vector_msi_compose_msg()
>   Maybe we could get the cpu that way, assuming that doesn't have a similar issue.
> 
> - We could just let this race happen, maybe the outcome isn't too catastrophic?
> 
> What do you think?

I would have to study further to provide good answers to your questions as
I don't have deep knowledge of this area off the top of my head. The code
looked suspicious because AND'ing the affinity with the cpu_online_mask in
the first place is presumably to prevent assigning the interrupt to a CPU
that is offline. That's a valid intent, since such assigning would indeed be
problematic.

But as written the code is inherently racy unless the cpus_read_lock() is
held. I'm on vacation all next week, and probably won't be able to look at
this again until early July. So the best I can do for now is flag the issue.

Michael

> 
> >> +
> >> +	res = hv_map_interrupt(device_id, false, cpu, cfg->vector,
> >> +			       out_entry ? out_entry : &dummy);
> >> +	if (!hv_result_success(res))
> >> +		pr_err("%s: failed to map interrupt: %s",
> >> +		       __func__, hv_result_to_string(res));
> >
> > hv_map_interrupt() already outputs a message if the hypercall
> > fails. Is another message needed here?
> >
> 
> It does print the function name, which gives additional context.
> Probably it can just be removed however.
> 
> >> +
> >> +	return hv_result_to_errno(res);
> >
> > The error handling is rather messed up. First hv_map_interrupt()
> > sometimes returns a Linux errno (not negated), and sometimes a
> > hypercall result. The errno is EINVAL, which has value "22", which is
> > the same as hypercall result HV_STATUS_ACKNOWLEDGED. And
> > the hypercall result returned from hv_map_interrupt() is just
> > the result code, not the full 64-bit status, as hv_map_interrupt()
> > has already done hv_result(status). Hence the "res" input arg to
> > hv_result_to_errno() isn't really the correct input. For example,
> > if the hypercall returns U64_MAX, that won't be caught by
> > hv_result_to_errno() since the value has been truncated to
> > 32 bits. Fixing all this will require some unscrambling.
> >
> 
> Good point, it's pretty messed up! I think in v2 I'll add a patch to
> clean this up first.
> 
> >>  }
> >> +EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
> >>
> >>  static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct
> msi_msg *msg)
> >>  {
> >> @@ -190,10 +217,8 @@ static void hv_irq_compose_msi_msg(struct irq_data
> *data, struct msi_msg *msg)
> >>  {
> >>  	struct msi_desc *msidesc;
> >>  	struct pci_dev *dev;
> >> -	struct hv_interrupt_entry out_entry, *stored_entry;
> >> +	struct hv_interrupt_entry *stored_entry;
> >>  	struct irq_cfg *cfg = irqd_cfg(data);
> >> -	const cpumask_t *affinity;
> >> -	int cpu;
> >>  	u64 status;
> >>
> >>  	msidesc = irq_data_get_msi_desc(data);
> >> @@ -204,9 +229,6 @@ static void hv_irq_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> >>  		return;
> >>  	}
> >>
> >> -	affinity = irq_data_get_effective_affinity_mask(data);
> >> -	cpu = cpumask_first_and(affinity, cpu_online_mask);
> >> -
> >>  	if (data->chip_data) {
> >>  		/*
> >>  		 * This interrupt is already mapped. Let's unmap first.
> >> @@ -235,15 +257,14 @@ static void hv_irq_compose_msi_msg(struct irq_data
> *data, struct msi_msg *msg)
> >>  		return;
> >>  	}
> >>
> >> -	status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
> >> +	status = hv_map_msi_interrupt(data, stored_entry);
> >>  	if (status != HV_STATUS_SUCCESS) {
> >
> > hv_map_msi_interrupt() returns an errno, so testing for HV_STATUS_SUCCESS
> > is bogus.
> >
> 
> Thanks, noted.
> 
> >>  		kfree(stored_entry);
> >>  		return;
> >>  	}
> >>
> >> -	*stored_entry = out_entry;
> >>  	data->chip_data = stored_entry;
> >> -	entry_to_msi_msg(&out_entry, msg);
> >> +	entry_to_msi_msg(data->chip_data, msg);
> >>
> >>  	return;
> >>  }
> >> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> >> index 5ec92e3e2e37..843121465ddd 100644
> >> --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -261,6 +261,8 @@ static inline void hv_apic_init(void) {}
> >>
> >>  struct irq_domain *hv_create_pci_msi_domain(void);
> >>
> >> +int hv_map_msi_interrupt(struct irq_data *data,
> >> +			 struct hv_interrupt_entry *out_entry);
> >>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
> >>  		struct hv_interrupt_entry *entry);
> >>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
> >> --
> >> 2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-18 21:08     ` Nuno Das Neves
  2025-06-19 22:02       ` Michael Kelley
@ 2025-06-20 16:19       ` Thomas Gleixner
  2025-06-23 22:13         ` Nuno Das Neves
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2025-06-20 16:19 UTC (permalink / raw)
  To: Nuno Das Neves, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

On Wed, Jun 18 2025 at 14:08, Nuno Das Neves wrote:
> On 6/11/2025 4:07 PM, Michael Kelley wrote:
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>> +/**
>>> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
>>> + * @data:      Describes the IRQ
>>> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
>>> + *
>>> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
>>> + */
>>> +int hv_map_msi_interrupt(struct irq_data *data,
>>> +			 struct hv_interrupt_entry *out_entry)
>>>  {
>>> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
>>> +	struct msi_desc *msidesc;
>>> +	struct pci_dev *dev;
>>> +	union hv_device_id device_id;
>>> +	struct hv_interrupt_entry dummy;
>>> +	struct irq_cfg *cfg = irqd_cfg(data);
>>> +	const cpumask_t *affinity;
>>> +	int cpu;
>>> +	u64 res;

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations

>>>
>>> -	return hv_map_interrupt(device_id, false, cpu, vector, entry);
>>> +	msidesc = irq_data_get_msi_desc(data);
>>> +	dev = msi_desc_to_pci_dev(msidesc);
>>> +	device_id = hv_build_pci_dev_id(dev);
>>> +	affinity = irq_data_get_effective_affinity_mask(data);
>>> +	cpu = cpumask_first_and(affinity, cpu_online_mask);
>> 
>> Is the cpus_read_lock held at this point? I'm not sure what the
>> overall calling sequence looks like. If it is not held, the CPU that
>> is selected could go offline before hv_map_interrupt() is called.
>> This computation of the target CPU is the same as in the code
>> before this patch, but that existing code looks like it has the
>> same problem.
>> 
>
> Thanks for pointing it out - It *looks* like the read lock is not held
> everywhere this could be called, so it could indeed be a problem.
>
> I've been thinking about different ways around this but I lack the
> knowledge to have an informed opinion about it:
>
> - We could take the cpu read lock in this function, would that work?

Obviously not.

> - I'm not actually sure why the code is getting the first cpu off the effective
>   affinity mask in the first place. It is possible to get the apic id (and hence
>   the cpu) already associated with the irq, as per e.g. x86_vector_msi_compose_msg()
>   Maybe we could get the cpu that way, assuming that doesn't have a
>   similar issue.

There is no reason to fiddle in the underlying low level data. The
effective affinity mask is there for a reason.

> - We could just let this race happen, maybe the outcome isn't too catastrophic?

Let's terminate guesswork mode and look at the facts.

The point is that hv_map_msi_interrupt() is called from:

    1) hv_irq_compose_msi_msg()

    2) hv_arch_irq_unmask() (in patch 4/4)

Both functions are interrupt chip callbacks and invoked with the
interrupt descriptor lock held.

At the point where they are called, the effective affinity mask is valid
and immutable. Nothing can modify it as any modification requires the
interrupt descriptor lock to be held. This applies to the CPU hotplug
machinery too. So this AND cpu_online_mask is a complete pointless
voodoo exercise.

Just use:

     cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));

and be done with it.

Please fix that first with a seperate patch before moving this code
around.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function
  2025-06-20 16:19       ` Thomas Gleixner
@ 2025-06-23 22:13         ` Nuno Das Neves
  0 siblings, 0 replies; 19+ messages in thread
From: Nuno Das Neves @ 2025-06-23 22:13 UTC (permalink / raw)
  To: Thomas Gleixner, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	jinankjain@linux.microsoft.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, x86@kernel.org

On 6/20/2025 9:19 AM, Thomas Gleixner wrote:
> On Wed, Jun 18 2025 at 14:08, Nuno Das Neves wrote:
>> On 6/11/2025 4:07 PM, Michael Kelley wrote:
>>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, June 10, 2025 4:52 PM
>>>> +/**
>>>> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
>>>> + * @data:      Describes the IRQ
>>>> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
>>>> + *
>>>> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
>>>> + */
>>>> +int hv_map_msi_interrupt(struct irq_data *data,
>>>> +			 struct hv_interrupt_entry *out_entry)
>>>>  {
>>>> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
>>>> +	struct msi_desc *msidesc;
>>>> +	struct pci_dev *dev;
>>>> +	union hv_device_id device_id;
>>>> +	struct hv_interrupt_entry dummy;
>>>> +	struct irq_cfg *cfg = irqd_cfg(data);
>>>> +	const cpumask_t *affinity;
>>>> +	int cpu;
>>>> +	u64 res;
> 
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations
> 
>>>>
>>>> -	return hv_map_interrupt(device_id, false, cpu, vector, entry);
>>>> +	msidesc = irq_data_get_msi_desc(data);
>>>> +	dev = msi_desc_to_pci_dev(msidesc);
>>>> +	device_id = hv_build_pci_dev_id(dev);
>>>> +	affinity = irq_data_get_effective_affinity_mask(data);
>>>> +	cpu = cpumask_first_and(affinity, cpu_online_mask);
>>>
>>> Is the cpus_read_lock held at this point? I'm not sure what the
>>> overall calling sequence looks like. If it is not held, the CPU that
>>> is selected could go offline before hv_map_interrupt() is called.
>>> This computation of the target CPU is the same as in the code
>>> before this patch, but that existing code looks like it has the
>>> same problem.
>>>
>>
>> Thanks for pointing it out - It *looks* like the read lock is not held
>> everywhere this could be called, so it could indeed be a problem.
>>
>> I've been thinking about different ways around this but I lack the
>> knowledge to have an informed opinion about it:
>>
>> - We could take the cpu read lock in this function, would that work?
> 
> Obviously not.
> 
>> - I'm not actually sure why the code is getting the first cpu off the effective
>>   affinity mask in the first place. It is possible to get the apic id (and hence
>>   the cpu) already associated with the irq, as per e.g. x86_vector_msi_compose_msg()
>>   Maybe we could get the cpu that way, assuming that doesn't have a
>>   similar issue.
> 
> There is no reason to fiddle in the underlying low level data. The
> effective affinity mask is there for a reason.
> 
>> - We could just let this race happen, maybe the outcome isn't too catastrophic?
> 
> Let's terminate guesswork mode and look at the facts.
> 
> The point is that hv_map_msi_interrupt() is called from:
> 
>     1) hv_irq_compose_msi_msg()
> 
>     2) hv_arch_irq_unmask() (in patch 4/4)
> 
> Both functions are interrupt chip callbacks and invoked with the
> interrupt descriptor lock held.
> 
> At the point where they are called, the effective affinity mask is valid
> and immutable. Nothing can modify it as any modification requires the
> interrupt descriptor lock to be held. This applies to the CPU hotplug
> machinery too. So this AND cpu_online_mask is a complete pointless
> voodoo exercise.
> 

Thanks for explaining.

> Just use:
> 
>      cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
> 
> and be done with it.
> 
> Please fix that first with a seperate patch before moving this code
> around.

Will do!

Nuno

> 
> Thanks,
> 
>         tglx


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-06-23 22:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 23:52 [PATCH 0/4] Nested virtualization fixes for root partition Nuno Das Neves
2025-06-10 23:52 ` [PATCH 1/4] PCI: hv: Do not do vmbus initialization on baremetal Nuno Das Neves
2025-06-11 23:06   ` Michael Kelley
2025-06-16 20:06     ` Nuno Das Neves
2025-06-10 23:52 ` [PATCH 2/4] Drivers: hv: Use nested hypercall for post message and signal event Nuno Das Neves
2025-06-11 23:06   ` Michael Kelley
2025-06-16 20:18     ` Nuno Das Neves
2025-06-10 23:52 ` [PATCH 3/4] x86: hyperv: Expose hv_map_msi_interrupt function Nuno Das Neves
2025-06-11 23:07   ` Michael Kelley
2025-06-18 21:08     ` Nuno Das Neves
2025-06-19 22:02       ` Michael Kelley
2025-06-20 16:19       ` Thomas Gleixner
2025-06-23 22:13         ` Nuno Das Neves
2025-06-10 23:52 ` [PATCH 4/4] PCI: hv: Use the correct hypercall for unmasking interrupts on nested Nuno Das Neves
2025-06-11 23:07   ` Michael Kelley
2025-06-18 21:24     ` Nuno Das Neves
2025-06-13 19:36   ` Bjorn Helgaas
2025-06-18 22:45     ` Nuno Das Neves
2025-06-11 17:55 ` [PATCH 0/4] Nested virtualization fixes for root partition Roman Kisel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).