[PATCH v5 00/10] Introduce /dev/mshv root partition driver

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/10] Introduce /dev/mshv root partition driver
@ 2025-02-26 23:07 Nuno Das Neves
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
                   ` (9 more replies)
  0 siblings, 10 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

This series introduces support for creating and running guest virtual
machines while running on the Microsoft Hypervisor[0] as root partition.
This is done via an IOCTL interface accessed through /dev/mshv, similar to
/dev/kvm. Another series introducing this support was previously posted in
2021[1], and v4 of this series was last posted in 2023[2].

Patches 1-4 are small refactors and additions to Hyper-V code.
Patches 5-6 just export some definitions needed by /dev/mshv.
Patches 7-9 introduce some functionality and definitions in common code, that
is needed by the driver.
Patch 10 contains the driver code.

-----------------
[0] "Hyper-V" is more well-known, but it really refers to the whole stack
    including the hypervisor and other components that run in Windows
    kernel and userspace.
[1] Previous /dev/mshv patch series (2021) and discussion:
https://lore.kernel.org/linux-hyperv/1632853875-20261-1-git-send-email-nunodasneves@linux.microsoft.com/
[2] v4 (2023):
https://lore.kernel.org/linux-hyperv/1696010501-24584-1-git-send-email-nunodasneves@linux.microsoft.com/

-----------------
Changes since v4:
* Slim down the IOCTL interface significantly, via several means:
  1. Use generic "passthrough" call MSHV_ROOT_HVCALL to replace many ioctls.
  2. Use MSHV_* versions of some of the HV_* definitions.
  3. Move hv headers out of uapi altogether, into include/hyperv/, see:
https://lore.kernel.org/linux-hyperv/1732577084-2122-1-git-send-email-nunodasneves@linux.microsoft.com/
* Remove mshv_vtl module altogther, it will be posted in followup series
  * Also remove the parent "mshv" module which didn't serve much purpose
* Update and refactor parts of the driver code for clarity, extensibility

Changes since v3 (summarized):
* Clean up the error and debug logging:
  1. Add a set of macros vp_*() and partition_*() which call the equivalent
     dev_*(), passing the device from the partition struct
     * The new macros also print the partition and vp ids to aid debugging
	   and reduce repeated code
  2. Use dev_*() (mostly via the new macros) instead of pr_*() *almost*
  everywhere - in interrupt context we can't always get the device struct
  3. Remove pr_*() logging from hv_call.c and mshv_root_hv_call.c

Changes since v2 (summarized):
* Fix many checkpatch.pl --strict style issues
* Initialize status in get/set registers hypercall helpers
* Add missing return on error in get_vp_signaled_count

Changes since v1 (summarized):
* Clean up formatting, commit messages

Nuno Das Neves (9):
  hyperv: Convert Hyper-V status codes to strings
  arm64/hyperv: Add some missing functions to arm64
  hyperv: Introduce hv_recommend_using_aeoi()
  acpi: numa: Export node_to_pxm()
  Drivers/hv: Export some functions for use by root partition module
  Drivers: hv: Introduce per-cpu event ring tail
  x86: hyperv: Add mshv_handler irq handler and setup function
  hyperv: Add definitions for root partition driver to hv headers
  Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs

Stanislav Kinsburskii (1):
  x86/mshyperv: Add support for extended Hyper-V features

 .../userspace-api/ioctl/ioctl-number.rst      |    2 +
 arch/arm64/hyperv/hv_core.c                   |   17 +
 arch/arm64/hyperv/mshyperv.c                  |    1 +
 arch/arm64/include/asm/mshyperv.h             |   12 +
 arch/x86/kernel/cpu/mshyperv.c                |   16 +-
 drivers/acpi/numa/srat.c                      |    1 +
 drivers/hv/Makefile                           |    5 +-
 drivers/hv/hv.c                               |   12 +-
 drivers/hv/hv_common.c                        |  105 +-
 drivers/hv/hv_proc.c                          |   16 +-
 drivers/hv/mshv.h                             |   30 +
 drivers/hv/mshv_common.c                      |  161 ++
 drivers/hv/mshv_eventfd.c                     |  833 ++++++
 drivers/hv/mshv_eventfd.h                     |   71 +
 drivers/hv/mshv_irq.c                         |  128 +
 drivers/hv/mshv_portid_table.c                |   84 +
 drivers/hv/mshv_root.h                        |  321 +++
 drivers/hv/mshv_root_hv_call.c                |  876 +++++++
 drivers/hv/mshv_root_main.c                   | 2329 +++++++++++++++++
 drivers/hv/mshv_synic.c                       |  665 +++++
 include/asm-generic/mshyperv.h                |   18 +
 include/hyperv/hvgdk_mini.h                   |   64 +-
 include/hyperv/hvhdk.h                        |  132 +-
 include/hyperv/hvhdk_mini.h                   |   91 +
 include/uapi/linux/mshv.h                     |  287 ++
 25 files changed, 6248 insertions(+), 29 deletions(-)
 create mode 100644 drivers/hv/mshv.h
 create mode 100644 drivers/hv/mshv_common.c
 create mode 100644 drivers/hv/mshv_eventfd.c
 create mode 100644 drivers/hv/mshv_eventfd.h
 create mode 100644 drivers/hv/mshv_irq.c
 create mode 100644 drivers/hv/mshv_portid_table.c
 create mode 100644 drivers/hv/mshv_root.h
 create mode 100644 drivers/hv/mshv_root_hv_call.c
 create mode 100644 drivers/hv/mshv_root_main.c
 create mode 100644 drivers/hv/mshv_synic.c
 create mode 100644 include/uapi/linux/mshv.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
@ 2025-02-26 23:07 ` Nuno Das Neves
  2025-02-26 23:26   ` Stanislav Kinsburskii
                     ` (3 more replies)
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
                   ` (8 subsequent siblings)
  9 siblings, 4 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

Introduce hv_result_to_string() for this purpose. This allows
hypercall failures to be debugged more easily with dmesg.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
 drivers/hv/hv_proc.c           | 13 ++++---
 include/asm-generic/mshyperv.h |  1 +
 3 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 9804adb4cc56..ce20818688fe 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
 			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
 	}
 }
+
+const char *hv_result_to_string(u64 hv_status)
+{
+	switch (hv_result(hv_status)) {
+	case HV_STATUS_SUCCESS:
+		return "HV_STATUS_SUCCESS";
+	case HV_STATUS_INVALID_HYPERCALL_CODE:
+		return "HV_STATUS_INVALID_HYPERCALL_CODE";
+	case HV_STATUS_INVALID_HYPERCALL_INPUT:
+		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
+	case HV_STATUS_INVALID_ALIGNMENT:
+		return "HV_STATUS_INVALID_ALIGNMENT";
+	case HV_STATUS_INVALID_PARAMETER:
+		return "HV_STATUS_INVALID_PARAMETER";
+	case HV_STATUS_ACCESS_DENIED:
+		return "HV_STATUS_ACCESS_DENIED";
+	case HV_STATUS_INVALID_PARTITION_STATE:
+		return "HV_STATUS_INVALID_PARTITION_STATE";
+	case HV_STATUS_OPERATION_DENIED:
+		return "HV_STATUS_OPERATION_DENIED";
+	case HV_STATUS_UNKNOWN_PROPERTY:
+		return "HV_STATUS_UNKNOWN_PROPERTY";
+	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
+		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
+	case HV_STATUS_INSUFFICIENT_MEMORY:
+		return "HV_STATUS_INSUFFICIENT_MEMORY";
+	case HV_STATUS_INVALID_PARTITION_ID:
+		return "HV_STATUS_INVALID_PARTITION_ID";
+	case HV_STATUS_INVALID_VP_INDEX:
+		return "HV_STATUS_INVALID_VP_INDEX";
+	case HV_STATUS_NOT_FOUND:
+		return "HV_STATUS_NOT_FOUND";
+	case HV_STATUS_INVALID_PORT_ID:
+		return "HV_STATUS_INVALID_PORT_ID";
+	case HV_STATUS_INVALID_CONNECTION_ID:
+		return "HV_STATUS_INVALID_CONNECTION_ID";
+	case HV_STATUS_INSUFFICIENT_BUFFERS:
+		return "HV_STATUS_INSUFFICIENT_BUFFERS";
+	case HV_STATUS_NOT_ACKNOWLEDGED:
+		return "HV_STATUS_NOT_ACKNOWLEDGED";
+	case HV_STATUS_INVALID_VP_STATE:
+		return "HV_STATUS_INVALID_VP_STATE";
+	case HV_STATUS_NO_RESOURCES:
+		return "HV_STATUS_NO_RESOURCES";
+	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
+		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
+	case HV_STATUS_INVALID_LP_INDEX:
+		return "HV_STATUS_INVALID_LP_INDEX";
+	case HV_STATUS_INVALID_REGISTER_VALUE:
+		return "HV_STATUS_INVALID_REGISTER_VALUE";
+	case HV_STATUS_OPERATION_FAILED:
+		return "HV_STATUS_OPERATION_FAILED";
+	case HV_STATUS_TIME_OUT:
+		return "HV_STATUS_TIME_OUT";
+	case HV_STATUS_CALL_PENDING:
+		return "HV_STATUS_CALL_PENDING";
+	case HV_STATUS_VTL_ALREADY_ENABLED:
+		return "HV_STATUS_VTL_ALREADY_ENABLED";
+	default:
+		return "Unknown";
+	};
+	return "Unknown";
+}
+EXPORT_SYMBOL_GPL(hv_result_to_string);
+
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 2fae18e4f7d2..8fc30f509fa7 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 				     page_count, 0, input_page, NULL);
 	local_irq_restore(flags);
 	if (!hv_result_success(status)) {
-		pr_err("Failed to deposit pages: %lld\n", status);
+		pr_err("%s: Failed to deposit pages: %s\n", __func__,
+		       hv_result_to_string(status));
 		ret = hv_result_to_errno(status);
 		goto err_free_allocations;
 	}
@@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 
 		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (!hv_result_success(status)) {
-				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
-				       lp_index, apic_id, status);
+				pr_err("%s: cpu %u apic ID %u, %s\n",
+				       __func__, lp_index, apic_id,
+				       hv_result_to_string(status));
 				ret = hv_result_to_errno(status);
 			}
 			break;
@@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 
 		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (!hv_result_success(status)) {
-				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
-				       vp_index, flags, status);
+				pr_err("%s: vcpu %u, lp %u, %s\n",
+				       __func__, vp_index, flags,
+				       hv_result_to_string(status));
 				ret = hv_result_to_errno(status);
 			}
 			break;
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index b13b0cda4ac8..dc4729dba9ef 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
 	return __cpumask_to_vpset(vpset, cpus, func);
 }
 
+const char *hv_result_to_string(u64 hv_status);
 int hv_result_to_errno(u64 status);
 void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
 bool hv_is_hyperv_initialized(void);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
@ 2025-02-26 23:26   ` Stanislav Kinsburskii
  2025-02-27  4:22   ` Easwar Hariharan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:26 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:07:55PM -0800, Nuno Das Neves wrote:
> Introduce hv_result_to_string() for this purpose. This allows
> hypercall failures to be debugged more easily with dmesg.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>  drivers/hv/hv_proc.c           | 13 ++++---
>  include/asm-generic/mshyperv.h |  1 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
  2025-02-26 23:26   ` Stanislav Kinsburskii
@ 2025-02-27  4:22   ` Easwar Hariharan
  2025-02-27 23:48     ` Nuno Das Neves
  2025-02-27 17:02   ` Roman Kisel
  2025-03-06 17:57   ` Michael Kelley
  3 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27  4:22 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> Introduce hv_result_to_string() for this purpose. This allows
> hypercall failures to be debugged more easily with dmesg.
> 

Let the commit message stand on its own, i.e. state that hv_result_to_string()
is introduced to convert hyper-v status codes to string.

> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>  drivers/hv/hv_proc.c           | 13 ++++---
>  include/asm-generic/mshyperv.h |  1 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 9804adb4cc56..ce20818688fe 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
>  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
>  	}
>  }
> +
> +const char *hv_result_to_string(u64 hv_status)
> +{
> +	switch (hv_result(hv_status)) {
> +	case HV_STATUS_SUCCESS:
> +		return "HV_STATUS_SUCCESS";
> +	case HV_STATUS_INVALID_HYPERCALL_CODE:
> +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
> +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
> +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
> +	case HV_STATUS_INVALID_ALIGNMENT:
> +		return "HV_STATUS_INVALID_ALIGNMENT";
> +	case HV_STATUS_INVALID_PARAMETER:
> +		return "HV_STATUS_INVALID_PARAMETER";
> +	case HV_STATUS_ACCESS_DENIED:
> +		return "HV_STATUS_ACCESS_DENIED";
> +	case HV_STATUS_INVALID_PARTITION_STATE:
> +		return "HV_STATUS_INVALID_PARTITION_STATE";
> +	case HV_STATUS_OPERATION_DENIED:
> +		return "HV_STATUS_OPERATION_DENIED";
> +	case HV_STATUS_UNKNOWN_PROPERTY:
> +		return "HV_STATUS_UNKNOWN_PROPERTY";
> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
> +	case HV_STATUS_INSUFFICIENT_MEMORY:
> +		return "HV_STATUS_INSUFFICIENT_MEMORY";
> +	case HV_STATUS_INVALID_PARTITION_ID:
> +		return "HV_STATUS_INVALID_PARTITION_ID";
> +	case HV_STATUS_INVALID_VP_INDEX:
> +		return "HV_STATUS_INVALID_VP_INDEX";
> +	case HV_STATUS_NOT_FOUND:
> +		return "HV_STATUS_NOT_FOUND";
> +	case HV_STATUS_INVALID_PORT_ID:
> +		return "HV_STATUS_INVALID_PORT_ID";
> +	case HV_STATUS_INVALID_CONNECTION_ID:
> +		return "HV_STATUS_INVALID_CONNECTION_ID";
> +	case HV_STATUS_INSUFFICIENT_BUFFERS:
> +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
> +	case HV_STATUS_NOT_ACKNOWLEDGED:
> +		return "HV_STATUS_NOT_ACKNOWLEDGED";
> +	case HV_STATUS_INVALID_VP_STATE:
> +		return "HV_STATUS_INVALID_VP_STATE";
> +	case HV_STATUS_NO_RESOURCES:
> +		return "HV_STATUS_NO_RESOURCES";
> +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
> +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
> +	case HV_STATUS_INVALID_LP_INDEX:
> +		return "HV_STATUS_INVALID_LP_INDEX";
> +	case HV_STATUS_INVALID_REGISTER_VALUE:
> +		return "HV_STATUS_INVALID_REGISTER_VALUE";
> +	case HV_STATUS_OPERATION_FAILED:
> +		return "HV_STATUS_OPERATION_FAILED";
> +	case HV_STATUS_TIME_OUT:
> +		return "HV_STATUS_TIME_OUT";
> +	case HV_STATUS_CALL_PENDING:
> +		return "HV_STATUS_CALL_PENDING";
> +	case HV_STATUS_VTL_ALREADY_ENABLED:
> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
> +	default:
> +		return "Unknown";
> +	};
> +	return "Unknown";

Unnecessary extra return since the default case already returns "Unknown"

> +}
> +EXPORT_SYMBOL_GPL(hv_result_to_string);
> +

Extra line here ^

> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 2fae18e4f7d2..8fc30f509fa7 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>  				     page_count, 0, input_page, NULL);
>  	local_irq_restore(flags);
>  	if (!hv_result_success(status)) {
> -		pr_err("Failed to deposit pages: %lld\n", status);
> +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
> +		       hv_result_to_string(status));
>  		ret = hv_result_to_errno(status);
>  		goto err_free_allocations;
>  	}
> @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>  
>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>  			if (!hv_result_success(status)) {
> -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
> -				       lp_index, apic_id, status);
> +				pr_err("%s: cpu %u apic ID %u, %s\n",
> +				       __func__, lp_index, apic_id,
> +				       hv_result_to_string(status));
>  				ret = hv_result_to_errno(status);
>  			}
>  			break;
> @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  
>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>  			if (!hv_result_success(status)) {
> -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
> -				       vp_index, flags, status);
> +				pr_err("%s: vcpu %u, lp %u, %s\n",
> +				       __func__, vp_index, flags,
> +				       hv_result_to_string(status));
>  				ret = hv_result_to_errno(status);
>  			}
>  			break;

There are more convertible instances in arch/x86/hyperv/irqdomain.c and drivers/iommu/hyperv-iommu.c

> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index b13b0cda4ac8..dc4729dba9ef 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
>  	return __cpumask_to_vpset(vpset, cpus, func);
>  }
>  
> +const char *hv_result_to_string(u64 hv_status);
>  int hv_result_to_errno(u64 status);
>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>  bool hv_is_hyperv_initialized(void);


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27  4:22   ` Easwar Hariharan
@ 2025-02-27 23:48     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-27 23:48 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 8:22 PM, Easwar Hariharan wrote:
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>> Introduce hv_result_to_string() for this purpose. This allows
>> hypercall failures to be debugged more easily with dmesg.
>>
> 
> Let the commit message stand on its own, i.e. state that hv_result_to_string()
> is introduced to convert hyper-v status codes to string.
> 
I thought since the subject line is part of the commit message, this kind of
phrasing is ok. However I see that in my email client it is a little odd because
the subject line is a bit far removed from the rest of the message.

I'll change it :)

>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>>  drivers/hv/hv_proc.c           | 13 ++++---
>>  include/asm-generic/mshyperv.h |  1 +
>>  3 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 9804adb4cc56..ce20818688fe 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
>>  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
>>  	}
>>  }
>> +
>> +const char *hv_result_to_string(u64 hv_status)
>> +{
>> +	switch (hv_result(hv_status)) {
>> +	case HV_STATUS_SUCCESS:
>> +		return "HV_STATUS_SUCCESS";
>> +	case HV_STATUS_INVALID_HYPERCALL_CODE:
>> +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
>> +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
>> +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
>> +	case HV_STATUS_INVALID_ALIGNMENT:
>> +		return "HV_STATUS_INVALID_ALIGNMENT";
>> +	case HV_STATUS_INVALID_PARAMETER:
>> +		return "HV_STATUS_INVALID_PARAMETER";
>> +	case HV_STATUS_ACCESS_DENIED:
>> +		return "HV_STATUS_ACCESS_DENIED";
>> +	case HV_STATUS_INVALID_PARTITION_STATE:
>> +		return "HV_STATUS_INVALID_PARTITION_STATE";
>> +	case HV_STATUS_OPERATION_DENIED:
>> +		return "HV_STATUS_OPERATION_DENIED";
>> +	case HV_STATUS_UNKNOWN_PROPERTY:
>> +		return "HV_STATUS_UNKNOWN_PROPERTY";
>> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
>> +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
>> +	case HV_STATUS_INSUFFICIENT_MEMORY:
>> +		return "HV_STATUS_INSUFFICIENT_MEMORY";
>> +	case HV_STATUS_INVALID_PARTITION_ID:
>> +		return "HV_STATUS_INVALID_PARTITION_ID";
>> +	case HV_STATUS_INVALID_VP_INDEX:
>> +		return "HV_STATUS_INVALID_VP_INDEX";
>> +	case HV_STATUS_NOT_FOUND:
>> +		return "HV_STATUS_NOT_FOUND";
>> +	case HV_STATUS_INVALID_PORT_ID:
>> +		return "HV_STATUS_INVALID_PORT_ID";
>> +	case HV_STATUS_INVALID_CONNECTION_ID:
>> +		return "HV_STATUS_INVALID_CONNECTION_ID";
>> +	case HV_STATUS_INSUFFICIENT_BUFFERS:
>> +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
>> +	case HV_STATUS_NOT_ACKNOWLEDGED:
>> +		return "HV_STATUS_NOT_ACKNOWLEDGED";
>> +	case HV_STATUS_INVALID_VP_STATE:
>> +		return "HV_STATUS_INVALID_VP_STATE";
>> +	case HV_STATUS_NO_RESOURCES:
>> +		return "HV_STATUS_NO_RESOURCES";
>> +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
>> +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
>> +	case HV_STATUS_INVALID_LP_INDEX:
>> +		return "HV_STATUS_INVALID_LP_INDEX";
>> +	case HV_STATUS_INVALID_REGISTER_VALUE:
>> +		return "HV_STATUS_INVALID_REGISTER_VALUE";
>> +	case HV_STATUS_OPERATION_FAILED:
>> +		return "HV_STATUS_OPERATION_FAILED";
>> +	case HV_STATUS_TIME_OUT:
>> +		return "HV_STATUS_TIME_OUT";
>> +	case HV_STATUS_CALL_PENDING:
>> +		return "HV_STATUS_CALL_PENDING";
>> +	case HV_STATUS_VTL_ALREADY_ENABLED:
>> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
>> +	default:
>> +		return "Unknown";
>> +	};
>> +	return "Unknown";
> 
> Unnecessary extra return since the default case already returns "Unknown"
> 
Good point, I think I'd prefer to remove the first return and leave the
default case empty.

>> +}
>> +EXPORT_SYMBOL_GPL(hv_result_to_string);
>> +
> 
> Extra line here ^
> 
Thanks

>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>> index 2fae18e4f7d2..8fc30f509fa7 100644
>> --- a/drivers/hv/hv_proc.c
>> +++ b/drivers/hv/hv_proc.c
>> @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>  				     page_count, 0, input_page, NULL);
>>  	local_irq_restore(flags);
>>  	if (!hv_result_success(status)) {
>> -		pr_err("Failed to deposit pages: %lld\n", status);
>> +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
>> +		       hv_result_to_string(status));
>>  		ret = hv_result_to_errno(status);
>>  		goto err_free_allocations;
>>  	}
>> @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>  
>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>  			if (!hv_result_success(status)) {
>> -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
>> -				       lp_index, apic_id, status);
>> +				pr_err("%s: cpu %u apic ID %u, %s\n",
>> +				       __func__, lp_index, apic_id,
>> +				       hv_result_to_string(status));
>>  				ret = hv_result_to_errno(status);
>>  			}
>>  			break;
>> @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>  
>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>  			if (!hv_result_success(status)) {
>> -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
>> -				       vp_index, flags, status);
>> +				pr_err("%s: vcpu %u, lp %u, %s\n",
>> +				       __func__, vp_index, flags,
>> +				       hv_result_to_string(status));
>>  				ret = hv_result_to_errno(status);
>>  			}
>>  			break;
> 
> There are more convertible instances in arch/x86/hyperv/irqdomain.c and drivers/iommu/hyperv-iommu.c
> 
Ah, thank you, happy to add those!

>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index b13b0cda4ac8..dc4729dba9ef 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
>>  	return __cpumask_to_vpset(vpset, cpus, func);
>>  }
>>  
>> +const char *hv_result_to_string(u64 hv_status);
>>  int hv_result_to_errno(u64 status);
>>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>>  bool hv_is_hyperv_initialized(void);


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
  2025-02-26 23:26   ` Stanislav Kinsburskii
  2025-02-27  4:22   ` Easwar Hariharan
@ 2025-02-27 17:02   ` Roman Kisel
  2025-02-27 22:54     ` Easwar Hariharan
  2025-02-28  0:15     ` Nuno Das Neves
  2025-03-06 17:57   ` Michael Kelley
  3 siblings, 2 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 17:02 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:

[...]

> +
> +const char *hv_result_to_string(u64 hv_status)
> +{
> +	switch (hv_result(hv_status)) {

[...]

> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
> +	default:
> +		return "Unknown";
> +	};
> +	return "Unknown";
> +}
> +EXPORT_SYMBOL_GPL(hv_result_to_string);

Should we remove this and output the hexadecimal error code in ~3 places
this function is used?

The "Unknown" part would make debugging harder actually when something
fails. I presume that the mainstream scenarios all work, and it is the
edge cases that might fail, and these are likelier to produce "Unknown".

Folks who actually debug failed hypercalls rarely have issues with
looking up the error code, and printing "Unknown" to the log is worse
than a hexadecimal. Like even the people who wrote the code got nothing
to say about what is going on.

-- 
Thank you,
Roman

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 17:02   ` Roman Kisel
@ 2025-02-27 22:54     ` Easwar Hariharan
  2025-02-27 23:08       ` Roman Kisel
  2025-02-27 23:21       ` Roman Kisel
  2025-02-28  0:15     ` Nuno Das Neves
  1 sibling, 2 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27 22:54 UTC (permalink / raw)
  To: Roman Kisel
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, eahariha, kys, haiyangz, wei.liu,
	mhklinux, decui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, hpa, daniel.lezcano, joro, robin.murphy, arnd,
	jinankjain, muminulrussell, skinsburskii, mrathor, ssengar, apais,
	Tianyu.Lan, stanislav.kinsburskiy, gregkh, vkuznets, prapal,
	muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 9:02 AM, Roman Kisel wrote:
> 
> 
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> 
> [...]
> 
>> +
>> +const char *hv_result_to_string(u64 hv_status)
>> +{
>> +    switch (hv_result(hv_status)) {
> 
> [...]
> 
>> +        return "HV_STATUS_VTL_ALREADY_ENABLED";
>> +    default:
>> +        return "Unknown";
>> +    };
>> +    return "Unknown";
>> +}
>> +EXPORT_SYMBOL_GPL(hv_result_to_string);
> 
> Should we remove this and output the hexadecimal error code in ~3 places
> this function is used?
> 
> The "Unknown" part would make debugging harder actually when something
> fails. I presume that the mainstream scenarios all work, and it is the
> edge cases that might fail, and these are likelier to produce "Unknown".
> 
> Folks who actually debug failed hypercalls rarely have issues with
> looking up the error code, and printing "Unknown" to the log is worse
> than a hexadecimal. Like even the people who wrote the code got nothing
> to say about what is going on.
> 

Sorry, I have to disagree with this, a recent commit of mine[1] closed a WSL
issue that was open for over 2 years for, partly, the utter uselessness of
the hex return code of the hypercall.

[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2138eab8cde61e0e6f62d0713e45202e8457d6d

Thanks,
Easwar (he/him)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 22:54     ` Easwar Hariharan
@ 2025-02-27 23:08       ` Roman Kisel
  2025-02-27 23:25         ` Easwar Hariharan
  2025-02-27 23:21       ` Roman Kisel
  1 sibling, 1 reply; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 23:08 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet



On 2/27/2025 2:54 PM, Easwar Hariharan wrote:
[...]

> 
> Sorry, I have to disagree with this, a recent commit of mine[1] closed a WSL
> issue that was open for over 2 years for, partly, the utter uselessness of
> the hex return code of the hypercall.

Thanks for your efforts, and sorry to hear you had a frustrating
debugging experience (sounds like it).

Would be great to learn the details to understand how this function is
going to improve the situation:

1. How come the hex error code was useless, what is not matching
    anything in the Linux headers?
2. How having "Unknown" in the log can possibly be better?
3. Given that the select hv status codes and the proposed strings have
    1:1 correspondence, and there is the 1:N catch-all case for the
    "Unknown", how's that better?

> 
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2138eab8cde61e0e6f62d0713e45202e8457d6d
> 
> Thanks,
> Easwar (he/him)

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 23:08       ` Roman Kisel
@ 2025-02-27 23:25         ` Easwar Hariharan
  2025-02-28 17:20           ` Roman Kisel
  0 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27 23:25 UTC (permalink / raw)
  To: Roman Kisel
  Cc: eahariha, Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel,
	linux-kernel, linux-arch, linux-acpi, kys, haiyangz, wei.liu,
	mhklinux, decui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, hpa, daniel.lezcano, joro, robin.murphy, arnd,
	jinankjain, muminulrussell, skinsburskii, mrathor, ssengar, apais,
	Tianyu.Lan, stanislav.kinsburskiy, gregkh, vkuznets, prapal,
	muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 3:08 PM, Roman Kisel wrote:
> 
> 
> On 2/27/2025 2:54 PM, Easwar Hariharan wrote:
> [...]
> 
>>
>> Sorry, I have to disagree with this, a recent commit of mine[1] closed a WSL
>> issue that was open for over 2 years for, partly, the utter uselessness of
>> the hex return code of the hypercall.
> 
> Thanks for your efforts, and sorry to hear you had a frustrating
> debugging experience (sounds like it).

TBF, I didn't personally struggle with it for 2 years, IMHO, it was the opaqueness
of what the value meant that contributed to user pain.

> 
> Would be great to learn the details to understand how this function is
> going to improve the situation:
> 
> 1. How come the hex error code was useless, what is not matching
>    anything in the Linux headers?

It doesn't match anything in the Linux headers, but it's an NTSTATUS, not HVSTATUS.

Coming from the PoV of a user, it would be a much more useful message to see:

[  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv STATUS_UNSUCCESSFUL

than 

[  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001

> 2. How having "Unknown" in the log can possibly be better?

IMHO, seeing "Unknown" in an error report means that there's a new return value
that needs to be mapped to errno in hv_status_to_errno() and updated here as well.

> 3. Given that the select hv status codes and the proposed strings have
>    1:1 correspondence, and there is the 1:N catch-all case for the
>    "Unknown", how's that better?
> 

I didn't really follow this question, but I suppose the answer to Q2 answers this as
well. If not, please expand and I'll try to answer.

Thanks,
Easwar (he/him)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 23:25         ` Easwar Hariharan
@ 2025-02-28 17:20           ` Roman Kisel
  2025-02-28 20:22             ` Easwar Hariharan
  0 siblings, 1 reply; 108+ messages in thread
From: Roman Kisel @ 2025-02-28 17:20 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/27/2025 3:25 PM, Easwar Hariharan wrote:
> On 2/27/2025 3:08 PM, Roman Kisel wrote:

[...]

>> Would be great to learn the details to understand how this function is
>> going to improve the situation:
>>
>> 1. How come the hex error code was useless, what is not matching
>>     anything in the Linux headers?
> 
> It doesn't match anything in the Linux headers, but it's an NTSTATUS, not HVSTATUS.
> 

That is what it looks like from the code, I posted the details in the
parallel thread.

Here is a fix:
https://lore.kernel.org/linux-hyperv/20250227233110.36596-1-romank@linux.microsoft.com/

Also I think the commit description in your patch

https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2138eab8cde61e0e6f62d0713e45202e8457d6d

conflates the hypervisor (ours runs bare-metal, Type 1) and the VMMs
(Virtual Machine Monitors)+VSPs (Virtual Service Providers, e.g StorVSP
that implements SCSI) running in the host/root/dom0 partition.

> Coming from the PoV of a user, it would be a much more useful message to see:
> 
> [  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv STATUS_UNSUCCESSFUL
> 
> than
> 
> [  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
> 

It is likely that the PoV of a user that you've mentioned is actually
a PoV of a (kernel) developer. It is hard to imagine that folks running
web sites, DB servers, LoBs, LLMs, etc. in Hyper-V VMs care about the
lowest software level of the virt stack in the form of the symbolic
name or the hex code. They need their VMs to be reliable or suggest
what the user may try if a configuration error is suspected.

To make the error log message useful to the user, the message should
mention ways of remediation or at least hint what might've gotten
wedged. Without that, that's only useful for the people who work with
the kernel code proper or the kernel interface to the user land.

So I'd think that the hex error codes from the hypervisor give the user
exactly as much as the error symbolic names do to get the system to the
desired state: nothing. Even less when the error reported "Unknown" :)

>> 2. How having "Unknown" in the log can possibly be better?
> 
> IMHO, seeing "Unknown" in an error report means that there's a new return value
> that needs to be mapped to errno in hv_status_to_errno() and updated here as well.
> 

It means that to the developer. To the user, it means the developers
messed something up and to make matters even worse they didn't leave any
breadcrumbs (e.g. the hex code) to see what's wrong to help the user and
themselves: there is just that "Unknown" thing in the log.

>> 3. Given that the select hv status codes and the proposed strings have
>>     1:1 correspondence, and there is the 1:N catch-all case for the
>>     "Unknown", how's that better?
>>
> 
> I didn't really follow this question, but I suppose the answer to Q2 answers this as
> well. If not, please expand and I'll try to answer.
>

Sorry about that chunk, hit "Send" without looking the e-mail over
another time. Appreciate the discussion very much!

> Thanks,
> Easwar (he/him)

-- 
Thank you,
Roman

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-28 17:20           ` Roman Kisel
@ 2025-02-28 20:22             ` Easwar Hariharan
  2025-02-28 22:26               ` Roman Kisel
  0 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-28 20:22 UTC (permalink / raw)
  To: Roman Kisel
  Cc: eahariha, Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel,
	linux-kernel, linux-arch, linux-acpi, kys, haiyangz, wei.liu,
	mhklinux, decui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, hpa, daniel.lezcano, joro, robin.murphy, arnd,
	jinankjain, muminulrussell, skinsburskii, mrathor, ssengar, apais,
	Tianyu.Lan, stanislav.kinsburskiy, gregkh, vkuznets, prapal,
	muislam, anrayabh, rafael, lenb, corbet

On 2/28/2025 9:20 AM, Roman Kisel wrote:
> 
> 
> On 2/27/2025 3:25 PM, Easwar Hariharan wrote:
>> On 2/27/2025 3:08 PM, Roman Kisel wrote:
> 
> [...]
> 
>>> Would be great to learn the details to understand how this function is
>>> going to improve the situation:
>>>
>>> 1. How come the hex error code was useless, what is not matching
>>>     anything in the Linux headers?
>>
>> It doesn't match anything in the Linux headers, but it's an NTSTATUS, not HVSTATUS.
>>
> 
> That is what it looks like from the code, I posted the details in the
> parallel thread.
> 
> Here is a fix:
> https://lore.kernel.org/linux-hyperv/20250227233110.36596-1-romank@linux.microsoft.com/
> 
> Also I think the commit description in your patch
> 
> https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2138eab8cde61e0e6f62d0713e45202e8457d6d
> 
> conflates the hypervisor (ours runs bare-metal, Type 1) and the VMMs
> (Virtual Machine Monitors)+VSPs (Virtual Service Providers, e.g StorVSP
> that implements SCSI) running in the host/root/dom0 partition.

Agreed, that was what I was led to believe, your patch would help with that
miscommunication, though not in its current form. See my review comment in that
thread.

> 
>> Coming from the PoV of a user, it would be a much more useful message to see:
>>
>> [  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv STATUS_UNSUCCESSFUL
>>
>> than
>>
>> [  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
>>
> 
> It is likely that the PoV of a user that you've mentioned is actually
> a PoV of a (kernel) developer.

Actually, no, it's PoV of the WSL users that are having the discussion in
the linked github issue. FWIW, that issue also occurred in Azure with multiple
incidents coming into our queue because of the unusable flood of error messages.

> It is hard to imagine that folks running
> web sites, DB servers, LoBs, LLMs, etc. in Hyper-V VMs care about the
> lowest software level of the virt stack in the form of the symbolic
> name or the hex code. They need their VMs to be reliable or suggest
> what the user may try if a configuration error is suspected.
> 
> To make the error log message useful to the user, the message should
> mention ways of remediation or at least hint what might've gotten
> wedged. Without that, that's only useful for the people who work with
> the kernel code proper or the kernel interface to the user land.

There's a step between seeing the issue and fixing it that you're missing,
i.e. the reporting.

An issue that says "flood of hv_storvsc errors reporting status
unsuccessful" is better than the same without that status information:
https://github.com/microsoft/WSL/issues/9173

> 
> So I'd think that the hex error codes from the hypervisor give the user
> exactly as much as the error symbolic names do to get the system to the
> desired state: nothing. 
I continue to disagree, seeing HV_STATUS_NO_RESOURCES is better than 0x1D,
because the user may think to look at `top` or `free -h` or similar to see
what could be killed to improve the situation.

> Even less when the error reported "Unknown" :)

I agree on the uselessness of "Unknown" to the user, except as already mentioned
below, as a prompt for the code to be updated.

> 
>>> 2. How having "Unknown" in the log can possibly be better?
>>
>> IMHO, seeing "Unknown" in an error report means that there's a new return value
>> that needs to be mapped to errno in hv_status_to_errno() and updated here as well.
>>
> 
> It means that to the developer. To the user, it means the developers
> messed something up and to make matters even worse they didn't leave any
> breadcrumbs (e.g. the hex code) to see what's wrong to help the user and
> themselves: there is just that "Unknown" thing in the log.

I think Nuno's compromise addresses this very well, to also print the hex code.

> 
>>> 3. Given that the select hv status codes and the proposed strings have
>>>     1:1 correspondence, and there is the 1:N catch-all case for the
>>>     "Unknown", how's that better?
>>>
>>
>> I didn't really follow this question, but I suppose the answer to Q2 answers this as
>> well. If not, please expand and I'll try to answer.
>>
> 
> Sorry about that chunk, hit "Send" without looking the e-mail over
> another time. Appreciate the discussion very much!
> 
> 
>> Thanks,
>> Easwar (he/him)
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-28 20:22             ` Easwar Hariharan
@ 2025-02-28 22:26               ` Roman Kisel
  0 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-28 22:26 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/28/2025 12:22 PM, Easwar Hariharan wrote:
> On 2/28/2025 9:20 AM, Roman Kisel wrote:
>>

[...]

>>
>> So I'd think that the hex error codes from the hypervisor give the user
>> exactly as much as the error symbolic names do to get the system to the
>> desired state: nothing.
> I continue to disagree, seeing HV_STATUS_NO_RESOURCES is better than 0x1D,
> because the user may think to look at `top` or `free -h` or similar to see
> what could be killed to improve the situation.
> 

I agree that the symbolic name might save the step of looking up the
error code in the headers. Now, the next step depends on how much the
user is into virt technologies (if at all). That is
to illustrate the point that a hint in the logs (or in the
Documentation) is crucial of what to do next.

The symbolic name might mislead; a hex code maybe with an addition of
"please look up what may fix this at <URL> or report the problem here
<URL>" would look better to _my imaginary_ customer :) That would be
as much friendly as possible, if the kernel needs to print any of that
at all. Likely the VMM in the user land if it gets that code as-is.

Thank you for the fair critique and the time!

[...]

>>> Thanks,
>>> Easwar (he/him)
>>
> 

-- 
Thank you,
Roman

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 22:54     ` Easwar Hariharan
  2025-02-27 23:08       ` Roman Kisel
@ 2025-02-27 23:21       ` Roman Kisel
  1 sibling, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 23:21 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet



On 2/27/2025 2:54 PM, Easwar Hariharan wrote:
> On 2/27/2025 9:02 AM, Roman Kisel wrote:

[...]

> 
> Sorry, I have to disagree with this, a recent commit of mine[1] closed a WSL
> issue that was open for over 2 years for, partly, the utter uselessness of
> the hex return code of the hypercall.

What hypercall was that? I see

		storvsc_log_ratelimited(device, loglevel,
			"tag#%d cmd 0x%x status: scsi 0x%x srb 0x%x hv 0x%x\n",
			scsi_cmd_to_rq(request->cmd)->tag,
			stor_pkt->vm_srb.cdb[0],
			vstor_packet->vm_srb.scsi_status,
			vstor_packet->vm_srb.srb_status,
			vstor_packet->status);

in your patch where `vstor_packet->status` is claimed to be a hypercall
status? I'd be surprised if the hypervisor concerned itself with
the details of visualized SCSI storage. The VMM on the host might and
should.

I'll look through the code to gain more confidence in my suspicion that
calling the SCSI virt storage packet status a hv status causd the
frustration with debugging, and if no counter examples found, will send
a patch to fix that log statement above.

> 
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2138eab8cde61e0e6f62d0713e45202e8457d6d
> 
> Thanks,
> Easwar (he/him)

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-27 17:02   ` Roman Kisel
  2025-02-27 22:54     ` Easwar Hariharan
@ 2025-02-28  0:15     ` Nuno Das Neves
  2025-02-28 16:40       ` Roman Kisel
  1 sibling, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-28  0:15 UTC (permalink / raw)
  To: Roman Kisel, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 9:02 AM, Roman Kisel wrote:
> 
> 
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> 
> [...]
> 
>> +
>> +const char *hv_result_to_string(u64 hv_status)
>> +{
>> +    switch (hv_result(hv_status)) {
> 
> [...]
> 
>> +        return "HV_STATUS_VTL_ALREADY_ENABLED";
>> +    default:
>> +        return "Unknown";
>> +    };
>> +    return "Unknown";
>> +}
>> +EXPORT_SYMBOL_GPL(hv_result_to_string);
> 
> Should we remove this and output the hexadecimal error code in ~3 places
> this function is used?
> 
I guess you're implying it's not worth adding such a function for only a
few places in the code? That is a good point, and a bit of an oversight
on my part while editing this series. Originally all the hypercall helper
functions in the driver code (10+ places) used this function as well, but
I removed those printks_()s as a temporary solution to limit the use of
printk in the driver code (as opposed to dev_printk() which is preferred).

I didn't think to remove *this* patch as a result of that change!
I do want to figure out a good way to add that logging back to the hypercall
helpers, so I do want to try and get some form of this patch in to aid
debugging hypercalls - it has been very very useful over time.

> The "Unknown" part would make debugging harder actually when something
> fails. I presume that the mainstream scenarios all work, and it is the
> edge cases that might fail, and these are likelier to produce "Unknown".
> 
That is a very good point. Ideally, we could log "Unknown" along with
the hex code instead of replacing it.

What do you think about keeping this function, but instead of using it
directly, introduce a "standard" way for logging hypercall errors which
can hopefully be used everywhere in the kernel?
e.g. a simple macro:
#define hv_hvcall_err(control, status)
do {
	u64 ___status = (status);
	pr_err("Hypercall: %#x err: %#x : %s", (control) & 0xFFFF, hv_result(___status), hv_result_to_string(___status));
} while (0)

I feel like this is the best of both worlds, and actually makes it even
easier to do this logging everywhere it is wanted (for me, that includes
all the /dev/mshv-related hypercalls).
We could add strings for the HVCALL_ values too, and/or include __func__
in the macro to aid in finding the context it was used in.

> Folks who actually debug failed hypercalls rarely have issues with
> looking up the error code, and printing "Unknown" to the log is worse
> than a hexadecimal. Like even the people who wrote the code got nothing
> to say about what is going on.
> 
Yep, totally agree having the hex code available can be valuable in
unexpected situations.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-28  0:15     ` Nuno Das Neves
@ 2025-02-28 16:40       ` Roman Kisel
  0 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-28 16:40 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/27/2025 4:15 PM, Nuno Das Neves wrote:
> On 2/27/2025 9:02 AM, Roman Kisel wrote:

[...]

> I guess you're implying it's not worth adding such a function for only a
> few places in the code? That is a good point, and a bit of an oversight
> on my part while editing this series. Originally all the hypercall helper
> functions in the driver code (10+ places) used this function as well, but
> I removed those printks_()s as a temporary solution to limit the use of
> printk in the driver code (as opposed to dev_printk() which is preferred).
> 
> I didn't think to remove *this* patch as a result of that change!
> I do want to figure out a good way to add that logging back to the hypercall
> helpers, so I do want to try and get some form of this patch in to aid
> debugging hypercalls - it has been very very useful over time.
> 

Right, I thought that the function looked more as a bring-up aid rather
than a full fledged solution to some problem.

>> The "Unknown" part would make debugging harder actually when something
>> fails. I presume that the mainstream scenarios all work, and it is the
>> edge cases that might fail, and these are likelier to produce "Unknown".
>>
> That is a very good point. Ideally, we could log "Unknown" along with
> the hex code instead of replacing it.
> 
> What do you think about keeping this function, but instead of using it
> directly, introduce a "standard" way for logging hypercall errors which
> can hopefully be used everywhere in the kernel?
> e.g. a simple macro:
> #define hv_hvcall_err(control, status)
> do {
> 	u64 ___status = (status);
> 	pr_err("Hypercall: %#x err: %#x : %s", (control) & 0xFFFF, hv_result(___status), hv_result_to_string(___status));
> } while (0)
> 
> I feel like this is the best of both worlds, and actually makes it even
> easier to do this logging everywhere it is wanted (for me, that includes
> all the /dev/mshv-related hypercalls).
> We could add strings for the HVCALL_ values too, and/or include __func__
> in the macro to aid in finding the context it was used in.
> 

That doesn’t seem to be common in the kernel from what I’ve seen in 
dmesg, although there is certainly a lot of appeal in that approach. 
However, we will have to remember to update the function each time when 
another status code is added not to leave things half-cooked.

Also it is a bit surprising the *kernel* should report that rather than 
the VMM from the user mode. E.g. the kernel does not report all errors 
on file open, file seek, etc. As I understand, the hv status codes are
later mapped to errno in a lossy manner, and errno is what the user mode
receives?

As long as the hex code is logged, I am fine with the change.

>> Folks who actually debug failed hypercalls rarely have issues with
>> looking up the error code, and printing "Unknown" to the log is worse
>> than a hexadecimal. Like even the people who wrote the code got nothing
>> to say about what is going on.
>>
> Yep, totally agree having the hex code available can be valuable in
> unexpected situations.
> 

Appreciate giving my concerns a thorough consideration!

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-02-27 17:02   ` Roman Kisel
@ 2025-03-06 17:57   ` Michael Kelley
  2025-03-06 18:09     ` Michael Kelley
  2025-03-07 19:38     ` Nuno Das Neves
  3 siblings, 2 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 17:57 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
> Introduce hv_result_to_string() for this purpose. This allows
> hypercall failures to be debugged more easily with dmesg.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>  drivers/hv/hv_proc.c           | 13 ++++---
>  include/asm-generic/mshyperv.h |  1 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 9804adb4cc56..ce20818688fe 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
>  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
>  	}
>  }
> +
> +const char *hv_result_to_string(u64 hv_status)
> +{
> +	switch (hv_result(hv_status)) {
> +	case HV_STATUS_SUCCESS:
> +		return "HV_STATUS_SUCCESS";
> +	case HV_STATUS_INVALID_HYPERCALL_CODE:
> +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
> +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
> +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
> +	case HV_STATUS_INVALID_ALIGNMENT:
> +		return "HV_STATUS_INVALID_ALIGNMENT";
> +	case HV_STATUS_INVALID_PARAMETER:
> +		return "HV_STATUS_INVALID_PARAMETER";
> +	case HV_STATUS_ACCESS_DENIED:
> +		return "HV_STATUS_ACCESS_DENIED";
> +	case HV_STATUS_INVALID_PARTITION_STATE:
> +		return "HV_STATUS_INVALID_PARTITION_STATE";
> +	case HV_STATUS_OPERATION_DENIED:
> +		return "HV_STATUS_OPERATION_DENIED";
> +	case HV_STATUS_UNKNOWN_PROPERTY:
> +		return "HV_STATUS_UNKNOWN_PROPERTY";
> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
> +	case HV_STATUS_INSUFFICIENT_MEMORY:
> +		return "HV_STATUS_INSUFFICIENT_MEMORY";
> +	case HV_STATUS_INVALID_PARTITION_ID:
> +		return "HV_STATUS_INVALID_PARTITION_ID";
> +	case HV_STATUS_INVALID_VP_INDEX:
> +		return "HV_STATUS_INVALID_VP_INDEX";
> +	case HV_STATUS_NOT_FOUND:
> +		return "HV_STATUS_NOT_FOUND";
> +	case HV_STATUS_INVALID_PORT_ID:
> +		return "HV_STATUS_INVALID_PORT_ID";
> +	case HV_STATUS_INVALID_CONNECTION_ID:
> +		return "HV_STATUS_INVALID_CONNECTION_ID";
> +	case HV_STATUS_INSUFFICIENT_BUFFERS:
> +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
> +	case HV_STATUS_NOT_ACKNOWLEDGED:
> +		return "HV_STATUS_NOT_ACKNOWLEDGED";
> +	case HV_STATUS_INVALID_VP_STATE:
> +		return "HV_STATUS_INVALID_VP_STATE";
> +	case HV_STATUS_NO_RESOURCES:
> +		return "HV_STATUS_NO_RESOURCES";
> +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
> +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
> +	case HV_STATUS_INVALID_LP_INDEX:
> +		return "HV_STATUS_INVALID_LP_INDEX";
> +	case HV_STATUS_INVALID_REGISTER_VALUE:
> +		return "HV_STATUS_INVALID_REGISTER_VALUE";
> +	case HV_STATUS_OPERATION_FAILED:
> +		return "HV_STATUS_OPERATION_FAILED";
> +	case HV_STATUS_TIME_OUT:
> +		return "HV_STATUS_TIME_OUT";
> +	case HV_STATUS_CALL_PENDING:
> +		return "HV_STATUS_CALL_PENDING";
> +	case HV_STATUS_VTL_ALREADY_ENABLED:
> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
> +	default:
> +		return "Unknown";
> +	};
> +	return "Unknown";
> +}
> +EXPORT_SYMBOL_GPL(hv_result_to_string);
> +
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 2fae18e4f7d2..8fc30f509fa7 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
> num_pages)
>  				     page_count, 0, input_page, NULL);
>  	local_irq_restore(flags);
>  	if (!hv_result_success(status)) {
> -		pr_err("Failed to deposit pages: %lld\n", status);
> +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
> +		       hv_result_to_string(status));
>  		ret = hv_result_to_errno(status);
>  		goto err_free_allocations;
>  	}
> @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
> 
>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>  			if (!hv_result_success(status)) {
> -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
> -				       lp_index, apic_id, status);
> +				pr_err("%s: cpu %u apic ID %u, %s\n",
> +				       __func__, lp_index, apic_id,
> +				       hv_result_to_string(status));
>  				ret = hv_result_to_errno(status);
>  			}
>  			break;
> @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index,
> u32 flags)
> 
>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>  			if (!hv_result_success(status)) {
> -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
> -				       vp_index, flags, status);
> +				pr_err("%s: vcpu %u, lp %u, %s\n",
> +				       __func__, vp_index, flags,
> +				       hv_result_to_string(status));
>  				ret = hv_result_to_errno(status);
>  			}
>  			break;
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index b13b0cda4ac8..dc4729dba9ef 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
>  	return __cpumask_to_vpset(vpset, cpus, func);
>  }
> 
> +const char *hv_result_to_string(u64 hv_status);
>  int hv_result_to_errno(u64 status);
>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>  bool hv_is_hyperv_initialized(void);
> --
> 2.34.1

I've read through the other comments on this patch. I definitely vote
for outputting both the hex code along with a string translation, which
could be empty if the hex code is unrecognized by the translation code.

I can see providing something like hv_hvcall_err() as Nuno proposed, since
that standardizes the text output. But I wonder if it would be too limiting.
For example, in the changes above, both hv_call_add_logical_proc() and
hv_call_create_vp() output additional debugging values, which we probably
don't want to give up.

Lastly, from an implementation standpoint, rather than using a big
switch statement, build a static array of entries that each have the
hex code and string equivalent. Then hv_result_to_string() loops through
the array looking for a match. This won't be any slower than the big switch
statement. I've seen other places in the kernel where string names are
output, and looking up the strings in a static array is the typical approach.
You'll have to work through the details and see if avoids being too clumsy,
but I think it will be OK.

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-03-06 17:57   ` Michael Kelley
@ 2025-03-06 18:09     ` Michael Kelley
  2025-03-06 18:40       ` Nuno Das Neves
  2025-03-07 19:38     ` Nuno Das Neves
  1 sibling, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 18:09 UTC (permalink / raw)
  To: Michael Kelley, Nuno Das Neves, linux-hyperv@vger.kernel.org,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, March 6, 2025 9:58 AM

> 
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February
> 26, 2025 3:08 PM
> >
> > Introduce hv_result_to_string() for this purpose. This allows
> > hypercall failures to be debugged more easily with dmesg.
> >
> > Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> > ---
> >  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
> >  drivers/hv/hv_proc.c           | 13 ++++---
> >  include/asm-generic/mshyperv.h |  1 +
> >  3 files changed, 74 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> > index 9804adb4cc56..ce20818688fe 100644
> > --- a/drivers/hv/hv_common.c
> > +++ b/drivers/hv/hv_common.c
> > @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
> >  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
> >  	}
> >  }
> > +
> > +const char *hv_result_to_string(u64 hv_status)
> > +{
> > +	switch (hv_result(hv_status)) {
> > +	case HV_STATUS_SUCCESS:
> > +		return "HV_STATUS_SUCCESS";
> > +	case HV_STATUS_INVALID_HYPERCALL_CODE:
> > +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
> > +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
> > +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
> > +	case HV_STATUS_INVALID_ALIGNMENT:
> > +		return "HV_STATUS_INVALID_ALIGNMENT";
> > +	case HV_STATUS_INVALID_PARAMETER:
> > +		return "HV_STATUS_INVALID_PARAMETER";
> > +	case HV_STATUS_ACCESS_DENIED:
> > +		return "HV_STATUS_ACCESS_DENIED";
> > +	case HV_STATUS_INVALID_PARTITION_STATE:
> > +		return "HV_STATUS_INVALID_PARTITION_STATE";
> > +	case HV_STATUS_OPERATION_DENIED:
> > +		return "HV_STATUS_OPERATION_DENIED";
> > +	case HV_STATUS_UNKNOWN_PROPERTY:
> > +		return "HV_STATUS_UNKNOWN_PROPERTY";
> > +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> > +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
> > +	case HV_STATUS_INSUFFICIENT_MEMORY:
> > +		return "HV_STATUS_INSUFFICIENT_MEMORY";
> > +	case HV_STATUS_INVALID_PARTITION_ID:
> > +		return "HV_STATUS_INVALID_PARTITION_ID";
> > +	case HV_STATUS_INVALID_VP_INDEX:
> > +		return "HV_STATUS_INVALID_VP_INDEX";
> > +	case HV_STATUS_NOT_FOUND:
> > +		return "HV_STATUS_NOT_FOUND";
> > +	case HV_STATUS_INVALID_PORT_ID:
> > +		return "HV_STATUS_INVALID_PORT_ID";
> > +	case HV_STATUS_INVALID_CONNECTION_ID:
> > +		return "HV_STATUS_INVALID_CONNECTION_ID";
> > +	case HV_STATUS_INSUFFICIENT_BUFFERS:
> > +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
> > +	case HV_STATUS_NOT_ACKNOWLEDGED:
> > +		return "HV_STATUS_NOT_ACKNOWLEDGED";
> > +	case HV_STATUS_INVALID_VP_STATE:
> > +		return "HV_STATUS_INVALID_VP_STATE";
> > +	case HV_STATUS_NO_RESOURCES:
> > +		return "HV_STATUS_NO_RESOURCES";
> > +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
> > +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
> > +	case HV_STATUS_INVALID_LP_INDEX:
> > +		return "HV_STATUS_INVALID_LP_INDEX";
> > +	case HV_STATUS_INVALID_REGISTER_VALUE:
> > +		return "HV_STATUS_INVALID_REGISTER_VALUE";
> > +	case HV_STATUS_OPERATION_FAILED:
> > +		return "HV_STATUS_OPERATION_FAILED";
> > +	case HV_STATUS_TIME_OUT:
> > +		return "HV_STATUS_TIME_OUT";
> > +	case HV_STATUS_CALL_PENDING:
> > +		return "HV_STATUS_CALL_PENDING";
> > +	case HV_STATUS_VTL_ALREADY_ENABLED:
> > +		return "HV_STATUS_VTL_ALREADY_ENABLED";
> > +	default:
> > +		return "Unknown";
> > +	};
> > +	return "Unknown";
> > +}
> > +EXPORT_SYMBOL_GPL(hv_result_to_string);
> > +
> > diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> > index 2fae18e4f7d2..8fc30f509fa7 100644
> > --- a/drivers/hv/hv_proc.c
> > +++ b/drivers/hv/hv_proc.c
> > @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
> > num_pages)
> >  				     page_count, 0, input_page, NULL);
> >  	local_irq_restore(flags);
> >  	if (!hv_result_success(status)) {
> > -		pr_err("Failed to deposit pages: %lld\n", status);
> > +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
> > +		       hv_result_to_string(status));
> >  		ret = hv_result_to_errno(status);
> >  		goto err_free_allocations;
> >  	}
> > @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
> >
> >  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> >  			if (!hv_result_success(status)) {
> > -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
> > -				       lp_index, apic_id, status);
> > +				pr_err("%s: cpu %u apic ID %u, %s\n",
> > +				       __func__, lp_index, apic_id,
> > +				       hv_result_to_string(status));
> >  				ret = hv_result_to_errno(status);
> >  			}
> >  			break;
> > @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index,
> > u32 flags)
> >
> >  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> >  			if (!hv_result_success(status)) {
> > -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
> > -				       vp_index, flags, status);
> > +				pr_err("%s: vcpu %u, lp %u, %s\n",
> > +				       __func__, vp_index, flags,
> > +				       hv_result_to_string(status));
> >  				ret = hv_result_to_errno(status);
> >  			}
> >  			break;
> > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > index b13b0cda4ac8..dc4729dba9ef 100644
> > --- a/include/asm-generic/mshyperv.h
> > +++ b/include/asm-generic/mshyperv.h
> > @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
> >  	return __cpumask_to_vpset(vpset, cpus, func);
> >  }
> >
> > +const char *hv_result_to_string(u64 hv_status);
> >  int hv_result_to_errno(u64 status);
> >  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
> >  bool hv_is_hyperv_initialized(void);
> > --
> > 2.34.1
> 
> I've read through the other comments on this patch. I definitely vote
> for outputting both the hex code along with a string translation, which
> could be empty if the hex code is unrecognized by the translation code.
> 
> I can see providing something like hv_hvcall_err() as Nuno proposed, since
> that standardizes the text output. But I wonder if it would be too limiting.
> For example, in the changes above, both hv_call_add_logical_proc() and
> hv_call_create_vp() output additional debugging values, which we probably
> don't want to give up.
> 
> Lastly, from an implementation standpoint, rather than using a big
> switch statement, build a static array of entries that each have the
> hex code and string equivalent. Then hv_result_to_string() loops through
> the array looking for a match. This won't be any slower than the big switch
> statement. I've seen other places in the kernel where string names are
> output, and looking up the strings in a static array is the typical approach.
> You'll have to work through the details and see if avoids being too clumsy,
> but I think it will be OK.
> 

Better yet, also include the translated errno in each static array entry.
Then hv_result_to_errno() can do the same kind of lookup instead of
having its own switch statement. I did a quick look to see if the two
functions might be combined to do only a single lookup, but that looks
somewhat clumsy unless someone else spots a better way to handle it.
The cost of doing two lookups doesn't really matter in an error case.

FWIW, hv_result_to_errno() and the new hv_result_to_string() are both
slightly misnamed. The input argument is a full 64-bit hv_status, not the
smaller 16-bit result field. hv_status_to_errno() and hv_status_to_string()
would be more precise.

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-03-06 18:09     ` Michael Kelley
@ 2025-03-06 18:40       ` Nuno Das Neves
  2025-03-06 18:57         ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-06 18:40 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/6/2025 10:09 AM, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, March 6, 2025 9:58 AM
> 
>>
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February
>> 26, 2025 3:08 PM
>>>
>>> Introduce hv_result_to_string() for this purpose. This allows
>>> hypercall failures to be debugged more easily with dmesg.
>>>
>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>> ---
>>>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>>>  drivers/hv/hv_proc.c           | 13 ++++---
>>>  include/asm-generic/mshyperv.h |  1 +
>>>  3 files changed, 74 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>>> index 9804adb4cc56..ce20818688fe 100644
>>> --- a/drivers/hv/hv_common.c
>>> +++ b/drivers/hv/hv_common.c
>>> @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
>>>  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
>>>  	}
>>>  }
>>> +
>>> +const char *hv_result_to_string(u64 hv_status)
>>> +{
>>> +	switch (hv_result(hv_status)) {
>>> +	case HV_STATUS_SUCCESS:
>>> +		return "HV_STATUS_SUCCESS";
>>> +	case HV_STATUS_INVALID_HYPERCALL_CODE:
>>> +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
>>> +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
>>> +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
>>> +	case HV_STATUS_INVALID_ALIGNMENT:
>>> +		return "HV_STATUS_INVALID_ALIGNMENT";
>>> +	case HV_STATUS_INVALID_PARAMETER:
>>> +		return "HV_STATUS_INVALID_PARAMETER";
>>> +	case HV_STATUS_ACCESS_DENIED:
>>> +		return "HV_STATUS_ACCESS_DENIED";
>>> +	case HV_STATUS_INVALID_PARTITION_STATE:
>>> +		return "HV_STATUS_INVALID_PARTITION_STATE";
>>> +	case HV_STATUS_OPERATION_DENIED:
>>> +		return "HV_STATUS_OPERATION_DENIED";
>>> +	case HV_STATUS_UNKNOWN_PROPERTY:
>>> +		return "HV_STATUS_UNKNOWN_PROPERTY";
>>> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
>>> +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
>>> +	case HV_STATUS_INSUFFICIENT_MEMORY:
>>> +		return "HV_STATUS_INSUFFICIENT_MEMORY";
>>> +	case HV_STATUS_INVALID_PARTITION_ID:
>>> +		return "HV_STATUS_INVALID_PARTITION_ID";
>>> +	case HV_STATUS_INVALID_VP_INDEX:
>>> +		return "HV_STATUS_INVALID_VP_INDEX";
>>> +	case HV_STATUS_NOT_FOUND:
>>> +		return "HV_STATUS_NOT_FOUND";
>>> +	case HV_STATUS_INVALID_PORT_ID:
>>> +		return "HV_STATUS_INVALID_PORT_ID";
>>> +	case HV_STATUS_INVALID_CONNECTION_ID:
>>> +		return "HV_STATUS_INVALID_CONNECTION_ID";
>>> +	case HV_STATUS_INSUFFICIENT_BUFFERS:
>>> +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
>>> +	case HV_STATUS_NOT_ACKNOWLEDGED:
>>> +		return "HV_STATUS_NOT_ACKNOWLEDGED";
>>> +	case HV_STATUS_INVALID_VP_STATE:
>>> +		return "HV_STATUS_INVALID_VP_STATE";
>>> +	case HV_STATUS_NO_RESOURCES:
>>> +		return "HV_STATUS_NO_RESOURCES";
>>> +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
>>> +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
>>> +	case HV_STATUS_INVALID_LP_INDEX:
>>> +		return "HV_STATUS_INVALID_LP_INDEX";
>>> +	case HV_STATUS_INVALID_REGISTER_VALUE:
>>> +		return "HV_STATUS_INVALID_REGISTER_VALUE";
>>> +	case HV_STATUS_OPERATION_FAILED:
>>> +		return "HV_STATUS_OPERATION_FAILED";
>>> +	case HV_STATUS_TIME_OUT:
>>> +		return "HV_STATUS_TIME_OUT";
>>> +	case HV_STATUS_CALL_PENDING:
>>> +		return "HV_STATUS_CALL_PENDING";
>>> +	case HV_STATUS_VTL_ALREADY_ENABLED:
>>> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
>>> +	default:
>>> +		return "Unknown";
>>> +	};
>>> +	return "Unknown";
>>> +}
>>> +EXPORT_SYMBOL_GPL(hv_result_to_string);
>>> +
>>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>>> index 2fae18e4f7d2..8fc30f509fa7 100644
>>> --- a/drivers/hv/hv_proc.c
>>> +++ b/drivers/hv/hv_proc.c
>>> @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
>>> num_pages)
>>>  				     page_count, 0, input_page, NULL);
>>>  	local_irq_restore(flags);
>>>  	if (!hv_result_success(status)) {
>>> -		pr_err("Failed to deposit pages: %lld\n", status);
>>> +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
>>> +		       hv_result_to_string(status));
>>>  		ret = hv_result_to_errno(status);
>>>  		goto err_free_allocations;
>>>  	}
>>> @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>>
>>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>>  			if (!hv_result_success(status)) {
>>> -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
>>> -				       lp_index, apic_id, status);
>>> +				pr_err("%s: cpu %u apic ID %u, %s\n",
>>> +				       __func__, lp_index, apic_id,
>>> +				       hv_result_to_string(status));
>>>  				ret = hv_result_to_errno(status);
>>>  			}
>>>  			break;
>>> @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index,
>>> u32 flags)
>>>
>>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>>  			if (!hv_result_success(status)) {
>>> -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
>>> -				       vp_index, flags, status);
>>> +				pr_err("%s: vcpu %u, lp %u, %s\n",
>>> +				       __func__, vp_index, flags,
>>> +				       hv_result_to_string(status));
>>>  				ret = hv_result_to_errno(status);
>>>  			}
>>>  			break;
>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>> index b13b0cda4ac8..dc4729dba9ef 100644
>>> --- a/include/asm-generic/mshyperv.h
>>> +++ b/include/asm-generic/mshyperv.h
>>> @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
>>>  	return __cpumask_to_vpset(vpset, cpus, func);
>>>  }
>>>
>>> +const char *hv_result_to_string(u64 hv_status);
>>>  int hv_result_to_errno(u64 status);
>>>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>>>  bool hv_is_hyperv_initialized(void);
>>> --
>>> 2.34.1
>>
>> I've read through the other comments on this patch. I definitely vote
>> for outputting both the hex code along with a string translation, which
>> could be empty if the hex code is unrecognized by the translation code.
>>
>> I can see providing something like hv_hvcall_err() as Nuno proposed, since
>> that standardizes the text output. But I wonder if it would be too limiting.
>> For example, in the changes above, both hv_call_add_logical_proc() and
>> hv_call_create_vp() output additional debugging values, which we probably
>> don't want to give up.
>>
>> Lastly, from an implementation standpoint, rather than using a big
>> switch statement, build a static array of entries that each have the
>> hex code and string equivalent. Then hv_result_to_string() loops through
>> the array looking for a match. This won't be any slower than the big switch
>> statement. I've seen other places in the kernel where string names are
>> output, and looking up the strings in a static array is the typical approach.
>> You'll have to work through the details and see if avoids being too clumsy,
>> but I think it will be OK.
>>
> 
> Better yet, also include the translated errno in each static array entry.
> Then hv_result_to_errno() can do the same kind of lookup instead of
> having its own switch statement. I did a quick look to see if the two
> functions might be combined to do only a single lookup, but that looks
> somewhat clumsy unless someone else spots a better way to handle it.
> The cost of doing two lookups doesn't really matter in an error case.
> 
> FWIW, hv_result_to_errno() and the new hv_result_to_string() are both
> slightly misnamed. The input argument is a full 64-bit hv_status, not the
> smaller 16-bit result field. hv_status_to_errno() and hv_status_to_string()
> would be more precise.
> 
Hmm, well I'll admit I was and still am rather confused on this point.

In the TLFS (section 3.8) the entire 64-bit return value is called the
"hypercall result value".
The 16-bit HV_STATUS part is *also* called the "result" in this section.
Later, in section 3.12, the 16-bit field is referred to as a "status value
field".
Furthermore, the name of the 16-bit value, itself, is HV_STATUS.

Despite the inconsistency, in my mind it makes the most sense that the
16-bit HV_STATUS part the "status" and the entire 64-bit return value the
"result". I am aware that elsewhere (and in the driver patches in this
series), the name "status" is used to refer to the entire 64-bit return
value.

These functions were actually called hv_status_to_errno() and hv_status_to_string()
in the past, but I changed them to use "result" by following my own logic, and I
thought this also matched the naming of hv_result() and hv_result_success().
However I now realize that the "result" in these names refers to the *output* of
these functions... they take a u64 status as a parameter after all..

So in the end I'm rather bothered by this whole situation. I can change these
names back to "status" (although hv_result_to_errno() is already merged, I
could send a fixup), or I could keep "result", which I think is a more
logical name for the 64-bit value, even though it somewhat contradicts how
the term is already used in the kernel.

Given it doesn't seem to be well-defined in the first place, I'm not really
sure the best route.

Nuno

> Michael


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-03-06 18:40       ` Nuno Das Neves
@ 2025-03-06 18:57         ` Michael Kelley
  0 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 18:57 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 6, 2025 10:41 AM
> 
> On 3/6/2025 10:09 AM, Michael Kelley wrote:
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Thursday, March 6, 2025 9:58 AM
> >

[snip]

> >> I've read through the other comments on this patch. I definitely vote
> >> for outputting both the hex code along with a string translation, which
> >> could be empty if the hex code is unrecognized by the translation code.
> >>
> >> I can see providing something like hv_hvcall_err() as Nuno proposed, since
> >> that standardizes the text output. But I wonder if it would be too limiting.
> >> For example, in the changes above, both hv_call_add_logical_proc() and
> >> hv_call_create_vp() output additional debugging values, which we probably
> >> don't want to give up.
> >>
> >> Lastly, from an implementation standpoint, rather than using a big
> >> switch statement, build a static array of entries that each have the
> >> hex code and string equivalent. Then hv_result_to_string() loops through
> >> the array looking for a match. This won't be any slower than the big switch
> >> statement. I've seen other places in the kernel where string names are
> >> output, and looking up the strings in a static array is the typical approach.
> >> You'll have to work through the details and see if avoids being too clumsy,
> >> but I think it will be OK.
> >>
> >
> > Better yet, also include the translated errno in each static array entry.
> > Then hv_result_to_errno() can do the same kind of lookup instead of
> > having its own switch statement. I did a quick look to see if the two
> > functions might be combined to do only a single lookup, but that looks
> > somewhat clumsy unless someone else spots a better way to handle it.
> > The cost of doing two lookups doesn't really matter in an error case.
> >
> > FWIW, hv_result_to_errno() and the new hv_result_to_string() are both
> > slightly misnamed. The input argument is a full 64-bit hv_status, not the
> > smaller 16-bit result field. hv_status_to_errno() and hv_status_to_string()
> > would be more precise.
> >
> Hmm, well I'll admit I was and still am rather confused on this point.
> 
> In the TLFS (section 3.8) the entire 64-bit return value is called the  "hypercall result value".
> The 16-bit HV_STATUS part is *also* called the "result" in this section.
> Later, in section 3.12, the 16-bit field is referred to as a "status value field".
> Furthermore, the name of the 16-bit value, itself, is HV_STATUS.
> 
> Despite the inconsistency, in my mind it makes the most sense that the
> 16-bit HV_STATUS part the "status" and the entire 64-bit return value the
> "result". I am aware that elsewhere (and in the driver patches in this
> series), the name "status" is used to refer to the entire 64-bit return
> value.
> 
> These functions were actually called hv_status_to_errno() and hv_status_to_string()
> in the past, but I changed them to use "result" by following my own logic, and I
> thought this also matched the naming of hv_result() and hv_result_success().
> However I now realize that the "result" in these names refers to the *output* of
> these functions... they take a u64 status as a parameter after all..
> 
> So in the end I'm rather bothered by this whole situation. I can change these
> names back to "status" (although hv_result_to_errno() is already merged, I
> could send a fixup), or I could keep "result", which I think is a more
> logical name for the 64-bit value, even though it somewhat contradicts how
> the term is already used in the kernel.
> 
> Given it doesn't seem to be well-defined in the first place, I'm not really
> sure the best route.
> 

Hmmm. You are right. I had in my mind that "status" is the full 64-bit
value, and "result" is the 16-bit error code. But that's certainly not
always the case. And as you point out, it doesn't comport with the TLFS,
and the TLFS itself is not consistent in the terminology.

Ignore my comment. The difference doesn't have any real impact. Leave
the sorting out for some other time. :-)

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings
  2025-03-06 17:57   ` Michael Kelley
  2025-03-06 18:09     ` Michael Kelley
@ 2025-03-07 19:38     ` Nuno Das Neves
  1 sibling, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 19:38 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/6/2025 9:57 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
>> Introduce hv_result_to_string() for this purpose. This allows
>> hypercall failures to be debugged more easily with dmesg.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  drivers/hv/hv_common.c         | 65 ++++++++++++++++++++++++++++++++++
>>  drivers/hv/hv_proc.c           | 13 ++++---
>>  include/asm-generic/mshyperv.h |  1 +
>>  3 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 9804adb4cc56..ce20818688fe 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -740,3 +740,68 @@ void hv_identify_partition_type(void)
>>  			pr_crit("Hyper-V: CONFIG_MSHV_ROOT not enabled!\n");
>>  	}
>>  }
>> +
>> +const char *hv_result_to_string(u64 hv_status)
>> +{
>> +	switch (hv_result(hv_status)) {
>> +	case HV_STATUS_SUCCESS:
>> +		return "HV_STATUS_SUCCESS";
>> +	case HV_STATUS_INVALID_HYPERCALL_CODE:
>> +		return "HV_STATUS_INVALID_HYPERCALL_CODE";
>> +	case HV_STATUS_INVALID_HYPERCALL_INPUT:
>> +		return "HV_STATUS_INVALID_HYPERCALL_INPUT";
>> +	case HV_STATUS_INVALID_ALIGNMENT:
>> +		return "HV_STATUS_INVALID_ALIGNMENT";
>> +	case HV_STATUS_INVALID_PARAMETER:
>> +		return "HV_STATUS_INVALID_PARAMETER";
>> +	case HV_STATUS_ACCESS_DENIED:
>> +		return "HV_STATUS_ACCESS_DENIED";
>> +	case HV_STATUS_INVALID_PARTITION_STATE:
>> +		return "HV_STATUS_INVALID_PARTITION_STATE";
>> +	case HV_STATUS_OPERATION_DENIED:
>> +		return "HV_STATUS_OPERATION_DENIED";
>> +	case HV_STATUS_UNKNOWN_PROPERTY:
>> +		return "HV_STATUS_UNKNOWN_PROPERTY";
>> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
>> +		return "HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE";
>> +	case HV_STATUS_INSUFFICIENT_MEMORY:
>> +		return "HV_STATUS_INSUFFICIENT_MEMORY";
>> +	case HV_STATUS_INVALID_PARTITION_ID:
>> +		return "HV_STATUS_INVALID_PARTITION_ID";
>> +	case HV_STATUS_INVALID_VP_INDEX:
>> +		return "HV_STATUS_INVALID_VP_INDEX";
>> +	case HV_STATUS_NOT_FOUND:
>> +		return "HV_STATUS_NOT_FOUND";
>> +	case HV_STATUS_INVALID_PORT_ID:
>> +		return "HV_STATUS_INVALID_PORT_ID";
>> +	case HV_STATUS_INVALID_CONNECTION_ID:
>> +		return "HV_STATUS_INVALID_CONNECTION_ID";
>> +	case HV_STATUS_INSUFFICIENT_BUFFERS:
>> +		return "HV_STATUS_INSUFFICIENT_BUFFERS";
>> +	case HV_STATUS_NOT_ACKNOWLEDGED:
>> +		return "HV_STATUS_NOT_ACKNOWLEDGED";
>> +	case HV_STATUS_INVALID_VP_STATE:
>> +		return "HV_STATUS_INVALID_VP_STATE";
>> +	case HV_STATUS_NO_RESOURCES:
>> +		return "HV_STATUS_NO_RESOURCES";
>> +	case HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED:
>> +		return "HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED";
>> +	case HV_STATUS_INVALID_LP_INDEX:
>> +		return "HV_STATUS_INVALID_LP_INDEX";
>> +	case HV_STATUS_INVALID_REGISTER_VALUE:
>> +		return "HV_STATUS_INVALID_REGISTER_VALUE";
>> +	case HV_STATUS_OPERATION_FAILED:
>> +		return "HV_STATUS_OPERATION_FAILED";
>> +	case HV_STATUS_TIME_OUT:
>> +		return "HV_STATUS_TIME_OUT";
>> +	case HV_STATUS_CALL_PENDING:
>> +		return "HV_STATUS_CALL_PENDING";
>> +	case HV_STATUS_VTL_ALREADY_ENABLED:
>> +		return "HV_STATUS_VTL_ALREADY_ENABLED";
>> +	default:
>> +		return "Unknown";
>> +	};
>> +	return "Unknown";
>> +}
>> +EXPORT_SYMBOL_GPL(hv_result_to_string);
>> +
>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>> index 2fae18e4f7d2..8fc30f509fa7 100644
>> --- a/drivers/hv/hv_proc.c
>> +++ b/drivers/hv/hv_proc.c
>> @@ -87,7 +87,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
>> num_pages)
>>  				     page_count, 0, input_page, NULL);
>>  	local_irq_restore(flags);
>>  	if (!hv_result_success(status)) {
>> -		pr_err("Failed to deposit pages: %lld\n", status);
>> +		pr_err("%s: Failed to deposit pages: %s\n", __func__,
>> +		       hv_result_to_string(status));
>>  		ret = hv_result_to_errno(status);
>>  		goto err_free_allocations;
>>  	}
>> @@ -137,8 +138,9 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>
>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>  			if (!hv_result_success(status)) {
>> -				pr_err("%s: cpu %u apic ID %u, %lld\n", __func__,
>> -				       lp_index, apic_id, status);
>> +				pr_err("%s: cpu %u apic ID %u, %s\n",
>> +				       __func__, lp_index, apic_id,
>> +				       hv_result_to_string(status));
>>  				ret = hv_result_to_errno(status);
>>  			}
>>  			break;
>> @@ -179,8 +181,9 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index,
>> u32 flags)
>>
>>  		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>>  			if (!hv_result_success(status)) {
>> -				pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
>> -				       vp_index, flags, status);
>> +				pr_err("%s: vcpu %u, lp %u, %s\n",
>> +				       __func__, vp_index, flags,
>> +				       hv_result_to_string(status));
>>  				ret = hv_result_to_errno(status);
>>  			}
>>  			break;
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index b13b0cda4ac8..dc4729dba9ef 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -298,6 +298,7 @@ static inline int cpumask_to_vpset_skip(struct hv_vpset *vpset,
>>  	return __cpumask_to_vpset(vpset, cpus, func);
>>  }
>>
>> +const char *hv_result_to_string(u64 hv_status);
>>  int hv_result_to_errno(u64 status);
>>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>>  bool hv_is_hyperv_initialized(void);
>> --
>> 2.34.1
> 
> I've read through the other comments on this patch. I definitely vote
> for outputting both the hex code along with a string translation, which
> could be empty if the hex code is unrecognized by the translation code.
> 
> I can see providing something like hv_hvcall_err() as Nuno proposed, since
> that standardizes the text output. But I wonder if it would be too limiting.
> For example, in the changes above, both hv_call_add_logical_proc() and
> hv_call_create_vp() output additional debugging values, which we probably
> don't want to give up.
> 

Good point - that is easy though, I'll add a __VA_ARGS__ to the macro so any
custom message can be printed alongside the other info.

> Lastly, from an implementation standpoint, rather than using a big
> switch statement, build a static array of entries that each have the
> hex code and string equivalent. Then hv_result_to_string() loops through
> the array looking for a match. This won't be any slower than the big switch
> statement. I've seen other places in the kernel where string names are
> output, and looking up the strings in a static array is the typical approach.
> You'll have to work through the details and see if avoids being too clumsy,
> but I think it will be OK.

I'll try it out, agree the perf difference probably won't be significant or
even matter much.

Thanks
Nuno

> 
> Michael


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
@ 2025-02-26 23:07 ` Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
                     ` (4 more replies)
  2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
                   ` (7 subsequent siblings)
  9 siblings, 5 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Extend the "ms_hyperv_info" structure to include a new field,
"ext_features", for capturing extended Hyper-V features.
Update the "ms_hyperv_init_platform" function to retrieve these features
using the cpuid instruction and include them in the informational output.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 6 ++++--
 include/asm-generic/mshyperv.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 4f01f424ea5b..2c29dfd6de19 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -434,13 +434,15 @@ static void __init ms_hyperv_init_platform(void)
 	 */
 	ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
 	ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
+	ms_hyperv.ext_features = cpuid_ecx(HYPERV_CPUID_FEATURES);
 	ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
 	ms_hyperv.hints    = cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
 
 	hv_max_functions_eax = cpuid_eax(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS);
 
-	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
-		ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
+	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
+		ms_hyperv.features, ms_hyperv.priv_high,
+		ms_hyperv.ext_features, ms_hyperv.hints,
 		ms_hyperv.misc_features);
 
 	ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index dc4729dba9ef..c020d5d0ec2a 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -36,6 +36,7 @@ enum hv_partition_type {
 struct ms_hyperv_info {
 	u32 features;
 	u32 priv_high;
+	u32 ext_features;
 	u32 misc_features;
 	u32 hints;
 	u32 nested_features;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
@ 2025-02-26 23:27   ` Stanislav Kinsburskii
  2025-02-27 17:59   ` Roman Kisel
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:27 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:07:56PM -0800, Nuno Das Neves wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> Extend the "ms_hyperv_info" structure to include a new field,
> "ext_features", for capturing extended Hyper-V features.
> Update the "ms_hyperv_init_platform" function to retrieve these features
> using the cpuid instruction and include them in the informational output.
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
@ 2025-02-27 17:59   ` Roman Kisel
  2025-02-28  0:17     ` Nuno Das Neves
  2025-02-27 18:17   ` Easwar Hariharan
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 17:59 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
[...]
>   
> -	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
> -		ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
> +	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
> +		ms_hyperv.features, ms_hyperv.priv_high,
> +		ms_hyperv.ext_features, ms_hyperv.hints,
>   		ms_hyperv.misc_features);

Would using %#x instead of 0x%x be better in your opinion?

[..]
-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-27 17:59   ` Roman Kisel
@ 2025-02-28  0:17     ` Nuno Das Neves
  2025-02-28 16:42       ` Roman Kisel
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-28  0:17 UTC (permalink / raw)
  To: Roman Kisel, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 9:59 AM, Roman Kisel wrote:
> 
> 
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> [...]
>>   -    pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
>> -        ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
>> +    pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
>> +        ms_hyperv.features, ms_hyperv.priv_high,
>> +        ms_hyperv.ext_features, ms_hyperv.hints,
>>           ms_hyperv.misc_features);
> 
> Would using %#x instead of 0x%x be better in your opinion?
> 
It's a reasonable suggestion. I'm not sure if it's worth another
version, if this patch seems good enough to merge as-is.
However if I'm doing another version of this series that still
includes this patch, then I can certainly make the change.

Thanks!

> [..]


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-28  0:17     ` Nuno Das Neves
@ 2025-02-28 16:42       ` Roman Kisel
  0 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-28 16:42 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/27/2025 4:17 PM, Nuno Das Neves wrote:
> On 2/27/2025 9:59 AM, Roman Kisel wrote:
>>
>>
>> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> [...]
>>>    -    pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
>>> -        ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
>>> +    pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
>>> +        ms_hyperv.features, ms_hyperv.priv_high,
>>> +        ms_hyperv.ext_features, ms_hyperv.hints,
>>>            ms_hyperv.misc_features);
>>
>> Would using %#x instead of 0x%x be better in your opinion?
>>
> It's a reasonable suggestion. I'm not sure if it's worth another
> version, if this patch seems good enough to merge as-is.
> However if I'm doing another version of this series that still
> includes this patch, then I can certainly make the change.
> 

You're right, a suggestion like that shouldn't warrant another version,
agreed! Whether you implement that tweak or not, looks good to me.

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

> Thanks!
> 
>> [..]
> 

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
  2025-02-27 17:59   ` Roman Kisel
@ 2025-02-27 18:17   ` Easwar Hariharan
  2025-03-06 18:30   ` Michael Kelley
  2025-03-10 13:17   ` Tianyu Lan
  4 siblings, 0 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27 18:17 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> Extend the "ms_hyperv_info" structure to include a new field,
> "ext_features", for capturing extended Hyper-V features.
> Update the "ms_hyperv_init_platform" function to retrieve these features
> using the cpuid instruction and include them in the informational output.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 6 ++++--
>  include/asm-generic/mshyperv.h | 1 +
>  2 files changed, 5 insertions(+), 2 deletions(-)

Looks good to me.

Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-02-27 18:17   ` Easwar Hariharan
@ 2025-03-06 18:30   ` Michael Kelley
  2025-03-12 18:04     ` Nuno Das Neves
  2025-03-10 13:17   ` Tianyu Lan
  4 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 18:30 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> Extend the "ms_hyperv_info" structure to include a new field,
> "ext_features", for capturing extended Hyper-V features.
> Update the "ms_hyperv_init_platform" function to retrieve these features
> using the cpuid instruction and include them in the informational output.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 6 ++++--
>  include/asm-generic/mshyperv.h | 1 +
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 4f01f424ea5b..2c29dfd6de19 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -434,13 +434,15 @@ static void __init ms_hyperv_init_platform(void)
>  	 */
>  	ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
>  	ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
> +	ms_hyperv.ext_features = cpuid_ecx(HYPERV_CPUID_FEATURES);
>  	ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
>  	ms_hyperv.hints    = cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
> 
>  	hv_max_functions_eax = cpuid_eax(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS);
> 
> -	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
> -		ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
> +	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
> +		ms_hyperv.features, ms_hyperv.priv_high,
> +		ms_hyperv.ext_features, ms_hyperv.hints,
>  		ms_hyperv.misc_features);
> 
>  	ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index dc4729dba9ef..c020d5d0ec2a 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -36,6 +36,7 @@ enum hv_partition_type {
>  struct ms_hyperv_info {
>  	u32 features;
>  	u32 priv_high;
> +	u32 ext_features;
>  	u32 misc_features;
>  	u32 hints;
>  	u32 nested_features;
> --
> 2.34.1

Are any of the extended features available on arm64? This code is obviously x86 specific,
so ms_hyperv.ext_features will be zero on arm64. From what I can see, ext_features is
referenced only in Patch 10 of this series, and in code that is under #ifdef CONFIG_X86_64,
so that should be OK.

The pr_info() line will now be slightly different on x86 and arm64 since arm64 won't have
the "ext" field, but I think that's OK too.

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-03-06 18:30   ` Michael Kelley
@ 2025-03-12 18:04     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-12 18:04 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/6/2025 10:30 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
>> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>
>> Extend the "ms_hyperv_info" structure to include a new field,
>> "ext_features", for capturing extended Hyper-V features.
>> Update the "ms_hyperv_init_platform" function to retrieve these features
>> using the cpuid instruction and include them in the informational output.
>>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/x86/kernel/cpu/mshyperv.c | 6 ++++--
>>  include/asm-generic/mshyperv.h | 1 +
>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 4f01f424ea5b..2c29dfd6de19 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -434,13 +434,15 @@ static void __init ms_hyperv_init_platform(void)
>>  	 */
>>  	ms_hyperv.features = cpuid_eax(HYPERV_CPUID_FEATURES);
>>  	ms_hyperv.priv_high = cpuid_ebx(HYPERV_CPUID_FEATURES);
>> +	ms_hyperv.ext_features = cpuid_ecx(HYPERV_CPUID_FEATURES);
>>  	ms_hyperv.misc_features = cpuid_edx(HYPERV_CPUID_FEATURES);
>>  	ms_hyperv.hints    = cpuid_eax(HYPERV_CPUID_ENLIGHTMENT_INFO);
>>
>>  	hv_max_functions_eax = cpuid_eax(HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS);
>>
>> -	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, hints 0x%x, misc 0x%x\n",
>> -		ms_hyperv.features, ms_hyperv.priv_high, ms_hyperv.hints,
>> +	pr_info("Hyper-V: privilege flags low 0x%x, high 0x%x, ext 0x%x, hints 0x%x, misc 0x%x\n",
>> +		ms_hyperv.features, ms_hyperv.priv_high,
>> +		ms_hyperv.ext_features, ms_hyperv.hints,
>>  		ms_hyperv.misc_features);
>>
>>  	ms_hyperv.max_vp_index = cpuid_eax(HYPERV_CPUID_IMPLEMENT_LIMITS);
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index dc4729dba9ef..c020d5d0ec2a 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -36,6 +36,7 @@ enum hv_partition_type {
>>  struct ms_hyperv_info {
>>  	u32 features;
>>  	u32 priv_high;
>> +	u32 ext_features;
>>  	u32 misc_features;
>>  	u32 hints;
>>  	u32 nested_features;
>> --
>> 2.34.1
> 
> Are any of the extended features available on arm64? This code is obviously x86 specific,
> so ms_hyperv.ext_features will be zero on arm64. From what I can see, ext_features is
> referenced only in Patch 10 of this series, and in code that is under #ifdef CONFIG_X86_64,
> so that should be OK.

Just checked - yes ARM64 has features in ECX, but they are different to the x86_64 ones.
We can add the ARM64 ones when needed.

Thanks
Nuno

> 
> The pr_info() line will now be slightly different on x86 and arm64 since arm64 won't have
> the "ext" field, but I think that's OK too.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
                     ` (3 preceding siblings ...)
  2025-03-06 18:30   ` Michael Kelley
@ 2025-03-10 13:17   ` Tianyu Lan
  4 siblings, 0 replies; 108+ messages in thread
From: Tianyu Lan @ 2025-03-10 13:17 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>
> Extend the "ms_hyperv_info" structure to include a new field,
> "ext_features", for capturing extended Hyper-V features.
> Update the "ms_hyperv_init_platform" function to retrieve these features
> using the cpuid instruction and include them in the informational output.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

Reviewed-by: Tianyu Lan <tiala@microsoft.com>

-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
  2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
  2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
@ 2025-02-26 23:07 ` Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
                     ` (2 more replies)
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

These non-nested msr and fast hypercall functions are present in x86,
but they must be available in both architetures for the root partition
driver code.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
 arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
 include/asm-generic/mshyperv.h    |  2 ++
 3 files changed, 31 insertions(+)

diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
index 69004f619c57..e33a9e3c366a 100644
--- a/arch/arm64/hyperv/hv_core.c
+++ b/arch/arm64/hyperv/hv_core.c
@@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
 }
 EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
 
+/*
+ * hv_do_fast_hypercall16 -- Invoke the specified hypercall
+ * with arguments in registers instead of physical memory.
+ * Avoids the overhead of virt_to_phys for simple hypercalls.
+ */
+u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
+{
+	struct arm_smccc_res	res;
+	u64			control;
+
+	control = (u64)code | HV_HYPERCALL_FAST_BIT;
+
+	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
+	return res.a0;
+}
+EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
+
 /*
  * Set a single VP register to a 64-bit value.
  */
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index 2e2f83bafcfb..2a900ba00622 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
 	return hv_get_vpreg(reg);
 }
 
+/*
+ * Nested is not supported on arm64
+ */
+static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
+{
+	hv_set_msr(reg, value);
+}
+static inline u64 hv_get_non_nested_msr(unsigned int reg)
+{
+	return hv_get_msr(reg);
+}
+
 /* SMCCC hypercall parameters */
 #define HV_SMCCC_FUNC_NUMBER	1
 #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c020d5d0ec2a..258034dfd829 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
 
 extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
 extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
+extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
+
 bool hv_isolation_type_snp(void);
 bool hv_isolation_type_tdx(void);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
@ 2025-02-26 23:27   ` Stanislav Kinsburskii
  2025-02-27  5:56   ` Easwar Hariharan
  2025-02-27 18:09   ` Roman Kisel
  2 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:27 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:07:57PM -0800, Nuno Das Neves wrote:
> These non-nested msr and fast hypercall functions are present in x86,
> but they must be available in both architetures for the root partition
> driver code.
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
@ 2025-02-27  5:56   ` Easwar Hariharan
  2025-02-28  0:21     ` Nuno Das Neves
  2025-02-27 18:09   ` Roman Kisel
  2 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27  5:56 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> These non-nested msr and fast hypercall functions are present in x86,
> but they must be available in both architetures for the root partition

nit: *architectures*


> driver code.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
>  arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
>  include/asm-generic/mshyperv.h    |  2 ++
>  3 files changed, 31 insertions(+)
> 
> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> index 69004f619c57..e33a9e3c366a 100644
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>  }
>  EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
>  
> +/*
> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
> + * with arguments in registers instead of physical memory.
> + * Avoids the overhead of virt_to_phys for simple hypercalls.
> + */
> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
> +{
> +	struct arm_smccc_res	res;
> +	u64			control;
> +
> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
> +
> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
> +	return res.a0;
> +}
> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
> +

I'd like this to have been in arch/arm64/include/asm/mshyperv.h like its x86
counterpart, but that's just my personal liking of symmetry. I see why it's here
with its slow and 8-byte brethren.

>  /*
>   * Set a single VP register to a 64-bit value.
>   */
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index 2e2f83bafcfb..2a900ba00622 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
>  	return hv_get_vpreg(reg);
>  }
>  
> +/*
> + * Nested is not supported on arm64
> + */
> +static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
> +{
> +	hv_set_msr(reg, value);
> +}

empty line preferred here, also reported by checkpatch

> +static inline u64 hv_get_non_nested_msr(unsigned int reg)
> +{
> +	return hv_get_msr(reg);
> +}
> +
>  /* SMCCC hypercall parameters */
>  #define HV_SMCCC_FUNC_NUMBER	1
>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index c020d5d0ec2a..258034dfd829 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
>  
>  extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>  extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
> +extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
> +

checkpatch warns against putting externs in header files, and FWIW, if hv_do_fast_hypercall16()
for arm64 were in arch/arm64/include/asm/mshyperv.h like its x86 counterpart, you probably
wouldn't need this?

>  bool hv_isolation_type_snp(void);
>  bool hv_isolation_type_tdx(void);
>  


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-27  5:56   ` Easwar Hariharan
@ 2025-02-28  0:21     ` Nuno Das Neves
  2025-03-06 19:05       ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-28  0:21 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 9:56 PM, Easwar Hariharan wrote:
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>> These non-nested msr and fast hypercall functions are present in x86,
>> but they must be available in both architetures for the root partition
> 
> nit: *architectures*
> 
> 
Thanks!

>> driver code.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
>>  arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
>>  include/asm-generic/mshyperv.h    |  2 ++
>>  3 files changed, 31 insertions(+)
>>
>> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
>> index 69004f619c57..e33a9e3c366a 100644
>> --- a/arch/arm64/hyperv/hv_core.c
>> +++ b/arch/arm64/hyperv/hv_core.c
>> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>>  }
>>  EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
>>  
>> +/*
>> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
>> + * with arguments in registers instead of physical memory.
>> + * Avoids the overhead of virt_to_phys for simple hypercalls.
>> + */
>> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>> +{
>> +	struct arm_smccc_res	res;
>> +	u64			control;
>> +
>> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
>> +
>> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
>> +	return res.a0;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
>> +
> 
> I'd like this to have been in arch/arm64/include/asm/mshyperv.h like its x86
> counterpart, but that's just my personal liking of symmetry. I see why it's here
> with its slow and 8-byte brethren.
> 
Good point, I don't see a good reason this can't be in the header.

>>  /*
>>   * Set a single VP register to a 64-bit value.
>>   */
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index 2e2f83bafcfb..2a900ba00622 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
>>  	return hv_get_vpreg(reg);
>>  }
>>  
>> +/*
>> + * Nested is not supported on arm64
>> + */
>> +static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
>> +{
>> +	hv_set_msr(reg, value);
>> +}
> 
> empty line preferred here, also reported by checkpatch
> 
Good point, missed that one...

>> +static inline u64 hv_get_non_nested_msr(unsigned int reg)
>> +{
>> +	return hv_get_msr(reg);
>> +}
>> +
>>  /* SMCCC hypercall parameters */
>>  #define HV_SMCCC_FUNC_NUMBER	1
>>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index c020d5d0ec2a..258034dfd829 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
>>  
>>  extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>>  extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
>> +extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
>> +
> 
> checkpatch warns against putting externs in header files, and FWIW, if hv_do_fast_hypercall16()
> for arm64 were in arch/arm64/include/asm/mshyperv.h like its x86 counterpart, you probably
> wouldn't need this?
> 
Yes I wondered about that warning. That's true, if I just put it in the arm64 header
then this won't be needed at all, so I might just do that!

>>  bool hv_isolation_type_snp(void);
>>  bool hv_isolation_type_tdx(void);
>>  


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-28  0:21     ` Nuno Das Neves
@ 2025-03-06 19:05       ` Michael Kelley
  2025-03-07 21:36         ` Nuno Das Neves
  0 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 19:05 UTC (permalink / raw)
  To: Nuno Das Neves, Easwar Hariharan
  Cc: linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, February 27, 2025 4:21 PM
> 
> On 2/26/2025 9:56 PM, Easwar Hariharan wrote:
> > On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> >> These non-nested msr and fast hypercall functions are present in x86,
> >> but they must be available in both architetures for the root partition
> >
> > nit: *architectures*
> >
> >
> Thanks!
> 
> >> driver code.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >>  arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
> >>  arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
> >>  include/asm-generic/mshyperv.h    |  2 ++
> >>  3 files changed, 31 insertions(+)
> >>
> >> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> >> index 69004f619c57..e33a9e3c366a 100644
> >> --- a/arch/arm64/hyperv/hv_core.c
> >> +++ b/arch/arm64/hyperv/hv_core.c
> >> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
> >>  }
> >>  EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
> >>
> >> +/*
> >> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
> >> + * with arguments in registers instead of physical memory.
> >> + * Avoids the overhead of virt_to_phys for simple hypercalls.
> >> + */
> >> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
> >> +{
> >> +	struct arm_smccc_res	res;
> >> +	u64			control;
> >> +
> >> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
> >> +
> >> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
> >> +	return res.a0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
> >> +
> >
> > I'd like this to have been in arch/arm64/include/asm/mshyperv.h like its x86
> > counterpart, but that's just my personal liking of symmetry. I see why it's here
> > with its slow and 8-byte brethren.
> >
> Good point, I don't see a good reason this can't be in the header.

I was trying to remember if there was some reason I originally put
hv_do_hypercall() and hv_do_fast_hypercall8() in the .c file instead of
the header like on x86. But I don't remember a reason. During
development, the code changed several times, and there might have
been a reason that didn't persistent in the version that was finally
accepted upstream.

My only comment is that hv_do_hypercall() and the 8 and 16 "fast"
versions should probably stay together one place on the arm64 side,
even if it doesn't match x86.

> 
> >>  /*
> >>   * Set a single VP register to a 64-bit value.
> >>   */
> >> diff --git a/arch/arm64/include/asm/mshyperv.h
> b/arch/arm64/include/asm/mshyperv.h
> >> index 2e2f83bafcfb..2a900ba00622 100644
> >> --- a/arch/arm64/include/asm/mshyperv.h
> >> +++ b/arch/arm64/include/asm/mshyperv.h
> >> @@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
> >>  	return hv_get_vpreg(reg);
> >>  }
> >>
> >> +/*
> >> + * Nested is not supported on arm64
> >> + */
> >> +static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
> >> +{
> >> +	hv_set_msr(reg, value);
> >> +}
> >
> > empty line preferred here, also reported by checkpatch
> >
> Good point, missed that one...
> 
> >> +static inline u64 hv_get_non_nested_msr(unsigned int reg)
> >> +{
> >> +	return hv_get_msr(reg);
> >> +}
> >> +
> >>  /* SMCCC hypercall parameters */
> >>  #define HV_SMCCC_FUNC_NUMBER	1
> >>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> >> index c020d5d0ec2a..258034dfd829 100644
> >> --- a/include/asm-generic/mshyperv.h
> >> +++ b/include/asm-generic/mshyperv.h
> >> @@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
> >>
> >>  extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
> >>  extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
> >> +extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
> >> +
> >
> > checkpatch warns against putting externs in header files, and FWIW, if
> hv_do_fast_hypercall16()
> > for arm64 were in arch/arm64/include/asm/mshyperv.h like its x86 counterpart, you
> probably
> > wouldn't need this?
> >
> Yes I wondered about that warning. That's true, if I just put it in the arm64 header
> then this won't be needed at all, so I might just do that!

I always thought the checkpatch warning was simply that "extern" on a function
declaration is superfluous. You can omit "extern" and nothing changes. Of
course, the same is not true for data items.

Michael

> 
> >>  bool hv_isolation_type_snp(void);
> >>  bool hv_isolation_type_tdx(void);
> >>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-03-06 19:05       ` Michael Kelley
@ 2025-03-07 21:36         ` Nuno Das Neves
  2025-03-07 21:55           ` Easwar Hariharan
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 21:36 UTC (permalink / raw)
  To: Michael Kelley, Easwar Hariharan
  Cc: linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

On 3/6/2025 11:05 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, February 27, 2025 4:21 PM
>>
>> On 2/26/2025 9:56 PM, Easwar Hariharan wrote:
>>> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>>>> These non-nested msr and fast hypercall functions are present in x86,
>>>> but they must be available in both architetures for the root partition
>>>
>>> nit: *architectures*
>>>
>>>
>> Thanks!
>>
>>>> driver code.
>>>>
>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> ---
>>>>  arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
>>>>  arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
>>>>  include/asm-generic/mshyperv.h    |  2 ++
>>>>  3 files changed, 31 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
>>>> index 69004f619c57..e33a9e3c366a 100644
>>>> --- a/arch/arm64/hyperv/hv_core.c
>>>> +++ b/arch/arm64/hyperv/hv_core.c
>>>> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
>>>>
>>>> +/*
>>>> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
>>>> + * with arguments in registers instead of physical memory.
>>>> + * Avoids the overhead of virt_to_phys for simple hypercalls.
>>>> + */
>>>> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>>>> +{
>>>> +	struct arm_smccc_res	res;
>>>> +	u64			control;
>>>> +
>>>> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
>>>> +
>>>> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
>>>> +	return res.a0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
>>>> +
>>>
>>> I'd like this to have been in arch/arm64/include/asm/mshyperv.h like its x86
>>> counterpart, but that's just my personal liking of symmetry. I see why it's here
>>> with its slow and 8-byte brethren.
>>>
>> Good point, I don't see a good reason this can't be in the header.
> 
> I was trying to remember if there was some reason I originally put
> hv_do_hypercall() and hv_do_fast_hypercall8() in the .c file instead of
> the header like on x86. But I don't remember a reason. During
> development, the code changed several times, and there might have
> been a reason that didn't persistent in the version that was finally
> accepted upstream.
> 
> My only comment is that hv_do_hypercall() and the 8 and 16 "fast"
> versions should probably stay together one place on the arm64 side,
> even if it doesn't match x86.
> 

I think I'll just keep them together here for now then. They
could be moved to the header in future if it seems worth doing.

>>
>>>>  /*
>>>>   * Set a single VP register to a 64-bit value.
>>>>   */
>>>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>>>> index 2e2f83bafcfb..2a900ba00622 100644
>>>> --- a/arch/arm64/include/asm/mshyperv.h
>>>> +++ b/arch/arm64/include/asm/mshyperv.h
>>>> @@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
>>>>  	return hv_get_vpreg(reg);
>>>>  }
>>>>
>>>> +/*
>>>> + * Nested is not supported on arm64
>>>> + */
>>>> +static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
>>>> +{
>>>> +	hv_set_msr(reg, value);
>>>> +}
>>>
>>> empty line preferred here, also reported by checkpatch
>>>
>> Good point, missed that one...
>>
>>>> +static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>>> +{
>>>> +	return hv_get_msr(reg);
>>>> +}
>>>> +
>>>>  /* SMCCC hypercall parameters */
>>>>  #define HV_SMCCC_FUNC_NUMBER	1
>>>>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
>>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>>> index c020d5d0ec2a..258034dfd829 100644
>>>> --- a/include/asm-generic/mshyperv.h
>>>> +++ b/include/asm-generic/mshyperv.h
>>>> @@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
>>>>
>>>>  extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>>>>  extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
>>>> +extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
>>>> +
>>>
>>> checkpatch warns against putting externs in header files, and FWIW, if
>> hv_do_fast_hypercall16()
>>> for arm64 were in arch/arm64/include/asm/mshyperv.h like its x86 counterpart, you
>> probably
>>> wouldn't need this?
>>>
>> Yes I wondered about that warning. That's true, if I just put it in the arm64 header
>> then this won't be needed at all, so I might just do that!
> 
> I always thought the checkpatch warning was simply that "extern" on a function
> declaration is superfluous. You can omit "extern" and nothing changes. Of
> course, the same is not true for data items.
> Good point, I think I'll clean up these "extern"s in the next version.

Nuno

> Michael
> 
>>
>>>>  bool hv_isolation_type_snp(void);
>>>>  bool hv_isolation_type_tdx(void);
>>>>
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-03-07 21:36         ` Nuno Das Neves
@ 2025-03-07 21:55           ` Easwar Hariharan
  0 siblings, 0 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-03-07 21:55 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: Michael Kelley, eahariha, linux-hyperv@vger.kernel.org,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

On 3/7/2025 1:36 PM, Nuno Das Neves wrote:
> On 3/6/2025 11:05 AM, Michael Kelley wrote:
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, February 27, 2025 4:21 PM
>>>
>>> On 2/26/2025 9:56 PM, Easwar Hariharan wrote:
>>>> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>>>>> These non-nested msr and fast hypercall functions are present in x86,
>>>>> but they must be available in both architetures for the root partition
>>>>
>>>> nit: *architectures*
>>>>
>>>>
>>> Thanks!
>>>
>>>>> driver code.
>>>>>
>>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>>> ---
>>>>>  arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
>>>>>  arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
>>>>>  include/asm-generic/mshyperv.h    |  2 ++
>>>>>  3 files changed, 31 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
>>>>> index 69004f619c57..e33a9e3c366a 100644
>>>>> --- a/arch/arm64/hyperv/hv_core.c
>>>>> +++ b/arch/arm64/hyperv/hv_core.c
>>>>> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
>>>>>
>>>>> +/*
>>>>> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
>>>>> + * with arguments in registers instead of physical memory.
>>>>> + * Avoids the overhead of virt_to_phys for simple hypercalls.
>>>>> + */
>>>>> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>>>>> +{
>>>>> +	struct arm_smccc_res	res;
>>>>> +	u64			control;
>>>>> +
>>>>> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
>>>>> +
>>>>> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
>>>>> +	return res.a0;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
>>>>> +
>>>>
>>>> I'd like this to have been in arch/arm64/include/asm/mshyperv.h like its x86
>>>> counterpart, but that's just my personal liking of symmetry. I see why it's here
>>>> with its slow and 8-byte brethren.
>>>>
>>> Good point, I don't see a good reason this can't be in the header.
>>
>> I was trying to remember if there was some reason I originally put
>> hv_do_hypercall() and hv_do_fast_hypercall8() in the .c file instead of
>> the header like on x86. But I don't remember a reason. During
>> development, the code changed several times, and there might have
>> been a reason that didn't persistent in the version that was finally
>> accepted upstream.
>>
>> My only comment is that hv_do_hypercall() and the 8 and 16 "fast"
>> versions should probably stay together one place on the arm64 side,
>> even if it doesn't match x86.
>>
> 
> I think I'll just keep them together here for now then. They
> could be moved to the header in future if it seems worth doing.
> 

I was really hoping the answer here would be to move all of them together to the header,
but oh well.

<snip>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64
  2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
  2025-02-26 23:27   ` Stanislav Kinsburskii
  2025-02-27  5:56   ` Easwar Hariharan
@ 2025-02-27 18:09   ` Roman Kisel
  2 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 18:09 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> These non-nested msr and fast hypercall functions are present in x86,
> but they must be available in both architetures for the root partition
> driver code.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>   arch/arm64/hyperv/hv_core.c       | 17 +++++++++++++++++
>   arch/arm64/include/asm/mshyperv.h | 12 ++++++++++++
>   include/asm-generic/mshyperv.h    |  2 ++
>   3 files changed, 31 insertions(+)
> 
> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> index 69004f619c57..e33a9e3c366a 100644
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
> @@ -53,6 +53,23 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
>   }
>   EXPORT_SYMBOL_GPL(hv_do_fast_hypercall8);
>   
> +/*
> + * hv_do_fast_hypercall16 -- Invoke the specified hypercall
> + * with arguments in registers instead of physical memory.
> + * Avoids the overhead of virt_to_phys for simple hypercalls.
> + */
> +u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
> +{
> +	struct arm_smccc_res	res;
> +	u64			control;
> +
> +	control = (u64)code | HV_HYPERCALL_FAST_BIT;
> +
> +	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
> +	return res.a0;
> +}
> +EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
> +
>   /*
>    * Set a single VP register to a 64-bit value.
>    */
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index 2e2f83bafcfb..2a900ba00622 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -40,6 +40,18 @@ static inline u64 hv_get_msr(unsigned int reg)
>   	return hv_get_vpreg(reg);
>   }
>   
> +/*
> + * Nested is not supported on arm64
> + */
> +static inline void hv_set_non_nested_msr(unsigned int reg, u64 value)
> +{
> +	hv_set_msr(reg, value);
> +}
> +static inline u64 hv_get_non_nested_msr(unsigned int reg)
> +{
> +	return hv_get_msr(reg);
> +}
> +
>   /* SMCCC hypercall parameters */
>   #define HV_SMCCC_FUNC_NUMBER	1
>   #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index c020d5d0ec2a..258034dfd829 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -72,6 +72,8 @@ extern void * __percpu *hyperv_pcpu_output_arg;
>   
>   extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>   extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
> +extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
> +
>   bool hv_isolation_type_snp(void);
>   bool hv_isolation_type_tdx(void);
>   

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (2 preceding siblings ...)
  2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
@ 2025-02-26 23:07 ` Nuno Das Neves
  2025-02-26 23:28   ` Stanislav Kinsburskii
                     ` (4 more replies)
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
                   ` (5 subsequent siblings)
  9 siblings, 5 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

Factor out the check for enabling auto eoi, to be reused in root
partition code.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/hv.c                | 12 +-----------
 include/asm-generic/mshyperv.h | 13 +++++++++++++
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index a38f84548bc2..308c8f279df8 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -313,17 +313,7 @@ void hv_synic_enable_regs(unsigned int cpu)
 
 	shared_sint.vector = vmbus_interrupt;
 	shared_sint.masked = false;
-
-	/*
-	 * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
-	 * it doesn't provide a recommendation flag and AEOI must be disabled.
-	 */
-#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
-	shared_sint.auto_eoi =
-			!(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
-#else
-	shared_sint.auto_eoi = 0;
-#endif
+	shared_sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_msr(HV_MSR_SINT0 + VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
 
 	/* Enable the global synic bit */
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 258034dfd829..1f46d19a16aa 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -77,6 +77,19 @@ extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
 bool hv_isolation_type_snp(void);
 bool hv_isolation_type_tdx(void);
 
+/*
+ * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
+ * it doesn't provide a recommendation flag and AEOI must be disabled.
+ */
+static inline bool hv_recommend_using_aeoi(void)
+{
+#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
+	return !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
+#else
+	return false;
+#endif
+}
+
 static inline struct hv_proximity_domain_info hv_numa_node_to_pxm_info(int node)
 {
 	struct hv_proximity_domain_info pxm_info = {};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
@ 2025-02-26 23:28   ` Stanislav Kinsburskii
  2025-02-27 18:04   ` Roman Kisel
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:28 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:07:58PM -0800, Nuno Das Neves wrote:
> Factor out the check for enabling auto eoi, to be reused in root
> partition code.
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
  2025-02-26 23:28   ` Stanislav Kinsburskii
@ 2025-02-27 18:04   ` Roman Kisel
  2025-02-28  0:21     ` Nuno Das Neves
  2025-02-27 23:03   ` Easwar Hariharan
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 18:04 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> Factor out the check for enabling auto eoi, to be reused in root
> partition code.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

I think adding "No functional changes" would bring some benefit:
that's an additional invariant to check against when reviewing.

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-27 18:04   ` Roman Kisel
@ 2025-02-28  0:21     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-28  0:21 UTC (permalink / raw)
  To: Roman Kisel, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 10:04 AM, Roman Kisel wrote:
> 
> 
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>> Factor out the check for enabling auto eoi, to be reused in root
>> partition code.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
> 
> I think adding "No functional changes" would bring some benefit:
> that's an additional invariant to check against when reviewing.
> 
Thanks, I can add it for next version :)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
  2025-02-26 23:28   ` Stanislav Kinsburskii
  2025-02-27 18:04   ` Roman Kisel
@ 2025-02-27 23:03   ` Easwar Hariharan
  2025-02-28  0:33     ` Nuno Das Neves
  2025-03-06 19:12   ` Michael Kelley
  2025-03-10 12:51   ` Tianyu Lan
  4 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27 23:03 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> Factor out the check for enabling auto eoi, to be reused in root
> partition code.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv.c                | 12 +-----------
>  include/asm-generic/mshyperv.h | 13 +++++++++++++
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 

<snip>

> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 258034dfd829..1f46d19a16aa 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -77,6 +77,19 @@ extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
>  bool hv_isolation_type_snp(void);
>  bool hv_isolation_type_tdx(void);
>  
> +/*
> + * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
> + * it doesn't provide a recommendation flag and AEOI must be disabled.
> + */
> +static inline bool hv_recommend_using_aeoi(void)
> +{
> +#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
> +	return !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
> +#else
> +	return false;
> +#endif
> +}
> +

I must be missing something very basic here, and if so, I apologize, and please enlighten me.

HV_DEPRECATING_AEOI_RECOMMENDED is defined as BIT(9) in include/hyperv/hvgdk_mini.h, and
asm-generic/mshyperv.h includes that via include/hyperv/hvhdk.h.

If this is the case, when would HV_DEPRECATING_AEOI_RECOMMENDED ever be not defined?
If it's always defined, do we need the #ifdef?

Thanks,
Easwar (he/him)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-27 23:03   ` Easwar Hariharan
@ 2025-02-28  0:33     ` Nuno Das Neves
  2025-02-28  0:49       ` Easwar Hariharan
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-28  0:33 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/27/2025 3:03 PM, Easwar Hariharan wrote:
> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>> Factor out the check for enabling auto eoi, to be reused in root
>> partition code.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  drivers/hv/hv.c                | 12 +-----------
>>  include/asm-generic/mshyperv.h | 13 +++++++++++++
>>  2 files changed, 14 insertions(+), 11 deletions(-)
>>
> 
> <snip>
> 
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index 258034dfd829..1f46d19a16aa 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -77,6 +77,19 @@ extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
>>  bool hv_isolation_type_snp(void);
>>  bool hv_isolation_type_tdx(void);
>>  
>> +/*
>> + * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
>> + * it doesn't provide a recommendation flag and AEOI must be disabled.
>> + */
>> +static inline bool hv_recommend_using_aeoi(void)
>> +{
>> +#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
>> +	return !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
>> +#else
>> +	return false;
>> +#endif
>> +}
>> +
> 
> I must be missing something very basic here, and if so, I apologize, and please enlighten me.
> 
> HV_DEPRECATING_AEOI_RECOMMENDED is defined as BIT(9) in include/hyperv/hvgdk_mini.h, and
> asm-generic/mshyperv.h includes that via include/hyperv/hvhdk.h.
> 
> If this is the case, when would HV_DEPRECATING_AEOI_RECOMMENDED ever be not defined?
> If it's always defined, do we need the #ifdef?
> 
HV_DEPRECATING_AEOI_RECOMMENDED is only defined on x86 (it used to live in x86 hyperv-tlfs.h).
It lives inside a #if defined(CONFIG_X86) block in hvgdk_mini.h. It is a bit confusing since
it is surrounded by other x86-only definitions which are prefixed with HV_X64_.

Thanks
Nuno

> Thanks,
> Easwar (he/him)


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-28  0:33     ` Nuno Das Neves
@ 2025-02-28  0:49       ` Easwar Hariharan
  0 siblings, 0 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-28  0:49 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: eahariha, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/27/2025 4:33 PM, Nuno Das Neves wrote:
> On 2/27/2025 3:03 PM, Easwar Hariharan wrote:
>> On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
>>> Factor out the check for enabling auto eoi, to be reused in root
>>> partition code.
>>>
>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>> ---
>>>  drivers/hv/hv.c                | 12 +-----------
>>>  include/asm-generic/mshyperv.h | 13 +++++++++++++
>>>  2 files changed, 14 insertions(+), 11 deletions(-)
>>>
>>
>> <snip>
>>
>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>> index 258034dfd829..1f46d19a16aa 100644
>>> --- a/include/asm-generic/mshyperv.h
>>> +++ b/include/asm-generic/mshyperv.h
>>> @@ -77,6 +77,19 @@ extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
>>>  bool hv_isolation_type_snp(void);
>>>  bool hv_isolation_type_tdx(void);
>>>  
>>> +/*
>>> + * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
>>> + * it doesn't provide a recommendation flag and AEOI must be disabled.
>>> + */
>>> +static inline bool hv_recommend_using_aeoi(void)
>>> +{
>>> +#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
>>> +	return !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
>>> +#else
>>> +	return false;
>>> +#endif
>>> +}
>>> +
>>
>> I must be missing something very basic here, and if so, I apologize, and please enlighten me.
>>
>> HV_DEPRECATING_AEOI_RECOMMENDED is defined as BIT(9) in include/hyperv/hvgdk_mini.h, and
>> asm-generic/mshyperv.h includes that via include/hyperv/hvhdk.h.
>>
>> If this is the case, when would HV_DEPRECATING_AEOI_RECOMMENDED ever be not defined?
>> If it's always defined, do we need the #ifdef?
>>
> HV_DEPRECATING_AEOI_RECOMMENDED is only defined on x86 (it used to live in x86 hyperv-tlfs.h).
> It lives inside a #if defined(CONFIG_X86) block in hvgdk_mini.h. It is a bit confusing since
> it is surrounded by other x86-only definitions which are prefixed with HV_X64_.
> 

Ah, thank you. I knew it must be something glaringly obvious in hindsight. With that resolved,

Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-02-27 23:03   ` Easwar Hariharan
@ 2025-03-06 19:12   ` Michael Kelley
  2025-03-10 12:51   ` Tianyu Lan
  4 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 19:12 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> 
> Factor out the check for enabling auto eoi, to be reused in root
> partition code.

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv.c                | 12 +-----------
>  include/asm-generic/mshyperv.h | 13 +++++++++++++
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index a38f84548bc2..308c8f279df8 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -313,17 +313,7 @@ void hv_synic_enable_regs(unsigned int cpu)
> 
>  	shared_sint.vector = vmbus_interrupt;
>  	shared_sint.masked = false;
> -
> -	/*
> -	 * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
> -	 * it doesn't provide a recommendation flag and AEOI must be disabled.
> -	 */
> -#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
> -	shared_sint.auto_eoi =
> -			!(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
> -#else
> -	shared_sint.auto_eoi = 0;
> -#endif
> +	shared_sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_msr(HV_MSR_SINT0 + VMBUS_MESSAGE_SINT, shared_sint.as_uint64);
> 
>  	/* Enable the global synic bit */
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 258034dfd829..1f46d19a16aa 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -77,6 +77,19 @@ extern u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64
> input2);
>  bool hv_isolation_type_snp(void);
>  bool hv_isolation_type_tdx(void);
> 
> +/*
> + * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
> + * it doesn't provide a recommendation flag and AEOI must be disabled.
> + */
> +static inline bool hv_recommend_using_aeoi(void)
> +{
> +#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
> +	return !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
> +#else
> +	return false;
> +#endif
> +}
> +
>  static inline struct hv_proximity_domain_info hv_numa_node_to_pxm_info(int node)
>  {
>  	struct hv_proximity_domain_info pxm_info = {};
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi()
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
                     ` (3 preceding siblings ...)
  2025-03-06 19:12   ` Michael Kelley
@ 2025-03-10 12:51   ` Tianyu Lan
  4 siblings, 0 replies; 108+ messages in thread
From: Tianyu Lan @ 2025-03-10 12:51 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> Factor out the check for enabling auto eoi, to be reused in root
> partition code.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

Reviewed-by: Tianyu Lan <tiala@microsoft.com>

-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 05/10] acpi: numa: Export node_to_pxm()
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (3 preceding siblings ...)
  2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
@ 2025-02-26 23:07 ` Nuno Das Neves
  2025-02-26 23:31   ` Stanislav Kinsburskii
                     ` (3 more replies)
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
                   ` (4 subsequent siblings)
  9 siblings, 4 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:07 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

node_to_pxm() is used by hv_numa_node_to_pxm_info().
That helper will be used by Hyper-V root partition module code
when CONFIG_MSHV_ROOT=m.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/acpi/numa/srat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 00ac0d7bb8c9..ce815d7cb8f6 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -51,6 +51,7 @@ int node_to_pxm(int node)
 		return PXM_INVAL;
 	return node_to_pxm_map[node];
 }
+EXPORT_SYMBOL_GPL(node_to_pxm);
 
 static void __acpi_map_pxm_to_node(int pxm, int node)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 05/10] acpi: numa: Export node_to_pxm()
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
@ 2025-02-26 23:31   ` Stanislav Kinsburskii
  2025-02-27 23:05   ` Easwar Hariharan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:31 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:07:59PM -0800, Nuno Das Neves wrote:
> node_to_pxm() is used by hv_numa_node_to_pxm_info().
> That helper will be used by Hyper-V root partition module code
> when CONFIG_MSHV_ROOT=m.
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 05/10] acpi: numa: Export node_to_pxm()
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
  2025-02-26 23:31   ` Stanislav Kinsburskii
@ 2025-02-27 23:05   ` Easwar Hariharan
  2025-03-06 19:16   ` Michael Kelley
  2025-03-10 12:50   ` Tianyu Lan
  3 siblings, 0 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27 23:05 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:07 PM, Nuno Das Neves wrote:
> node_to_pxm() is used by hv_numa_node_to_pxm_info().
> That helper will be used by Hyper-V root partition module code
> when CONFIG_MSHV_ROOT=m.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/acpi/numa/srat.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 00ac0d7bb8c9..ce815d7cb8f6 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -51,6 +51,7 @@ int node_to_pxm(int node)
>  		return PXM_INVAL;
>  	return node_to_pxm_map[node];
>  }
> +EXPORT_SYMBOL_GPL(node_to_pxm);
>  
>  static void __acpi_map_pxm_to_node(int pxm, int node)
>  {

FWIW,

Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 05/10] acpi: numa: Export node_to_pxm()
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
  2025-02-26 23:31   ` Stanislav Kinsburskii
  2025-02-27 23:05   ` Easwar Hariharan
@ 2025-03-06 19:16   ` Michael Kelley
  2025-03-10 12:50   ` Tianyu Lan
  3 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 19:16 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> 
> node_to_pxm() is used by hv_numa_node_to_pxm_info().
> That helper will be used by Hyper-V root partition module code
> when CONFIG_MSHV_ROOT=m.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
>  drivers/acpi/numa/srat.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 00ac0d7bb8c9..ce815d7cb8f6 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -51,6 +51,7 @@ int node_to_pxm(int node)
>  		return PXM_INVAL;
>  	return node_to_pxm_map[node];
>  }
> +EXPORT_SYMBOL_GPL(node_to_pxm);
> 
>  static void __acpi_map_pxm_to_node(int pxm, int node)
>  {
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 05/10] acpi: numa: Export node_to_pxm()
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-03-06 19:16   ` Michael Kelley
@ 2025-03-10 12:50   ` Tianyu Lan
  3 siblings, 0 replies; 108+ messages in thread
From: Tianyu Lan @ 2025-03-10 12:50 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 7:10 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> node_to_pxm() is used by hv_numa_node_to_pxm_info().
> That helper will be used by Hyper-V root partition module code
> when CONFIG_MSHV_ROOT=m.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

Reviewed-by: Tianyu Lan <tiala@microsoft.com>


-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (4 preceding siblings ...)
  2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
@ 2025-02-26 23:08 ` Nuno Das Neves
  2025-02-26 23:32   ` Stanislav Kinsburskii
                     ` (3 more replies)
  2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
                   ` (3 subsequent siblings)
  9 siblings, 4 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:08 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
hv_call_deposit_pages, and hv_call_create_vp are all needed in module
with CONFIG_MSHV_ROOT=m.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c   | 1 +
 arch/x86/kernel/cpu/mshyperv.c | 1 +
 drivers/hv/hv_common.c         | 1 +
 drivers/hv/hv_proc.c           | 3 ++-
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 2265ea5ce5ad..4e27cc29c79e 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -26,6 +26,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
 
 static int __init hyperv_init(void)
 {
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 2c29dfd6de19..0116d0e96ef9 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -420,6 +420,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
 
 static void __init ms_hyperv_init_platform(void)
 {
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index ce20818688fe..252fd66ad4db 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -717,6 +717,7 @@ int hv_result_to_errno(u64 status)
 	}
 	return -EIO;
 }
+EXPORT_SYMBOL_GPL(hv_result_to_errno);
 
 void hv_identify_partition_type(void)
 {
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 8fc30f509fa7..20c8cee81e2b 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -108,6 +108,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 	kfree(counts);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
 
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 {
@@ -194,4 +195,4 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 
 	return ret;
 }
-
+EXPORT_SYMBOL_GPL(hv_call_create_vp);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
@ 2025-02-26 23:32   ` Stanislav Kinsburskii
  2025-02-27 18:11   ` Roman Kisel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:32 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:08:00PM -0800, Nuno Das Neves wrote:
> get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
> hv_call_deposit_pages, and hv_call_create_vp are all needed in module
> with CONFIG_MSHV_ROOT=m.
> 

Reviewed-by: Stanislav Kinsburskii <skinsburskii@microsoft.linux.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
  2025-02-26 23:32   ` Stanislav Kinsburskii
@ 2025-02-27 18:11   ` Roman Kisel
  2025-02-28  0:51   ` Easwar Hariharan
  2025-03-06 19:23   ` Michael Kelley
  3 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 18:11 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
> hv_call_deposit_pages, and hv_call_create_vp are all needed in module
> with CONFIG_MSHV_ROOT=m.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>   arch/arm64/hyperv/mshyperv.c   | 1 +
>   arch/x86/kernel/cpu/mshyperv.c | 1 +
>   drivers/hv/hv_common.c         | 1 +
>   drivers/hv/hv_proc.c           | 3 ++-
>   4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 2265ea5ce5ad..4e27cc29c79e 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -26,6 +26,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>   
>   	return 0;
>   }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>   
>   static int __init hyperv_init(void)
>   {
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 2c29dfd6de19..0116d0e96ef9 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -420,6 +420,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>   
>   	return 0;
>   }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>   
>   static void __init ms_hyperv_init_platform(void)
>   {
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index ce20818688fe..252fd66ad4db 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -717,6 +717,7 @@ int hv_result_to_errno(u64 status)
>   	}
>   	return -EIO;
>   }
> +EXPORT_SYMBOL_GPL(hv_result_to_errno);
>   
>   void hv_identify_partition_type(void)
>   {
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 8fc30f509fa7..20c8cee81e2b 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -108,6 +108,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>   	kfree(counts);
>   	return ret;
>   }
> +EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>   
>   int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>   {
> @@ -194,4 +195,4 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>   
>   	return ret;
>   }
> -
> +EXPORT_SYMBOL_GPL(hv_call_create_vp);

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
  2025-02-26 23:32   ` Stanislav Kinsburskii
  2025-02-27 18:11   ` Roman Kisel
@ 2025-02-28  0:51   ` Easwar Hariharan
  2025-03-06 19:23   ` Michael Kelley
  3 siblings, 0 replies; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-28  0:51 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
> hv_call_deposit_pages, and hv_call_create_vp are all needed in module
> with CONFIG_MSHV_ROOT=m.
> 

Nit: It's generally good practice to use parentheses when mentioning functions, i.e.
hv_get_hypervisor_version(), hv_call_deposit_pages() etc

Otherwise, looks good to me.

Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>

> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/arm64/hyperv/mshyperv.c   | 1 +
>  arch/x86/kernel/cpu/mshyperv.c | 1 +
>  drivers/hv/hv_common.c         | 1 +
>  drivers/hv/hv_proc.c           | 3 ++-
>  4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 2265ea5ce5ad..4e27cc29c79e 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -26,6 +26,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>  
>  static int __init hyperv_init(void)
>  {
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 2c29dfd6de19..0116d0e96ef9 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -420,6 +420,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>  
>  static void __init ms_hyperv_init_platform(void)
>  {
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index ce20818688fe..252fd66ad4db 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -717,6 +717,7 @@ int hv_result_to_errno(u64 status)
>  	}
>  	return -EIO;
>  }
> +EXPORT_SYMBOL_GPL(hv_result_to_errno);
>  
>  void hv_identify_partition_type(void)
>  {
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 8fc30f509fa7..20c8cee81e2b 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -108,6 +108,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>  	kfree(counts);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>  
>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>  {
> @@ -194,4 +195,4 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  
>  	return ret;
>  }
> -
> +EXPORT_SYMBOL_GPL(hv_call_create_vp);


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-02-28  0:51   ` Easwar Hariharan
@ 2025-03-06 19:23   ` Michael Kelley
  2025-03-07 21:38     ` Nuno Das Neves
  3 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-06 19:23 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 

Nit: For the patch Subject line, use prefix "Drivers: hv:" instead of with a slash.
That's what we usually use and what you have used for other patches in this
series.

> get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
> hv_call_deposit_pages, and hv_call_create_vp are all needed in module
> with CONFIG_MSHV_ROOT=m.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

Modulo the nit:

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
>  arch/arm64/hyperv/mshyperv.c   | 1 +
>  arch/x86/kernel/cpu/mshyperv.c | 1 +
>  drivers/hv/hv_common.c         | 1 +
>  drivers/hv/hv_proc.c           | 3 ++-
>  4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 2265ea5ce5ad..4e27cc29c79e 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c
> @@ -26,6 +26,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info
> *info)
> 
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
> 
>  static int __init hyperv_init(void)
>  {
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 2c29dfd6de19..0116d0e96ef9 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -420,6 +420,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info
> *info)
> 
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
> 
>  static void __init ms_hyperv_init_platform(void)
>  {
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index ce20818688fe..252fd66ad4db 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -717,6 +717,7 @@ int hv_result_to_errno(u64 status)
>  	}
>  	return -EIO;
>  }
> +EXPORT_SYMBOL_GPL(hv_result_to_errno);
> 
>  void hv_identify_partition_type(void)
>  {
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 8fc30f509fa7..20c8cee81e2b 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -108,6 +108,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>  	kfree(counts);
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
> 
>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>  {
> @@ -194,4 +195,4 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> 
>  	return ret;
>  }
> -
> +EXPORT_SYMBOL_GPL(hv_call_create_vp);
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module
  2025-03-06 19:23   ` Michael Kelley
@ 2025-03-07 21:38     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 21:38 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/6/2025 11:23 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
> 
> Nit: For the patch Subject line, use prefix "Drivers: hv:" instead of with a slash.
> That's what we usually use and what you have used for other patches in this
> series.
> 
Thanks, I thought I checked these but I guess I missed this! I'll update for v6.

>> get_hypervisor_version, hv_call_deposit_pages, hv_call_create_vp,
>> hv_call_deposit_pages, and hv_call_create_vp are all needed in module
>> with CONFIG_MSHV_ROOT=m.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> 
> Modulo the nit:
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 
>> ---
>>  arch/arm64/hyperv/mshyperv.c   | 1 +
>>  arch/x86/kernel/cpu/mshyperv.c | 1 +
>>  drivers/hv/hv_common.c         | 1 +
>>  drivers/hv/hv_proc.c           | 3 ++-
>>  4 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
>> index 2265ea5ce5ad..4e27cc29c79e 100644
>> --- a/arch/arm64/hyperv/mshyperv.c
>> +++ b/arch/arm64/hyperv/mshyperv.c
>> @@ -26,6 +26,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info
>> *info)
>>
>>  	return 0;
>>  }
>> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>>
>>  static int __init hyperv_init(void)
>>  {
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 2c29dfd6de19..0116d0e96ef9 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -420,6 +420,7 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info
>> *info)
>>
>>  	return 0;
>>  }
>> +EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>>
>>  static void __init ms_hyperv_init_platform(void)
>>  {
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index ce20818688fe..252fd66ad4db 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -717,6 +717,7 @@ int hv_result_to_errno(u64 status)
>>  	}
>>  	return -EIO;
>>  }
>> +EXPORT_SYMBOL_GPL(hv_result_to_errno);
>>
>>  void hv_identify_partition_type(void)
>>  {
>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>> index 8fc30f509fa7..20c8cee81e2b 100644
>> --- a/drivers/hv/hv_proc.c
>> +++ b/drivers/hv/hv_proc.c
>> @@ -108,6 +108,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>  	kfree(counts);
>>  	return ret;
>>  }
>> +EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>>
>>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>  {
>> @@ -194,4 +195,4 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>
>>  	return ret;
>>  }
>> -
>> +EXPORT_SYMBOL_GPL(hv_call_create_vp);
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (5 preceding siblings ...)
  2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
@ 2025-02-26 23:08 ` Nuno Das Neves
  2025-02-26 23:39   ` Stanislav Kinsburskii
                     ` (2 more replies)
  2025-02-26 23:08 ` [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function Nuno Das Neves
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:08 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

Add a pointer hv_synic_eventring_tail to track the tail pointer for the
SynIC event ring buffer for each SINT.

This will be used by the mshv driver, but must be tracked independently
since the driver module could be removed and re-inserted.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
---
 drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 252fd66ad4db..2763cb6d3678 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
 
 static struct ctl_table_header *hv_ctl_table_hdr;
 
+/*
+ * Per-cpu array holding the tail pointer for the SynIC event ring buffer
+ * for each SINT.
+ *
+ * We cannot maintain this in mshv driver because the tail pointer should
+ * persist even if the mshv driver is unloaded.
+ */
+u8 __percpu **hv_synic_eventring_tail;
+EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
+
 /*
  * Hyper-V specific initialization and shutdown code that is
  * common across all architectures.  Called from architecture
@@ -90,6 +100,9 @@ void __init hv_common_free(void)
 
 	free_percpu(hyperv_pcpu_input_arg);
 	hyperv_pcpu_input_arg = NULL;
+
+	free_percpu(hv_synic_eventring_tail);
+	hv_synic_eventring_tail = NULL;
 }
 
 /*
@@ -372,6 +385,11 @@ int __init hv_common_init(void)
 		BUG_ON(!hyperv_pcpu_output_arg);
 	}
 
+	if (hv_root_partition()) {
+		hv_synic_eventring_tail = alloc_percpu(u8 *);
+		BUG_ON(hv_synic_eventring_tail == NULL);
+	}
+
 	hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
 				    GFP_KERNEL);
 	if (!hv_vp_index) {
@@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
 int hv_common_cpu_init(unsigned int cpu)
 {
 	void **inputarg, **outputarg;
+	u8 **synic_eventring_tail;
 	u64 msr_vp_index;
 	gfp_t flags;
 	const int pgcount = hv_output_page_exists() ? 2 : 1;
@@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
 	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	/*
-	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
-	 * allocated if this CPU was previously online and then taken offline
+	 * The per-cpu memory is already allocated if this CPU was previously
+	 * online and then taken offline
 	 */
 	if (!*inputarg) {
 		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
@@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
 			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
 		}
 
+		if (hv_root_partition()) {
+			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
+			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
+							sizeof(u8), flags);
+
+			if (unlikely(!*synic_eventring_tail)) {
+				kfree(mem);
+				return -ENOMEM;
+			}
+		}
+
 		if (!ms_hyperv.paravisor_present &&
 		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
 			ret = set_memory_decrypted((unsigned long)mem, pgcount);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
@ 2025-02-26 23:39   ` Stanislav Kinsburskii
  2025-03-07 17:02   ` Michael Kelley
  2025-03-10 13:01   ` Tianyu Lan
  2 siblings, 0 replies; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:39 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:08:01PM -0800, Nuno Das Neves wrote:
> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> SynIC event ring buffer for each SINT.
> 
> This will be used by the mshv driver, but must be tracked independently
> since the driver module could be removed and re-inserted.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> ---
>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 252fd66ad4db..2763cb6d3678 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>  int hv_common_cpu_init(unsigned int cpu)
>  {
>  	void **inputarg, **outputarg;
> +	u8 **synic_eventring_tail;
>  	u64 msr_vp_index;
>  	gfp_t flags;
>  	const int pgcount = hv_output_page_exists() ? 2 : 1;
> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>  	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>  
>  	/*
> -	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
> -	 * allocated if this CPU was previously online and then taken offline
> +	 * The per-cpu memory is already allocated if this CPU was previously
> +	 * online and then taken offline
>  	 */
>  	if (!*inputarg) {
>  		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>  		}
>  
> +		if (hv_root_partition()) {
> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
> +							sizeof(u8), flags);
> +

Redundant empty line ^^^
> +			if (unlikely(!*synic_eventring_tail)) {
> +				kfree(mem);
> +				return -ENOMEM;
> +			}
> +		}
> +
>  		if (!ms_hyperv.paravisor_present &&
>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
  2025-02-26 23:39   ` Stanislav Kinsburskii
@ 2025-03-07 17:02   ` Michael Kelley
  2025-03-07 22:06     ` Nuno Das Neves
  2025-03-10 13:01   ` Tianyu Lan
  2 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 17:02 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> SynIC event ring buffer for each SINT.
> 
> This will be used by the mshv driver, but must be tracked independently
> since the driver module could be removed and re-inserted.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> ---
>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 252fd66ad4db..2763cb6d3678 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
> 
>  static struct ctl_table_header *hv_ctl_table_hdr;
> 
> +/*
> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
> + * for each SINT.
> + *
> + * We cannot maintain this in mshv driver because the tail pointer should
> + * persist even if the mshv driver is unloaded.
> + */
> +u8 __percpu **hv_synic_eventring_tail;

I think the "__percpu" is in the wrong place here. This placement
is likely to cause errors from the "sparse" tool.  It should be

u8 * __percpu *hv_synic_eventring_tail;

See the way hyperv_pcpu_input_arg, for example, is defined.  And
see commit db3c65bc3a13 where I fixed hyperv_pcpu_input_arg.

> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);

The "extern" declaration for this variable is in Patch 10 of the series
in drivers/hv/mshv_root.h. I guess that's OK, but I would normally
expect to find such a declaration in the header file associated with
where the variable is defined, which in this case is mshyperv.h.
Perhaps you are trying to restrict its usage to just mshv?

> +
>  /*
>   * Hyper-V specific initialization and shutdown code that is
>   * common across all architectures.  Called from architecture
> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
> 
>  	free_percpu(hyperv_pcpu_input_arg);
>  	hyperv_pcpu_input_arg = NULL;
> +
> +	free_percpu(hv_synic_eventring_tail);
> +	hv_synic_eventring_tail = NULL;
>  }
> 
>  /*
> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
>  		BUG_ON(!hyperv_pcpu_output_arg);
>  	}
> 
> +	if (hv_root_partition()) {
> +		hv_synic_eventring_tail = alloc_percpu(u8 *);
> +		BUG_ON(hv_synic_eventring_tail == NULL);
> +	}
> +
>  	hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
>  				    GFP_KERNEL);
>  	if (!hv_vp_index) {
> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>  int hv_common_cpu_init(unsigned int cpu)
>  {
>  	void **inputarg, **outputarg;
> +	u8 **synic_eventring_tail;
>  	u64 msr_vp_index;
>  	gfp_t flags;
>  	const int pgcount = hv_output_page_exists() ? 2 : 1;
> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>  	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> 
>  	/*
> -	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
> -	 * allocated if this CPU was previously online and then taken offline
> +	 * The per-cpu memory is already allocated if this CPU was previously
> +	 * online and then taken offline
>  	 */
>  	if (!*inputarg) {
>  		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>  		}
> 
> +		if (hv_root_partition()) {
> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
> +							sizeof(u8), flags);
> +
> +			if (unlikely(!*synic_eventring_tail)) {
> +				kfree(mem);
> +				return -ENOMEM;
> +			}
> +		}
> +

Adding this code under the "if(!*inputarg)" implicitly ties the lifecycle of
synic_eventring_tail to the lifecycle of hyperv_pcpu_input_arg and
hyperv_pcpu_output_arg. Is there some logical relationship between the
two that warrants tying the lifecycles together (other than just both being
per-cpu)?  hyperv_pcpu_input_arg and hyperv_pcpu_output_arg have an
unusual lifecycle management in that they aren't freed when a CPU goes
offline, as described in the comment in hv_common_cpu_die(). Does
synic_eventring_tail also need that same unusual lifecycle?

Assuming there's no logical relationship, I'm thinking synic_eventring_tail
should be managed independent of the other two. If it does need the
unusual lifecycle, make sure to add a comment in hv_common_cpu_die()
explaining why. If it doesn't need the unusual lifecycle, maybe just do
the normal thing of allocating it in hv_common_cpu_init() and freeing
it in hv_common_cpu_die().

The code as written in your patch isn't wrong and would work OK. But
the structure implies a relationship with hyperv_pcpu_*_arg that I
suspect doesn't exist.

Michael

>  		if (!ms_hyperv.paravisor_present &&
>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-07 17:02   ` Michael Kelley
@ 2025-03-07 22:06     ` Nuno Das Neves
  2025-03-07 23:21       ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 22:06 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/7/2025 9:02 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
>> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
>> SynIC event ring buffer for each SINT.
>>
>> This will be used by the mshv driver, but must be tracked independently
>> since the driver module could be removed and re-inserted.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>> ---
>>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 252fd66ad4db..2763cb6d3678 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
>>
>>  static struct ctl_table_header *hv_ctl_table_hdr;
>>
>> +/*
>> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
>> + * for each SINT.
>> + *
>> + * We cannot maintain this in mshv driver because the tail pointer should
>> + * persist even if the mshv driver is unloaded.
>> + */
>> +u8 __percpu **hv_synic_eventring_tail;
> 
> I think the "__percpu" is in the wrong place here. This placement
> is likely to cause errors from the "sparse" tool.  It should be
> 
> u8 * __percpu *hv_synic_eventring_tail;
> 
> See the way hyperv_pcpu_input_arg, for example, is defined.  And
> see commit db3c65bc3a13 where I fixed hyperv_pcpu_input_arg.
> 
Thanks. I'll fix it.

>> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
> 
> The "extern" declaration for this variable is in Patch 10 of the series
> in drivers/hv/mshv_root.h. I guess that's OK, but I would normally
> expect to find such a declaration in the header file associated with
> where the variable is defined, which in this case is mshyperv.h.
> Perhaps you are trying to restrict its usage to just mshv?
> 
Yes, that's the idea - it should only be used by the driver.

>> +
>>  /*
>>   * Hyper-V specific initialization and shutdown code that is
>>   * common across all architectures.  Called from architecture
>> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
>>
>>  	free_percpu(hyperv_pcpu_input_arg);
>>  	hyperv_pcpu_input_arg = NULL;
>> +
>> +	free_percpu(hv_synic_eventring_tail);
>> +	hv_synic_eventring_tail = NULL;
>>  }
>>
>>  /*
>> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
>>  		BUG_ON(!hyperv_pcpu_output_arg);
>>  	}
>>
>> +	if (hv_root_partition()) {
>> +		hv_synic_eventring_tail = alloc_percpu(u8 *);
>> +		BUG_ON(hv_synic_eventring_tail == NULL);
>> +	}
>> +
>>  	hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
>>  				    GFP_KERNEL);
>>  	if (!hv_vp_index) {
>> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>>  int hv_common_cpu_init(unsigned int cpu)
>>  {
>>  	void **inputarg, **outputarg;
>> +	u8 **synic_eventring_tail;
>>  	u64 msr_vp_index;
>>  	gfp_t flags;
>>  	const int pgcount = hv_output_page_exists() ? 2 : 1;
>> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>>  	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>>
>>  	/*
>> -	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
>> -	 * allocated if this CPU was previously online and then taken offline
>> +	 * The per-cpu memory is already allocated if this CPU was previously
>> +	 * online and then taken offline
>>  	 */
>>  	if (!*inputarg) {
>>  		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
>> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>>  		}
>>
>> +		if (hv_root_partition()) {
>> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
>> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
>> +							sizeof(u8), flags);
>> +
>> +			if (unlikely(!*synic_eventring_tail)) {
>> +				kfree(mem);
>> +				return -ENOMEM;
>> +			}
>> +		}
>> +
> 
> Adding this code under the "if(!*inputarg)" implicitly ties the lifecycle of
> synic_eventring_tail to the lifecycle of hyperv_pcpu_input_arg and
> hyperv_pcpu_output_arg. Is there some logical relationship between the
> two that warrants tying the lifecycles together (other than just both being
> per-cpu)?  hyperv_pcpu_input_arg and hyperv_pcpu_output_arg have an
> unusual lifecycle management in that they aren't freed when a CPU goes
> offline, as described in the comment in hv_common_cpu_die(). Does
> synic_eventring_tail also need that same unusual lifecycle?
> 
I thought about it, and no I don't think it shares the same exact lifecycle.
It's only used by the mshv_root driver - it just needs to remain present
whenever there's a chance the module could be re-inserted and expect it to
be there.

> Assuming there's no logical relationship, I'm thinking synic_eventring_tail
> should be managed independent of the other two. If it does need the
> unusual lifecycle, make sure to add a comment in hv_common_cpu_die()
> explaining why. If it doesn't need the unusual lifecycle, maybe just do
> the normal thing of allocating it in hv_common_cpu_init() and freeing
> it in hv_common_cpu_die().
> 
Yep, I suppose it should just be freed normally then, assuming
hv_common_cpu_die() is only called when the hypervisor is going to reset
(or remove) the synic pages for this partition. Is that the case here?

Otherwise we'd want to retain it, in case mshv_root ever needs it again for
that CPU in the lifetime of this partition.

Nuno

> The code as written in your patch isn't wrong and would work OK. But
> the structure implies a relationship with hyperv_pcpu_*_arg that I
> suspect doesn't exist.
> 
> Michael
> 
>>  		if (!ms_hyperv.paravisor_present &&
>>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-07 22:06     ` Nuno Das Neves
@ 2025-03-07 23:21       ` Michael Kelley
  2025-03-07 23:31         ` Nuno Das Neves
  2025-03-07 23:37         ` Michael Kelley
  0 siblings, 2 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 23:21 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, March 7, 2025 2:07 PM
> 
> On 3/7/2025 9:02 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> >>
> >> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> >> SynIC event ring buffer for each SINT.
> >>
> >> This will be used by the mshv driver, but must be tracked independently
> >> since the driver module could be removed and re-inserted.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> >> ---
> >>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
> >>  1 file changed, 32 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> >> index 252fd66ad4db..2763cb6d3678 100644
> >> --- a/drivers/hv/hv_common.c
> >> +++ b/drivers/hv/hv_common.c
> >> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
> >>
> >>  static struct ctl_table_header *hv_ctl_table_hdr;
> >>
> >> +/*
> >> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
> >> + * for each SINT.
> >> + *
> >> + * We cannot maintain this in mshv driver because the tail pointer should
> >> + * persist even if the mshv driver is unloaded.
> >> + */
> >> +u8 __percpu **hv_synic_eventring_tail;
> >
> > I think the "__percpu" is in the wrong place here. This placement
> > is likely to cause errors from the "sparse" tool.  It should be
> >
> > u8 * __percpu *hv_synic_eventring_tail;
> >
> > See the way hyperv_pcpu_input_arg, for example, is defined.  And
> > see commit db3c65bc3a13 where I fixed hyperv_pcpu_input_arg.
> >
> Thanks. I'll fix it.
> 
> >> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
> >
> > The "extern" declaration for this variable is in Patch 10 of the series
> > in drivers/hv/mshv_root.h. I guess that's OK, but I would normally
> > expect to find such a declaration in the header file associated with
> > where the variable is defined, which in this case is mshyperv.h.
> > Perhaps you are trying to restrict its usage to just mshv?
> >
> Yes, that's the idea - it should only be used by the driver.
> 
> >> +
> >>  /*
> >>   * Hyper-V specific initialization and shutdown code that is
> >>   * common across all architectures.  Called from architecture
> >> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
> >>
> >>  	free_percpu(hyperv_pcpu_input_arg);
> >>  	hyperv_pcpu_input_arg = NULL;
> >> +
> >> +	free_percpu(hv_synic_eventring_tail);
> >> +	hv_synic_eventring_tail = NULL;
> >>  }
> >>
> >>  /*
> >> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
> >>  		BUG_ON(!hyperv_pcpu_output_arg);
> >>  	}
> >>
> >> +	if (hv_root_partition()) {
> >> +		hv_synic_eventring_tail = alloc_percpu(u8 *);
> >> +		BUG_ON(hv_synic_eventring_tail == NULL);
> >> +	}
> >> +
> >>  	hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
> >>  				    GFP_KERNEL);
> >>  	if (!hv_vp_index) {
> >> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
> >>  int hv_common_cpu_init(unsigned int cpu)
> >>  {
> >>  	void **inputarg, **outputarg;
> >> +	u8 **synic_eventring_tail;
> >>  	u64 msr_vp_index;
> >>  	gfp_t flags;
> >>  	const int pgcount = hv_output_page_exists() ? 2 : 1;
> >> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
> >>  	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
> >>
> >>  	/*
> >> -	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
> >> -	 * allocated if this CPU was previously online and then taken offline
> >> +	 * The per-cpu memory is already allocated if this CPU was previously
> >> +	 * online and then taken offline
> >>  	 */
> >>  	if (!*inputarg) {
> >>  		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
> >> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
> >>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
> >>  		}
> >>
> >> +		if (hv_root_partition()) {
> >> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> >> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
> >> +							sizeof(u8), flags);
> >> +
> >> +			if (unlikely(!*synic_eventring_tail)) {
> >> +				kfree(mem);
> >> +				return -ENOMEM;
> >> +			}
> >> +		}
> >> +
> >
> > Adding this code under the "if(!*inputarg)" implicitly ties the lifecycle of
> > synic_eventring_tail to the lifecycle of hyperv_pcpu_input_arg and
> > hyperv_pcpu_output_arg. Is there some logical relationship between the
> > two that warrants tying the lifecycles together (other than just both being
> > per-cpu)?  hyperv_pcpu_input_arg and hyperv_pcpu_output_arg have an
> > unusual lifecycle management in that they aren't freed when a CPU goes
> > offline, as described in the comment in hv_common_cpu_die(). Does
> > synic_eventring_tail also need that same unusual lifecycle?
> >
> I thought about it, and no I don't think it shares the same exact lifecycle.
> It's only used by the mshv_root driver - it just needs to remain present
> whenever there's a chance the module could be re-inserted and expect it to
> be there.
> 
> > Assuming there's no logical relationship, I'm thinking synic_eventring_tail
> > should be managed independent of the other two. If it does need the
> > unusual lifecycle, make sure to add a comment in hv_common_cpu_die()
> > explaining why. If it doesn't need the unusual lifecycle, maybe just do
> > the normal thing of allocating it in hv_common_cpu_init() and freeing
> > it in hv_common_cpu_die().
> >
> Yep, I suppose it should just be freed normally then, assuming
> hv_common_cpu_die() is only called when the hypervisor is going to reset
> (or remove) the synic pages for this partition. Is that the case here?
> 

Yes, it is the case here. A particular vCPU can be taken offline
independent of other vCPUs in the VM (such as by writing "0"
to /sys/devices/system/cpu/cpu<nn>/online). When that happens
the vCPU going offline runs hv_synic_cleanup() first, and then it
runs hv_cpu_die(), which calls hv_common_cpu_die(). So by the
time hv_common_cpu_die() runs, the synic_message_page and
synic_event_page will have been unmapped and the pointers set
to NULL.

On arm64, there is no hv_cpu_init()/die(), and the "common"
versions are called directly. Perhaps at some point in the future there
will be arm64 specific things to be done, and hv_cpu_init()/die()
will need to be added. But the ordering is the same and
hv_synic_cleanup() runs first.

So, yes, since synic_eventring_tail is tied to the synic, it sounds like
the normal lifecycle could be used, like with the VP assist page that
is handled in hv_cpu_init()/die() on x86.

> Otherwise we'd want to retain it, in case mshv_root ever needs it again for
> that CPU in the lifetime of this partition.
> 
> Nuno
> 
> > The code as written in your patch isn't wrong and would work OK. But
> > the structure implies a relationship with hyperv_pcpu_*_arg that I
> > suspect doesn't exist.
> >
> > Michael
> >
> >>  		if (!ms_hyperv.paravisor_present &&
> >>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
> >>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);
> >> --
> >> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-07 23:21       ` Michael Kelley
@ 2025-03-07 23:31         ` Nuno Das Neves
  2025-03-07 23:37         ` Michael Kelley
  1 sibling, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 23:31 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/7/2025 3:21 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, March 7, 2025 2:07 PM
>>
>> On 3/7/2025 9:02 AM, Michael Kelley wrote:
>>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>>>
>>>> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
>>>> SynIC event ring buffer for each SINT.
>>>>
>>>> This will be used by the mshv driver, but must be tracked independently
>>>> since the driver module could be removed and re-inserted.
>>>>
>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>>>> ---
>>>>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>>>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>>>> index 252fd66ad4db..2763cb6d3678 100644
>>>> --- a/drivers/hv/hv_common.c
>>>> +++ b/drivers/hv/hv_common.c
>>>> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
>>>>
>>>>  static struct ctl_table_header *hv_ctl_table_hdr;
>>>>
>>>> +/*
>>>> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
>>>> + * for each SINT.
>>>> + *
>>>> + * We cannot maintain this in mshv driver because the tail pointer should
>>>> + * persist even if the mshv driver is unloaded.
>>>> + */
>>>> +u8 __percpu **hv_synic_eventring_tail;
>>>
>>> I think the "__percpu" is in the wrong place here. This placement
>>> is likely to cause errors from the "sparse" tool.  It should be
>>>
>>> u8 * __percpu *hv_synic_eventring_tail;
>>>
>>> See the way hyperv_pcpu_input_arg, for example, is defined.  And
>>> see commit db3c65bc3a13 where I fixed hyperv_pcpu_input_arg.
>>>
>> Thanks. I'll fix it.
>>
>>>> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
>>>
>>> The "extern" declaration for this variable is in Patch 10 of the series
>>> in drivers/hv/mshv_root.h. I guess that's OK, but I would normally
>>> expect to find such a declaration in the header file associated with
>>> where the variable is defined, which in this case is mshyperv.h.
>>> Perhaps you are trying to restrict its usage to just mshv?
>>>
>> Yes, that's the idea - it should only be used by the driver.
>>
>>>> +
>>>>  /*
>>>>   * Hyper-V specific initialization and shutdown code that is
>>>>   * common across all architectures.  Called from architecture
>>>> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
>>>>
>>>>  	free_percpu(hyperv_pcpu_input_arg);
>>>>  	hyperv_pcpu_input_arg = NULL;
>>>> +
>>>> +	free_percpu(hv_synic_eventring_tail);
>>>> +	hv_synic_eventring_tail = NULL;
>>>>  }
>>>>
>>>>  /*
>>>> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
>>>>  		BUG_ON(!hyperv_pcpu_output_arg);
>>>>  	}
>>>>
>>>> +	if (hv_root_partition()) {
>>>> +		hv_synic_eventring_tail = alloc_percpu(u8 *);
>>>> +		BUG_ON(hv_synic_eventring_tail == NULL);
>>>> +	}
>>>> +
>>>>  	hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
>>>>  				    GFP_KERNEL);
>>>>  	if (!hv_vp_index) {
>>>> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>>>>  int hv_common_cpu_init(unsigned int cpu)
>>>>  {
>>>>  	void **inputarg, **outputarg;
>>>> +	u8 **synic_eventring_tail;
>>>>  	u64 msr_vp_index;
>>>>  	gfp_t flags;
>>>>  	const int pgcount = hv_output_page_exists() ? 2 : 1;
>>>> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>>>>  	inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>
>>>>  	/*
>>>> -	 * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
>>>> -	 * allocated if this CPU was previously online and then taken offline
>>>> +	 * The per-cpu memory is already allocated if this CPU was previously
>>>> +	 * online and then taken offline
>>>>  	 */
>>>>  	if (!*inputarg) {
>>>>  		mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
>>>> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>>>>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>>>>  		}
>>>>
>>>> +		if (hv_root_partition()) {
>>>> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
>>>> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
>>>> +							sizeof(u8), flags);
>>>> +
>>>> +			if (unlikely(!*synic_eventring_tail)) {
>>>> +				kfree(mem);
>>>> +				return -ENOMEM;
>>>> +			}
>>>> +		}
>>>> +
>>>
>>> Adding this code under the "if(!*inputarg)" implicitly ties the lifecycle of
>>> synic_eventring_tail to the lifecycle of hyperv_pcpu_input_arg and
>>> hyperv_pcpu_output_arg. Is there some logical relationship between the
>>> two that warrants tying the lifecycles together (other than just both being
>>> per-cpu)?  hyperv_pcpu_input_arg and hyperv_pcpu_output_arg have an
>>> unusual lifecycle management in that they aren't freed when a CPU goes
>>> offline, as described in the comment in hv_common_cpu_die(). Does
>>> synic_eventring_tail also need that same unusual lifecycle?
>>>
>> I thought about it, and no I don't think it shares the same exact lifecycle.
>> It's only used by the mshv_root driver - it just needs to remain present
>> whenever there's a chance the module could be re-inserted and expect it to
>> be there.
>>
>>> Assuming there's no logical relationship, I'm thinking synic_eventring_tail
>>> should be managed independent of the other two. If it does need the
>>> unusual lifecycle, make sure to add a comment in hv_common_cpu_die()
>>> explaining why. If it doesn't need the unusual lifecycle, maybe just do
>>> the normal thing of allocating it in hv_common_cpu_init() and freeing
>>> it in hv_common_cpu_die().
>>>
>> Yep, I suppose it should just be freed normally then, assuming
>> hv_common_cpu_die() is only called when the hypervisor is going to reset
>> (or remove) the synic pages for this partition. Is that the case here?
>>
> 
> Yes, it is the case here. A particular vCPU can be taken offline
> independent of other vCPUs in the VM (such as by writing "0"
> to /sys/devices/system/cpu/cpu<nn>/online). When that happens
> the vCPU going offline runs hv_synic_cleanup() first, and then it
> runs hv_cpu_die(), which calls hv_common_cpu_die(). So by the
> time hv_common_cpu_die() runs, the synic_message_page and
> synic_event_page will have been unmapped and the pointers set
> to NULL.
> 
> On arm64, there is no hv_cpu_init()/die(), and the "common"
> versions are called directly. Perhaps at some point in the future there
> will be arm64 specific things to be done, and hv_cpu_init()/die()
> will need to be added. But the ordering is the same and
> hv_synic_cleanup() runs first.
> 
> So, yes, since synic_eventring_tail is tied to the synic, it sounds like
> the normal lifecycle could be used, like with the VP assist page that
> is handled in hv_cpu_init()/die() on x86.
> 
Great, thanks for the clarification! I'll fix it for v6.

Nuno

>> Otherwise we'd want to retain it, in case mshv_root ever needs it again for
>> that CPU in the lifetime of this partition.
>>
>> Nuno
>>
>>> The code as written in your patch isn't wrong and would work OK. But
>>> the structure implies a relationship with hyperv_pcpu_*_arg that I
>>> suspect doesn't exist.
>>>
>>> Michael
>>>
>>>>  		if (!ms_hyperv.paravisor_present &&
>>>>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>>>>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);
>>>> --
>>>> 2.34.1
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-07 23:21       ` Michael Kelley
  2025-03-07 23:31         ` Nuno Das Neves
@ 2025-03-07 23:37         ` Michael Kelley
  1 sibling, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 23:37 UTC (permalink / raw)
  To: Michael Kelley, Nuno Das Neves, linux-hyperv@vger.kernel.org,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Michael Kelley <mhklinux@outlook.com> Sent: Friday, March 7, 2025 3:21 PM
> 
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, March 7, 2025
> 2:07 PM
> >

[snip]

> > >> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
> > >>  			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
> > >>  		}
> > >>
> > >> +		if (hv_root_partition()) {
> > >> +			synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> > >> +			*synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
> > >> +							sizeof(u8), flags);
> > >> +
> > >> +			if (unlikely(!*synic_eventring_tail)) {
> > >> +				kfree(mem);
> > >> +				return -ENOMEM;
> > >> +			}
> > >> +		}
> > >> +
> > >
> > > Adding this code under the "if(!*inputarg)" implicitly ties the lifecycle of
> > > synic_eventring_tail to the lifecycle of hyperv_pcpu_input_arg and
> > > hyperv_pcpu_output_arg. Is there some logical relationship between the
> > > two that warrants tying the lifecycles together (other than just both being
> > > per-cpu)?  hyperv_pcpu_input_arg and hyperv_pcpu_output_arg have an
> > > unusual lifecycle management in that they aren't freed when a CPU goes
> > > offline, as described in the comment in hv_common_cpu_die(). Does
> > > synic_eventring_tail also need that same unusual lifecycle?
> > >
> > I thought about it, and no I don't think it shares the same exact lifecycle.
> > It's only used by the mshv_root driver - it just needs to remain present
> > whenever there's a chance the module could be re-inserted and expect it to
> > be there.
> >
> > > Assuming there's no logical relationship, I'm thinking synic_eventring_tail
> > > should be managed independent of the other two. If it does need the
> > > unusual lifecycle, make sure to add a comment in hv_common_cpu_die()
> > > explaining why. If it doesn't need the unusual lifecycle, maybe just do
> > > the normal thing of allocating it in hv_common_cpu_init() and freeing
> > > it in hv_common_cpu_die().
> > >
> > Yep, I suppose it should just be freed normally then, assuming
> > hv_common_cpu_die() is only called when the hypervisor is going to reset
> > (or remove) the synic pages for this partition. Is that the case here?
> >
> 
> Yes, it is the case here. A particular vCPU can be taken offline
> independent of other vCPUs in the VM (such as by writing "0"
> to /sys/devices/system/cpu/cpu<nn>/online). When that happens
> the vCPU going offline runs hv_synic_cleanup() first, and then it
> runs hv_cpu_die(), which calls hv_common_cpu_die(). So by the
> time hv_common_cpu_die() runs, the synic_message_page and
> synic_event_page will have been unmapped and the pointers set
> to NULL.
> 
> On arm64, there is no hv_cpu_init()/die(), and the "common"
> versions are called directly. Perhaps at some point in the future there
> will be arm64 specific things to be done, and hv_cpu_init()/die()
> will need to be added. But the ordering is the same and
> hv_synic_cleanup() runs first.
> 
> So, yes, since synic_eventring_tail is tied to the synic, it sounds like
> the normal lifecycle could be used, like with the VP assist page that
> is handled in hv_cpu_init()/die() on x86.
> 

One more thought:

Perhaps there's more affinity with synic code than with generic
per-cpu memory, and it would be even better to allocate and
free the synic_eventring_tail memory for each vCPU in
hv_synic_init()/cleanup(), or hv_synic_enable/disable_regs().
There's potentially some interaction with hibernate suspend/resume,
which I assume isn't a valid scenario for the root partition. But I
haven't thought through all the details.

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
  2025-02-26 23:39   ` Stanislav Kinsburskii
  2025-03-07 17:02   ` Michael Kelley
@ 2025-03-10 13:01   ` Tianyu Lan
  2025-03-12 19:44     ` Nuno Das Neves
  2 siblings, 1 reply; 108+ messages in thread
From: Tianyu Lan @ 2025-03-10 13:01 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> SynIC event ring buffer for each SINT.
>
> This will be used by the mshv driver, but must be tracked independently
> since the driver module could be removed and re-inserted.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>

It's better to expose a function to check the tail instead of exposing
hv_synic_eventring_tail directly.

BTW, how does mshv driver use hv_synic_eventring_tail? Which patch
uses it in this series?

Thanks.


> ---
>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 252fd66ad4db..2763cb6d3678 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
>
>  static struct ctl_table_header *hv_ctl_table_hdr;
>
> +/*
> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
> + * for each SINT.
> + *
> + * We cannot maintain this in mshv driver because the tail pointer should
> + * persist even if the mshv driver is unloaded.
> + */
> +u8 __percpu **hv_synic_eventring_tail;
> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
> +
>  /*
>   * Hyper-V specific initialization and shutdown code that is
>   * common across all architectures.  Called from architecture
> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
>
>         free_percpu(hyperv_pcpu_input_arg);
>         hyperv_pcpu_input_arg = NULL;
> +
> +       free_percpu(hv_synic_eventring_tail);
> +       hv_synic_eventring_tail = NULL;
>  }
>
>  /*
> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
>                 BUG_ON(!hyperv_pcpu_output_arg);
>         }
>
> +       if (hv_root_partition()) {
> +               hv_synic_eventring_tail = alloc_percpu(u8 *);
> +               BUG_ON(hv_synic_eventring_tail == NULL);
> +       }
> +
>         hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
>                                     GFP_KERNEL);
>         if (!hv_vp_index) {
> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>  int hv_common_cpu_init(unsigned int cpu)
>  {
>         void **inputarg, **outputarg;
> +       u8 **synic_eventring_tail;
>         u64 msr_vp_index;
>         gfp_t flags;
>         const int pgcount = hv_output_page_exists() ? 2 : 1;
> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>         inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>
>         /*
> -        * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
> -        * allocated if this CPU was previously online and then taken offline
> +        * The per-cpu memory is already allocated if this CPU was previously
> +        * online and then taken offline
>          */
>         if (!*inputarg) {
>                 mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>                         *outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>                 }
>
> +               if (hv_root_partition()) {
> +                       synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> +                       *synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
> +                                                       sizeof(u8), flags);
> +
> +                       if (unlikely(!*synic_eventring_tail)) {
> +                               kfree(mem);
> +                               return -ENOMEM;
> +                       }
> +               }
> +
>                 if (!ms_hyperv.paravisor_present &&
>                     (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>                         ret = set_memory_decrypted((unsigned long)mem, pgcount);
> --
> 2.34.1
>
>


-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-10 13:01   ` Tianyu Lan
@ 2025-03-12 19:44     ` Nuno Das Neves
  2025-03-13  7:34       ` Tianyu Lan
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-12 19:44 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 3/10/2025 6:01 AM, Tianyu Lan wrote:
> On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
> <nunodasneves@linux.microsoft.com> wrote:
>>
>> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
>> SynIC event ring buffer for each SINT.
>>
>> This will be used by the mshv driver, but must be tracked independently
>> since the driver module could be removed and re-inserted.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> 
> It's better to expose a function to check the tail instead of exposing
> hv_synic_eventring_tail directly.
> 
What is the advantage of using a function for this? We need to both set
and get the tail.

> BTW, how does mshv driver use hv_synic_eventring_tail? Which patch
> uses it in this series?
>
This variable stores indices into the synic eventring page (one for each
SINT, and per-cpu). Each SINT has a ringbuffer of u32 messages. The tail
index points to the latest one.

This is only used for doorbell messages today. The message in this case is
a port number which is used to lookup and invoke a callback, which signals
ioeventfd(s), to notify the VMM of a guest MMIO write.

It is used in patch 10.

Thanks
Nuno
 
> Thanks.
> 
> 
>> ---
>>  drivers/hv/hv_common.c | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 252fd66ad4db..2763cb6d3678 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -68,6 +68,16 @@ static void hv_kmsg_dump_unregister(void);
>>
>>  static struct ctl_table_header *hv_ctl_table_hdr;
>>
>> +/*
>> + * Per-cpu array holding the tail pointer for the SynIC event ring buffer
>> + * for each SINT.
>> + *
>> + * We cannot maintain this in mshv driver because the tail pointer should
>> + * persist even if the mshv driver is unloaded.
>> + */
>> +u8 __percpu **hv_synic_eventring_tail;
>> +EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
>> +
>>  /*
>>   * Hyper-V specific initialization and shutdown code that is
>>   * common across all architectures.  Called from architecture
>> @@ -90,6 +100,9 @@ void __init hv_common_free(void)
>>
>>         free_percpu(hyperv_pcpu_input_arg);
>>         hyperv_pcpu_input_arg = NULL;
>> +
>> +       free_percpu(hv_synic_eventring_tail);
>> +       hv_synic_eventring_tail = NULL;
>>  }
>>
>>  /*
>> @@ -372,6 +385,11 @@ int __init hv_common_init(void)
>>                 BUG_ON(!hyperv_pcpu_output_arg);
>>         }
>>
>> +       if (hv_root_partition()) {
>> +               hv_synic_eventring_tail = alloc_percpu(u8 *);
>> +               BUG_ON(hv_synic_eventring_tail == NULL);
>> +       }
>> +
>>         hv_vp_index = kmalloc_array(nr_cpu_ids, sizeof(*hv_vp_index),
>>                                     GFP_KERNEL);
>>         if (!hv_vp_index) {
>> @@ -460,6 +478,7 @@ void __init ms_hyperv_late_init(void)
>>  int hv_common_cpu_init(unsigned int cpu)
>>  {
>>         void **inputarg, **outputarg;
>> +       u8 **synic_eventring_tail;
>>         u64 msr_vp_index;
>>         gfp_t flags;
>>         const int pgcount = hv_output_page_exists() ? 2 : 1;
>> @@ -472,8 +491,8 @@ int hv_common_cpu_init(unsigned int cpu)
>>         inputarg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
>>
>>         /*
>> -        * hyperv_pcpu_input_arg and hyperv_pcpu_output_arg memory is already
>> -        * allocated if this CPU was previously online and then taken offline
>> +        * The per-cpu memory is already allocated if this CPU was previously
>> +        * online and then taken offline
>>          */
>>         if (!*inputarg) {
>>                 mem = kmalloc(pgcount * HV_HYP_PAGE_SIZE, flags);
>> @@ -485,6 +504,17 @@ int hv_common_cpu_init(unsigned int cpu)
>>                         *outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
>>                 }
>>
>> +               if (hv_root_partition()) {
>> +                       synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
>> +                       *synic_eventring_tail = kcalloc(HV_SYNIC_SINT_COUNT,
>> +                                                       sizeof(u8), flags);
>> +
>> +                       if (unlikely(!*synic_eventring_tail)) {
>> +                               kfree(mem);
>> +                               return -ENOMEM;
>> +                       }
>> +               }
>> +
>>                 if (!ms_hyperv.paravisor_present &&
>>                     (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
>>                         ret = set_memory_decrypted((unsigned long)mem, pgcount);
>> --
>> 2.34.1
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-12 19:44     ` Nuno Das Neves
@ 2025-03-13  7:34       ` Tianyu Lan
  2025-03-13 15:56         ` Nuno Das Neves
  0 siblings, 1 reply; 108+ messages in thread
From: Tianyu Lan @ 2025-03-13  7:34 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Mar 13, 2025 at 3:45 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> On 3/10/2025 6:01 AM, Tianyu Lan wrote:
> > On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
> > <nunodasneves@linux.microsoft.com> wrote:
> >>
> >> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> >> SynIC event ring buffer for each SINT.
> >>
> >> This will be used by the mshv driver, but must be tracked independently
> >> since the driver module could be removed and re-inserted.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> >
> > It's better to expose a function to check the tail instead of exposing
> > hv_synic_eventring_tail directly.
> >
> What is the advantage of using a function for this? We need to both set
> and get the tail.

We may add lock or check to avoid race conditions and this depends on the
user case. This is why I want to see how mshv driver uses it.

>
> > BTW, how does mshv driver use hv_synic_eventring_tail? Which patch
> > uses it in this series?
> >
> This variable stores indices into the synic eventring page (one for each
> SINT, and per-cpu). Each SINT has a ringbuffer of u32 messages. The tail
> index points to the latest one.
>
> This is only used for doorbell messages today. The message in this case is
> a port number which is used to lookup and invoke a callback, which signals
> ioeventfd(s), to notify the VMM of a guest MMIO write.
>
> It is used in patch 10.

I found "extern u8 __percpu **hv_synic_eventring_tail;" in the
drivers/hv/mshv_root.h of patch 10.
I seem to miss the code to use it.

+int hv_call_unmap_stat_page(enum hv_stats_object_type type,
+                           const union hv_stats_object_identity *identity);
+int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
+                                  u64 page_struct_count, u32 host_access,
+                                  u32 flags, u8 acquire);
+
+extern struct mshv_root mshv_root;
+extern enum hv_scheduler_type hv_scheduler_type;
+extern u8 __percpu **hv_synic_eventring_tail;
+
+#endif /* _MSHV_ROOT_H_ */

-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-13  7:34       ` Tianyu Lan
@ 2025-03-13 15:56         ` Nuno Das Neves
  2025-03-13 16:00           ` Tianyu Lan
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-13 15:56 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 3/13/2025 12:34 AM, Tianyu Lan wrote:
> On Thu, Mar 13, 2025 at 3:45 AM Nuno Das Neves
> <nunodasneves@linux.microsoft.com> wrote:
>>
>> On 3/10/2025 6:01 AM, Tianyu Lan wrote:
>>> On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
>>> <nunodasneves@linux.microsoft.com> wrote:
>>>>
>>>> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
>>>> SynIC event ring buffer for each SINT.
>>>>
>>>> This will be used by the mshv driver, but must be tracked independently
>>>> since the driver module could be removed and re-inserted.
>>>>
>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>>>
>>> It's better to expose a function to check the tail instead of exposing
>>> hv_synic_eventring_tail directly.
>>>
>> What is the advantage of using a function for this? We need to both set
>> and get the tail.
> 
> We may add lock or check to avoid race conditions and this depends on the
> user case. This is why I want to see how mshv driver uses it.
> 
>>
>>> BTW, how does mshv driver use hv_synic_eventring_tail? Which patch
>>> uses it in this series?
>>>
>> This variable stores indices into the synic eventring page (one for each
>> SINT, and per-cpu). Each SINT has a ringbuffer of u32 messages. The tail
>> index points to the latest one.
>>
>> This is only used for doorbell messages today. The message in this case is
>> a port number which is used to lookup and invoke a callback, which signals
>> ioeventfd(s), to notify the VMM of a guest MMIO write.
>>
>> It is used in patch 10.
> 
> I found "extern u8 __percpu **hv_synic_eventring_tail;" in the
> drivers/hv/mshv_root.h of patch 10.
> I seem to miss the code to use it.
> 
> +int hv_call_unmap_stat_page(enum hv_stats_object_type type,
> +                           const union hv_stats_object_identity *identity);
> +int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> +                                  u64 page_struct_count, u32 host_access,
> +                                  u32 flags, u8 acquire);
> +
> +extern struct mshv_root mshv_root;
> +extern enum hv_scheduler_type hv_scheduler_type;
> +extern u8 __percpu **hv_synic_eventring_tail;
> +
> +#endif /* _MSHV_ROOT_H_ */
> 

It is used in mshv_synic.c in synic_event_ring_get_queued_port():

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
new file mode 100644
index 000000000000..e7782f92e339
--- /dev/null
+++ b/drivers/hv/mshv_synic.c
@@ -0,0 +1,665 @@
<snip>
+static u32 synic_event_ring_get_queued_port(u32 sint_index)
+{
+	struct hv_synic_event_ring_page **event_ring_page;
+	volatile struct hv_synic_event_ring *ring;
+	struct hv_synic_pages *spages;
+	u8 **synic_eventring_tail;
+	u32 message;
+	u8 tail;
+
+	spages = this_cpu_ptr(mshv_root.synic_pages);
+	event_ring_page = &spages->synic_event_ring_page;
+	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
+	tail = (*synic_eventring_tail)[sint_index];

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail
  2025-03-13 15:56         ` Nuno Das Neves
@ 2025-03-13 16:00           ` Tianyu Lan
  0 siblings, 0 replies; 108+ messages in thread
From: Tianyu Lan @ 2025-03-13 16:00 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Mar 13, 2025 at 11:56 PM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> On 3/13/2025 12:34 AM, Tianyu Lan wrote:
> > On Thu, Mar 13, 2025 at 3:45 AM Nuno Das Neves
> > <nunodasneves@linux.microsoft.com> wrote:
> >>
> >> On 3/10/2025 6:01 AM, Tianyu Lan wrote:
> >>> On Thu, Feb 27, 2025 at 7:09 AM Nuno Das Neves
> >>> <nunodasneves@linux.microsoft.com> wrote:
> >>>>
> >>>> Add a pointer hv_synic_eventring_tail to track the tail pointer for the
> >>>> SynIC event ring buffer for each SINT.
> >>>>
> >>>> This will be used by the mshv driver, but must be tracked independently
> >>>> since the driver module could be removed and re-inserted.
> >>>>
> >>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >>>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> >>>
> >>> It's better to expose a function to check the tail instead of exposing
> >>> hv_synic_eventring_tail directly.
> >>>
> >> What is the advantage of using a function for this? We need to both set
> >> and get the tail.
> >
> > We may add lock or check to avoid race conditions and this depends on the
> > user case. This is why I want to see how mshv driver uses it.
> >
> >>
> >>> BTW, how does mshv driver use hv_synic_eventring_tail? Which patch
> >>> uses it in this series?
> >>>
> >> This variable stores indices into the synic eventring page (one for each
> >> SINT, and per-cpu). Each SINT has a ringbuffer of u32 messages. The tail
> >> index points to the latest one.
> >>
> >> This is only used for doorbell messages today. The message in this case is
> >> a port number which is used to lookup and invoke a callback, which signals
> >> ioeventfd(s), to notify the VMM of a guest MMIO write.
> >>
> >> It is used in patch 10.
> >
> > I found "extern u8 __percpu **hv_synic_eventring_tail;" in the
> > drivers/hv/mshv_root.h of patch 10.
> > I seem to miss the code to use it.
> >
> > +int hv_call_unmap_stat_page(enum hv_stats_object_type type,
> > +                           const union hv_stats_object_identity *identity);
> > +int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > +                                  u64 page_struct_count, u32 host_access,
> > +                                  u32 flags, u8 acquire);
> > +
> > +extern struct mshv_root mshv_root;
> > +extern enum hv_scheduler_type hv_scheduler_type;
> > +extern u8 __percpu **hv_synic_eventring_tail;
> > +
> > +#endif /* _MSHV_ROOT_H_ */
> >
>
> It is used in mshv_synic.c in synic_event_ring_get_queued_port():
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> new file mode 100644
> index 000000000000..e7782f92e339
> --- /dev/null
> +++ b/drivers/hv/mshv_synic.c
> @@ -0,0 +1,665 @@
> <snip>
> +static u32 synic_event_ring_get_queued_port(u32 sint_index)
> +{
> +       struct hv_synic_event_ring_page **event_ring_page;
> +       volatile struct hv_synic_event_ring *ring;
> +       struct hv_synic_pages *spages;
> +       u8 **synic_eventring_tail;
> +       u32 message;
> +       u8 tail;
> +
> +       spages = this_cpu_ptr(mshv_root.synic_pages);
> +       event_ring_page = &spages->synic_event_ring_page;
> +       synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> +       tail = (*synic_eventring_tail)[sint_index];

OK. I got it. Thanks.

-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (6 preceding siblings ...)
  2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
@ 2025-02-26 23:08 ` Nuno Das Neves
  2025-02-26 23:43   ` Stanislav Kinsburskii
  2025-03-07 17:44   ` Michael Kelley
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
  9 siblings, 2 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:08 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

This will handle SYNIC interrupts such as intercepts, doorbells, and
scheduling messages intended for the mshv driver.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
 drivers/hv/hv_common.c         | 5 +++++
 include/asm-generic/mshyperv.h | 1 +
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 0116d0e96ef9..616e9a5d77b4 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
 }
 EXPORT_SYMBOL_GPL(hv_set_msr);
 
+static void (*mshv_handler)(void);
 static void (*vmbus_handler)(void);
 static void (*hv_stimer0_handler)(void);
 static void (*hv_kexec_handler)(void);
@@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	inc_irq_stat(irq_hv_callback_count);
+	if (mshv_handler)
+		mshv_handler();
+
 	if (vmbus_handler)
 		vmbus_handler();
 
@@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	set_irq_regs(old_regs);
 }
 
+void hv_setup_mshv_handler(void (*handler)(void))
+{
+	mshv_handler = handler;
+}
+
 void hv_setup_vmbus_handler(void (*handler)(void))
 {
 	vmbus_handler = handler;
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 2763cb6d3678..f5a07fd9a03b 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
 }
 EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
 
+void __weak hv_setup_mshv_handler(void (*handler)(void))
+{
+}
+EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
+
 void __weak hv_setup_kexec_handler(void (*handler)(void))
 {
 }
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 1f46d19a16aa..a05f12e63ccd 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
 void hv_remove_kexec_handler(void);
 void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
 void hv_remove_crash_handler(void);
+void hv_setup_mshv_handler(void (*handler)(void));
 
 extern int vmbus_interrupt;
 extern int vmbus_irq;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-02-26 23:08 ` [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function Nuno Das Neves
@ 2025-02-26 23:43   ` Stanislav Kinsburskii
  2025-03-01  0:38     ` Nuno Das Neves
  2025-03-07 17:44   ` Michael Kelley
  1 sibling, 1 reply; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:43 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:08:02PM -0800, Nuno Das Neves wrote:
> This will handle SYNIC interrupts such as intercepts, doorbells, and
> scheduling messages intended for the mshv driver.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
>  drivers/hv/hv_common.c         | 5 +++++
>  include/asm-generic/mshyperv.h | 1 +
>  3 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 0116d0e96ef9..616e9a5d77b4 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
>  }
>  EXPORT_SYMBOL_GPL(hv_set_msr);
>  
> +static void (*mshv_handler)(void);
>  static void (*vmbus_handler)(void);
>  static void (*hv_stimer0_handler)(void);
>  static void (*hv_kexec_handler)(void);
> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>  	struct pt_regs *old_regs = set_irq_regs(regs);
>  
>  	inc_irq_stat(irq_hv_callback_count);
> +	if (mshv_handler)
> +		mshv_handler();

Can mshv_handler be defined as a weak symbol doing nothing instead
of defining it a null pointer?
This should allow to simplify this code and get rid of
hv_setup_mshv_handler, which looks redundant.

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

> +
>  	if (vmbus_handler)
>  		vmbus_handler();
>  
> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>  	set_irq_regs(old_regs);
>  }
>  
> +void hv_setup_mshv_handler(void (*handler)(void))
> +{
> +	mshv_handler = handler;
> +}
> +
>  void hv_setup_vmbus_handler(void (*handler)(void))
>  {
>  	vmbus_handler = handler;
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 2763cb6d3678..f5a07fd9a03b 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>  
> +void __weak hv_setup_mshv_handler(void (*handler)(void))
> +{
> +}
> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
> +
>  void __weak hv_setup_kexec_handler(void (*handler)(void))
>  {
>  }
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 1f46d19a16aa..a05f12e63ccd 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
>  void hv_remove_kexec_handler(void);
>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
>  void hv_remove_crash_handler(void);
> +void hv_setup_mshv_handler(void (*handler)(void));
>  
>  extern int vmbus_interrupt;
>  extern int vmbus_irq;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-02-26 23:43   ` Stanislav Kinsburskii
@ 2025-03-01  0:38     ` Nuno Das Neves
  2025-03-07 17:38       ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-01  0:38 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:43 PM, Stanislav Kinsburskii wrote:
> On Wed, Feb 26, 2025 at 03:08:02PM -0800, Nuno Das Neves wrote:
>> This will handle SYNIC interrupts such as intercepts, doorbells, and
>> scheduling messages intended for the mshv driver.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
>> ---
>>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
>>  drivers/hv/hv_common.c         | 5 +++++
>>  include/asm-generic/mshyperv.h | 1 +
>>  3 files changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 0116d0e96ef9..616e9a5d77b4 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
>>  }
>>  EXPORT_SYMBOL_GPL(hv_set_msr);
>>  
>> +static void (*mshv_handler)(void);
>>  static void (*vmbus_handler)(void);
>>  static void (*hv_stimer0_handler)(void);
>>  static void (*hv_kexec_handler)(void);
>> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>  	struct pt_regs *old_regs = set_irq_regs(regs);
>>  
>>  	inc_irq_stat(irq_hv_callback_count);
>> +	if (mshv_handler)
>> +		mshv_handler();
> 
> Can mshv_handler be defined as a weak symbol doing nothing instead
> of defining it a null pointer?
> This should allow to simplify this code and get rid of
> hv_setup_mshv_handler, which looks redundant.
> 
Interesting, I tested this and it does seems to work! It seems like
a good change, thanks.

> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
>> +
>>  	if (vmbus_handler)
>>  		vmbus_handler();
>>  
>> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>  	set_irq_regs(old_regs);
>>  }
>>  
>> +void hv_setup_mshv_handler(void (*handler)(void))
>> +{
>> +	mshv_handler = handler;
>> +}
>> +
>>  void hv_setup_vmbus_handler(void (*handler)(void))
>>  {
>>  	vmbus_handler = handler;
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 2763cb6d3678..f5a07fd9a03b 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
>>  }
>>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>>  
>> +void __weak hv_setup_mshv_handler(void (*handler)(void))
>> +{
>> +}
>> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
>> +
>>  void __weak hv_setup_kexec_handler(void (*handler)(void))
>>  {
>>  }
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index 1f46d19a16aa..a05f12e63ccd 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
>>  void hv_remove_kexec_handler(void);
>>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
>>  void hv_remove_crash_handler(void);
>> +void hv_setup_mshv_handler(void (*handler)(void));
>>  
>>  extern int vmbus_interrupt;
>>  extern int vmbus_irq;
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-03-01  0:38     ` Nuno Das Neves
@ 2025-03-07 17:38       ` Michael Kelley
  2025-03-10 21:46         ` Nuno Das Neves
  0 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 17:38 UTC (permalink / raw)
  To: Nuno Das Neves, Stanislav Kinsburskii
  Cc: linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, February 28, 2025 4:38 PM
> 
> On 2/26/2025 3:43 PM, Stanislav Kinsburskii wrote:
> > On Wed, Feb 26, 2025 at 03:08:02PM -0800, Nuno Das Neves wrote:
> >> This will handle SYNIC interrupts such as intercepts, doorbells, and
> >> scheduling messages intended for the mshv driver.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> >> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
> >> ---
> >>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
> >>  drivers/hv/hv_common.c         | 5 +++++
> >>  include/asm-generic/mshyperv.h | 1 +
> >>  3 files changed, 15 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> >> index 0116d0e96ef9..616e9a5d77b4 100644
> >> --- a/arch/x86/kernel/cpu/mshyperv.c
> >> +++ b/arch/x86/kernel/cpu/mshyperv.c
> >> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
> >>  }
> >>  EXPORT_SYMBOL_GPL(hv_set_msr);
> >>
> >> +static void (*mshv_handler)(void);
> >>  static void (*vmbus_handler)(void);
> >>  static void (*hv_stimer0_handler)(void);
> >>  static void (*hv_kexec_handler)(void);
> >> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
> >>  	struct pt_regs *old_regs = set_irq_regs(regs);
> >>
> >>  	inc_irq_stat(irq_hv_callback_count);
> >> +	if (mshv_handler)
> >> +		mshv_handler();
> >
> > Can mshv_handler be defined as a weak symbol doing nothing instead
> > of defining it a null pointer?
> > This should allow to simplify this code and get rid of
> > hv_setup_mshv_handler, which looks redundant.
> >
> Interesting, I tested this and it does seems to work! It seems like
> a good change, thanks.

Just be a bit careful. When CONFIG_HYPERV=n, mshyperv.c still gets
built even through none of the other Hyper-V related files do.  There
are #ifdef CONFIG_HYPERV in mshyperv.c to eliminate references to
Hyper-V files that wouldn't be built. I'd suggest doing a test build with
that configuration to make sure it's all clean.

Michael

> 
> > Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >
> >> +
> >>  	if (vmbus_handler)
> >>  		vmbus_handler();
> >>
> >> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
> >>  	set_irq_regs(old_regs);
> >>  }
> >>
> >> +void hv_setup_mshv_handler(void (*handler)(void))
> >> +{
> >> +	mshv_handler = handler;
> >> +}
> >> +
> >>  void hv_setup_vmbus_handler(void (*handler)(void))
> >>  {
> >>  	vmbus_handler = handler;
> >> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> >> index 2763cb6d3678..f5a07fd9a03b 100644
> >> --- a/drivers/hv/hv_common.c
> >> +++ b/drivers/hv/hv_common.c
> >> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
> >>  }
> >>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
> >>
> >> +void __weak hv_setup_mshv_handler(void (*handler)(void))
> >> +{
> >> +}
> >> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
> >> +
> >>  void __weak hv_setup_kexec_handler(void (*handler)(void))
> >>  {
> >>  }
> >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> >> index 1f46d19a16aa..a05f12e63ccd 100644
> >> --- a/include/asm-generic/mshyperv.h
> >> +++ b/include/asm-generic/mshyperv.h
> >> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
> >>  void hv_remove_kexec_handler(void);
> >>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
> >>  void hv_remove_crash_handler(void);
> >> +void hv_setup_mshv_handler(void (*handler)(void));
> >>
> >>  extern int vmbus_interrupt;
> >>  extern int vmbus_irq;
> >> --
> >> 2.34.1
> >>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-03-07 17:38       ` Michael Kelley
@ 2025-03-10 21:46         ` Nuno Das Neves
  2025-03-10 22:23           ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-10 21:46 UTC (permalink / raw)
  To: Michael Kelley, Stanislav Kinsburskii
  Cc: linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/7/2025 9:38 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, February 28, 2025 4:38 PM
>>
>> On 2/26/2025 3:43 PM, Stanislav Kinsburskii wrote:
>>> On Wed, Feb 26, 2025 at 03:08:02PM -0800, Nuno Das Neves wrote:
>>>> This will handle SYNIC interrupts such as intercepts, doorbells, and
>>>> scheduling messages intended for the mshv driver.
>>>>
>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>>>> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
>>>> ---
>>>>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
>>>>  drivers/hv/hv_common.c         | 5 +++++
>>>>  include/asm-generic/mshyperv.h | 1 +
>>>>  3 files changed, 15 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>>>> index 0116d0e96ef9..616e9a5d77b4 100644
>>>> --- a/arch/x86/kernel/cpu/mshyperv.c
>>>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>>>> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(hv_set_msr);
>>>>
>>>> +static void (*mshv_handler)(void);
>>>>  static void (*vmbus_handler)(void);
>>>>  static void (*hv_stimer0_handler)(void);
>>>>  static void (*hv_kexec_handler)(void);
>>>> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>>>  	struct pt_regs *old_regs = set_irq_regs(regs);
>>>>
>>>>  	inc_irq_stat(irq_hv_callback_count);
>>>> +	if (mshv_handler)
>>>> +		mshv_handler();
>>>
>>> Can mshv_handler be defined as a weak symbol doing nothing instead
>>> of defining it a null pointer?
>>> This should allow to simplify this code and get rid of
>>> hv_setup_mshv_handler, which looks redundant.
>>>
>> Interesting, I tested this and it does seems to work! It seems like
>> a good change, thanks.
> 
> Just be a bit careful. When CONFIG_HYPERV=n, mshyperv.c still gets
> built even through none of the other Hyper-V related files do.  There
> are #ifdef CONFIG_HYPERV in mshyperv.c to eliminate references to
> Hyper-V files that wouldn't be built. I'd suggest doing a test build with
> that configuration to make sure it's all clean.
> 
Thanks Michael - I don't think it would be an issue since the __weak version
would be defined in mshyperv.c itself, replacing the function pointer.

However, I went and tested this __weak version again with CONFIG_MSHV_ROOT=m
and it does not actually work. Everything seems ok at first (it compiles,
can insert the module), but upon starting a guest, the interrupts don't get
delivered to the root (or rather, they don't get handled by mshv_hander()).

This seems to match with what the ld docs say - There's an option
LD_DYNAMIC_LINK to allow __weak symbols to be overridden by the dynamic
linker, but this is not enabled in the kernel.

So I will stick with the current implementation.

Nuno

> Michael
> 
>>
>>> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>
>>>> +
>>>>  	if (vmbus_handler)
>>>>  		vmbus_handler();
>>>>
>>>> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>>>  	set_irq_regs(old_regs);
>>>>  }
>>>>
>>>> +void hv_setup_mshv_handler(void (*handler)(void))
>>>> +{
>>>> +	mshv_handler = handler;
>>>> +}
>>>> +
>>>>  void hv_setup_vmbus_handler(void (*handler)(void))
>>>>  {
>>>>  	vmbus_handler = handler;
>>>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>>>> index 2763cb6d3678..f5a07fd9a03b 100644
>>>> --- a/drivers/hv/hv_common.c
>>>> +++ b/drivers/hv/hv_common.c
>>>> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>>>>
>>>> +void __weak hv_setup_mshv_handler(void (*handler)(void))
>>>> +{
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
>>>> +
>>>>  void __weak hv_setup_kexec_handler(void (*handler)(void))
>>>>  {
>>>>  }
>>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>>> index 1f46d19a16aa..a05f12e63ccd 100644
>>>> --- a/include/asm-generic/mshyperv.h
>>>> +++ b/include/asm-generic/mshyperv.h
>>>> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
>>>>  void hv_remove_kexec_handler(void);
>>>>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
>>>>  void hv_remove_crash_handler(void);
>>>> +void hv_setup_mshv_handler(void (*handler)(void));
>>>>
>>>>  extern int vmbus_interrupt;
>>>>  extern int vmbus_irq;
>>>> --
>>>> 2.34.1
>>>>
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-03-10 21:46         ` Nuno Das Neves
@ 2025-03-10 22:23           ` Michael Kelley
  0 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-10 22:23 UTC (permalink / raw)
  To: Nuno Das Neves, Stanislav Kinsburskii
  Cc: linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, March 10, 2025 2:47 PM
> 
> On 3/7/2025 9:38 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, February 28,
> 2025 4:38 PM
> >>
> >> On 2/26/2025 3:43 PM, Stanislav Kinsburskii wrote:
> >>> On Wed, Feb 26, 2025 at 03:08:02PM -0800, Nuno Das Neves wrote:
> >>>> This will handle SYNIC interrupts such as intercepts, doorbells, and
> >>>> scheduling messages intended for the mshv driver.
> >>>>
> >>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >>>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> >>>> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
> >>>> ---
> >>>>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
> >>>>  drivers/hv/hv_common.c         | 5 +++++
> >>>>  include/asm-generic/mshyperv.h | 1 +
> >>>>  3 files changed, 15 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> >>>> index 0116d0e96ef9..616e9a5d77b4 100644
> >>>> --- a/arch/x86/kernel/cpu/mshyperv.c
> >>>> +++ b/arch/x86/kernel/cpu/mshyperv.c
> >>>> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(hv_set_msr);
> >>>>
> >>>> +static void (*mshv_handler)(void);
> >>>>  static void (*vmbus_handler)(void);
> >>>>  static void (*hv_stimer0_handler)(void);
> >>>>  static void (*hv_kexec_handler)(void);
> >>>> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
> >>>>  	struct pt_regs *old_regs = set_irq_regs(regs);
> >>>>
> >>>>  	inc_irq_stat(irq_hv_callback_count);
> >>>> +	if (mshv_handler)
> >>>> +		mshv_handler();
> >>>
> >>> Can mshv_handler be defined as a weak symbol doing nothing instead
> >>> of defining it a null pointer?
> >>> This should allow to simplify this code and get rid of
> >>> hv_setup_mshv_handler, which looks redundant.
> >>>
> >> Interesting, I tested this and it does seems to work! It seems like
> >> a good change, thanks.
> >
> > Just be a bit careful. When CONFIG_HYPERV=n, mshyperv.c still gets
> > built even through none of the other Hyper-V related files do.  There
> > are #ifdef CONFIG_HYPERV in mshyperv.c to eliminate references to
> > Hyper-V files that wouldn't be built. I'd suggest doing a test build with
> > that configuration to make sure it's all clean.
> >
> Thanks Michael - I don't think it would be an issue since the __weak version
> would be defined in mshyperv.c itself, replacing the function pointer.

Yes, sounds right to me.

> 
> However, I went and tested this __weak version again with CONFIG_MSHV_ROOT=m
> and it does not actually work. Everything seems ok at first (it compiles,
> can insert the module), but upon starting a guest, the interrupts don't get
> delivered to the root (or rather, they don't get handled by mshv_hander()).
> 
> This seems to match with what the ld docs say - There's an option
> LD_DYNAMIC_LINK to allow __weak symbols to be overridden by the dynamic
> linker, but this is not enabled in the kernel.
> 

Yeah, I recall learning the hard way that a symbol defined in a module doesn't
override a __weak symbol in the kernel image. At the time, I gave up and took
a different path, and didn't get as far as looking at 'ld' options like you did. :-)

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-02-26 23:08 ` [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function Nuno Das Neves
  2025-02-26 23:43   ` Stanislav Kinsburskii
@ 2025-03-07 17:44   ` Michael Kelley
  2025-03-07 23:29     ` Nuno Das Neves
  1 sibling, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 17:44 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM

> 
> This will handle SYNIC interrupts such as intercepts, doorbells, and
> scheduling messages intended for the mshv driver.

Could you provide a bit more detailed commit message? How does
the mshv_handler() relate to the vmbus_handler()? From the code
mshv_handler() goes first, and I'm assuming it processes what it
knows about (intercepts, doorbells, scheduling messages?) and
then hands off control to the vmbus_handler() to handle the usual
VMbus-related message and channel interrupts. But it would be
nice to have the commit message or code comments describe the
overall intent and any obscure aspects of the relationship.

And avoid references to "This" or "This patch". :-)

> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Reviewed-by: Wei Liu <wei.liu@kernel.org>
> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
>  drivers/hv/hv_common.c         | 5 +++++
>  include/asm-generic/mshyperv.h | 1 +
>  3 files changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 0116d0e96ef9..616e9a5d77b4 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
>  }
>  EXPORT_SYMBOL_GPL(hv_set_msr);
> 
> +static void (*mshv_handler)(void);
>  static void (*vmbus_handler)(void);
>  static void (*hv_stimer0_handler)(void);
>  static void (*hv_kexec_handler)(void);
> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>  	struct pt_regs *old_regs = set_irq_regs(regs);
> 
>  	inc_irq_stat(irq_hv_callback_count);
> +	if (mshv_handler)
> +		mshv_handler();
> +
>  	if (vmbus_handler)
>  		vmbus_handler();
> 
> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>  	set_irq_regs(old_regs);
>  }
> 
> +void hv_setup_mshv_handler(void (*handler)(void))
> +{
> +	mshv_handler = handler;
> +}
> +
>  void hv_setup_vmbus_handler(void (*handler)(void))
>  {
>  	vmbus_handler = handler;
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 2763cb6d3678..f5a07fd9a03b 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
> 
> +void __weak hv_setup_mshv_handler(void (*handler)(void))
> +{
> +}
> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
> +
>  void __weak hv_setup_kexec_handler(void (*handler)(void))
>  {
>  }
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 1f46d19a16aa..a05f12e63ccd 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
>  void hv_remove_kexec_handler(void);
>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
>  void hv_remove_crash_handler(void);
> +void hv_setup_mshv_handler(void (*handler)(void));
> 
>  extern int vmbus_interrupt;
>  extern int vmbus_irq;
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-03-07 17:44   ` Michael Kelley
@ 2025-03-07 23:29     ` Nuno Das Neves
  2025-03-07 23:45       ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 23:29 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/7/2025 9:44 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
>>
>> This will handle SYNIC interrupts such as intercepts, doorbells, and
>> scheduling messages intended for the mshv driver.
> 
> Could you provide a bit more detailed commit message? How does
> the mshv_handler() relate to the vmbus_handler()? From the code
> mshv_handler() goes first, and I'm assuming it processes what it
> knows about (intercepts, doorbells, scheduling messages?) and
> then hands off control to the vmbus_handler() to handle the usual
> VMbus-related message and channel interrupts. But it would be
> nice to have the commit message or code comments describe the
> overall intent and any obscure aspects of the relationship.
> 
> And avoid references to "This" or "This patch". :-)
> 

You've described it pretty well, I don't know if there's too much
more to say given this patch doesn't introduce the handler code.

I can try to improve it, something like:
"
mshv_handler() will be used to process messages related to managing
guest partitions such as intercepts, doorbells, and scheduling messages.

In a (non-nested) root partition, the same interrupt vector is shared
between the vmbus and mshv_root drivers.

Introduce a stub for mshv_handler() and call it in
sysvec_hyperv_callback alongside vmbus_handler().

Even though both handlers will be called for every Hyper-V interrupt,
the messages for each driver are delivered to different offsets within
the SYNIC message page, so they won't step on each other.
"

Does that work?

Nuno
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Reviewed-by: Wei Liu <wei.liu@kernel.org>
>> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
>> ---
>>  arch/x86/kernel/cpu/mshyperv.c | 9 +++++++++
>>  drivers/hv/hv_common.c         | 5 +++++
>>  include/asm-generic/mshyperv.h | 1 +
>>  3 files changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 0116d0e96ef9..616e9a5d77b4 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -107,6 +107,7 @@ void hv_set_msr(unsigned int reg, u64 value)
>>  }
>>  EXPORT_SYMBOL_GPL(hv_set_msr);
>>
>> +static void (*mshv_handler)(void);
>>  static void (*vmbus_handler)(void);
>>  static void (*hv_stimer0_handler)(void);
>>  static void (*hv_kexec_handler)(void);
>> @@ -117,6 +118,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>  	struct pt_regs *old_regs = set_irq_regs(regs);
>>
>>  	inc_irq_stat(irq_hv_callback_count);
>> +	if (mshv_handler)
>> +		mshv_handler();
>> +
>>  	if (vmbus_handler)
>>  		vmbus_handler();
>>
>> @@ -126,6 +130,11 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
>>  	set_irq_regs(old_regs);
>>  }
>>
>> +void hv_setup_mshv_handler(void (*handler)(void))
>> +{
>> +	mshv_handler = handler;
>> +}
>> +
>>  void hv_setup_vmbus_handler(void (*handler)(void))
>>  {
>>  	vmbus_handler = handler;
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 2763cb6d3678..f5a07fd9a03b 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -677,6 +677,11 @@ void __weak hv_remove_vmbus_handler(void)
>>  }
>>  EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>>
>> +void __weak hv_setup_mshv_handler(void (*handler)(void))
>> +{
>> +}
>> +EXPORT_SYMBOL_GPL(hv_setup_mshv_handler);
>> +
>>  void __weak hv_setup_kexec_handler(void (*handler)(void))
>>  {
>>  }
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index 1f46d19a16aa..a05f12e63ccd 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -208,6 +208,7 @@ void hv_setup_kexec_handler(void (*handler)(void));
>>  void hv_remove_kexec_handler(void);
>>  void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
>>  void hv_remove_crash_handler(void);
>> +void hv_setup_mshv_handler(void (*handler)(void));
>>
>>  extern int vmbus_interrupt;
>>  extern int vmbus_irq;
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function
  2025-03-07 23:29     ` Nuno Das Neves
@ 2025-03-07 23:45       ` Michael Kelley
  0 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 23:45 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, March 7, 2025 3:30 PM
> 
> On 3/7/2025 9:44 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> >
> >>
> >> This will handle SYNIC interrupts such as intercepts, doorbells, and
> >> scheduling messages intended for the mshv driver.
> >
> > Could you provide a bit more detailed commit message? How does
> > the mshv_handler() relate to the vmbus_handler()? From the code
> > mshv_handler() goes first, and I'm assuming it processes what it
> > knows about (intercepts, doorbells, scheduling messages?) and
> > then hands off control to the vmbus_handler() to handle the usual
> > VMbus-related message and channel interrupts. But it would be
> > nice to have the commit message or code comments describe the
> > overall intent and any obscure aspects of the relationship.
> >
> > And avoid references to "This" or "This patch". :-)
> >
> 
> You've described it pretty well, I don't know if there's too much
> more to say given this patch doesn't introduce the handler code.
> 
> I can try to improve it, something like:
> "
> mshv_handler() will be used to process messages related to managing
> guest partitions such as intercepts, doorbells, and scheduling messages.
> 
> In a (non-nested) root partition, the same interrupt vector is shared
> between the vmbus and mshv_root drivers.
> 
> Introduce a stub for mshv_handler() and call it in
> sysvec_hyperv_callback alongside vmbus_handler().
> 
> Even though both handlers will be called for every Hyper-V interrupt,
> the messages for each driver are delivered to different offsets within
> the SYNIC message page, so they won't step on each other.
> "
> 
> Does that work?

Yes, that works. I was going to ask how the two handlers could
Both be used on the same interrupt, and you've provided the
explanation, which is perfect. ;-)

Minor tweak: Start the first sentence as an imperative:

   Add mshv_handler() to process messages related .....

Michael


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (7 preceding siblings ...)
  2025-02-26 23:08 ` [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function Nuno Das Neves
@ 2025-02-26 23:08 ` Nuno Das Neves
  2025-02-26 23:51   ` Stanislav Kinsburskii
                     ` (4 more replies)
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
  9 siblings, 5 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:08 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

A few additional definitions are required for the mshv driver code
(to follow). Introduce those here and clean up a little bit while
at it.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
 include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
 include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
 3 files changed, 280 insertions(+), 7 deletions(-)

diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 58895883f636..e4a3cca0cbce 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -13,7 +13,7 @@ struct hv_u128 {
 	u64 high_part;
 } __packed;
 
-/* NOTE: when adding below, update hv_status_to_string() */
+/* NOTE: when adding below, update hv_result_to_string() */
 #define HV_STATUS_SUCCESS			    0x0
 #define HV_STATUS_INVALID_HYPERCALL_CODE	    0x2
 #define HV_STATUS_INVALID_HYPERCALL_INPUT	    0x3
@@ -51,6 +51,7 @@ struct hv_u128 {
 #define HV_HYP_PAGE_SHIFT		12
 #define HV_HYP_PAGE_SIZE		BIT(HV_HYP_PAGE_SHIFT)
 #define HV_HYP_PAGE_MASK		(~(HV_HYP_PAGE_SIZE - 1))
+#define HV_HYP_LARGE_PAGE_SHIFT		21
 
 #define HV_PARTITION_ID_INVALID		((u64)0)
 #define HV_PARTITION_ID_SELF		((u64)-1)
@@ -374,6 +375,10 @@ union hv_hypervisor_version_info {
 #define HV_SHARED_GPA_BOUNDARY_ACTIVE			BIT(5)
 #define HV_SHARED_GPA_BOUNDARY_BITS			GENMASK(11, 6)
 
+/* HYPERV_CPUID_FEATURES.ECX bits. */
+#define HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE	BIT(9)
+#define HV_VP_GHCB_ROOT_MAPPING_AVAILABLE		BIT(10)
+
 enum hv_isolation_type {
 	HV_ISOLATION_TYPE_NONE	= 0,	/* HV_PARTITION_ISOLATION_TYPE_NONE */
 	HV_ISOLATION_TYPE_VBS	= 1,
@@ -437,9 +442,12 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_MAP_GPA_PAGES				0x004b
 #define HVCALL_UNMAP_GPA_PAGES				0x004c
 #define HVCALL_CREATE_VP				0x004e
+#define HVCALL_INSTALL_INTERCEPT			0x004d
 #define HVCALL_DELETE_VP				0x004f
 #define HVCALL_GET_VP_REGISTERS				0x0050
 #define HVCALL_SET_VP_REGISTERS				0x0051
+#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS		0x0052
+#define HVCALL_CLEAR_VIRTUAL_INTERRUPT			0x0056
 #define HVCALL_DELETE_PORT				0x0058
 #define HVCALL_DISCONNECT_PORT				0x005b
 #define HVCALL_POST_MESSAGE				0x005c
@@ -447,12 +455,15 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_POST_DEBUG_DATA				0x0069
 #define HVCALL_RETRIEVE_DEBUG_DATA			0x006a
 #define HVCALL_RESET_DEBUG_SESSION			0x006b
+#define HVCALL_MAP_STATS_PAGE				0x006c
+#define HVCALL_UNMAP_STATS_PAGE				0x006d
 #define HVCALL_ADD_LOGICAL_PROCESSOR			0x0076
 #define HVCALL_GET_SYSTEM_PROPERTY			0x007b
 #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
 #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
 #define HVCALL_RETARGET_INTERRUPT			0x007e
 #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
+#define HVCALL_REGISTER_INTERCEPT_RESULT		0x0091
 #define HVCALL_ASSERT_VIRTUAL_INTERRUPT			0x0094
 #define HVCALL_CREATE_PORT				0x0095
 #define HVCALL_CONNECT_PORT				0x0096
@@ -460,12 +471,18 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_VP_ID_FROM_APIC_ID			0x009a
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
+#define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
+#define HVCALL_POST_MESSAGE_DIRECT			0x00c1
 #define HVCALL_DISPATCH_VP				0x00c2
+#define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
+#define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
+#define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
 #define HVCALL_MAP_VP_STATE_PAGE			0x00e1
 #define HVCALL_UNMAP_VP_STATE_PAGE			0x00e2
 #define HVCALL_GET_VP_STATE				0x00e3
 #define HVCALL_SET_VP_STATE				0x00e4
+#define HVCALL_GET_VP_CPUID_VALUES			0x00f4
 #define HVCALL_MMIO_READ				0x0106
 #define HVCALL_MMIO_WRITE				0x0107
 
@@ -807,6 +824,8 @@ struct hv_x64_table_register {
 	u64 base;
 } __packed;
 
+#define HV_NORMAL_VTL	0
+
 union hv_input_vtl {
 	u8 as_uint8;
 	struct {
@@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
 	struct hv_device_interrupt_target int_target;
 } __packed __aligned(8);
 
+enum hv_intercept_type {
+#if defined(CONFIG_X86_64)
+	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
+	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
+	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
+#endif
+	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
+	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
+	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
+	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
+#if defined(CONFIG_X86_64)
+	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
+	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
+#endif
+	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
+#if defined(CONFIG_X86_64)
+	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
+	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
+	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
+	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
+#endif
+	HV_INTERCEPT_TYPE_MAX,
+	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
+};
+
+union hv_intercept_parameters {
+	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
+	__u64 as_uint64;
+#if defined(CONFIG_X86_64)
+	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
+	__u16 io_port;
+	/* HV_INTERCEPT_TYPE_X64_CPUID */
+	__u32 cpuid_index;
+	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
+	__u32 apic_write_mask;
+	/* HV_INTERCEPT_TYPE_EXCEPTION */
+	__u16 exception_vector;
+	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
+	__u32 msr_index;
+#endif
+	/* N.B. Other intercept types do not have any parameters. */
+};
+
 /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
 #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
 
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 64407c2a3809..1b447155c338 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -19,11 +19,24 @@
 
 #define HV_VP_REGISTER_PAGE_VERSION_1	1u
 
+#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
+
+union hv_vp_register_page_interrupt_vectors {
+	u64 as_uint64;
+	struct {
+		u8 vector_count;
+		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
+	} __packed;
+} __packed;
+
 struct hv_vp_register_page {
 	u16 version;
 	u8 isvalid;
 	u8 rsvdz;
 	u32 dirty;
+
+#if IS_ENABLED(CONFIG_X86)
+
 	union {
 		struct {
 			/* General purpose registers
@@ -95,6 +108,22 @@ struct hv_vp_register_page {
 	union hv_x64_pending_interruption_register pending_interruption;
 	union hv_x64_interrupt_state_register interrupt_state;
 	u64 instruction_emulation_hints;
+	u64 xfem;
+
+	/*
+	 * Fields from this point are not included in the register page save chunk.
+	 * The reserved field is intended to maintain alignment for unsaved fields.
+	 */
+	u8 reserved1[0x100];
+
+	/*
+	 * Interrupts injected as part of HvCallDispatchVp.
+	 */
+	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
+
+#elif IS_ENABLED(CONFIG_ARM64)
+	/* Not yet supported in ARM */
+#endif
 } __packed;
 
 #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
@@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
 #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
 
 /* Note: Exo partition is enabled by default */
-#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
-#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
-#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
-#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
+#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
+#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
+#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
+#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
+#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
 
 struct hv_input_create_partition {
 	u64 flags;
@@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
 enum hv_vp_state_page_type {
 	HV_VP_STATE_PAGE_REGISTERS = 0,
 	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
+	HV_VP_STATE_PAGE_GHCB,
 	HV_VP_STATE_PAGE_COUNT
 };
 
 struct hv_input_map_vp_state_page {
 	u64 partition_id;
 	u32 vp_index;
-	u32 type; /* enum hv_vp_state_page_type */
+	u16 type; /* enum hv_vp_state_page_type */
+	union hv_input_vtl input_vtl;
+	union {
+		u8 as_uint8;
+		struct {
+			u8 map_location_provided : 1;
+			u8 reserved : 7;
+		};
+	} flags;
+	u64 requested_map_location;
 } __packed;
 
 struct hv_output_map_vp_state_page {
@@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
 struct hv_input_unmap_vp_state_page {
 	u64 partition_id;
 	u32 vp_index;
-	u32 type; /* enum hv_vp_state_page_type */
+	u16 type; /* enum hv_vp_state_page_type */
+	union hv_input_vtl input_vtl;
+	u8 reserved0;
+} __packed;
+
+struct hv_x64_apic_eoi_message {
+	__u32 vp_index;
+	__u32 interrupt_vector;
 } __packed;
 
 struct hv_opaque_intercept_message {
@@ -515,6 +562,13 @@ struct hv_synthetic_timers_state {
 	u64 reserved[5];
 } __packed;
 
+struct hv_async_completion_message_payload {
+	__u64 partition_id;
+	__u32 status;
+	__u32 completion_count;
+	__u64 sub_status;
+} __packed;
+
 union hv_input_delete_vp {
 	u64 as_uint64[2];
 	struct {
@@ -649,6 +703,57 @@ struct hv_input_set_vp_state {
 	union hv_input_set_vp_state_data data[];
 } __packed;
 
+union hv_x64_vp_execution_state {
+	__u16 as_uint16;
+	struct {
+		__u16 cpl:2;
+		__u16 cr0_pe:1;
+		__u16 cr0_am:1;
+		__u16 efer_lma:1;
+		__u16 debug_active:1;
+		__u16 interruption_pending:1;
+		__u16 vtl:4;
+		__u16 enclave_mode:1;
+		__u16 interrupt_shadow:1;
+		__u16 virtualization_fault_active:1;
+		__u16 reserved:2;
+	} __packed;
+};
+
+struct hv_x64_intercept_message_header {
+	__u32 vp_index;
+	__u8 instruction_length:4;
+	__u8 cr8:4; /* Only set for exo partitions */
+	__u8 intercept_access_type;
+	union hv_x64_vp_execution_state execution_state;
+	struct hv_x64_segment_register cs_segment;
+	__u64 rip;
+	__u64 rflags;
+} __packed;
+
+union hv_x64_memory_access_info {
+	__u8 as_uint8;
+	struct {
+		__u8 gva_valid:1;
+		__u8 gva_gpa_valid:1;
+		__u8 hypercall_output_pending:1;
+		__u8 tlb_locked_no_overlay:1;
+		__u8 reserved:4;
+	} __packed;
+};
+
+struct hv_x64_memory_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u32 cache_type; /* enum hv_cache_type */
+	__u8 instruction_byte_count;
+	union hv_x64_memory_access_info memory_access_info;
+	__u8 tpr_priority;
+	__u8 reserved1;
+	__u64 guest_virtual_address;
+	__u64 guest_physical_address;
+	__u8 instruction_bytes[16];
+} __packed;
+
 /*
  * Dispatch state for the VP communicated by the hypervisor to the
  * VP-dispatching thread in the root on return from HVCALL_DISPATCH_VP.
@@ -716,6 +821,7 @@ static_assert(sizeof(struct hv_vp_signal_pair_scheduler_message) ==
 #define HV_DISPATCH_VP_FLAG_SKIP_VP_SPEC_FLUSH		0x8
 #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_SPEC_FLUSH	0x10
 #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_USER_SPEC_FLUSH	0x20
+#define HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION	0x40
 
 struct hv_input_dispatch_vp {
 	u64 partition_id;
@@ -730,4 +836,18 @@ struct hv_output_dispatch_vp {
 	u32 dispatch_event; /* enum hv_vp_dispatch_event */
 } __packed;
 
+struct hv_input_modify_sparse_spa_page_host_access {
+	u32 host_access : 2;
+	u32 reserved : 30;
+	u32 flags;
+	u64 partition_id;
+	u64 spa_page_list[];
+} __packed;
+
+/* hv_input_modify_sparse_spa_page_host_access flags */
+#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE  0x1
+#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED     0x2
+#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
+#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
+
 #endif /* _HV_HVHDK_H */
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index f8a39d3e9ce6..42e7876455b5 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -36,6 +36,52 @@ enum hv_scheduler_type {
 	HV_SCHEDULER_TYPE_MAX
 };
 
+/* HV_STATS_AREA_TYPE */
+enum hv_stats_area_type {
+	HV_STATS_AREA_SELF = 0,
+	HV_STATS_AREA_PARENT = 1,
+	HV_STATS_AREA_INTERNAL = 2,
+	HV_STATS_AREA_COUNT
+};
+
+enum hv_stats_object_type {
+	HV_STATS_OBJECT_HYPERVISOR		= 0x00000001,
+	HV_STATS_OBJECT_LOGICAL_PROCESSOR	= 0x00000002,
+	HV_STATS_OBJECT_PARTITION		= 0x00010001,
+	HV_STATS_OBJECT_VP			= 0x00010002
+};
+
+union hv_stats_object_identity {
+	/* hv_stats_hypervisor */
+	struct {
+		u8 reserved[15];
+		u8 stats_area_type;
+	} __packed hv;
+
+	/* hv_stats_logical_processor */
+	struct {
+		u32 lp_index;
+		u8 reserved[11];
+		u8 stats_area_type;
+	} __packed lp;
+
+	/* hv_stats_partition */
+	struct {
+		u64 partition_id;
+		u8  reserved[7];
+		u8  stats_area_type;
+	} __packed partition;
+
+	/* hv_stats_vp */
+	struct {
+		u64 partition_id;
+		u32 vp_index;
+		u16 flags;
+		u8  reserved;
+		u8  stats_area_type;
+	} __packed vp;
+};
+
 enum hv_partition_property_code {
 	/* Privilege properties */
 	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
@@ -47,19 +93,45 @@ enum hv_partition_property_code {
 
 	/* Compatibility properties */
 	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES		= 0x00060002,
+	HV_PARTITION_PROPERTY_XSAVE_STATES                      = 0x00060007,
 	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE		= 0x00060008,
 	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY		= 0x00060009,
 };
 
+enum hv_snp_status {
+	HV_SNP_STATUS_NONE = 0,
+	HV_SNP_STATUS_AVAILABLE = 1,
+	HV_SNP_STATUS_INCOMPATIBLE = 2,
+	HV_SNP_STATUS_PSP_UNAVAILABLE = 3,
+	HV_SNP_STATUS_PSP_INIT_FAILED = 4,
+	HV_SNP_STATUS_PSP_BAD_FW_VERSION = 5,
+	HV_SNP_STATUS_BAD_CONFIGURATION = 6,
+	HV_SNP_STATUS_PSP_FW_UPDATE_IN_PROGRESS = 7,
+	HV_SNP_STATUS_PSP_RB_INIT_FAILED = 8,
+	HV_SNP_STATUS_PSP_PLATFORM_STATUS_FAILED = 9,
+	HV_SNP_STATUS_PSP_INIT_LATE_FAILED = 10,
+};
+
 enum hv_system_property {
 	/* Add more values when needed */
 	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
+	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
+};
+
+enum hv_dynamic_processor_feature_property {
+	/* Add more values when needed */
+	HV_X64_DYNAMIC_PROCESSOR_FEATURE_MAX_ENCRYPTED_PARTITIONS = 13,
+	HV_X64_DYNAMIC_PROCESSOR_FEATURE_SNP_STATUS = 16,
 };
 
 struct hv_input_get_system_property {
 	u32 property_id; /* enum hv_system_property */
 	union {
 		u32 as_uint32;
+#if IS_ENABLED(CONFIG_X86)
+		/* enum hv_dynamic_processor_feature_property */
+		u32 hv_processor_feature;
+#endif
 		/* More fields to be filled in when needed */
 	};
 } __packed;
@@ -67,9 +139,28 @@ struct hv_input_get_system_property {
 struct hv_output_get_system_property {
 	union {
 		u32 scheduler_type; /* enum hv_scheduler_type */
+#if IS_ENABLED(CONFIG_X86)
+		u64 hv_processor_feature_value;
+#endif
 	};
 } __packed;
 
+struct hv_input_map_stats_page {
+	u32 type; /* enum hv_stats_object_type */
+	u32 padding;
+	union hv_stats_object_identity identity;
+} __packed;
+
+struct hv_output_map_stats_page {
+	u64 map_location;
+} __packed;
+
+struct hv_input_unmap_stats_page {
+	u32 type; /* enum hv_stats_object_type */
+	u32 padding;
+	union hv_stats_object_identity identity;
+} __packed;
+
 struct hv_proximity_domain_flags {
 	u32 proximity_preferred : 1;
 	u32 reserved : 30;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
@ 2025-02-26 23:51   ` Stanislav Kinsburskii
  2025-03-01  0:46     ` Nuno Das Neves
  2025-02-27 18:13   ` Roman Kisel
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 108+ messages in thread
From: Stanislav Kinsburskii @ 2025-02-26 23:51 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Wed, Feb 26, 2025 at 03:08:03PM -0800, Nuno Das Neves wrote:
> A few additional definitions are required for the mshv driver code
> (to follow). Introduce those here and clean up a little bit while
> at it.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>  3 files changed, 280 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 58895883f636..e4a3cca0cbce 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
 @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>  	struct hv_device_interrupt_target int_target;
>  } __packed __aligned(8);
>  
> +enum hv_intercept_type {
> +#if defined(CONFIG_X86_64)

Prehaps it would be nice to have per-arch headers for such structures
instead.

> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
> +#endif
> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
> +#endif
> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
> +#endif
> +	HV_INTERCEPT_TYPE_MAX,
> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
> +};
> +
> +union hv_intercept_parameters {
> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
> +	__u64 as_uint64;

Should this one be "u64" instead of "__u64" (here and below) ?

> +#if defined(CONFIG_X86_64)
> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
> +	__u16 io_port;
> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
> +	__u32 cpuid_index;
> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
> +	__u32 apic_write_mask;
> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
> +	__u16 exception_vector;
> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
> +	__u32 msr_index;
> +#endif
> +	/* N.B. Other intercept types do not have any parameters. */
> +};
> +
>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>  
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 64407c2a3809..1b447155c338 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -19,11 +19,24 @@
>  
>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>  
> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
> +
> +union hv_vp_register_page_interrupt_vectors {
> +	u64 as_uint64;
> +	struct {
> +		u8 vector_count;
> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
> +	} __packed;
> +} __packed;

Packed attribute for the union looks redundant.

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:51   ` Stanislav Kinsburskii
@ 2025-03-01  0:46     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-01  0:46 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:51 PM, Stanislav Kinsburskii wrote:
> On Wed, Feb 26, 2025 at 03:08:03PM -0800, Nuno Das Neves wrote:
>> A few additional definitions are required for the mshv driver code
>> (to follow). Introduce those here and clean up a little bit while
>> at it.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>>  3 files changed, 280 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 58895883f636..e4a3cca0cbce 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>  @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>>  	struct hv_device_interrupt_target int_target;
>>  } __packed __aligned(8);
>>  
>> +enum hv_intercept_type {
>> +#if defined(CONFIG_X86_64)
> 
> Prehaps it would be nice to have per-arch headers for such structures
> instead.
> 
The goal with these files is to reflect the Hyper-V code closely, in order
to make porting the definitions to Linux as easy as possible. Splitting
these into per-arch headers is not my preferred approach because it is
counter to that goal.

>> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
>> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
>> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
>> +#endif
>> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
>> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
>> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
>> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
>> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
>> +#endif
>> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
>> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
>> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
>> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
>> +#endif
>> +	HV_INTERCEPT_TYPE_MAX,
>> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
>> +};
>> +
>> +union hv_intercept_parameters {
>> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
>> +	__u64 as_uint64;
> 
> Should this one be "u64" instead of "__u64" (here and below) ?
> 
Yes, it looks like a few of the uapi types are still lingering, oops!

>> +#if defined(CONFIG_X86_64)
>> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
>> +	__u16 io_port;
>> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
>> +	__u32 cpuid_index;
>> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
>> +	__u32 apic_write_mask;
>> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
>> +	__u16 exception_vector;
>> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
>> +	__u32 msr_index;
>> +#endif
>> +	/* N.B. Other intercept types do not have any parameters. */
>> +};
>> +
>>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>>  
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 64407c2a3809..1b447155c338 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -19,11 +19,24 @@
>>  
>>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>>  
>> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
>> +
>> +union hv_vp_register_page_interrupt_vectors {
>> +	u64 as_uint64;
>> +	struct {
>> +		u8 vector_count;
>> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
>> +	} __packed;
>> +} __packed;
> 
> Packed attribute for the union looks redundant.
> 
Good point, I can remove it in the next version

> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
  2025-02-26 23:51   ` Stanislav Kinsburskii
@ 2025-02-27 18:13   ` Roman Kisel
  2025-02-28  1:27   ` Easwar Hariharan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 18:13 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet



On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> A few additional definitions are required for the mshv driver code
> (to follow). Introduce those here and clean up a little bit while
> at it.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>   include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>   include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>   include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>   3 files changed, 280 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 58895883f636..e4a3cca0cbce 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -13,7 +13,7 @@ struct hv_u128 {
>   	u64 high_part;
>   } __packed;
>   
> -/* NOTE: when adding below, update hv_status_to_string() */
> +/* NOTE: when adding below, update hv_result_to_string() */
>   #define HV_STATUS_SUCCESS			    0x0
>   #define HV_STATUS_INVALID_HYPERCALL_CODE	    0x2
>   #define HV_STATUS_INVALID_HYPERCALL_INPUT	    0x3
> @@ -51,6 +51,7 @@ struct hv_u128 {
>   #define HV_HYP_PAGE_SHIFT		12
>   #define HV_HYP_PAGE_SIZE		BIT(HV_HYP_PAGE_SHIFT)
>   #define HV_HYP_PAGE_MASK		(~(HV_HYP_PAGE_SIZE - 1))
> +#define HV_HYP_LARGE_PAGE_SHIFT		21
>   
>   #define HV_PARTITION_ID_INVALID		((u64)0)
>   #define HV_PARTITION_ID_SELF		((u64)-1)
> @@ -374,6 +375,10 @@ union hv_hypervisor_version_info {
>   #define HV_SHARED_GPA_BOUNDARY_ACTIVE			BIT(5)
>   #define HV_SHARED_GPA_BOUNDARY_BITS			GENMASK(11, 6)
>   
> +/* HYPERV_CPUID_FEATURES.ECX bits. */
> +#define HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE	BIT(9)
> +#define HV_VP_GHCB_ROOT_MAPPING_AVAILABLE		BIT(10)
> +
>   enum hv_isolation_type {
>   	HV_ISOLATION_TYPE_NONE	= 0,	/* HV_PARTITION_ISOLATION_TYPE_NONE */
>   	HV_ISOLATION_TYPE_VBS	= 1,
> @@ -437,9 +442,12 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
>   #define HVCALL_MAP_GPA_PAGES				0x004b
>   #define HVCALL_UNMAP_GPA_PAGES				0x004c
>   #define HVCALL_CREATE_VP				0x004e
> +#define HVCALL_INSTALL_INTERCEPT			0x004d
>   #define HVCALL_DELETE_VP				0x004f
>   #define HVCALL_GET_VP_REGISTERS				0x0050
>   #define HVCALL_SET_VP_REGISTERS				0x0051
> +#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS		0x0052
> +#define HVCALL_CLEAR_VIRTUAL_INTERRUPT			0x0056
>   #define HVCALL_DELETE_PORT				0x0058
>   #define HVCALL_DISCONNECT_PORT				0x005b
>   #define HVCALL_POST_MESSAGE				0x005c
> @@ -447,12 +455,15 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
>   #define HVCALL_POST_DEBUG_DATA				0x0069
>   #define HVCALL_RETRIEVE_DEBUG_DATA			0x006a
>   #define HVCALL_RESET_DEBUG_SESSION			0x006b
> +#define HVCALL_MAP_STATS_PAGE				0x006c
> +#define HVCALL_UNMAP_STATS_PAGE				0x006d
>   #define HVCALL_ADD_LOGICAL_PROCESSOR			0x0076
>   #define HVCALL_GET_SYSTEM_PROPERTY			0x007b
>   #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
>   #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
>   #define HVCALL_RETARGET_INTERRUPT			0x007e
>   #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
> +#define HVCALL_REGISTER_INTERCEPT_RESULT		0x0091
>   #define HVCALL_ASSERT_VIRTUAL_INTERRUPT			0x0094
>   #define HVCALL_CREATE_PORT				0x0095
>   #define HVCALL_CONNECT_PORT				0x0096
> @@ -460,12 +471,18 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
>   #define HVCALL_GET_VP_ID_FROM_APIC_ID			0x009a
>   #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
>   #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
> +#define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
> +#define HVCALL_POST_MESSAGE_DIRECT			0x00c1
>   #define HVCALL_DISPATCH_VP				0x00c2
> +#define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
> +#define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
> +#define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
>   #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
>   #define HVCALL_MAP_VP_STATE_PAGE			0x00e1
>   #define HVCALL_UNMAP_VP_STATE_PAGE			0x00e2
>   #define HVCALL_GET_VP_STATE				0x00e3
>   #define HVCALL_SET_VP_STATE				0x00e4
> +#define HVCALL_GET_VP_CPUID_VALUES			0x00f4
>   #define HVCALL_MMIO_READ				0x0106
>   #define HVCALL_MMIO_WRITE				0x0107
>   
> @@ -807,6 +824,8 @@ struct hv_x64_table_register {
>   	u64 base;
>   } __packed;
>   
> +#define HV_NORMAL_VTL	0
> +
>   union hv_input_vtl {
>   	u8 as_uint8;
>   	struct {
> @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>   	struct hv_device_interrupt_target int_target;
>   } __packed __aligned(8);
>   
> +enum hv_intercept_type {
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
> +#endif
> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
> +#endif
> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
> +#endif
> +	HV_INTERCEPT_TYPE_MAX,
> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
> +};
> +
> +union hv_intercept_parameters {
> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
> +	__u64 as_uint64;
> +#if defined(CONFIG_X86_64)
> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
> +	__u16 io_port;
> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
> +	__u32 cpuid_index;
> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
> +	__u32 apic_write_mask;
> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
> +	__u16 exception_vector;
> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
> +	__u32 msr_index;
> +#endif
> +	/* N.B. Other intercept types do not have any parameters. */
> +};
> +
>   /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>   #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>   
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 64407c2a3809..1b447155c338 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -19,11 +19,24 @@
>   
>   #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>   
> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
> +
> +union hv_vp_register_page_interrupt_vectors {
> +	u64 as_uint64;
> +	struct {
> +		u8 vector_count;
> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
> +	} __packed;
> +} __packed;
> +
>   struct hv_vp_register_page {
>   	u16 version;
>   	u8 isvalid;
>   	u8 rsvdz;
>   	u32 dirty;
> +
> +#if IS_ENABLED(CONFIG_X86)
> +
>   	union {
>   		struct {
>   			/* General purpose registers
> @@ -95,6 +108,22 @@ struct hv_vp_register_page {
>   	union hv_x64_pending_interruption_register pending_interruption;
>   	union hv_x64_interrupt_state_register interrupt_state;
>   	u64 instruction_emulation_hints;
> +	u64 xfem;
> +
> +	/*
> +	 * Fields from this point are not included in the register page save chunk.
> +	 * The reserved field is intended to maintain alignment for unsaved fields.
> +	 */
> +	u8 reserved1[0x100];
> +
> +	/*
> +	 * Interrupts injected as part of HvCallDispatchVp.
> +	 */
> +	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
> +
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	/* Not yet supported in ARM */
> +#endif
>   } __packed;
>   
>   #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
> @@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
>   #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
>   
>   /* Note: Exo partition is enabled by default */
> -#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
> -#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> -#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
> -#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
> +#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
> +#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
>   
>   struct hv_input_create_partition {
>   	u64 flags;
> @@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
>   enum hv_vp_state_page_type {
>   	HV_VP_STATE_PAGE_REGISTERS = 0,
>   	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
> +	HV_VP_STATE_PAGE_GHCB,
>   	HV_VP_STATE_PAGE_COUNT
>   };
>   
>   struct hv_input_map_vp_state_page {
>   	u64 partition_id;
>   	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	union {
> +		u8 as_uint8;
> +		struct {
> +			u8 map_location_provided : 1;
> +			u8 reserved : 7;
> +		};
> +	} flags;
> +	u64 requested_map_location;
>   } __packed;
>   
>   struct hv_output_map_vp_state_page {
> @@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
>   struct hv_input_unmap_vp_state_page {
>   	u64 partition_id;
>   	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	u8 reserved0;
> +} __packed;
> +
> +struct hv_x64_apic_eoi_message {
> +	__u32 vp_index;
> +	__u32 interrupt_vector;
>   } __packed;
>   
>   struct hv_opaque_intercept_message {
> @@ -515,6 +562,13 @@ struct hv_synthetic_timers_state {
>   	u64 reserved[5];
>   } __packed;
>   
> +struct hv_async_completion_message_payload {
> +	__u64 partition_id;
> +	__u32 status;
> +	__u32 completion_count;
> +	__u64 sub_status;
> +} __packed;
> +
>   union hv_input_delete_vp {
>   	u64 as_uint64[2];
>   	struct {
> @@ -649,6 +703,57 @@ struct hv_input_set_vp_state {
>   	union hv_input_set_vp_state_data data[];
>   } __packed;
>   
> +union hv_x64_vp_execution_state {
> +	__u16 as_uint16;
> +	struct {
> +		__u16 cpl:2;
> +		__u16 cr0_pe:1;
> +		__u16 cr0_am:1;
> +		__u16 efer_lma:1;
> +		__u16 debug_active:1;
> +		__u16 interruption_pending:1;
> +		__u16 vtl:4;
> +		__u16 enclave_mode:1;
> +		__u16 interrupt_shadow:1;
> +		__u16 virtualization_fault_active:1;
> +		__u16 reserved:2;
> +	} __packed;
> +};
> +
> +struct hv_x64_intercept_message_header {
> +	__u32 vp_index;
> +	__u8 instruction_length:4;
> +	__u8 cr8:4; /* Only set for exo partitions */
> +	__u8 intercept_access_type;
> +	union hv_x64_vp_execution_state execution_state;
> +	struct hv_x64_segment_register cs_segment;
> +	__u64 rip;
> +	__u64 rflags;
> +} __packed;
> +
> +union hv_x64_memory_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 gva_gpa_valid:1;
> +		__u8 hypercall_output_pending:1;
> +		__u8 tlb_locked_no_overlay:1;
> +		__u8 reserved:4;
> +	} __packed;
> +};
> +
> +struct hv_x64_memory_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 cache_type; /* enum hv_cache_type */
> +	__u8 instruction_byte_count;
> +	union hv_x64_memory_access_info memory_access_info;
> +	__u8 tpr_priority;
> +	__u8 reserved1;
> +	__u64 guest_virtual_address;
> +	__u64 guest_physical_address;
> +	__u8 instruction_bytes[16];
> +} __packed;
> +
>   /*
>    * Dispatch state for the VP communicated by the hypervisor to the
>    * VP-dispatching thread in the root on return from HVCALL_DISPATCH_VP.
> @@ -716,6 +821,7 @@ static_assert(sizeof(struct hv_vp_signal_pair_scheduler_message) ==
>   #define HV_DISPATCH_VP_FLAG_SKIP_VP_SPEC_FLUSH		0x8
>   #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_SPEC_FLUSH	0x10
>   #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_USER_SPEC_FLUSH	0x20
> +#define HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION	0x40
>   
>   struct hv_input_dispatch_vp {
>   	u64 partition_id;
> @@ -730,4 +836,18 @@ struct hv_output_dispatch_vp {
>   	u32 dispatch_event; /* enum hv_vp_dispatch_event */
>   } __packed;
>   
> +struct hv_input_modify_sparse_spa_page_host_access {
> +	u32 host_access : 2;
> +	u32 reserved : 30;
> +	u32 flags;
> +	u64 partition_id;
> +	u64 spa_page_list[];
> +} __packed;
> +
> +/* hv_input_modify_sparse_spa_page_host_access flags */
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE  0x1
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED     0x2
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
> +
>   #endif /* _HV_HVHDK_H */
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index f8a39d3e9ce6..42e7876455b5 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -36,6 +36,52 @@ enum hv_scheduler_type {
>   	HV_SCHEDULER_TYPE_MAX
>   };
>   
> +/* HV_STATS_AREA_TYPE */
> +enum hv_stats_area_type {
> +	HV_STATS_AREA_SELF = 0,
> +	HV_STATS_AREA_PARENT = 1,
> +	HV_STATS_AREA_INTERNAL = 2,
> +	HV_STATS_AREA_COUNT
> +};
> +
> +enum hv_stats_object_type {
> +	HV_STATS_OBJECT_HYPERVISOR		= 0x00000001,
> +	HV_STATS_OBJECT_LOGICAL_PROCESSOR	= 0x00000002,
> +	HV_STATS_OBJECT_PARTITION		= 0x00010001,
> +	HV_STATS_OBJECT_VP			= 0x00010002
> +};
> +
> +union hv_stats_object_identity {
> +	/* hv_stats_hypervisor */
> +	struct {
> +		u8 reserved[15];
> +		u8 stats_area_type;
> +	} __packed hv;
> +
> +	/* hv_stats_logical_processor */
> +	struct {
> +		u32 lp_index;
> +		u8 reserved[11];
> +		u8 stats_area_type;
> +	} __packed lp;
> +
> +	/* hv_stats_partition */
> +	struct {
> +		u64 partition_id;
> +		u8  reserved[7];
> +		u8  stats_area_type;
> +	} __packed partition;
> +
> +	/* hv_stats_vp */
> +	struct {
> +		u64 partition_id;
> +		u32 vp_index;
> +		u16 flags;
> +		u8  reserved;
> +		u8  stats_area_type;
> +	} __packed vp;
> +};
> +
>   enum hv_partition_property_code {
>   	/* Privilege properties */
>   	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> @@ -47,19 +93,45 @@ enum hv_partition_property_code {
>   
>   	/* Compatibility properties */
>   	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES		= 0x00060002,
> +	HV_PARTITION_PROPERTY_XSAVE_STATES                      = 0x00060007,
>   	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE		= 0x00060008,
>   	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY		= 0x00060009,
>   };
>   
> +enum hv_snp_status {
> +	HV_SNP_STATUS_NONE = 0,
> +	HV_SNP_STATUS_AVAILABLE = 1,
> +	HV_SNP_STATUS_INCOMPATIBLE = 2,
> +	HV_SNP_STATUS_PSP_UNAVAILABLE = 3,
> +	HV_SNP_STATUS_PSP_INIT_FAILED = 4,
> +	HV_SNP_STATUS_PSP_BAD_FW_VERSION = 5,
> +	HV_SNP_STATUS_BAD_CONFIGURATION = 6,
> +	HV_SNP_STATUS_PSP_FW_UPDATE_IN_PROGRESS = 7,
> +	HV_SNP_STATUS_PSP_RB_INIT_FAILED = 8,
> +	HV_SNP_STATUS_PSP_PLATFORM_STATUS_FAILED = 9,
> +	HV_SNP_STATUS_PSP_INIT_LATE_FAILED = 10,
> +};
> +
>   enum hv_system_property {
>   	/* Add more values when needed */
>   	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
> +	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
> +};
> +
> +enum hv_dynamic_processor_feature_property {
> +	/* Add more values when needed */
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_MAX_ENCRYPTED_PARTITIONS = 13,
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_SNP_STATUS = 16,
>   };
>   
>   struct hv_input_get_system_property {
>   	u32 property_id; /* enum hv_system_property */
>   	union {
>   		u32 as_uint32;
> +#if IS_ENABLED(CONFIG_X86)
> +		/* enum hv_dynamic_processor_feature_property */
> +		u32 hv_processor_feature;
> +#endif
>   		/* More fields to be filled in when needed */
>   	};
>   } __packed;
> @@ -67,9 +139,28 @@ struct hv_input_get_system_property {
>   struct hv_output_get_system_property {
>   	union {
>   		u32 scheduler_type; /* enum hv_scheduler_type */
> +#if IS_ENABLED(CONFIG_X86)
> +		u64 hv_processor_feature_value;
> +#endif
>   	};
>   } __packed;
>   
> +struct hv_input_map_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
> +struct hv_output_map_stats_page {
> +	u64 map_location;
> +} __packed;
> +
> +struct hv_input_unmap_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
>   struct hv_proximity_domain_flags {
>   	u32 proximity_preferred : 1;
>   	u32 reserved : 30;

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
  2025-02-26 23:51   ` Stanislav Kinsburskii
  2025-02-27 18:13   ` Roman Kisel
@ 2025-02-28  1:27   ` Easwar Hariharan
  2025-03-01  0:52     ` Nuno Das Neves
  2025-03-07 17:26   ` Michael Kelley
  2025-03-10 12:40   ` Tianyu Lan
  4 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-28  1:27 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> A few additional definitions are required for the mshv driver code
> (to follow). Introduce those here and clean up a little bit while
> at it.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>  3 files changed, 280 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 58895883f636..e4a3cca0cbce 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -13,7 +13,7 @@ struct hv_u128 {
>  	u64 high_part;
>  } __packed;
>  

<snip>

>  union hv_input_vtl {
>  	u8 as_uint8;
>  	struct {
> @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>  	struct hv_device_interrupt_target int_target;
>  } __packed __aligned(8);
>  
> +enum hv_intercept_type {
> +#if defined(CONFIG_X86_64)

These chosen ifdef's come across kinda arbitrary. The hypervisor code has
this enabled for both 32-bit and 64-bit x86, but you've chosen x86_64 only.
I thought that may be because we only intend to support root partition for 64-bit
platforms, but then, below...

> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
> +#endif
> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
> +#endif
> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
> +#endif
> +	HV_INTERCEPT_TYPE_MAX,
> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
> +};
> +
> +union hv_intercept_parameters {
> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
> +	__u64 as_uint64;
> +#if defined(CONFIG_X86_64)
> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
> +	__u16 io_port;
> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
> +	__u32 cpuid_index;
> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
> +	__u32 apic_write_mask;
> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
> +	__u16 exception_vector;
> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
> +	__u32 msr_index;
> +#endif
> +	/* N.B. Other intercept types do not have any parameters. */
> +};
> +
>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>  
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 64407c2a3809..1b447155c338 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -19,11 +19,24 @@
>  
>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>  
> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
> +
> +union hv_vp_register_page_interrupt_vectors {
> +	u64 as_uint64;
> +	struct {
> +		u8 vector_count;
> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
> +	} __packed;
> +} __packed;
> +
>  struct hv_vp_register_page {
>  	u16 version;
>  	u8 isvalid;
>  	u8 rsvdz;
>  	u32 dirty;
> +
> +#if IS_ENABLED(CONFIG_X86)
> +

...you've chosen to include 32bit here, where the hypervisor code supports both.

Confused

>  	union {
>  		struct {
>  			/* General purpose registers
> @@ -95,6 +108,22 @@ struct hv_vp_register_page {
>  	union hv_x64_pending_interruption_register pending_interruption;
>  	union hv_x64_interrupt_state_register interrupt_state;
>  	u64 instruction_emulation_hints;
> +	u64 xfem;
> +
> +	/*
> +	 * Fields from this point are not included in the register page save chunk.
> +	 * The reserved field is intended to maintain alignment for unsaved fields.
> +	 */
> +	u8 reserved1[0x100];
> +
> +	/*
> +	 * Interrupts injected as part of HvCallDispatchVp.
> +	 */
> +	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
> +
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	/* Not yet supported in ARM */
> +#endif
>  } __packed;
>  
>  #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
> @@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
>  #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
>  
>  /* Note: Exo partition is enabled by default */
> -#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
> -#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> -#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
> -#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
> +#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
> +#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
>  
>  struct hv_input_create_partition {
>  	u64 flags;
> @@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
>  enum hv_vp_state_page_type {
>  	HV_VP_STATE_PAGE_REGISTERS = 0,
>  	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
> +	HV_VP_STATE_PAGE_GHCB,
>  	HV_VP_STATE_PAGE_COUNT
>  };
>  
>  struct hv_input_map_vp_state_page {
>  	u64 partition_id;
>  	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	union {
> +		u8 as_uint8;
> +		struct {
> +			u8 map_location_provided : 1;
> +			u8 reserved : 7;
> +		};
> +	} flags;
> +	u64 requested_map_location;
>  } __packed;
>  
>  struct hv_output_map_vp_state_page {
> @@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
>  struct hv_input_unmap_vp_state_page {
>  	u64 partition_id;
>  	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	u8 reserved0;
> +} __packed;
> +
> +struct hv_x64_apic_eoi_message {
> +	__u32 vp_index;
> +	__u32 interrupt_vector;

Can these be plain u32? Similar below...

>  } __packed;
>  
>  struct hv_opaque_intercept_message {
> @@ -515,6 +562,13 @@ struct hv_synthetic_timers_state {
>  	u64 reserved[5];
>  } __packed;
>  
> +struct hv_async_completion_message_payload {
> +	__u64 partition_id;
> +	__u32 status;
> +	__u32 completion_count;
> +	__u64 sub_status;
> +} __packed;
> +
>  union hv_input_delete_vp {
>  	u64 as_uint64[2];
>  	struct {
> @@ -649,6 +703,57 @@ struct hv_input_set_vp_state {
>  	union hv_input_set_vp_state_data data[];
>  } __packed;
>  
> +union hv_x64_vp_execution_state {
> +	__u16 as_uint16;
> +	struct {
> +		__u16 cpl:2;
> +		__u16 cr0_pe:1;
> +		__u16 cr0_am:1;
> +		__u16 efer_lma:1;
> +		__u16 debug_active:1;
> +		__u16 interruption_pending:1;
> +		__u16 vtl:4;
> +		__u16 enclave_mode:1;
> +		__u16 interrupt_shadow:1;
> +		__u16 virtualization_fault_active:1;
> +		__u16 reserved:2;
> +	} __packed;
> +};
> +
> +struct hv_x64_intercept_message_header {
> +	__u32 vp_index;
> +	__u8 instruction_length:4;
> +	__u8 cr8:4; /* Only set for exo partitions */
> +	__u8 intercept_access_type;
> +	union hv_x64_vp_execution_state execution_state;
> +	struct hv_x64_segment_register cs_segment;
> +	__u64 rip;
> +	__u64 rflags;
> +} __packed;
> +
> +union hv_x64_memory_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 gva_gpa_valid:1;
> +		__u8 hypercall_output_pending:1;
> +		__u8 tlb_locked_no_overlay:1;
> +		__u8 reserved:4;
> +	} __packed;
> +};
> +
> +struct hv_x64_memory_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 cache_type; /* enum hv_cache_type */
> +	__u8 instruction_byte_count;
> +	union hv_x64_memory_access_info memory_access_info;
> +	__u8 tpr_priority;
> +	__u8 reserved1;
> +	__u64 guest_virtual_address;
> +	__u64 guest_physical_address;
> +	__u8 instruction_bytes[16];
> +} __packed;
> +
>  /*
>   * Dispatch state for the VP communicated by the hypervisor to the
>   * VP-dispatching thread in the root on return from HVCALL_DISPATCH_VP.
> @@ -716,6 +821,7 @@ static_assert(sizeof(struct hv_vp_signal_pair_scheduler_message) ==
>  #define HV_DISPATCH_VP_FLAG_SKIP_VP_SPEC_FLUSH		0x8
>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_SPEC_FLUSH	0x10
>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_USER_SPEC_FLUSH	0x20
> +#define HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION	0x40
>  
>  struct hv_input_dispatch_vp {
>  	u64 partition_id;
> @@ -730,4 +836,18 @@ struct hv_output_dispatch_vp {
>  	u32 dispatch_event; /* enum hv_vp_dispatch_event */
>  } __packed;
>  
> +struct hv_input_modify_sparse_spa_page_host_access {
> +	u32 host_access : 2;
> +	u32 reserved : 30;
> +	u32 flags;
> +	u64 partition_id;
> +	u64 spa_page_list[];
> +} __packed;
> +
> +/* hv_input_modify_sparse_spa_page_host_access flags */
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE  0x1
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED     0x2
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
> +
>  #endif /* _HV_HVHDK_H */
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index f8a39d3e9ce6..42e7876455b5 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -36,6 +36,52 @@ enum hv_scheduler_type {
>  	HV_SCHEDULER_TYPE_MAX
>  };
>  
> +/* HV_STATS_AREA_TYPE */
> +enum hv_stats_area_type {
> +	HV_STATS_AREA_SELF = 0,
> +	HV_STATS_AREA_PARENT = 1,
> +	HV_STATS_AREA_INTERNAL = 2,
> +	HV_STATS_AREA_COUNT
> +};
> +
> +enum hv_stats_object_type {
> +	HV_STATS_OBJECT_HYPERVISOR		= 0x00000001,
> +	HV_STATS_OBJECT_LOGICAL_PROCESSOR	= 0x00000002,
> +	HV_STATS_OBJECT_PARTITION		= 0x00010001,
> +	HV_STATS_OBJECT_VP			= 0x00010002
> +};
> +
> +union hv_stats_object_identity {
> +	/* hv_stats_hypervisor */
> +	struct {
> +		u8 reserved[15];
> +		u8 stats_area_type;
> +	} __packed hv;
> +
> +	/* hv_stats_logical_processor */
> +	struct {
> +		u32 lp_index;
> +		u8 reserved[11];
> +		u8 stats_area_type;
> +	} __packed lp;
> +
> +	/* hv_stats_partition */
> +	struct {
> +		u64 partition_id;
> +		u8  reserved[7];
> +		u8  stats_area_type;
> +	} __packed partition;
> +
> +	/* hv_stats_vp */
> +	struct {
> +		u64 partition_id;
> +		u32 vp_index;
> +		u16 flags;
> +		u8  reserved;
> +		u8  stats_area_type;
> +	} __packed vp;
> +};
> +
>  enum hv_partition_property_code {
>  	/* Privilege properties */
>  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> @@ -47,19 +93,45 @@ enum hv_partition_property_code {
>  
>  	/* Compatibility properties */
>  	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES		= 0x00060002,
> +	HV_PARTITION_PROPERTY_XSAVE_STATES                      = 0x00060007,
>  	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE		= 0x00060008,
>  	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY		= 0x00060009,
>  };
>  
> +enum hv_snp_status {
> +	HV_SNP_STATUS_NONE = 0,
> +	HV_SNP_STATUS_AVAILABLE = 1,
> +	HV_SNP_STATUS_INCOMPATIBLE = 2,
> +	HV_SNP_STATUS_PSP_UNAVAILABLE = 3,
> +	HV_SNP_STATUS_PSP_INIT_FAILED = 4,
> +	HV_SNP_STATUS_PSP_BAD_FW_VERSION = 5,
> +	HV_SNP_STATUS_BAD_CONFIGURATION = 6,
> +	HV_SNP_STATUS_PSP_FW_UPDATE_IN_PROGRESS = 7,
> +	HV_SNP_STATUS_PSP_RB_INIT_FAILED = 8,
> +	HV_SNP_STATUS_PSP_PLATFORM_STATUS_FAILED = 9,
> +	HV_SNP_STATUS_PSP_INIT_LATE_FAILED = 10,
> +};
> +
>  enum hv_system_property {
>  	/* Add more values when needed */
>  	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
> +	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
> +};
> +
> +enum hv_dynamic_processor_feature_property {
> +	/* Add more values when needed */
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_MAX_ENCRYPTED_PARTITIONS = 13,
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_SNP_STATUS = 16,
>  };
>  
>  struct hv_input_get_system_property {
>  	u32 property_id; /* enum hv_system_property */
>  	union {
>  		u32 as_uint32;
> +#if IS_ENABLED(CONFIG_X86)
> +		/* enum hv_dynamic_processor_feature_property */
> +		u32 hv_processor_feature;
> +#endif
>  		/* More fields to be filled in when needed */
>  	};
>  } __packed;
> @@ -67,9 +139,28 @@ struct hv_input_get_system_property {
>  struct hv_output_get_system_property {
>  	union {
>  		u32 scheduler_type; /* enum hv_scheduler_type */
> +#if IS_ENABLED(CONFIG_X86)
> +		u64 hv_processor_feature_value;
> +#endif
>  	};
>  } __packed;
>  
> +struct hv_input_map_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
> +struct hv_output_map_stats_page {
> +	u64 map_location;
> +} __packed;
> +
> +struct hv_input_unmap_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
>  struct hv_proximity_domain_flags {
>  	u32 proximity_preferred : 1;
>  	u32 reserved : 30;


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-28  1:27   ` Easwar Hariharan
@ 2025-03-01  0:52     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-01  0:52 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/27/2025 5:27 PM, Easwar Hariharan wrote:
> On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
>> A few additional definitions are required for the mshv driver code
>> (to follow). Introduce those here and clean up a little bit while
>> at it.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>>  3 files changed, 280 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 58895883f636..e4a3cca0cbce 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>> @@ -13,7 +13,7 @@ struct hv_u128 {
>>  	u64 high_part;
>>  } __packed;
>>  
> 
> <snip>
> 
>>  union hv_input_vtl {
>>  	u8 as_uint8;
>>  	struct {
>> @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /* HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>>  	struct hv_device_interrupt_target int_target;
>>  } __packed __aligned(8);
>>  
>> +enum hv_intercept_type {
>> +#if defined(CONFIG_X86_64)
> 
> These chosen ifdef's come across kinda arbitrary. The hypervisor code has
> this enabled for both 32-bit and 64-bit x86, but you've chosen x86_64 only.
> I thought that may be because we only intend to support root partition for 64-bit
> platforms, but then, below...
> 
Oops! They should all be X86 instead of X86_64. It's true root partition is only
supported on 64-bit systems, but guests (whom also use these headers) can of course
be 32-bit. It makes better sense to just use CONFIG_X86 for these ifdefs.

>> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
>> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
>> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
>> +#endif
>> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
>> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
>> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
>> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
>> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
>> +#endif
>> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
>> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
>> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
>> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
>> +#endif
>> +	HV_INTERCEPT_TYPE_MAX,
>> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
>> +};
>> +
>> +union hv_intercept_parameters {
>> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
>> +	__u64 as_uint64;
>> +#if defined(CONFIG_X86_64)
>> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
>> +	__u16 io_port;
>> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
>> +	__u32 cpuid_index;
>> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
>> +	__u32 apic_write_mask;
>> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
>> +	__u16 exception_vector;
>> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
>> +	__u32 msr_index;
>> +#endif
>> +	/* N.B. Other intercept types do not have any parameters. */
>> +};
>> +
>>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>>  
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 64407c2a3809..1b447155c338 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -19,11 +19,24 @@
>>  
>>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>>  
>> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
>> +
>> +union hv_vp_register_page_interrupt_vectors {
>> +	u64 as_uint64;
>> +	struct {
>> +		u8 vector_count;
>> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
>> +	} __packed;
>> +} __packed;
>> +
>>  struct hv_vp_register_page {
>>  	u16 version;
>>  	u8 isvalid;
>>  	u8 rsvdz;
>>  	u32 dirty;
>> +
>> +#if IS_ENABLED(CONFIG_X86)
>> +
> 
> ...you've chosen to include 32bit here, where the hypervisor code supports both.
> 
> Confused
> 
>>  	union {
>>  		struct {
>>  			/* General purpose registers
>> @@ -95,6 +108,22 @@ struct hv_vp_register_page {
>>  	union hv_x64_pending_interruption_register pending_interruption;
>>  	union hv_x64_interrupt_state_register interrupt_state;
>>  	u64 instruction_emulation_hints;
>> +	u64 xfem;
>> +
>> +	/*
>> +	 * Fields from this point are not included in the register page save chunk.
>> +	 * The reserved field is intended to maintain alignment for unsaved fields.
>> +	 */
>> +	u8 reserved1[0x100];
>> +
>> +	/*
>> +	 * Interrupts injected as part of HvCallDispatchVp.
>> +	 */
>> +	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
>> +
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> +	/* Not yet supported in ARM */
>> +#endif
>>  } __packed;
>>  
>>  #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
>> @@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
>>  #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
>>  
>>  /* Note: Exo partition is enabled by default */
>> -#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
>> -#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
>> -#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
>> -#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
>> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
>> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
>> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
>> +#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
>> +#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
>>  
>>  struct hv_input_create_partition {
>>  	u64 flags;
>> @@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
>>  enum hv_vp_state_page_type {
>>  	HV_VP_STATE_PAGE_REGISTERS = 0,
>>  	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
>> +	HV_VP_STATE_PAGE_GHCB,
>>  	HV_VP_STATE_PAGE_COUNT
>>  };
>>  
>>  struct hv_input_map_vp_state_page {
>>  	u64 partition_id;
>>  	u32 vp_index;
>> -	u32 type; /* enum hv_vp_state_page_type */
>> +	u16 type; /* enum hv_vp_state_page_type */
>> +	union hv_input_vtl input_vtl;
>> +	union {
>> +		u8 as_uint8;
>> +		struct {
>> +			u8 map_location_provided : 1;
>> +			u8 reserved : 7;
>> +		};
>> +	} flags;
>> +	u64 requested_map_location;
>>  } __packed;
>>  
>>  struct hv_output_map_vp_state_page {
>> @@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
>>  struct hv_input_unmap_vp_state_page {
>>  	u64 partition_id;
>>  	u32 vp_index;
>> -	u32 type; /* enum hv_vp_state_page_type */
>> +	u16 type; /* enum hv_vp_state_page_type */
>> +	union hv_input_vtl input_vtl;
>> +	u8 reserved0;
>> +} __packed;
>> +
>> +struct hv_x64_apic_eoi_message {
>> +	__u32 vp_index;
>> +	__u32 interrupt_vector;
> 
> Can these be plain u32? Similar below...
> 
Yes, these are some uapi types I forgot to convert somehow, oops!

Thanks
Nuno

<snip>


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-02-28  1:27   ` Easwar Hariharan
@ 2025-03-07 17:26   ` Michael Kelley
  2025-03-07 23:35     ` Nuno Das Neves
  2025-03-10 12:40   ` Tianyu Lan
  4 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-07 17:26 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
> A few additional definitions are required for the mshv driver code
> (to follow). Introduce those here and clean up a little bit while
> at it.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>  3 files changed, 280 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 58895883f636..e4a3cca0cbce 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -13,7 +13,7 @@ struct hv_u128 {
>  	u64 high_part;
>  } __packed;
> 
> -/* NOTE: when adding below, update hv_status_to_string() */
> +/* NOTE: when adding below, update hv_result_to_string() */
>  #define HV_STATUS_SUCCESS			    0x0
>  #define HV_STATUS_INVALID_HYPERCALL_CODE	    0x2
>  #define HV_STATUS_INVALID_HYPERCALL_INPUT	    0x3
> @@ -51,6 +51,7 @@ struct hv_u128 {
>  #define HV_HYP_PAGE_SHIFT		12
>  #define HV_HYP_PAGE_SIZE		BIT(HV_HYP_PAGE_SHIFT)
>  #define HV_HYP_PAGE_MASK		(~(HV_HYP_PAGE_SIZE - 1))
> +#define HV_HYP_LARGE_PAGE_SHIFT		21
> 
>  #define HV_PARTITION_ID_INVALID		((u64)0)
>  #define HV_PARTITION_ID_SELF		((u64)-1)
> @@ -374,6 +375,10 @@ union hv_hypervisor_version_info {
>  #define HV_SHARED_GPA_BOUNDARY_ACTIVE			BIT(5)
>  #define HV_SHARED_GPA_BOUNDARY_BITS			GENMASK(11, 6)
> 
> +/* HYPERV_CPUID_FEATURES.ECX bits. */
> +#define HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE	BIT(9)
> +#define HV_VP_GHCB_ROOT_MAPPING_AVAILABLE		BIT(10)
> +
>  enum hv_isolation_type {
>  	HV_ISOLATION_TYPE_NONE	= 0,	/*
> HV_PARTITION_ISOLATION_TYPE_NONE */
>  	HV_ISOLATION_TYPE_VBS	= 1,
> @@ -437,9 +442,12 @@ union hv_vp_assist_msr_contents {	 /*
> HV_REGISTER_VP_ASSIST_PAGE */
>  #define HVCALL_MAP_GPA_PAGES				0x004b
>  #define HVCALL_UNMAP_GPA_PAGES				0x004c
>  #define HVCALL_CREATE_VP				0x004e
> +#define HVCALL_INSTALL_INTERCEPT			0x004d

This is numerically out-of-order.  Should be before HVCALL_CREATE_VP.

>  #define HVCALL_DELETE_VP				0x004f
>  #define HVCALL_GET_VP_REGISTERS				0x0050
>  #define HVCALL_SET_VP_REGISTERS				0x0051
> +#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS		0x0052
> +#define HVCALL_CLEAR_VIRTUAL_INTERRUPT			0x0056
>  #define HVCALL_DELETE_PORT				0x0058
>  #define HVCALL_DISCONNECT_PORT				0x005b
>  #define HVCALL_POST_MESSAGE				0x005c
> @@ -447,12 +455,15 @@ union hv_vp_assist_msr_contents {	 /*
> HV_REGISTER_VP_ASSIST_PAGE */
>  #define HVCALL_POST_DEBUG_DATA				0x0069
>  #define HVCALL_RETRIEVE_DEBUG_DATA			0x006a
>  #define HVCALL_RESET_DEBUG_SESSION			0x006b
> +#define HVCALL_MAP_STATS_PAGE				0x006c
> +#define HVCALL_UNMAP_STATS_PAGE				0x006d
>  #define HVCALL_ADD_LOGICAL_PROCESSOR			0x0076
>  #define HVCALL_GET_SYSTEM_PROPERTY			0x007b
>  #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
>  #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
>  #define HVCALL_RETARGET_INTERRUPT			0x007e
>  #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
> +#define HVCALL_REGISTER_INTERCEPT_RESULT		0x0091
>  #define HVCALL_ASSERT_VIRTUAL_INTERRUPT			0x0094
>  #define HVCALL_CREATE_PORT				0x0095
>  #define HVCALL_CONNECT_PORT				0x0096
> @@ -460,12 +471,18 @@ union hv_vp_assist_msr_contents {	 /*
> HV_REGISTER_VP_ASSIST_PAGE */
>  #define HVCALL_GET_VP_ID_FROM_APIC_ID			0x009a
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
> +#define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
> +#define HVCALL_POST_MESSAGE_DIRECT			0x00c1
>  #define HVCALL_DISPATCH_VP				0x00c2
> +#define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
> +#define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
> +#define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
>  #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
>  #define HVCALL_MAP_VP_STATE_PAGE			0x00e1
>  #define HVCALL_UNMAP_VP_STATE_PAGE			0x00e2
>  #define HVCALL_GET_VP_STATE				0x00e3
>  #define HVCALL_SET_VP_STATE				0x00e4
> +#define HVCALL_GET_VP_CPUID_VALUES			0x00f4
>  #define HVCALL_MMIO_READ				0x0106
>  #define HVCALL_MMIO_WRITE				0x0107
> 
> @@ -807,6 +824,8 @@ struct hv_x64_table_register {
>  	u64 base;
>  } __packed;
> 
> +#define HV_NORMAL_VTL	0
> +
>  union hv_input_vtl {
>  	u8 as_uint8;
>  	struct {
> @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /*
> HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>  	struct hv_device_interrupt_target int_target;
>  } __packed __aligned(8);
> 
> +enum hv_intercept_type {
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
> +#endif
> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
> +#endif
> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
> +#if defined(CONFIG_X86_64)
> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
> +#endif
> +	HV_INTERCEPT_TYPE_MAX,
> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
> +};
> +
> +union hv_intercept_parameters {
> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
> +	__u64 as_uint64;
> +#if defined(CONFIG_X86_64)
> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
> +	__u16 io_port;
> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
> +	__u32 cpuid_index;
> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
> +	__u32 apic_write_mask;
> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
> +	__u16 exception_vector;
> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
> +	__u32 msr_index;
> +#endif
> +	/* N.B. Other intercept types do not have any parameters. */
> +};
> +
>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
> 
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 64407c2a3809..1b447155c338 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -19,11 +19,24 @@
> 
>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
> 
> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
> +
> +union hv_vp_register_page_interrupt_vectors {
> +	u64 as_uint64;
> +	struct {
> +		u8 vector_count;
> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
> +	} __packed;
> +} __packed;
> +
>  struct hv_vp_register_page {
>  	u16 version;
>  	u8 isvalid;
>  	u8 rsvdz;
>  	u32 dirty;
> +
> +#if IS_ENABLED(CONFIG_X86)
> +
>  	union {
>  		struct {
>  			/* General purpose registers
> @@ -95,6 +108,22 @@ struct hv_vp_register_page {
>  	union hv_x64_pending_interruption_register pending_interruption;
>  	union hv_x64_interrupt_state_register interrupt_state;
>  	u64 instruction_emulation_hints;
> +	u64 xfem;
> +
> +	/*
> +	 * Fields from this point are not included in the register page save chunk.
> +	 * The reserved field is intended to maintain alignment for unsaved fields.
> +	 */
> +	u8 reserved1[0x100];
> +
> +	/*
> +	 * Interrupts injected as part of HvCallDispatchVp.
> +	 */
> +	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
> +
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	/* Not yet supported in ARM */
> +#endif
>  } __packed;
> 
>  #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
> @@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
>  #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
> 
>  /* Note: Exo partition is enabled by default */
> -#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
> -#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> -#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
> -#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
> +#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
> +#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
> 
>  struct hv_input_create_partition {
>  	u64 flags;
> @@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
>  enum hv_vp_state_page_type {
>  	HV_VP_STATE_PAGE_REGISTERS = 0,
>  	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
> +	HV_VP_STATE_PAGE_GHCB,

Seems like this enum member should have an explicit value assigned
since it is part of the contract with the hypervisor.

>  	HV_VP_STATE_PAGE_COUNT
>  };
> 
>  struct hv_input_map_vp_state_page {
>  	u64 partition_id;
>  	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	union {
> +		u8 as_uint8;
> +		struct {
> +			u8 map_location_provided : 1;
> +			u8 reserved : 7;
> +		};
> +	} flags;
> +	u64 requested_map_location;
>  } __packed;
> 
>  struct hv_output_map_vp_state_page {
> @@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
>  struct hv_input_unmap_vp_state_page {
>  	u64 partition_id;
>  	u32 vp_index;
> -	u32 type; /* enum hv_vp_state_page_type */
> +	u16 type; /* enum hv_vp_state_page_type */
> +	union hv_input_vtl input_vtl;
> +	u8 reserved0;
> +} __packed;
> +
> +struct hv_x64_apic_eoi_message {
> +	__u32 vp_index;
> +	__u32 interrupt_vector;
>  } __packed;
> 
>  struct hv_opaque_intercept_message {
> @@ -515,6 +562,13 @@ struct hv_synthetic_timers_state {
>  	u64 reserved[5];
>  } __packed;
> 
> +struct hv_async_completion_message_payload {
> +	__u64 partition_id;
> +	__u32 status;
> +	__u32 completion_count;
> +	__u64 sub_status;
> +} __packed;
> +
>  union hv_input_delete_vp {
>  	u64 as_uint64[2];
>  	struct {
> @@ -649,6 +703,57 @@ struct hv_input_set_vp_state {
>  	union hv_input_set_vp_state_data data[];
>  } __packed;
> 
> +union hv_x64_vp_execution_state {
> +	__u16 as_uint16;
> +	struct {
> +		__u16 cpl:2;
> +		__u16 cr0_pe:1;
> +		__u16 cr0_am:1;
> +		__u16 efer_lma:1;
> +		__u16 debug_active:1;
> +		__u16 interruption_pending:1;
> +		__u16 vtl:4;
> +		__u16 enclave_mode:1;
> +		__u16 interrupt_shadow:1;
> +		__u16 virtualization_fault_active:1;
> +		__u16 reserved:2;
> +	} __packed;
> +};
> +
> +struct hv_x64_intercept_message_header {
> +	__u32 vp_index;
> +	__u8 instruction_length:4;
> +	__u8 cr8:4; /* Only set for exo partitions */
> +	__u8 intercept_access_type;
> +	union hv_x64_vp_execution_state execution_state;
> +	struct hv_x64_segment_register cs_segment;
> +	__u64 rip;
> +	__u64 rflags;
> +} __packed;
> +
> +union hv_x64_memory_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 gva_gpa_valid:1;
> +		__u8 hypercall_output_pending:1;
> +		__u8 tlb_locked_no_overlay:1;
> +		__u8 reserved:4;
> +	} __packed;
> +};
> +
> +struct hv_x64_memory_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 cache_type; /* enum hv_cache_type */
> +	__u8 instruction_byte_count;
> +	union hv_x64_memory_access_info memory_access_info;
> +	__u8 tpr_priority;
> +	__u8 reserved1;
> +	__u64 guest_virtual_address;
> +	__u64 guest_physical_address;
> +	__u8 instruction_bytes[16];
> +} __packed;
> +
>  /*
>   * Dispatch state for the VP communicated by the hypervisor to the
>   * VP-dispatching thread in the root on return from HVCALL_DISPATCH_VP.
> @@ -716,6 +821,7 @@ static_assert(sizeof(struct hv_vp_signal_pair_scheduler_message)
> ==
>  #define HV_DISPATCH_VP_FLAG_SKIP_VP_SPEC_FLUSH		0x8
>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_SPEC_FLUSH	0x10
>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_USER_SPEC_FLUSH	0x20
> +#define HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION	0x40
> 
>  struct hv_input_dispatch_vp {
>  	u64 partition_id;
> @@ -730,4 +836,18 @@ struct hv_output_dispatch_vp {
>  	u32 dispatch_event; /* enum hv_vp_dispatch_event */
>  } __packed;
> 
> +struct hv_input_modify_sparse_spa_page_host_access {
> +	u32 host_access : 2;
> +	u32 reserved : 30;
> +	u32 flags;
> +	u64 partition_id;
> +	u64 spa_page_list[];
> +} __packed;
> +
> +/* hv_input_modify_sparse_spa_page_host_access flags */
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE  0x1
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED     0x2
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
> +
>  #endif /* _HV_HVHDK_H */
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index f8a39d3e9ce6..42e7876455b5 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -36,6 +36,52 @@ enum hv_scheduler_type {
>  	HV_SCHEDULER_TYPE_MAX
>  };
> 
> +/* HV_STATS_AREA_TYPE */
> +enum hv_stats_area_type {
> +	HV_STATS_AREA_SELF = 0,
> +	HV_STATS_AREA_PARENT = 1,
> +	HV_STATS_AREA_INTERNAL = 2,
> +	HV_STATS_AREA_COUNT
> +};
> +
> +enum hv_stats_object_type {
> +	HV_STATS_OBJECT_HYPERVISOR		= 0x00000001,
> +	HV_STATS_OBJECT_LOGICAL_PROCESSOR	= 0x00000002,
> +	HV_STATS_OBJECT_PARTITION		= 0x00010001,
> +	HV_STATS_OBJECT_VP			= 0x00010002
> +};
> +
> +union hv_stats_object_identity {
> +	/* hv_stats_hypervisor */
> +	struct {
> +		u8 reserved[15];
> +		u8 stats_area_type;
> +	} __packed hv;
> +
> +	/* hv_stats_logical_processor */
> +	struct {
> +		u32 lp_index;
> +		u8 reserved[11];
> +		u8 stats_area_type;
> +	} __packed lp;
> +
> +	/* hv_stats_partition */
> +	struct {
> +		u64 partition_id;
> +		u8  reserved[7];
> +		u8  stats_area_type;
> +	} __packed partition;
> +
> +	/* hv_stats_vp */
> +	struct {
> +		u64 partition_id;
> +		u32 vp_index;
> +		u16 flags;
> +		u8  reserved;
> +		u8  stats_area_type;
> +	} __packed vp;
> +};
> +
>  enum hv_partition_property_code {
>  	/* Privilege properties */
>  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> @@ -47,19 +93,45 @@ enum hv_partition_property_code {
> 
>  	/* Compatibility properties */
>  	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES		=
> 0x00060002,
> +	HV_PARTITION_PROPERTY_XSAVE_STATES                      = 0x00060007,
>  	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE		= 0x00060008,
>  	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY		=
> 0x00060009,
>  };
> 
> +enum hv_snp_status {
> +	HV_SNP_STATUS_NONE = 0,
> +	HV_SNP_STATUS_AVAILABLE = 1,
> +	HV_SNP_STATUS_INCOMPATIBLE = 2,
> +	HV_SNP_STATUS_PSP_UNAVAILABLE = 3,
> +	HV_SNP_STATUS_PSP_INIT_FAILED = 4,
> +	HV_SNP_STATUS_PSP_BAD_FW_VERSION = 5,
> +	HV_SNP_STATUS_BAD_CONFIGURATION = 6,
> +	HV_SNP_STATUS_PSP_FW_UPDATE_IN_PROGRESS = 7,
> +	HV_SNP_STATUS_PSP_RB_INIT_FAILED = 8,
> +	HV_SNP_STATUS_PSP_PLATFORM_STATUS_FAILED = 9,
> +	HV_SNP_STATUS_PSP_INIT_LATE_FAILED = 10,
> +};
> +
>  enum hv_system_property {
>  	/* Add more values when needed */
>  	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
> +	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
> +};
> +
> +enum hv_dynamic_processor_feature_property {
> +	/* Add more values when needed */
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_MAX_ENCRYPTED_PARTITIONS = 13,
> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_SNP_STATUS = 16,
>  };
> 
>  struct hv_input_get_system_property {
>  	u32 property_id; /* enum hv_system_property */
>  	union {
>  		u32 as_uint32;
> +#if IS_ENABLED(CONFIG_X86)
> +		/* enum hv_dynamic_processor_feature_property */
> +		u32 hv_processor_feature;
> +#endif
>  		/* More fields to be filled in when needed */
>  	};
>  } __packed;
> @@ -67,9 +139,28 @@ struct hv_input_get_system_property {
>  struct hv_output_get_system_property {
>  	union {
>  		u32 scheduler_type; /* enum hv_scheduler_type */
> +#if IS_ENABLED(CONFIG_X86)
> +		u64 hv_processor_feature_value;
> +#endif
>  	};
>  } __packed;
> 
> +struct hv_input_map_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
> +struct hv_output_map_stats_page {
> +	u64 map_location;
> +} __packed;
> +
> +struct hv_input_unmap_stats_page {
> +	u32 type; /* enum hv_stats_object_type */
> +	u32 padding;
> +	union hv_stats_object_identity identity;
> +} __packed;
> +
>  struct hv_proximity_domain_flags {
>  	u32 proximity_preferred : 1;
>  	u32 reserved : 30;
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-03-07 17:26   ` Michael Kelley
@ 2025-03-07 23:35     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-07 23:35 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/7/2025 9:26 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
>> A few additional definitions are required for the mshv driver code
>> (to follow). Introduce those here and clean up a little bit while
>> at it.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  include/hyperv/hvgdk_mini.h |  64 ++++++++++++++++-
>>  include/hyperv/hvhdk.h      | 132 ++++++++++++++++++++++++++++++++++--
>>  include/hyperv/hvhdk_mini.h |  91 +++++++++++++++++++++++++
>>  3 files changed, 280 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 58895883f636..e4a3cca0cbce 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>> @@ -13,7 +13,7 @@ struct hv_u128 {
>>  	u64 high_part;
>>  } __packed;
>>
>> -/* NOTE: when adding below, update hv_status_to_string() */
>> +/* NOTE: when adding below, update hv_result_to_string() */
>>  #define HV_STATUS_SUCCESS			    0x0
>>  #define HV_STATUS_INVALID_HYPERCALL_CODE	    0x2
>>  #define HV_STATUS_INVALID_HYPERCALL_INPUT	    0x3
>> @@ -51,6 +51,7 @@ struct hv_u128 {
>>  #define HV_HYP_PAGE_SHIFT		12
>>  #define HV_HYP_PAGE_SIZE		BIT(HV_HYP_PAGE_SHIFT)
>>  #define HV_HYP_PAGE_MASK		(~(HV_HYP_PAGE_SIZE - 1))
>> +#define HV_HYP_LARGE_PAGE_SHIFT		21
>>
>>  #define HV_PARTITION_ID_INVALID		((u64)0)
>>  #define HV_PARTITION_ID_SELF		((u64)-1)
>> @@ -374,6 +375,10 @@ union hv_hypervisor_version_info {
>>  #define HV_SHARED_GPA_BOUNDARY_ACTIVE			BIT(5)
>>  #define HV_SHARED_GPA_BOUNDARY_BITS			GENMASK(11, 6)
>>
>> +/* HYPERV_CPUID_FEATURES.ECX bits. */
>> +#define HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE	BIT(9)
>> +#define HV_VP_GHCB_ROOT_MAPPING_AVAILABLE		BIT(10)
>> +
>>  enum hv_isolation_type {
>>  	HV_ISOLATION_TYPE_NONE	= 0,	/*
>> HV_PARTITION_ISOLATION_TYPE_NONE */
>>  	HV_ISOLATION_TYPE_VBS	= 1,
>> @@ -437,9 +442,12 @@ union hv_vp_assist_msr_contents {	 /*
>> HV_REGISTER_VP_ASSIST_PAGE */
>>  #define HVCALL_MAP_GPA_PAGES				0x004b
>>  #define HVCALL_UNMAP_GPA_PAGES				0x004c
>>  #define HVCALL_CREATE_VP				0x004e
>> +#define HVCALL_INSTALL_INTERCEPT			0x004d
> 
> This is numerically out-of-order.  Should be before HVCALL_CREATE_VP.
> 
Oops! Thanks for spotting that.

>>  #define HVCALL_DELETE_VP				0x004f
>>  #define HVCALL_GET_VP_REGISTERS				0x0050
>>  #define HVCALL_SET_VP_REGISTERS				0x0051
>> +#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS		0x0052
>> +#define HVCALL_CLEAR_VIRTUAL_INTERRUPT			0x0056
>>  #define HVCALL_DELETE_PORT				0x0058
>>  #define HVCALL_DISCONNECT_PORT				0x005b
>>  #define HVCALL_POST_MESSAGE				0x005c
>> @@ -447,12 +455,15 @@ union hv_vp_assist_msr_contents {	 /*
>> HV_REGISTER_VP_ASSIST_PAGE */
>>  #define HVCALL_POST_DEBUG_DATA				0x0069
>>  #define HVCALL_RETRIEVE_DEBUG_DATA			0x006a
>>  #define HVCALL_RESET_DEBUG_SESSION			0x006b
>> +#define HVCALL_MAP_STATS_PAGE				0x006c
>> +#define HVCALL_UNMAP_STATS_PAGE				0x006d
>>  #define HVCALL_ADD_LOGICAL_PROCESSOR			0x0076
>>  #define HVCALL_GET_SYSTEM_PROPERTY			0x007b
>>  #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
>>  #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
>>  #define HVCALL_RETARGET_INTERRUPT			0x007e
>>  #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
>> +#define HVCALL_REGISTER_INTERCEPT_RESULT		0x0091
>>  #define HVCALL_ASSERT_VIRTUAL_INTERRUPT			0x0094
>>  #define HVCALL_CREATE_PORT				0x0095
>>  #define HVCALL_CONNECT_PORT				0x0096
>> @@ -460,12 +471,18 @@ union hv_vp_assist_msr_contents {	 /*
>> HV_REGISTER_VP_ASSIST_PAGE */
>>  #define HVCALL_GET_VP_ID_FROM_APIC_ID			0x009a
>>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
>>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
>> +#define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
>> +#define HVCALL_POST_MESSAGE_DIRECT			0x00c1
>>  #define HVCALL_DISPATCH_VP				0x00c2
>> +#define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
>> +#define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
>> +#define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
>>  #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
>>  #define HVCALL_MAP_VP_STATE_PAGE			0x00e1
>>  #define HVCALL_UNMAP_VP_STATE_PAGE			0x00e2
>>  #define HVCALL_GET_VP_STATE				0x00e3
>>  #define HVCALL_SET_VP_STATE				0x00e4
>> +#define HVCALL_GET_VP_CPUID_VALUES			0x00f4
>>  #define HVCALL_MMIO_READ				0x0106
>>  #define HVCALL_MMIO_WRITE				0x0107
>>
>> @@ -807,6 +824,8 @@ struct hv_x64_table_register {
>>  	u64 base;
>>  } __packed;
>>
>> +#define HV_NORMAL_VTL	0
>> +
>>  union hv_input_vtl {
>>  	u8 as_uint8;
>>  	struct {
>> @@ -1325,6 +1344,49 @@ struct hv_retarget_device_interrupt {	 /*
>> HV_INPUT_RETARGET_DEVICE_INTERRUPT */
>>  	struct hv_device_interrupt_target int_target;
>>  } __packed __aligned(8);
>>
>> +enum hv_intercept_type {
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0x00000000,
>> +	HV_INTERCEPT_TYPE_X64_MSR			= 0x00000001,
>> +	HV_INTERCEPT_TYPE_X64_CPUID			= 0x00000002,
>> +#endif
>> +	HV_INTERCEPT_TYPE_EXCEPTION			= 0x00000003,
>> +	/* Used to be HV_INTERCEPT_TYPE_REGISTER */
>> +	HV_INTERCEPT_TYPE_RESERVED0			= 0x00000004,
>> +	HV_INTERCEPT_TYPE_MMIO				= 0x00000005,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0x00000006,
>> +	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0x00000007,
>> +#endif
>> +	HV_INTERCEPT_TYPE_HYPERCALL			= 0x00000008,
>> +#if defined(CONFIG_X86_64)
>> +	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0x00000009,
>> +	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0x0000000A,
>> +	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0x0000000B,
>> +	HV_INTERCEPT_TYPE_X64_MSR_INDEX			= 0x0000000C,
>> +#endif
>> +	HV_INTERCEPT_TYPE_MAX,
>> +	HV_INTERCEPT_TYPE_INVALID			= 0xFFFFFFFF,
>> +};
>> +
>> +union hv_intercept_parameters {
>> +	/*  HV_INTERCEPT_PARAMETERS is defined to be an 8-byte field. */
>> +	__u64 as_uint64;
>> +#if defined(CONFIG_X86_64)
>> +	/* HV_INTERCEPT_TYPE_X64_IO_PORT */
>> +	__u16 io_port;
>> +	/* HV_INTERCEPT_TYPE_X64_CPUID */
>> +	__u32 cpuid_index;
>> +	/* HV_INTERCEPT_TYPE_X64_APIC_WRITE */
>> +	__u32 apic_write_mask;
>> +	/* HV_INTERCEPT_TYPE_EXCEPTION */
>> +	__u16 exception_vector;
>> +	/* HV_INTERCEPT_TYPE_X64_MSR_INDEX */
>> +	__u32 msr_index;
>> +#endif
>> +	/* N.B. Other intercept types do not have any parameters. */
>> +};
>> +
>>  /* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
>>  #define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
>>
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 64407c2a3809..1b447155c338 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -19,11 +19,24 @@
>>
>>  #define HV_VP_REGISTER_PAGE_VERSION_1	1u
>>
>> +#define HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT		7
>> +
>> +union hv_vp_register_page_interrupt_vectors {
>> +	u64 as_uint64;
>> +	struct {
>> +		u8 vector_count;
>> +		u8 vector[HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT];
>> +	} __packed;
>> +} __packed;
>> +
>>  struct hv_vp_register_page {
>>  	u16 version;
>>  	u8 isvalid;
>>  	u8 rsvdz;
>>  	u32 dirty;
>> +
>> +#if IS_ENABLED(CONFIG_X86)
>> +
>>  	union {
>>  		struct {
>>  			/* General purpose registers
>> @@ -95,6 +108,22 @@ struct hv_vp_register_page {
>>  	union hv_x64_pending_interruption_register pending_interruption;
>>  	union hv_x64_interrupt_state_register interrupt_state;
>>  	u64 instruction_emulation_hints;
>> +	u64 xfem;
>> +
>> +	/*
>> +	 * Fields from this point are not included in the register page save chunk.
>> +	 * The reserved field is intended to maintain alignment for unsaved fields.
>> +	 */
>> +	u8 reserved1[0x100];
>> +
>> +	/*
>> +	 * Interrupts injected as part of HvCallDispatchVp.
>> +	 */
>> +	union hv_vp_register_page_interrupt_vectors interrupt_vectors;
>> +
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> +	/* Not yet supported in ARM */
>> +#endif
>>  } __packed;
>>
>>  #define HV_PARTITION_PROCESSOR_FEATURES_BANKS 2
>> @@ -299,10 +328,11 @@ union hv_partition_isolation_properties {
>>  #define HV_PARTITION_ISOLATION_HOST_TYPE_RESERVED   0x2
>>
>>  /* Note: Exo partition is enabled by default */
>> -#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
>> -#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
>> -#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED   BIT(19)
>> -#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE                   BIT(22)
>> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED		BIT(4)
>> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION			BIT(8)
>> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED			BIT(13)
>> +#define HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED	BIT(19)
>> +#define HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE			BIT(22)
>>
>>  struct hv_input_create_partition {
>>  	u64 flags;
>> @@ -349,13 +379,23 @@ struct hv_input_set_partition_property {
>>  enum hv_vp_state_page_type {
>>  	HV_VP_STATE_PAGE_REGISTERS = 0,
>>  	HV_VP_STATE_PAGE_INTERCEPT_MESSAGE = 1,
>> +	HV_VP_STATE_PAGE_GHCB,
> 
> Seems like this enum member should have an explicit value assigned
> since it is part of the contract with the hypervisor.
> 
Fair enough. They are just 0, 1, 2, but I agree it's better to be
explicit with the ABI values.

>>  	HV_VP_STATE_PAGE_COUNT
>>  };
>>
>>  struct hv_input_map_vp_state_page {
>>  	u64 partition_id;
>>  	u32 vp_index;
>> -	u32 type; /* enum hv_vp_state_page_type */
>> +	u16 type; /* enum hv_vp_state_page_type */
>> +	union hv_input_vtl input_vtl;
>> +	union {
>> +		u8 as_uint8;
>> +		struct {
>> +			u8 map_location_provided : 1;
>> +			u8 reserved : 7;
>> +		};
>> +	} flags;
>> +	u64 requested_map_location;
>>  } __packed;
>>
>>  struct hv_output_map_vp_state_page {
>> @@ -365,7 +405,14 @@ struct hv_output_map_vp_state_page {
>>  struct hv_input_unmap_vp_state_page {
>>  	u64 partition_id;
>>  	u32 vp_index;
>> -	u32 type; /* enum hv_vp_state_page_type */
>> +	u16 type; /* enum hv_vp_state_page_type */
>> +	union hv_input_vtl input_vtl;
>> +	u8 reserved0;
>> +} __packed;
>> +
>> +struct hv_x64_apic_eoi_message {
>> +	__u32 vp_index;
>> +	__u32 interrupt_vector;
>>  } __packed;
>>
>>  struct hv_opaque_intercept_message {
>> @@ -515,6 +562,13 @@ struct hv_synthetic_timers_state {
>>  	u64 reserved[5];
>>  } __packed;
>>
>> +struct hv_async_completion_message_payload {
>> +	__u64 partition_id;
>> +	__u32 status;
>> +	__u32 completion_count;
>> +	__u64 sub_status;
>> +} __packed;
>> +
>>  union hv_input_delete_vp {
>>  	u64 as_uint64[2];
>>  	struct {
>> @@ -649,6 +703,57 @@ struct hv_input_set_vp_state {
>>  	union hv_input_set_vp_state_data data[];
>>  } __packed;
>>
>> +union hv_x64_vp_execution_state {
>> +	__u16 as_uint16;
>> +	struct {
>> +		__u16 cpl:2;
>> +		__u16 cr0_pe:1;
>> +		__u16 cr0_am:1;
>> +		__u16 efer_lma:1;
>> +		__u16 debug_active:1;
>> +		__u16 interruption_pending:1;
>> +		__u16 vtl:4;
>> +		__u16 enclave_mode:1;
>> +		__u16 interrupt_shadow:1;
>> +		__u16 virtualization_fault_active:1;
>> +		__u16 reserved:2;
>> +	} __packed;
>> +};
>> +
>> +struct hv_x64_intercept_message_header {
>> +	__u32 vp_index;
>> +	__u8 instruction_length:4;
>> +	__u8 cr8:4; /* Only set for exo partitions */
>> +	__u8 intercept_access_type;
>> +	union hv_x64_vp_execution_state execution_state;
>> +	struct hv_x64_segment_register cs_segment;
>> +	__u64 rip;
>> +	__u64 rflags;
>> +} __packed;
>> +
>> +union hv_x64_memory_access_info {
>> +	__u8 as_uint8;
>> +	struct {
>> +		__u8 gva_valid:1;
>> +		__u8 gva_gpa_valid:1;
>> +		__u8 hypercall_output_pending:1;
>> +		__u8 tlb_locked_no_overlay:1;
>> +		__u8 reserved:4;
>> +	} __packed;
>> +};
>> +
>> +struct hv_x64_memory_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u32 cache_type; /* enum hv_cache_type */
>> +	__u8 instruction_byte_count;
>> +	union hv_x64_memory_access_info memory_access_info;
>> +	__u8 tpr_priority;
>> +	__u8 reserved1;
>> +	__u64 guest_virtual_address;
>> +	__u64 guest_physical_address;
>> +	__u8 instruction_bytes[16];
>> +} __packed;
>> +
>>  /*
>>   * Dispatch state for the VP communicated by the hypervisor to the
>>   * VP-dispatching thread in the root on return from HVCALL_DISPATCH_VP.
>> @@ -716,6 +821,7 @@ static_assert(sizeof(struct hv_vp_signal_pair_scheduler_message)
>> ==
>>  #define HV_DISPATCH_VP_FLAG_SKIP_VP_SPEC_FLUSH		0x8
>>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_SPEC_FLUSH	0x10
>>  #define HV_DISPATCH_VP_FLAG_SKIP_CALLER_USER_SPEC_FLUSH	0x20
>> +#define HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION	0x40
>>
>>  struct hv_input_dispatch_vp {
>>  	u64 partition_id;
>> @@ -730,4 +836,18 @@ struct hv_output_dispatch_vp {
>>  	u32 dispatch_event; /* enum hv_vp_dispatch_event */
>>  } __packed;
>>
>> +struct hv_input_modify_sparse_spa_page_host_access {
>> +	u32 host_access : 2;
>> +	u32 reserved : 30;
>> +	u32 flags;
>> +	u64 partition_id;
>> +	u64 spa_page_list[];
>> +} __packed;
>> +
>> +/* hv_input_modify_sparse_spa_page_host_access flags */
>> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE  0x1
>> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED     0x2
>> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
>> +#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
>> +
>>  #endif /* _HV_HVHDK_H */
>> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
>> index f8a39d3e9ce6..42e7876455b5 100644
>> --- a/include/hyperv/hvhdk_mini.h
>> +++ b/include/hyperv/hvhdk_mini.h
>> @@ -36,6 +36,52 @@ enum hv_scheduler_type {
>>  	HV_SCHEDULER_TYPE_MAX
>>  };
>>
>> +/* HV_STATS_AREA_TYPE */
>> +enum hv_stats_area_type {
>> +	HV_STATS_AREA_SELF = 0,
>> +	HV_STATS_AREA_PARENT = 1,
>> +	HV_STATS_AREA_INTERNAL = 2,
>> +	HV_STATS_AREA_COUNT
>> +};
>> +
>> +enum hv_stats_object_type {
>> +	HV_STATS_OBJECT_HYPERVISOR		= 0x00000001,
>> +	HV_STATS_OBJECT_LOGICAL_PROCESSOR	= 0x00000002,
>> +	HV_STATS_OBJECT_PARTITION		= 0x00010001,
>> +	HV_STATS_OBJECT_VP			= 0x00010002
>> +};
>> +
>> +union hv_stats_object_identity {
>> +	/* hv_stats_hypervisor */
>> +	struct {
>> +		u8 reserved[15];
>> +		u8 stats_area_type;
>> +	} __packed hv;
>> +
>> +	/* hv_stats_logical_processor */
>> +	struct {
>> +		u32 lp_index;
>> +		u8 reserved[11];
>> +		u8 stats_area_type;
>> +	} __packed lp;
>> +
>> +	/* hv_stats_partition */
>> +	struct {
>> +		u64 partition_id;
>> +		u8  reserved[7];
>> +		u8  stats_area_type;
>> +	} __packed partition;
>> +
>> +	/* hv_stats_vp */
>> +	struct {
>> +		u64 partition_id;
>> +		u32 vp_index;
>> +		u16 flags;
>> +		u8  reserved;
>> +		u8  stats_area_type;
>> +	} __packed vp;
>> +};
>> +
>>  enum hv_partition_property_code {
>>  	/* Privilege properties */
>>  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
>> @@ -47,19 +93,45 @@ enum hv_partition_property_code {
>>
>>  	/* Compatibility properties */
>>  	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES		=
>> 0x00060002,
>> +	HV_PARTITION_PROPERTY_XSAVE_STATES                      = 0x00060007,
>>  	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE		= 0x00060008,
>>  	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY		=
>> 0x00060009,
>>  };
>>
>> +enum hv_snp_status {
>> +	HV_SNP_STATUS_NONE = 0,
>> +	HV_SNP_STATUS_AVAILABLE = 1,
>> +	HV_SNP_STATUS_INCOMPATIBLE = 2,
>> +	HV_SNP_STATUS_PSP_UNAVAILABLE = 3,
>> +	HV_SNP_STATUS_PSP_INIT_FAILED = 4,
>> +	HV_SNP_STATUS_PSP_BAD_FW_VERSION = 5,
>> +	HV_SNP_STATUS_BAD_CONFIGURATION = 6,
>> +	HV_SNP_STATUS_PSP_FW_UPDATE_IN_PROGRESS = 7,
>> +	HV_SNP_STATUS_PSP_RB_INIT_FAILED = 8,
>> +	HV_SNP_STATUS_PSP_PLATFORM_STATUS_FAILED = 9,
>> +	HV_SNP_STATUS_PSP_INIT_LATE_FAILED = 10,
>> +};
>> +
>>  enum hv_system_property {
>>  	/* Add more values when needed */
>>  	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
>> +	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
>> +};
>> +
>> +enum hv_dynamic_processor_feature_property {
>> +	/* Add more values when needed */
>> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_MAX_ENCRYPTED_PARTITIONS = 13,
>> +	HV_X64_DYNAMIC_PROCESSOR_FEATURE_SNP_STATUS = 16,
>>  };
>>
>>  struct hv_input_get_system_property {
>>  	u32 property_id; /* enum hv_system_property */
>>  	union {
>>  		u32 as_uint32;
>> +#if IS_ENABLED(CONFIG_X86)
>> +		/* enum hv_dynamic_processor_feature_property */
>> +		u32 hv_processor_feature;
>> +#endif
>>  		/* More fields to be filled in when needed */
>>  	};
>>  } __packed;
>> @@ -67,9 +139,28 @@ struct hv_input_get_system_property {
>>  struct hv_output_get_system_property {
>>  	union {
>>  		u32 scheduler_type; /* enum hv_scheduler_type */
>> +#if IS_ENABLED(CONFIG_X86)
>> +		u64 hv_processor_feature_value;
>> +#endif
>>  	};
>>  } __packed;
>>
>> +struct hv_input_map_stats_page {
>> +	u32 type; /* enum hv_stats_object_type */
>> +	u32 padding;
>> +	union hv_stats_object_identity identity;
>> +} __packed;
>> +
>> +struct hv_output_map_stats_page {
>> +	u64 map_location;
>> +} __packed;
>> +
>> +struct hv_input_unmap_stats_page {
>> +	u32 type; /* enum hv_stats_object_type */
>> +	u32 padding;
>> +	union hv_stats_object_identity identity;
>> +} __packed;
>> +
>>  struct hv_proximity_domain_flags {
>>  	u32 proximity_preferred : 1;
>>  	u32 reserved : 30;
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
                     ` (3 preceding siblings ...)
  2025-03-07 17:26   ` Michael Kelley
@ 2025-03-10 12:40   ` Tianyu Lan
  2025-03-12 20:17     ` Nuno Das Neves
  4 siblings, 1 reply; 108+ messages in thread
From: Tianyu Lan @ 2025-03-10 12:40 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 7:11 AM Nuno Das Neves
<nunodasneves@linux.microsoft.com> wrote:
>
> A few additional definitions are required for the mshv driver code
> (to follow). Introduce those here and clean up a little bit while
> at it.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

It may be better to unify data type u8, u16, u32, u64 or __u8, __u16,
__u32, __u64 in the hvhdk.h.

Others like good.

Reviewed-by: Tianyu Lan <tiala@microsoft.com>

-- 
Thanks
Tianyu Lan

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers
  2025-03-10 12:40   ` Tianyu Lan
@ 2025-03-12 20:17     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-12 20:17 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 3/10/2025 5:40 AM, Tianyu Lan wrote:
> On Thu, Feb 27, 2025 at 7:11 AM Nuno Das Neves
> <nunodasneves@linux.microsoft.com> wrote:
>>
>> A few additional definitions are required for the mshv driver code
>> (to follow). Introduce those here and clean up a little bit while
>> at it.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
> 
> It may be better to unify data type u8, u16, u32, u64 or __u8, __u16,
> __u32, __u64 in the hvhdk.h.
> 
Agreed, this was an oversight. They should all be the kernel types
without the __ prefix.

Thanks
Nuno

> Others like good.
> 
> Reviewed-by: Tianyu Lan <tiala@microsoft.com>
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
                   ` (8 preceding siblings ...)
  2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
@ 2025-02-26 23:08 ` Nuno Das Neves
  2025-02-27  4:59   ` Easwar Hariharan
                     ` (4 more replies)
  9 siblings, 5 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-02-26 23:08 UTC (permalink / raw)
  To: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

Provide a set of IOCTLs for creating and managing child partitions when
running as root partition on Hyper-V. The new driver is enabled via
CONFIG_MSHV_ROOT.

A brief overview of the interface:

MSHV_CREATE_PARTITION is the entry point, returning a file descriptor
representing a child partition. IOCTLs on this fd can be used to map
memory, create VPs, etc.

Creating a VP returns another file descriptor representing that VP which
in turn has another set of corresponding IOCTLs for running the VP,
getting/setting state, etc.

MSHV_ROOT_HVCALL is a generic "passthrough" hypercall IOCTL which can be
used for a number of partition or VP hypercalls. This is for hypercalls
that do not affect any state in the kernel driver, such as getting and
setting VP registers and partition properties, translating addresses,
etc. It is "passthrough" because the binary input and output for the
hypercall is only interpreted by the VMM - the kernel driver does
nothing but insert the VP and partition id where necessary (which are
always in the same place), and execute the hypercall.

Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Co-developed-by: Muminul Islam <muislam@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |    2 +
 drivers/hv/Makefile                           |    5 +-
 drivers/hv/mshv.h                             |   30 +
 drivers/hv/mshv_common.c                      |  161 ++
 drivers/hv/mshv_eventfd.c                     |  833 ++++++
 drivers/hv/mshv_eventfd.h                     |   71 +
 drivers/hv/mshv_irq.c                         |  128 +
 drivers/hv/mshv_portid_table.c                |   84 +
 drivers/hv/mshv_root.h                        |  321 +++
 drivers/hv/mshv_root_hv_call.c                |  876 +++++++
 drivers/hv/mshv_root_main.c                   | 2329 +++++++++++++++++
 drivers/hv/mshv_synic.c                       |  665 +++++
 include/uapi/linux/mshv.h                     |  287 ++
 13 files changed, 5791 insertions(+), 1 deletion(-)
 create mode 100644 drivers/hv/mshv.h
 create mode 100644 drivers/hv/mshv_common.c
 create mode 100644 drivers/hv/mshv_eventfd.c
 create mode 100644 drivers/hv/mshv_eventfd.h
 create mode 100644 drivers/hv/mshv_irq.c
 create mode 100644 drivers/hv/mshv_portid_table.c
 create mode 100644 drivers/hv/mshv_root.h
 create mode 100644 drivers/hv/mshv_root_hv_call.c
 create mode 100644 drivers/hv/mshv_root_main.c
 create mode 100644 drivers/hv/mshv_synic.c
 create mode 100644 include/uapi/linux/mshv.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 6d1465315df3..66dcfaae698b 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -370,6 +370,8 @@ Code  Seq#    Include File                                           Comments
 0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-remoteproc@vger.kernel.org>
 0xB7  all    uapi/linux/nsfs.h                                       <mailto:Andrei Vagin <avagin@openvz.org>>
 0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
+0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
+                                                                     <mailto:linux-hyperv@vger.kernel.org>
 0xC0  00-0F  linux/usb/iowarrior.h
 0xCA  00-0F  uapi/misc/cxl.h
 0xCA  10-2F  uapi/misc/ocxl.h
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 2b8dc954b350..976189c725dc 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
 obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
 obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
+obj-$(CONFIG_MSHV_ROOT)		+= mshv_root.o
 
 CFLAGS_hv_trace.o = -I$(src)
 CFLAGS_hv_balloon.o = -I$(src)
@@ -11,7 +12,9 @@ hv_vmbus-y := vmbus_drv.o \
 		 channel_mgmt.o ring_buffer.o hv_trace.o
 hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
+mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
+	       mshv_root_hv_call.o mshv_portid_table.o
 
 # Code that must be built-in
 obj-$(subst m,y,$(CONFIG_HYPERV)) += hv_common.o
-obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o
+obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o mshv_common.o
diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
new file mode 100644
index 000000000000..0340a67acd0a
--- /dev/null
+++ b/drivers/hv/mshv.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2023, Microsoft Corporation.
+ */
+
+#ifndef _MSHV_H_
+#define _MSHV_H_
+
+#include <linux/stddef.h>
+#include <linux/string.h>
+#include <hyperv/hvhdk.h>
+
+#define mshv_field_nonzero(STRUCT, MEMBER) \
+	memchr_inv(&((STRUCT).MEMBER), \
+		   0, sizeof_field(typeof(STRUCT), MEMBER))
+
+int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers);
+
+int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers);
+
+int hv_call_get_partition_property(u64 partition_id, u64 property_code,
+				   u64 *property_value);
+
+int mshv_do_pre_guest_mode_work(ulong th_flags);
+
+#endif /* _MSHV_H */
diff --git a/drivers/hv/mshv_common.c b/drivers/hv/mshv_common.c
new file mode 100644
index 000000000000..d97631dcbee1
--- /dev/null
+++ b/drivers/hv/mshv_common.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, Microsoft Corporation.
+ *
+ * This file contains functions that are called from one or more modules: ROOT,
+ * DIAG, or VTL. If any of these modules are configured to build, this file is
+ * built and just statically linked in.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <asm/mshyperv.h>
+#include <linux/resume_user_mode.h>
+
+#include "mshv.h"
+
+#define HV_GET_REGISTER_BATCH_SIZE	\
+	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
+#define HV_SET_REGISTER_BATCH_SIZE	\
+	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
+		/ sizeof(struct hv_register_assoc))
+
+int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers)
+{
+	struct hv_input_get_vp_registers *input_page;
+	union hv_register_value *output_page;
+	u16 completed = 0;
+	unsigned long remaining = count;
+	int rep_count, i;
+	u64 status = HV_STATUS_SUCCESS;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	input_page->partition_id = partition_id;
+	input_page->vp_index = vp_index;
+	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
+	input_page->rsvd_z8 = 0;
+	input_page->rsvd_z16 = 0;
+
+	while (remaining) {
+		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
+		for (i = 0; i < rep_count; ++i)
+			input_page->names[i] = registers[i].name;
+
+		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
+					     0, input_page, output_page);
+		if (!hv_result_success(status))
+			break;
+
+		completed = hv_repcomp(status);
+		for (i = 0; i < completed; ++i)
+			registers[i].value = output_page[i];
+
+		registers += completed;
+		remaining -= completed;
+	}
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+EXPORT_SYMBOL_GPL(hv_call_get_vp_registers);
+
+int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers)
+{
+	struct hv_input_set_vp_registers *input_page;
+	u16 completed = 0;
+	unsigned long remaining = count;
+	int rep_count;
+	u64 status = HV_STATUS_SUCCESS;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+	input_page->partition_id = partition_id;
+	input_page->vp_index = vp_index;
+	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
+	input_page->rsvd_z8 = 0;
+	input_page->rsvd_z16 = 0;
+
+	while (remaining) {
+		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
+		memcpy(input_page->elements, registers,
+		       sizeof(struct hv_register_assoc) * rep_count);
+
+		status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
+					     0, input_page, NULL);
+		if (!hv_result_success(status))
+			break;
+
+		completed = hv_repcomp(status);
+		registers += completed;
+		remaining -= completed;
+	}
+
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+EXPORT_SYMBOL_GPL(hv_call_set_vp_registers);
+
+int hv_call_get_partition_property(u64 partition_id,
+				   u64 property_code,
+				   u64 *property_value)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_get_partition_property *input;
+	struct hv_output_get_partition_property *output;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition_id;
+	input->property_code = property_code;
+	status = hv_do_hypercall(HVCALL_GET_PARTITION_PROPERTY, input, output);
+
+	if (!hv_result_success(status)) {
+		local_irq_restore(flags);
+		return hv_result_to_errno(status);
+	}
+	*property_value = output->property_value;
+
+	local_irq_restore(flags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hv_call_get_partition_property);
+
+/*
+ * Handle any pre-processing before going into the guest mode on this cpu, most
+ * notably call schedule(). Must be invoked with both preemption and
+ * interrupts enabled.
+ *
+ * Returns: 0 on success, -errno on error.
+ */
+int mshv_do_pre_guest_mode_work(ulong th_flags)
+{
+	if (th_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
+		return -EINTR;
+
+	if (th_flags & _TIF_NEED_RESCHED)
+		schedule();
+
+	if (th_flags & _TIF_NOTIFY_RESUME)
+		resume_user_mode_work(NULL);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mshv_do_pre_guest_mode_work);
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
new file mode 100644
index 000000000000..8dd22be2ca0b
--- /dev/null
+++ b/drivers/hv/mshv_eventfd.c
@@ -0,0 +1,833 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * eventfd support for mshv
+ *
+ * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic
+ * framework code is taken from the kvm implementation.
+ *
+ * All credits to kvm developers.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/eventfd.h>
+
+#if IS_ENABLED(CONFIG_X86_64)
+#include <asm/apic.h>
+#endif
+#include <asm/mshyperv.h>
+
+#include "mshv_eventfd.h"
+#include "mshv.h"
+#include "mshv_root.h"
+
+static struct workqueue_struct *irqfd_cleanup_wq;
+
+void mshv_register_irq_ack_notifier(struct mshv_partition *partition,
+				    struct mshv_irq_ack_notifier *mian)
+{
+	mutex_lock(&partition->pt_irq_lock);
+	hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list);
+	mutex_unlock(&partition->pt_irq_lock);
+}
+
+void mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
+				      struct mshv_irq_ack_notifier *mian)
+{
+	mutex_lock(&partition->pt_irq_lock);
+	hlist_del_init_rcu(&mian->link);
+	mutex_unlock(&partition->pt_irq_lock);
+	synchronize_rcu();
+}
+
+bool mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi)
+{
+	struct mshv_irq_ack_notifier *mian;
+	bool acked = false;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list,
+				 link) {
+		if (mian->irq_ack_gsi == gsi) {
+			mian->irq_acked(mian);
+			acked = true;
+		}
+	}
+	rcu_read_unlock();
+
+	return acked;
+}
+
+#if IS_ENABLED(CONFIG_ARM64)
+static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
+{
+	return false;
+}
+#elif IS_ENABLED(CONFIG_X86_64)
+static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
+{
+	return type == HV_X64_INTERRUPT_TYPE_EXTINT;
+}
+#endif
+
+static void mshv_irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
+{
+	struct mshv_irqfd_resampler *resampler;
+	struct mshv_partition *partition;
+	struct mshv_irqfd *irqfd;
+	int idx;
+
+	resampler = container_of(mian, struct mshv_irqfd_resampler,
+				 rsmplr_notifier);
+	partition = resampler->rsmplr_partn;
+
+	idx = srcu_read_lock(&partition->pt_irq_srcu);
+
+	hlist_for_each_entry_rcu(irqfd, &resampler->rsmplr_irqfd_list,
+				 irqfd_resampler_hnode) {
+		if (hv_should_clear_interrupt(irqfd->irqfd_lapic_irq.lapic_control.interrupt_type))
+			hv_call_clear_virtual_interrupt(partition->pt_id);
+
+		eventfd_signal(irqfd->irqfd_resamplefd);
+	}
+
+	srcu_read_unlock(&partition->pt_irq_srcu, idx);
+}
+
+#if IS_ENABLED(CONFIG_X86_64)
+static bool
+mshv_vp_irq_vector_injected(union hv_vp_register_page_interrupt_vectors iv,
+			    u32 vector)
+{
+	int i;
+
+	for (i = 0; i < iv.vector_count; i++) {
+		if (iv.vector[i] == vector)
+			return true;
+	}
+
+	return false;
+}
+
+static int mshv_vp_irq_try_set_vector(struct mshv_vp *vp, u32 vector)
+{
+	union hv_vp_register_page_interrupt_vectors iv, new_iv;
+
+	iv = vp->vp_register_page->interrupt_vectors;
+	new_iv = iv;
+
+	if (mshv_vp_irq_vector_injected(iv, vector))
+		return 0;
+
+	if (iv.vector_count >= HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT)
+		return -ENOSPC;
+
+	new_iv.vector[new_iv.vector_count++] = vector;
+
+	if (cmpxchg(&vp->vp_register_page->interrupt_vectors.as_uint64,
+		    iv.as_uint64, new_iv.as_uint64) != iv.as_uint64)
+		return -EAGAIN;
+
+	return 0;
+}
+
+static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
+{
+	int ret;
+
+	do {
+		ret = mshv_vp_irq_try_set_vector(vp, vector);
+	} while (ret == -EAGAIN && !need_resched());
+
+	return ret;
+}
+
+/*
+ * Try to raise irq for guest via shared vector array. hyp does the actual
+ * inject of the interrupt.
+ */
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+{
+	struct mshv_partition *partition = irqfd->irqfd_partn;
+	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+	struct mshv_vp *vp;
+
+	if (!(ms_hyperv.ext_features &
+	      HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE))
+		return -EOPNOTSUPP;
+
+	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
+		return -EOPNOTSUPP;
+
+	if (irq->lapic_control.logical_dest_mode)
+		return -EOPNOTSUPP;
+
+	vp = partition->pt_vp_array[irq->lapic_apic_id];
+
+	if (!vp->vp_register_page)
+		return -EOPNOTSUPP;
+
+	if (mshv_vp_irq_set_vector(vp, irq->lapic_vector))
+		return -EINVAL;
+
+	if (vp->run.flags.root_sched_dispatched &&
+	    vp->vp_register_page->interrupt_vectors.as_uint64)
+		return -EBUSY;
+
+	wake_up(&vp->run.vp_suspend_queue);
+
+	return 0;
+}
+#else /* CONFIG_X86_64 */
+static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
+static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
+{
+	struct mshv_partition *partition = irqfd->irqfd_partn;
+	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
+	unsigned int seq;
+	int idx;
+
+	WARN_ON(irqfd->irqfd_resampler &&
+		!irq->lapic_control.level_triggered);
+
+	idx = srcu_read_lock(&partition->pt_irq_srcu);
+	if (irqfd->irqfd_girq_ent.guest_irq_num) {
+		if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
+			srcu_read_unlock(&partition->pt_irq_srcu, idx);
+			return;
+		}
+
+		do {
+			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+	}
+
+	hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
+					 irq->lapic_vector, irq->lapic_apic_id,
+					 irq->lapic_control);
+	srcu_read_unlock(&partition->pt_irq_srcu, idx);
+}
+
+static void mshv_irqfd_resampler_shutdown(struct mshv_irqfd *irqfd)
+{
+	struct mshv_irqfd_resampler *rp = irqfd->irqfd_resampler;
+	struct mshv_partition *pt = rp->rsmplr_partn;
+
+	mutex_lock(&pt->irqfds_resampler_lock);
+
+	hlist_del_rcu(&irqfd->irqfd_resampler_hnode);
+	synchronize_srcu(&pt->pt_irq_srcu);
+
+	if (hlist_empty(&rp->rsmplr_irqfd_list)) {
+		hlist_del(&rp->rsmplr_hnode);
+		mshv_unregister_irq_ack_notifier(pt, &rp->rsmplr_notifier);
+		kfree(rp);
+	}
+
+	mutex_unlock(&pt->irqfds_resampler_lock);
+}
+
+/*
+ * Race-free decouple logic (ordering is critical)
+ */
+static void mshv_irqfd_shutdown(struct work_struct *work)
+{
+	struct mshv_irqfd *irqfd =
+			container_of(work, struct mshv_irqfd, irqfd_shutdown);
+
+	/*
+	 * Synchronize with the wait-queue and unhook ourselves to prevent
+	 * further events.
+	 */
+	remove_wait_queue(irqfd->irqfd_wqh, &irqfd->irqfd_wait);
+
+	if (irqfd->irqfd_resampler) {
+		mshv_irqfd_resampler_shutdown(irqfd);
+		eventfd_ctx_put(irqfd->irqfd_resamplefd);
+	}
+
+	/*
+	 * It is now safe to release the object's resources
+	 */
+	eventfd_ctx_put(irqfd->irqfd_eventfd_ctx);
+	kfree(irqfd);
+}
+
+/* assumes partition->pt_irqfds_lock is held */
+static bool mshv_irqfd_is_active(struct mshv_irqfd *irqfd)
+{
+	return !hlist_unhashed(&irqfd->irqfd_hnode);
+}
+
+/*
+ * Mark the irqfd as inactive and schedule it for removal
+ *
+ * assumes partition->pt_irqfds_lock is held
+ */
+static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
+{
+	if (!mshv_irqfd_is_active(irqfd))
+		return;
+
+	hlist_del(&irqfd->irqfd_hnode);
+
+	queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
+}
+
+/*
+ * Called with wqh->lock held and interrupts disabled
+ */
+static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+			     int sync, void *key)
+{
+	struct mshv_irqfd *irqfd = container_of(wait, struct mshv_irqfd,
+						irqfd_wait);
+	unsigned long flags = (unsigned long)key;
+	int idx;
+	unsigned int seq;
+	struct mshv_partition *pt = irqfd->irqfd_partn;
+	int ret = 0;
+
+	if (flags & POLLIN) {
+		u64 cnt;
+
+		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
+		idx = srcu_read_lock(&pt->pt_irq_srcu);
+		do {
+			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
+		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
+
+		/* An event has been signaled, raise an interrupt */
+		ret = mshv_try_assert_irq_fast(irqfd);
+		if (ret)
+			mshv_assert_irq_slow(irqfd);
+
+		srcu_read_unlock(&pt->pt_irq_srcu, idx);
+
+		ret = 1;
+	}
+
+	if (flags & POLLHUP) {
+		/* The eventfd is closing, detach from the partition */
+		unsigned long flags;
+
+		spin_lock_irqsave(&pt->pt_irqfds_lock, flags);
+
+		/*
+		 * We must check if someone deactivated the irqfd before
+		 * we could acquire the pt_irqfds_lock since the item is
+		 * deactivated from the mshv side before it is unhooked from
+		 * the wait-queue.  If it is already deactivated, we can
+		 * simply return knowing the other side will cleanup for us.
+		 * We cannot race against the irqfd going away since the
+		 * other side is required to acquire wqh->lock, which we hold
+		 */
+		if (mshv_irqfd_is_active(irqfd))
+			mshv_irqfd_deactivate(irqfd);
+
+		spin_unlock_irqrestore(&pt->pt_irqfds_lock, flags);
+	}
+
+	return ret;
+}
+
+/* Must be called under pt_irqfds_lock */
+static void mshv_irqfd_update(struct mshv_partition *pt,
+			      struct mshv_irqfd *irqfd)
+{
+	write_seqcount_begin(&irqfd->irqfd_irqe_sc);
+	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt,
+						    irqfd->irqfd_irqnum);
+	mshv_copy_girq_info(&irqfd->irqfd_girq_ent, &irqfd->irqfd_lapic_irq);
+	write_seqcount_end(&irqfd->irqfd_irqe_sc);
+}
+
+void mshv_irqfd_routing_update(struct mshv_partition *pt)
+{
+	struct mshv_irqfd *irqfd;
+
+	spin_lock_irq(&pt->pt_irqfds_lock);
+	hlist_for_each_entry(irqfd, &pt->pt_irqfds_list, irqfd_hnode)
+		mshv_irqfd_update(pt, irqfd);
+	spin_unlock_irq(&pt->pt_irqfds_lock);
+}
+
+static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
+				  poll_table *polltbl)
+{
+	struct mshv_irqfd *irqfd =
+			container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);
+
+	irqfd->irqfd_wqh = wqh;
+	add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
+}
+
+static int mshv_irqfd_assign(struct mshv_partition *pt,
+			     struct mshv_user_irqfd *args)
+{
+	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
+	struct mshv_irqfd *irqfd, *tmp;
+	unsigned int events;
+	struct fd f;
+	int ret;
+	int idx;
+
+	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
+	if (!irqfd)
+		return -ENOMEM;
+
+	irqfd->irqfd_partn = pt;
+	irqfd->irqfd_irqnum = args->gsi;
+	INIT_WORK(&irqfd->irqfd_shutdown, mshv_irqfd_shutdown);
+	seqcount_spinlock_init(&irqfd->irqfd_irqe_sc, &pt->pt_irqfds_lock);
+
+	f = fdget(args->fd);
+	if (!fd_file(f)) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	eventfd = eventfd_ctx_fileget(fd_file(f));
+	if (IS_ERR(eventfd)) {
+		ret = PTR_ERR(eventfd);
+		goto fail;
+	}
+
+	irqfd->irqfd_eventfd_ctx = eventfd;
+
+	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE)) {
+		struct mshv_irqfd_resampler *rp;
+
+		resamplefd = eventfd_ctx_fdget(args->resamplefd);
+		if (IS_ERR(resamplefd)) {
+			ret = PTR_ERR(resamplefd);
+			goto fail;
+		}
+
+		irqfd->irqfd_resamplefd = resamplefd;
+
+		mutex_lock(&pt->irqfds_resampler_lock);
+
+		hlist_for_each_entry(rp, &pt->irqfds_resampler_list,
+				     rsmplr_hnode) {
+			if (rp->rsmplr_notifier.irq_ack_gsi ==
+							 irqfd->irqfd_irqnum) {
+				irqfd->irqfd_resampler = rp;
+				break;
+			}
+		}
+
+		if (!irqfd->irqfd_resampler) {
+			rp = kzalloc(sizeof(*rp), GFP_KERNEL_ACCOUNT);
+			if (!rp) {
+				ret = -ENOMEM;
+				mutex_unlock(&pt->irqfds_resampler_lock);
+				goto fail;
+			}
+
+			rp->rsmplr_partn = pt;
+			INIT_HLIST_HEAD(&rp->rsmplr_irqfd_list);
+			rp->rsmplr_notifier.irq_ack_gsi = irqfd->irqfd_irqnum;
+			rp->rsmplr_notifier.irq_acked =
+						      mshv_irqfd_resampler_ack;
+
+			hlist_add_head(&rp->rsmplr_hnode,
+				       &pt->irqfds_resampler_list);
+			mshv_register_irq_ack_notifier(pt,
+						       &rp->rsmplr_notifier);
+			irqfd->irqfd_resampler = rp;
+		}
+
+		hlist_add_head_rcu(&irqfd->irqfd_resampler_hnode,
+				   &irqfd->irqfd_resampler->rsmplr_irqfd_list);
+
+		mutex_unlock(&pt->irqfds_resampler_lock);
+	}
+
+	/*
+	 * Install our own custom wake-up handling so we are notified via
+	 * a callback whenever someone signals the underlying eventfd
+	 */
+	init_waitqueue_func_entry(&irqfd->irqfd_wait, mshv_irqfd_wakeup);
+	init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
+
+	spin_lock_irq(&pt->pt_irqfds_lock);
+	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
+	    !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
+		/*
+		 * Resample Fd must be for level triggered interrupt
+		 * Otherwise return with failure
+		 */
+		spin_unlock_irq(&pt->pt_irqfds_lock);
+		ret = -EINVAL;
+		goto fail;
+	}
+	ret = 0;
+	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
+		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
+			continue;
+		/* This fd is used for another irq already. */
+		ret = -EBUSY;
+		spin_unlock_irq(&pt->pt_irqfds_lock);
+		goto fail;
+	}
+
+	idx = srcu_read_lock(&pt->pt_irq_srcu);
+	mshv_irqfd_update(pt, irqfd);
+	hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
+	spin_unlock_irq(&pt->pt_irqfds_lock);
+
+	/*
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered, and trigger it as if we didn't miss it.
+	 */
+	events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
+
+	if (events & POLLIN)
+		mshv_assert_irq_slow(irqfd);
+
+	srcu_read_unlock(&pt->pt_irq_srcu, idx);
+	/*
+	 * do not drop the file until the irqfd is fully initialized, otherwise
+	 * we might race against the POLLHUP
+	 */
+	fdput(f);
+
+	return 0;
+
+fail:
+	if (irqfd->irqfd_resampler)
+		mshv_irqfd_resampler_shutdown(irqfd);
+
+	if (resamplefd && !IS_ERR(resamplefd))
+		eventfd_ctx_put(resamplefd);
+
+	if (eventfd && !IS_ERR(eventfd))
+		eventfd_ctx_put(eventfd);
+
+	fdput(f);
+
+out:
+	kfree(irqfd);
+	return ret;
+}
+
+/*
+ * shutdown any irqfd's that match fd+gsi
+ */
+static int mshv_irqfd_deassign(struct mshv_partition *pt,
+			       struct mshv_user_irqfd *args)
+{
+	struct mshv_irqfd *irqfd;
+	struct hlist_node *n;
+	struct eventfd_ctx *eventfd;
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
+				  irqfd_hnode) {
+		if (irqfd->irqfd_eventfd_ctx == eventfd &&
+		    irqfd->irqfd_irqnum == args->gsi)
+
+			mshv_irqfd_deactivate(irqfd);
+	}
+
+	eventfd_ctx_put(eventfd);
+
+	/*
+	 * Block until we know all outstanding shutdown jobs have completed
+	 * so that we guarantee there will not be any more interrupts on this
+	 * gsi once this deassign function returns.
+	 */
+	flush_workqueue(irqfd_cleanup_wq);
+
+	return 0;
+}
+
+int mshv_set_unset_irqfd(struct mshv_partition *pt,
+			 struct mshv_user_irqfd *args)
+{
+	if (args->flags & ~MSHV_IRQFD_FLAGS_MASK)
+		return -EINVAL;
+
+	if (args->flags & BIT(MSHV_IRQFD_BIT_DEASSIGN))
+		return mshv_irqfd_deassign(pt, args);
+
+	return mshv_irqfd_assign(pt, args);
+}
+
+/*
+ * This function is called as the mshv VM fd is being released.
+ * Shutdown all irqfds that still remain open
+ */
+static void mshv_irqfd_release(struct mshv_partition *pt)
+{
+	struct mshv_irqfd *irqfd;
+	struct hlist_node *n;
+
+	spin_lock_irq(&pt->pt_irqfds_lock);
+
+	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list, irqfd_hnode)
+		mshv_irqfd_deactivate(irqfd);
+
+	spin_unlock_irq(&pt->pt_irqfds_lock);
+
+	/*
+	 * Block until we know all outstanding shutdown jobs have completed
+	 * since we do not take a mshv_partition* reference.
+	 */
+	flush_workqueue(irqfd_cleanup_wq);
+}
+
+int mshv_irqfd_wq_init(void)
+{
+	irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0);
+	if (!irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void mshv_irqfd_wq_cleanup(void)
+{
+	destroy_workqueue(irqfd_cleanup_wq);
+}
+
+/*
+ * --------------------------------------------------------------------
+ * ioeventfd: translate a MMIO memory write to an eventfd signal.
+ *
+ * userspace can register a MMIO address with an eventfd for receiving
+ * notification when the memory has been touched.
+ * --------------------------------------------------------------------
+ */
+
+static void ioeventfd_release(struct mshv_ioeventfd *p, u64 partition_id)
+{
+	if (p->iovntfd_doorbell_id > 0)
+		mshv_unregister_doorbell(partition_id, p->iovntfd_doorbell_id);
+	eventfd_ctx_put(p->iovntfd_eventfd);
+	kfree(p);
+}
+
+/* MMIO writes trigger an event if the addr/val match */
+static void ioeventfd_mmio_write(int doorbell_id, void *data)
+{
+	struct mshv_partition *partition = (struct mshv_partition *)data;
+	struct mshv_ioeventfd *p;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(p, &partition->ioeventfds_list, iovntfd_hnode)
+		if (p->iovntfd_doorbell_id == doorbell_id) {
+			eventfd_signal(p->iovntfd_eventfd);
+			break;
+		}
+
+	rcu_read_unlock();
+}
+
+static bool ioeventfd_check_collision(struct mshv_partition *pt,
+				      struct mshv_ioeventfd *p)
+	__must_hold(&pt->mutex)
+{
+	struct mshv_ioeventfd *_p;
+
+	hlist_for_each_entry(_p, &pt->ioeventfds_list, iovntfd_hnode)
+		if (_p->iovntfd_addr == p->iovntfd_addr &&
+		    _p->iovntfd_length == p->iovntfd_length &&
+		    (_p->iovntfd_wildcard || p->iovntfd_wildcard ||
+		     _p->iovntfd_datamatch == p->iovntfd_datamatch))
+			return true;
+
+	return false;
+}
+
+static int mshv_assign_ioeventfd(struct mshv_partition *pt,
+				 struct mshv_user_ioeventfd *args)
+	__must_hold(&pt->mutex)
+{
+	struct mshv_ioeventfd *p;
+	struct eventfd_ctx *eventfd;
+	u64 doorbell_flags = 0;
+	int ret;
+
+	/* This mutex is currently protecting ioeventfd.items list */
+	WARN_ON_ONCE(!mutex_is_locked(&pt->pt_mutex));
+
+	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_PIO))
+		return -EOPNOTSUPP;
+
+	/* must be natural-word sized */
+	switch (args->len) {
+	case 0:
+		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
+		break;
+	case 1:
+		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
+		break;
+	case 2:
+		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
+		break;
+	case 4:
+		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
+		break;
+	case 8:
+		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* check for range overflow */
+	if (args->addr + args->len < args->addr)
+		return -EINVAL;
+
+	/* check for extra flags that we don't understand */
+	if (args->flags & ~MSHV_IOEVENTFD_FLAGS_MASK)
+		return -EINVAL;
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	p = kzalloc(sizeof(*p), GFP_KERNEL);
+	if (!p) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	p->iovntfd_addr = args->addr;
+	p->iovntfd_length  = args->len;
+	p->iovntfd_eventfd = eventfd;
+
+	/* The datamatch feature is optional, otherwise this is a wildcard */
+	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_DATAMATCH)) {
+		p->iovntfd_datamatch = args->datamatch;
+	} else {
+		p->iovntfd_wildcard = true;
+		doorbell_flags |= HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
+	}
+
+	if (ioeventfd_check_collision(pt, p)) {
+		ret = -EEXIST;
+		goto unlock_fail;
+	}
+
+	ret = mshv_register_doorbell(pt->pt_id, ioeventfd_mmio_write,
+				     (void *)pt, p->iovntfd_addr,
+				     p->iovntfd_datamatch, doorbell_flags);
+	if (ret < 0)
+		goto unlock_fail;
+
+	p->iovntfd_doorbell_id = ret;
+
+	hlist_add_head_rcu(&p->iovntfd_hnode, &pt->ioeventfds_list);
+
+	return 0;
+
+unlock_fail:
+	kfree(p);
+
+fail:
+	eventfd_ctx_put(eventfd);
+
+	return ret;
+}
+
+static int mshv_deassign_ioeventfd(struct mshv_partition *pt,
+				   struct mshv_user_ioeventfd *args)
+	__must_hold(&pt->mutex)
+{
+	struct mshv_ioeventfd *p;
+	struct eventfd_ctx *eventfd;
+	struct hlist_node *n;
+	int ret = -ENOENT;
+
+	/* This mutex is currently protecting ioeventfd.items list */
+	WARN_ON_ONCE(!mutex_is_locked(&pt->pt_mutex));
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	hlist_for_each_entry_safe(p, n, &pt->ioeventfds_list, iovntfd_hnode) {
+		bool wildcard = !(args->flags & BIT(MSHV_IOEVENTFD_BIT_DATAMATCH));
+
+		if (p->iovntfd_eventfd != eventfd  ||
+		    p->iovntfd_addr != args->addr  ||
+		    p->iovntfd_length != args->len ||
+		    p->iovntfd_wildcard != wildcard)
+			continue;
+
+		if (!p->iovntfd_wildcard &&
+		    p->iovntfd_datamatch != args->datamatch)
+			continue;
+
+		hlist_del_rcu(&p->iovntfd_hnode);
+		synchronize_rcu();
+		ioeventfd_release(p, pt->pt_id);
+		ret = 0;
+		break;
+	}
+
+	eventfd_ctx_put(eventfd);
+
+	return ret;
+}
+
+int mshv_set_unset_ioeventfd(struct mshv_partition *pt,
+			     struct mshv_user_ioeventfd *args)
+	__must_hold(&pt->mutex)
+{
+	if ((args->flags & ~MSHV_IOEVENTFD_FLAGS_MASK) ||
+	    mshv_field_nonzero(*args, rsvd))
+		return -EINVAL;
+
+	/* PIO not yet implemented */
+	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_PIO))
+		return -EOPNOTSUPP;
+
+	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_DEASSIGN))
+		return mshv_deassign_ioeventfd(pt, args);
+
+	return mshv_assign_ioeventfd(pt, args);
+}
+
+void mshv_eventfd_init(struct mshv_partition *pt)
+{
+	spin_lock_init(&pt->pt_irqfds_lock);
+	INIT_HLIST_HEAD(&pt->pt_irqfds_list);
+
+	INIT_HLIST_HEAD(&pt->irqfds_resampler_list);
+	mutex_init(&pt->irqfds_resampler_lock);
+
+	INIT_HLIST_HEAD(&pt->ioeventfds_list);
+}
+
+void mshv_eventfd_release(struct mshv_partition *pt)
+{
+	struct hlist_head items;
+	struct hlist_node *n;
+	struct mshv_ioeventfd *p;
+
+	hlist_move_list(&pt->ioeventfds_list, &items);
+	synchronize_rcu();
+
+	hlist_for_each_entry_safe(p, n, &items, iovntfd_hnode) {
+		hlist_del(&p->iovntfd_hnode);
+		ioeventfd_release(p, pt->pt_id);
+	}
+
+	mshv_irqfd_release(pt);
+}
diff --git a/drivers/hv/mshv_eventfd.h b/drivers/hv/mshv_eventfd.h
new file mode 100644
index 000000000000..332e7670a344
--- /dev/null
+++ b/drivers/hv/mshv_eventfd.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * irqfd: Allows an fd to be used to inject an interrupt to the guest.
+ * ioeventfd: Allow an fd to be used to receive a signal from the guest.
+ * All credit goes to kvm developers.
+ */
+
+#ifndef __LINUX_MSHV_EVENTFD_H
+#define __LINUX_MSHV_EVENTFD_H
+
+#include <linux/poll.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+/* struct to contain list of irqfds sharing an irq. Updates are protected by
+ * partition.irqfds.resampler_lock
+ */
+struct mshv_irqfd_resampler {
+	struct mshv_partition	    *rsmplr_partn;
+	struct hlist_head	     rsmplr_irqfd_list;
+	struct mshv_irq_ack_notifier rsmplr_notifier;
+	struct hlist_node	     rsmplr_hnode;
+};
+
+struct mshv_irqfd {
+	struct mshv_partition		    *irqfd_partn;
+	struct eventfd_ctx		    *irqfd_eventfd_ctx;
+	struct mshv_guest_irq_ent	     irqfd_girq_ent;
+	seqcount_spinlock_t		     irqfd_irqe_sc;
+	u32				     irqfd_irqnum;
+	struct mshv_lapic_irq		     irqfd_lapic_irq;
+	struct hlist_node		     irqfd_hnode;
+	poll_table			     irqfd_polltbl;
+	wait_queue_head_t		    *irqfd_wqh;
+	wait_queue_entry_t		     irqfd_wait;
+	struct work_struct		     irqfd_shutdown;
+	struct mshv_irqfd_resampler	    *irqfd_resampler;
+	struct eventfd_ctx		    *irqfd_resamplefd;
+	struct hlist_node		     irqfd_resampler_hnode;
+};
+
+void mshv_eventfd_init(struct mshv_partition *partition);
+void mshv_eventfd_release(struct mshv_partition *partition);
+
+void mshv_register_irq_ack_notifier(struct mshv_partition *partition,
+				    struct mshv_irq_ack_notifier *mian);
+void mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
+				      struct mshv_irq_ack_notifier *mian);
+bool mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi);
+
+int mshv_set_unset_irqfd(struct mshv_partition *partition,
+			 struct mshv_user_irqfd *args);
+
+int mshv_irqfd_wq_init(void);
+void mshv_irqfd_wq_cleanup(void);
+
+struct mshv_ioeventfd {
+	struct hlist_node    iovntfd_hnode;
+	u64		     iovntfd_addr;
+	int		     iovntfd_length;
+	struct eventfd_ctx  *iovntfd_eventfd;
+	u64		     iovntfd_datamatch;
+	int		     iovntfd_doorbell_id;
+	bool		     iovntfd_wildcard;
+};
+
+int mshv_set_unset_ioeventfd(struct mshv_partition *pt,
+			     struct mshv_user_ioeventfd *args);
+
+#endif /* __LINUX_MSHV_EVENTFD_H */
diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
new file mode 100644
index 000000000000..f956e125afb4
--- /dev/null
+++ b/drivers/hv/mshv_irq.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, Microsoft Corporation.
+ *
+ * Authors:
+ *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <asm/mshyperv.h>
+
+#include "mshv_eventfd.h"
+#include "mshv.h"
+#include "mshv_root.h"
+
+MODULE_AUTHOR("Microsoft");
+MODULE_LICENSE("GPL");
+
+/* called from the ioctl code, user wants to update the guest irq table */
+int mshv_update_routing_table(struct mshv_partition *partition,
+			      const struct mshv_user_irq_entry *ue,
+			      unsigned int numents)
+{
+	struct mshv_girq_routing_table *new = NULL, *old;
+	u32 i, nr_rt_entries = 0;
+	int r = 0;
+
+	if (numents == 0)
+		goto swap_routes;
+
+	for (i = 0; i < numents; i++) {
+		if (ue[i].gsi >= MSHV_MAX_GUEST_IRQS)
+			return -EINVAL;
+
+		if (ue[i].address_hi)
+			return -EINVAL;
+
+		nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
+	}
+	nr_rt_entries += 1;
+
+	new = kzalloc(struct_size(new, mshv_girq_info_tbl, nr_rt_entries),
+		      GFP_KERNEL_ACCOUNT);
+	if (!new)
+		return -ENOMEM;
+
+	new->num_rt_entries = nr_rt_entries;
+	for (i = 0; i < numents; i++) {
+		struct mshv_guest_irq_ent *girq;
+
+		girq = &new->mshv_girq_info_tbl[ue[i].gsi];
+
+		/*
+		 * Allow only one to one mapping between GSI and MSI routing.
+		 */
+		if (girq->guest_irq_num != 0) {
+			r = -EINVAL;
+			goto out;
+		}
+
+		girq->guest_irq_num = ue[i].gsi;
+		girq->girq_addr_lo = ue[i].address_lo;
+		girq->girq_addr_hi = ue[i].address_hi;
+		girq->girq_irq_data = ue[i].data;
+		girq->girq_entry_valid = true;
+	}
+
+swap_routes:
+	mutex_lock(&partition->pt_irq_lock);
+	old = rcu_dereference_protected(partition->pt_girq_tbl, 1);
+	rcu_assign_pointer(partition->pt_girq_tbl, new);
+	mshv_irqfd_routing_update(partition);
+	mutex_unlock(&partition->pt_irq_lock);
+
+	synchronize_srcu_expedited(&partition->pt_irq_srcu);
+	new = old;
+
+out:
+	kfree(new);
+
+	return r;
+}
+
+/* vm is going away, kfree the irq routing table */
+void mshv_free_routing_table(struct mshv_partition *partition)
+{
+	struct mshv_girq_routing_table *rt =
+				   rcu_access_pointer(partition->pt_girq_tbl);
+
+	kfree(rt);
+}
+
+struct mshv_guest_irq_ent
+mshv_ret_girq_entry(struct mshv_partition *partition, u32 irqnum)
+{
+	struct mshv_guest_irq_ent entry = { 0 };
+	struct mshv_girq_routing_table *girq_tbl;
+
+	girq_tbl = srcu_dereference_check(partition->pt_girq_tbl,
+					  &partition->pt_irq_srcu,
+					  lockdep_is_held(&partition->pt_irq_lock));
+	if (!girq_tbl || irqnum >= girq_tbl->num_rt_entries) {
+		/*
+		 * Premature register_irqfd, setting valid_entry = 0
+		 * would ignore this entry anyway
+		 */
+		entry.guest_irq_num = irqnum;
+		return entry;
+	}
+
+	return girq_tbl->mshv_girq_info_tbl[irqnum];
+}
+
+void mshv_copy_girq_info(struct mshv_guest_irq_ent *ent,
+			 struct mshv_lapic_irq *lirq)
+{
+	memset(lirq, 0, sizeof(*lirq));
+	if (!ent || !ent->girq_entry_valid)
+		return;
+
+	lirq->lapic_vector = ent->girq_irq_data & 0xFF;
+	lirq->lapic_apic_id = (ent->girq_addr_lo >> 12) & 0xFF;
+	lirq->lapic_control.interrupt_type = (ent->girq_irq_data & 0x700) >> 8;
+	lirq->lapic_control.level_triggered = (ent->girq_irq_data >> 15) & 0x1;
+	lirq->lapic_control.logical_dest_mode = (ent->girq_addr_lo >> 2) & 0x1;
+}
diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
new file mode 100644
index 000000000000..a40abde6fd15
--- /dev/null
+++ b/drivers/hv/mshv_portid_table.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/version.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <asm/mshyperv.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+/*
+ * Ports and connections are hypervisor struct used for inter-partition
+ * communication. Port represents the source and connection represents
+ * the destination. Partitions are responsible for managing the port and
+ * connection ids.
+ *
+ */
+
+#define PORTID_MIN	1
+#define PORTID_MAX	INT_MAX
+
+static DEFINE_IDR(port_table_idr);
+
+void
+mshv_port_table_fini(void)
+{
+	struct port_table_info *port_info;
+	unsigned long i, tmp;
+
+	idr_lock(&port_table_idr);
+	if (!idr_is_empty(&port_table_idr)) {
+		idr_for_each_entry_ul(&port_table_idr, port_info, tmp, i) {
+			port_info = idr_remove(&port_table_idr, i);
+			kfree_rcu(port_info, portbl_rcu);
+		}
+	}
+	idr_unlock(&port_table_idr);
+}
+
+int
+mshv_portid_alloc(struct port_table_info *info)
+{
+	int ret = 0;
+
+	idr_lock(&port_table_idr);
+	ret = idr_alloc(&port_table_idr, info, PORTID_MIN,
+			PORTID_MAX, GFP_KERNEL);
+	idr_unlock(&port_table_idr);
+
+	return ret;
+}
+
+void
+mshv_portid_free(int port_id)
+{
+	struct port_table_info *info;
+
+	idr_lock(&port_table_idr);
+	info = idr_remove(&port_table_idr, port_id);
+	WARN_ON(!info);
+	idr_unlock(&port_table_idr);
+
+	synchronize_rcu();
+	kfree(info);
+}
+
+int
+mshv_portid_lookup(int port_id, struct port_table_info *info)
+{
+	struct port_table_info *_info;
+	int ret = -ENOENT;
+
+	rcu_read_lock();
+	_info = idr_find(&port_table_idr, port_id);
+	rcu_read_unlock();
+
+	if (_info) {
+		*info = *_info;
+		ret = 0;
+	}
+
+	return ret;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
new file mode 100644
index 000000000000..f8d85db14db1
--- /dev/null
+++ b/drivers/hv/mshv_root.h
@@ -0,0 +1,321 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2023, Microsoft Corporation.
+ */
+
+#ifndef _MSHV_ROOT_H_
+#define _MSHV_ROOT_H_
+
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/semaphore.h>
+#include <linux/sched.h>
+#include <linux/srcu.h>
+#include <linux/wait.h>
+#include <linux/hashtable.h>
+#include <linux/dev_printk.h>
+#include <uapi/linux/mshv.h>
+
+/*
+ * Hypervisor must be between these version numbers (inclusive)
+ * to guarantee compatibility
+ */
+#define MSHV_HV_MIN_VERSION		(27744)
+#define MSHV_HV_MAX_VERSION		(27751)
+
+#define MSHV_MAX_VPS			256
+
+#define MSHV_PARTITIONS_HASH_BITS	9
+
+#define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
+
+struct mshv_vp {
+	u32 vp_index;
+	struct mshv_partition *vp_partition;
+	struct mutex vp_mutex;
+	struct hv_vp_register_page *vp_register_page;
+	struct hv_message *vp_intercept_msg_page;
+	void *vp_ghcb_page;
+	struct hv_stats_page *vp_stats_pages[2];
+	struct {
+		atomic64_t vp_signaled_count;
+		struct {
+			u64 intercept_suspend: 1;
+			u64 root_sched_blocked: 1; /* root scheduler only */
+			u64 root_sched_dispatched: 1; /* root scheduler only */
+			u64 reserved: 62;
+		} flags;
+		unsigned int kicked_by_hv;
+		wait_queue_head_t vp_suspend_queue;
+	} run;
+};
+
+#define vp_fmt(fmt) "p%lluvp%u: " fmt
+#define vp_dev(v) ((v)->vp_partition->pt_module_dev)
+#define vp_emerg(v, fmt, ...) \
+	dev_emerg(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		  (v)->vp_index, ##__VA_ARGS__)
+#define vp_crit(v, fmt, ...) \
+	dev_crit(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		 (v)->vp_index, ##__VA_ARGS__)
+#define vp_alert(v, fmt, ...) \
+	dev_alert(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		  (v)->vp_index, ##__VA_ARGS__)
+#define vp_err(v, fmt, ...) \
+	dev_err(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		(v)->vp_index, ##__VA_ARGS__)
+#define vp_warn(v, fmt, ...) \
+	dev_warn(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		 (v)->vp_index, ##__VA_ARGS__)
+#define vp_notice(v, fmt, ...) \
+	dev_notice(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		   (v)->vp_index, ##__VA_ARGS__)
+#define vp_info(v, fmt, ...) \
+	dev_info(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		 (v)->vp_index, ##__VA_ARGS__)
+#define vp_dbg(v, fmt, ...) \
+	dev_dbg(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
+		(v)->vp_index, ##__VA_ARGS__)
+
+struct mshv_mem_region {
+	struct hlist_node hnode;
+	u64 nr_pages;
+	u64 start_gfn;
+	u64 start_uaddr;
+	u32 hv_map_flags;
+	struct {
+		u64 large_pages:  1; /* 2MiB */
+		u64 range_pinned: 1;
+		u64 reserved:	 62;
+	} flags;
+	struct mshv_partition *partition;
+	struct page *pages[];
+};
+
+struct mshv_irq_ack_notifier {
+	struct hlist_node link;
+	unsigned int irq_ack_gsi;
+	void (*irq_acked)(struct mshv_irq_ack_notifier *mian);
+};
+
+struct mshv_partition {
+	struct device *pt_module_dev;
+
+	struct hlist_node pt_hnode;
+	u64 pt_id;
+	refcount_t pt_ref_count;
+	struct mutex pt_mutex;
+	struct hlist_head pt_mem_regions; // not ordered
+
+	u32 pt_vp_count;
+	struct mshv_vp *pt_vp_array[MSHV_MAX_VPS];
+
+	struct mutex pt_irq_lock;
+	struct srcu_struct pt_irq_srcu;
+	struct hlist_head irq_ack_notifier_list;
+
+	struct hlist_head pt_devices;
+
+	/*
+	 * Since MSHV does not support more than one async hypercall in flight
+	 * for a single partition. Thus, it is okay to define per partition
+	 * async hypercall status.
+	 */
+	struct completion async_hypercall;
+	u64 async_hypercall_status;
+
+	spinlock_t	  pt_irqfds_lock;
+	struct hlist_head pt_irqfds_list;
+	struct mutex	  irqfds_resampler_lock;
+	struct hlist_head irqfds_resampler_list;
+
+	struct hlist_head ioeventfds_list;
+
+	struct mshv_girq_routing_table __rcu *pt_girq_tbl;
+	u64 isolation_type;
+	bool import_completed;
+	bool pt_initialized;
+};
+
+#define pt_fmt(fmt) "p%llu: " fmt
+#define pt_dev(p) ((p)->pt_module_dev)
+#define pt_emerg(p, fmt, ...) \
+	dev_emerg(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_crit(p, fmt, ...) \
+	dev_crit(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_alert(p, fmt, ...) \
+	dev_alert(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_err(p, fmt, ...) \
+	dev_err(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_warn(p, fmt, ...) \
+	dev_warn(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_notice(p, fmt, ...) \
+	dev_notice(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_info(p, fmt, ...) \
+	dev_info(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+#define pt_dbg(p, fmt, ...) \
+	dev_dbg(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
+
+struct mshv_lapic_irq {
+	u32 lapic_vector;
+	u64 lapic_apic_id;
+	union hv_interrupt_control lapic_control;
+};
+
+#define MSHV_MAX_GUEST_IRQS		4096
+
+/* representation of one guest irq entry, either msi or legacy */
+struct mshv_guest_irq_ent {
+	u32 girq_entry_valid;	/* vfio looks at this */
+	u32 guest_irq_num;	/* a unique number for each irq */
+	u32 girq_addr_lo;	/* guest irq msi address info */
+	u32 girq_addr_hi;
+	u32 girq_irq_data;	/* idt vector in some cases */
+};
+
+struct mshv_girq_routing_table {
+	u32 num_rt_entries;
+	struct mshv_guest_irq_ent mshv_girq_info_tbl[];
+};
+
+struct hv_synic_pages {
+	struct hv_message_page *synic_message_page;
+	struct hv_synic_event_flags_page *synic_event_flags_page;
+	struct hv_synic_event_ring_page *synic_event_ring_page;
+};
+
+struct mshv_root {
+	struct hv_synic_pages __percpu *synic_pages;
+	spinlock_t pt_ht_lock;
+	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
+};
+
+/*
+ * Callback for doorbell events.
+ * NOTE: This is called in interrupt context. Callback
+ * should defer slow and sleeping logic to later.
+ */
+typedef void (*doorbell_cb_t) (int doorbell_id, void *);
+
+/*
+ * port table information
+ */
+struct port_table_info {
+	struct rcu_head portbl_rcu;
+	enum hv_port_type hv_port_type;
+	union {
+		struct {
+			u64 reserved[2];
+		} hv_port_message;
+		struct {
+			u64 reserved[2];
+		} hv_port_event;
+		struct {
+			u64 reserved[2];
+		} hv_port_monitor;
+		struct {
+			doorbell_cb_t doorbell_cb;
+			void *data;
+		} hv_port_doorbell;
+	};
+};
+
+int mshv_update_routing_table(struct mshv_partition *partition,
+			      const struct mshv_user_irq_entry *entries,
+			      unsigned int numents);
+void mshv_free_routing_table(struct mshv_partition *partition);
+
+struct mshv_guest_irq_ent mshv_ret_girq_entry(struct mshv_partition *partition,
+					      u32 irq_num);
+
+void mshv_copy_girq_info(struct mshv_guest_irq_ent *src_irq,
+			 struct mshv_lapic_irq *dest_irq);
+
+void mshv_irqfd_routing_update(struct mshv_partition *partition);
+
+void mshv_port_table_fini(void);
+int mshv_portid_alloc(struct port_table_info *info);
+int mshv_portid_lookup(int port_id, struct port_table_info *info);
+void mshv_portid_free(int port_id);
+
+int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
+			   void *data, u64 gpa, u64 val, u64 flags);
+void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
+
+void mshv_isr(void);
+int mshv_synic_init(unsigned int cpu);
+int mshv_synic_cleanup(unsigned int cpu);
+
+static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
+{
+	return partition->isolation_type == HV_PARTITION_ISOLATION_TYPE_SNP;
+}
+
+struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
+void mshv_partition_put(struct mshv_partition *partition);
+struct mshv_partition *mshv_partition_find(u64 partition_id) __must_hold(RCU);
+
+/* hypercalls */
+
+int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
+int hv_call_create_partition(u64 flags,
+			     struct hv_partition_creation_properties creation_properties,
+			     union hv_partition_isolation_properties isolation_properties,
+			     u64 *partition_id);
+int hv_call_initialize_partition(u64 partition_id);
+int hv_call_finalize_partition(u64 partition_id);
+int hv_call_delete_partition(u64 partition_id);
+int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
+int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
+			  u32 flags, struct page **pages);
+int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
+			    u32 flags);
+int hv_call_delete_vp(u64 partition_id, u32 vp_index);
+int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
+				     u64 dest_addr,
+				     union hv_interrupt_control control);
+int hv_call_clear_virtual_interrupt(u64 partition_id);
+int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
+				  union hv_gpa_page_access_state_flags state_flags,
+				  int *written_total,
+				  union hv_gpa_page_access_state *states);
+int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
+			 struct hv_vp_state_data state_data,
+			 /* Choose between pages and ret_output */
+			 u64 page_count, struct page **pages,
+			 union hv_output_get_vp_state *ret_output);
+int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
+			 /* Choose between pages and bytes */
+			 struct hv_vp_state_data state_data, u64 page_count,
+			 struct page **pages, u32 num_bytes, u8 *bytes);
+int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
+			      union hv_input_vtl input_vtl,
+			      struct page **state_page);
+int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
+				union hv_input_vtl input_vtl);
+int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
+			u64 connection_partition_id, struct hv_port_info *port_info,
+			u8 port_vtl, u8 min_connection_vtl, int node);
+int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id);
+int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
+			 u64 connection_partition_id,
+			 union hv_connection_id connection_id,
+			 struct hv_connection_info *connection_info,
+			 u8 connection_vtl, int node);
+int hv_call_disconnect_port(u64 connection_partition_id,
+			    union hv_connection_id connection_id);
+int hv_call_notify_port_ring_empty(u32 sint_index);
+int hv_call_map_stat_page(enum hv_stats_object_type type,
+			  const union hv_stats_object_identity *identity,
+			  void **addr);
+int hv_call_unmap_stat_page(enum hv_stats_object_type type,
+			    const union hv_stats_object_identity *identity);
+int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
+				   u64 page_struct_count, u32 host_access,
+				   u32 flags, u8 acquire);
+
+extern struct mshv_root mshv_root;
+extern enum hv_scheduler_type hv_scheduler_type;
+extern u8 __percpu **hv_synic_eventring_tail;
+
+#endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
new file mode 100644
index 000000000000..b3e5c652015d
--- /dev/null
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -0,0 +1,876 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, Microsoft Corporation.
+ *
+ * Hypercall helper functions used by the mshv_root module.
+ *
+ * Authors:
+ *   Nuno Das Neves <nunodasneves@linux.microsoft.com>
+ *   Wei Liu <wei.liu@kernel.org>
+ *   Jinank Jain <jinankjain@microsoft.com>
+ *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
+ *   Asher Kariv <askariv@microsoft.com>
+ *   Muminul Islam <Muminul.Islam@microsoft.com>
+ *   Anatol Belski <anbelski@linux.microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <asm/mshyperv.h>
+
+#include "mshv_root.h"
+
+/* Determined empirically */
+#define HV_INIT_PARTITION_DEPOSIT_PAGES 208
+#define HV_MAP_GPA_DEPOSIT_PAGES	256
+
+#define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
+
+#define HV_WITHDRAW_BATCH_SIZE	(HV_HYP_PAGE_SIZE / sizeof(u64))
+#define HV_MAP_GPA_BATCH_SIZE	\
+	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_map_gpa_pages)) \
+		/ sizeof(u64))
+#define HV_GET_VP_STATE_BATCH_SIZE	\
+	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_get_vp_state)) \
+		/ sizeof(u64))
+#define HV_SET_VP_STATE_BATCH_SIZE	\
+	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_state)) \
+		/ sizeof(u64))
+#define HV_GET_GPA_ACCESS_STATES_BATCH_SIZE	\
+	((HV_HYP_PAGE_SIZE - sizeof(union hv_gpa_page_access_state)) \
+		/ sizeof(union hv_gpa_page_access_state))
+#define HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT		       \
+	((HV_HYP_PAGE_SIZE -						       \
+	  sizeof(struct hv_input_modify_sparse_spa_page_host_access)) /        \
+	 sizeof(u64))
+
+int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
+{
+	struct hv_input_withdraw_memory *input_page;
+	struct hv_output_withdraw_memory *output_page;
+	struct page *page;
+	u16 completed;
+	unsigned long remaining = count;
+	u64 status;
+	int i;
+	unsigned long flags;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+	output_page = page_address(page);
+
+	while (remaining) {
+		local_irq_save(flags);
+
+		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+		memset(input_page, 0, sizeof(*input_page));
+		input_page->partition_id = partition_id;
+		status = hv_do_rep_hypercall(HVCALL_WITHDRAW_MEMORY,
+					     min(remaining, HV_WITHDRAW_BATCH_SIZE),
+					     0, input_page, output_page);
+
+		local_irq_restore(flags);
+
+		completed = hv_repcomp(status);
+
+		for (i = 0; i < completed; i++)
+			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
+
+		if (!hv_result_success(status)) {
+			if (hv_result(status) == HV_STATUS_NO_RESOURCES)
+				status = HV_STATUS_SUCCESS;
+			break;
+		}
+
+		remaining -= completed;
+	}
+	free_page((unsigned long)output_page);
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_create_partition(u64 flags,
+			     struct hv_partition_creation_properties creation_properties,
+			     union hv_partition_isolation_properties isolation_properties,
+			     u64 *partition_id)
+{
+	struct hv_input_create_partition *input;
+	struct hv_output_create_partition *output;
+	u64 status;
+	int ret;
+	unsigned long irq_flags;
+
+	do {
+		local_irq_save(irq_flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+		memset(input, 0, sizeof(*input));
+		input->flags = flags;
+		input->compatibility_version = HV_COMPATIBILITY_21_H2;
+
+		memcpy(&input->partition_creation_properties, &creation_properties,
+		       sizeof(creation_properties));
+
+		memcpy(&input->isolation_properties, &isolation_properties,
+		       sizeof(isolation_properties));
+
+		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
+					 input, output);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (hv_result_success(status))
+				*partition_id = output->partition_id;
+			local_irq_restore(irq_flags);
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		local_irq_restore(irq_flags);
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    hv_current_partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int hv_call_initialize_partition(u64 partition_id)
+{
+	struct hv_input_initialize_partition input;
+	u64 status;
+	int ret;
+
+	input.partition_id = partition_id;
+
+	ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
+				    HV_INIT_PARTITION_DEPOSIT_PAGES);
+	if (ret)
+		return ret;
+
+	do {
+		status = hv_do_fast_hypercall8(HVCALL_INITIALIZE_PARTITION,
+					       *(u64 *)&input);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int hv_call_finalize_partition(u64 partition_id)
+{
+	struct hv_input_finalize_partition input;
+	u64 status;
+
+	input.partition_id = partition_id;
+	status = hv_do_fast_hypercall8(HVCALL_FINALIZE_PARTITION,
+				       *(u64 *)&input);
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_delete_partition(u64 partition_id)
+{
+	struct hv_input_delete_partition input;
+	u64 status;
+
+	input.partition_id = partition_id;
+	status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64 *)&input);
+
+	return hv_result_to_errno(status);
+}
+
+/* Ask the hypervisor to map guest ram pages or the guest mmio space */
+static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
+			       u32 flags, struct page **pages, u64 mmio_spa)
+{
+	struct hv_input_map_gpa_pages *input_page;
+	u64 status, *pfnlist;
+	unsigned long irq_flags, large_shift = 0;
+	int ret = 0, done = 0;
+	u64 page_count = page_struct_count;
+
+	if (page_count == 0 || (pages && mmio_spa))
+		return -EINVAL;
+
+	if (flags & HV_MAP_GPA_LARGE_PAGE) {
+		if (mmio_spa)
+			return -EINVAL;
+
+		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
+			return -EINVAL;
+
+		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
+		page_count >>= large_shift;
+	}
+
+	while (done < page_count) {
+		ulong i, completed, remain = page_count - done;
+		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+
+		local_irq_save(irq_flags);
+		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+		input_page->target_partition_id = partition_id;
+		input_page->target_gpa_base = gfn + (done << large_shift);
+		input_page->map_flags = flags;
+		pfnlist = input_page->source_gpa_page_list;
+
+		for (i = 0; i < rep_count; i++)
+			if (flags & HV_MAP_GPA_NO_ACCESS) {
+				pfnlist[i] = 0;
+			} else if (pages) {
+				u64 index = (done + i) << large_shift;
+
+				if (index >= page_struct_count) {
+					ret = -EINVAL;
+					break;
+				}
+				pfnlist[i] = page_to_pfn(pages[index]);
+			} else {
+				pfnlist[i] = mmio_spa + done + i;
+			}
+		if (ret)
+			break;
+
+		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
+					     input_page, NULL);
+		local_irq_restore(irq_flags);
+
+		completed = hv_repcomp(status);
+
+		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
+						    HV_MAP_GPA_DEPOSIT_PAGES);
+			if (ret)
+				break;
+
+		} else if (!hv_result_success(status)) {
+			ret = hv_result_to_errno(status);
+			break;
+		}
+
+		done += completed;
+	}
+
+	if (ret && done) {
+		u32 unmap_flags = 0;
+
+		if (flags & HV_MAP_GPA_LARGE_PAGE)
+			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+	}
+
+	return ret;
+}
+
+/* Ask the hypervisor to map guest ram pages */
+int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
+			  u32 flags, struct page **pages)
+{
+	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
+				   flags, pages, 0);
+}
+
+/* Ask the hypervisor to map guest mmio space */
+int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
+{
+	int i;
+	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
+		    HV_MAP_GPA_NOT_CACHED;
+
+	for (i = 0; i < numpgs; i++)
+		if (page_is_ram(mmio_spa + i))
+			return -EINVAL;
+
+	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
+				   mmio_spa);
+}
+
+int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
+			    u32 flags)
+{
+	struct hv_input_unmap_gpa_pages *input_page;
+	u64 status, page_count = page_count_4k;
+	unsigned long irq_flags, large_shift = 0;
+	int ret = 0, done = 0;
+
+	if (page_count == 0)
+		return -EINVAL;
+
+	if (flags & HV_UNMAP_GPA_LARGE_PAGE) {
+		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
+			return -EINVAL;
+
+		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
+		page_count >>= large_shift;
+	}
+
+	while (done < page_count) {
+		ulong completed, remain = page_count - done;
+		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+
+		local_irq_save(irq_flags);
+		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+		input_page->target_partition_id = partition_id;
+		input_page->target_gpa_base = gfn + (done << large_shift);
+		input_page->unmap_flags = flags;
+		status = hv_do_rep_hypercall(HVCALL_UNMAP_GPA_PAGES, rep_count,
+					     0, input_page, NULL);
+		local_irq_restore(irq_flags);
+
+		completed = hv_repcomp(status);
+		if (!hv_result_success(status)) {
+			ret = hv_result_to_errno(status);
+			break;
+		}
+
+		done += completed;
+	}
+
+	return ret;
+}
+
+int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
+				  union hv_gpa_page_access_state_flags state_flags,
+				  int *written_total,
+				  union hv_gpa_page_access_state *states)
+{
+	struct hv_input_get_gpa_pages_access_state *input_page;
+	union hv_gpa_page_access_state *output_page;
+	int completed = 0;
+	unsigned long remaining = count;
+	int rep_count, i;
+	u64 status;
+	unsigned long flags;
+
+	*written_total = 0;
+	while (remaining) {
+		local_irq_save(flags);
+		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+		input_page->partition_id = partition_id;
+		input_page->hv_gpa_page_number = gpa_base_pfn + *written_total;
+		input_page->flags = state_flags;
+		rep_count = min(remaining, HV_GET_GPA_ACCESS_STATES_BATCH_SIZE);
+
+		status = hv_do_rep_hypercall(HVCALL_GET_GPA_PAGES_ACCESS_STATES, rep_count,
+					     0, input_page, output_page);
+		if (!hv_result_success(status)) {
+			local_irq_restore(flags);
+			break;
+		}
+		completed = hv_repcomp(status);
+		for (i = 0; i < completed; ++i)
+			states[i].as_uint8 = output_page[i].as_uint8;
+
+		states += completed;
+		*written_total += completed;
+		remaining -= completed;
+		local_irq_restore(flags);
+	}
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
+				     u64 dest_addr,
+				     union hv_interrupt_control control)
+{
+	struct hv_input_assert_virtual_interrupt *input;
+	unsigned long flags;
+	u64 status;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition_id;
+	input->vector = vector;
+	input->dest_addr = dest_addr;
+	input->control = control;
+	status = hv_do_hypercall(HVCALL_ASSERT_VIRTUAL_INTERRUPT, input, NULL);
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_delete_vp(u64 partition_id, u32 vp_index)
+{
+	union hv_input_delete_vp input = {};
+	u64 status;
+
+	input.partition_id = partition_id;
+	input.vp_index = vp_index;
+
+	status = hv_do_fast_hypercall16(HVCALL_DELETE_VP,
+					input.as_uint64[0], input.as_uint64[1]);
+
+	return hv_result_to_errno(status);
+}
+EXPORT_SYMBOL_GPL(hv_call_delete_vp);
+
+int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
+			 struct hv_vp_state_data state_data,
+			 /* Choose between pages and ret_output */
+			 u64 page_count, struct page **pages,
+			 union hv_output_get_vp_state *ret_output)
+{
+	struct hv_input_get_vp_state *input;
+	union hv_output_get_vp_state *output;
+	u64 status;
+	int i;
+	u64 control;
+	unsigned long flags;
+	int ret = 0;
+
+	if (page_count > HV_GET_VP_STATE_BATCH_SIZE)
+		return -EINVAL;
+
+	if (!page_count && !ret_output)
+		return -EINVAL;
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+		memset(input, 0, sizeof(*input));
+		memset(output, 0, sizeof(*output));
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->state_data = state_data;
+		for (i = 0; i < page_count; i++)
+			input->output_data_pfns[i] = page_to_pfn(pages[i]);
+
+		control = (HVCALL_GET_VP_STATE) |
+			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
+
+		status = hv_do_hypercall(control, input, output);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (hv_result_success(status) && ret_output)
+				memcpy(ret_output, output, sizeof(*output));
+
+			local_irq_restore(flags);
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
+			 /* Choose between pages and bytes */
+			 struct hv_vp_state_data state_data, u64 page_count,
+			 struct page **pages, u32 num_bytes, u8 *bytes)
+{
+	struct hv_input_set_vp_state *input;
+	u64 status;
+	int i;
+	u64 control;
+	unsigned long flags;
+	int ret = 0;
+	u16 varhead_sz;
+
+	if (page_count > HV_SET_VP_STATE_BATCH_SIZE)
+		return -EINVAL;
+	if (sizeof(*input) + num_bytes > HV_HYP_PAGE_SIZE)
+		return -EINVAL;
+
+	if (num_bytes)
+		/* round up to 8 and divide by 8 */
+		varhead_sz = (num_bytes + 7) >> 3;
+	else if (page_count)
+		varhead_sz = page_count;
+	else
+		return -EINVAL;
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		memset(input, 0, sizeof(*input));
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->state_data = state_data;
+		if (num_bytes) {
+			memcpy((u8 *)input->data, bytes, num_bytes);
+		} else {
+			for (i = 0; i < page_count; i++)
+				input->data[i].pfns = page_to_pfn(pages[i]);
+		}
+
+		control = (HVCALL_SET_VP_STATE) |
+			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
+
+		status = hv_do_hypercall(control, input, NULL);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			local_irq_restore(flags);
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
+			      union hv_input_vtl input_vtl,
+			      struct page **state_page)
+{
+	struct hv_input_map_vp_state_page *input;
+	struct hv_output_map_vp_state_page *output;
+	u64 status;
+	int ret;
+	unsigned long flags;
+
+	do {
+		local_irq_save(flags);
+
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->type = type;
+		input->input_vtl = input_vtl;
+
+		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE, input, output);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (hv_result_success(status))
+				*state_page = pfn_to_page(output->map_location);
+			local_irq_restore(flags);
+			ret = hv_result_to_errno(status);
+			break;
+		}
+
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
+				union hv_input_vtl input_vtl)
+{
+	unsigned long flags;
+	u64 status;
+	struct hv_input_unmap_vp_state_page *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+	memset(input, 0, sizeof(*input));
+
+	input->partition_id = partition_id;
+	input->vp_index = vp_index;
+	input->type = type;
+	input->input_vtl = input_vtl;
+
+	status = hv_do_hypercall(HVCALL_UNMAP_VP_STATE_PAGE, input, NULL);
+
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int
+hv_call_clear_virtual_interrupt(u64 partition_id)
+{
+	unsigned long flags;
+	int status;
+
+	local_irq_save(flags);
+	status = hv_do_fast_hypercall8(HVCALL_CLEAR_VIRTUAL_INTERRUPT,
+				       partition_id) &
+			HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int
+hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
+		    u64 connection_partition_id,
+		    struct hv_port_info *port_info,
+		    u8 port_vtl, u8 min_connection_vtl, int node)
+{
+	struct hv_input_create_port *input;
+	unsigned long flags;
+	int ret = 0;
+	int status;
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		memset(input, 0, sizeof(*input));
+
+		input->port_partition_id = port_partition_id;
+		input->port_id = port_id;
+		input->connection_partition_id = connection_partition_id;
+		input->port_info = *port_info;
+		input->port_vtl = port_vtl;
+		input->min_connection_vtl = min_connection_vtl;
+		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
+		status = hv_do_hypercall(HVCALL_CREATE_PORT, input, NULL) &
+			 HV_HYPERCALL_RESULT_MASK;
+		local_irq_restore(flags);
+		if (status == HV_STATUS_SUCCESS)
+			break;
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
+
+	} while (!ret);
+
+	return ret;
+}
+
+int
+hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id)
+{
+	union hv_input_delete_port input = { 0 };
+	unsigned long flags;
+	int status;
+
+	local_irq_save(flags);
+	input.port_partition_id = port_partition_id;
+	input.port_id = port_id;
+	status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
+					input.as_uint64[0],
+					input.as_uint64[1]) &
+			HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int
+hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
+		     u64 connection_partition_id,
+		     union hv_connection_id connection_id,
+		     struct hv_connection_info *connection_info,
+		     u8 connection_vtl, int node)
+{
+	struct hv_input_connect_port *input;
+	unsigned long flags;
+	int ret = 0, status;
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		memset(input, 0, sizeof(*input));
+		input->port_partition_id = port_partition_id;
+		input->port_id = port_id;
+		input->connection_partition_id = connection_partition_id;
+		input->connection_id = connection_id;
+		input->connection_info = *connection_info;
+		input->connection_vtl = connection_vtl;
+		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
+		status = hv_do_hypercall(HVCALL_CONNECT_PORT, input, NULL) &
+			 HV_HYPERCALL_RESULT_MASK;
+
+		local_irq_restore(flags);
+		if (status == HV_STATUS_SUCCESS)
+			break;
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_result_to_errno(status);
+			break;
+		}
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    connection_partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+int
+hv_call_disconnect_port(u64 connection_partition_id,
+			union hv_connection_id connection_id)
+{
+	union hv_input_disconnect_port input = { 0 };
+	unsigned long flags;
+	int status;
+
+	local_irq_save(flags);
+	input.connection_partition_id = connection_partition_id;
+	input.connection_id = connection_id;
+	input.is_doorbell = 1;
+	status = hv_do_fast_hypercall16(HVCALL_DISCONNECT_PORT,
+					input.as_uint64[0],
+					input.as_uint64[1]) &
+			HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int
+hv_call_notify_port_ring_empty(u32 sint_index)
+{
+	union hv_input_notify_port_ring_empty input = { 0 };
+	unsigned long flags;
+	int status;
+
+	local_irq_save(flags);
+	input.sint_index = sint_index;
+	status = hv_do_fast_hypercall8(HVCALL_NOTIFY_PORT_RING_EMPTY,
+				       input.as_uint64) &
+		 HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_map_stat_page(enum hv_stats_object_type type,
+			  const union hv_stats_object_identity *identity,
+			  void **addr)
+{
+	unsigned long flags;
+	struct hv_input_map_stats_page *input;
+	struct hv_output_map_stats_page *output;
+	u64 status, pfn;
+	int ret;
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+		memset(input, 0, sizeof(*input));
+		input->type = type;
+		input->identity = *identity;
+
+		status = hv_do_hypercall(HVCALL_MAP_STATS_PAGE, input, output);
+		pfn = output->map_location;
+
+		local_irq_restore(flags);
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (hv_result_success(status))
+				break;
+			return hv_result_to_errno(status);
+		}
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    hv_current_partition_id, 1);
+		if (ret)
+			return ret;
+	} while (!ret);
+
+	*addr = page_address(pfn_to_page(pfn));
+
+	return ret;
+}
+
+int hv_call_unmap_stat_page(enum hv_stats_object_type type,
+			    const union hv_stats_object_identity *identity)
+{
+	unsigned long flags;
+	struct hv_input_unmap_stats_page *input;
+	u64 status;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+	memset(input, 0, sizeof(*input));
+	input->type = type;
+	input->identity = *identity;
+
+	status = hv_do_hypercall(HVCALL_UNMAP_STATS_PAGE, input, NULL);
+	local_irq_restore(flags);
+
+	return hv_result_to_errno(status);
+}
+
+int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
+				   u64 page_struct_count, u32 host_access,
+				   u32 flags, u8 acquire)
+{
+	struct hv_input_modify_sparse_spa_page_host_access *input_page;
+	u64 status;
+	int done = 0;
+	unsigned long irq_flags, large_shift = 0;
+	u64 page_count = page_struct_count;
+	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
+			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
+
+	if (page_count == 0)
+		return -EINVAL;
+
+	if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE) {
+		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
+			return -EINVAL;
+		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
+		page_count >>= large_shift;
+	}
+
+	while (done < page_count) {
+		ulong i, completed, remain = page_count - done;
+		int rep_count = min(remain,
+				    HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
+
+		local_irq_save(irq_flags);
+		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		/*
+		 * This is required to make sure that reserved field is set to
+		 * zero, because MSHV has a check to make sure reserved bits are
+		 * set to zero.
+		 */
+		memset(input_page, 0, sizeof(*input_page));
+		/* Only set the partition id if you are making the pages
+		 * exclusive
+		 */
+		if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE)
+			input_page->partition_id = partition_id;
+		input_page->flags = flags;
+		input_page->host_access = host_access;
+
+		for (i = 0; i < rep_count; i++) {
+			u64 index = (done + i) << large_shift;
+
+			if (index >= page_struct_count)
+				return -EINVAL;
+
+			input_page->spa_page_list[i] =
+						page_to_pfn(pages[index]);
+		}
+
+		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
+					     NULL);
+		local_irq_restore(irq_flags);
+
+		completed = hv_repcomp(status);
+
+		if (!hv_result_success(status))
+			return hv_result_to_errno(status);
+
+		done += completed;
+	}
+
+	return 0;
+}
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
new file mode 100644
index 000000000000..fed19aa80049
--- /dev/null
+++ b/drivers/hv/mshv_root_main.c
@@ -0,0 +1,2329 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, Microsoft Corporation.
+ *
+ * The main part of the mshv_root module, providing APIs to create
+ * and manage guest partitions.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/cpuhotplug.h>
+#include <linux/random.h>
+#include <asm/mshyperv.h>
+#include <linux/hyperv.h>
+#include <linux/notifier.h>
+#include <linux/reboot.h>
+#include <linux/kexec.h>
+#include <linux/page-flags.h>
+#include <linux/crash_dump.h>
+#include <linux/panic_notifier.h>
+#include <linux/vmalloc.h>
+
+#include "mshv_eventfd.h"
+#include "mshv.h"
+#include "mshv_root.h"
+
+/* TODO move this to mshyperv.h when needed outside driver */
+static inline bool hv_parent_partition(void)
+{
+	return hv_root_partition();
+}
+
+/* TODO move this to another file when debugfs code is added */
+enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
+#if defined(CONFIG_X86)
+	VpRootDispatchThreadBlocked			= 201,
+#elif defined(CONFIG_ARM64)
+	VpRootDispatchThreadBlocked			= 94,
+#endif
+	VpStatsMaxCounter
+};
+
+struct hv_stats_page {
+	union {
+		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
+		u8 data[HV_HYP_PAGE_SIZE];
+	};
+} __packed;
+
+struct mshv_root mshv_root = {};
+
+enum hv_scheduler_type hv_scheduler_type;
+
+/* Once we implement the fast extended hypercall ABI they can go away. */
+static void __percpu **root_scheduler_input;
+static void __percpu **root_scheduler_output;
+
+static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+static int mshv_dev_open(struct inode *inode, struct file *filp);
+static int mshv_dev_release(struct inode *inode, struct file *filp);
+static int mshv_vp_release(struct inode *inode, struct file *filp);
+static long mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+static int mshv_partition_release(struct inode *inode, struct file *filp);
+static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
+static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
+static int mshv_init_async_handler(struct mshv_partition *partition);
+static void mshv_async_hvcall_handler(void *data, u64 *status);
+
+static const struct vm_operations_struct mshv_vp_vm_ops = {
+	.fault = mshv_vp_fault,
+};
+
+static const struct file_operations mshv_vp_fops = {
+	.owner = THIS_MODULE,
+	.release = mshv_vp_release,
+	.unlocked_ioctl = mshv_vp_ioctl,
+	.llseek = noop_llseek,
+	.mmap = mshv_vp_mmap,
+};
+
+static const struct file_operations mshv_partition_fops = {
+	.owner = THIS_MODULE,
+	.release = mshv_partition_release,
+	.unlocked_ioctl = mshv_partition_ioctl,
+	.llseek = noop_llseek,
+};
+
+static const struct file_operations mshv_dev_fops = {
+	.owner = THIS_MODULE,
+	.open = mshv_dev_open,
+	.release = mshv_dev_release,
+	.unlocked_ioctl = mshv_dev_ioctl,
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice mshv_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mshv",
+	.fops = &mshv_dev_fops,
+	.mode = 0600,
+};
+
+/*
+ * Only allow hypercalls that have a u64 partition id as the first member of
+ * the input structure.
+ * These are sorted by value.
+ */
+static u16 mshv_passthru_hvcalls[] = {
+	HVCALL_GET_PARTITION_PROPERTY,
+	HVCALL_SET_PARTITION_PROPERTY,
+	HVCALL_INSTALL_INTERCEPT,
+	HVCALL_GET_VP_REGISTERS,
+	HVCALL_SET_VP_REGISTERS,
+	HVCALL_TRANSLATE_VIRTUAL_ADDRESS,
+	HVCALL_CLEAR_VIRTUAL_INTERRUPT,
+	HVCALL_REGISTER_INTERCEPT_RESULT,
+	HVCALL_ASSERT_VIRTUAL_INTERRUPT,
+	HVCALL_GET_GPA_PAGES_ACCESS_STATES,
+	HVCALL_SIGNAL_EVENT_DIRECT,
+	HVCALL_POST_MESSAGE_DIRECT,
+	HVCALL_GET_VP_CPUID_VALUES,
+};
+
+static bool mshv_hvcall_is_async(u16 code)
+{
+	switch (code) {
+	case HVCALL_SET_PARTITION_PROPERTY:
+		return true;
+	default:
+		break;
+	}
+	return false;
+}
+
+static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
+				      bool partition_locked,
+				      void __user *user_args)
+{
+	u64 status;
+	int ret, i;
+	bool is_async;
+	struct mshv_root_hvcall args;
+	struct page *page;
+	unsigned int pages_order;
+	void *input_pg = NULL;
+	void *output_pg = NULL;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.status || !args.in_ptr || args.in_sz < sizeof(u64) ||
+	    mshv_field_nonzero(args, rsvd) || args.in_sz > HV_HYP_PAGE_SIZE)
+		return -EINVAL;
+
+	if (args.out_ptr && (!args.out_sz || args.out_sz > HV_HYP_PAGE_SIZE))
+		return -EINVAL;
+
+	for (i = 0; i < ARRAY_SIZE(mshv_passthru_hvcalls); ++i)
+		if (args.code == mshv_passthru_hvcalls[i])
+			break;
+
+	if (i >= ARRAY_SIZE(mshv_passthru_hvcalls))
+		return -EINVAL;
+
+	is_async = mshv_hvcall_is_async(args.code);
+	if (is_async) {
+		/* async hypercalls can only be called from partition fd */
+		if (!partition_locked)
+			return -EINVAL;
+		ret = mshv_init_async_handler(partition);
+		if (ret)
+			return ret;
+	}
+
+	pages_order = args.out_ptr ? 1 : 0;
+	page = alloc_pages(GFP_KERNEL, pages_order);
+	if (!page)
+		return -ENOMEM;
+	input_pg = page_address(page);
+
+	if (args.out_ptr)
+		output_pg = (char *)input_pg + PAGE_SIZE;
+	else
+		output_pg = NULL;
+
+	if (copy_from_user(input_pg, (void __user *)args.in_ptr,
+			   args.in_sz)) {
+		ret = -EFAULT;
+		goto free_pages_out;
+	}
+
+	/*
+	 * NOTE: This only works because all the allowed hypercalls' input
+	 * structs begin with a u64 partition_id field.
+	 */
+	*(u64 *)input_pg = partition->pt_id;
+
+	if (args.reps)
+		status = hv_do_rep_hypercall(args.code, args.reps, 0,
+					     input_pg, output_pg);
+	else
+		status = hv_do_hypercall(args.code, input_pg, output_pg);
+
+	if (hv_result(status) == HV_STATUS_CALL_PENDING) {
+		if (is_async) {
+			mshv_async_hvcall_handler(partition, &status);
+		} else { /* Paranoia check. This shouldn't happen! */
+			ret = -EBADFD;
+			goto free_pages_out;
+		}
+	}
+
+	if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition->pt_id, 1);
+		if (!ret)
+			ret = -EAGAIN;
+	} else if (!hv_result_success(status)) {
+		ret = hv_result_to_errno(status);
+	}
+
+	/*
+	 * Always return the status and output data regardless of result.
+	 * The VMM may need it to determine how to proceed. E.g. the status may
+	 * contain the number of reps completed if a rep hypercall partially
+	 * succeeded.
+	 */
+	args.status = hv_result(status);
+	args.reps = args.reps ? hv_repcomp(status) : 0;
+	if (copy_to_user(user_args, &args, sizeof(args)))
+		ret = -EFAULT;
+
+	if (output_pg &&
+	    copy_to_user((void __user *)args.out_ptr, output_pg, args.out_sz))
+		ret = -EFAULT;
+
+free_pages_out:
+	free_pages((unsigned long)input_pg, pages_order);
+
+	return ret;
+}
+
+static inline bool is_ghcb_mapping_available(void)
+{
+#if IS_ENABLED(CONFIG_X86_64)
+	return ms_hyperv.ext_features & HV_VP_GHCB_ROOT_MAPPING_AVAILABLE;
+#else
+	return 0;
+#endif
+}
+
+static int mshv_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+				 struct hv_register_assoc *registers)
+{
+	union hv_input_vtl input_vtl;
+
+	input_vtl.as_uint8 = 0;
+	return hv_call_get_vp_registers(vp_index, partition_id,
+					count, input_vtl, registers);
+}
+
+static int mshv_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+				 struct hv_register_assoc *registers)
+{
+	union hv_input_vtl input_vtl;
+
+	input_vtl.as_uint8 = 0;
+	return hv_call_set_vp_registers(vp_index, partition_id,
+					count, input_vtl, registers);
+}
+
+/*
+ * Explicit guest vCPU suspend is asynchronous by nature (as it is requested by
+ * dom0 vCPU for guest vCPU) and thus it can race with "intercept" suspend,
+ * done by the hypervisor.
+ * "Intercept" suspend leads to asynchronous message delivery to dom0 which
+ * should be awaited to keep the VP loop consistent (i.e. no message pending
+ * upon VP resume).
+ * VP intercept suspend can't be done when the VP is explicitly suspended
+ * already, and thus can be only two possible race scenarios:
+ *   1. implicit suspend bit set -> explicit suspend bit set -> message sent
+ *   2. implicit suspend bit set -> message sent -> explicit suspend bit set
+ * Checking for implicit suspend bit set after explicit suspend request has
+ * succeeded in either case allows us to reliably identify, if there is a
+ * message to receive and deliver to VMM.
+ */
+static long
+mshv_suspend_vp(const struct mshv_vp *vp, bool *message_in_flight)
+{
+	struct hv_register_assoc explicit_suspend = {
+		.name = HV_REGISTER_EXPLICIT_SUSPEND
+	};
+	struct hv_register_assoc intercept_suspend = {
+		.name = HV_REGISTER_INTERCEPT_SUSPEND
+	};
+	union hv_explicit_suspend_register *es =
+		&explicit_suspend.value.explicit_suspend;
+	union hv_intercept_suspend_register *is =
+		&intercept_suspend.value.intercept_suspend;
+	int ret;
+
+	es->suspended = 1;
+
+	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    1, &explicit_suspend);
+	if (ret) {
+		vp_err(vp, "Failed to explicitly suspend vCPU\n");
+		return ret;
+	}
+
+	ret = mshv_get_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    1, &intercept_suspend);
+	if (ret) {
+		vp_err(vp, "Failed to get intercept suspend state\n");
+		return ret;
+	}
+
+	*message_in_flight = is->suspended;
+
+	return 0;
+}
+
+/*
+ * This function is used when VPs are scheduled by the hypervisor's
+ * scheduler.
+ *
+ * Caller has to make sure the registers contain cleared
+ * HV_REGISTER_INTERCEPT_SUSPEND and HV_REGISTER_EXPLICIT_SUSPEND registers
+ * exactly in this order (the hypervisor clears them sequentially) to avoid
+ * potential invalid clearing a newly arrived HV_REGISTER_INTERCEPT_SUSPEND
+ * after VP is released from HV_REGISTER_EXPLICIT_SUSPEND in case of the
+ * opposite order.
+ */
+static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
+{
+	long ret;
+	struct hv_register_assoc suspend_regs[2] = {
+			{ .name = HV_REGISTER_INTERCEPT_SUSPEND },
+			{ .name = HV_REGISTER_EXPLICIT_SUSPEND }
+	};
+	size_t count = ARRAY_SIZE(suspend_regs);
+
+	/* Resume VP execution */
+	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    count, suspend_regs);
+	if (ret) {
+		vp_err(vp, "Failed to resume vp execution. %lx\n", ret);
+		return ret;
+	}
+
+	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
+				       vp->run.kicked_by_hv == 1);
+	if (ret) {
+		bool message_in_flight;
+
+		/*
+		 * Otherwise the waiting was interrupted by a signal: suspend
+		 * the vCPU explicitly and copy message in flight (if any).
+		 */
+		ret = mshv_suspend_vp(vp, &message_in_flight);
+		if (ret)
+			return ret;
+
+		/* Return if no message in flight */
+		if (!message_in_flight)
+			return -EINTR;
+
+		/* Wait for the message in flight. */
+		wait_event(vp->run.vp_suspend_queue, vp->run.kicked_by_hv == 1);
+	}
+
+	/*
+	 * Reset the flag to make the wait_event call above work
+	 * next time.
+	 */
+	vp->run.kicked_by_hv = 0;
+
+	return 0;
+}
+
+static int
+mshv_vp_dispatch(struct mshv_vp *vp, u32 flags,
+		 struct hv_output_dispatch_vp *res)
+{
+	struct hv_input_dispatch_vp *input;
+	struct hv_output_dispatch_vp *output;
+	u64 status;
+
+	preempt_disable();
+	input = *this_cpu_ptr(root_scheduler_input);
+	output = *this_cpu_ptr(root_scheduler_output);
+
+	memset(input, 0, sizeof(*input));
+	memset(output, 0, sizeof(*output));
+
+	input->partition_id = vp->vp_partition->pt_id;
+	input->vp_index = vp->vp_index;
+	input->time_slice = 0; /* Run forever until something happens */
+	input->spec_ctrl = 0; /* TODO: set sensible flags */
+	input->flags = flags;
+
+	vp->run.flags.root_sched_dispatched = 1;
+	status = hv_do_hypercall(HVCALL_DISPATCH_VP, input, output);
+	vp->run.flags.root_sched_dispatched = 0;
+
+	*res = *output;
+	preempt_enable();
+
+	if (!hv_result_success(status))
+		vp_err(vp, "%s: status %s\n", __func__,
+		       hv_result_to_string(status));
+
+	return hv_result_to_errno(status);
+}
+
+static int
+mshv_vp_clear_explicit_suspend(struct mshv_vp *vp)
+{
+	struct hv_register_assoc explicit_suspend = {
+		.name = HV_REGISTER_EXPLICIT_SUSPEND,
+		.value.explicit_suspend.suspended = 0,
+	};
+	int ret;
+
+	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    1, &explicit_suspend);
+
+	if (ret)
+		vp_err(vp, "Failed to unsuspend\n");
+
+	return ret;
+}
+
+#if IS_ENABLED(CONFIG_X86_64)
+static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
+{
+	if (!vp->vp_register_page)
+		return 0;
+	return vp->vp_register_page->interrupt_vectors.as_uint64;
+}
+#else
+static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
+{
+	return 0;
+}
+#endif
+
+static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
+{
+	struct hv_stats_page **stats = vp->vp_stats_pages;
+	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
+	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
+
+	if (self_vp_cntrs[VpRootDispatchThreadBlocked])
+		return self_vp_cntrs[VpRootDispatchThreadBlocked];
+	return parent_vp_cntrs[VpRootDispatchThreadBlocked];
+}
+
+static int
+mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
+{
+	int ret;
+
+	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
+				       (vp->run.kicked_by_hv == 1 &&
+					!mshv_vp_dispatch_thread_blocked(vp)) ||
+				       mshv_vp_interrupt_pending(vp));
+	if (ret)
+		return -EINTR;
+
+	vp->run.flags.root_sched_blocked = 0;
+	vp->run.kicked_by_hv = 0;
+
+	return 0;
+}
+
+static int mshv_pre_guest_mode_work(struct mshv_vp *vp)
+{
+	const ulong work_flags = _TIF_NOTIFY_SIGNAL | _TIF_SIGPENDING |
+				 _TIF_NEED_RESCHED  | _TIF_NOTIFY_RESUME;
+	ulong th_flags;
+
+	th_flags = read_thread_flags();
+	while (th_flags & work_flags) {
+		int ret;
+
+		/* nb: following will call schedule */
+		ret = mshv_do_pre_guest_mode_work(th_flags);
+
+		if (ret)
+			return ret;
+
+		th_flags = read_thread_flags();
+	}
+
+	return 0;
+}
+
+/* Must be called with interrupts enabled */
+static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
+{
+	long ret;
+
+	if (vp->run.flags.root_sched_blocked) {
+		/*
+		 * Dispatch state of this VP is blocked. Need to wait
+		 * for the hypervisor to clear the blocked state before
+		 * dispatching it.
+		 */
+		ret = mshv_vp_wait_for_hv_kick(vp);
+		if (ret)
+			return ret;
+	}
+
+	do {
+		u32 flags = 0;
+		struct hv_output_dispatch_vp output;
+
+		ret = mshv_pre_guest_mode_work(vp);
+		if (ret)
+			break;
+
+		if (vp->run.flags.intercept_suspend)
+			flags |= HV_DISPATCH_VP_FLAG_CLEAR_INTERCEPT_SUSPEND;
+
+		if (mshv_vp_interrupt_pending(vp))
+			flags |= HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION;
+
+		ret = mshv_vp_dispatch(vp, flags, &output);
+		if (ret)
+			break;
+
+		vp->run.flags.intercept_suspend = 0;
+
+		if (output.dispatch_state == HV_VP_DISPATCH_STATE_BLOCKED) {
+			if (output.dispatch_event ==
+						HV_VP_DISPATCH_EVENT_SUSPEND) {
+				/*
+				 * TODO: remove the warning once VP canceling
+				 *	 is supported
+				 */
+				WARN_ONCE(atomic64_read(&vp->run.vp_signaled_count),
+					  "%s: vp#%d: unexpected explicit suspend\n",
+					  __func__, vp->vp_index);
+				/*
+				 * Need to clear explicit suspend before
+				 * dispatching.
+				 * Explicit suspend is either:
+				 * - set right after the first VP dispatch or
+				 * - set explicitly via hypercall
+				 * Since the latter case is not yet supported,
+				 * simply clear it here.
+				 */
+				ret = mshv_vp_clear_explicit_suspend(vp);
+				if (ret)
+					break;
+
+				ret = mshv_vp_wait_for_hv_kick(vp);
+				if (ret)
+					break;
+			} else {
+				vp->run.flags.root_sched_blocked = 1;
+				ret = mshv_vp_wait_for_hv_kick(vp);
+				if (ret)
+					break;
+			}
+		} else {
+			/* HV_VP_DISPATCH_STATE_READY */
+			if (output.dispatch_event ==
+						HV_VP_DISPATCH_EVENT_INTERCEPT)
+				vp->run.flags.intercept_suspend = 1;
+		}
+	} while (!vp->run.flags.intercept_suspend);
+
+	return ret;
+}
+
+static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
+	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
+
+static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
+{
+	long rc;
+	char *schednm;
+
+	schednm = hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT ? "root" : "hv";
+
+	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+		rc = mshv_run_vp_with_root_scheduler(vp);
+	else
+		rc = mshv_run_vp_with_hyp_scheduler(vp);
+
+	if (rc)
+		return rc;
+
+	if (copy_to_user(ret_msg, vp->vp_intercept_msg_page,
+			 sizeof(struct hv_message)))
+		rc = -EFAULT;
+
+	return rc;
+}
+
+static int
+mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
+				struct hv_vp_state_data state_data,
+				unsigned long user_pfn, size_t page_count,
+				bool is_set)
+{
+	int completed, ret = 0;
+	unsigned long check;
+	struct page **pages;
+
+	if (page_count > INT_MAX)
+		return -EINVAL;
+	/*
+	 * Check the arithmetic for wraparound/overflow.
+	 * The last page address in the buffer is:
+	 * (user_pfn + (page_count - 1)) * PAGE_SIZE
+	 */
+	if (check_add_overflow(user_pfn, (page_count - 1), &check))
+		return -EOVERFLOW;
+	if (check_mul_overflow(check, PAGE_SIZE, &check))
+		return -EOVERFLOW;
+
+	/* Pin user pages so hypervisor can copy directly to them */
+	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (completed = 0; completed < page_count; completed += ret) {
+		unsigned long user_addr = (user_pfn + completed) * PAGE_SIZE;
+		int remaining = page_count - completed;
+
+		ret = pin_user_pages_fast(user_addr, remaining, FOLL_WRITE,
+					  &pages[completed]);
+		if (ret < 0) {
+			vp_err(vp, "%s: Failed to pin user pages error %i\n",
+			       __func__, ret);
+			goto unpin_pages;
+		}
+	}
+
+	if (is_set)
+		ret = hv_call_set_vp_state(vp->vp_index,
+					   vp->vp_partition->pt_id,
+					   state_data, page_count, pages,
+					   0, NULL);
+	else
+		ret = hv_call_get_vp_state(vp->vp_index,
+					   vp->vp_partition->pt_id,
+					   state_data, page_count, pages,
+					   NULL);
+
+unpin_pages:
+	unpin_user_pages(pages, completed);
+	kfree(pages);
+	return ret;
+}
+
+static long
+mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
+			    struct mshv_get_set_vp_state __user *user_args,
+			    bool is_set)
+{
+	struct mshv_get_set_vp_state args;
+	long ret = 0;
+	union hv_output_get_vp_state vp_state;
+	u32 data_sz;
+	struct hv_vp_state_data state_data = {};
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.type >= MSHV_VP_STATE_COUNT || mshv_field_nonzero(args, rsvd) ||
+	    !args.buf_sz || !PAGE_ALIGNED(args.buf_sz) ||
+	    !PAGE_ALIGNED(args.buf_ptr))
+		return -EINVAL;
+
+	if (!access_ok((void __user *)args.buf_ptr, args.buf_sz))
+		return -EFAULT;
+
+	switch (args.type) {
+	case MSHV_VP_STATE_LAPIC:
+		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
+		data_sz = HV_HYP_PAGE_SIZE;
+		break;
+	case MSHV_VP_STATE_XSAVE:
+	{
+		u64 data_sz_64;
+
+		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
+						     HV_PARTITION_PROPERTY_XSAVE_STATES,
+						     &state_data.xsave.states.as_uint64);
+		if (ret)
+			return ret;
+
+		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
+						     HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE,
+						     &data_sz_64);
+		if (ret)
+			return ret;
+
+		data_sz = (u32)data_sz_64;
+		state_data.xsave.flags = 0;
+		/* Always request legacy states */
+		state_data.xsave.states.legacy_x87 = 1;
+		state_data.xsave.states.legacy_sse = 1;
+		state_data.type = HV_GET_SET_VP_STATE_XSAVE;
+		break;
+	}
+	case MSHV_VP_STATE_SIMP:
+		state_data.type = HV_GET_SET_VP_STATE_SIM_PAGE;
+		data_sz = HV_HYP_PAGE_SIZE;
+		break;
+	case MSHV_VP_STATE_SIEFP:
+		state_data.type = HV_GET_SET_VP_STATE_SIEF_PAGE;
+		data_sz = HV_HYP_PAGE_SIZE;
+		break;
+	case MSHV_VP_STATE_SYNTHETIC_TIMERS:
+		state_data.type = HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS;
+		data_sz = sizeof(vp_state.synthetic_timers_state);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (copy_to_user(&user_args->buf_sz, &data_sz, sizeof(user_args->buf_sz)))
+		return -EFAULT;
+
+	if (data_sz > args.buf_sz)
+		return -EINVAL;
+
+	/* If the data is transmitted via pfns, delegate to helper */
+	if (state_data.type & HV_GET_SET_VP_STATE_TYPE_PFN) {
+		unsigned long user_pfn = PFN_DOWN(args.buf_ptr);
+		size_t page_count = PFN_DOWN(args.buf_sz);
+
+		return mshv_vp_ioctl_get_set_state_pfn(vp, state_data, user_pfn,
+						       page_count, is_set);
+	}
+
+	/* Paranoia check - this shouldn't happen! */
+	if (data_sz > sizeof(vp_state)) {
+		vp_err(vp, "Invalid vp state data size!\n");
+		return -EINVAL;
+	}
+
+	if (is_set) {
+		if (copy_from_user(&vp_state, (__user void *)args.buf_ptr, data_sz))
+			return -EFAULT;
+
+		return hv_call_set_vp_state(vp->vp_index,
+					    vp->vp_partition->pt_id,
+					    state_data, 0, NULL,
+					    sizeof(vp_state), (u8 *)&vp_state);
+	}
+
+	ret = hv_call_get_vp_state(vp->vp_index, vp->vp_partition->pt_id,
+				   state_data, 0, NULL, &vp_state);
+	if (ret)
+		return ret;
+
+	if (copy_to_user((void __user *)args.buf_ptr, &vp_state, data_sz))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long
+mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+	struct mshv_vp *vp = filp->private_data;
+	long r = -ENOTTY;
+
+	if (mutex_lock_killable(&vp->vp_mutex))
+		return -EINTR;
+
+	switch (ioctl) {
+	case MSHV_RUN_VP:
+		r = mshv_vp_ioctl_run_vp(vp, (void __user *)arg);
+		break;
+	case MSHV_GET_VP_STATE:
+		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
+		break;
+	case MSHV_SET_VP_STATE:
+		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
+		break;
+	case MSHV_ROOT_HVCALL:
+		r = mshv_ioctl_passthru_hvcall(vp->vp_partition, false,
+					       (void __user *)arg);
+		break;
+	default:
+		vp_warn(vp, "Invalid ioctl: %#x\n", ioctl);
+		break;
+	}
+	mutex_unlock(&vp->vp_mutex);
+
+	return r;
+}
+
+static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
+{
+	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
+
+	switch (vmf->vma->vm_pgoff) {
+	case MSHV_VP_MMAP_OFFSET_REGISTERS:
+		vmf->page = virt_to_page(vp->vp_register_page);
+		break;
+	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
+		vmf->page = virt_to_page(vp->vp_intercept_msg_page);
+		break;
+	case MSHV_VP_MMAP_OFFSET_GHCB:
+		if (is_ghcb_mapping_available())
+			vmf->page = virt_to_page(vp->vp_ghcb_page);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	get_page(vmf->page);
+
+	return 0;
+}
+
+static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct mshv_vp *vp = file->private_data;
+
+	switch (vma->vm_pgoff) {
+	case MSHV_VP_MMAP_OFFSET_REGISTERS:
+		if (!vp->vp_register_page)
+			return -ENODEV;
+		break;
+	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
+		if (!vp->vp_intercept_msg_page)
+			return -ENODEV;
+		break;
+	case MSHV_VP_MMAP_OFFSET_GHCB:
+		if (is_ghcb_mapping_available() && !vp->vp_ghcb_page)
+			return -ENODEV;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	vma->vm_ops = &mshv_vp_vm_ops;
+	return 0;
+}
+
+static int
+mshv_vp_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_vp *vp = filp->private_data;
+
+	/* Rest of VP cleanup happens in destroy_partition() */
+	mshv_partition_put(vp->vp_partition);
+	return 0;
+}
+
+static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index)
+{
+	union hv_stats_object_identity identity = {
+		.vp.partition_id = partition_id,
+		.vp.vp_index = vp_index,
+	};
+
+	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
+	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
+
+	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
+}
+
+static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
+			     void *stats_pages[])
+{
+	union hv_stats_object_identity identity = {
+		.vp.partition_id = partition_id,
+		.vp.vp_index = vp_index,
+	};
+	int err;
+
+	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
+	err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity,
+				    &stats_pages[HV_STATS_AREA_SELF]);
+	if (err)
+		return err;
+
+	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+	err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity,
+				    &stats_pages[HV_STATS_AREA_PARENT]);
+	if (err)
+		goto unmap_self;
+
+	return 0;
+
+unmap_self:
+	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
+	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
+	return err;
+}
+
+static long
+mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
+			       void __user *arg)
+{
+	struct mshv_create_vp args;
+	struct mshv_vp *vp;
+	struct page *intercept_message_page, *register_page, *ghcb_page;
+	void *stats_pages[2];
+	long ret;
+	union hv_input_vtl input_vtl;
+
+	if (copy_from_user(&args, arg, sizeof(args)))
+		return -EFAULT;
+
+	if (args.vp_index >= MSHV_MAX_VPS)
+		return -EINVAL;
+
+	if (partition->pt_vp_array[args.vp_index])
+		return -EEXIST;
+
+	ret = hv_call_create_vp(NUMA_NO_NODE, partition->pt_id, args.vp_index,
+				0 /* Only valid for root partition VPs */);
+	if (ret)
+		return ret;
+
+	input_vtl.as_uint8 = 0;
+	ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
+					HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
+					input_vtl,
+					&intercept_message_page);
+	if (ret)
+		goto destroy_vp;
+
+	if (!mshv_partition_encrypted(partition)) {
+		input_vtl.as_uint8 = 0;
+		ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
+						HV_VP_STATE_PAGE_REGISTERS,
+						input_vtl,
+						&register_page);
+		if (ret)
+			goto unmap_intercept_message_page;
+	}
+
+	if (mshv_partition_encrypted(partition) &&
+	    is_ghcb_mapping_available()) {
+		input_vtl.as_uint8 = 0;
+		input_vtl.use_target_vtl = 1;
+		input_vtl.target_vtl = HV_NORMAL_VTL;
+		ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
+						HV_VP_STATE_PAGE_GHCB,
+						input_vtl,
+						&ghcb_page);
+		if (ret)
+			goto unmap_register_page;
+	}
+
+	if (hv_parent_partition()) {
+		ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
+					stats_pages);
+		if (ret)
+			goto unmap_ghcb_page;
+	}
+
+	vp = kzalloc(sizeof(*vp), GFP_KERNEL);
+	if (!vp)
+		goto unmap_stats_pages;
+
+	vp->vp_partition = mshv_partition_get(partition);
+	if (!vp->vp_partition) {
+		ret = -EBADF;
+		goto free_vp;
+	}
+
+	mutex_init(&vp->vp_mutex);
+	init_waitqueue_head(&vp->run.vp_suspend_queue);
+	atomic64_set(&vp->run.vp_signaled_count, 0);
+
+	vp->vp_index = args.vp_index;
+	vp->vp_intercept_msg_page = page_to_virt(intercept_message_page);
+	if (!mshv_partition_encrypted(partition))
+		vp->vp_register_page = page_to_virt(register_page);
+
+	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
+		vp->vp_ghcb_page = page_to_virt(ghcb_page);
+
+	if (hv_parent_partition())
+		memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
+
+	/*
+	 * Keep anon_inode_getfd last: it installs fd in the file struct and
+	 * thus makes the state accessible in user space.
+	 */
+	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
+			       O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto put_partition;
+
+	/* already exclusive with the partition mutex for all ioctls */
+	partition->pt_vp_count++;
+	partition->pt_vp_array[args.vp_index] = vp;
+
+	return ret;
+
+put_partition:
+	mshv_partition_put(partition);
+free_vp:
+	kfree(vp);
+unmap_stats_pages:
+	if (hv_parent_partition())
+		mshv_vp_stats_unmap(partition->pt_id, args.vp_index);
+unmap_ghcb_page:
+	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available()) {
+		input_vtl.as_uint8 = 0;
+		input_vtl.use_target_vtl = 1;
+		input_vtl.target_vtl = HV_NORMAL_VTL;
+
+		hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
+					    HV_VP_STATE_PAGE_GHCB, input_vtl);
+	}
+unmap_register_page:
+	if (!mshv_partition_encrypted(partition)) {
+		input_vtl.as_uint8 = 0;
+
+		hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
+					    HV_VP_STATE_PAGE_REGISTERS,
+					    input_vtl);
+	}
+unmap_intercept_message_page:
+	input_vtl.as_uint8 = 0;
+	hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
+				    HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
+				    input_vtl);
+destroy_vp:
+	hv_call_delete_vp(partition->pt_id, args.vp_index);
+	return ret;
+}
+
+static int mshv_init_async_handler(struct mshv_partition *partition)
+{
+	if (completion_done(&partition->async_hypercall)) {
+		pt_err(partition,
+		       "Cannot issue another async hypercall, while another one in progress!\n");
+		return -EPERM;
+	}
+
+	reinit_completion(&partition->async_hypercall);
+	return 0;
+}
+
+static void mshv_async_hvcall_handler(void *data, u64 *status)
+{
+	struct mshv_partition *partition = data;
+
+	wait_for_completion(&partition->async_hypercall);
+	pt_dbg(partition, "Async hypercall completed!\n");
+
+	*status = partition->async_hypercall_status;
+}
+
+static int
+mshv_partition_region_share(struct mshv_mem_region *region)
+{
+	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
+
+	if (region->flags.large_pages)
+		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
+
+	return hv_call_modify_spa_host_access(region->partition->pt_id,
+			region->pages, region->nr_pages,
+			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
+			flags, true);
+}
+
+static int
+mshv_partition_region_unshare(struct mshv_mem_region *region)
+{
+	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
+
+	if (region->flags.large_pages)
+		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
+
+	return hv_call_modify_spa_host_access(region->partition->pt_id,
+			region->pages, region->nr_pages,
+			0,
+			flags, false);
+}
+
+static int
+mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags,
+			u64 page_offset, u64 page_count)
+{
+	if (page_offset + page_count > region->nr_pages)
+		return -EINVAL;
+
+	if (region->flags.large_pages)
+		map_flags |= HV_MAP_GPA_LARGE_PAGE;
+
+	/* ask the hypervisor to map guest ram */
+	return hv_call_map_gpa_pages(region->partition->pt_id,
+				     region->start_gfn + page_offset,
+				     page_count, map_flags,
+				     region->pages + page_offset);
+}
+
+static int
+mshv_region_map(struct mshv_mem_region *region)
+{
+	u32 map_flags = region->hv_map_flags;
+
+	return mshv_region_remap_pages(region, map_flags,
+				       0, region->nr_pages);
+}
+
+static void
+mshv_region_evict_pages(struct mshv_mem_region *region,
+			u64 page_offset, u64 page_count)
+{
+	if (region->flags.range_pinned)
+		unpin_user_pages(region->pages + page_offset, page_count);
+
+	memset(region->pages + page_offset, 0,
+	       page_count * sizeof(struct page *));
+}
+
+static void
+mshv_region_evict(struct mshv_mem_region *region)
+{
+	mshv_region_evict_pages(region, 0, region->nr_pages);
+}
+
+static int
+mshv_region_populate_pages(struct mshv_mem_region *region,
+			   u64 page_offset, u64 page_count)
+{
+	u64 done_count, nr_pages;
+	struct page **pages;
+	__u64 userspace_addr;
+	int ret;
+
+	if (page_offset + page_count > region->nr_pages)
+		return -EINVAL;
+
+	for (done_count = 0; done_count < page_count; done_count += ret) {
+		pages = region->pages + page_offset + done_count;
+		userspace_addr = region->start_uaddr +
+				(page_offset + done_count) *
+				HV_HYP_PAGE_SIZE;
+		nr_pages = min(page_count - done_count,
+			       MSHV_PIN_PAGES_BATCH_SIZE);
+
+		/*
+		 * Pinning assuming 4k pages works for large pages too.
+		 * All page structs within the large page are returned.
+		 *
+		 * Pin requests are batched because pin_user_pages_fast
+		 * with the FOLL_LONGTERM flag does a large temporary
+		 * allocation of contiguous memory.
+		 */
+		if (region->flags.range_pinned)
+			ret = pin_user_pages_fast(userspace_addr,
+						  nr_pages,
+						  FOLL_WRITE | FOLL_LONGTERM,
+						  pages);
+		else
+			ret = -EOPNOTSUPP;
+
+		if (ret < 0)
+			goto release_pages;
+	}
+
+	if (PageHuge(region->pages[page_offset]))
+		region->flags.large_pages = true;
+
+	return 0;
+
+release_pages:
+	mshv_region_evict_pages(region, page_offset, done_count);
+	return ret;
+}
+
+static int
+mshv_region_populate(struct mshv_mem_region *region)
+{
+	return mshv_region_populate_pages(region, 0, region->nr_pages);
+}
+
+static struct mshv_mem_region *
+mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
+{
+	struct mshv_mem_region *region;
+
+	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
+		if (gfn >= region->start_gfn &&
+		    gfn < region->start_gfn + region->nr_pages)
+			return region;
+	}
+
+	return NULL;
+}
+
+static struct mshv_mem_region *
+mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
+{
+	struct mshv_mem_region *region;
+
+	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
+		if (uaddr >= region->start_uaddr &&
+		    uaddr < region->start_uaddr +
+			    (region->nr_pages << HV_HYP_PAGE_SHIFT))
+			return region;
+	}
+
+	return NULL;
+}
+
+/*
+ * NB: caller checks and makes sure mem->size is page aligned
+ * Returns: 0 with regionpp updated on success, or -errno
+ */
+static int mshv_partition_create_region(struct mshv_partition *partition,
+					struct mshv_user_mem_region *mem,
+					struct mshv_mem_region **regionpp,
+					bool is_mmio)
+{
+	struct mshv_mem_region *region;
+	u64 nr_pages = HVPFN_DOWN(mem->size);
+
+	/* Reject overlapping regions */
+	if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) ||
+	    mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - 1) ||
+	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr) ||
+	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr + mem->size - 1))
+		return -EEXIST;
+
+	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
+	if (!region)
+		return -ENOMEM;
+
+	region->nr_pages = nr_pages;
+	region->start_gfn = mem->guest_pfn;
+	region->start_uaddr = mem->userspace_addr;
+	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
+	if (mem->flags & BIT(MSHV_SET_MEM_BIT_WRITABLE))
+		region->hv_map_flags |= HV_MAP_GPA_WRITABLE;
+	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
+		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
+
+	/* Note: large_pages flag populated when we pin the pages */
+	if (!is_mmio)
+		region->flags.range_pinned = true;
+
+	region->partition = partition;
+
+	*regionpp = region;
+
+	return 0;
+}
+
+/*
+ * Map guest ram. if snp, make sure to release that from the host first
+ * Side Effects: In case of failure, pages are unpinned when feasible.
+ */
+static int
+mshv_partition_mem_region_map(struct mshv_mem_region *region)
+{
+	struct mshv_partition *partition = region->partition;
+	int ret;
+
+	ret = mshv_region_populate(region);
+	if (ret) {
+		pt_err(partition, "Failed to populate memory region: %d\n",
+		       ret);
+		goto err_out;
+	}
+
+	/*
+	 * For an SNP partition it is a requirement that for every memory region
+	 * that we are going to map for this partition we should make sure that
+	 * host access to that region is released. This is ensured by doing an
+	 * additional hypercall which will update the SLAT to release host
+	 * access to guest memory regions.
+	 */
+	if (mshv_partition_encrypted(partition)) {
+		ret = mshv_partition_region_unshare(region);
+		if (ret) {
+			pt_err(partition,
+			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
+			       region->start_gfn, ret);
+			goto evict_region;
+		}
+	}
+
+	ret = mshv_region_map(region);
+	if (ret && mshv_partition_encrypted(partition)) {
+		int shrc;
+
+		shrc = mshv_partition_region_share(region);
+		if (!shrc)
+			goto evict_region;
+
+		pt_err(partition,
+		       "Failed to share memory region (guest_pfn: %llu): %d\n",
+		       region->start_gfn, shrc);
+		/*
+		 * Don't unpin if marking shared failed because pages are no
+		 * longer mapped in the host, ie root, anymore.
+		 */
+		goto err_out;
+	}
+
+	return 0;
+
+evict_region:
+	mshv_region_evict(region);
+err_out:
+	return ret;
+}
+
+/*
+ * This maps two things: guest RAM and for pci passthru mmio space.
+ *
+ * mmio:
+ *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
+ *  - Two things need to happen for mapping mmio range:
+ *	1. mapped in the uaddr so VMM can access it.
+ *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
+ *
+ *   This function takes care of the second. The first one is managed by vfio,
+ *   and hence is taken care of via vfio_pci_mmap_fault().
+ */
+static long
+mshv_map_user_memory(struct mshv_partition *partition,
+		     struct mshv_user_mem_region mem)
+{
+	struct mshv_mem_region *region;
+	struct vm_area_struct *vma;
+	bool is_mmio;
+	ulong mmio_pfn;
+	long ret;
+
+	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
+	    !access_ok((const void *)mem.userspace_addr, mem.size))
+		return -EINVAL;
+
+	mmap_read_lock(current->mm);
+	vma = vma_lookup(current->mm, mem.userspace_addr);
+	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
+	mmap_read_unlock(current->mm);
+
+	if (!vma)
+		return -EINVAL;
+
+	ret = mshv_partition_create_region(partition, &mem, &region,
+					   is_mmio);
+	if (ret)
+		return ret;
+
+	if (is_mmio)
+		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
+					     mmio_pfn, HVPFN_DOWN(mem.size));
+	else
+		ret = mshv_partition_mem_region_map(region);
+
+	if (ret)
+		goto errout;
+
+	/* Install the new region */
+	hlist_add_head(&region->hnode, &partition->pt_mem_regions);
+
+	return 0;
+
+errout:
+	vfree(region);
+	return ret;
+}
+
+/* Called for unmapping both the guest ram and the mmio space */
+static long
+mshv_unmap_user_memory(struct mshv_partition *partition,
+		       struct mshv_user_mem_region mem)
+{
+	struct mshv_mem_region *region;
+	u32 unmap_flags = 0;
+
+	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
+		return -EINVAL;
+
+	if (hlist_empty(&partition->pt_mem_regions))
+		return -EINVAL;
+
+	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
+	if (!region)
+		return -EINVAL;
+
+	/* Paranoia check */
+	if (region->start_uaddr != mem.userspace_addr ||
+	    region->start_gfn != mem.guest_pfn ||
+	    region->nr_pages != HVPFN_DOWN(mem.size))
+		return -EINVAL;
+
+	hlist_del(&region->hnode);
+
+	if (region->flags.large_pages)
+		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
+
+	/* ignore unmap failures and continue as process may be exiting */
+	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
+				region->nr_pages, unmap_flags);
+
+	mshv_region_evict(region);
+
+	vfree(region);
+	return 0;
+}
+
+static long
+mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
+				struct mshv_user_mem_region __user *user_mem)
+{
+	struct mshv_user_mem_region mem;
+
+	if (copy_from_user(&mem, user_mem, sizeof(mem)))
+		return -EFAULT;
+
+	if (!mem.size ||
+	    !PAGE_ALIGNED(mem.size) ||
+	    !PAGE_ALIGNED(mem.userspace_addr) ||
+	    (mem.flags & ~MSHV_SET_MEM_FLAGS_MASK) ||
+	    mshv_field_nonzero(mem, rsvd))
+		return -EINVAL;
+
+	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
+		return mshv_unmap_user_memory(partition, mem);
+
+	return mshv_map_user_memory(partition, mem);
+}
+
+static long
+mshv_partition_ioctl_ioeventfd(struct mshv_partition *partition,
+			       void __user *user_args)
+{
+	struct mshv_user_ioeventfd args;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	return mshv_set_unset_ioeventfd(partition, &args);
+}
+
+static long
+mshv_partition_ioctl_irqfd(struct mshv_partition *partition,
+			   void __user *user_args)
+{
+	struct mshv_user_irqfd args;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	return mshv_set_unset_irqfd(partition, &args);
+}
+
+static long
+mshv_partition_ioctl_get_gpap_access_bitmap(struct mshv_partition *partition,
+					    void __user *user_args)
+{
+	struct mshv_gpap_access_bitmap args;
+	union hv_gpa_page_access_state *states;
+	long ret, i;
+	union hv_gpa_page_access_state_flags hv_flags = {};
+	u8 hv_type_mask;
+	ulong bitmap_buf_sz, states_buf_sz;
+	int written = 0;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.access_type >= MSHV_GPAP_ACCESS_TYPE_COUNT ||
+	    args.access_op >= MSHV_GPAP_ACCESS_OP_COUNT ||
+	    mshv_field_nonzero(args, rsvd) || !args.page_count ||
+	    !args.bitmap_ptr)
+		return -EINVAL;
+
+	if (check_mul_overflow(args.page_count, sizeof(*states), &states_buf_sz))
+		return -E2BIG;
+
+	/* Num bytes needed to store bitmap; one bit per page rounded up */
+	bitmap_buf_sz = DIV_ROUND_UP(args.page_count, 8);
+
+	/* Sanity check */
+	if (bitmap_buf_sz > states_buf_sz)
+		return -EBADFD;
+
+	switch (args.access_type) {
+	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
+		hv_type_mask = 1;
+		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
+			hv_flags.clear_accessed = 1;
+			/* not accessed implies not dirty */
+			hv_flags.clear_dirty = 1;
+		} else { // MSHV_GPAP_ACCESS_OP_SET
+			hv_flags.set_accessed = 1;
+		}
+		break;
+	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
+		hv_type_mask = 2;
+		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
+			hv_flags.clear_dirty = 1;
+		} else { // MSHV_GPAP_ACCESS_OP_SET
+			hv_flags.set_dirty = 1;
+			/* dirty implies accessed */
+			hv_flags.set_accessed = 1;
+		}
+		break;
+	}
+
+	states = vzalloc(states_buf_sz);
+	if (!states)
+		return -ENOMEM;
+
+	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
+					    args.gpap_base, hv_flags, &written,
+					    states);
+	if (ret)
+		goto free_return;
+
+	/*
+	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
+	 * correspond to bitfields in hv_gpa_page_access_state
+	 */
+	for (i = 0; i < written; ++i)
+		assign_bit(i, (ulong *)states,
+			   states[i].as_uint8 & hv_type_mask);
+
+	args.page_count = written;
+
+	if (copy_to_user(user_args, &args, sizeof(args))) {
+		ret = -EFAULT;
+		goto free_return;
+	}
+	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
+		ret = -EFAULT;
+
+free_return:
+	vfree(states);
+	return ret;
+}
+
+static long
+mshv_partition_ioctl_set_msi_routing(struct mshv_partition *partition,
+				     void __user *user_args)
+{
+	struct mshv_user_irq_entry *entries = NULL;
+	struct mshv_user_irq_table args;
+	long ret;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.nr > MSHV_MAX_GUEST_IRQS ||
+	    mshv_field_nonzero(args, rsvd))
+		return -EINVAL;
+
+	if (args.nr) {
+		struct mshv_user_irq_table __user *urouting = user_args;
+
+		entries = vmemdup_user(urouting->entries,
+				       array_size(sizeof(*entries),
+						  args.nr));
+		if (IS_ERR(entries))
+			return PTR_ERR(entries);
+	}
+	ret = mshv_update_routing_table(partition, entries, args.nr);
+	kvfree(entries);
+
+	return ret;
+}
+
+static long
+mshv_partition_ioctl_initialize(struct mshv_partition *partition)
+{
+	long ret;
+
+	if (partition->pt_initialized)
+		return 0;
+
+	ret = hv_call_initialize_partition(partition->pt_id);
+	if (ret)
+		goto withdraw_mem;
+
+	partition->pt_initialized = true;
+
+	return 0;
+
+withdraw_mem:
+	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
+
+	return ret;
+}
+
+static long
+mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+	struct mshv_partition *partition = filp->private_data;
+	long ret;
+	void __user *uarg = (void __user *)arg;
+
+	if (mutex_lock_killable(&partition->pt_mutex))
+		return -EINTR;
+
+	switch (ioctl) {
+	case MSHV_INITIALIZE_PARTITION:
+		ret = mshv_partition_ioctl_initialize(partition);
+		break;
+	case MSHV_SET_GUEST_MEMORY:
+		ret = mshv_partition_ioctl_set_memory(partition, uarg);
+		break;
+	case MSHV_CREATE_VP:
+		ret = mshv_partition_ioctl_create_vp(partition, uarg);
+		break;
+	case MSHV_IRQFD:
+		ret = mshv_partition_ioctl_irqfd(partition, uarg);
+		break;
+	case MSHV_IOEVENTFD:
+		ret = mshv_partition_ioctl_ioeventfd(partition, uarg);
+		break;
+	case MSHV_SET_MSI_ROUTING:
+		ret = mshv_partition_ioctl_set_msi_routing(partition, uarg);
+		break;
+	case MSHV_GET_GPAP_ACCESS_BITMAP:
+		ret = mshv_partition_ioctl_get_gpap_access_bitmap(partition,
+								  uarg);
+		break;
+	case MSHV_ROOT_HVCALL:
+		ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
+		break;
+	default:
+		ret = -ENOTTY;
+	}
+
+	mutex_unlock(&partition->pt_mutex);
+	return ret;
+}
+
+static int
+disable_vp_dispatch(struct mshv_vp *vp)
+{
+	int ret;
+	struct hv_register_assoc dispatch_suspend = {
+		.name = HV_REGISTER_DISPATCH_SUSPEND,
+		.value.dispatch_suspend.suspended = 1,
+	};
+
+	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    1, &dispatch_suspend);
+	if (ret)
+		vp_err(vp, "failed to suspend\n");
+
+	return ret;
+}
+
+static int
+get_vp_signaled_count(struct mshv_vp *vp, u64 *count)
+{
+	int ret;
+	struct hv_register_assoc root_signal_count = {
+		.name = HV_REGISTER_VP_ROOT_SIGNAL_COUNT,
+	};
+
+	ret = mshv_get_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
+				    1, &root_signal_count);
+
+	if (ret) {
+		vp_err(vp, "Failed to get root signal count");
+		*count = 0;
+		return ret;
+	}
+
+	*count = root_signal_count.value.reg64;
+
+	return ret;
+}
+
+static void
+drain_vp_signals(struct mshv_vp *vp)
+{
+	u64 hv_signal_count;
+	u64 vp_signal_count;
+
+	get_vp_signaled_count(vp, &hv_signal_count);
+
+	vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
+
+	/*
+	 * There should be at most 1 outstanding notification, but be extra
+	 * careful anyway.
+	 */
+	while (hv_signal_count != vp_signal_count) {
+		WARN_ON(hv_signal_count - vp_signal_count != 1);
+
+		if (wait_event_interruptible(vp->run.vp_suspend_queue,
+					     vp->run.kicked_by_hv == 1))
+			break;
+		vp->run.kicked_by_hv = 0;
+		vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
+	}
+}
+
+static void drain_all_vps(const struct mshv_partition *partition)
+{
+	int i;
+	struct mshv_vp *vp;
+
+	/*
+	 * VPs are reachable from ISR. It is safe to not take the partition
+	 * lock because nobody else can enter this function and drop the
+	 * partition from the list.
+	 */
+	for (i = 0; i < MSHV_MAX_VPS; i++) {
+		vp = partition->pt_vp_array[i];
+		if (!vp)
+			continue;
+		/*
+		 * Disable dispatching of the VP in the hypervisor. After this
+		 * the hypervisor guarantees it won't generate any signals for
+		 * the VP and the hypervisor's VP signal count won't change.
+		 */
+		disable_vp_dispatch(vp);
+		drain_vp_signals(vp);
+	}
+}
+
+static void
+remove_partition(struct mshv_partition *partition)
+{
+	spin_lock(&mshv_root.pt_ht_lock);
+	hlist_del_rcu(&partition->pt_hnode);
+	spin_unlock(&mshv_root.pt_ht_lock);
+
+	synchronize_rcu();
+}
+
+/*
+ * Tear down a partition and remove it from the list.
+ * Partition's refcount must be 0
+ */
+static void destroy_partition(struct mshv_partition *partition)
+{
+	struct mshv_vp *vp;
+	struct mshv_mem_region *region;
+	int i, ret;
+	struct hlist_node *n;
+	union hv_input_vtl input_vtl;
+
+	if (refcount_read(&partition->pt_ref_count)) {
+		pt_err(partition,
+		       "Attempt to destroy partition but refcount > 0\n");
+		return;
+	}
+
+	if (partition->pt_initialized) {
+		/*
+		 * We only need to drain signals for root scheduler. This should be
+		 * done before removing the partition from the partition list.
+		 */
+		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
+			drain_all_vps(partition);
+
+		/* Remove vps */
+		for (i = 0; i < MSHV_MAX_VPS; ++i) {
+			vp = partition->pt_vp_array[i];
+			if (!vp)
+				continue;
+
+			if (hv_parent_partition())
+				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index);
+
+			if (vp->vp_register_page) {
+				input_vtl.as_uint8 = 0;
+				(void)hv_call_unmap_vp_state_page(partition->pt_id,
+								  vp->vp_index,
+								  HV_VP_STATE_PAGE_REGISTERS,
+								  input_vtl);
+				vp->vp_register_page = NULL;
+			}
+
+			input_vtl.as_uint8 = 0;
+			(void)hv_call_unmap_vp_state_page(partition->pt_id,
+							  vp->vp_index,
+							  HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
+							  input_vtl);
+			vp->vp_intercept_msg_page = NULL;
+
+			if (vp->vp_ghcb_page) {
+				input_vtl.use_target_vtl = 1;
+				input_vtl.target_vtl = HV_NORMAL_VTL;
+				(void)hv_call_unmap_vp_state_page(partition->pt_id,
+								  vp->vp_index,
+								  HV_VP_STATE_PAGE_GHCB,
+								  input_vtl);
+				vp->vp_ghcb_page = NULL;
+			}
+
+			kfree(vp);
+
+			partition->pt_vp_array[i] = NULL;
+		}
+
+		/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
+		hv_call_finalize_partition(partition->pt_id);
+
+		partition->pt_initialized = false;
+	}
+
+	remove_partition(partition);
+
+	/* Remove regions, regain access to the memory and unpin the pages */
+	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
+				  hnode) {
+		hlist_del(&region->hnode);
+
+		if (mshv_partition_encrypted(partition)) {
+			ret = mshv_partition_region_share(region);
+			if (ret) {
+				pt_err(partition,
+				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
+				      ret);
+				return;
+			}
+		}
+
+		mshv_region_evict(region);
+
+		vfree(region);
+	}
+
+	/* Withdraw and free all pages we deposited */
+	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
+	hv_call_delete_partition(partition->pt_id);
+
+	mshv_free_routing_table(partition);
+	kfree(partition);
+}
+
+struct
+mshv_partition *mshv_partition_get(struct mshv_partition *partition)
+{
+	if (refcount_inc_not_zero(&partition->pt_ref_count))
+		return partition;
+	return NULL;
+}
+
+struct
+mshv_partition *mshv_partition_find(u64 partition_id)
+	__must_hold(RCU)
+{
+	struct mshv_partition *p;
+
+	hash_for_each_possible_rcu(mshv_root.pt_htable, p, pt_hnode,
+				   partition_id)
+		if (p->pt_id == partition_id)
+			return p;
+
+	return NULL;
+}
+
+void
+mshv_partition_put(struct mshv_partition *partition)
+{
+	if (refcount_dec_and_test(&partition->pt_ref_count))
+		destroy_partition(partition);
+}
+
+static int
+mshv_partition_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_partition *partition = filp->private_data;
+
+	mshv_eventfd_release(partition);
+
+	cleanup_srcu_struct(&partition->pt_irq_srcu);
+
+	mshv_partition_put(partition);
+
+	return 0;
+}
+
+static int
+add_partition(struct mshv_partition *partition)
+{
+	spin_lock(&mshv_root.pt_ht_lock);
+
+	hash_add_rcu(mshv_root.pt_htable, &partition->pt_hnode,
+		     partition->pt_id);
+
+	spin_unlock(&mshv_root.pt_ht_lock);
+
+	return 0;
+}
+
+static long
+mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
+{
+	struct mshv_create_partition args;
+	u64 creation_flags;
+	struct hv_partition_creation_properties creation_properties = {};
+	union hv_partition_isolation_properties isolation_properties = {};
+	struct mshv_partition *partition;
+	struct file *file;
+	int fd;
+	long ret;
+
+	if (copy_from_user(&args, user_arg, sizeof(args)))
+		return -EFAULT;
+
+	if ((args.pt_flags & ~MSHV_PT_FLAGS_MASK) ||
+	    args.pt_isolation >= MSHV_PT_ISOLATION_COUNT)
+		return -EINVAL;
+
+	/* Only support EXO partitions */
+	creation_flags = HV_PARTITION_CREATION_FLAG_EXO_PARTITION |
+			 HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED;
+
+	if (args.pt_flags & BIT(MSHV_PT_BIT_LAPIC))
+		creation_flags |= HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED;
+	if (args.pt_flags & BIT(MSHV_PT_BIT_X2APIC))
+		creation_flags |= HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE;
+	if (args.pt_flags & BIT(MSHV_PT_BIT_GPA_SUPER_PAGES))
+		creation_flags |= HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED;
+
+	switch (args.pt_isolation) {
+	case MSHV_PT_ISOLATION_NONE:
+		isolation_properties.isolation_type =
+			HV_PARTITION_ISOLATION_TYPE_NONE;
+		break;
+	}
+
+	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
+	if (!partition)
+		return -ENOMEM;
+
+	partition->pt_module_dev = module_dev;
+	partition->isolation_type = isolation_properties.isolation_type;
+
+	refcount_set(&partition->pt_ref_count, 1);
+
+	mutex_init(&partition->pt_mutex);
+
+	mutex_init(&partition->pt_irq_lock);
+
+	init_completion(&partition->async_hypercall);
+
+	INIT_HLIST_HEAD(&partition->irq_ack_notifier_list);
+
+	INIT_HLIST_HEAD(&partition->pt_devices);
+
+	INIT_HLIST_HEAD(&partition->pt_mem_regions);
+
+	mshv_eventfd_init(partition);
+
+	ret = init_srcu_struct(&partition->pt_irq_srcu);
+	if (ret)
+		goto free_partition;
+
+	ret = hv_call_create_partition(creation_flags,
+				       creation_properties,
+				       isolation_properties,
+				       &partition->pt_id);
+	if (ret)
+		goto cleanup_irq_srcu;
+
+	ret = add_partition(partition);
+	if (ret)
+		goto delete_partition;
+
+	ret = mshv_init_async_handler(partition);
+	if (ret)
+		goto remove_partition;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto remove_partition;
+	}
+
+	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
+				  partition, O_RDWR);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto put_fd;
+	}
+
+	fd_install(fd, file);
+
+	return fd;
+
+put_fd:
+	put_unused_fd(fd);
+remove_partition:
+	remove_partition(partition);
+delete_partition:
+	hv_call_delete_partition(partition->pt_id);
+cleanup_irq_srcu:
+	cleanup_srcu_struct(&partition->pt_irq_srcu);
+free_partition:
+	kfree(partition);
+
+	return ret;
+}
+
+static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct miscdevice *misc = filp->private_data;
+
+	switch (ioctl) {
+	case MSHV_CREATE_PARTITION:
+		return mshv_ioctl_create_partition((void __user *)arg,
+						misc->this_device);
+	}
+
+	return -ENOTTY;
+}
+
+static int
+mshv_dev_open(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static int
+mshv_dev_release(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static int mshv_cpuhp_online;
+static int mshv_root_sched_online;
+
+static const char *scheduler_type_to_string(enum hv_scheduler_type type)
+{
+	switch (type) {
+	case HV_SCHEDULER_TYPE_LP:
+		return "classic scheduler without SMT";
+	case HV_SCHEDULER_TYPE_LP_SMT:
+		return "classic scheduler with SMT";
+	case HV_SCHEDULER_TYPE_CORE_SMT:
+		return "core scheduler";
+	case HV_SCHEDULER_TYPE_ROOT:
+		return "root scheduler";
+	default:
+		return "unknown scheduler";
+	};
+}
+
+/* TODO move this to hv_common.c when needed outside */
+static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
+{
+	struct hv_input_get_system_property *input;
+	struct hv_output_get_system_property *output;
+	unsigned long flags;
+	u64 status;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	memset(input, 0, sizeof(*input));
+	memset(output, 0, sizeof(*output));
+	input->property_id = HV_SYSTEM_PROPERTY_SCHEDULER_TYPE;
+
+	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
+	if (!hv_result_success(status)) {
+		local_irq_restore(flags);
+		pr_err("%s: %s\n", __func__, hv_result_to_string(status));
+		return hv_result_to_errno(status);
+	}
+
+	*out = output->scheduler_type;
+	local_irq_restore(flags);
+
+	return 0;
+}
+
+/* Retrieve and stash the supported scheduler type */
+static int __init mshv_retrieve_scheduler_type(struct device *dev)
+{
+	int ret;
+
+	ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
+	if (ret)
+		return ret;
+
+	dev_info(dev, "Hypervisor using %s\n",
+		 scheduler_type_to_string(hv_scheduler_type));
+
+	switch (hv_scheduler_type) {
+	case HV_SCHEDULER_TYPE_CORE_SMT:
+	case HV_SCHEDULER_TYPE_LP_SMT:
+	case HV_SCHEDULER_TYPE_ROOT:
+	case HV_SCHEDULER_TYPE_LP:
+		/* Supported scheduler, nothing to do */
+		break;
+	default:
+		dev_err(dev, "unsupported scheduler 0x%x, bailing.\n",
+			hv_scheduler_type);
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int mshv_root_scheduler_init(unsigned int cpu)
+{
+	void **inputarg, **outputarg, *p;
+
+	inputarg = (void **)this_cpu_ptr(root_scheduler_input);
+	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
+
+	/* Allocate two consecutive pages. One for input, one for output. */
+	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	*inputarg = p;
+	*outputarg = (char *)p + HV_HYP_PAGE_SIZE;
+
+	return 0;
+}
+
+static int mshv_root_scheduler_cleanup(unsigned int cpu)
+{
+	void *p, **inputarg, **outputarg;
+
+	inputarg = (void **)this_cpu_ptr(root_scheduler_input);
+	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
+
+	p = *inputarg;
+
+	*inputarg = NULL;
+	*outputarg = NULL;
+
+	kfree(p);
+
+	return 0;
+}
+
+/* Must be called after retrieving the scheduler type */
+static int
+root_scheduler_init(struct device *dev)
+{
+	int ret;
+
+	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
+		return 0;
+
+	root_scheduler_input = alloc_percpu(void *);
+	root_scheduler_output = alloc_percpu(void *);
+
+	if (!root_scheduler_input || !root_scheduler_output) {
+		dev_err(dev, "Failed to allocate root scheduler buffers\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_root_sched",
+				mshv_root_scheduler_init,
+				mshv_root_scheduler_cleanup);
+
+	if (ret < 0) {
+		dev_err(dev, "Failed to setup root scheduler state: %i\n", ret);
+		goto out;
+	}
+
+	mshv_root_sched_online = ret;
+
+	return 0;
+
+out:
+	free_percpu(root_scheduler_input);
+	free_percpu(root_scheduler_output);
+	return ret;
+}
+
+static void
+root_scheduler_deinit(void)
+{
+	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
+		return;
+
+	cpuhp_remove_state(mshv_root_sched_online);
+	free_percpu(root_scheduler_input);
+	free_percpu(root_scheduler_output);
+}
+
+static int mshv_reboot_notify(struct notifier_block *nb,
+			      unsigned long code, void *unused)
+{
+	cpuhp_remove_state(mshv_cpuhp_online);
+	return 0;
+}
+
+struct notifier_block mshv_reboot_nb = {
+	.notifier_call = mshv_reboot_notify,
+};
+
+static void mshv_root_partition_exit(void)
+{
+	unregister_reboot_notifier(&mshv_reboot_nb);
+	root_scheduler_deinit();
+}
+
+static int __init mshv_root_partition_init(struct device *dev)
+{
+	int err;
+
+	if (mshv_retrieve_scheduler_type(dev))
+		return -ENODEV;
+
+	err = root_scheduler_init(dev);
+	if (err)
+		return err;
+
+	err = register_reboot_notifier(&mshv_reboot_nb);
+	if (err)
+		goto root_sched_deinit;
+
+	return 0;
+
+root_sched_deinit:
+	root_scheduler_deinit();
+	return err;
+}
+
+static int __init mshv_parent_partition_init(void)
+{
+	int ret;
+	struct device *dev;
+	union hv_hypervisor_version_info version_info;
+
+	if (!hv_root_partition() || is_kdump_kernel())
+		return -ENODEV;
+
+	if (hv_get_hypervisor_version(&version_info))
+		return -ENODEV;
+
+	ret = misc_register(&mshv_dev);
+	if (ret)
+		return ret;
+
+	dev = mshv_dev.this_device;
+
+	if (version_info.build_number < MSHV_HV_MIN_VERSION ||
+	    version_info.build_number > MSHV_HV_MAX_VERSION) {
+		dev_err(dev, "Running on unvalidated Hyper-V version\n");
+		dev_err(dev, "Versions: current: %u  min: %u  max: %u\n",
+			version_info.build_number, MSHV_HV_MIN_VERSION,
+			MSHV_HV_MAX_VERSION);
+	}
+
+	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
+	if (!mshv_root.synic_pages) {
+		dev_err(dev, "Failed to allocate percpu synic page\n");
+		ret = -ENOMEM;
+		goto device_deregister;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
+				mshv_synic_init,
+				mshv_synic_cleanup);
+	if (ret < 0) {
+		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
+		goto free_synic_pages;
+	}
+
+	mshv_cpuhp_online = ret;
+
+	ret = mshv_root_partition_init(dev);
+	if (ret)
+		goto remove_cpu_state;
+
+	ret = mshv_irqfd_wq_init();
+	if (ret)
+		goto exit_partition;
+
+	spin_lock_init(&mshv_root.pt_ht_lock);
+	hash_init(mshv_root.pt_htable);
+
+	hv_setup_mshv_handler(mshv_isr);
+
+	return 0;
+
+exit_partition:
+	if (hv_root_partition())
+		mshv_root_partition_exit();
+remove_cpu_state:
+	cpuhp_remove_state(mshv_cpuhp_online);
+free_synic_pages:
+	free_percpu(mshv_root.synic_pages);
+device_deregister:
+	misc_deregister(&mshv_dev);
+	return ret;
+}
+
+static void __exit mshv_parent_partition_exit(void)
+{
+	hv_setup_mshv_handler(NULL);
+	mshv_port_table_fini();
+	misc_deregister(&mshv_dev);
+	mshv_irqfd_wq_cleanup();
+	if (hv_root_partition())
+		mshv_root_partition_exit();
+	cpuhp_remove_state(mshv_cpuhp_online);
+	free_percpu(mshv_root.synic_pages);
+}
+
+module_init(mshv_parent_partition_init);
+module_exit(mshv_parent_partition_exit);
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
new file mode 100644
index 000000000000..e7782f92e339
--- /dev/null
+++ b/drivers/hv/mshv_synic.c
@@ -0,0 +1,665 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, Microsoft Corporation.
+ *
+ * mshv_root module's main interrupt handler and associated functionality.
+ *
+ * Authors:
+ *   Nuno Das Neves <nunodasneves@linux.microsoft.com>
+ *   Lillian Grassin-Drake <ligrassi@microsoft.com>
+ *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
+ *   Wei Liu <wei.liu@kernel.org>
+ *   Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/random.h>
+#include <asm/mshyperv.h>
+
+#include "mshv_eventfd.h"
+#include "mshv.h"
+
+static u32 synic_event_ring_get_queued_port(u32 sint_index)
+{
+	struct hv_synic_event_ring_page **event_ring_page;
+	volatile struct hv_synic_event_ring *ring;
+	struct hv_synic_pages *spages;
+	u8 **synic_eventring_tail;
+	u32 message;
+	u8 tail;
+
+	spages = this_cpu_ptr(mshv_root.synic_pages);
+	event_ring_page = &spages->synic_event_ring_page;
+	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
+	tail = (*synic_eventring_tail)[sint_index];
+
+	if (unlikely(!(*event_ring_page))) {
+		pr_debug("Missing synic event ring page!\n");
+		return 0;
+	}
+
+	ring = &(*event_ring_page)->sint_event_ring[sint_index];
+
+	/*
+	 * Get the message.
+	 */
+	message = ring->data[tail];
+
+	if (!message) {
+		if (ring->ring_full) {
+			/*
+			 * Ring is marked full, but we would have consumed all
+			 * the messages. Notify the hypervisor that ring is now
+			 * empty and check again.
+			 */
+			ring->ring_full = 0;
+			hv_call_notify_port_ring_empty(sint_index);
+			message = ring->data[tail];
+		}
+
+		if (!message) {
+			ring->signal_masked = 0;
+			/*
+			 * Unmask the signal and sync with hypervisor
+			 * before one last check for any message.
+			 */
+			mb();
+			message = ring->data[tail];
+
+			/*
+			 * Ok, lets bail out.
+			 */
+			if (!message)
+				return 0;
+		}
+
+		ring->signal_masked = 1;
+	}
+
+	/*
+	 * Clear the message in the ring buffer.
+	 */
+	ring->data[tail] = 0;
+
+	if (++tail == HV_SYNIC_EVENT_RING_MESSAGE_COUNT)
+		tail = 0;
+
+	(*synic_eventring_tail)[sint_index] = tail;
+
+	return message;
+}
+
+static bool
+mshv_doorbell_isr(struct hv_message *msg)
+{
+	struct hv_notification_message_payload *notification;
+	u32 port;
+
+	if (msg->header.message_type != HVMSG_SYNIC_SINT_INTERCEPT)
+		return false;
+
+	notification = (struct hv_notification_message_payload *)msg->u.payload;
+	if (notification->sint_index != HV_SYNIC_DOORBELL_SINT_INDEX)
+		return false;
+
+	while ((port = synic_event_ring_get_queued_port(HV_SYNIC_DOORBELL_SINT_INDEX))) {
+		struct port_table_info ptinfo = { 0 };
+
+		if (mshv_portid_lookup(port, &ptinfo)) {
+			pr_debug("Failed to get port info from port_table!\n");
+			continue;
+		}
+
+		if (ptinfo.hv_port_type != HV_PORT_TYPE_DOORBELL) {
+			pr_debug("Not a doorbell port!, port: %d, port_type: %d\n",
+				 port, ptinfo.hv_port_type);
+			continue;
+		}
+
+		/* Invoke the callback */
+		ptinfo.hv_port_doorbell.doorbell_cb(port,
+						 ptinfo.hv_port_doorbell.data);
+	}
+
+	return true;
+}
+
+static bool mshv_async_call_completion_isr(struct hv_message *msg)
+{
+	bool handled = false;
+	struct hv_async_completion_message_payload *async_msg;
+	struct mshv_partition *partition;
+	u64 partition_id;
+
+	if (msg->header.message_type != HVMSG_ASYNC_CALL_COMPLETION)
+		goto out;
+
+	async_msg =
+		(struct hv_async_completion_message_payload *)msg->u.payload;
+
+	partition_id = async_msg->partition_id;
+
+	/*
+	 * Hold this lock for the rest of the isr, because the partition could
+	 * be released anytime.
+	 * e.g. the MSHV_RUN_VP thread could wake on another cpu; it could
+	 * release the partition unless we hold this!
+	 */
+	rcu_read_lock();
+
+	partition = mshv_partition_find(partition_id);
+	partition->async_hypercall_status = async_msg->status;
+
+	if (unlikely(!partition)) {
+		pr_debug("failed to find partition %llu\n", partition_id);
+		goto unlock_out;
+	}
+
+	complete(&partition->async_hypercall);
+
+	handled = true;
+
+unlock_out:
+	rcu_read_unlock();
+out:
+	return handled;
+}
+
+static void kick_vp(struct mshv_vp *vp)
+{
+	atomic64_inc(&vp->run.vp_signaled_count);
+	vp->run.kicked_by_hv = 1;
+	wake_up(&vp->run.vp_suspend_queue);
+}
+
+static void
+handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
+{
+	int bank_idx, vps_signaled = 0, bank_mask_size;
+	struct mshv_partition *partition;
+	const struct hv_vpset *vpset;
+	const u64 *bank_contents;
+	u64 partition_id = msg->partition_id;
+
+	if (msg->vp_bitset.bitset.format != HV_GENERIC_SET_SPARSE_4K) {
+		pr_debug("scheduler message format is not HV_GENERIC_SET_SPARSE_4K");
+		return;
+	}
+
+	if (msg->vp_count == 0) {
+		pr_debug("scheduler message with no VP specified");
+		return;
+	}
+
+	rcu_read_lock();
+
+	partition = mshv_partition_find(partition_id);
+	if (unlikely(!partition)) {
+		pr_debug("failed to find partition %llu\n", partition_id);
+		goto unlock_out;
+	}
+
+	vpset = &msg->vp_bitset.bitset;
+
+	bank_idx = -1;
+	bank_contents = vpset->bank_contents;
+	bank_mask_size = sizeof(vpset->valid_bank_mask) * BITS_PER_BYTE;
+
+	while (true) {
+		int vp_bank_idx = -1;
+		int vp_bank_size = sizeof(*bank_contents) * BITS_PER_BYTE;
+		int vp_index;
+
+		bank_idx = find_next_bit((unsigned long *)&vpset->valid_bank_mask,
+					 bank_mask_size, bank_idx + 1);
+		if (bank_idx == bank_mask_size)
+			break;
+
+		while (true) {
+			struct mshv_vp *vp;
+
+			vp_bank_idx = find_next_bit((unsigned long *)bank_contents,
+						    vp_bank_size, vp_bank_idx + 1);
+			if (vp_bank_idx == vp_bank_size)
+				break;
+
+			vp_index = (bank_idx << HV_GENERIC_SET_SHIFT) + vp_bank_idx;
+
+			/* This shouldn't happen, but just in case. */
+			if (unlikely(vp_index >= MSHV_MAX_VPS)) {
+				pr_debug("VP index %u out of bounds\n",
+					 vp_index);
+				goto unlock_out;
+			}
+
+			vp = partition->pt_vp_array[vp_index];
+			if (unlikely(!vp)) {
+				pr_debug("failed to find VP %u\n", vp_index);
+				goto unlock_out;
+			}
+
+			kick_vp(vp);
+			vps_signaled++;
+		}
+
+		bank_contents++;
+	}
+
+unlock_out:
+	rcu_read_unlock();
+
+	if (vps_signaled != msg->vp_count)
+		pr_debug("asked to signal %u VPs but only did %u\n",
+			 msg->vp_count, vps_signaled);
+}
+
+static void
+handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
+{
+	struct mshv_partition *partition = NULL;
+	struct mshv_vp *vp;
+	int idx;
+
+	rcu_read_lock();
+
+	for (idx = 0; idx < msg->vp_count; idx++) {
+		u64 partition_id = msg->partition_ids[idx];
+		u32 vp_index = msg->vp_indexes[idx];
+
+		if (idx == 0 || partition->pt_id != partition_id) {
+			partition = mshv_partition_find(partition_id);
+			if (unlikely(!partition)) {
+				pr_debug("failed to find partition %llu\n",
+					 partition_id);
+				break;
+			}
+		}
+
+		/* This shouldn't happen, but just in case. */
+		if (unlikely(vp_index >= MSHV_MAX_VPS)) {
+			pr_debug("VP index %u out of bounds\n", vp_index);
+			break;
+		}
+
+		vp = partition->pt_vp_array[vp_index];
+		if (!vp) {
+			pr_debug("failed to find VP %u\n", vp_index);
+			break;
+		}
+
+		kick_vp(vp);
+	}
+
+	rcu_read_unlock();
+}
+
+static bool
+mshv_scheduler_isr(struct hv_message *msg)
+{
+	if (msg->header.message_type != HVMSG_SCHEDULER_VP_SIGNAL_BITSET &&
+	    msg->header.message_type != HVMSG_SCHEDULER_VP_SIGNAL_PAIR)
+		return false;
+
+	if (msg->header.message_type == HVMSG_SCHEDULER_VP_SIGNAL_BITSET)
+		handle_bitset_message((struct hv_vp_signal_bitset_scheduler_message *)
+				      msg->u.payload);
+	else
+		handle_pair_message((struct hv_vp_signal_pair_scheduler_message *)
+				    msg->u.payload);
+
+	return true;
+}
+
+static bool
+mshv_intercept_isr(struct hv_message *msg)
+{
+	struct mshv_partition *partition;
+	bool handled = false;
+	struct mshv_vp *vp;
+	u64 partition_id;
+	u32 vp_index;
+
+	partition_id = msg->header.sender;
+
+	rcu_read_lock();
+
+	partition = mshv_partition_find(partition_id);
+	if (unlikely(!partition)) {
+		pr_debug("failed to find partition %llu\n",
+			 partition_id);
+		goto unlock_out;
+	}
+
+	if (msg->header.message_type == HVMSG_X64_APIC_EOI) {
+		/*
+		 * Check if this gsi is registered in the
+		 * ack_notifier list and invoke the callback
+		 * if registered.
+		 */
+
+		/*
+		 * If there is a notifier, the ack callback is supposed
+		 * to handle the VMEXIT. So we need not pass this message
+		 * to vcpu thread.
+		 */
+		struct hv_x64_apic_eoi_message *eoi_msg =
+			(struct hv_x64_apic_eoi_message *)&msg->u.payload[0];
+
+		if (mshv_notify_acked_gsi(partition, eoi_msg->interrupt_vector)) {
+			handled = true;
+			goto unlock_out;
+		}
+	}
+
+	/*
+	 * We should get an opaque intercept message here for all intercept
+	 * messages, since we're using the mapped VP intercept message page.
+	 *
+	 * The intercept message will have been placed in intercept message
+	 * page at this point.
+	 *
+	 * Make sure the message type matches our expectation.
+	 */
+	if (msg->header.message_type != HVMSG_OPAQUE_INTERCEPT) {
+		pr_debug("wrong message type %d", msg->header.message_type);
+		goto unlock_out;
+	}
+
+	/*
+	 * Since we directly index the vp, and it has to exist for us to be here
+	 * (because the vp is only deleted when the partition is), no additional
+	 * locking is needed here
+	 */
+	vp_index =
+	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
+	vp = partition->pt_vp_array[vp_index];
+	if (unlikely(!vp)) {
+		pr_debug("failed to find VP %u\n", vp_index);
+		goto unlock_out;
+	}
+
+	kick_vp(vp);
+
+	handled = true;
+
+unlock_out:
+	rcu_read_unlock();
+
+	return handled;
+}
+
+void mshv_isr(void)
+{
+	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_message_page **msg_page = &spages->synic_message_page;
+	struct hv_message *msg;
+	bool handled;
+
+	if (unlikely(!(*msg_page))) {
+		pr_debug("Missing synic page!\n");
+		return;
+	}
+
+	msg = &((*msg_page)->sint_message[HV_SYNIC_INTERCEPTION_SINT_INDEX]);
+
+	/*
+	 * If the type isn't set, there isn't really a message;
+	 * it may be some other hyperv interrupt
+	 */
+	if (msg->header.message_type == HVMSG_NONE)
+		return;
+
+	handled = mshv_doorbell_isr(msg);
+
+	if (!handled)
+		handled = mshv_scheduler_isr(msg);
+
+	if (!handled)
+		handled = mshv_async_call_completion_isr(msg);
+
+	if (!handled)
+		handled = mshv_intercept_isr(msg);
+
+	if (handled) {
+		/*
+		 * Acknowledge message with hypervisor if another message is
+		 * pending.
+		 */
+		msg->header.message_type = HVMSG_NONE;
+		/*
+		 * Ensure the write is complete so the hypervisor will deliver
+		 * the next message if available.
+		 */
+		mb();
+		if (msg->header.message_flags.msg_pending)
+			hv_set_non_nested_msr(HV_MSR_EOM, 0);
+
+#ifdef HYPERVISOR_CALLBACK_VECTOR
+		add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
+#endif
+	} else {
+		pr_warn_once("%s: unknown message type 0x%x\n", __func__,
+			     msg->header.message_type);
+	}
+}
+
+int mshv_synic_init(unsigned int cpu)
+{
+	union hv_synic_simp simp;
+	union hv_synic_siefp siefp;
+	union hv_synic_sirbp sirbp;
+#ifdef HYPERVISOR_CALLBACK_VECTOR
+	union hv_synic_sint sint;
+#endif
+	union hv_synic_scontrol sctrl;
+	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_message_page **msg_page = &spages->synic_message_page;
+	struct hv_synic_event_flags_page **event_flags_page =
+			&spages->synic_event_flags_page;
+	struct hv_synic_event_ring_page **event_ring_page =
+			&spages->synic_event_ring_page;
+
+	/* Setup the Synic's message page */
+	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+	simp.simp_enabled = true;
+	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
+			     HV_HYP_PAGE_SIZE,
+			     MEMREMAP_WB);
+
+	if (!(*msg_page))
+		return -EFAULT;
+
+	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+
+	/* Setup the Synic's event flags page */
+	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+	siefp.siefp_enabled = true;
+	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
+				     PAGE_SIZE, MEMREMAP_WB);
+
+	if (!(*event_flags_page))
+		goto cleanup;
+
+	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+	/* Setup the Synic's event ring page */
+	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
+	sirbp.sirbp_enabled = true;
+	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+				    PAGE_SIZE, MEMREMAP_WB);
+
+	if (!(*event_ring_page))
+		goto cleanup;
+
+	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+
+#ifdef HYPERVISOR_CALLBACK_VECTOR
+	/* Enable intercepts */
+	sint.as_uint64 = 0;
+	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.masked = false;
+	sint.auto_eoi = hv_recommend_using_aeoi();
+	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
+			      sint.as_uint64);
+
+	/* Doorbell SINT */
+	sint.as_uint64 = 0;
+	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.masked = false;
+	sint.as_intercept = 1;
+	sint.auto_eoi = hv_recommend_using_aeoi();
+	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
+			      sint.as_uint64);
+#endif
+
+	/* Enable global synic bit */
+	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+	sctrl.enable = 1;
+	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+
+	return 0;
+
+cleanup:
+	if (*event_ring_page) {
+		sirbp.sirbp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		memunmap(*event_ring_page);
+	}
+	if (*event_flags_page) {
+		siefp.siefp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+		memunmap(*event_flags_page);
+	}
+	if (*msg_page) {
+		simp.simp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+		memunmap(*msg_page);
+	}
+
+	return -EFAULT;
+}
+
+int mshv_synic_cleanup(unsigned int cpu)
+{
+	union hv_synic_sint sint;
+	union hv_synic_simp simp;
+	union hv_synic_siefp siefp;
+	union hv_synic_sirbp sirbp;
+	union hv_synic_scontrol sctrl;
+	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_message_page **msg_page = &spages->synic_message_page;
+	struct hv_synic_event_flags_page **event_flags_page =
+		&spages->synic_event_flags_page;
+	struct hv_synic_event_ring_page **event_ring_page =
+		&spages->synic_event_ring_page;
+
+	/* Disable the interrupt */
+	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
+	sint.masked = true;
+	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
+			      sint.as_uint64);
+
+	/* Disable Doorbell SINT */
+	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX);
+	sint.masked = true;
+	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
+			      sint.as_uint64);
+
+	/* Disable Synic's event ring page */
+	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
+	sirbp.sirbp_enabled = false;
+	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+	memunmap(*event_ring_page);
+
+	/* Disable Synic's event flags page */
+	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+	siefp.siefp_enabled = false;
+	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	memunmap(*event_flags_page);
+
+	/* Disable Synic's message page */
+	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+	simp.simp_enabled = false;
+	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	memunmap(*msg_page);
+
+	/* Disable global synic bit */
+	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+	sctrl.enable = 0;
+	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+
+	return 0;
+}
+
+int
+mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb, void *data,
+		       u64 gpa, u64 val, u64 flags)
+{
+	struct hv_connection_info connection_info = { 0 };
+	union hv_connection_id connection_id = { 0 };
+	struct port_table_info *port_table_info;
+	struct hv_port_info port_info = { 0 };
+	union hv_port_id port_id = { 0 };
+	int ret;
+
+	port_table_info = kmalloc(sizeof(*port_table_info), GFP_KERNEL);
+	if (!port_table_info)
+		return -ENOMEM;
+
+	port_table_info->hv_port_type = HV_PORT_TYPE_DOORBELL;
+	port_table_info->hv_port_doorbell.doorbell_cb = doorbell_cb;
+	port_table_info->hv_port_doorbell.data = data;
+	ret = mshv_portid_alloc(port_table_info);
+	if (ret < 0) {
+		kfree(port_table_info);
+		return ret;
+	}
+
+	port_id.u.id = ret;
+	port_info.port_type = HV_PORT_TYPE_DOORBELL;
+	port_info.doorbell_port_info.target_sint = HV_SYNIC_DOORBELL_SINT_INDEX;
+	port_info.doorbell_port_info.target_vp = HV_ANY_VP;
+	ret = hv_call_create_port(hv_current_partition_id, port_id, partition_id,
+				  &port_info,
+				  0, 0, NUMA_NO_NODE);
+
+	if (ret < 0) {
+		mshv_portid_free(port_id.u.id);
+		return ret;
+	}
+
+	connection_id.u.id = port_id.u.id;
+	connection_info.port_type = HV_PORT_TYPE_DOORBELL;
+	connection_info.doorbell_connection_info.gpa = gpa;
+	connection_info.doorbell_connection_info.trigger_value = val;
+	connection_info.doorbell_connection_info.flags = flags;
+
+	ret = hv_call_connect_port(hv_current_partition_id, port_id, partition_id,
+				   connection_id, &connection_info, 0, NUMA_NO_NODE);
+	if (ret < 0) {
+		hv_call_delete_port(hv_current_partition_id, port_id);
+		mshv_portid_free(port_id.u.id);
+		return ret;
+	}
+
+	// lets use the port_id as the doorbell_id
+	return port_id.u.id;
+}
+
+void
+mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
+{
+	union hv_port_id port_id = { 0 };
+	union hv_connection_id connection_id = { 0 };
+
+	connection_id.u.id = doorbell_portid;
+	hv_call_disconnect_port(partition_id, connection_id);
+
+	port_id.u.id = doorbell_portid;
+	hv_call_delete_port(hv_current_partition_id, port_id);
+
+	mshv_portid_free(doorbell_portid);
+}
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
new file mode 100644
index 000000000000..9468f66c5658
--- /dev/null
+++ b/include/uapi/linux/mshv.h
@@ -0,0 +1,287 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Userspace interfaces for /dev/mshv* devices and derived fds
+ *
+ * This file is divided into sections containing data structures and IOCTLs for
+ * a particular set of related devices or derived file descriptors.
+ *
+ * The IOCTL definitions are at the end of each section. They are grouped by
+ * device/fd, so that new IOCTLs can easily be added with a monotonically
+ * increasing number.
+ */
+#ifndef _UAPI_LINUX_MSHV_H
+#define _UAPI_LINUX_MSHV_H
+
+#include <linux/types.h>
+
+#define MSHV_IOCTL	0xB8
+
+/*
+ *******************************************
+ * Entry point to main VMM APIs: /dev/mshv *
+ *******************************************
+ */
+
+enum {
+	MSHV_PT_BIT_LAPIC,
+	MSHV_PT_BIT_X2APIC,
+	MSHV_PT_BIT_GPA_SUPER_PAGES,
+	MSHV_PT_BIT_COUNT,
+};
+
+#define MSHV_PT_FLAGS_MASK ((1 << MSHV_PT_BIT_COUNT) - 1)
+
+enum {
+	MSHV_PT_ISOLATION_NONE,
+	MSHV_PT_ISOLATION_COUNT,
+};
+
+/**
+ * struct mshv_create_partition - arguments for MSHV_CREATE_PARTITION
+ * @pt_flags: Bitmask of 1 << MSHV_PT_BIT_*
+ * @pt_isolation: MSHV_PT_ISOLATION_*
+ *
+ * Returns a file descriptor to act as a handle to a guest partition.
+ * At this point the partition is not yet initialized in the hypervisor.
+ * Some operations must be done with the partition in this state, e.g. setting
+ * so-called "early" partition properties. The partition can then be
+ * initialized with MSHV_INITIALIZE_PARTITION.
+ */
+struct mshv_create_partition {
+	__u64 pt_flags;
+	__u64 pt_isolation;
+};
+
+/* /dev/mshv */
+#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x00, struct mshv_create_partition)
+
+/*
+ ************************
+ * Child partition APIs *
+ ************************
+ */
+
+struct mshv_create_vp {
+	__u32 vp_index;
+};
+
+enum {
+	MSHV_SET_MEM_BIT_WRITABLE,
+	MSHV_SET_MEM_BIT_EXECUTABLE,
+	MSHV_SET_MEM_BIT_UNMAP,
+	MSHV_SET_MEM_BIT_COUNT
+};
+
+#define MSHV_SET_MEM_FLAGS_MASK ((1 << MSHV_SET_MEM_BIT_COUNT) - 1)
+
+/**
+ * struct mshv_user_mem_region - arguments for MSHV_SET_GUEST_MEMORY
+ * @size: Size of the memory region (bytes). Must be aligned to PAGE_SIZE
+ * @guest_pfn: Base guest page number to map
+ * @userspace_addr: Base address of userspace memory. Must be aligned to
+ *                  PAGE_SIZE
+ * @flags: Bitmask of 1 << MSHV_SET_MEM_BIT_*. If (1 << MSHV_SET_MEM_BIT_UNMAP)
+ *         is set, ignore other bits.
+ * @rsvd: MBZ
+ *
+ * Map or unmap a region of userspace memory to Guest Physical Addresses (GPA).
+ * Mappings can't overlap in GPA space or userspace.
+ * To unmap, these fields must match an existing mapping.
+ */
+struct mshv_user_mem_region {
+	__u64 size;
+	__u64 guest_pfn;
+	__u64 userspace_addr;
+	__u8 flags;
+	__u8 rsvd[7];
+};
+
+enum {
+	MSHV_IRQFD_BIT_DEASSIGN,
+	MSHV_IRQFD_BIT_RESAMPLE,
+	MSHV_IRQFD_BIT_COUNT,
+};
+
+#define MSHV_IRQFD_FLAGS_MASK	((1 << MSHV_IRQFD_BIT_COUNT) - 1)
+
+struct mshv_user_irqfd {
+	__s32 fd;
+	__s32 resamplefd;
+	__u32 gsi;
+	__u32 flags;
+};
+
+enum {
+	MSHV_IOEVENTFD_BIT_DATAMATCH,
+	MSHV_IOEVENTFD_BIT_PIO,
+	MSHV_IOEVENTFD_BIT_DEASSIGN,
+	MSHV_IOEVENTFD_BIT_COUNT,
+};
+
+#define MSHV_IOEVENTFD_FLAGS_MASK	((1 << MSHV_IOEVENTFD_BIT_COUNT) - 1)
+
+struct mshv_user_ioeventfd {
+	__u64 datamatch;
+	__u64 addr;	   /* legal pio/mmio address */
+	__u32 len;	   /* 1, 2, 4, or 8 bytes    */
+	__s32 fd;
+	__u32 flags;
+	__u8  rsvd[4];
+};
+
+struct mshv_user_irq_entry {
+	__u32 gsi;
+	__u32 address_lo;
+	__u32 address_hi;
+	__u32 data;
+};
+
+struct mshv_user_irq_table {
+	__u32 nr;
+	__u32 rsvd; /* MBZ */
+	struct mshv_user_irq_entry entries[];
+};
+
+enum {
+	MSHV_GPAP_ACCESS_TYPE_ACCESSED = 0,
+	MSHV_GPAP_ACCESS_TYPE_DIRTY,
+	MSHV_GPAP_ACCESS_TYPE_COUNT		/* Count of enum members */
+};
+
+enum {
+	MSHV_GPAP_ACCESS_OP_NOOP = 0,
+	MSHV_GPAP_ACCESS_OP_CLEAR,
+	MSHV_GPAP_ACCESS_OP_SET,
+	MSHV_GPAP_ACCESS_OP_COUNT		/* Count of enum members */
+};
+
+/**
+ * struct mshv_gpap_access_bitmap - arguments for MSHV_GET_GPAP_ACCESS_BITMAP
+ * @access_type: MSHV_GPAP_ACCESS_TYPE_* - The type of access to record in the
+ *               bitmap
+ * @access_op: MSHV_GPAP_ACCESS_OP_* - Allows an optional clear or set of all
+ *             the access states in the range, after retrieving the current
+ *             states.
+ * @rsvd: MBZ
+ * @page_count: in: number of pages
+ *              out: on error, number of states successfully written to bitmap
+ * @gpap_base: Base gpa page number
+ * @bitmap_ptr: Output buffer for bitmap, at least (page_count + 7) / 8 bytes
+ *
+ * Retrieve a bitmap of either ACCESSED or DIRTY bits for a given range of guest
+ * memory, and optionally clear or set the bits.
+ */
+struct mshv_gpap_access_bitmap {
+	__u8 access_type;
+	__u8 access_op;
+	__u8 rsvd[6];
+	__u64 page_count;
+	__u64 gpap_base;
+	__u64 bitmap_ptr;
+};
+
+/**
+ * struct mshv_root_hvcall - arguments for MSHV_ROOT_HVCALL
+ * @code: Hypercall code (HVCALL_*)
+ * @reps: in: Rep count ('repcount')
+ *	  out: Reps completed ('repcomp'). MBZ unless rep hvcall
+ * @in_sz: Size of input incl rep data. <= HV_HYP_PAGE_SIZE
+ * @out_sz: Size of output buffer. <= HV_HYP_PAGE_SIZE. MBZ if out_ptr is 0
+ * @status: in: MBZ
+ *	    out: HV_STATUS_* from hypercall
+ * @rsvd: MBZ
+ * @in_ptr: Input data buffer (struct hv_input_*). If used with partition or
+ *	    vp fd, partition id field is populated by kernel.
+ * @out_ptr: Output data buffer (optional)
+ */
+struct mshv_root_hvcall {
+	__u16 code;
+	__u16 reps;
+	__u16 in_sz;
+	__u16 out_sz;
+	__u16 status;
+	__u8 rsvd[6];
+	__u64 in_ptr;
+	__u64 out_ptr;
+};
+
+/* Partition fds created with MSHV_CREATE_PARTITION */
+#define MSHV_INITIALIZE_PARTITION	_IO(MSHV_IOCTL, 0x00)
+#define MSHV_CREATE_VP			_IOW(MSHV_IOCTL, 0x01, struct mshv_create_vp)
+#define MSHV_SET_GUEST_MEMORY		_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
+#define MSHV_IRQFD			_IOW(MSHV_IOCTL, 0x03, struct mshv_user_irqfd)
+#define MSHV_IOEVENTFD			_IOW(MSHV_IOCTL, 0x04, struct mshv_user_ioeventfd)
+#define MSHV_SET_MSI_ROUTING		_IOW(MSHV_IOCTL, 0x05, struct mshv_user_irq_table)
+#define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
+/* Generic hypercall */
+#define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
+
+/*
+ ********************************
+ * VP APIs for child partitions *
+ ********************************
+ */
+
+#define MSHV_RUN_VP_BUF_SZ 256
+
+/*
+ * Map various VP state pages to userspace.
+ * Multiply the offset by PAGE_SIZE before being passed as the 'offset'
+ * argument to mmap().
+ * e.g.
+ * void *reg_page = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
+ *                       MAP_SHARED, vp_fd,
+ *                       MSHV_VP_MMAP_OFFSET_REGISTERS * PAGE_SIZE);
+ */
+enum {
+	MSHV_VP_MMAP_OFFSET_REGISTERS,
+	MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE,
+	MSHV_VP_MMAP_OFFSET_GHCB,
+	MSHV_VP_MMAP_OFFSET_COUNT
+};
+
+/**
+ * struct mshv_run_vp - argument for MSHV_RUN_VP
+ * @msg_buf: On success, the intercept message is copied here. It can be
+ *           interpreted using the relevant hypervisor definitions.
+ */
+struct mshv_run_vp {
+	__u8 msg_buf[MSHV_RUN_VP_BUF_SZ];
+};
+
+enum {
+	MSHV_VP_STATE_LAPIC,		/* Local interrupt controller state (either arch) */
+	MSHV_VP_STATE_XSAVE,		/* XSAVE data in compacted form (x86_64) */
+	MSHV_VP_STATE_SIMP,
+	MSHV_VP_STATE_SIEFP,
+	MSHV_VP_STATE_SYNTHETIC_TIMERS,
+	MSHV_VP_STATE_COUNT,
+};
+
+/**
+ * struct mshv_get_set_vp_hvcall - arguments for MSHV_[GET,SET]_VP_STATE
+ * @type: MSHV_VP_STATE_*
+ * @rsvd: MBZ
+ * @buf_sz: in: 4k page-aligned size of buffer
+ *          out: Actual size of data (on EINVAL, check this to see if buffer
+ *               was too small)
+ * @buf_ptr: 4k page-aligned data buffer
+ */
+struct mshv_get_set_vp_state {
+	__u8 type;
+	__u8 rsvd[3];
+	__u32 buf_sz;
+	__u64 buf_ptr;
+};
+
+/* VP fds created with MSHV_CREATE_VP */
+#define MSHV_RUN_VP			_IOR(MSHV_IOCTL, 0x00, struct mshv_run_vp)
+#define MSHV_GET_VP_STATE		_IOWR(MSHV_IOCTL, 0x01, struct mshv_get_set_vp_state)
+#define MSHV_SET_VP_STATE		_IOWR(MSHV_IOCTL, 0x02, struct mshv_get_set_vp_state)
+/*
+ * Generic hypercall
+ * Defined above in partition IOCTLs, avoid redefining it here
+ * #define MSHV_ROOT_HVCALL			_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
+ */
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
@ 2025-02-27  4:59   ` Easwar Hariharan
  2025-03-01  1:29     ` Nuno Das Neves
  2025-02-27 18:50   ` Roman Kisel
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 108+ messages in thread
From: Easwar Hariharan @ 2025-02-27  4:59 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, eahariha, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> Provide a set of IOCTLs for creating and managing child partitions when
> running as root partition on Hyper-V. The new driver is enabled via
> CONFIG_MSHV_ROOT.
> 
> A brief overview of the interface:
> 
> MSHV_CREATE_PARTITION is the entry point, returning a file descriptor
> representing a child partition. IOCTLs on this fd can be used to map
> memory, create VPs, etc.
> 
> Creating a VP returns another file descriptor representing that VP which
> in turn has another set of corresponding IOCTLs for running the VP,
> getting/setting state, etc.
> 
> MSHV_ROOT_HVCALL is a generic "passthrough" hypercall IOCTL which can be
> used for a number of partition or VP hypercalls. This is for hypercalls
> that do not affect any state in the kernel driver, such as getting and
> setting VP registers and partition properties, translating addresses,
> etc. It is "passthrough" because the binary input and output for the
> hypercall is only interpreted by the VMM - the kernel driver does
> nothing but insert the VP and partition id where necessary (which are
> always in the same place), and execute the hypercall.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
> Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
> Co-developed-by: Muminul Islam <muislam@microsoft.com>
> Signed-off-by: Muminul Islam <muislam@microsoft.com>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---

I see some issues reported by checkpatch, both vanilla and --strict.
<snip>

Thanks,
Easwar (he/him)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-27  4:59   ` Easwar Hariharan
@ 2025-03-01  1:29     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-01  1:29 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: linux-hyperv, x86, linux-arm-kernel, linux-kernel, linux-arch,
	linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On 2/26/2025 8:59 PM, Easwar Hariharan wrote:
> On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
>> Provide a set of IOCTLs for creating and managing child partitions when
>> running as root partition on Hyper-V. The new driver is enabled via
>> CONFIG_MSHV_ROOT.
>>
>> A brief overview of the interface:
>>
>> MSHV_CREATE_PARTITION is the entry point, returning a file descriptor
>> representing a child partition. IOCTLs on this fd can be used to map
>> memory, create VPs, etc.
>>
>> Creating a VP returns another file descriptor representing that VP which
>> in turn has another set of corresponding IOCTLs for running the VP,
>> getting/setting state, etc.
>>
>> MSHV_ROOT_HVCALL is a generic "passthrough" hypercall IOCTL which can be
>> used for a number of partition or VP hypercalls. This is for hypercalls
>> that do not affect any state in the kernel driver, such as getting and
>> setting VP registers and partition properties, translating addresses,
>> etc. It is "passthrough" because the binary input and output for the
>> hypercall is only interpreted by the VMM - the kernel driver does
>> nothing but insert the VP and partition id where necessary (which are
>> always in the same place), and execute the hypercall.
>>
>> Co-developed-by: Wei Liu <wei.liu@kernel.org>
>> Signed-off-by: Wei Liu <wei.liu@kernel.org>
>> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
>> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
>> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
>> Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
>> Co-developed-by: Muminul Islam <muislam@microsoft.com>
>> Signed-off-by: Muminul Islam <muislam@microsoft.com>
>> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
>> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
> 
> I see some issues reported by checkpatch, both vanilla and --strict.
> <snip>

Yes, most of them are from --strict.

The macro argument reuse ones are a non-issue I think. I suppose this
could be cleaned up for the vp_ and pt_ macros, I might do that.

"struct mutex/spinlock_t definition without comment" - I'm not sure
if that's really needed. The code that uses these primitives
demonstrates their purpose better than a comment, I think.

"Avoid CamelCase" - Some Hyper-V definitions that use the original
CamelCase definitions are introduced in this patch. These are
stats-related - partition and vp statistics that can be gathered
from the hypervisor. In a future patch these will be converted to
strings and displayed in debugfs, and... hmm, to be honest I'm not
sure why they need to remain in CamelCase when we convert everything
else to Linux style... For now there are only 2 of these definitions
and they're only defined in mshv_root_main.c so I think it's ok.
I'll consider what to do when the rest of the stats code is proposed,
which includes a big chunk of these CamelCase definitions.

"Use of volatile is usually wrong" - I admit I'm not an expert in
this area. We use it for a pointer to hv_synic_event_ring, similar
to how it is used to access hv_synic_event_flags_page in vmbus_drv.c.

"Added, moved or deleted file(s), does MAINTAINERS need updating?" -
drivers/hv is already listed in MAINTAINERS.

Thanks
Nuno

> 
> Thanks,
> Easwar (he/him)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
  2025-02-27  4:59   ` Easwar Hariharan
@ 2025-02-27 18:50   ` Roman Kisel
  2025-03-01  1:38     ` Nuno Das Neves
  2025-03-06 17:32     ` Wei Liu
  2025-03-11 18:01   ` Jeff Johnson
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 108+ messages in thread
From: Roman Kisel @ 2025-02-27 18:50 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet




On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> Provide a set of IOCTLs for creating and managing child partitions when
> running as root partition on Hyper-V. The new driver is enabled via
> CONFIG_MSHV_ROOT.
> 

[...]


As I understood, the changes fall into these buckets:

1. Partition management (VPs and memory). Built of the top of fd's which
    looks as the right approach. There is ref counting etc.
2. Scheduling. Here, there is the mature KVM and Xen code to find
    inspiration in. Xen being the Type 1 hypervisor should likely be
    closer to MSHV in my understanding.
3. IOCTL code allocation. Not sure how this is allocated yet given that
    the patch series has been through a multi-year review, that must be
    settled by now.
4. IOCTLs themselves. The majority just marshals data to the
    hypervisor.

Despite the rather large size of the patch, I spot-checked the places
where I have the chance to make an informed decision, and could not find
anything that'd stand out as suspicious to me. Going to extrapolate that
the patch itself should be good enough. Given that this code has been in
development and validation for a few years, I'd vote to merge it. That
will also enable upstreaming the rest of the VTL mode code that powers
Azure Boost (https://github.com/microsoft/OHCL-Linux-Kernel)

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-27 18:50   ` Roman Kisel
@ 2025-03-01  1:38     ` Nuno Das Neves
  2025-03-06 17:32     ` Wei Liu
  1 sibling, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-01  1:38 UTC (permalink / raw)
  To: Roman Kisel, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/27/2025 10:50 AM, Roman Kisel wrote:
> 
> 
> 
> On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
>> Provide a set of IOCTLs for creating and managing child partitions when
>> running as root partition on Hyper-V. The new driver is enabled via
>> CONFIG_MSHV_ROOT.
>>
> 
> [...]
> 
> 
> As I understood, the changes fall into these buckets:
> 
> 1. Partition management (VPs and memory). Built of the top of fd's which
>    looks as the right approach. There is ref counting etc.
> 2. Scheduling. Here, there is the mature KVM and Xen code dto find
>    inspiration in. Xen being the Type 1 hypervisor should likely be
>    closer to MSHV in my understanding.
> 3. IOCTL code allocation. Not sure how this is allocated yet given that
>    the patch series has been through a multi-year review, that must be
>    settled by now.
> 4. IOCTLs themselves. The majority just marshals data to the
>    hypervisor.
> 
This is a good summary, thanks.

> Despite the rather large size of the patch, I spot-checked the places
> where I have the chance to make an informed decision, and could not find
> anything that'd stand out as suspicious to me. Going to extrapolate that
> the patch itself should be good enough. Given that this code has been in
> development and validation for a few years, I'd vote to merge it. That
> will also enable upstreaming the rest of the VTL mode code that powers
> Azure Boost (https://github.com/microsoft/OHCL-Linux-Kernel)
> 
> Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-27 18:50   ` Roman Kisel
  2025-03-01  1:38     ` Nuno Das Neves
@ 2025-03-06 17:32     ` Wei Liu
  2025-03-07 18:06       ` Roman Kisel
  1 sibling, 1 reply; 108+ messages in thread
From: Wei Liu @ 2025-03-06 17:32 UTC (permalink / raw)
  To: Roman Kisel
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, wei.liu, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet

On Thu, Feb 27, 2025 at 10:50:30AM -0800, Roman Kisel wrote:
> 
> 
> 
> On 2/26/2025 3:08 PM, Nuno Das Neves wrote:
> > Provide a set of IOCTLs for creating and managing child partitions when
> > running as root partition on Hyper-V. The new driver is enabled via
> > CONFIG_MSHV_ROOT.
> > 
> 
> [...]
> 
> 
> As I understood, the changes fall into these buckets:
> 
> 1. Partition management (VPs and memory). Built of the top of fd's which
>    looks as the right approach. There is ref counting etc.
> 2. Scheduling. Here, there is the mature KVM and Xen code to find
>    inspiration in. Xen being the Type 1 hypervisor should likely be
>    closer to MSHV in my understanding.

Yes and no.

When a hypervisor-based scheduler (either classic or core) is used, the
scheduling model is the same as Xen. In this model, the hypervisor makes
the scheduling decisions.

There is a second scheduler model. In that model, the hypervisor
delegates scheduling to the Linux kernel. The Linux scheduler makes the
scheduling decisions. It is similar to KVM.

We support both. Which model to use largely depends on the workload and
the desired behaviors of the system.

This is purely informational in case people wonder why the run vp
function branches off to two different code paths.

> 3. IOCTL code allocation. Not sure how this is allocated yet given that
>    the patch series has been through a multi-year review, that must be
>    settled by now.
> 4. IOCTLs themselves. The majority just marshals data to the
>    hypervisor.
> 
> Despite the rather large size of the patch, I spot-checked the places
> where I have the chance to make an informed decision, and could not find
> anything that'd stand out as suspicious to me. Going to extrapolate that
> the patch itself should be good enough. Given that this code has been in
> development and validation for a few years, I'd vote to merge it. That
> will also enable upstreaming the rest of the VTL mode code that powers
> Azure Boost (https://github.com/microsoft/OHCL-Linux-Kernel)
> 
> Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
> 

Thank you for the review.

Thanks,
Wei.

> -- 
> Thank you,
> Roman
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-06 17:32     ` Wei Liu
@ 2025-03-07 18:06       ` Roman Kisel
  0 siblings, 0 replies; 108+ messages in thread
From: Roman Kisel @ 2025-03-07 18:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi, kys, haiyangz, mhklinux, decui,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa,
	daniel.lezcano, joro, robin.murphy, arnd, jinankjain,
	muminulrussell, skinsburskii, mrathor, ssengar, apais, Tianyu.Lan,
	stanislav.kinsburskiy, gregkh, vkuznets, prapal, muislam,
	anrayabh, rafael, lenb, corbet



On 3/6/2025 9:32 AM, Wei Liu wrote:
> On Thu, Feb 27, 2025 at 10:50:30AM -0800, Roman Kisel wrote:

[...]

>> 2. Scheduling. Here, there is the mature KVM and Xen code to find
>>     inspiration in. Xen being the Type 1 hypervisor should likely be
>>     closer to MSHV in my understanding.
> 
> Yes and no.
> 
> When a hypervisor-based scheduler (either classic or core) is used, the
> scheduling model is the same as Xen. In this model, the hypervisor makes
> the scheduling decisions.
> 
> There is a second scheduler model. In that model, the hypervisor
> delegates scheduling to the Linux kernel. The Linux scheduler makes the
> scheduling decisions. It is similar to KVM.
> 
> We support both. Which model to use largely depends on the workload and
> the desired behaviors of the system.
> 
> This is purely informational in case people wonder why the run vp
> function branches off to two different code paths.
> 

Thanks, now I understand that better :)

[...]

>> -- 
>> Thank you,
>> Roman
>>

-- 
Thank you,
Roman


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
  2025-02-27  4:59   ` Easwar Hariharan
  2025-02-27 18:50   ` Roman Kisel
@ 2025-03-11 18:01   ` Jeff Johnson
  2025-03-14 19:25     ` Nuno Das Neves
  2025-03-13 16:43   ` Michael Kelley
  2025-03-17 23:51   ` Michael Kelley
  4 siblings, 1 reply; 108+ messages in thread
From: Jeff Johnson @ 2025-03-11 18:01 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 2/26/25 15:08, Nuno Das Neves wrote:
...
> +
> +MODULE_AUTHOR("Microsoft");
> +MODULE_LICENSE("GPL");
> +

Since commit 1fffe7a34c89 ("script: modpost: emit a warning when the
description is missing"), a module without a MODULE_DESCRIPTION() will
result in a warning with make W=1. Please add a MODULE_DESCRIPTION()
to avoid this warning.

This is a canned review based upon finding a MODULE_LICENSE without a
MODULE_DESCRIPTION.

/jeff

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-11 18:01   ` Jeff Johnson
@ 2025-03-14 19:25     ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-14 19:25 UTC (permalink / raw)
  To: Jeff Johnson, linux-hyperv, x86, linux-arm-kernel, linux-kernel,
	linux-arch, linux-acpi
  Cc: kys, haiyangz, wei.liu, mhklinux, decui, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, daniel.lezcano, joro,
	robin.murphy, arnd, jinankjain, muminulrussell, skinsburskii,
	mrathor, ssengar, apais, Tianyu.Lan, stanislav.kinsburskiy,
	gregkh, vkuznets, prapal, muislam, anrayabh, rafael, lenb, corbet

On 3/11/2025 11:01 AM, Jeff Johnson wrote:
> On 2/26/25 15:08, Nuno Das Neves wrote:
> ...
>> +
>> +MODULE_AUTHOR("Microsoft");
>> +MODULE_LICENSE("GPL");
>> +
> 
> Since commit 1fffe7a34c89 ("script: modpost: emit a warning when the
> description is missing"), a module without a MODULE_DESCRIPTION() will
> result in a warning with make W=1. Please add a MODULE_DESCRIPTION()
> to avoid this warning.
> 
> This is a canned review based upon finding a MODULE_LICENSE without a
> MODULE_DESCRIPTION.
> 
> /jeff

Thanks Jeff. Fixed in v6.

Nuno

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
                     ` (2 preceding siblings ...)
  2025-03-11 18:01   ` Jeff Johnson
@ 2025-03-13 16:43   ` Michael Kelley
  2025-03-14  2:15     ` Nuno Das Neves
  2025-03-17 23:51   ` Michael Kelley
  4 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-13 16:43 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 

I've done a partial review of the code in this patch.  See comments inline
as usual.

I'd like to still review most of the code in mshv_root_main.c, and maybe
some of mshv_synic.c and include/uapi/linux/mshv.c. I'll send a separate
email with those comments when I complete them. The patch is huge, so
I'm breaking my review comments into two parts.

I've glanced through mshv_eventfd.c, mshv_eventfd.h, and mshv_irq.c,
but I don't have enough knowledge/expertise in these areas to add any
useful comments, so I'm not planning to review them further.

> Provide a set of IOCTLs for creating and managing child partitions when
> running as root partition on Hyper-V. The new driver is enabled via
> CONFIG_MSHV_ROOT.
> 
> A brief overview of the interface:
> 
> MSHV_CREATE_PARTITION is the entry point, returning a file descriptor
> representing a child partition. IOCTLs on this fd can be used to map
> memory, create VPs, etc.
> 
> Creating a VP returns another file descriptor representing that VP which
> in turn has another set of corresponding IOCTLs for running the VP,
> getting/setting state, etc.
> 
> MSHV_ROOT_HVCALL is a generic "passthrough" hypercall IOCTL which can be
> used for a number of partition or VP hypercalls. This is for hypercalls
> that do not affect any state in the kernel driver, such as getting and
> setting VP registers and partition properties, translating addresses,
> etc. It is "passthrough" because the binary input and output for the
> hypercall is only interpreted by the VMM - the kernel driver does
> nothing but insert the VP and partition id where necessary (which are
> always in the same place), and execute the hypercall.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
> Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
> Co-developed-by: Muminul Islam <muislam@microsoft.com>
> Signed-off-by: Muminul Islam <muislam@microsoft.com>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |    2 +
>  drivers/hv/Makefile                           |    5 +-
>  drivers/hv/mshv.h                             |   30 +
>  drivers/hv/mshv_common.c                      |  161 ++
>  drivers/hv/mshv_eventfd.c                     |  833 ++++++
>  drivers/hv/mshv_eventfd.h                     |   71 +
>  drivers/hv/mshv_irq.c                         |  128 +
>  drivers/hv/mshv_portid_table.c                |   84 +
>  drivers/hv/mshv_root.h                        |  321 +++
>  drivers/hv/mshv_root_hv_call.c                |  876 +++++++
>  drivers/hv/mshv_root_main.c                   | 2329 +++++++++++++++++
>  drivers/hv/mshv_synic.c                       |  665 +++++
>  include/uapi/linux/mshv.h                     |  287 ++
>  13 files changed, 5791 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/hv/mshv.h
>  create mode 100644 drivers/hv/mshv_common.c
>  create mode 100644 drivers/hv/mshv_eventfd.c
>  create mode 100644 drivers/hv/mshv_eventfd.h
>  create mode 100644 drivers/hv/mshv_irq.c
>  create mode 100644 drivers/hv/mshv_portid_table.c
>  create mode 100644 drivers/hv/mshv_root.h
>  create mode 100644 drivers/hv/mshv_root_hv_call.c
>  create mode 100644 drivers/hv/mshv_root_main.c
>  create mode 100644 drivers/hv/mshv_synic.c
>  create mode 100644 include/uapi/linux/mshv.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 6d1465315df3..66dcfaae698b 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -370,6 +370,8 @@ Code  Seq#    Include File                                           Comments
>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
> remoteproc@vger.kernel.org>
>  0xB7  all    uapi/linux/nsfs.h                                       <mailto:Andrei Vagin
> <avagin@openvz.org>>
>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver

Hmmm. Doesn't this mean that the mshv ioctls overlap with the Marvell
CN10K DPI ioctls? Is that intentional? I thought the goal of the central
registry in ioctl-number.rst is to avoid overlap.

> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index 2b8dc954b350..976189c725dc 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -2,6 +2,7 @@
>  obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
>  obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
>  obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
> +obj-$(CONFIG_MSHV_ROOT)		+= mshv_root.o
> 
>  CFLAGS_hv_trace.o = -I$(src)
>  CFLAGS_hv_balloon.o = -I$(src)
> @@ -11,7 +12,9 @@ hv_vmbus-y := vmbus_drv.o \
>  		 channel_mgmt.o ring_buffer.o hv_trace.o
>  hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
>  hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
> +mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
> +	       mshv_root_hv_call.o mshv_portid_table.o
> 
>  # Code that must be built-in
>  obj-$(subst m,y,$(CONFIG_HYPERV)) += hv_common.o
> -obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o
> +obj-$(subst m,y,$(CONFIG_MSHV_ROOT)) += hv_proc.o mshv_common.o
> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> new file mode 100644
> index 000000000000..0340a67acd0a
> --- /dev/null
> +++ b/drivers/hv/mshv.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + */
> +
> +#ifndef _MSHV_H_
> +#define _MSHV_H_
> +
> +#include <linux/stddef.h>
> +#include <linux/string.h>
> +#include <hyperv/hvhdk.h>
> +
> +#define mshv_field_nonzero(STRUCT, MEMBER) \
> +	memchr_inv(&((STRUCT).MEMBER), \
> +		   0, sizeof_field(typeof(STRUCT), MEMBER))
> +
> +int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers);
> +
> +int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers);
> +
> +int hv_call_get_partition_property(u64 partition_id, u64 property_code,
> +				   u64 *property_value);
> +
> +int mshv_do_pre_guest_mode_work(ulong th_flags);
> +
> +#endif /* _MSHV_H */
> diff --git a/drivers/hv/mshv_common.c b/drivers/hv/mshv_common.c
> new file mode 100644
> index 000000000000..d97631dcbee1
> --- /dev/null
> +++ b/drivers/hv/mshv_common.c
> @@ -0,0 +1,161 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, Microsoft Corporation.
> + *
> + * This file contains functions that are called from one or more modules: ROOT,
> + * DIAG, or VTL. If any of these modules are configured to build, this file is

What are the DIAG and VTL modules?  I see only a root module in the Makefile.

> + * built and just statically linked in.
> + *
> + * Authors: Microsoft Linux virtualization team
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <asm/mshyperv.h>
> +#include <linux/resume_user_mode.h>
> +
> +#include "mshv.h"
> +
> +#define HV_GET_REGISTER_BATCH_SIZE	\
> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> +#define HV_SET_REGISTER_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> +		/ sizeof(struct hv_register_assoc))
> +
> +int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers)
> +{
> +	struct hv_input_get_vp_registers *input_page;
> +	union hv_register_value *output_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count, i;
> +	u64 status = HV_STATUS_SUCCESS;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> +		for (i = 0; i < rep_count; ++i)
> +			input_page->names[i] = registers[i].name;
> +
> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
> +					     0, input_page, output_page);
> +		if (!hv_result_success(status))
> +			break;
> +
> +		completed = hv_repcomp(status);
> +		for (i = 0; i < completed; ++i)
> +			registers[i].value = output_page[i];
> +
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +	local_irq_restore(flags);
> +
> +	return hv_result_to_errno(status);
> +}
> +EXPORT_SYMBOL_GPL(hv_call_get_vp_registers);
> +
> +int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +			     union hv_input_vtl input_vtl,
> +			     struct hv_register_assoc *registers)
> +{
> +	struct hv_input_set_vp_registers *input_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	u64 status = HV_STATUS_SUCCESS;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->elements, registers,
> +		       sizeof(struct hv_register_assoc) * rep_count);
> +
> +		status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
> +					     0, input_page, NULL);
> +		if (!hv_result_success(status))
> +			break;
> +
> +		completed = hv_repcomp(status);
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return hv_result_to_errno(status);
> +}
> +EXPORT_SYMBOL_GPL(hv_call_set_vp_registers);
> +
> +int hv_call_get_partition_property(u64 partition_id,
> +				   u64 property_code,
> +				   u64 *property_value)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_get_partition_property *input;
> +	struct hv_output_get_partition_property *output;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = partition_id;
> +	input->property_code = property_code;
> +	status = hv_do_hypercall(HVCALL_GET_PARTITION_PROPERTY, input, output);
> +
> +	if (!hv_result_success(status)) {
> +		local_irq_restore(flags);
> +		return hv_result_to_errno(status);
> +	}
> +	*property_value = output->property_value;
> +
> +	local_irq_restore(flags);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_call_get_partition_property);
> +
> +/*
> + * Handle any pre-processing before going into the guest mode on this cpu, most
> + * notably call schedule(). Must be invoked with both preemption and
> + * interrupts enabled.
> + *
> + * Returns: 0 on success, -errno on error.
> + */
> +int mshv_do_pre_guest_mode_work(ulong th_flags)
> +{
> +	if (th_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> +		return -EINTR;
> +
> +	if (th_flags & _TIF_NEED_RESCHED)
> +		schedule();
> +
> +	if (th_flags & _TIF_NOTIFY_RESUME)
> +		resume_user_mode_work(NULL);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mshv_do_pre_guest_mode_work);
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> new file mode 100644
> index 000000000000..8dd22be2ca0b
> --- /dev/null
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -0,0 +1,833 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * eventfd support for mshv
> + *
> + * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic
> + * framework code is taken from the kvm implementation.
> + *
> + * All credits to kvm developers.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/eventfd.h>
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> +#include <asm/apic.h>
> +#endif
> +#include <asm/mshyperv.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +static struct workqueue_struct *irqfd_cleanup_wq;
> +
> +void mshv_register_irq_ack_notifier(struct mshv_partition *partition,
> +				    struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->pt_irq_lock);
> +	hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list);
> +	mutex_unlock(&partition->pt_irq_lock);
> +}
> +
> +void mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
> +				      struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->pt_irq_lock);
> +	hlist_del_init_rcu(&mian->link);
> +	mutex_unlock(&partition->pt_irq_lock);
> +	synchronize_rcu();
> +}
> +
> +bool mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi)
> +{
> +	struct mshv_irq_ack_notifier *mian;
> +	bool acked = false;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list,
> +				 link) {
> +		if (mian->irq_ack_gsi == gsi) {
> +			mian->irq_acked(mian);
> +			acked = true;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return acked;
> +}
> +
> +#if IS_ENABLED(CONFIG_ARM64)
> +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
> +{
> +	return false;
> +}
> +#elif IS_ENABLED(CONFIG_X86_64)
> +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
> +{
> +	return type == HV_X64_INTERRUPT_TYPE_EXTINT;
> +}
> +#endif
> +
> +static void mshv_irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
> +{
> +	struct mshv_irqfd_resampler *resampler;
> +	struct mshv_partition *partition;
> +	struct mshv_irqfd *irqfd;
> +	int idx;
> +
> +	resampler = container_of(mian, struct mshv_irqfd_resampler,
> +				 rsmplr_notifier);
> +	partition = resampler->rsmplr_partn;
> +
> +	idx = srcu_read_lock(&partition->pt_irq_srcu);
> +
> +	hlist_for_each_entry_rcu(irqfd, &resampler->rsmplr_irqfd_list,
> +				 irqfd_resampler_hnode) {
> +		if (hv_should_clear_interrupt(irqfd->irqfd_lapic_irq.lapic_control.interrupt_type))
> +			hv_call_clear_virtual_interrupt(partition->pt_id);
> +
> +		eventfd_signal(irqfd->irqfd_resamplefd);
> +	}
> +
> +	srcu_read_unlock(&partition->pt_irq_srcu, idx);
> +}
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> +static bool
> +mshv_vp_irq_vector_injected(union hv_vp_register_page_interrupt_vectors iv,
> +			    u32 vector)
> +{
> +	int i;
> +
> +	for (i = 0; i < iv.vector_count; i++) {
> +		if (iv.vector[i] == vector)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int mshv_vp_irq_try_set_vector(struct mshv_vp *vp, u32 vector)
> +{
> +	union hv_vp_register_page_interrupt_vectors iv, new_iv;
> +
> +	iv = vp->vp_register_page->interrupt_vectors;
> +	new_iv = iv;
> +
> +	if (mshv_vp_irq_vector_injected(iv, vector))
> +		return 0;
> +
> +	if (iv.vector_count >= HV_VP_REGISTER_PAGE_MAX_VECTOR_COUNT)
> +		return -ENOSPC;
> +
> +	new_iv.vector[new_iv.vector_count++] = vector;
> +
> +	if (cmpxchg(&vp->vp_register_page->interrupt_vectors.as_uint64,
> +		    iv.as_uint64, new_iv.as_uint64) != iv.as_uint64)
> +		return -EAGAIN;
> +
> +	return 0;
> +}
> +
> +static int mshv_vp_irq_set_vector(struct mshv_vp *vp, u32 vector)
> +{
> +	int ret;
> +
> +	do {
> +		ret = mshv_vp_irq_try_set_vector(vp, vector);
> +	} while (ret == -EAGAIN && !need_resched());
> +
> +	return ret;
> +}
> +
> +/*
> + * Try to raise irq for guest via shared vector array. hyp does the actual
> + * inject of the interrupt.
> + */
> +static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
> +{
> +	struct mshv_partition *partition = irqfd->irqfd_partn;
> +	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
> +	struct mshv_vp *vp;
> +
> +	if (!(ms_hyperv.ext_features &
> +	      HV_VP_DISPATCH_INTERRUPT_INJECTION_AVAILABLE))
> +		return -EOPNOTSUPP;
> +
> +	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
> +		return -EOPNOTSUPP;
> +
> +	if (irq->lapic_control.logical_dest_mode)
> +		return -EOPNOTSUPP;
> +
> +	vp = partition->pt_vp_array[irq->lapic_apic_id];
> +
> +	if (!vp->vp_register_page)
> +		return -EOPNOTSUPP;
> +
> +	if (mshv_vp_irq_set_vector(vp, irq->lapic_vector))
> +		return -EINVAL;
> +
> +	if (vp->run.flags.root_sched_dispatched &&
> +	    vp->vp_register_page->interrupt_vectors.as_uint64)
> +		return -EBUSY;
> +
> +	wake_up(&vp->run.vp_suspend_queue);
> +
> +	return 0;
> +}
> +#else /* CONFIG_X86_64 */
> +static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif
> +
> +static void mshv_assert_irq_slow(struct mshv_irqfd *irqfd)
> +{
> +	struct mshv_partition *partition = irqfd->irqfd_partn;
> +	struct mshv_lapic_irq *irq = &irqfd->irqfd_lapic_irq;
> +	unsigned int seq;
> +	int idx;
> +
> +	WARN_ON(irqfd->irqfd_resampler &&
> +		!irq->lapic_control.level_triggered);
> +
> +	idx = srcu_read_lock(&partition->pt_irq_srcu);
> +	if (irqfd->irqfd_girq_ent.guest_irq_num) {
> +		if (!irqfd->irqfd_girq_ent.girq_entry_valid) {
> +			srcu_read_unlock(&partition->pt_irq_srcu, idx);
> +			return;
> +		}
> +
> +		do {
> +			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
> +		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
> +	}
> +
> +	hv_call_assert_virtual_interrupt(irqfd->irqfd_partn->pt_id,
> +					 irq->lapic_vector, irq->lapic_apic_id,
> +					 irq->lapic_control);
> +	srcu_read_unlock(&partition->pt_irq_srcu, idx);
> +}
> +
> +static void mshv_irqfd_resampler_shutdown(struct mshv_irqfd *irqfd)
> +{
> +	struct mshv_irqfd_resampler *rp = irqfd->irqfd_resampler;
> +	struct mshv_partition *pt = rp->rsmplr_partn;
> +
> +	mutex_lock(&pt->irqfds_resampler_lock);
> +
> +	hlist_del_rcu(&irqfd->irqfd_resampler_hnode);
> +	synchronize_srcu(&pt->pt_irq_srcu);
> +
> +	if (hlist_empty(&rp->rsmplr_irqfd_list)) {
> +		hlist_del(&rp->rsmplr_hnode);
> +		mshv_unregister_irq_ack_notifier(pt, &rp->rsmplr_notifier);
> +		kfree(rp);
> +	}
> +
> +	mutex_unlock(&pt->irqfds_resampler_lock);
> +}
> +
> +/*
> + * Race-free decouple logic (ordering is critical)
> + */
> +static void mshv_irqfd_shutdown(struct work_struct *work)
> +{
> +	struct mshv_irqfd *irqfd =
> +			container_of(work, struct mshv_irqfd, irqfd_shutdown);
> +
> +	/*
> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
> +	 * further events.
> +	 */
> +	remove_wait_queue(irqfd->irqfd_wqh, &irqfd->irqfd_wait);
> +
> +	if (irqfd->irqfd_resampler) {
> +		mshv_irqfd_resampler_shutdown(irqfd);
> +		eventfd_ctx_put(irqfd->irqfd_resamplefd);
> +	}
> +
> +	/*
> +	 * It is now safe to release the object's resources
> +	 */
> +	eventfd_ctx_put(irqfd->irqfd_eventfd_ctx);
> +	kfree(irqfd);
> +}
> +
> +/* assumes partition->pt_irqfds_lock is held */
> +static bool mshv_irqfd_is_active(struct mshv_irqfd *irqfd)
> +{
> +	return !hlist_unhashed(&irqfd->irqfd_hnode);
> +}
> +
> +/*
> + * Mark the irqfd as inactive and schedule it for removal
> + *
> + * assumes partition->pt_irqfds_lock is held
> + */
> +static void mshv_irqfd_deactivate(struct mshv_irqfd *irqfd)
> +{
> +	if (!mshv_irqfd_is_active(irqfd))
> +		return;
> +
> +	hlist_del(&irqfd->irqfd_hnode);
> +
> +	queue_work(irqfd_cleanup_wq, &irqfd->irqfd_shutdown);
> +}
> +
> +/*
> + * Called with wqh->lock held and interrupts disabled
> + */
> +static int mshv_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> +			     int sync, void *key)
> +{
> +	struct mshv_irqfd *irqfd = container_of(wait, struct mshv_irqfd,
> +						irqfd_wait);
> +	unsigned long flags = (unsigned long)key;
> +	int idx;
> +	unsigned int seq;
> +	struct mshv_partition *pt = irqfd->irqfd_partn;
> +	int ret = 0;
> +
> +	if (flags & POLLIN) {
> +		u64 cnt;
> +
> +		eventfd_ctx_do_read(irqfd->irqfd_eventfd_ctx, &cnt);
> +		idx = srcu_read_lock(&pt->pt_irq_srcu);
> +		do {
> +			seq = read_seqcount_begin(&irqfd->irqfd_irqe_sc);
> +		} while (read_seqcount_retry(&irqfd->irqfd_irqe_sc, seq));
> +
> +		/* An event has been signaled, raise an interrupt */
> +		ret = mshv_try_assert_irq_fast(irqfd);
> +		if (ret)
> +			mshv_assert_irq_slow(irqfd);
> +
> +		srcu_read_unlock(&pt->pt_irq_srcu, idx);
> +
> +		ret = 1;
> +	}
> +
> +	if (flags & POLLHUP) {
> +		/* The eventfd is closing, detach from the partition */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&pt->pt_irqfds_lock, flags);
> +
> +		/*
> +		 * We must check if someone deactivated the irqfd before
> +		 * we could acquire the pt_irqfds_lock since the item is
> +		 * deactivated from the mshv side before it is unhooked from
> +		 * the wait-queue.  If it is already deactivated, we can
> +		 * simply return knowing the other side will cleanup for us.
> +		 * We cannot race against the irqfd going away since the
> +		 * other side is required to acquire wqh->lock, which we hold
> +		 */
> +		if (mshv_irqfd_is_active(irqfd))
> +			mshv_irqfd_deactivate(irqfd);
> +
> +		spin_unlock_irqrestore(&pt->pt_irqfds_lock, flags);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Must be called under pt_irqfds_lock */
> +static void mshv_irqfd_update(struct mshv_partition *pt,
> +			      struct mshv_irqfd *irqfd)
> +{
> +	write_seqcount_begin(&irqfd->irqfd_irqe_sc);
> +	irqfd->irqfd_girq_ent = mshv_ret_girq_entry(pt,
> +						    irqfd->irqfd_irqnum);
> +	mshv_copy_girq_info(&irqfd->irqfd_girq_ent, &irqfd->irqfd_lapic_irq);
> +	write_seqcount_end(&irqfd->irqfd_irqe_sc);
> +}
> +
> +void mshv_irqfd_routing_update(struct mshv_partition *pt)
> +{
> +	struct mshv_irqfd *irqfd;
> +
> +	spin_lock_irq(&pt->pt_irqfds_lock);
> +	hlist_for_each_entry(irqfd, &pt->pt_irqfds_list, irqfd_hnode)
> +		mshv_irqfd_update(pt, irqfd);
> +	spin_unlock_irq(&pt->pt_irqfds_lock);
> +}
> +
> +static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
> +				  poll_table *polltbl)
> +{
> +	struct mshv_irqfd *irqfd =
> +			container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);
> +
> +	irqfd->irqfd_wqh = wqh;
> +	add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
> +}
> +
> +static int mshv_irqfd_assign(struct mshv_partition *pt,
> +			     struct mshv_user_irqfd *args)
> +{
> +	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
> +	struct mshv_irqfd *irqfd, *tmp;
> +	unsigned int events;
> +	struct fd f;
> +	int ret;
> +	int idx;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->irqfd_partn = pt;
> +	irqfd->irqfd_irqnum = args->gsi;
> +	INIT_WORK(&irqfd->irqfd_shutdown, mshv_irqfd_shutdown);
> +	seqcount_spinlock_init(&irqfd->irqfd_irqe_sc, &pt->pt_irqfds_lock);
> +
> +	f = fdget(args->fd);
> +	if (!fd_file(f)) {
> +		ret = -EBADF;
> +		goto out;
> +	}
> +
> +	eventfd = eventfd_ctx_fileget(fd_file(f));
> +	if (IS_ERR(eventfd)) {
> +		ret = PTR_ERR(eventfd);
> +		goto fail;
> +	}
> +
> +	irqfd->irqfd_eventfd_ctx = eventfd;
> +
> +	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE)) {
> +		struct mshv_irqfd_resampler *rp;
> +
> +		resamplefd = eventfd_ctx_fdget(args->resamplefd);
> +		if (IS_ERR(resamplefd)) {
> +			ret = PTR_ERR(resamplefd);
> +			goto fail;
> +		}
> +
> +		irqfd->irqfd_resamplefd = resamplefd;
> +
> +		mutex_lock(&pt->irqfds_resampler_lock);
> +
> +		hlist_for_each_entry(rp, &pt->irqfds_resampler_list,
> +				     rsmplr_hnode) {
> +			if (rp->rsmplr_notifier.irq_ack_gsi ==
> +							 irqfd->irqfd_irqnum) {
> +				irqfd->irqfd_resampler = rp;
> +				break;
> +			}
> +		}
> +
> +		if (!irqfd->irqfd_resampler) {
> +			rp = kzalloc(sizeof(*rp), GFP_KERNEL_ACCOUNT);
> +			if (!rp) {
> +				ret = -ENOMEM;
> +				mutex_unlock(&pt->irqfds_resampler_lock);
> +				goto fail;
> +			}
> +
> +			rp->rsmplr_partn = pt;
> +			INIT_HLIST_HEAD(&rp->rsmplr_irqfd_list);
> +			rp->rsmplr_notifier.irq_ack_gsi = irqfd->irqfd_irqnum;
> +			rp->rsmplr_notifier.irq_acked =
> +						      mshv_irqfd_resampler_ack;
> +
> +			hlist_add_head(&rp->rsmplr_hnode,
> +				       &pt->irqfds_resampler_list);
> +			mshv_register_irq_ack_notifier(pt,
> +						       &rp->rsmplr_notifier);
> +			irqfd->irqfd_resampler = rp;
> +		}
> +
> +		hlist_add_head_rcu(&irqfd->irqfd_resampler_hnode,
> +				   &irqfd->irqfd_resampler->rsmplr_irqfd_list);
> +
> +		mutex_unlock(&pt->irqfds_resampler_lock);
> +	}
> +
> +	/*
> +	 * Install our own custom wake-up handling so we are notified via
> +	 * a callback whenever someone signals the underlying eventfd
> +	 */
> +	init_waitqueue_func_entry(&irqfd->irqfd_wait, mshv_irqfd_wakeup);
> +	init_poll_funcptr(&irqfd->irqfd_polltbl, mshv_irqfd_queue_proc);
> +
> +	spin_lock_irq(&pt->pt_irqfds_lock);
> +	if (args->flags & BIT(MSHV_IRQFD_BIT_RESAMPLE) &&
> +	    !irqfd->irqfd_lapic_irq.lapic_control.level_triggered) {
> +		/*
> +		 * Resample Fd must be for level triggered interrupt
> +		 * Otherwise return with failure
> +		 */
> +		spin_unlock_irq(&pt->pt_irqfds_lock);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	ret = 0;
> +	hlist_for_each_entry(tmp, &pt->pt_irqfds_list, irqfd_hnode) {
> +		if (irqfd->irqfd_eventfd_ctx != tmp->irqfd_eventfd_ctx)
> +			continue;
> +		/* This fd is used for another irq already. */
> +		ret = -EBUSY;
> +		spin_unlock_irq(&pt->pt_irqfds_lock);
> +		goto fail;
> +	}
> +
> +	idx = srcu_read_lock(&pt->pt_irq_srcu);
> +	mshv_irqfd_update(pt, irqfd);
> +	hlist_add_head(&irqfd->irqfd_hnode, &pt->pt_irqfds_list);
> +	spin_unlock_irq(&pt->pt_irqfds_lock);
> +
> +	/*
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered, and trigger it as if we didn't miss it.
> +	 */
> +	events = vfs_poll(fd_file(f), &irqfd->irqfd_polltbl);
> +
> +	if (events & POLLIN)
> +		mshv_assert_irq_slow(irqfd);
> +
> +	srcu_read_unlock(&pt->pt_irq_srcu, idx);
> +	/*
> +	 * do not drop the file until the irqfd is fully initialized, otherwise
> +	 * we might race against the POLLHUP
> +	 */
> +	fdput(f);
> +
> +	return 0;
> +
> +fail:
> +	if (irqfd->irqfd_resampler)
> +		mshv_irqfd_resampler_shutdown(irqfd);
> +
> +	if (resamplefd && !IS_ERR(resamplefd))
> +		eventfd_ctx_put(resamplefd);
> +
> +	if (eventfd && !IS_ERR(eventfd))
> +		eventfd_ctx_put(eventfd);
> +
> +	fdput(f);
> +
> +out:
> +	kfree(irqfd);
> +	return ret;
> +}
> +
> +/*
> + * shutdown any irqfd's that match fd+gsi
> + */
> +static int mshv_irqfd_deassign(struct mshv_partition *pt,
> +			       struct mshv_user_irqfd *args)
> +{
> +	struct mshv_irqfd *irqfd;
> +	struct hlist_node *n;
> +	struct eventfd_ctx *eventfd;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list,
> +				  irqfd_hnode) {
> +		if (irqfd->irqfd_eventfd_ctx == eventfd &&
> +		    irqfd->irqfd_irqnum == args->gsi)
> +
> +			mshv_irqfd_deactivate(irqfd);
> +	}
> +
> +	eventfd_ctx_put(eventfd);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * so that we guarantee there will not be any more interrupts on this
> +	 * gsi once this deassign function returns.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +	return 0;
> +}
> +
> +int mshv_set_unset_irqfd(struct mshv_partition *pt,
> +			 struct mshv_user_irqfd *args)
> +{
> +	if (args->flags & ~MSHV_IRQFD_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	if (args->flags & BIT(MSHV_IRQFD_BIT_DEASSIGN))
> +		return mshv_irqfd_deassign(pt, args);
> +
> +	return mshv_irqfd_assign(pt, args);
> +}
> +
> +/*
> + * This function is called as the mshv VM fd is being released.
> + * Shutdown all irqfds that still remain open
> + */
> +static void mshv_irqfd_release(struct mshv_partition *pt)
> +{
> +	struct mshv_irqfd *irqfd;
> +	struct hlist_node *n;
> +
> +	spin_lock_irq(&pt->pt_irqfds_lock);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &pt->pt_irqfds_list, irqfd_hnode)
> +		mshv_irqfd_deactivate(irqfd);
> +
> +	spin_unlock_irq(&pt->pt_irqfds_lock);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * since we do not take a mshv_partition* reference.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +}
> +
> +int mshv_irqfd_wq_init(void)
> +{
> +	irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0);
> +	if (!irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void mshv_irqfd_wq_cleanup(void)
> +{
> +	destroy_workqueue(irqfd_cleanup_wq);
> +}
> +
> +/*
> + * --------------------------------------------------------------------
> + * ioeventfd: translate a MMIO memory write to an eventfd signal.
> + *
> + * userspace can register a MMIO address with an eventfd for receiving
> + * notification when the memory has been touched.
> + * --------------------------------------------------------------------
> + */
> +
> +static void ioeventfd_release(struct mshv_ioeventfd *p, u64 partition_id)
> +{
> +	if (p->iovntfd_doorbell_id > 0)
> +		mshv_unregister_doorbell(partition_id, p->iovntfd_doorbell_id);
> +	eventfd_ctx_put(p->iovntfd_eventfd);
> +	kfree(p);
> +}
> +
> +/* MMIO writes trigger an event if the addr/val match */
> +static void ioeventfd_mmio_write(int doorbell_id, void *data)
> +{
> +	struct mshv_partition *partition = (struct mshv_partition *)data;
> +	struct mshv_ioeventfd *p;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(p, &partition->ioeventfds_list, iovntfd_hnode)
> +		if (p->iovntfd_doorbell_id == doorbell_id) {
> +			eventfd_signal(p->iovntfd_eventfd);
> +			break;
> +		}
> +
> +	rcu_read_unlock();
> +}
> +
> +static bool ioeventfd_check_collision(struct mshv_partition *pt,
> +				      struct mshv_ioeventfd *p)
> +	__must_hold(&pt->mutex)
> +{
> +	struct mshv_ioeventfd *_p;
> +
> +	hlist_for_each_entry(_p, &pt->ioeventfds_list, iovntfd_hnode)
> +		if (_p->iovntfd_addr == p->iovntfd_addr &&
> +		    _p->iovntfd_length == p->iovntfd_length &&
> +		    (_p->iovntfd_wildcard || p->iovntfd_wildcard ||
> +		     _p->iovntfd_datamatch == p->iovntfd_datamatch))
> +			return true;
> +
> +	return false;
> +}
> +
> +static int mshv_assign_ioeventfd(struct mshv_partition *pt,
> +				 struct mshv_user_ioeventfd *args)
> +	__must_hold(&pt->mutex)
> +{
> +	struct mshv_ioeventfd *p;
> +	struct eventfd_ctx *eventfd;
> +	u64 doorbell_flags = 0;
> +	int ret;
> +
> +	/* This mutex is currently protecting ioeventfd.items list */
> +	WARN_ON_ONCE(!mutex_is_locked(&pt->pt_mutex));
> +
> +	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_PIO))
> +		return -EOPNOTSUPP;
> +
> +	/* must be natural-word sized */
> +	switch (args->len) {
> +	case 0:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
> +		break;
> +	case 1:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
> +		break;
> +	case 2:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
> +		break;
> +	case 4:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
> +		break;
> +	case 8:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/* check for range overflow */
> +	if (args->addr + args->len < args->addr)
> +		return -EINVAL;
> +
> +	/* check for extra flags that we don't understand */
> +	if (args->flags & ~MSHV_IOEVENTFD_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	p = kzalloc(sizeof(*p), GFP_KERNEL);
> +	if (!p) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	p->iovntfd_addr = args->addr;
> +	p->iovntfd_length  = args->len;
> +	p->iovntfd_eventfd = eventfd;
> +
> +	/* The datamatch feature is optional, otherwise this is a wildcard */
> +	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_DATAMATCH)) {
> +		p->iovntfd_datamatch = args->datamatch;
> +	} else {
> +		p->iovntfd_wildcard = true;
> +		doorbell_flags |= HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
> +	}
> +
> +	if (ioeventfd_check_collision(pt, p)) {
> +		ret = -EEXIST;
> +		goto unlock_fail;
> +	}
> +
> +	ret = mshv_register_doorbell(pt->pt_id, ioeventfd_mmio_write,
> +				     (void *)pt, p->iovntfd_addr,
> +				     p->iovntfd_datamatch, doorbell_flags);
> +	if (ret < 0)
> +		goto unlock_fail;
> +
> +	p->iovntfd_doorbell_id = ret;
> +
> +	hlist_add_head_rcu(&p->iovntfd_hnode, &pt->ioeventfds_list);
> +
> +	return 0;
> +
> +unlock_fail:
> +	kfree(p);
> +
> +fail:
> +	eventfd_ctx_put(eventfd);
> +
> +	return ret;
> +}
> +
> +static int mshv_deassign_ioeventfd(struct mshv_partition *pt,
> +				   struct mshv_user_ioeventfd *args)
> +	__must_hold(&pt->mutex)
> +{
> +	struct mshv_ioeventfd *p;
> +	struct eventfd_ctx *eventfd;
> +	struct hlist_node *n;
> +	int ret = -ENOENT;
> +
> +	/* This mutex is currently protecting ioeventfd.items list */
> +	WARN_ON_ONCE(!mutex_is_locked(&pt->pt_mutex));
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	hlist_for_each_entry_safe(p, n, &pt->ioeventfds_list, iovntfd_hnode) {
> +		bool wildcard = !(args->flags & BIT(MSHV_IOEVENTFD_BIT_DATAMATCH));
> +
> +		if (p->iovntfd_eventfd != eventfd  ||
> +		    p->iovntfd_addr != args->addr  ||
> +		    p->iovntfd_length != args->len ||
> +		    p->iovntfd_wildcard != wildcard)
> +			continue;
> +
> +		if (!p->iovntfd_wildcard &&
> +		    p->iovntfd_datamatch != args->datamatch)
> +			continue;
> +
> +		hlist_del_rcu(&p->iovntfd_hnode);
> +		synchronize_rcu();
> +		ioeventfd_release(p, pt->pt_id);
> +		ret = 0;
> +		break;
> +	}
> +
> +	eventfd_ctx_put(eventfd);
> +
> +	return ret;
> +}
> +
> +int mshv_set_unset_ioeventfd(struct mshv_partition *pt,
> +			     struct mshv_user_ioeventfd *args)
> +	__must_hold(&pt->mutex)
> +{
> +	if ((args->flags & ~MSHV_IOEVENTFD_FLAGS_MASK) ||
> +	    mshv_field_nonzero(*args, rsvd))
> +		return -EINVAL;
> +
> +	/* PIO not yet implemented */
> +	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_PIO))
> +		return -EOPNOTSUPP;
> +
> +	if (args->flags & BIT(MSHV_IOEVENTFD_BIT_DEASSIGN))
> +		return mshv_deassign_ioeventfd(pt, args);
> +
> +	return mshv_assign_ioeventfd(pt, args);
> +}
> +
> +void mshv_eventfd_init(struct mshv_partition *pt)
> +{
> +	spin_lock_init(&pt->pt_irqfds_lock);
> +	INIT_HLIST_HEAD(&pt->pt_irqfds_list);
> +
> +	INIT_HLIST_HEAD(&pt->irqfds_resampler_list);
> +	mutex_init(&pt->irqfds_resampler_lock);
> +
> +	INIT_HLIST_HEAD(&pt->ioeventfds_list);
> +}
> +
> +void mshv_eventfd_release(struct mshv_partition *pt)
> +{
> +	struct hlist_head items;
> +	struct hlist_node *n;
> +	struct mshv_ioeventfd *p;
> +
> +	hlist_move_list(&pt->ioeventfds_list, &items);
> +	synchronize_rcu();
> +
> +	hlist_for_each_entry_safe(p, n, &items, iovntfd_hnode) {
> +		hlist_del(&p->iovntfd_hnode);
> +		ioeventfd_release(p, pt->pt_id);
> +	}
> +
> +	mshv_irqfd_release(pt);
> +}
> diff --git a/drivers/hv/mshv_eventfd.h b/drivers/hv/mshv_eventfd.h
> new file mode 100644
> index 000000000000..332e7670a344
> --- /dev/null
> +++ b/drivers/hv/mshv_eventfd.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * irqfd: Allows an fd to be used to inject an interrupt to the guest.
> + * ioeventfd: Allow an fd to be used to receive a signal from the guest.
> + * All credit goes to kvm developers.
> + */
> +
> +#ifndef __LINUX_MSHV_EVENTFD_H
> +#define __LINUX_MSHV_EVENTFD_H
> +
> +#include <linux/poll.h>
> +
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +/* struct to contain list of irqfds sharing an irq. Updates are protected by
> + * partition.irqfds.resampler_lock
> + */
> +struct mshv_irqfd_resampler {
> +	struct mshv_partition	    *rsmplr_partn;
> +	struct hlist_head	     rsmplr_irqfd_list;
> +	struct mshv_irq_ack_notifier rsmplr_notifier;
> +	struct hlist_node	     rsmplr_hnode;
> +};
> +
> +struct mshv_irqfd {
> +	struct mshv_partition		    *irqfd_partn;
> +	struct eventfd_ctx		    *irqfd_eventfd_ctx;
> +	struct mshv_guest_irq_ent	     irqfd_girq_ent;
> +	seqcount_spinlock_t		     irqfd_irqe_sc;
> +	u32				     irqfd_irqnum;
> +	struct mshv_lapic_irq		     irqfd_lapic_irq;
> +	struct hlist_node		     irqfd_hnode;
> +	poll_table			     irqfd_polltbl;
> +	wait_queue_head_t		    *irqfd_wqh;
> +	wait_queue_entry_t		     irqfd_wait;
> +	struct work_struct		     irqfd_shutdown;
> +	struct mshv_irqfd_resampler	    *irqfd_resampler;
> +	struct eventfd_ctx		    *irqfd_resamplefd;
> +	struct hlist_node		     irqfd_resampler_hnode;
> +};
> +
> +void mshv_eventfd_init(struct mshv_partition *partition);
> +void mshv_eventfd_release(struct mshv_partition *partition);
> +
> +void mshv_register_irq_ack_notifier(struct mshv_partition *partition,
> +				    struct mshv_irq_ack_notifier *mian);
> +void mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
> +				      struct mshv_irq_ack_notifier *mian);
> +bool mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi);
> +
> +int mshv_set_unset_irqfd(struct mshv_partition *partition,
> +			 struct mshv_user_irqfd *args);
> +
> +int mshv_irqfd_wq_init(void);
> +void mshv_irqfd_wq_cleanup(void);
> +
> +struct mshv_ioeventfd {
> +	struct hlist_node    iovntfd_hnode;
> +	u64		     iovntfd_addr;
> +	int		     iovntfd_length;
> +	struct eventfd_ctx  *iovntfd_eventfd;
> +	u64		     iovntfd_datamatch;
> +	int		     iovntfd_doorbell_id;
> +	bool		     iovntfd_wildcard;
> +};
> +
> +int mshv_set_unset_ioeventfd(struct mshv_partition *pt,
> +			     struct mshv_user_ioeventfd *args);
> +
> +#endif /* __LINUX_MSHV_EVENTFD_H */
> diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
> new file mode 100644
> index 000000000000..f956e125afb4
> --- /dev/null
> +++ b/drivers/hv/mshv_irq.c
> @@ -0,0 +1,128 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Authors:
> + *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <asm/mshyperv.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +MODULE_AUTHOR("Microsoft");
> +MODULE_LICENSE("GPL");
> +
> +/* called from the ioctl code, user wants to update the guest irq table */
> +int mshv_update_routing_table(struct mshv_partition *partition,
> +			      const struct mshv_user_irq_entry *ue,
> +			      unsigned int numents)
> +{
> +	struct mshv_girq_routing_table *new = NULL, *old;
> +	u32 i, nr_rt_entries = 0;
> +	int r = 0;
> +
> +	if (numents == 0)
> +		goto swap_routes;
> +
> +	for (i = 0; i < numents; i++) {
> +		if (ue[i].gsi >= MSHV_MAX_GUEST_IRQS)
> +			return -EINVAL;
> +
> +		if (ue[i].address_hi)
> +			return -EINVAL;
> +
> +		nr_rt_entries = max(nr_rt_entries, ue[i].gsi);
> +	}
> +	nr_rt_entries += 1;
> +
> +	new = kzalloc(struct_size(new, mshv_girq_info_tbl, nr_rt_entries),
> +		      GFP_KERNEL_ACCOUNT);
> +	if (!new)
> +		return -ENOMEM;
> +
> +	new->num_rt_entries = nr_rt_entries;
> +	for (i = 0; i < numents; i++) {
> +		struct mshv_guest_irq_ent *girq;
> +
> +		girq = &new->mshv_girq_info_tbl[ue[i].gsi];
> +
> +		/*
> +		 * Allow only one to one mapping between GSI and MSI routing.
> +		 */
> +		if (girq->guest_irq_num != 0) {
> +			r = -EINVAL;
> +			goto out;
> +		}
> +
> +		girq->guest_irq_num = ue[i].gsi;
> +		girq->girq_addr_lo = ue[i].address_lo;
> +		girq->girq_addr_hi = ue[i].address_hi;
> +		girq->girq_irq_data = ue[i].data;
> +		girq->girq_entry_valid = true;
> +	}
> +
> +swap_routes:
> +	mutex_lock(&partition->pt_irq_lock);
> +	old = rcu_dereference_protected(partition->pt_girq_tbl, 1);
> +	rcu_assign_pointer(partition->pt_girq_tbl, new);
> +	mshv_irqfd_routing_update(partition);
> +	mutex_unlock(&partition->pt_irq_lock);
> +
> +	synchronize_srcu_expedited(&partition->pt_irq_srcu);
> +	new = old;
> +
> +out:
> +	kfree(new);
> +
> +	return r;
> +}
> +
> +/* vm is going away, kfree the irq routing table */
> +void mshv_free_routing_table(struct mshv_partition *partition)
> +{
> +	struct mshv_girq_routing_table *rt =
> +				   rcu_access_pointer(partition->pt_girq_tbl);
> +
> +	kfree(rt);
> +}
> +
> +struct mshv_guest_irq_ent
> +mshv_ret_girq_entry(struct mshv_partition *partition, u32 irqnum)
> +{
> +	struct mshv_guest_irq_ent entry = { 0 };
> +	struct mshv_girq_routing_table *girq_tbl;
> +
> +	girq_tbl = srcu_dereference_check(partition->pt_girq_tbl,
> +					  &partition->pt_irq_srcu,
> +					  lockdep_is_held(&partition->pt_irq_lock));
> +	if (!girq_tbl || irqnum >= girq_tbl->num_rt_entries) {
> +		/*
> +		 * Premature register_irqfd, setting valid_entry = 0
> +		 * would ignore this entry anyway
> +		 */
> +		entry.guest_irq_num = irqnum;
> +		return entry;
> +	}
> +
> +	return girq_tbl->mshv_girq_info_tbl[irqnum];
> +}
> +
> +void mshv_copy_girq_info(struct mshv_guest_irq_ent *ent,
> +			 struct mshv_lapic_irq *lirq)
> +{
> +	memset(lirq, 0, sizeof(*lirq));
> +	if (!ent || !ent->girq_entry_valid)
> +		return;
> +
> +	lirq->lapic_vector = ent->girq_irq_data & 0xFF;
> +	lirq->lapic_apic_id = (ent->girq_addr_lo >> 12) & 0xFF;
> +	lirq->lapic_control.interrupt_type = (ent->girq_irq_data & 0x700) >> 8;
> +	lirq->lapic_control.level_triggered = (ent->girq_irq_data >> 15) & 0x1;
> +	lirq->lapic_control.logical_dest_mode = (ent->girq_addr_lo >> 2) & 0x1;
> +}
> diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
> new file mode 100644
> index 000000000000..a40abde6fd15
> --- /dev/null
> +++ b/drivers/hv/mshv_portid_table.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/version.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <asm/mshyperv.h>
> +
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +/*
> + * Ports and connections are hypervisor struct used for inter-partition
> + * communication. Port represents the source and connection represents
> + * the destination. Partitions are responsible for managing the port and
> + * connection ids.
> + *
> + */
> +
> +#define PORTID_MIN	1
> +#define PORTID_MAX	INT_MAX
> +
> +static DEFINE_IDR(port_table_idr);
> +
> +void
> +mshv_port_table_fini(void)
> +{
> +	struct port_table_info *port_info;
> +	unsigned long i, tmp;
> +
> +	idr_lock(&port_table_idr);
> +	if (!idr_is_empty(&port_table_idr)) {
> +		idr_for_each_entry_ul(&port_table_idr, port_info, tmp, i) {
> +			port_info = idr_remove(&port_table_idr, i);
> +			kfree_rcu(port_info, portbl_rcu);
> +		}
> +	}
> +	idr_unlock(&port_table_idr);
> +}
> +
> +int
> +mshv_portid_alloc(struct port_table_info *info)
> +{
> +	int ret = 0;
> +
> +	idr_lock(&port_table_idr);
> +	ret = idr_alloc(&port_table_idr, info, PORTID_MIN,
> +			PORTID_MAX, GFP_KERNEL);
> +	idr_unlock(&port_table_idr);
> +
> +	return ret;
> +}
> +
> +void
> +mshv_portid_free(int port_id)
> +{
> +	struct port_table_info *info;
> +
> +	idr_lock(&port_table_idr);
> +	info = idr_remove(&port_table_idr, port_id);
> +	WARN_ON(!info);
> +	idr_unlock(&port_table_idr);
> +
> +	synchronize_rcu();
> +	kfree(info);
> +}
> +
> +int
> +mshv_portid_lookup(int port_id, struct port_table_info *info)
> +{
> +	struct port_table_info *_info;
> +	int ret = -ENOENT;
> +
> +	rcu_read_lock();
> +	_info = idr_find(&port_table_idr, port_id);
> +	rcu_read_unlock();
> +
> +	if (_info) {
> +		*info = *_info;
> +		ret = 0;
> +	}
> +
> +	return ret;
> +}
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> new file mode 100644
> index 000000000000..f8d85db14db1
> --- /dev/null
> +++ b/drivers/hv/mshv_root.h
> @@ -0,0 +1,321 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + */
> +
> +#ifndef _MSHV_ROOT_H_
> +#define _MSHV_ROOT_H_
> +
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/semaphore.h>
> +#include <linux/sched.h>
> +#include <linux/srcu.h>
> +#include <linux/wait.h>
> +#include <linux/hashtable.h>
> +#include <linux/dev_printk.h>
> +#include <uapi/linux/mshv.h>
> +
> +/*
> + * Hypervisor must be between these version numbers (inclusive)
> + * to guarantee compatibility
> + */
> +#define MSHV_HV_MIN_VERSION		(27744)
> +#define MSHV_HV_MAX_VERSION		(27751)
> +
> +#define MSHV_MAX_VPS			256
> +
> +#define MSHV_PARTITIONS_HASH_BITS	9
> +
> +#define MSHV_PIN_PAGES_BATCH_SIZE	(0x10000000ULL / HV_HYP_PAGE_SIZE)
> +
> +struct mshv_vp {
> +	u32 vp_index;
> +	struct mshv_partition *vp_partition;
> +	struct mutex vp_mutex;
> +	struct hv_vp_register_page *vp_register_page;
> +	struct hv_message *vp_intercept_msg_page;
> +	void *vp_ghcb_page;
> +	struct hv_stats_page *vp_stats_pages[2];
> +	struct {
> +		atomic64_t vp_signaled_count;
> +		struct {
> +			u64 intercept_suspend: 1;
> +			u64 root_sched_blocked: 1; /* root scheduler only */
> +			u64 root_sched_dispatched: 1; /* root scheduler only */
> +			u64 reserved: 62;

Hmmm.  This looks like 65 bits allocated in a u64.

> +		} flags;
> +		unsigned int kicked_by_hv;
> +		wait_queue_head_t vp_suspend_queue;
> +	} run;
> +};
> +
> +#define vp_fmt(fmt) "p%lluvp%u: " fmt
> +#define vp_dev(v) ((v)->vp_partition->pt_module_dev)
> +#define vp_emerg(v, fmt, ...) \
> +	dev_emerg(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		  (v)->vp_index, ##__VA_ARGS__)
> +#define vp_crit(v, fmt, ...) \
> +	dev_crit(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		 (v)->vp_index, ##__VA_ARGS__)
> +#define vp_alert(v, fmt, ...) \
> +	dev_alert(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		  (v)->vp_index, ##__VA_ARGS__)
> +#define vp_err(v, fmt, ...) \
> +	dev_err(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		(v)->vp_index, ##__VA_ARGS__)
> +#define vp_warn(v, fmt, ...) \
> +	dev_warn(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		 (v)->vp_index, ##__VA_ARGS__)
> +#define vp_notice(v, fmt, ...) \
> +	dev_notice(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		   (v)->vp_index, ##__VA_ARGS__)
> +#define vp_info(v, fmt, ...) \
> +	dev_info(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		 (v)->vp_index, ##__VA_ARGS__)
> +#define vp_dbg(v, fmt, ...) \
> +	dev_dbg(vp_dev(v), vp_fmt(fmt), (v)->vp_partition->pt_id, \
> +		(v)->vp_index, ##__VA_ARGS__)
> +
> +struct mshv_mem_region {
> +	struct hlist_node hnode;
> +	u64 nr_pages;
> +	u64 start_gfn;
> +	u64 start_uaddr;
> +	u32 hv_map_flags;
> +	struct {
> +		u64 large_pages:  1; /* 2MiB */
> +		u64 range_pinned: 1;
> +		u64 reserved:	 62;
> +	} flags;
> +	struct mshv_partition *partition;
> +	struct page *pages[];
> +};
> +
> +struct mshv_irq_ack_notifier {
> +	struct hlist_node link;
> +	unsigned int irq_ack_gsi;
> +	void (*irq_acked)(struct mshv_irq_ack_notifier *mian);
> +};
> +
> +struct mshv_partition {
> +	struct device *pt_module_dev;
> +
> +	struct hlist_node pt_hnode;
> +	u64 pt_id;
> +	refcount_t pt_ref_count;
> +	struct mutex pt_mutex;
> +	struct hlist_head pt_mem_regions; // not ordered
> +
> +	u32 pt_vp_count;
> +	struct mshv_vp *pt_vp_array[MSHV_MAX_VPS];
> +
> +	struct mutex pt_irq_lock;
> +	struct srcu_struct pt_irq_srcu;
> +	struct hlist_head irq_ack_notifier_list;
> +
> +	struct hlist_head pt_devices;
> +
> +	/*
> +	 * Since MSHV does not support more than one async hypercall in flight

Wording is a bit messed up.  Drop the "Since"?

> +	 * for a single partition. Thus, it is okay to define per partition
> +	 * async hypercall status.
> +	 */
> +	struct completion async_hypercall;
> +	u64 async_hypercall_status;
> +
> +	spinlock_t	  pt_irqfds_lock;
> +	struct hlist_head pt_irqfds_list;
> +	struct mutex	  irqfds_resampler_lock;
> +	struct hlist_head irqfds_resampler_list;
> +
> +	struct hlist_head ioeventfds_list;
> +
> +	struct mshv_girq_routing_table __rcu *pt_girq_tbl;
> +	u64 isolation_type;
> +	bool import_completed;
> +	bool pt_initialized;
> +};
> +
> +#define pt_fmt(fmt) "p%llu: " fmt
> +#define pt_dev(p) ((p)->pt_module_dev)
> +#define pt_emerg(p, fmt, ...) \
> +	dev_emerg(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_crit(p, fmt, ...) \
> +	dev_crit(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_alert(p, fmt, ...) \
> +	dev_alert(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_err(p, fmt, ...) \
> +	dev_err(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_warn(p, fmt, ...) \
> +	dev_warn(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_notice(p, fmt, ...) \
> +	dev_notice(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_info(p, fmt, ...) \
> +	dev_info(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +#define pt_dbg(p, fmt, ...) \
> +	dev_dbg(pt_dev(p), pt_fmt(fmt), (p)->pt_id, ##__VA_ARGS__)
> +
> +struct mshv_lapic_irq {
> +	u32 lapic_vector;
> +	u64 lapic_apic_id;
> +	union hv_interrupt_control lapic_control;
> +};
> +
> +#define MSHV_MAX_GUEST_IRQS		4096
> +
> +/* representation of one guest irq entry, either msi or legacy */
> +struct mshv_guest_irq_ent {
> +	u32 girq_entry_valid;	/* vfio looks at this */
> +	u32 guest_irq_num;	/* a unique number for each irq */
> +	u32 girq_addr_lo;	/* guest irq msi address info */
> +	u32 girq_addr_hi;
> +	u32 girq_irq_data;	/* idt vector in some cases */
> +};
> +
> +struct mshv_girq_routing_table {
> +	u32 num_rt_entries;
> +	struct mshv_guest_irq_ent mshv_girq_info_tbl[];
> +};
> +
> +struct hv_synic_pages {
> +	struct hv_message_page *synic_message_page;
> +	struct hv_synic_event_flags_page *synic_event_flags_page;
> +	struct hv_synic_event_ring_page *synic_event_ring_page;
> +};
> +
> +struct mshv_root {
> +	struct hv_synic_pages __percpu *synic_pages;
> +	spinlock_t pt_ht_lock;
> +	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
> +};
> +
> +/*
> + * Callback for doorbell events.
> + * NOTE: This is called in interrupt context. Callback
> + * should defer slow and sleeping logic to later.
> + */
> +typedef void (*doorbell_cb_t) (int doorbell_id, void *);
> +
> +/*
> + * port table information
> + */
> +struct port_table_info {
> +	struct rcu_head portbl_rcu;
> +	enum hv_port_type hv_port_type;
> +	union {
> +		struct {
> +			u64 reserved[2];
> +		} hv_port_message;
> +		struct {
> +			u64 reserved[2];
> +		} hv_port_event;
> +		struct {
> +			u64 reserved[2];
> +		} hv_port_monitor;
> +		struct {
> +			doorbell_cb_t doorbell_cb;
> +			void *data;
> +		} hv_port_doorbell;
> +	};
> +};
> +
> +int mshv_update_routing_table(struct mshv_partition *partition,
> +			      const struct mshv_user_irq_entry *entries,
> +			      unsigned int numents);
> +void mshv_free_routing_table(struct mshv_partition *partition);
> +
> +struct mshv_guest_irq_ent mshv_ret_girq_entry(struct mshv_partition *partition,
> +					      u32 irq_num);
> +
> +void mshv_copy_girq_info(struct mshv_guest_irq_ent *src_irq,
> +			 struct mshv_lapic_irq *dest_irq);
> +
> +void mshv_irqfd_routing_update(struct mshv_partition *partition);
> +
> +void mshv_port_table_fini(void);
> +int mshv_portid_alloc(struct port_table_info *info);
> +int mshv_portid_lookup(int port_id, struct port_table_info *info);
> +void mshv_portid_free(int port_id);
> +
> +int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
> +			   void *data, u64 gpa, u64 val, u64 flags);
> +void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
> +
> +void mshv_isr(void);
> +int mshv_synic_init(unsigned int cpu);
> +int mshv_synic_cleanup(unsigned int cpu);
> +
> +static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
> +{
> +	return partition->isolation_type == HV_PARTITION_ISOLATION_TYPE_SNP;
> +}
> +
> +struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
> +void mshv_partition_put(struct mshv_partition *partition);
> +struct mshv_partition *mshv_partition_find(u64 partition_id) __must_hold(RCU);
> +
> +/* hypercalls */
> +
> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> +int hv_call_create_partition(u64 flags,
> +			     struct hv_partition_creation_properties creation_properties,
> +			     union hv_partition_isolation_properties isolation_properties,
> +			     u64 *partition_id);
> +int hv_call_initialize_partition(u64 partition_id);
> +int hv_call_finalize_partition(u64 partition_id);
> +int hv_call_delete_partition(u64 partition_id);
> +int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
> +int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> +			  u32 flags, struct page **pages);
> +int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> +			    u32 flags);
> +int hv_call_delete_vp(u64 partition_id, u32 vp_index);
> +int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> +				     u64 dest_addr,
> +				     union hv_interrupt_control control);
> +int hv_call_clear_virtual_interrupt(u64 partition_id);
> +int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> +				  union hv_gpa_page_access_state_flags state_flags,
> +				  int *written_total,
> +				  union hv_gpa_page_access_state *states);
> +int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
> +			 struct hv_vp_state_data state_data,
> +			 /* Choose between pages and ret_output */
> +			 u64 page_count, struct page **pages,
> +			 union hv_output_get_vp_state *ret_output);
> +int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> +			 /* Choose between pages and bytes */
> +			 struct hv_vp_state_data state_data, u64 page_count,
> +			 struct page **pages, u32 num_bytes, u8 *bytes);
> +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +			      union hv_input_vtl input_vtl,
> +			      struct page **state_page);
> +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +				union hv_input_vtl input_vtl);
> +int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> +			u64 connection_partition_id, struct hv_port_info *port_info,
> +			u8 port_vtl, u8 min_connection_vtl, int node);
> +int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id);
> +int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> +			 u64 connection_partition_id,
> +			 union hv_connection_id connection_id,
> +			 struct hv_connection_info *connection_info,
> +			 u8 connection_vtl, int node);
> +int hv_call_disconnect_port(u64 connection_partition_id,
> +			    union hv_connection_id connection_id);
> +int hv_call_notify_port_ring_empty(u32 sint_index);
> +int hv_call_map_stat_page(enum hv_stats_object_type type,
> +			  const union hv_stats_object_identity *identity,
> +			  void **addr);
> +int hv_call_unmap_stat_page(enum hv_stats_object_type type,
> +			    const union hv_stats_object_identity *identity);
> +int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> +				   u64 page_struct_count, u32 host_access,
> +				   u32 flags, u8 acquire);
> +
> +extern struct mshv_root mshv_root;
> +extern enum hv_scheduler_type hv_scheduler_type;
> +extern u8 __percpu **hv_synic_eventring_tail;

Per comments on an earlier patch, the __percpu is in the wrong place.

> +
> +#endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> new file mode 100644
> index 000000000000..b3e5c652015d
> --- /dev/null
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -0,0 +1,876 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Hypercall helper functions used by the mshv_root module.
> + *
> + * Authors:
> + *   Nuno Das Neves <nunodasneves@linux.microsoft.com>
> + *   Wei Liu <wei.liu@kernel.org>
> + *   Jinank Jain <jinankjain@microsoft.com>
> + *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> + *   Asher Kariv <askariv@microsoft.com>
> + *   Muminul Islam <Muminul.Islam@microsoft.com>
> + *   Anatol Belski <anbelski@linux.microsoft.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <asm/mshyperv.h>
> +
> +#include "mshv_root.h"
> +
> +/* Determined empirically */
> +#define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> +#define HV_MAP_GPA_DEPOSIT_PAGES	256
> +
> +#define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
> +
> +#define HV_WITHDRAW_BATCH_SIZE	(HV_HYP_PAGE_SIZE / sizeof(u64))
> +#define HV_MAP_GPA_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_map_gpa_pages)) \
> +		/ sizeof(u64))
> +#define HV_GET_VP_STATE_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_get_vp_state)) \
> +		/ sizeof(u64))
> +#define HV_SET_VP_STATE_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_state)) \
> +		/ sizeof(u64))
> +#define HV_GET_GPA_ACCESS_STATES_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(union hv_gpa_page_access_state)) \
> +		/ sizeof(union hv_gpa_page_access_state))
> +#define HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT
> 	       \
> +	((HV_HYP_PAGE_SIZE -						       \
> +	  sizeof(struct hv_input_modify_sparse_spa_page_host_access)) /        \
> +	 sizeof(u64))
> +
> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> +{
> +	struct hv_input_withdraw_memory *input_page;
> +	struct hv_output_withdraw_memory *output_page;
> +	struct page *page;
> +	u16 completed;
> +	unsigned long remaining = count;
> +	u64 status;
> +	int i;
> +	unsigned long flags;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +	output_page = page_address(page);
> +
> +	while (remaining) {
> +		local_irq_save(flags);
> +
> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +		memset(input_page, 0, sizeof(*input_page));
> +		input_page->partition_id = partition_id;
> +		status = hv_do_rep_hypercall(HVCALL_WITHDRAW_MEMORY,
> +					     min(remaining, HV_WITHDRAW_BATCH_SIZE),
> +					     0, input_page, output_page);
> +
> +		local_irq_restore(flags);
> +
> +		completed = hv_repcomp(status);
> +
> +		for (i = 0; i < completed; i++)
> +			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
> +
> +		if (!hv_result_success(status)) {
> +			if (hv_result(status) == HV_STATUS_NO_RESOURCES)
> +				status = HV_STATUS_SUCCESS;
> +			break;
> +		}
> +
> +		remaining -= completed;
> +	}
> +	free_page((unsigned long)output_page);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_create_partition(u64 flags,
> +			     struct hv_partition_creation_properties creation_properties,
> +			     union hv_partition_isolation_properties isolation_properties,
> +			     u64 *partition_id)
> +{
> +	struct hv_input_create_partition *input;
> +	struct hv_output_create_partition *output;
> +	u64 status;
> +	int ret;
> +	unsigned long irq_flags;
> +
> +	do {
> +		local_irq_save(irq_flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +		memset(input, 0, sizeof(*input));
> +		input->flags = flags;
> +		input->compatibility_version = HV_COMPATIBILITY_21_H2;
> +
> +		memcpy(&input->partition_creation_properties, &creation_properties,
> +		       sizeof(creation_properties));

This is an example of a generic question/concern that occurs in several places. By
doing a memcpy into the hypercall input, the assumption is that the creation
properties supplied by the caller have zeros in all the reserved or unused fields.
Is that a valid assumption?

> +
> +		memcpy(&input->isolation_properties, &isolation_properties,
> +		       sizeof(isolation_properties));
> +
> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> +					 input, output);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (hv_result_success(status))
> +				*partition_id = output->partition_id;
> +			local_irq_restore(irq_flags);
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(irq_flags);
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    hv_current_partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int hv_call_initialize_partition(u64 partition_id)
> +{
> +	struct hv_input_initialize_partition input;
> +	u64 status;
> +	int ret;
> +
> +	input.partition_id = partition_id;
> +
> +	ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> +				    HV_INIT_PARTITION_DEPOSIT_PAGES);
> +	if (ret)
> +		return ret;
> +
> +	do {
> +		status = hv_do_fast_hypercall8(HVCALL_INITIALIZE_PARTITION,
> +					       *(u64 *)&input);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int hv_call_finalize_partition(u64 partition_id)
> +{
> +	struct hv_input_finalize_partition input;
> +	u64 status;
> +
> +	input.partition_id = partition_id;
> +	status = hv_do_fast_hypercall8(HVCALL_FINALIZE_PARTITION,
> +				       *(u64 *)&input);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_delete_partition(u64 partition_id)
> +{
> +	struct hv_input_delete_partition input;
> +	u64 status;
> +
> +	input.partition_id = partition_id;
> +	status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64 *)&input);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> +static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> +			       u32 flags, struct page **pages, u64 mmio_spa)
> +{
> +	struct hv_input_map_gpa_pages *input_page;
> +	u64 status, *pfnlist;
> +	unsigned long irq_flags, large_shift = 0;
> +	int ret = 0, done = 0;
> +	u64 page_count = page_struct_count;
> +
> +	if (page_count == 0 || (pages && mmio_spa))
> +		return -EINVAL;
> +
> +	if (flags & HV_MAP_GPA_LARGE_PAGE) {
> +		if (mmio_spa)
> +			return -EINVAL;
> +
> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> +			return -EINVAL;
> +
> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> +		page_count >>= large_shift;
> +	}
> +
> +	while (done < page_count) {
> +		ulong i, completed, remain = page_count - done;
> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> +
> +		local_irq_save(irq_flags);
> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +		input_page->target_partition_id = partition_id;
> +		input_page->target_gpa_base = gfn + (done << large_shift);
> +		input_page->map_flags = flags;
> +		pfnlist = input_page->source_gpa_page_list;
> +
> +		for (i = 0; i < rep_count; i++)
> +			if (flags & HV_MAP_GPA_NO_ACCESS) {
> +				pfnlist[i] = 0;
> +			} else if (pages) {
> +				u64 index = (done + i) << large_shift;
> +
> +				if (index >= page_struct_count) {

Can this test ever be true?  It looks like the pages array must
have space for each 4K page even if mapping in 2Meg granularity.
But only every 512th entry in the pages array is looked at
(which seems a little weird). But based on how rep_count is set up,
I don't see how the algorithm could go past the end of the pages
array.

> +					ret = -EINVAL;
> +					break;
> +				}
> +				pfnlist[i] = page_to_pfn(pages[index]);
> +			} else {
> +				pfnlist[i] = mmio_spa + done + i;
> +			}
> +		if (ret)
> +			break;

This test could also go away if the ret = -EINVAL error above can't
happen.

> +
> +		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> +					     input_page, NULL);
> +		local_irq_restore(irq_flags);
> +
> +		completed = hv_repcomp(status);
> +
> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> +						    HV_MAP_GPA_DEPOSIT_PAGES);
> +			if (ret)
> +				break;
> +
> +		} else if (!hv_result_success(status)) {
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +
> +		done += completed;
> +	}
> +
> +	if (ret && done) {
> +		u32 unmap_flags = 0;
> +
> +		if (flags & HV_MAP_GPA_LARGE_PAGE)
> +			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Ask the hypervisor to map guest ram pages */
> +int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> +			  u32 flags, struct page **pages)
> +{
> +	return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
> +				   flags, pages, 0);
> +}
> +
> +/* Ask the hypervisor to map guest mmio space */
> +int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> +{
> +	int i;
> +	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> +		    HV_MAP_GPA_NOT_CACHED;
> +
> +	for (i = 0; i < numpgs; i++)
> +		if (page_is_ram(mmio_spa + i))

FWIW, doing this check one-page-at-a-time is somewhat expensive if numpgs
is large.  The underlying data structures should support doing a single range
check, but I haven't looked at whether functions exist to do such a range check.

> +			return -EINVAL;
> +
> +	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> +				   mmio_spa);
> +}
> +
> +int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> +			    u32 flags)
> +{
> +	struct hv_input_unmap_gpa_pages *input_page;
> +	u64 status, page_count = page_count_4k;
> +	unsigned long irq_flags, large_shift = 0;
> +	int ret = 0, done = 0;
> +
> +	if (page_count == 0)
> +		return -EINVAL;
> +
> +	if (flags & HV_UNMAP_GPA_LARGE_PAGE) {
> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> +			return -EINVAL;
> +
> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> +		page_count >>= large_shift;
> +	}
> +
> +	while (done < page_count) {
> +		ulong completed, remain = page_count - done;
> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);

Using HV_MAP_GPA_BATCH_SIZE seems a little weird here since there's
no input array and hence no constraint based on keeping input args to
just one page. Is it being used as an arbitrary limit so the rep_count
passed to the hypercall isn't "too large" for some definition of "too large"?
If that's the case, perhaps a separate #define and a comment would
make sense. I kept trying to figure out how the batch size for unmap was
related to the map hypercall, and I don't think there is any relationship.

> +
> +		local_irq_save(irq_flags);
> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +		input_page->target_partition_id = partition_id;
> +		input_page->target_gpa_base = gfn + (done << large_shift);
> +		input_page->unmap_flags = flags;
> +		status = hv_do_rep_hypercall(HVCALL_UNMAP_GPA_PAGES, rep_count,
> +					     0, input_page, NULL);
> +		local_irq_restore(irq_flags);
> +
> +		completed = hv_repcomp(status);
> +		if (!hv_result_success(status)) {
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +
> +		done += completed;
> +	}
> +
> +	return ret;
> +}
> +
> +int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> +				  union hv_gpa_page_access_state_flags state_flags,
> +				  int *written_total,
> +				  union hv_gpa_page_access_state *states)
> +{
> +	struct hv_input_get_gpa_pages_access_state *input_page;
> +	union hv_gpa_page_access_state *output_page;
> +	int completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count, i;
> +	u64 status;
> +	unsigned long flags;
> +
> +	*written_total = 0;
> +	while (remaining) {
> +		local_irq_save(flags);
> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +		input_page->partition_id = partition_id;
> +		input_page->hv_gpa_page_number = gpa_base_pfn + *written_total;
> +		input_page->flags = state_flags;
> +		rep_count = min(remaining, HV_GET_GPA_ACCESS_STATES_BATCH_SIZE);
> +
> +		status = hv_do_rep_hypercall(HVCALL_GET_GPA_PAGES_ACCESS_STATES, rep_count,
> +					     0, input_page, output_page);
> +		if (!hv_result_success(status)) {
> +			local_irq_restore(flags);
> +			break;
> +		}
> +		completed = hv_repcomp(status);
> +		for (i = 0; i < completed; ++i)
> +			states[i].as_uint8 = output_page[i].as_uint8;
> +
> +		states += completed;
> +		*written_total += completed;
> +		remaining -= completed;
> +		local_irq_restore(flags);

FWIW, this local_irq_restore() could move up three lines to before the progress
accounting is done.

> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> +				     u64 dest_addr,
> +				     union hv_interrupt_control control)
> +{
> +	struct hv_input_assert_virtual_interrupt *input;
> +	unsigned long flags;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = partition_id;
> +	input->vector = vector;
> +	input->dest_addr = dest_addr;
> +	input->control = control;
> +	status = hv_do_hypercall(HVCALL_ASSERT_VIRTUAL_INTERRUPT, input, NULL);
> +	local_irq_restore(flags);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_delete_vp(u64 partition_id, u32 vp_index)
> +{
> +	union hv_input_delete_vp input = {};
> +	u64 status;
> +
> +	input.partition_id = partition_id;
> +	input.vp_index = vp_index;
> +
> +	status = hv_do_fast_hypercall16(HVCALL_DELETE_VP,
> +					input.as_uint64[0], input.as_uint64[1]);
> +
> +	return hv_result_to_errno(status);
> +}
> +EXPORT_SYMBOL_GPL(hv_call_delete_vp);
> +
> +int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
> +			 struct hv_vp_state_data state_data,
> +			 /* Choose between pages and ret_output */
> +			 u64 page_count, struct page **pages,
> +			 union hv_output_get_vp_state *ret_output)
> +{
> +	struct hv_input_get_vp_state *input;
> +	union hv_output_get_vp_state *output;
> +	u64 status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	if (page_count > HV_GET_VP_STATE_BATCH_SIZE)
> +		return -EINVAL;
> +
> +	if (!page_count && !ret_output)
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +		memset(input, 0, sizeof(*input));
> +		memset(output, 0, sizeof(*output));

Why is the output set to zero?  I would think Hyper-V is responsible for
ensuring that the output is properly populated, with unused fields/areas
set to zero.

> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data = state_data;
> +		for (i = 0; i < page_count; i++)
> +			input->output_data_pfns[i] = page_to_pfn(pages[i]);
> +
> +		control = (HVCALL_GET_VP_STATE) |
> +			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, output);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (hv_result_success(status) && ret_output)
> +				memcpy(ret_output, output, sizeof(*output));
> +
> +			local_irq_restore(flags);
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> +			 /* Choose between pages and bytes */
> +			 struct hv_vp_state_data state_data, u64 page_count,

The size of "struct hv_vp_state_data" looks to be 24 bytes (3 64-bit words).
Is there a reason to pass this by value instead of as a pointer? I guess it works
like this, but it seems atypical.

> +			 struct page **pages, u32 num_bytes, u8 *bytes)
> +{
> +	struct hv_input_set_vp_state *input;
> +	u64 status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +	u16 varhead_sz;
> +
> +	if (page_count > HV_SET_VP_STATE_BATCH_SIZE)
> +		return -EINVAL;
> +	if (sizeof(*input) + num_bytes > HV_HYP_PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (num_bytes)
> +		/* round up to 8 and divide by 8 */
> +		varhead_sz = (num_bytes + 7) >> 3;
> +	else if (page_count)
> +		varhead_sz = page_count;
> +	else
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data = state_data;
> +		if (num_bytes) {
> +			memcpy((u8 *)input->data, bytes, num_bytes);
> +		} else {
> +			for (i = 0; i < page_count; i++)
> +				input->data[i].pfns = page_to_pfn(pages[i]);
> +		}
> +
> +		control = (HVCALL_SET_VP_STATE) |
> +			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, NULL);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			local_irq_restore(flags);
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +			      union hv_input_vtl input_vtl,
> +			      struct page **state_page)
> +{
> +	struct hv_input_map_vp_state_page *input;
> +	struct hv_output_map_vp_state_page *output;
> +	u64 status;
> +	int ret;
> +	unsigned long flags;
> +
> +	do {
> +		local_irq_save(flags);
> +
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->type = type;
> +		input->input_vtl = input_vtl;
> +
> +		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE, input, output);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (hv_result_success(status))
> +				*state_page = pfn_to_page(output->map_location);
> +			local_irq_restore(flags);
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +				union hv_input_vtl input_vtl)
> +{
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_unmap_vp_state_page *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = partition_id;
> +	input->vp_index = vp_index;
> +	input->type = type;
> +	input->input_vtl = input_vtl;
> +
> +	status = hv_do_hypercall(HVCALL_UNMAP_VP_STATE_PAGE, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int
> +hv_call_clear_virtual_interrupt(u64 partition_id)
> +{
> +	unsigned long flags;
> +	int status;
> +
> +	local_irq_save(flags);
> +	status = hv_do_fast_hypercall8(HVCALL_CLEAR_VIRTUAL_INTERRUPT,
> +				       partition_id) &
> +			HV_HYPERCALL_RESULT_MASK;

This "anding" with HV_HYPERCALL_RESULT_MASK should be removed.

> +	local_irq_restore(flags);

The irq save/restore isn't needed here since this is a fast hypercall and
per-cpu arg memory is not used.

> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int
> +hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> +		    u64 connection_partition_id,
> +		    struct hv_port_info *port_info,
> +		    u8 port_vtl, u8 min_connection_vtl, int node)
> +{
> +	struct hv_input_create_port *input;
> +	unsigned long flags;
> +	int ret = 0;
> +	int status;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +
> +		input->port_partition_id = port_partition_id;
> +		input->port_id = port_id;
> +		input->connection_partition_id = connection_partition_id;
> +		input->port_info = *port_info;
> +		input->port_vtl = port_vtl;
> +		input->min_connection_vtl = min_connection_vtl;
> +		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
> +		status = hv_do_hypercall(HVCALL_CREATE_PORT, input, NULL) &
> +			 HV_HYPERCALL_RESULT_MASK;

Use the hv_status checking macros instead of and'ing with
HV_HYPERCALL_RESULT_MASK.

> +		local_irq_restore(flags);
> +		if (status == HV_STATUS_SUCCESS)
> +			break;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
> +
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int
> +hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id)
> +{
> +	union hv_input_delete_port input = { 0 };
> +	unsigned long flags;
> +	int status;
> +
> +	local_irq_save(flags);
> +	input.port_partition_id = port_partition_id;
> +	input.port_id = port_id;
> +	status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
> +					input.as_uint64[0],
> +					input.as_uint64[1]) &
> +			HV_HYPERCALL_RESULT_MASK;
> +	local_irq_restore(flags);

Same a previous comment about and'ing.  And irq save/restore
isn't needed.

> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int
> +hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> +		     u64 connection_partition_id,
> +		     union hv_connection_id connection_id,
> +		     struct hv_connection_info *connection_info,
> +		     u8 connection_vtl, int node)
> +{
> +	struct hv_input_connect_port *input;
> +	unsigned long flags;
> +	int ret = 0, status;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +		input->port_partition_id = port_partition_id;
> +		input->port_id = port_id;
> +		input->connection_partition_id = connection_partition_id;
> +		input->connection_id = connection_id;
> +		input->connection_info = *connection_info;
> +		input->connection_vtl = connection_vtl;
> +		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
> +		status = hv_do_hypercall(HVCALL_CONNECT_PORT, input, NULL) &
> +			 HV_HYPERCALL_RESULT_MASK;

Same here.  Use hv_* macros.

> +
> +		local_irq_restore(flags);
> +		if (status == HV_STATUS_SUCCESS)
> +			break;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_result_to_errno(status);
> +			break;
> +		}
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    connection_partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +int
> +hv_call_disconnect_port(u64 connection_partition_id,
> +			union hv_connection_id connection_id)
> +{
> +	union hv_input_disconnect_port input = { 0 };
> +	unsigned long flags;
> +	int status;
> +
> +	local_irq_save(flags);
> +	input.connection_partition_id = connection_partition_id;
> +	input.connection_id = connection_id;
> +	input.is_doorbell = 1;
> +	status = hv_do_fast_hypercall16(HVCALL_DISCONNECT_PORT,
> +					input.as_uint64[0],
> +					input.as_uint64[1]) &
> +			HV_HYPERCALL_RESULT_MASK;
> +	local_irq_restore(flags);

Same as above.

> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int
> +hv_call_notify_port_ring_empty(u32 sint_index)
> +{
> +	union hv_input_notify_port_ring_empty input = { 0 };
> +	unsigned long flags;
> +	int status;
> +
> +	local_irq_save(flags);
> +	input.sint_index = sint_index;
> +	status = hv_do_fast_hypercall8(HVCALL_NOTIFY_PORT_RING_EMPTY,
> +				       input.as_uint64) &
> +		 HV_HYPERCALL_RESULT_MASK;
> +	local_irq_restore(flags);

Same as above.

> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_map_stat_page(enum hv_stats_object_type type,
> +			  const union hv_stats_object_identity *identity,
> +			  void **addr)
> +{
> +	unsigned long flags;
> +	struct hv_input_map_stats_page *input;
> +	struct hv_output_map_stats_page *output;
> +	u64 status, pfn;
> +	int ret;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +		memset(input, 0, sizeof(*input));
> +		input->type = type;
> +		input->identity = *identity;
> +
> +		status = hv_do_hypercall(HVCALL_MAP_STATS_PAGE, input, output);
> +		pfn = output->map_location;
> +
> +		local_irq_restore(flags);
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (hv_result_success(status))
> +				break;
> +			return hv_result_to_errno(status);
> +		}
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    hv_current_partition_id, 1);
> +		if (ret)
> +			return ret;
> +	} while (!ret);
> +
> +	*addr = page_address(pfn_to_page(pfn));
> +
> +	return ret;
> +}
> +
> +int hv_call_unmap_stat_page(enum hv_stats_object_type type,
> +			    const union hv_stats_object_identity *identity)
> +{
> +	unsigned long flags;
> +	struct hv_input_unmap_stats_page *input;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	input->type = type;
> +	input->identity = *identity;
> +
> +	status = hv_do_hypercall(HVCALL_UNMAP_STATS_PAGE, input, NULL);
> +	local_irq_restore(flags);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> +				   u64 page_struct_count, u32 host_access,
> +				   u32 flags, u8 acquire)
> +{
> +	struct hv_input_modify_sparse_spa_page_host_access *input_page;
> +	u64 status;
> +	int done = 0;
> +	unsigned long irq_flags, large_shift = 0;
> +	u64 page_count = page_struct_count;
> +	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> +			     HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
> +
> +	if (page_count == 0)
> +		return -EINVAL;
> +
> +	if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE) {
> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> +			return -EINVAL;
> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> +		page_count >>= large_shift;
> +	}
> +
> +	while (done < page_count) {
> +		ulong i, completed, remain = page_count - done;
> +		int rep_count = min(remain,
> +				HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
> +
> +		local_irq_save(irq_flags);
> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		/*
> +		 * This is required to make sure that reserved field is set to
> +		 * zero, because MSHV has a check to make sure reserved bits are
> +		 * set to zero.
> +		 */

Is this comment about checking reserved bits unique to this hypercall? If not, it
seems a little odd to see this comment here, but not other places where the input
is zero'ed.

> +		memset(input_page, 0, sizeof(*input_page));
> +		/* Only set the partition id if you are making the pages
> +		 * exclusive
> +		 */
> +		if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE)
> +			input_page->partition_id = partition_id;
> +		input_page->flags = flags;
> +		input_page->host_access = host_access;
> +
> +		for (i = 0; i < rep_count; i++) {
> +			u64 index = (done + i) << large_shift;
> +
> +			if (index >= page_struct_count)
> +				return -EINVAL;

Can this test ever be true?

> +
> +			input_page->spa_page_list[i] =
> +						page_to_pfn(pages[index]);

When large_shift is non-zero, it seems weird to be skipping over most
of the entries in the "pages" array.  But maybe there's a reason for that.

> +		}
> +
> +		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> +					     NULL);
> +		local_irq_restore(irq_flags);
> +
> +		completed = hv_repcomp(status);
> +
> +		if (!hv_result_success(status))
> +			return hv_result_to_errno(status);
> +
> +		done += completed;
> +	}
> +
> +	return 0;
> +}

[snip the rest of the patch that I haven't reviewed yet]

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-13 16:43   ` Michael Kelley
@ 2025-03-14  2:15     ` Nuno Das Neves
  2025-03-14  3:27       ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-14  2:15 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/13/2025 9:43 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
>>
> 
> I've done a partial review of the code in this patch.  See comments inline
> as usual.
> 
> I'd like to still review most of the code in mshv_root_main.c, and maybe
> some of mshv_synic.c and include/uapi/linux/mshv.c. I'll send a separate
> email with those comments when I complete them. The patch is huge, so
> I'm breaking my review comments into two parts.
> 
> I've glanced through mshv_eventfd.c, mshv_eventfd.h, and mshv_irq.c,
> but I don't have enough knowledge/expertise in these areas to add any
> useful comments, so I'm not planning to review them further.
> 
Thanks for taking a look. Just so you know, I was getting ready to post v6 of
this patchset when I saw this email. So not all the comments will be addressed
in the next version, but I've noted them and I will keep an eye out for the
second part if you send it after v6 is posted.

<snip>
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 6d1465315df3..66dcfaae698b 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -370,6 +370,8 @@ Code  Seq#    Include File                                           Comments
>>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
>> remoteproc@vger.kernel.org>
>>  0xB7  all    uapi/linux/nsfs.h                                       <mailto:Andrei Vagin
>> <avagin@openvz.org>>
>>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
>> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
> 
> Hmmm. Doesn't this mean that the mshv ioctls overlap with the Marvell
> CN10K DPI ioctls? Is that intentional? I thought the goal of the central
> registry in ioctl-number.rst is to avoid overlap.
> 
Yes, they overlap. In practice it really doesn't matter IMO - IOCTL numbers
are only interpreted by the driver of the device that the ioctl() syscall
is made on.

I believe the whole scheme to generate unique IOCTL numbers and try not to
overlap them was is some case I'm not familiar with - something like
multiple drivers handling IOCTLs on the same device FD? And maybe it's handy
in debugging if you see an IOCTL number in isolation and want to know where
it comes from?

On a practical note, we have been using this IOCTL range for some time
in other upstream code like our userspace rust library which interfaces with
this driver (https://github.com/rust-vmm/mshv). So it would also be nice to
keep that all working as much as possible with the kernel code that is on
this mailing list.

<snip>>> +#endif /* _MSHV_H */
>> diff --git a/drivers/hv/mshv_common.c b/drivers/hv/mshv_common.c
>> new file mode 100644
>> index 000000000000..d97631dcbee1
>> --- /dev/null
>> +++ b/drivers/hv/mshv_common.c
>> @@ -0,0 +1,161 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2024, Microsoft Corporation.
>> + *
>> + * This file contains functions that are called from one or more modules: ROOT,
>> + * DIAG, or VTL. If any of these modules are configured to build, this file is
> 
> What are the DIAG and VTL modules?  I see only a root module in the Makefile.
> 
Ah, yep, they are not in this patchset but will follow. I can remove thereferences to them here, and make this comment future tense: "functions that WILL
be called from one or more modules".

<snip>>> +
>> +struct mshv_vp {
>> +	u32 vp_index;
>> +	struct mshv_partition *vp_partition;
>> +	struct mutex vp_mutex;
>> +	struct hv_vp_register_page *vp_register_page;
>> +	struct hv_message *vp_intercept_msg_page;
>> +	void *vp_ghcb_page;
>> +	struct hv_stats_page *vp_stats_pages[2];
>> +	struct {
>> +		atomic64_t vp_signaled_count;
>> +		struct {
>> +			u64 intercept_suspend: 1;
>> +			u64 root_sched_blocked: 1; /* root scheduler only */
>> +			u64 root_sched_dispatched: 1; /* root scheduler only */
>> +			u64 reserved: 62;
> 
> Hmmm.  This looks like 65 bits allocated in a u64.
> 
Indeed it is, good catch

>> +
>> +	/*
>> +	 * Since MSHV does not support more than one async hypercall in flight
> 
> Wording is a bit messed up.  Drop the "Since"?
> 
Yep, thanks

>> +	 * for a single partition. Thus, it is okay to define per partition
>> +	 * async hypercall status.
>> +	 */
<snip>
>> +
>> +extern struct mshv_root mshv_root;
>> +extern enum hv_scheduler_type hv_scheduler_type;
>> +extern u8 __percpu **hv_synic_eventring_tail;
> 
> Per comments on an earlier patch, the __percpu is in the wrong place.
> 
Thanks, will fix here too.
<snip>>> +int hv_call_create_partition(u64 flags,
>> +			     struct hv_partition_creation_properties creation_properties,
>> +			     union hv_partition_isolation_properties isolation_properties,
>> +			     u64 *partition_id)
>> +{
>> +	struct hv_input_create_partition *input;
>> +	struct hv_output_create_partition *output;
>> +	u64 status;
>> +	int ret;
>> +	unsigned long irq_flags;
>> +
>> +	do {
>> +		local_irq_save(irq_flags);
>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +		memset(input, 0, sizeof(*input));
>> +		input->flags = flags;
>> +		input->compatibility_version = HV_COMPATIBILITY_21_H2;
>> +
>> +		memcpy(&input->partition_creation_properties, &creation_properties,
>> +		       sizeof(creation_properties));
> 
> This is an example of a generic question/concern that occurs in several places. By
> doing a memcpy into the hypercall input, the assumption is that the creation
> properties supplied by the caller have zeros in all the reserved or unused fields.
> Is that a valid assumption?
> 
When the entire struct is provided as a function parameter, I think it's a valid
assumption that that struct is initialized correctly by the caller.

The alternative (taking it to an extreme, in my opinion) is that we go through
each field in the parameters and assign them all individually, which could be quite
a lot of fields. E.g. going through all the bits in these structs with 60+ bitfields
and re-setting them here to be sure the reserved bits are 0.

>> +
>> +		memcpy(&input->isolation_properties, &isolation_properties,
>> +		       sizeof(isolation_properties));
>> +
>> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
>> +					 input, output);
<snip>>> +/* Ask the hypervisor to map guest ram pages or the guest mmio space */
>> +static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>> +			       u32 flags, struct page **pages, u64 mmio_spa)
>> +{
>> +	struct hv_input_map_gpa_pages *input_page;
>> +	u64 status, *pfnlist;
>> +	unsigned long irq_flags, large_shift = 0;
>> +	int ret = 0, done = 0;
>> +	u64 page_count = page_struct_count;
>> +
>> +	if (page_count == 0 || (pages && mmio_spa))
>> +		return -EINVAL;
>> +
>> +	if (flags & HV_MAP_GPA_LARGE_PAGE) {
>> +		if (mmio_spa)
>> +			return -EINVAL;
>> +
>> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
>> +			return -EINVAL;
>> +
>> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
>> +		page_count >>= large_shift;
>> +	}
>> +
>> +	while (done < page_count) {
>> +		ulong i, completed, remain = page_count - done;
>> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
>> +
>> +		local_irq_save(irq_flags);
>> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +
>> +		input_page->target_partition_id = partition_id;
>> +		input_page->target_gpa_base = gfn + (done << large_shift);
>> +		input_page->map_flags = flags;
>> +		pfnlist = input_page->source_gpa_page_list;
>> +
>> +		for (i = 0; i < rep_count; i++)
>> +			if (flags & HV_MAP_GPA_NO_ACCESS) {
>> +				pfnlist[i] = 0;
>> +			} else if (pages) {
>> +				u64 index = (done + i) << large_shift;
>> +
>> +				if (index >= page_struct_count) {
> 
> Can this test ever be true?  It looks like the pages array must
> have space for each 4K page even if mapping in 2Meg granularity.> But only every 512th entry in the pages array is looked at
> (which seems a little weird). But based on how rep_count is set up,
> I don't see how the algorithm could go past the end of the pages
> array.
> 
I don't think the test can actually be true - IIRC I wrote it as a kind
of "is my math correct?" sanity check, and there was a pr_err() or a
WARN()here in a previous iteration of the code.

The large page list is a bit weird - When we allocate the large pages in
the kernel, we get all the (4K) page structs for that range back from the
kernel, and we hang onto them. When mapping the large pages into the
hypervisor we just have to map the PFN of the first page of each 2M page,
hence the skipping.

Now I'm thinking about it again, maybe we can discard most of the 4K page
structs the kernel gives back and keep it as a packed array of the "head"
pages which are all we really need (and then also simplify this mapping
code and save some memory).

The current code was just the simplest way to add the large page
functionality on top of what we already had, but looks like it could
probably be improved.

>> +					ret = -EINVAL;
>> +					break;
>> +				}
>> +				pfnlist[i] = page_to_pfn(pages[index]);
>> +			} else {
>> +				pfnlist[i] = mmio_spa + done + i;
>> +			}
>> +		if (ret)
>> +			break;
> 
> This test could also go away if the ret = -EINVAL error above can't
> happen.
> 
Ack
<snip>
>> +
>> +/* Ask the hypervisor to map guest mmio space */
>> +int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
>> +{
>> +	int i;
>> +	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
>> +		    HV_MAP_GPA_NOT_CACHED;
>> +
>> +	for (i = 0; i < numpgs; i++)
>> +		if (page_is_ram(mmio_spa + i))
> 
> FWIW, doing this check one-page-at-a-time is somewhat expensive if numpgs
> is large.  The underlying data structures should support doing a single range
> check, but I haven't looked at whether functions exist to do such a range check.
> 
Indeed - I'll make a note to investigate, thanks.

>> +			return -EINVAL;
>> +
>> +	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
>> +				   mmio_spa);
>> +}
>> +
>> +int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
>> +			    u32 flags)
>> +{
>> +	struct hv_input_unmap_gpa_pages *input_page;
>> +	u64 status, page_count = page_count_4k;
>> +	unsigned long irq_flags, large_shift = 0;
>> +	int ret = 0, done = 0;
>> +
>> +	if (page_count == 0)
>> +		return -EINVAL;
>> +
>> +	if (flags & HV_UNMAP_GPA_LARGE_PAGE) {
>> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
>> +			return -EINVAL;
>> +
>> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
>> +		page_count >>= large_shift;
>> +	}
>> +
>> +	while (done < page_count) {
>> +		ulong completed, remain = page_count - done;
>> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> 
> Using HV_MAP_GPA_BATCH_SIZE seems a little weird here since there's
> no input array and hence no constraint based on keeping input args to
> just one page. Is it being used as an arbitrary limit so the rep_count
> passed to the hypercall isn't "too large" for some definition of "too large"?
> If that's the case, perhaps a separate #define and a comment would
> make sense. I kept trying to figure out how the batch size for unmap was
> related to the map hypercall, and I don't think there is any relationship.
> 
I think batching this was intentional so that we can be sure to re-enable
interrupts periodically when unmapping an entire VM's worth of memory. That
said, as you know the hypercall will return if it takes longer than a certain
amount of time so I guess that is "built-in" in some sense.

I think keeping the batching, but #defining a specific value for unmap as you
suggest is a good idea.

I'd be inclined to use a similar number (something like 512).

>> +
>> +		local_irq_save(irq_flags);
>> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +
>> +		input_page->target_partition_id = partition_id;
>> +		input_page->target_gpa_base = gfn + (done << large_shift);
>> +		input_page->unmap_flags = flags;
>> +		status = hv_do_rep_hypercall(HVCALL_UNMAP_GPA_PAGES, rep_count,
>> +					     0, input_page, NULL);
>> +		local_irq_restore(irq_flags);
>> +
>> +		completed = hv_repcomp(status);
>> +		if (!hv_result_success(status)) {
>> +			ret = hv_result_to_errno(status);
>> +			break;
>> +		}
>> +
>> +		done += completed;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
>> +				  union hv_gpa_page_access_state_flags state_flags,
>> +				  int *written_total,
>> +				  union hv_gpa_page_access_state *states)
>> +{
>> +	struct hv_input_get_gpa_pages_access_state *input_page;
>> +	union hv_gpa_page_access_state *output_page;
>> +	int completed = 0;
>> +	unsigned long remaining = count;
>> +	int rep_count, i;
>> +	u64 status;
>> +	unsigned long flags;
>> +
>> +	*written_total = 0;
>> +	while (remaining) {
>> +		local_irq_save(flags);
>> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +		input_page->partition_id = partition_id;
>> +		input_page->hv_gpa_page_number = gpa_base_pfn + *written_total;
>> +		input_page->flags = state_flags;
>> +		rep_count = min(remaining, HV_GET_GPA_ACCESS_STATES_BATCH_SIZE);
>> +
>> +		status = hv_do_rep_hypercall(HVCALL_GET_GPA_PAGES_ACCESS_STATES, rep_count,
>> +					     0, input_page, output_page);
>> +		if (!hv_result_success(status)) {
>> +			local_irq_restore(flags);
>> +			break;
>> +		}
>> +		completed = hv_repcomp(status);
>> +		for (i = 0; i < completed; ++i)
>> +			states[i].as_uint8 = output_page[i].as_uint8;
>> +
>> +		states += completed;
>> +		*written_total += completed;
>> +		remaining -= completed;
>> +		local_irq_restore(flags);
> 
> FWIW, this local_irq_restore() could move up three lines to before the progress
> accounting is done.
> 
Good point, thanks.
<snip>
>> +		memset(input, 0, sizeof(*input));
>> +		memset(output, 0, sizeof(*output));
> 
> Why is the output set to zero?  I would think Hyper-V is responsible for
> ensuring that the output is properly populated, with unused fields/areas
> set to zero.
> 
Overabundance of caution, I think! It doesn't need to be zeroed AFAIK.

I recently did a some cleanup (in our internal tree) to make sure we are
memset()ing the input and *not* memset()ing the output everywhere, but
it didn't make it into this series. There are a few more places like this.

<snip>
>> +
>> +int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
>> +			 /* Choose between pages and bytes */
>> +			 struct hv_vp_state_data state_data, u64 page_count,
> 
> The size of "struct hv_vp_state_data" looks to be 24 bytes (3 64-bit words).
> Is there a reason to pass this by value instead of as a pointer? I guess it works
> like this, but it seems atypical.
> 
No particular reason. I'm guessing the compiler will pass this by copying it to this
function's stack frame - 24 bytes is still rather small so I don't think it's an issue.

I'm also under the impression the compiler may optimize this to a pointer since it is
not modified?

I usually only pass a pointer (for read-only values) when it's something really
large that I *definitely* don't want to be copied on the stack (like, 100 bytes?).
In that case I probably only have a pointer to vmalloc'd/kalloc()'d memory anyway.

<snip>
>> +	local_irq_save(flags);
>> +	status = hv_do_fast_hypercall8(HVCALL_CLEAR_VIRTUAL_INTERRUPT,
>> +				       partition_id) &
>> +			HV_HYPERCALL_RESULT_MASK;
> 
> This "anding" with HV_HYPERCALL_RESULT_MASK should be removed.
> 
Yep, thanks.

>> +	local_irq_restore(flags);
> 
> The irq save/restore isn't needed here since this is a fast hypercall and
> per-cpu arg memory is not used.
> 
Agreed, will remove these for the fast hypercall sites.

<snip>
>> +		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
>> +		status = hv_do_hypercall(HVCALL_CREATE_PORT, input, NULL) &
>> +			 HV_HYPERCALL_RESULT_MASK;
> 
> Use the hv_status checking macros instead of and'ing with
> HV_HYPERCALL_RESULT_MASK.
> 
Thanks, these need a bit of cleanup.

<snip>
>> +	status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
>> +					input.as_uint64[0],
>> +					input.as_uint64[1]) &
>> +			HV_HYPERCALL_RESULT_MASK;
>> +	local_irq_restore(flags);
> 
> Same a previous comment about and'ing.  And irq save/restore
> isn't needed.
> 
ack

<snip>
>> +		status = hv_do_hypercall(HVCALL_CONNECT_PORT, input, NULL) &
>> +			 HV_HYPERCALL_RESULT_MASK;
> 
> Same here.  Use hv_* macros.
> 
ack

<snip>
>> +	status = hv_do_fast_hypercall16(HVCALL_DISCONNECT_PORT,
>> +					input.as_uint64[0],
>> +					input.as_uint64[1]) &
>> +			HV_HYPERCALL_RESULT_MASK;
>> +	local_irq_restore(flags);
> 
> Same as above.
> 
ack

<snip>
>> +	local_irq_save(flags);
>> +	input.sint_index = sint_index;
>> +	status = hv_do_fast_hypercall8(HVCALL_NOTIFY_PORT_RING_EMPTY,
>> +				       input.as_uint64) &
>> +		 HV_HYPERCALL_RESULT_MASK;
>> +	local_irq_restore(flags);
> 
> Same as above.
> 
ack, and I'll double check we don't have other fast hypercall sites doing this

<snip>>> +		/*
>> +		 * This is required to make sure that reserved field is set to
>> +		 * zero, because MSHV has a check to make sure reserved bits are
>> +		 * set to zero.
>> +		 */
> 
> Is this comment about checking reserved bits unique to this hypercall? If not, it
> seems a little odd to see this comment here, but not other places where the input
> is zero'ed.
> 
I agree the comment isn't necessary - memset()ing the input to zero should be the
default policy. I'll remove it.

>> +		memset(input_page, 0, sizeof(*input_page));
>> +		/* Only set the partition id if you are making the pages
>> +		 * exclusive
>> +		 */
>> +		if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE)
>> +			input_page->partition_id = partition_id;
>> +		input_page->flags = flags;
>> +		input_page->host_access = host_access;
>> +
>> +		for (i = 0; i < rep_count; i++) {
>> +			u64 index = (done + i) << large_shift;
>> +
>> +			if (index >= page_struct_count)
>> +				return -EINVAL;
> 
> Can this test ever be true?
> 
See above in hv_do_map_gpa_hcall(), it's more of a sanity check or assert.

>> +
>> +			input_page->spa_page_list[i] =
>> +						page_to_pfn(pages[index]);
> 
> When large_shift is non-zero, it seems weird to be skipping over most
> of the entries in the "pages" array.  But maybe there's a reason for that.
> 
See above where we do the same thing in hv_do_map_gpa_hcall(). The hypervisor
only needs to see the "head" pages - the GPAs of the 2MB pages.

>> +		}
>> +
>> +		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
>> +					     NULL);
>> +		local_irq_restore(irq_flags);
>> +
>> +		completed = hv_repcomp(status);
>> +
>> +		if (!hv_result_success(status))
>> +			return hv_result_to_errno(status);
>> +
>> +		done += completed;
>> +	}
>> +
>> +	return 0;
>> +}
> 
> [snip the rest of the patch that I haven't reviewed yet]
> 
> Michael


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-14  2:15     ` Nuno Das Neves
@ 2025-03-14  3:27       ` Michael Kelley
  0 siblings, 0 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-14  3:27 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 13, 2025 7:16 PM
> 
> On 3/13/2025 9:43 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday,
> February 26, 2025 3:08 PM
> >>
> >
> > I've done a partial review of the code in this patch.  See comments inline
> > as usual.
> >
> > I'd like to still review most of the code in mshv_root_main.c, and maybe
> > some of mshv_synic.c and include/uapi/linux/mshv.c. I'll send a separate
> > email with those comments when I complete them. The patch is huge, so
> > I'm breaking my review comments into two parts.
> >
> > I've glanced through mshv_eventfd.c, mshv_eventfd.h, and mshv_irq.c,
> > but I don't have enough knowledge/expertise in these areas to add any
> > useful comments, so I'm not planning to review them further.
> >
> Thanks for taking a look. Just so you know, I was getting ready to post v6 of
> this patchset when I saw this email. So not all the comments will be addressed
> in the next version, but I've noted them and I will keep an eye out for the
> second part if you send it after v6 is posted.

That's fine.

> 
> <snip>
> >> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> index 6d1465315df3..66dcfaae698b 100644
> >> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >> @@ -370,6 +370,8 @@ Code  Seq#    Include File                                           Comments
> >>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
> >> remoteproc@vger.kernel.org>
> >>  0xB7  all    uapi/linux/nsfs.h                                       <mailto:Andrei Vagin
> >> <avagin@openvz.org>>
> >>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
> >> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
> >
> > Hmmm. Doesn't this mean that the mshv ioctls overlap with the Marvell
> > CN10K DPI ioctls? Is that intentional? I thought the goal of the central
> > registry in ioctl-number.rst is to avoid overlap.
> >
> Yes, they overlap. In practice it really doesn't matter IMO - IOCTL numbers
> are only interpreted by the driver of the device that the ioctl() syscall
> is made on.
> 
> I believe the whole scheme to generate unique IOCTL numbers and try not to
> overlap them was is some case I'm not familiar with - something like
> multiple drivers handling IOCTLs on the same device FD? And maybe it's handy
> in debugging if you see an IOCTL number in isolation and want to know where
> it comes from?

Yes, I think the debugging aspect is one that is mentioned in the text in
ioctl-number.rst. For example, maybe you want to filter the ioctl() system call
based on a particular ioctl value.

> 
> On a practical note, we have been using this IOCTL range for some time
> in other upstream code like our userspace rust library which interfaces with
> this driver (https://github.com/rust-vmm/mshv). So it would also be nice to
> keep that all working as much as possible with the kernel code that is on
> this mailing list.

I can see that having to change the ioctl values could be a pain. And apparently
there are already some historical overlaps. Also I saw that later in Patch 10 where
the mshv ioctls are defined, you have some overlaps just within the mshv space.
I don't really like the idea of having overlaps, either within the mshv space, or
with other drivers.  Doing a transition to non-overlapping values would probably
mean the driver having to accept both the "old" and "new" values for a while
until the rust library can be updated and deployed copies using the old values
go away.  It could be done in a relatively seamless fashion, but I can't really
make a strong argument that it would be worth the hassle.

> 
> <snip>>> +#endif /* _MSHV_H */
> >> diff --git a/drivers/hv/mshv_common.c b/drivers/hv/mshv_common.c
> >> new file mode 100644
> >> index 000000000000..d97631dcbee1
> >> --- /dev/null
> >> +++ b/drivers/hv/mshv_common.c
> >> @@ -0,0 +1,161 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (c) 2024, Microsoft Corporation.
> >> + *
> >> + * This file contains functions that are called from one or more modules: ROOT,
> >> + * DIAG, or VTL. If any of these modules are configured to build, this file is
> >
> > What are the DIAG and VTL modules?  I see only a root module in the Makefile.
> >
> Ah, yep, they are not in this patchset but will follow. I can remove thereferences to them
> here, and make this comment future tense: "functions that WILL
> be called from one or more modules".
> 
> <snip>>> +
> >> +struct mshv_vp {
> >> +	u32 vp_index;
> >> +	struct mshv_partition *vp_partition;
> >> +	struct mutex vp_mutex;
> >> +	struct hv_vp_register_page *vp_register_page;
> >> +	struct hv_message *vp_intercept_msg_page;
> >> +	void *vp_ghcb_page;
> >> +	struct hv_stats_page *vp_stats_pages[2];
> >> +	struct {
> >> +		atomic64_t vp_signaled_count;
> >> +		struct {
> >> +			u64 intercept_suspend: 1;
> >> +			u64 root_sched_blocked: 1; /* root scheduler only */
> >> +			u64 root_sched_dispatched: 1; /* root scheduler only */
> >> +			u64 reserved: 62;
> >
> > Hmmm.  This looks like 65 bits allocated in a u64.
> >
> Indeed it is, good catch
> 
> >> +
> >> +	/*
> >> +	 * Since MSHV does not support more than one async hypercall in flight
> >
> > Wording is a bit messed up.  Drop the "Since"?
> >
> Yep, thanks
> 
> >> +	 * for a single partition. Thus, it is okay to define per partition
> >> +	 * async hypercall status.
> >> +	 */
> <snip>
> >> +
> >> +extern struct mshv_root mshv_root;
> >> +extern enum hv_scheduler_type hv_scheduler_type;
> >> +extern u8 __percpu **hv_synic_eventring_tail;
> >
> > Per comments on an earlier patch, the __percpu is in the wrong place.
> >
> Thanks, will fix here too.
> <snip>>> +int hv_call_create_partition(u64 flags,
> >> +			     struct hv_partition_creation_properties creation_properties,
> >> +			     union hv_partition_isolation_properties isolation_properties,
> >> +			     u64 *partition_id)
> >> +{
> >> +	struct hv_input_create_partition *input;
> >> +	struct hv_output_create_partition *output;
> >> +	u64 status;
> >> +	int ret;
> >> +	unsigned long irq_flags;
> >> +
> >> +	do {
> >> +		local_irq_save(irq_flags);
> >> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> +		memset(input, 0, sizeof(*input));
> >> +		input->flags = flags;
> >> +		input->compatibility_version = HV_COMPATIBILITY_21_H2;
> >> +
> >> +		memcpy(&input->partition_creation_properties, &creation_properties,
> >> +		       sizeof(creation_properties));
> >
> > This is an example of a generic question/concern that occurs in several places. By
> > doing a memcpy into the hypercall input, the assumption is that the creation
> > properties supplied by the caller have zeros in all the reserved or unused fields.
> > Is that a valid assumption?
> >
> When the entire struct is provided as a function parameter, I think it's a valid
> assumption that that struct is initialized correctly by the caller.
> 
> The alternative (taking it to an extreme, in my opinion) is that we go through
> each field in the parameters and assign them all individually, which could be quite
> a lot of fields. E.g. going through all the bits in these structs with 60+ bitfields
> and re-setting them here to be sure the reserved bits are 0.

Agreed -- I would not advocate the extreme alternative.  But perhaps the
requirement on the caller to correctly initialize all the memory should
be made more explicit in the cases where that's true.

> 
> >> +
> >> +		memcpy(&input->isolation_properties, &isolation_properties,
> >> +		       sizeof(isolation_properties));
> >> +
> >> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> >> +					 input, output);

<snip>

> >> +/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> >> +static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >> +			       u32 flags, struct page **pages, u64 mmio_spa)
> >> +{
> >> +	struct hv_input_map_gpa_pages *input_page;
> >> +	u64 status, *pfnlist;
> >> +	unsigned long irq_flags, large_shift = 0;
> >> +	int ret = 0, done = 0;
> >> +	u64 page_count = page_struct_count;
> >> +
> >> +	if (page_count == 0 || (pages && mmio_spa))
> >> +		return -EINVAL;
> >> +
> >> +	if (flags & HV_MAP_GPA_LARGE_PAGE) {
> >> +		if (mmio_spa)
> >> +			return -EINVAL;
> >> +
> >> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> >> +			return -EINVAL;
> >> +
> >> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> >> +		page_count >>= large_shift;
> >> +	}
> >> +
> >> +	while (done < page_count) {
> >> +		ulong i, completed, remain = page_count - done;
> >> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> +		local_irq_save(irq_flags);
> >> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +
> >> +		input_page->target_partition_id = partition_id;
> >> +		input_page->target_gpa_base = gfn + (done << large_shift);
> >> +		input_page->map_flags = flags;
> >> +		pfnlist = input_page->source_gpa_page_list;
> >> +
> >> +		for (i = 0; i < rep_count; i++)
> >> +			if (flags & HV_MAP_GPA_NO_ACCESS) {
> >> +				pfnlist[i] = 0;
> >> +			} else if (pages) {
> >> +				u64 index = (done + i) << large_shift;
> >> +
> >> +				if (index >= page_struct_count) {
> >
> > Can this test ever be true?  It looks like the pages array must
> > have space for each 4K page even if mapping in 2Meg granularity.> But only every 512th
> entry in the pages array is looked at
> > (which seems a little weird). But based on how rep_count is set up,
> > I don't see how the algorithm could go past the end of the pages
> > array.
> >
> I don't think the test can actually be true - IIRC I wrote it as a kind
> of "is my math correct?" sanity check, and there was a pr_err() or a
> WARN()here in a previous iteration of the code.
> 
> The large page list is a bit weird - When we allocate the large pages in
> the kernel, we get all the (4K) page structs for that range back from the
> kernel, and we hang onto them. When mapping the large pages into the
> hypervisor we just have to map the PFN of the first page of each 2M page,
> hence the skipping.
> 
> Now I'm thinking about it again, maybe we can discard most of the 4K page
> structs the kernel gives back and keep it as a packed array of the "head"
> pages which are all we really need (and then also simplify this mapping
> code and save some memory).
> 
> The current code was just the simplest way to add the large page
> functionality on top of what we already had, but looks like it could
> probably be improved.
> 
> >> +					ret = -EINVAL;
> >> +					break;
> >> +				}
> >> +				pfnlist[i] = page_to_pfn(pages[index]);
> >> +			} else {
> >> +				pfnlist[i] = mmio_spa + done + i;
> >> +			}
> >> +		if (ret)
> >> +			break;
> >
> > This test could also go away if the ret = -EINVAL error above can't
> > happen.
> >
> Ack
> <snip>
> >> +
> >> +/* Ask the hypervisor to map guest mmio space */
> >> +int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> >> +{
> >> +	int i;
> >> +	u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> >> +		    HV_MAP_GPA_NOT_CACHED;
> >> +
> >> +	for (i = 0; i < numpgs; i++)
> >> +		if (page_is_ram(mmio_spa + i))
> >
> > FWIW, doing this check one-page-at-a-time is somewhat expensive if numpgs
> > is large. The underlying data structures should support doing a single range
> > check, but I haven't looked at whether functions exist to do such a range check.
> >
> Indeed - I'll make a note to investigate, thanks.
> 
> >> +			return -EINVAL;
> >> +
> >> +	return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> >> +				   mmio_spa);
> >> +}
> >> +
> >> +int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> >> +			    u32 flags)
> >> +{
> >> +	struct hv_input_unmap_gpa_pages *input_page;
> >> +	u64 status, page_count = page_count_4k;
> >> +	unsigned long irq_flags, large_shift = 0;
> >> +	int ret = 0, done = 0;
> >> +
> >> +	if (page_count == 0)
> >> +		return -EINVAL;
> >> +
> >> +	if (flags & HV_UNMAP_GPA_LARGE_PAGE) {
> >> +		if (!HV_PAGE_COUNT_2M_ALIGNED(page_count))
> >> +			return -EINVAL;
> >> +
> >> +		large_shift = HV_HYP_LARGE_PAGE_SHIFT - HV_HYP_PAGE_SHIFT;
> >> +		page_count >>= large_shift;
> >> +	}
> >> +
> >> +	while (done < page_count) {
> >> +		ulong completed, remain = page_count - done;
> >> +		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> >
> > Using HV_MAP_GPA_BATCH_SIZE seems a little weird here since there's
> > no input array and hence no constraint based on keeping input args to
> > just one page. Is it being used as an arbitrary limit so the rep_count
> > passed to the hypercall isn't "too large" for some definition of "too large"?
> > If that's the case, perhaps a separate #define and a comment would
> > make sense. I kept trying to figure out how the batch size for unmap was
> > related to the map hypercall, and I don't think there is any relationship.
> >
> I think batching this was intentional so that we can be sure to re-enable
> interrupts periodically when unmapping an entire VM's worth of memory. That
> said, as you know the hypercall will return if it takes longer than a certain
> amount of time so I guess that is "built-in" in some sense.
> 
> I think keeping the batching, but #defining a specific value for unmap as you
> suggest is a good idea.
> 
> I'd be inclined to use a similar number (something like 512).

Yes, I agree the batching should be kept because interrupts are disabled.
If the rep hypercall is taking a "long time", it will by itself come up for
air to allow taking interrupts. If interrupts were not disabled, that would
solve the problem. But with interrupts disabled, you do need some
batching.

Using a number like 512 makes sense to me. Just add a comment that
the value is somewhat arbitrary and only to allow for interrupts. It's
not based on memory space for input/output arguments like all the
other batch sizes.

> 
> >> +
> >> +		local_irq_save(irq_flags);
> >> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +
> >> +		input_page->target_partition_id = partition_id;
> >> +		input_page->target_gpa_base = gfn + (done << large_shift);
> >> +		input_page->unmap_flags = flags;
> >> +		status = hv_do_rep_hypercall(HVCALL_UNMAP_GPA_PAGES, rep_count,
> >> +					     0, input_page, NULL);
> >> +		local_irq_restore(irq_flags);
> >> +
> >> +		completed = hv_repcomp(status);
> >> +		if (!hv_result_success(status)) {
> >> +			ret = hv_result_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		done += completed;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
> >> +				  union hv_gpa_page_access_state_flags state_flags,
> >> +				  int *written_total,
> >> +				  union hv_gpa_page_access_state *states)
> >> +{
> >> +	struct hv_input_get_gpa_pages_access_state *input_page;
> >> +	union hv_gpa_page_access_state *output_page;
> >> +	int completed = 0;
> >> +	unsigned long remaining = count;
> >> +	int rep_count, i;
> >> +	u64 status;
> >> +	unsigned long flags;
> >> +
> >> +	*written_total = 0;
> >> +	while (remaining) {
> >> +		local_irq_save(flags);
> >> +		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> +		input_page->partition_id = partition_id;
> >> +		input_page->hv_gpa_page_number = gpa_base_pfn + *written_total;
> >> +		input_page->flags = state_flags;
> >> +		rep_count = min(remaining, HV_GET_GPA_ACCESS_STATES_BATCH_SIZE);
> >> +
> >> +		status = hv_do_rep_hypercall(HVCALL_GET_GPA_PAGES_ACCESS_STATES,
> rep_count,
> >> +					     0, input_page, output_page);
> >> +		if (!hv_result_success(status)) {
> >> +			local_irq_restore(flags);
> >> +			break;
> >> +		}
> >> +		completed = hv_repcomp(status);
> >> +		for (i = 0; i < completed; ++i)
> >> +			states[i].as_uint8 = output_page[i].as_uint8;
> >> +
> >> +		states += completed;
> >> +		*written_total += completed;
> >> +		remaining -= completed;
> >> +		local_irq_restore(flags);
> >
> > FWIW, this local_irq_restore() could move up three lines to before the progress
> > accounting is done.
> >
> Good point, thanks.
> <snip>
> >> +		memset(input, 0, sizeof(*input));
> >> +		memset(output, 0, sizeof(*output));
> >
> > Why is the output set to zero?  I would think Hyper-V is responsible for
> > ensuring that the output is properly populated, with unused fields/areas
> > set to zero.
> >
> Overabundance of caution, I think! It doesn't need to be zeroed AFAIK.
> 
> I recently did a some cleanup (in our internal tree) to make sure we are
> memset()ing the input and *not* memset()ing the output everywhere, but
> it didn't make it into this series. There are a few more places like this.
> 
> <snip>
> >> +
> >> +int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> >> +			 /* Choose between pages and bytes */
> >> +			 struct hv_vp_state_data state_data, u64 page_count,
> >
> > The size of "struct hv_vp_state_data" looks to be 24 bytes (3 64-bit words).
> > Is there a reason to pass this by value instead of as a pointer? I guess it works
> > like this, but it seems atypical.
> >
> No particular reason. I'm guessing the compiler will pass this by copying it to this
> function's stack frame - 24 bytes is still rather small so I don't think it's an issue.

Right, I think the code works as written. It's just atypical. When I see stuff that
doesn't fit the usual pattern, I always wonder "why?"  And if there's no good
reason "why", reverting to the usual pattern avoids somebody else wondering
"why" in the future. :-)  But you can leave it as is.

> 
> I'm also under the impression the compiler may optimize this to a pointer since it is
> not modified?

It might. I'm not sure.

> 
> I usually only pass a pointer (for read-only values) when it's something really
> large that I *definitely* don't want to be copied on the stack (like, 100 bytes?).
> In that case I probably only have a pointer to vmalloc'd/kalloc()'d memory anyway.
> 
> <snip>
> >> +	local_irq_save(flags);
> >> +	status = hv_do_fast_hypercall8(HVCALL_CLEAR_VIRTUAL_INTERRUPT,
> >> +				       partition_id) &
> >> +			HV_HYPERCALL_RESULT_MASK;
> >
> > This "anding" with HV_HYPERCALL_RESULT_MASK should be removed.
> >
> Yep, thanks.
> 
> >> +	local_irq_restore(flags);
> >
> > The irq save/restore isn't needed here since this is a fast hypercall and
> > per-cpu arg memory is not used.
> >
> Agreed, will remove these for the fast hypercall sites.
> 
> <snip>
> >> +		input->proximity_domain_info = hv_numa_node_to_pxm_info(node);
> >> +		status = hv_do_hypercall(HVCALL_CREATE_PORT, input, NULL) &
> >> +			 HV_HYPERCALL_RESULT_MASK;
> >
> > Use the hv_status checking macros instead of and'ing with
> > HV_HYPERCALL_RESULT_MASK.
> >
> Thanks, these need a bit of cleanup.
> 
> <snip>
> >> +	status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
> >> +					input.as_uint64[0],
> >> +					input.as_uint64[1]) &
> >> +			HV_HYPERCALL_RESULT_MASK;
> >> +	local_irq_restore(flags);
> >
> > Same a previous comment about and'ing.  And irq save/restore
> > isn't needed.
> >
> ack
> 
> <snip>
> >> +		status = hv_do_hypercall(HVCALL_CONNECT_PORT, input, NULL) &
> >> +			 HV_HYPERCALL_RESULT_MASK;
> >
> > Same here.  Use hv_* macros.
> >
> ack
> 
> <snip>
> >> +	status = hv_do_fast_hypercall16(HVCALL_DISCONNECT_PORT,
> >> +					input.as_uint64[0],
> >> +					input.as_uint64[1]) &
> >> +			HV_HYPERCALL_RESULT_MASK;
> >> +	local_irq_restore(flags);
> >
> > Same as above.
> >
> ack
> 
> <snip>
> >> +	local_irq_save(flags);
> >> +	input.sint_index = sint_index;
> >> +	status = hv_do_fast_hypercall8(HVCALL_NOTIFY_PORT_RING_EMPTY,
> >> +				       input.as_uint64) &
> >> +		 HV_HYPERCALL_RESULT_MASK;
> >> +	local_irq_restore(flags);
> >
> > Same as above.
> >
> ack, and I'll double check we don't have other fast hypercall sites doing this
> 
> <snip>>> +		/*
> >> +		 * This is required to make sure that reserved field is set to
> >> +		 * zero, because MSHV has a check to make sure reserved bits are
> >> +		 * set to zero.
> >> +		 */
> >
> > Is this comment about checking reserved bits unique to this hypercall? If not, it
> > seems a little odd to see this comment here, but not other places where the input
> > is zero'ed.
> >
> I agree the comment isn't necessary - memset()ing the input to zero should be the
> default policy. I'll remove it.

This is another case where something doesn't fit the usual pattern, and I wonder
"why". :-)

> 
> >> +		memset(input_page, 0, sizeof(*input_page));
> >> +		/* Only set the partition id if you are making the pages
> >> +		 * exclusive
> >> +		 */
> >> +		if (flags & HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE)
> >> +			input_page->partition_id = partition_id;
> >> +		input_page->flags = flags;
> >> +		input_page->host_access = host_access;
> >> +
> >> +		for (i = 0; i < rep_count; i++) {
> >> +			u64 index = (done + i) << large_shift;
> >> +
> >> +			if (index >= page_struct_count)
> >> +				return -EINVAL;
> >
> > Can this test ever be true?
> >
> See above in hv_do_map_gpa_hcall(), it's more of a sanity check or assert.
> 
> >> +
> >> +			input_page->spa_page_list[i] =
> >> +						page_to_pfn(pages[index]);
> >
> > When large_shift is non-zero, it seems weird to be skipping over most
> > of the entries in the "pages" array.  But maybe there's a reason for that.
> >
> See above where we do the same thing in hv_do_map_gpa_hcall(). The hypervisor
> only needs to see the "head" pages - the GPAs of the 2MB pages.
> 
> >> +		}
> >> +
> >> +		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> >> +					     NULL);
> >> +		local_irq_restore(irq_flags);
> >> +
> >> +		completed = hv_repcomp(status);
> >> +
> >> +		if (!hv_result_success(status))
> >> +			return hv_result_to_errno(status);
> >> +
> >> +		done += completed;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >
> > [snip the rest of the patch that I haven't reviewed yet]
> >
> > Michael


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
                     ` (3 preceding siblings ...)
  2025-03-13 16:43   ` Michael Kelley
@ 2025-03-17 23:51   ` Michael Kelley
  2025-03-18 17:24     ` Wei Liu
  2025-03-19  0:34     ` Nuno Das Neves
  4 siblings, 2 replies; 108+ messages in thread
From: Michael Kelley @ 2025-03-17 23:51 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 

This is part 2 of my review of this large patch.

[snipping what I already reviewed or decided to skip in part 1 of my review ]

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> new file mode 100644
> index 000000000000..fed19aa80049
> --- /dev/null
> +++ b/drivers/hv/mshv_root_main.c
> @@ -0,0 +1,2329 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, Microsoft Corporation.
> + *
> + * The main part of the mshv_root module, providing APIs to create
> + * and manage guest partitions.
> + *
> + * Authors: Microsoft Linux virtualization team
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/random.h>
> +#include <asm/mshyperv.h>
> +#include <linux/hyperv.h>
> +#include <linux/notifier.h>
> +#include <linux/reboot.h>
> +#include <linux/kexec.h>
> +#include <linux/page-flags.h>
> +#include <linux/crash_dump.h>
> +#include <linux/panic_notifier.h>
> +#include <linux/vmalloc.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +/* TODO move this to mshyperv.h when needed outside driver */
> +static inline bool hv_parent_partition(void)
> +{
> +	return hv_root_partition();
> +}
> +
> +/* TODO move this to another file when debugfs code is added */
> +enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
> +#if defined(CONFIG_X86)
> +	VpRootDispatchThreadBlocked			= 201,
> +#elif defined(CONFIG_ARM64)
> +	VpRootDispatchThreadBlocked			= 94,
> +#endif
> +	VpStatsMaxCounter
> +};

Where do these "magic" numbers come from?  Are they matching something
in the Hyper-V host?

> +
> +struct hv_stats_page {
> +	union {
> +		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
> +		u8 data[HV_HYP_PAGE_SIZE];
> +	};
> +} __packed;
> +
> +struct mshv_root mshv_root = {};

Initializer is unnecessary for global variables. They are already set to zero.

> +
> +enum hv_scheduler_type hv_scheduler_type;
> +
> +/* Once we implement the fast extended hypercall ABI they can go away. */
> +static void __percpu **root_scheduler_input;
> +static void __percpu **root_scheduler_output;

The __percpu is probably in the wrong place like mentioned in earlier
patches in this series.

> +
> +static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +static int mshv_dev_open(struct inode *inode, struct file *filp);
> +static int mshv_dev_release(struct inode *inode, struct file *filp);
> +static int mshv_vp_release(struct inode *inode, struct file *filp);
> +static long mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +static int mshv_partition_release(struct inode *inode, struct file *filp);
> +static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
> +static int mshv_init_async_handler(struct mshv_partition *partition);
> +static void mshv_async_hvcall_handler(void *data, u64 *status);
> +
> +static const struct vm_operations_struct mshv_vp_vm_ops = {
> +	.fault = mshv_vp_fault,
> +};
> +
> +static const struct file_operations mshv_vp_fops = {
> +	.owner = THIS_MODULE,
> +	.release = mshv_vp_release,
> +	.unlocked_ioctl = mshv_vp_ioctl,
> +	.llseek = noop_llseek,
> +	.mmap = mshv_vp_mmap,
> +};
> +
> +static const struct file_operations mshv_partition_fops = {
> +	.owner = THIS_MODULE,
> +	.release = mshv_partition_release,
> +	.unlocked_ioctl = mshv_partition_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static const struct file_operations mshv_dev_fops = {
> +	.owner = THIS_MODULE,
> +	.open = mshv_dev_open,
> +	.release = mshv_dev_release,
> +	.unlocked_ioctl = mshv_dev_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static struct miscdevice mshv_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mshv",
> +	.fops = &mshv_dev_fops,
> +	.mode = 0600,
> +};
> +
> +/*
> + * Only allow hypercalls that have a u64 partition id as the first member of
> + * the input structure.
> + * These are sorted by value.
> + */
> +static u16 mshv_passthru_hvcalls[] = {
> +	HVCALL_GET_PARTITION_PROPERTY,
> +	HVCALL_SET_PARTITION_PROPERTY,
> +	HVCALL_INSTALL_INTERCEPT,
> +	HVCALL_GET_VP_REGISTERS,
> +	HVCALL_SET_VP_REGISTERS,
> +	HVCALL_TRANSLATE_VIRTUAL_ADDRESS,
> +	HVCALL_CLEAR_VIRTUAL_INTERRUPT,
> +	HVCALL_REGISTER_INTERCEPT_RESULT,
> +	HVCALL_ASSERT_VIRTUAL_INTERRUPT,
> +	HVCALL_GET_GPA_PAGES_ACCESS_STATES,
> +	HVCALL_SIGNAL_EVENT_DIRECT,
> +	HVCALL_POST_MESSAGE_DIRECT,
> +	HVCALL_GET_VP_CPUID_VALUES,
> +};
> +
> +static bool mshv_hvcall_is_async(u16 code)
> +{
> +	switch (code) {
> +	case HVCALL_SET_PARTITION_PROPERTY:
> +		return true;
> +	default:
> +		break;
> +	}
> +	return false;
> +}
> +
> +static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
> +				      bool partition_locked,
> +				      void __user *user_args)
> +{
> +	u64 status;
> +	int ret, i;

'ret' should be initialized to 0. There's a path through this function that
never sets 'ret' and the return value would be stack garbage.

> +	bool is_async;
> +	struct mshv_root_hvcall args;
> +	struct page *page;
> +	unsigned int pages_order;
> +	void *input_pg = NULL;
> +	void *output_pg = NULL;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.status || !args.in_ptr || args.in_sz < sizeof(u64) ||
> +	    mshv_field_nonzero(args, rsvd) || args.in_sz > HV_HYP_PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (args.out_ptr && (!args.out_sz || args.out_sz > HV_HYP_PAGE_SIZE))
> +		return -EINVAL;
> +
> +	for (i = 0; i < ARRAY_SIZE(mshv_passthru_hvcalls); ++i)
> +		if (args.code == mshv_passthru_hvcalls[i])
> +			break;
> +
> +	if (i >= ARRAY_SIZE(mshv_passthru_hvcalls))
> +		return -EINVAL;
> +
> +	is_async = mshv_hvcall_is_async(args.code);
> +	if (is_async) {
> +		/* async hypercalls can only be called from partition fd */
> +		if (!partition_locked)
> +			return -EINVAL;
> +		ret = mshv_init_async_handler(partition);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	pages_order = args.out_ptr ? 1 : 0;
> +	page = alloc_pages(GFP_KERNEL, pages_order);
> +	if (!page)
> +		return -ENOMEM;
> +	input_pg = page_address(page);
> +
> +	if (args.out_ptr)
> +		output_pg = (char *)input_pg + PAGE_SIZE;
> +	else
> +		output_pg = NULL;
> +
> +	if (copy_from_user(input_pg, (void __user *)args.in_ptr,
> +			   args.in_sz)) {
> +		ret = -EFAULT;
> +		goto free_pages_out;
> +	}
> +
> +	/*
> +	 * NOTE: This only works because all the allowed hypercalls' input
> +	 * structs begin with a u64 partition_id field.
> +	 */
> +	*(u64 *)input_pg = partition->pt_id;
> +
> +	if (args.reps)
> +		status = hv_do_rep_hypercall(args.code, args.reps, 0,
> +					     input_pg, output_pg);
> +	else
> +		status = hv_do_hypercall(args.code, input_pg, output_pg);
> +
> +	if (hv_result(status) == HV_STATUS_CALL_PENDING) {
> +		if (is_async) {
> +			mshv_async_hvcall_handler(partition, &status);
> +		} else { /* Paranoia check. This shouldn't happen! */
> +			ret = -EBADFD;
> +			goto free_pages_out;
> +		}
> +	}
> +
> +	if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition->pt_id, 1);
> +		if (!ret)
> +			ret = -EAGAIN;
> +	} else if (!hv_result_success(status)) {
> +		ret = hv_result_to_errno(status);
> +	}
> +
> +	/*
> +	 * Always return the status and output data regardless of result.
> +	 * The VMM may need it to determine how to proceed. E.g. the status may
> +	 * contain the number of reps completed if a rep hypercall partially
> +	 * succeeded.
> +	 */
> +	args.status = hv_result(status);
> +	args.reps = args.reps ? hv_repcomp(status) : 0;
> +	if (copy_to_user(user_args, &args, sizeof(args)))
> +		ret = -EFAULT;
> +
> +	if (output_pg &&
> +	    copy_to_user((void __user *)args.out_ptr, output_pg, args.out_sz))
> +		ret = -EFAULT;
> +
> +free_pages_out:
> +	free_pages((unsigned long)input_pg, pages_order);
> +
> +	return ret;
> +}
> +
> +static inline bool is_ghcb_mapping_available(void)
> +{
> +#if IS_ENABLED(CONFIG_X86_64)
> +	return ms_hyperv.ext_features & HV_VP_GHCB_ROOT_MAPPING_AVAILABLE;
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static int mshv_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +				 struct hv_register_assoc *registers)
> +{
> +	union hv_input_vtl input_vtl;
> +
> +	input_vtl.as_uint8 = 0;
> +	return hv_call_get_vp_registers(vp_index, partition_id,
> +					count, input_vtl, registers);
> +}
> +
> +static int mshv_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
> +				 struct hv_register_assoc *registers)
> +{
> +	union hv_input_vtl input_vtl;
> +
> +	input_vtl.as_uint8 = 0;
> +	return hv_call_set_vp_registers(vp_index, partition_id,
> +					count, input_vtl, registers);
> +}
> +
> +/*
> + * Explicit guest vCPU suspend is asynchronous by nature (as it is requested by
> + * dom0 vCPU for guest vCPU) and thus it can race with "intercept" suspend,
> + * done by the hypervisor.
> + * "Intercept" suspend leads to asynchronous message delivery to dom0 which
> + * should be awaited to keep the VP loop consistent (i.e. no message pending
> + * upon VP resume).
> + * VP intercept suspend can't be done when the VP is explicitly suspended
> + * already, and thus can be only two possible race scenarios:
> + *   1. implicit suspend bit set -> explicit suspend bit set -> message sent
> + *   2. implicit suspend bit set -> message sent -> explicit suspend bit set
> + * Checking for implicit suspend bit set after explicit suspend request has
> + * succeeded in either case allows us to reliably identify, if there is a
> + * message to receive and deliver to VMM.
> + */
> +static long

For this function, why is the return type "long" instead of "int"?  Same
question for several other functions below.  "long" works, but it's another
case of being gratuitously atypical -- unless there's a reason.

> +mshv_suspend_vp(const struct mshv_vp *vp, bool *message_in_flight)
> +{
> +	struct hv_register_assoc explicit_suspend = {
> +		.name = HV_REGISTER_EXPLICIT_SUSPEND
> +	};
> +	struct hv_register_assoc intercept_suspend = {
> +		.name = HV_REGISTER_INTERCEPT_SUSPEND
> +	};
> +	union hv_explicit_suspend_register *es =
> +		&explicit_suspend.value.explicit_suspend;
> +	union hv_intercept_suspend_register *is =
> +		&intercept_suspend.value.intercept_suspend;
> +	int ret;
> +
> +	es->suspended = 1;
> +
> +	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    1, &explicit_suspend);
> +	if (ret) {
> +		vp_err(vp, "Failed to explicitly suspend vCPU\n");
> +		return ret;
> +	}
> +
> +	ret = mshv_get_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    1, &intercept_suspend);
> +	if (ret) {
> +		vp_err(vp, "Failed to get intercept suspend state\n");
> +		return ret;
> +	}
> +
> +	*message_in_flight = is->suspended;
> +
> +	return 0;
> +}
> +
> +/*
> + * This function is used when VPs are scheduled by the hypervisor's
> + * scheduler.
> + *
> + * Caller has to make sure the registers contain cleared
> + * HV_REGISTER_INTERCEPT_SUSPEND and HV_REGISTER_EXPLICIT_SUSPEND registers
> + * exactly in this order (the hypervisor clears them sequentially) to avoid
> + * potential invalid clearing a newly arrived HV_REGISTER_INTERCEPT_SUSPEND
> + * after VP is released from HV_REGISTER_EXPLICIT_SUSPEND in case of the
> + * opposite order.
> + */
> +static long mshv_run_vp_with_hyp_scheduler(struct mshv_vp *vp)
> +{
> +	long ret;
> +	struct hv_register_assoc suspend_regs[2] = {
> +			{ .name = HV_REGISTER_INTERCEPT_SUSPEND },
> +			{ .name = HV_REGISTER_EXPLICIT_SUSPEND }
> +	};
> +	size_t count = ARRAY_SIZE(suspend_regs);
> +
> +	/* Resume VP execution */
> +	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    count, suspend_regs);
> +	if (ret) {
> +		vp_err(vp, "Failed to resume vp execution. %lx\n", ret);
> +		return ret;
> +	}
> +
> +	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
> +				       vp->run.kicked_by_hv == 1);
> +	if (ret) {
> +		bool message_in_flight;
> +
> +		/*
> +		 * Otherwise the waiting was interrupted by a signal: suspend
> +		 * the vCPU explicitly and copy message in flight (if any).
> +		 */
> +		ret = mshv_suspend_vp(vp, &message_in_flight);
> +		if (ret)
> +			return ret;
> +
> +		/* Return if no message in flight */
> +		if (!message_in_flight)
> +			return -EINTR;
> +
> +		/* Wait for the message in flight. */
> +		wait_event(vp->run.vp_suspend_queue, vp->run.kicked_by_hv == 1);
> +	}
> +
> +	/*
> +	 * Reset the flag to make the wait_event call above work
> +	 * next time.
> +	 */
> +	vp->run.kicked_by_hv = 0;
> +
> +	return 0;
> +}
> +
> +static int
> +mshv_vp_dispatch(struct mshv_vp *vp, u32 flags,
> +		 struct hv_output_dispatch_vp *res)
> +{
> +	struct hv_input_dispatch_vp *input;
> +	struct hv_output_dispatch_vp *output;
> +	u64 status;
> +
> +	preempt_disable();
> +	input = *this_cpu_ptr(root_scheduler_input);
> +	output = *this_cpu_ptr(root_scheduler_output);
> +
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));
> +
> +	input->partition_id = vp->vp_partition->pt_id;
> +	input->vp_index = vp->vp_index;
> +	input->time_slice = 0; /* Run forever until something happens */
> +	input->spec_ctrl = 0; /* TODO: set sensible flags */
> +	input->flags = flags;
> +
> +	vp->run.flags.root_sched_dispatched = 1;
> +	status = hv_do_hypercall(HVCALL_DISPATCH_VP, input, output);
> +	vp->run.flags.root_sched_dispatched = 0;
> +
> +	*res = *output;
> +	preempt_enable();
> +
> +	if (!hv_result_success(status))
> +		vp_err(vp, "%s: status %s\n", __func__,
> +		       hv_result_to_string(status));
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int
> +mshv_vp_clear_explicit_suspend(struct mshv_vp *vp)
> +{
> +	struct hv_register_assoc explicit_suspend = {
> +		.name = HV_REGISTER_EXPLICIT_SUSPEND,
> +		.value.explicit_suspend.suspended = 0,
> +	};
> +	int ret;
> +
> +	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    1, &explicit_suspend);
> +
> +	if (ret)
> +		vp_err(vp, "Failed to unsuspend\n");
> +
> +	return ret;
> +}
> +
> +#if IS_ENABLED(CONFIG_X86_64)
> +static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
> +{
> +	if (!vp->vp_register_page)
> +		return 0;
> +	return vp->vp_register_page->interrupt_vectors.as_uint64;
> +}
> +#else
> +static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
> +{
> +	return 0;
> +}
> +#endif
> +
> +static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
> +{
> +	struct hv_stats_page **stats = vp->vp_stats_pages;
> +	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
> +	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
> +
> +	if (self_vp_cntrs[VpRootDispatchThreadBlocked])
> +		return self_vp_cntrs[VpRootDispatchThreadBlocked];
> +	return parent_vp_cntrs[VpRootDispatchThreadBlocked];
> +}
> +
> +static int
> +mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
> +{
> +	int ret;
> +
> +	ret = wait_event_interruptible(vp->run.vp_suspend_queue,
> +				       (vp->run.kicked_by_hv == 1 &&
> +					!mshv_vp_dispatch_thread_blocked(vp)) ||
> +				       mshv_vp_interrupt_pending(vp));
> +	if (ret)
> +		return -EINTR;
> +
> +	vp->run.flags.root_sched_blocked = 0;
> +	vp->run.kicked_by_hv = 0;
> +
> +	return 0;
> +}
> +
> +static int mshv_pre_guest_mode_work(struct mshv_vp *vp)
> +{
> +	const ulong work_flags = _TIF_NOTIFY_SIGNAL | _TIF_SIGPENDING |
> +				 _TIF_NEED_RESCHED  | _TIF_NOTIFY_RESUME;
> +	ulong th_flags;
> +
> +	th_flags = read_thread_flags();
> +	while (th_flags & work_flags) {
> +		int ret;
> +
> +		/* nb: following will call schedule */
> +		ret = mshv_do_pre_guest_mode_work(th_flags);
> +
> +		if (ret)
> +			return ret;
> +
> +		th_flags = read_thread_flags();
> +	}
> +
> +	return 0;
> +}
> +
> +/* Must be called with interrupts enabled */
> +static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
> +{
> +	long ret;
> +
> +	if (vp->run.flags.root_sched_blocked) {
> +		/*
> +		 * Dispatch state of this VP is blocked. Need to wait
> +		 * for the hypervisor to clear the blocked state before
> +		 * dispatching it.
> +		 */
> +		ret = mshv_vp_wait_for_hv_kick(vp);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	do {
> +		u32 flags = 0;
> +		struct hv_output_dispatch_vp output;
> +
> +		ret = mshv_pre_guest_mode_work(vp);
> +		if (ret)
> +			break;
> +
> +		if (vp->run.flags.intercept_suspend)
> +			flags |= HV_DISPATCH_VP_FLAG_CLEAR_INTERCEPT_SUSPEND;
> +
> +		if (mshv_vp_interrupt_pending(vp))
> +			flags |= HV_DISPATCH_VP_FLAG_SCAN_INTERRUPT_INJECTION;
> +
> +		ret = mshv_vp_dispatch(vp, flags, &output);
> +		if (ret)
> +			break;
> +
> +		vp->run.flags.intercept_suspend = 0;
> +
> +		if (output.dispatch_state == HV_VP_DISPATCH_STATE_BLOCKED) {
> +			if (output.dispatch_event ==
> +						HV_VP_DISPATCH_EVENT_SUSPEND) {
> +				/*
> +				 * TODO: remove the warning once VP canceling
> +				 *	 is supported
> +				 */
> +				WARN_ONCE(atomic64_read(&vp->run.vp_signaled_count),
> +					  "%s: vp#%d: unexpected explicit suspend\n",
> +					  __func__, vp->vp_index);
> +				/*
> +				 * Need to clear explicit suspend before
> +				 * dispatching.
> +				 * Explicit suspend is either:
> +				 * - set right after the first VP dispatch or
> +				 * - set explicitly via hypercall
> +				 * Since the latter case is not yet supported,
> +				 * simply clear it here.
> +				 */
> +				ret = mshv_vp_clear_explicit_suspend(vp);
> +				if (ret)
> +					break;
> +
> +				ret = mshv_vp_wait_for_hv_kick(vp);
> +				if (ret)
> +					break;
> +			} else {
> +				vp->run.flags.root_sched_blocked = 1;
> +				ret = mshv_vp_wait_for_hv_kick(vp);
> +				if (ret)
> +					break;
> +			}
> +		} else {
> +			/* HV_VP_DISPATCH_STATE_READY */
> +			if (output.dispatch_event ==
> +						HV_VP_DISPATCH_EVENT_INTERCEPT)
> +				vp->run.flags.intercept_suspend = 1;
> +		}
> +	} while (!vp->run.flags.intercept_suspend);
> +
> +	return ret;
> +}
> +
> +static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
> +	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
> +
> +static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
> +{
> +	long rc;
> +	char *schednm;
> +
> +	schednm = hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT ? "root" : "hv";
> +
> +	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> +		rc = mshv_run_vp_with_root_scheduler(vp);
> +	else
> +		rc = mshv_run_vp_with_hyp_scheduler(vp);
> +
> +	if (rc)
> +		return rc;
> +
> +	if (copy_to_user(ret_msg, vp->vp_intercept_msg_page,
> +			 sizeof(struct hv_message)))
> +		rc = -EFAULT;
> +
> +	return rc;
> +}
> +
> +static int
> +mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
> +				struct hv_vp_state_data state_data,
> +				unsigned long user_pfn, size_t page_count,
> +				bool is_set)
> +{
> +	int completed, ret = 0;
> +	unsigned long check;
> +	struct page **pages;
> +
> +	if (page_count > INT_MAX)
> +		return -EINVAL;
> +	/*
> +	 * Check the arithmetic for wraparound/overflow.
> +	 * The last page address in the buffer is:
> +	 * (user_pfn + (page_count - 1)) * PAGE_SIZE
> +	 */
> +	if (check_add_overflow(user_pfn, (page_count - 1), &check))
> +		return -EOVERFLOW;
> +	if (check_mul_overflow(check, PAGE_SIZE, &check))
> +		return -EOVERFLOW;
> +
> +	/* Pin user pages so hypervisor can copy directly to them */
> +	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	for (completed = 0; completed < page_count; completed += ret) {
> +		unsigned long user_addr = (user_pfn + completed) * PAGE_SIZE;
> +		int remaining = page_count - completed;
> +
> +		ret = pin_user_pages_fast(user_addr, remaining, FOLL_WRITE,
> +					  &pages[completed]);
> +		if (ret < 0) {
> +			vp_err(vp, "%s: Failed to pin user pages error %i\n",
> +			       __func__, ret);
> +			goto unpin_pages;
> +		}
> +	}
> +
> +	if (is_set)
> +		ret = hv_call_set_vp_state(vp->vp_index,
> +					   vp->vp_partition->pt_id,
> +					   state_data, page_count, pages,
> +					   0, NULL);
> +	else
> +		ret = hv_call_get_vp_state(vp->vp_index,
> +					   vp->vp_partition->pt_id,
> +					   state_data, page_count, pages,
> +					   NULL);
> +
> +unpin_pages:
> +	unpin_user_pages(pages, completed);
> +	kfree(pages);
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
> +			    struct mshv_get_set_vp_state __user *user_args,
> +			    bool is_set)
> +{
> +	struct mshv_get_set_vp_state args;
> +	long ret = 0;
> +	union hv_output_get_vp_state vp_state;
> +	u32 data_sz;
> +	struct hv_vp_state_data state_data = {};
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.type >= MSHV_VP_STATE_COUNT || mshv_field_nonzero(args, rsvd) ||
> +	    !args.buf_sz || !PAGE_ALIGNED(args.buf_sz) ||
> +	    !PAGE_ALIGNED(args.buf_ptr))
> +		return -EINVAL;
> +
> +	if (!access_ok((void __user *)args.buf_ptr, args.buf_sz))
> +		return -EFAULT;
> +
> +	switch (args.type) {
> +	case MSHV_VP_STATE_LAPIC:
> +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
> +		data_sz = HV_HYP_PAGE_SIZE;
> +		break;
> +	case MSHV_VP_STATE_XSAVE:

Just FYI, you can put a semicolon after the colon on the above line, which
adds a null statement, and then the C compiler will accept the definition
of local variable data_sz_64 without needing the odd-looking braces. 

See https://stackoverflow.com/questions/92396/why-cant-variables-be-declared-in-a-switch-statement/19830820

I learn something new every day! :-)

> +	{
> +		u64 data_sz_64;
> +
> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
> +						     HV_PARTITION_PROPERTY_XSAVE_STATES,
> +						     &state_data.xsave.states.as_uint64);
> +		if (ret)
> +			return ret;
> +
> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
> +						     HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE,
> +						     &data_sz_64);
> +		if (ret)
> +			return ret;
> +
> +		data_sz = (u32)data_sz_64;
> +		state_data.xsave.flags = 0;
> +		/* Always request legacy states */
> +		state_data.xsave.states.legacy_x87 = 1;
> +		state_data.xsave.states.legacy_sse = 1;
> +		state_data.type = HV_GET_SET_VP_STATE_XSAVE;
> +		break;
> +	}
> +	case MSHV_VP_STATE_SIMP:
> +		state_data.type = HV_GET_SET_VP_STATE_SIM_PAGE;
> +		data_sz = HV_HYP_PAGE_SIZE;
> +		break;
> +	case MSHV_VP_STATE_SIEFP:
> +		state_data.type = HV_GET_SET_VP_STATE_SIEF_PAGE;
> +		data_sz = HV_HYP_PAGE_SIZE;
> +		break;
> +	case MSHV_VP_STATE_SYNTHETIC_TIMERS:
> +		state_data.type = HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS;
> +		data_sz = sizeof(vp_state.synthetic_timers_state);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (copy_to_user(&user_args->buf_sz, &data_sz, sizeof(user_args->buf_sz)))
> +		return -EFAULT;
> +
> +	if (data_sz > args.buf_sz)
> +		return -EINVAL;
> +
> +	/* If the data is transmitted via pfns, delegate to helper */
> +	if (state_data.type & HV_GET_SET_VP_STATE_TYPE_PFN) {
> +		unsigned long user_pfn = PFN_DOWN(args.buf_ptr);
> +		size_t page_count = PFN_DOWN(args.buf_sz);
> +
> +		return mshv_vp_ioctl_get_set_state_pfn(vp, state_data, user_pfn,
> +						       page_count, is_set);
> +	}
> +
> +	/* Paranoia check - this shouldn't happen! */
> +	if (data_sz > sizeof(vp_state)) {
> +		vp_err(vp, "Invalid vp state data size!\n");
> +		return -EINVAL;
> +	}

I don't understand the above check.  sizeof(vp_state) is relatively small since
it is effectively sizeof(hv_synthetic_timers_state), which is 200 bytes if I've
done the arithmetic correctly. But data_sz could be a full page (4096 bytes)
for the LAPIC, SIMP, and SIEFP cases, and the check would cause an error to
be returned.

> +
> +	if (is_set) {
> +		if (copy_from_user(&vp_state, (__user void *)args.buf_ptr, data_sz))
> +			return -EFAULT;
> +
> +		return hv_call_set_vp_state(vp->vp_index,
> +					    vp->vp_partition->pt_id,
> +					    state_data, 0, NULL,
> +					    sizeof(vp_state), (u8 *)&vp_state);

This is one of the cases where data from user space gets passed directly to
the hypercall. So user space is responsible for ensuring that reserved fields
are zero'ed and for otherwise ensuring a proper hypercall input. I just
wonder if user space really does this correctly.

> +	}
> +
> +	ret = hv_call_get_vp_state(vp->vp_index, vp->vp_partition->pt_id,
> +				   state_data, 0, NULL, &vp_state);
> +	if (ret)
> +		return ret;
> +
> +	if (copy_to_user((void __user *)args.buf_ptr, &vp_state, data_sz))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static long
> +mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> +{
> +	struct mshv_vp *vp = filp->private_data;
> +	long r = -ENOTTY;
> +
> +	if (mutex_lock_killable(&vp->vp_mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_RUN_VP:
> +		r = mshv_vp_ioctl_run_vp(vp, (void __user *)arg);
> +		break;
> +	case MSHV_GET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
> +		break;
> +	case MSHV_SET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
> +		break;
> +	case MSHV_ROOT_HVCALL:
> +		r = mshv_ioctl_passthru_hvcall(vp->vp_partition, false,
> +					       (void __user *)arg);
> +		break;
> +	default:
> +		vp_warn(vp, "Invalid ioctl: %#x\n", ioctl);
> +		break;
> +	}
> +	mutex_unlock(&vp->vp_mutex);
> +
> +	return r;
> +}
> +
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
> +{
> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
> +
> +	switch (vmf->vma->vm_pgoff) {
> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
> +		vmf->page = virt_to_page(vp->vp_register_page);
> +		break;
> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
> +		vmf->page = virt_to_page(vp->vp_intercept_msg_page);
> +		break;
> +	case MSHV_VP_MMAP_OFFSET_GHCB:
> +		if (is_ghcb_mapping_available())
> +			vmf->page = virt_to_page(vp->vp_ghcb_page);
> +		break;

If there's no GHCB mapping available, execution just continues with
vmf->page not set. Won't the later get_page() call fail? Perhaps this
should fail if there's no GHCB mapping available. Or maybe there's
more about how this works that I'm ignorant of. :-)

> +	default:
> +		return -EINVAL;
> +	}
> +
> +	get_page(vmf->page);
> +
> +	return 0;
> +}
> +
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct mshv_vp *vp = file->private_data;
> +
> +	switch (vma->vm_pgoff) {
> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
> +		if (!vp->vp_register_page)
> +			return -ENODEV;
> +		break;
> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
> +		if (!vp->vp_intercept_msg_page)
> +			return -ENODEV;
> +		break;
> +	case MSHV_VP_MMAP_OFFSET_GHCB:
> +		if (is_ghcb_mapping_available() && !vp->vp_ghcb_page)
> +			return -ENODEV;
> +		break;

Again, if no GHCB mapping is available, should this return success?

> +	default:
> +		return -EINVAL;
> +	}
> +
> +	vma->vm_ops = &mshv_vp_vm_ops;
> +	return 0;
> +}
> +
> +static int
> +mshv_vp_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_vp *vp = filp->private_data;
> +
> +	/* Rest of VP cleanup happens in destroy_partition() */
> +	mshv_partition_put(vp->vp_partition);
> +	return 0;
> +}
> +
> +static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index)
> +{
> +	union hv_stats_object_identity identity = {
> +		.vp.partition_id = partition_id,
> +		.vp.vp_index = vp_index,
> +	};
> +
> +	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
> +	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
> +
> +	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> +	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
> +}
> +
> +static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> +			     void *stats_pages[])
> +{
> +	union hv_stats_object_identity identity = {
> +		.vp.partition_id = partition_id,
> +		.vp.vp_index = vp_index,
> +	};
> +	int err;
> +
> +	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
> +	err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity,
> +				    &stats_pages[HV_STATS_AREA_SELF]);
> +	if (err)
> +		return err;
> +
> +	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> +	err = hv_call_map_stat_page(HV_STATS_OBJECT_VP, &identity,
> +				    &stats_pages[HV_STATS_AREA_PARENT]);
> +	if (err)
> +		goto unmap_self;
> +
> +	return 0;
> +
> +unmap_self:
> +	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
> +	hv_call_unmap_stat_page(HV_STATS_OBJECT_VP, &identity);
> +	return err;
> +}
> +
> +static long
> +mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
> +			       void __user *arg)
> +{
> +	struct mshv_create_vp args;
> +	struct mshv_vp *vp;
> +	struct page *intercept_message_page, *register_page, *ghcb_page;
> +	void *stats_pages[2];
> +	long ret;
> +	union hv_input_vtl input_vtl;
> +
> +	if (copy_from_user(&args, arg, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.vp_index >= MSHV_MAX_VPS)
> +		return -EINVAL;
> +
> +	if (partition->pt_vp_array[args.vp_index])
> +		return -EEXIST;
> +
> +	ret = hv_call_create_vp(NUMA_NO_NODE, partition->pt_id, args.vp_index,
> +				0 /* Only valid for root partition VPs */);
> +	if (ret)
> +		return ret;
> +
> +	input_vtl.as_uint8 = 0;

I see eight occurrences in this source code file where the above statement
occurs and there is no further modification. Perhaps declare a static
variable that is initialized properly, and use it as the input parameter to the
various functions.  A second static variable could have the use_target_vtl = 1
setting that is needed in three places.

> +	ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
> +					HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
> +					input_vtl,
> +					&intercept_message_page);
> +	if (ret)
> +		goto destroy_vp;
> +
> +	if (!mshv_partition_encrypted(partition)) {
> +		input_vtl.as_uint8 = 0;
> +		ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
> +						HV_VP_STATE_PAGE_REGISTERS,
> +						input_vtl,
> +						&register_page);
> +		if (ret)
> +			goto unmap_intercept_message_page;
> +	}
> +
> +	if (mshv_partition_encrypted(partition) &&
> +	    is_ghcb_mapping_available()) {
> +		input_vtl.as_uint8 = 0;
> +		input_vtl.use_target_vtl = 1;
> +		input_vtl.target_vtl = HV_NORMAL_VTL;
> +		ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
> +						HV_VP_STATE_PAGE_GHCB,
> +						input_vtl,
> +						&ghcb_page);
> +		if (ret)
> +			goto unmap_register_page;
> +	}
> +
> +	if (hv_parent_partition()) {
> +		ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
> +					stats_pages);
> +		if (ret)
> +			goto unmap_ghcb_page;
> +	}
> +
> +	vp = kzalloc(sizeof(*vp), GFP_KERNEL);
> +	if (!vp)
> +		goto unmap_stats_pages;
> +
> +	vp->vp_partition = mshv_partition_get(partition);
> +	if (!vp->vp_partition) {
> +		ret = -EBADF;
> +		goto free_vp;
> +	}
> +
> +	mutex_init(&vp->vp_mutex);
> +	init_waitqueue_head(&vp->run.vp_suspend_queue);
> +	atomic64_set(&vp->run.vp_signaled_count, 0);
> +
> +	vp->vp_index = args.vp_index;
> +	vp->vp_intercept_msg_page = page_to_virt(intercept_message_page);
> +	if (!mshv_partition_encrypted(partition))
> +		vp->vp_register_page = page_to_virt(register_page);
> +
> +	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
> +		vp->vp_ghcb_page = page_to_virt(ghcb_page);
> +
> +	if (hv_parent_partition())
> +		memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
> +
> +	/*
> +	 * Keep anon_inode_getfd last: it installs fd in the file struct and
> +	 * thus makes the state accessible in user space.
> +	 */
> +	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
> +			       O_RDWR | O_CLOEXEC);
> +	if (ret < 0)
> +		goto put_partition;
> +
> +	/* already exclusive with the partition mutex for all ioctls */
> +	partition->pt_vp_count++;
> +	partition->pt_vp_array[args.vp_index] = vp;
> +
> +	return ret;
> +
> +put_partition:
> +	mshv_partition_put(partition);
> +free_vp:
> +	kfree(vp);
> +unmap_stats_pages:
> +	if (hv_parent_partition())
> +		mshv_vp_stats_unmap(partition->pt_id, args.vp_index);
> +unmap_ghcb_page:
> +	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available()) {
> +		input_vtl.as_uint8 = 0;
> +		input_vtl.use_target_vtl = 1;
> +		input_vtl.target_vtl = HV_NORMAL_VTL;
> +
> +		hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
> +					    HV_VP_STATE_PAGE_GHCB, input_vtl);
> +	}
> +unmap_register_page:
> +	if (!mshv_partition_encrypted(partition)) {
> +		input_vtl.as_uint8 = 0;
> +
> +		hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
> +					    HV_VP_STATE_PAGE_REGISTERS,
> +					    input_vtl);
> +	}
> +unmap_intercept_message_page:
> +	input_vtl.as_uint8 = 0;
> +	hv_call_unmap_vp_state_page(partition->pt_id, args.vp_index,
> +				    HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
> +				    input_vtl);
> +destroy_vp:
> +	hv_call_delete_vp(partition->pt_id, args.vp_index);
> +	return ret;
> +}
> +
> +static int mshv_init_async_handler(struct mshv_partition *partition)
> +{
> +	if (completion_done(&partition->async_hypercall)) {
> +		pt_err(partition,
> +		       "Cannot issue another async hypercall, while another one in progress!\n");

Two uses of word "another" in the error message is redundant.  Perhaps

	"Cannot issue async hypercall while another one is in progress!"

> +		return -EPERM;
> +	}
> +
> +	reinit_completion(&partition->async_hypercall);
> +	return 0;
> +}
> +
> +static void mshv_async_hvcall_handler(void *data, u64 *status)
> +{
> +	struct mshv_partition *partition = data;
> +
> +	wait_for_completion(&partition->async_hypercall);
> +	pt_dbg(partition, "Async hypercall completed!\n");
> +
> +	*status = partition->async_hypercall_status;
> +}
> +
> +static int
> +mshv_partition_region_share(struct mshv_mem_region *region)
> +{
> +	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
> +
> +	if (region->flags.large_pages)
> +		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> +
> +	return hv_call_modify_spa_host_access(region->partition->pt_id,
> +			region->pages, region->nr_pages,
> +			HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
> +			flags, true);
> +}
> +
> +static int
> +mshv_partition_region_unshare(struct mshv_mem_region *region)
> +{
> +	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
> +
> +	if (region->flags.large_pages)
> +		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> +
> +	return hv_call_modify_spa_host_access(region->partition->pt_id,
> +			region->pages, region->nr_pages,
> +			0,
> +			flags, false);
> +}
> +
> +static int
> +mshv_region_remap_pages(struct mshv_mem_region *region, u32 map_flags,
> +			u64 page_offset, u64 page_count)
> +{
> +	if (page_offset + page_count > region->nr_pages)
> +		return -EINVAL;
> +
> +	if (region->flags.large_pages)
> +		map_flags |= HV_MAP_GPA_LARGE_PAGE;
> +
> +	/* ask the hypervisor to map guest ram */
> +	return hv_call_map_gpa_pages(region->partition->pt_id,
> +				     region->start_gfn + page_offset,
> +				     page_count, map_flags,
> +				     region->pages + page_offset);
> +}
> +
> +static int
> +mshv_region_map(struct mshv_mem_region *region)
> +{
> +	u32 map_flags = region->hv_map_flags;
> +
> +	return mshv_region_remap_pages(region, map_flags,
> +				       0, region->nr_pages);
> +}
> +
> +static void
> +mshv_region_evict_pages(struct mshv_mem_region *region,
> +			u64 page_offset, u64 page_count)
> +{
> +	if (region->flags.range_pinned)
> +		unpin_user_pages(region->pages + page_offset, page_count);
> +
> +	memset(region->pages + page_offset, 0,
> +	       page_count * sizeof(struct page *));
> +}
> +
> +static void
> +mshv_region_evict(struct mshv_mem_region *region)
> +{
> +	mshv_region_evict_pages(region, 0, region->nr_pages);
> +}
> +
> +static int
> +mshv_region_populate_pages(struct mshv_mem_region *region,
> +			   u64 page_offset, u64 page_count)
> +{
> +	u64 done_count, nr_pages;
> +	struct page **pages;
> +	__u64 userspace_addr;
> +	int ret;
> +
> +	if (page_offset + page_count > region->nr_pages)
> +		return -EINVAL;
> +
> +	for (done_count = 0; done_count < page_count; done_count += ret) {
> +		pages = region->pages + page_offset + done_count;
> +		userspace_addr = region->start_uaddr +
> +				(page_offset + done_count) *
> +				HV_HYP_PAGE_SIZE;
> +		nr_pages = min(page_count - done_count,
> +			       MSHV_PIN_PAGES_BATCH_SIZE);
> +
> +		/*
> +		 * Pinning assuming 4k pages works for large pages too.
> +		 * All page structs within the large page are returned.
> +		 *
> +		 * Pin requests are batched because pin_user_pages_fast
> +		 * with the FOLL_LONGTERM flag does a large temporary
> +		 * allocation of contiguous memory.
> +		 */
> +		if (region->flags.range_pinned)
> +			ret = pin_user_pages_fast(userspace_addr,
> +						  nr_pages,
> +						  FOLL_WRITE | FOLL_LONGTERM,
> +						  pages);
> +		else
> +			ret = -EOPNOTSUPP;
> +
> +		if (ret < 0)
> +			goto release_pages;
> +	}
> +
> +	if (PageHuge(region->pages[page_offset]))
> +		region->flags.large_pages = true;
> +
> +	return 0;
> +
> +release_pages:
> +	mshv_region_evict_pages(region, page_offset, done_count);
> +	return ret;
> +}
> +
> +static int
> +mshv_region_populate(struct mshv_mem_region *region)
> +{
> +	return mshv_region_populate_pages(region, 0, region->nr_pages);
> +}
> +
> +static struct mshv_mem_region *
> +mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> +{
> +	struct mshv_mem_region *region;
> +
> +	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
> +		if (gfn >= region->start_gfn &&
> +		    gfn < region->start_gfn + region->nr_pages)
> +			return region;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct mshv_mem_region *
> +mshv_partition_region_by_uaddr(struct mshv_partition *partition, u64 uaddr)
> +{
> +	struct mshv_mem_region *region;
> +
> +	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
> +		if (uaddr >= region->start_uaddr &&
> +		    uaddr < region->start_uaddr +
> +			    (region->nr_pages << HV_HYP_PAGE_SHIFT))
> +			return region;
> +	}
> +
> +	return NULL;
> +}
> +
> +/*
> + * NB: caller checks and makes sure mem->size is page aligned
> + * Returns: 0 with regionpp updated on success, or -errno
> + */
> +static int mshv_partition_create_region(struct mshv_partition *partition,
> +					struct mshv_user_mem_region *mem,
> +					struct mshv_mem_region **regionpp,
> +					bool is_mmio)
> +{
> +	struct mshv_mem_region *region;
> +	u64 nr_pages = HVPFN_DOWN(mem->size);
> +
> +	/* Reject overlapping regions */
> +	if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) ||
> +	    mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - 1) ||
> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr) ||
> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr + mem->size - 1))
> +		return -EEXIST;

Having to fully walk the partition region list four times for the above checks
isn't the most efficient approach, but I'm guessing that creating a region isn't
really a hot path so it doesn't matter. And I don't know how long the region list
typically is.

> +
> +	region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
> +	if (!region)
> +		return -ENOMEM;
> +
> +	region->nr_pages = nr_pages;
> +	region->start_gfn = mem->guest_pfn;
> +	region->start_uaddr = mem->userspace_addr;
> +	region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
> +	if (mem->flags & BIT(MSHV_SET_MEM_BIT_WRITABLE))
> +		region->hv_map_flags |= HV_MAP_GPA_WRITABLE;
> +	if (mem->flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> +		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> +
> +	/* Note: large_pages flag populated when we pin the pages */
> +	if (!is_mmio)
> +		region->flags.range_pinned = true;
> +
> +	region->partition = partition;
> +
> +	*regionpp = region;
> +
> +	return 0;
> +}
> +
> +/*
> + * Map guest ram. if snp, make sure to release that from the host first
> + * Side Effects: In case of failure, pages are unpinned when feasible.
> + */
> +static int
> +mshv_partition_mem_region_map(struct mshv_mem_region *region)
> +{
> +	struct mshv_partition *partition = region->partition;
> +	int ret;
> +
> +	ret = mshv_region_populate(region);
> +	if (ret) {
> +		pt_err(partition, "Failed to populate memory region: %d\n",
> +		       ret);
> +		goto err_out;
> +	}
> +
> +	/*
> +	 * For an SNP partition it is a requirement that for every memory region
> +	 * that we are going to map for this partition we should make sure that
> +	 * host access to that region is released. This is ensured by doing an
> +	 * additional hypercall which will update the SLAT to release host
> +	 * access to guest memory regions.
> +	 */
> +	if (mshv_partition_encrypted(partition)) {
> +		ret = mshv_partition_region_unshare(region);
> +		if (ret) {
> +			pt_err(partition,
> +			       "Failed to unshare memory region (guest_pfn: %llu): %d\n",
> +			       region->start_gfn, ret);
> +			goto evict_region;
> +		}
> +	}
> +
> +	ret = mshv_region_map(region);
> +	if (ret && mshv_partition_encrypted(partition)) {
> +		int shrc;
> +
> +		shrc = mshv_partition_region_share(region);
> +		if (!shrc)
> +			goto evict_region;
> +
> +		pt_err(partition,
> +		       "Failed to share memory region (guest_pfn: %llu): %d\n",
> +		       region->start_gfn, shrc);
> +		/*
> +		 * Don't unpin if marking shared failed because pages are no
> +		 * longer mapped in the host, ie root, anymore.
> +		 */
> +		goto err_out;
> +	}
> +
> +	return 0;
> +
> +evict_region:
> +	mshv_region_evict(region);
> +err_out:
> +	return ret;
> +}
> +
> +/*
> + * This maps two things: guest RAM and for pci passthru mmio space.
> + *
> + * mmio:
> + *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
> + *  - Two things need to happen for mapping mmio range:
> + *	1. mapped in the uaddr so VMM can access it.
> + *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
> + *
> + *   This function takes care of the second. The first one is managed by vfio,
> + *   and hence is taken care of via vfio_pci_mmap_fault().
> + */
> +static long
> +mshv_map_user_memory(struct mshv_partition *partition,
> +		     struct mshv_user_mem_region mem)
> +{
> +	struct mshv_mem_region *region;
> +	struct vm_area_struct *vma;
> +	bool is_mmio;
> +	ulong mmio_pfn;
> +	long ret;
> +
> +	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
> +	    !access_ok((const void *)mem.userspace_addr, mem.size))
> +		return -EINVAL;
> +
> +	mmap_read_lock(current->mm);
> +	vma = vma_lookup(current->mm, mem.userspace_addr);
> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> +	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
> +	mmap_read_unlock(current->mm);
> +
> +	if (!vma)
> +		return -EINVAL;
> +
> +	ret = mshv_partition_create_region(partition, &mem, &region,
> +					   is_mmio);
> +	if (ret)
> +		return ret;
> +
> +	if (is_mmio)
> +		ret = hv_call_map_mmio_pages(partition->pt_id, mem.guest_pfn,
> +					     mmio_pfn, HVPFN_DOWN(mem.size));
> +	else
> +		ret = mshv_partition_mem_region_map(region);
> +
> +	if (ret)
> +		goto errout;
> +
> +	/* Install the new region */
> +	hlist_add_head(&region->hnode, &partition->pt_mem_regions);
> +
> +	return 0;
> +
> +errout:
> +	vfree(region);
> +	return ret;
> +}
> +
> +/* Called for unmapping both the guest ram and the mmio space */
> +static long
> +mshv_unmap_user_memory(struct mshv_partition *partition,
> +		       struct mshv_user_mem_region mem)
> +{
> +	struct mshv_mem_region *region;
> +	u32 unmap_flags = 0;
> +
> +	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
> +		return -EINVAL;
> +
> +	if (hlist_empty(&partition->pt_mem_regions))
> +		return -EINVAL;

Isn't the above check redundant, given the lookup by gfn that is
done immediately below?

> +
> +	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
> +	if (!region)
> +		return -EINVAL;
> +
> +	/* Paranoia check */
> +	if (region->start_uaddr != mem.userspace_addr ||
> +	    region->start_gfn != mem.guest_pfn ||
> +	    region->nr_pages != HVPFN_DOWN(mem.size))
> +		return -EINVAL;
> +
> +	hlist_del(&region->hnode);
> +
> +	if (region->flags.large_pages)
> +		unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> +
> +	/* ignore unmap failures and continue as process may be exiting */
> +	hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> +				region->nr_pages, unmap_flags);
> +
> +	mshv_region_evict(region);
> +
> +	vfree(region);
> +	return 0;
> +}
> +
> +static long
> +mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
> +				struct mshv_user_mem_region __user *user_mem)
> +{
> +	struct mshv_user_mem_region mem;
> +
> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> +		return -EFAULT;
> +
> +	if (!mem.size ||
> +	    !PAGE_ALIGNED(mem.size) ||
> +	    !PAGE_ALIGNED(mem.userspace_addr) ||
> +	    (mem.flags & ~MSHV_SET_MEM_FLAGS_MASK) ||
> +	    mshv_field_nonzero(mem, rsvd))
> +		return -EINVAL;
> +
> +	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
> +		return mshv_unmap_user_memory(partition, mem);
> +
> +	return mshv_map_user_memory(partition, mem);
> +}
> +
> +static long
> +mshv_partition_ioctl_ioeventfd(struct mshv_partition *partition,
> +			       void __user *user_args)
> +{
> +	struct mshv_user_ioeventfd args;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	return mshv_set_unset_ioeventfd(partition, &args);
> +}
> +
> +static long
> +mshv_partition_ioctl_irqfd(struct mshv_partition *partition,
> +			   void __user *user_args)
> +{
> +	struct mshv_user_irqfd args;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	return mshv_set_unset_irqfd(partition, &args);
> +}
> +
> +static long
> +mshv_partition_ioctl_get_gpap_access_bitmap(struct mshv_partition *partition,
> +					    void __user *user_args)
> +{
> +	struct mshv_gpap_access_bitmap args;
> +	union hv_gpa_page_access_state *states;
> +	long ret, i;
> +	union hv_gpa_page_access_state_flags hv_flags = {};
> +	u8 hv_type_mask;
> +	ulong bitmap_buf_sz, states_buf_sz;
> +	int written = 0;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.access_type >= MSHV_GPAP_ACCESS_TYPE_COUNT ||
> +	    args.access_op >= MSHV_GPAP_ACCESS_OP_COUNT ||
> +	    mshv_field_nonzero(args, rsvd) || !args.page_count ||
> +	    !args.bitmap_ptr)
> +		return -EINVAL;
> +
> +	if (check_mul_overflow(args.page_count, sizeof(*states), &states_buf_sz))
> +		return -E2BIG;
> +
> +	/* Num bytes needed to store bitmap; one bit per page rounded up */
> +	bitmap_buf_sz = DIV_ROUND_UP(args.page_count, 8);
> +
> +	/* Sanity check */
> +	if (bitmap_buf_sz > states_buf_sz)
> +		return -EBADFD;
> +
> +	switch (args.access_type) {
> +	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
> +		hv_type_mask = 1;
> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> +			hv_flags.clear_accessed = 1;
> +			/* not accessed implies not dirty */
> +			hv_flags.clear_dirty = 1;
> +		} else { // MSHV_GPAP_ACCESS_OP_SET

Avoid C++ style comments.

> +			hv_flags.set_accessed = 1;
> +		}
> +		break;
> +	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
> +		hv_type_mask = 2;
> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> +			hv_flags.clear_dirty = 1;
> +		} else { // MSHV_GPAP_ACCESS_OP_SET

Same here.

> +			hv_flags.set_dirty = 1;
> +			/* dirty implies accessed */
> +			hv_flags.set_accessed = 1;
> +		}
> +		break;
> +	}
> +
> +	states = vzalloc(states_buf_sz);
> +	if (!states)
> +		return -ENOMEM;
> +
> +	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
> +					    args.gpap_base, hv_flags, &written,
> +					    states);
> +	if (ret)
> +		goto free_return;
> +
> +	/*
> +	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
> +	 * correspond to bitfields in hv_gpa_page_access_state
> +	 */
> +	for (i = 0; i < written; ++i)
> +		assign_bit(i, (ulong *)states,

Why the cast to ulong *?  I think this argument to assign_bit() is void *, in
which case the cast wouldn't be needed.

Also, assign_bit() does atomic bit operations. Doing such in a loop like
here will really hammer the hardware memory bus with atomic 
read-modify-write cycles. Use __assign_bit() instead, which does
non-atomic operations. You don't need atomic here as no other
threads are modifying the bit array.

> +			   states[i].as_uint8 & hv_type_mask);

OK, so the starting contents of "states" is an array of bytes. The ending
contents is an array of bits. This works because every bit in the ending
bit array is set to either 0 or 1. Overlap occurs on the first iteration
where the code reads the 0th byte, and writes the 0th bit, which is part of
the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
which doesn't overlap, and there's no overlap from then on.

Suppose "written" is not a multiple of 8. The last byte of "states" as an
array of bits will have some bits that have not been set to either 0 or 1 and
might be leftover garbage from when "states" was an array of bytes. That
garbage will get copied to user space. Is that OK? Even if user space knows
enough to ignore those bits, it seems a little dubious to be copying even
a few bits of garbage to user space.

Some comments might help here.

> +
> +	args.page_count = written;
> +
> +	if (copy_to_user(user_args, &args, sizeof(args))) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
> +		ret = -EFAULT;
> +
> +free_return:
> +	vfree(states);
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_set_msi_routing(struct mshv_partition *partition,
> +				     void __user *user_args)
> +{
> +	struct mshv_user_irq_entry *entries = NULL;
> +	struct mshv_user_irq_table args;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.nr > MSHV_MAX_GUEST_IRQS ||
> +	    mshv_field_nonzero(args, rsvd))
> +		return -EINVAL;
> +
> +	if (args.nr) {
> +		struct mshv_user_irq_table __user *urouting = user_args;
> +
> +		entries = vmemdup_user(urouting->entries,
> +				       array_size(sizeof(*entries),
> +						  args.nr));
> +		if (IS_ERR(entries))
> +			return PTR_ERR(entries);
> +	}
> +	ret = mshv_update_routing_table(partition, entries, args.nr);
> +	kvfree(entries);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_initialize(struct mshv_partition *partition)
> +{
> +	long ret;
> +
> +	if (partition->pt_initialized)
> +		return 0;
> +
> +	ret = hv_call_initialize_partition(partition->pt_id);
> +	if (ret)
> +		goto withdraw_mem;
> +
> +	partition->pt_initialized = true;
> +
> +	return 0;
> +
> +withdraw_mem:
> +	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> +{
> +	struct mshv_partition *partition = filp->private_data;
> +	long ret;
> +	void __user *uarg = (void __user *)arg;
> +
> +	if (mutex_lock_killable(&partition->pt_mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_INITIALIZE_PARTITION:
> +		ret = mshv_partition_ioctl_initialize(partition);
> +		break;
> +	case MSHV_SET_GUEST_MEMORY:
> +		ret = mshv_partition_ioctl_set_memory(partition, uarg);
> +		break;
> +	case MSHV_CREATE_VP:
> +		ret = mshv_partition_ioctl_create_vp(partition, uarg);
> +		break;
> +	case MSHV_IRQFD:
> +		ret = mshv_partition_ioctl_irqfd(partition, uarg);
> +		break;
> +	case MSHV_IOEVENTFD:
> +		ret = mshv_partition_ioctl_ioeventfd(partition, uarg);
> +		break;
> +	case MSHV_SET_MSI_ROUTING:
> +		ret = mshv_partition_ioctl_set_msi_routing(partition, uarg);
> +		break;
> +	case MSHV_GET_GPAP_ACCESS_BITMAP:
> +		ret = mshv_partition_ioctl_get_gpap_access_bitmap(partition,
> +								  uarg);
> +		break;
> +	case MSHV_ROOT_HVCALL:
> +		ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
> +		break;
> +	default:
> +		ret = -ENOTTY;
> +	}
> +
> +	mutex_unlock(&partition->pt_mutex);
> +	return ret;
> +}
> +
> +static int
> +disable_vp_dispatch(struct mshv_vp *vp)
> +{
> +	int ret;
> +	struct hv_register_assoc dispatch_suspend = {
> +		.name = HV_REGISTER_DISPATCH_SUSPEND,
> +		.value.dispatch_suspend.suspended = 1,
> +	};
> +
> +	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    1, &dispatch_suspend);
> +	if (ret)
> +		vp_err(vp, "failed to suspend\n");
> +
> +	return ret;
> +}
> +
> +static int
> +get_vp_signaled_count(struct mshv_vp *vp, u64 *count)
> +{
> +	int ret;
> +	struct hv_register_assoc root_signal_count = {
> +		.name = HV_REGISTER_VP_ROOT_SIGNAL_COUNT,
> +	};
> +
> +	ret = mshv_get_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
> +				    1, &root_signal_count);
> +
> +	if (ret) {
> +		vp_err(vp, "Failed to get root signal count");
> +		*count = 0;
> +		return ret;
> +	}
> +
> +	*count = root_signal_count.value.reg64;
> +
> +	return ret;
> +}
> +
> +static void
> +drain_vp_signals(struct mshv_vp *vp)
> +{
> +	u64 hv_signal_count;
> +	u64 vp_signal_count;
> +
> +	get_vp_signaled_count(vp, &hv_signal_count);
> +
> +	vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
> +
> +	/*
> +	 * There should be at most 1 outstanding notification, but be extra
> +	 * careful anyway.
> +	 */
> +	while (hv_signal_count != vp_signal_count) {
> +		WARN_ON(hv_signal_count - vp_signal_count != 1);
> +
> +		if (wait_event_interruptible(vp->run.vp_suspend_queue,
> +					     vp->run.kicked_by_hv == 1))
> +			break;
> +		vp->run.kicked_by_hv = 0;
> +		vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
> +	}
> +}
> +
> +static void drain_all_vps(const struct mshv_partition *partition)
> +{
> +	int i;
> +	struct mshv_vp *vp;
> +
> +	/*
> +	 * VPs are reachable from ISR. It is safe to not take the partition
> +	 * lock because nobody else can enter this function and drop the
> +	 * partition from the list.
> +	 */
> +	for (i = 0; i < MSHV_MAX_VPS; i++) {
> +		vp = partition->pt_vp_array[i];
> +		if (!vp)
> +			continue;
> +		/*
> +		 * Disable dispatching of the VP in the hypervisor. After this
> +		 * the hypervisor guarantees it won't generate any signals for
> +		 * the VP and the hypervisor's VP signal count won't change.
> +		 */
> +		disable_vp_dispatch(vp);
> +		drain_vp_signals(vp);
> +	}
> +}
> +
> +static void
> +remove_partition(struct mshv_partition *partition)
> +{
> +	spin_lock(&mshv_root.pt_ht_lock);
> +	hlist_del_rcu(&partition->pt_hnode);
> +	spin_unlock(&mshv_root.pt_ht_lock);
> +
> +	synchronize_rcu();
> +}
> +
> +/*
> + * Tear down a partition and remove it from the list.
> + * Partition's refcount must be 0
> + */
> +static void destroy_partition(struct mshv_partition *partition)
> +{
> +	struct mshv_vp *vp;
> +	struct mshv_mem_region *region;
> +	int i, ret;
> +	struct hlist_node *n;
> +	union hv_input_vtl input_vtl;
> +
> +	if (refcount_read(&partition->pt_ref_count)) {
> +		pt_err(partition,
> +		       "Attempt to destroy partition but refcount > 0\n");
> +		return;
> +	}
> +
> +	if (partition->pt_initialized) {
> +		/*
> +		 * We only need to drain signals for root scheduler. This should be
> +		 * done before removing the partition from the partition list.
> +		 */
> +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> +			drain_all_vps(partition);
> +
> +		/* Remove vps */
> +		for (i = 0; i < MSHV_MAX_VPS; ++i) {
> +			vp = partition->pt_vp_array[i];
> +			if (!vp)
> +				continue;
> +
> +			if (hv_parent_partition())
> +				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index);
> +
> +			if (vp->vp_register_page) {
> +				input_vtl.as_uint8 = 0;
> +				(void)hv_call_unmap_vp_state_page(partition->pt_id,
> +								  vp->vp_index,
> +								  HV_VP_STATE_PAGE_REGISTERS,
> +								  input_vtl);
> +				vp->vp_register_page = NULL;
> +			}
> +
> +			input_vtl.as_uint8 = 0;
> +			(void)hv_call_unmap_vp_state_page(partition->pt_id,
> +							  vp->vp_index,
> +							  HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
> +							  input_vtl);
> +			vp->vp_intercept_msg_page = NULL;
> +
> +			if (vp->vp_ghcb_page) {
> +				input_vtl.use_target_vtl = 1;
> +				input_vtl.target_vtl = HV_NORMAL_VTL;
> +				(void)hv_call_unmap_vp_state_page(partition->pt_id,
> +								  vp->vp_index,
> +								  HV_VP_STATE_PAGE_GHCB,
> +								  input_vtl);
> +				vp->vp_ghcb_page = NULL;
> +			}
> +
> +			kfree(vp);
> +
> +			partition->pt_vp_array[i] = NULL;
> +		}
> +
> +		/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
> +		hv_call_finalize_partition(partition->pt_id);
> +
> +		partition->pt_initialized = false;
> +	}
> +
> +	remove_partition(partition);
> +
> +	/* Remove regions, regain access to the memory and unpin the pages */
> +	hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions,
> +				  hnode) {
> +		hlist_del(&region->hnode);
> +
> +		if (mshv_partition_encrypted(partition)) {
> +			ret = mshv_partition_region_share(region);
> +			if (ret) {
> +				pt_err(partition,
> +				       "Failed to regain access to memory, unpinning user pages will fail and crash the host error: %d\n",
> +				      ret);
> +				return;
> +			}
> +		}
> +
> +		mshv_region_evict(region);
> +
> +		vfree(region);
> +	}
> +
> +	/* Withdraw and free all pages we deposited */
> +	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
> +	hv_call_delete_partition(partition->pt_id);
> +
> +	mshv_free_routing_table(partition);
> +	kfree(partition);
> +}
> +
> +struct
> +mshv_partition *mshv_partition_get(struct mshv_partition *partition)
> +{
> +	if (refcount_inc_not_zero(&partition->pt_ref_count))
> +		return partition;
> +	return NULL;
> +}
> +
> +struct
> +mshv_partition *mshv_partition_find(u64 partition_id)
> +	__must_hold(RCU)
> +{
> +	struct mshv_partition *p;
> +
> +	hash_for_each_possible_rcu(mshv_root.pt_htable, p, pt_hnode,
> +				   partition_id)
> +		if (p->pt_id == partition_id)
> +			return p;
> +
> +	return NULL;
> +}
> +
> +void
> +mshv_partition_put(struct mshv_partition *partition)
> +{
> +	if (refcount_dec_and_test(&partition->pt_ref_count))
> +		destroy_partition(partition);
> +}
> +
> +static int
> +mshv_partition_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_partition *partition = filp->private_data;
> +
> +	mshv_eventfd_release(partition);
> +
> +	cleanup_srcu_struct(&partition->pt_irq_srcu);
> +
> +	mshv_partition_put(partition);
> +
> +	return 0;
> +}
> +
> +static int
> +add_partition(struct mshv_partition *partition)
> +{
> +	spin_lock(&mshv_root.pt_ht_lock);
> +
> +	hash_add_rcu(mshv_root.pt_htable, &partition->pt_hnode,
> +		     partition->pt_id);
> +
> +	spin_unlock(&mshv_root.pt_ht_lock);
> +
> +	return 0;
> +}
> +
> +static long
> +mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
> +{
> +	struct mshv_create_partition args;
> +	u64 creation_flags;
> +	struct hv_partition_creation_properties creation_properties = {};
> +	union hv_partition_isolation_properties isolation_properties = {};
> +	struct mshv_partition *partition;
> +	struct file *file;
> +	int fd;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_arg, sizeof(args)))
> +		return -EFAULT;
> +
> +	if ((args.pt_flags & ~MSHV_PT_FLAGS_MASK) ||
> +	    args.pt_isolation >= MSHV_PT_ISOLATION_COUNT)
> +		return -EINVAL;
> +
> +	/* Only support EXO partitions */
> +	creation_flags = HV_PARTITION_CREATION_FLAG_EXO_PARTITION |
> +			HV_PARTITION_CREATION_FLAG_INTERCEPT_MESSAGE_PAGE_ENABLED;
> +
> +	if (args.pt_flags & BIT(MSHV_PT_BIT_LAPIC))
> +		creation_flags |= HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED;
> +	if (args.pt_flags & BIT(MSHV_PT_BIT_X2APIC))
> +		creation_flags |= HV_PARTITION_CREATION_FLAG_X2APIC_CAPABLE;
> +	if (args.pt_flags & BIT(MSHV_PT_BIT_GPA_SUPER_PAGES))
> +		creation_flags |= HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED;
> +
> +	switch (args.pt_isolation) {
> +	case MSHV_PT_ISOLATION_NONE:
> +		isolation_properties.isolation_type =
> +			HV_PARTITION_ISOLATION_TYPE_NONE;
> +		break;
> +	}
> +
> +	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
> +	if (!partition)
> +		return -ENOMEM;
> +
> +	partition->pt_module_dev = module_dev;
> +	partition->isolation_type = isolation_properties.isolation_type;
> +
> +	refcount_set(&partition->pt_ref_count, 1);
> +
> +	mutex_init(&partition->pt_mutex);
> +
> +	mutex_init(&partition->pt_irq_lock);
> +
> +	init_completion(&partition->async_hypercall);
> +
> +	INIT_HLIST_HEAD(&partition->irq_ack_notifier_list);
> +
> +	INIT_HLIST_HEAD(&partition->pt_devices);
> +
> +	INIT_HLIST_HEAD(&partition->pt_mem_regions);
> +
> +	mshv_eventfd_init(partition);
> +
> +	ret = init_srcu_struct(&partition->pt_irq_srcu);
> +	if (ret)
> +		goto free_partition;
> +
> +	ret = hv_call_create_partition(creation_flags,
> +				       creation_properties,
> +				       isolation_properties,
> +				       &partition->pt_id);
> +	if (ret)
> +		goto cleanup_irq_srcu;
> +
> +	ret = add_partition(partition);
> +	if (ret)
> +		goto delete_partition;
> +
> +	ret = mshv_init_async_handler(partition);
> +	if (ret)
> +		goto remove_partition;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		ret = fd;
> +		goto remove_partition;
> +	}
> +
> +	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
> +				  partition, O_RDWR);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto put_fd;
> +	}
> +
> +	fd_install(fd, file);
> +
> +	return fd;
> +
> +put_fd:
> +	put_unused_fd(fd);
> +remove_partition:
> +	remove_partition(partition);
> +delete_partition:
> +	hv_call_delete_partition(partition->pt_id);
> +cleanup_irq_srcu:
> +	cleanup_srcu_struct(&partition->pt_irq_srcu);
> +free_partition:
> +	kfree(partition);
> +
> +	return ret;
> +}
> +
> +static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl,
> +			   unsigned long arg)
> +{
> +	struct miscdevice *misc = filp->private_data;
> +
> +	switch (ioctl) {
> +	case MSHV_CREATE_PARTITION:
> +		return mshv_ioctl_create_partition((void __user *)arg,
> +						misc->this_device);
> +	}
> +
> +	return -ENOTTY;
> +}
> +
> +static int
> +mshv_dev_open(struct inode *inode, struct file *filp)
> +{
> +	return 0;
> +}
> +
> +static int
> +mshv_dev_release(struct inode *inode, struct file *filp)
> +{
> +	return 0;
> +}
> +
> +static int mshv_cpuhp_online;
> +static int mshv_root_sched_online;
> +
> +static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> +{
> +	switch (type) {
> +	case HV_SCHEDULER_TYPE_LP:
> +		return "classic scheduler without SMT";
> +	case HV_SCHEDULER_TYPE_LP_SMT:
> +		return "classic scheduler with SMT";
> +	case HV_SCHEDULER_TYPE_CORE_SMT:
> +		return "core scheduler";
> +	case HV_SCHEDULER_TYPE_ROOT:
> +		return "root scheduler";
> +	default:
> +		return "unknown scheduler";
> +	};
> +}
> +
> +/* TODO move this to hv_common.c when needed outside */
> +static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> +{
> +	struct hv_input_get_system_property *input;
> +	struct hv_output_get_system_property *output;
> +	unsigned long flags;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));
> +	input->property_id = HV_SYSTEM_PROPERTY_SCHEDULER_TYPE;
> +
> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
> +	if (!hv_result_success(status)) {
> +		local_irq_restore(flags);
> +		pr_err("%s: %s\n", __func__, hv_result_to_string(status));
> +		return hv_result_to_errno(status);
> +	}
> +
> +	*out = output->scheduler_type;
> +	local_irq_restore(flags);
> +
> +	return 0;
> +}
> +
> +/* Retrieve and stash the supported scheduler type */
> +static int __init mshv_retrieve_scheduler_type(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
> +	if (ret)
> +		return ret;
> +
> +	dev_info(dev, "Hypervisor using %s\n",
> +		 scheduler_type_to_string(hv_scheduler_type));
> +
> +	switch (hv_scheduler_type) {
> +	case HV_SCHEDULER_TYPE_CORE_SMT:
> +	case HV_SCHEDULER_TYPE_LP_SMT:
> +	case HV_SCHEDULER_TYPE_ROOT:
> +	case HV_SCHEDULER_TYPE_LP:
> +		/* Supported scheduler, nothing to do */
> +		break;
> +	default:
> +		dev_err(dev, "unsupported scheduler 0x%x, bailing.\n",
> +			hv_scheduler_type);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static int mshv_root_scheduler_init(unsigned int cpu)
> +{
> +	void **inputarg, **outputarg, *p;
> +
> +	inputarg = (void **)this_cpu_ptr(root_scheduler_input);
> +	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
> +
> +	/* Allocate two consecutive pages. One for input, one for output. */
> +	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	*inputarg = p;
> +	*outputarg = (char *)p + HV_HYP_PAGE_SIZE;
> +
> +	return 0;
> +}
> +
> +static int mshv_root_scheduler_cleanup(unsigned int cpu)
> +{
> +	void *p, **inputarg, **outputarg;
> +
> +	inputarg = (void **)this_cpu_ptr(root_scheduler_input);
> +	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
> +
> +	p = *inputarg;
> +
> +	*inputarg = NULL;
> +	*outputarg = NULL;
> +
> +	kfree(p);
> +
> +	return 0;
> +}
> +
> +/* Must be called after retrieving the scheduler type */
> +static int
> +root_scheduler_init(struct device *dev)
> +{
> +	int ret;
> +
> +	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
> +		return 0;
> +
> +	root_scheduler_input = alloc_percpu(void *);
> +	root_scheduler_output = alloc_percpu(void *);
> +
> +	if (!root_scheduler_input || !root_scheduler_output) {
> +		dev_err(dev, "Failed to allocate root scheduler buffers\n");
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_root_sched",
> +				mshv_root_scheduler_init,
> +				mshv_root_scheduler_cleanup);
> +
> +	if (ret < 0) {
> +		dev_err(dev, "Failed to setup root scheduler state: %i\n", ret);
> +		goto out;
> +	}
> +
> +	mshv_root_sched_online = ret;
> +
> +	return 0;
> +
> +out:
> +	free_percpu(root_scheduler_input);
> +	free_percpu(root_scheduler_output);
> +	return ret;
> +}
> +
> +static void
> +root_scheduler_deinit(void)
> +{
> +	if (hv_scheduler_type != HV_SCHEDULER_TYPE_ROOT)
> +		return;
> +
> +	cpuhp_remove_state(mshv_root_sched_online);
> +	free_percpu(root_scheduler_input);
> +	free_percpu(root_scheduler_output);
> +}
> +
> +static int mshv_reboot_notify(struct notifier_block *nb,
> +			      unsigned long code, void *unused)
> +{
> +	cpuhp_remove_state(mshv_cpuhp_online);
> +	return 0;
> +}
> +
> +struct notifier_block mshv_reboot_nb = {
> +	.notifier_call = mshv_reboot_notify,
> +};
> +
> +static void mshv_root_partition_exit(void)
> +{
> +	unregister_reboot_notifier(&mshv_reboot_nb);
> +	root_scheduler_deinit();
> +}
> +
> +static int __init mshv_root_partition_init(struct device *dev)
> +{
> +	int err;
> +
> +	if (mshv_retrieve_scheduler_type(dev))
> +		return -ENODEV;
> +
> +	err = root_scheduler_init(dev);
> +	if (err)
> +		return err;
> +
> +	err = register_reboot_notifier(&mshv_reboot_nb);
> +	if (err)
> +		goto root_sched_deinit;
> +
> +	return 0;
> +
> +root_sched_deinit:
> +	root_scheduler_deinit();
> +	return err;
> +}
> +
> +static int __init mshv_parent_partition_init(void)
> +{
> +	int ret;
> +	struct device *dev;
> +	union hv_hypervisor_version_info version_info;
> +
> +	if (!hv_root_partition() || is_kdump_kernel())
> +		return -ENODEV;
> +
> +	if (hv_get_hypervisor_version(&version_info))
> +		return -ENODEV;
> +
> +	ret = misc_register(&mshv_dev);
> +	if (ret)
> +		return ret;
> +
> +	dev = mshv_dev.this_device;
> +
> +	if (version_info.build_number < MSHV_HV_MIN_VERSION ||
> +	    version_info.build_number > MSHV_HV_MAX_VERSION) {
> +		dev_err(dev, "Running on unvalidated Hyper-V version\n");
> +		dev_err(dev, "Versions: current: %u  min: %u  max: %u\n",
> +			version_info.build_number, MSHV_HV_MIN_VERSION,
> +			MSHV_HV_MAX_VERSION);
> +	}
> +
> +	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> +	if (!mshv_root.synic_pages) {
> +		dev_err(dev, "Failed to allocate percpu synic page\n");
> +		ret = -ENOMEM;
> +		goto device_deregister;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> +				mshv_synic_init,
> +				mshv_synic_cleanup);
> +	if (ret < 0) {
> +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> +		goto free_synic_pages;
> +	}
> +
> +	mshv_cpuhp_online = ret;
> +
> +	ret = mshv_root_partition_init(dev);
> +	if (ret)
> +		goto remove_cpu_state;
> +
> +	ret = mshv_irqfd_wq_init();
> +	if (ret)
> +		goto exit_partition;
> +
> +	spin_lock_init(&mshv_root.pt_ht_lock);
> +	hash_init(mshv_root.pt_htable);
> +
> +	hv_setup_mshv_handler(mshv_isr);
> +
> +	return 0;
> +
> +exit_partition:
> +	if (hv_root_partition())
> +		mshv_root_partition_exit();
> +remove_cpu_state:
> +	cpuhp_remove_state(mshv_cpuhp_online);
> +free_synic_pages:
> +	free_percpu(mshv_root.synic_pages);
> +device_deregister:
> +	misc_deregister(&mshv_dev);
> +	return ret;
> +}
> +
> +static void __exit mshv_parent_partition_exit(void)
> +{
> +	hv_setup_mshv_handler(NULL);
> +	mshv_port_table_fini();
> +	misc_deregister(&mshv_dev);
> +	mshv_irqfd_wq_cleanup();
> +	if (hv_root_partition())
> +		mshv_root_partition_exit();
> +	cpuhp_remove_state(mshv_cpuhp_online);
> +	free_percpu(mshv_root.synic_pages);
> +}
> +
> +module_init(mshv_parent_partition_init);
> +module_exit(mshv_parent_partition_exit);
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> new file mode 100644
> index 000000000000..e7782f92e339
> --- /dev/null
> +++ b/drivers/hv/mshv_synic.c
> @@ -0,0 +1,665 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * mshv_root module's main interrupt handler and associated functionality.
> + *
> + * Authors:
> + *   Nuno Das Neves <nunodasneves@linux.microsoft.com>
> + *   Lillian Grassin-Drake <ligrassi@microsoft.com>
> + *   Vineeth Remanan Pillai <viremana@linux.microsoft.com>
> + *   Wei Liu <wei.liu@kernel.org>
> + *   Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/random.h>
> +#include <asm/mshyperv.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +
> +static u32 synic_event_ring_get_queued_port(u32 sint_index)
> +{
> +	struct hv_synic_event_ring_page **event_ring_page;
> +	volatile struct hv_synic_event_ring *ring;
> +	struct hv_synic_pages *spages;
> +	u8 **synic_eventring_tail;
> +	u32 message;
> +	u8 tail;
> +
> +	spages = this_cpu_ptr(mshv_root.synic_pages);
> +	event_ring_page = &spages->synic_event_ring_page;
> +	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> +	tail = (*synic_eventring_tail)[sint_index];
> +
> +	if (unlikely(!(*event_ring_page))) {
> +		pr_debug("Missing synic event ring page!\n");
> +		return 0;
> +	}
> +
> +	ring = &(*event_ring_page)->sint_event_ring[sint_index];
> +
> +	/*
> +	 * Get the message.
> +	 */
> +	message = ring->data[tail];
> +
> +	if (!message) {
> +		if (ring->ring_full) {
> +			/*
> +			 * Ring is marked full, but we would have consumed all
> +			 * the messages. Notify the hypervisor that ring is now
> +			 * empty and check again.
> +			 */
> +			ring->ring_full = 0;
> +			hv_call_notify_port_ring_empty(sint_index);
> +			message = ring->data[tail];
> +		}
> +
> +		if (!message) {
> +			ring->signal_masked = 0;
> +			/*
> +			 * Unmask the signal and sync with hypervisor
> +			 * before one last check for any message.
> +			 */
> +			mb();
> +			message = ring->data[tail];
> +
> +			/*
> +			 * Ok, lets bail out.
> +			 */
> +			if (!message)
> +				return 0;
> +		}
> +
> +		ring->signal_masked = 1;
> +	}
> +
> +	/*
> +	 * Clear the message in the ring buffer.
> +	 */
> +	ring->data[tail] = 0;
> +
> +	if (++tail == HV_SYNIC_EVENT_RING_MESSAGE_COUNT)
> +		tail = 0;
> +
> +	(*synic_eventring_tail)[sint_index] = tail;
> +
> +	return message;
> +}
> +
> +static bool
> +mshv_doorbell_isr(struct hv_message *msg)
> +{
> +	struct hv_notification_message_payload *notification;
> +	u32 port;
> +
> +	if (msg->header.message_type != HVMSG_SYNIC_SINT_INTERCEPT)
> +		return false;
> +
> +	notification = (struct hv_notification_message_payload *)msg->u.payload;
> +	if (notification->sint_index != HV_SYNIC_DOORBELL_SINT_INDEX)
> +		return false;
> +
> +	while ((port = synic_event_ring_get_queued_port(HV_SYNIC_DOORBELL_SINT_INDEX))) {
> +		struct port_table_info ptinfo = { 0 };
> +
> +		if (mshv_portid_lookup(port, &ptinfo)) {
> +			pr_debug("Failed to get port info from port_table!\n");
> +			continue;
> +		}
> +
> +		if (ptinfo.hv_port_type != HV_PORT_TYPE_DOORBELL) {
> +			pr_debug("Not a doorbell port!, port: %d, port_type: %d\n",
> +				 port, ptinfo.hv_port_type);
> +			continue;
> +		}
> +
> +		/* Invoke the callback */
> +		ptinfo.hv_port_doorbell.doorbell_cb(port,
> +						 ptinfo.hv_port_doorbell.data);
> +	}
> +
> +	return true;
> +}
> +
> +static bool mshv_async_call_completion_isr(struct hv_message *msg)
> +{
> +	bool handled = false;
> +	struct hv_async_completion_message_payload *async_msg;
> +	struct mshv_partition *partition;
> +	u64 partition_id;
> +
> +	if (msg->header.message_type != HVMSG_ASYNC_CALL_COMPLETION)
> +		goto out;
> +
> +	async_msg =
> +		(struct hv_async_completion_message_payload *)msg->u.payload;
> +
> +	partition_id = async_msg->partition_id;
> +
> +	/*
> +	 * Hold this lock for the rest of the isr, because the partition could
> +	 * be released anytime.
> +	 * e.g. the MSHV_RUN_VP thread could wake on another cpu; it could
> +	 * release the partition unless we hold this!
> +	 */
> +	rcu_read_lock();
> +
> +	partition = mshv_partition_find(partition_id);
> +	partition->async_hypercall_status = async_msg->status;
> +
> +	if (unlikely(!partition)) {
> +		pr_debug("failed to find partition %llu\n", partition_id);
> +		goto unlock_out;
> +	}
> +
> +	complete(&partition->async_hypercall);
> +
> +	handled = true;
> +
> +unlock_out:
> +	rcu_read_unlock();
> +out:
> +	return handled;
> +}
> +
> +static void kick_vp(struct mshv_vp *vp)
> +{
> +	atomic64_inc(&vp->run.vp_signaled_count);
> +	vp->run.kicked_by_hv = 1;
> +	wake_up(&vp->run.vp_suspend_queue);
> +}
> +
> +static void
> +handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> +{
> +	int bank_idx, vps_signaled = 0, bank_mask_size;
> +	struct mshv_partition *partition;
> +	const struct hv_vpset *vpset;
> +	const u64 *bank_contents;
> +	u64 partition_id = msg->partition_id;
> +
> +	if (msg->vp_bitset.bitset.format != HV_GENERIC_SET_SPARSE_4K) {
> +		pr_debug("scheduler message format is not HV_GENERIC_SET_SPARSE_4K");
> +		return;
> +	}
> +
> +	if (msg->vp_count == 0) {
> +		pr_debug("scheduler message with no VP specified");
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +
> +	partition = mshv_partition_find(partition_id);
> +	if (unlikely(!partition)) {
> +		pr_debug("failed to find partition %llu\n", partition_id);
> +		goto unlock_out;
> +	}
> +
> +	vpset = &msg->vp_bitset.bitset;
> +
> +	bank_idx = -1;
> +	bank_contents = vpset->bank_contents;
> +	bank_mask_size = sizeof(vpset->valid_bank_mask) * BITS_PER_BYTE;
> +
> +	while (true) {
> +		int vp_bank_idx = -1;
> +		int vp_bank_size = sizeof(*bank_contents) * BITS_PER_BYTE;
> +		int vp_index;
> +
> +		bank_idx = find_next_bit((unsigned long *)&vpset->valid_bank_mask,
> +					 bank_mask_size, bank_idx + 1);
> +		if (bank_idx == bank_mask_size)
> +			break;
> +
> +		while (true) {
> +			struct mshv_vp *vp;
> +
> +			vp_bank_idx = find_next_bit((unsigned long *)bank_contents,
> +						    vp_bank_size, vp_bank_idx + 1);
> +			if (vp_bank_idx == vp_bank_size)
> +				break;
> +
> +			vp_index = (bank_idx << HV_GENERIC_SET_SHIFT) + vp_bank_idx;

This would be clearer if just multiplied by bank_mask_size instead of shifting.
Since the compiler knows the constant value of bank_mask_size, it should generate
the same code as the shift.

> +
> +			/* This shouldn't happen, but just in case. */
> +			if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +				pr_debug("VP index %u out of bounds\n",
> +					 vp_index);
> +				goto unlock_out;
> +			}
> +
> +			vp = partition->pt_vp_array[vp_index];
> +			if (unlikely(!vp)) {
> +				pr_debug("failed to find VP %u\n", vp_index);
> +				goto unlock_out;
> +			}
> +
> +			kick_vp(vp);
> +			vps_signaled++;
> +		}
> +
> +		bank_contents++;
> +	}
> +
> +unlock_out:
> +	rcu_read_unlock();
> +
> +	if (vps_signaled != msg->vp_count)
> +		pr_debug("asked to signal %u VPs but only did %u\n",
> +			 msg->vp_count, vps_signaled);
> +}
> +
> +static void
> +handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
> +{
> +	struct mshv_partition *partition = NULL;
> +	struct mshv_vp *vp;
> +	int idx;
> +
> +	rcu_read_lock();
> +
> +	for (idx = 0; idx < msg->vp_count; idx++) {
> +		u64 partition_id = msg->partition_ids[idx];
> +		u32 vp_index = msg->vp_indexes[idx];
> +
> +		if (idx == 0 || partition->pt_id != partition_id) {
> +			partition = mshv_partition_find(partition_id);
> +			if (unlikely(!partition)) {
> +				pr_debug("failed to find partition %llu\n",
> +					 partition_id);
> +				break;
> +			}
> +		}
> +
> +		/* This shouldn't happen, but just in case. */
> +		if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> +			pr_debug("VP index %u out of bounds\n", vp_index);
> +			break;
> +		}
> +
> +		vp = partition->pt_vp_array[vp_index];
> +		if (!vp) {
> +			pr_debug("failed to find VP %u\n", vp_index);
> +			break;
> +		}
> +
> +		kick_vp(vp);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +static bool
> +mshv_scheduler_isr(struct hv_message *msg)
> +{
> +	if (msg->header.message_type != HVMSG_SCHEDULER_VP_SIGNAL_BITSET &&
> +	    msg->header.message_type != HVMSG_SCHEDULER_VP_SIGNAL_PAIR)
> +		return false;
> +
> +	if (msg->header.message_type == HVMSG_SCHEDULER_VP_SIGNAL_BITSET)
> +		handle_bitset_message((struct hv_vp_signal_bitset_scheduler_message *)
> +				      msg->u.payload);
> +	else
> +		handle_pair_message((struct hv_vp_signal_pair_scheduler_message *)
> +				    msg->u.payload);
> +
> +	return true;
> +}
> +
> +static bool
> +mshv_intercept_isr(struct hv_message *msg)
> +{
> +	struct mshv_partition *partition;
> +	bool handled = false;
> +	struct mshv_vp *vp;
> +	u64 partition_id;
> +	u32 vp_index;
> +
> +	partition_id = msg->header.sender;
> +
> +	rcu_read_lock();
> +
> +	partition = mshv_partition_find(partition_id);
> +	if (unlikely(!partition)) {
> +		pr_debug("failed to find partition %llu\n",
> +			 partition_id);
> +		goto unlock_out;
> +	}
> +
> +	if (msg->header.message_type == HVMSG_X64_APIC_EOI) {
> +		/*
> +		 * Check if this gsi is registered in the
> +		 * ack_notifier list and invoke the callback
> +		 * if registered.
> +		 */
> +
> +		/*
> +		 * If there is a notifier, the ack callback is supposed
> +		 * to handle the VMEXIT. So we need not pass this message
> +		 * to vcpu thread.
> +		 */
> +		struct hv_x64_apic_eoi_message *eoi_msg =
> +			(struct hv_x64_apic_eoi_message *)&msg->u.payload[0];
> +
> +		if (mshv_notify_acked_gsi(partition, eoi_msg->interrupt_vector)) {
> +			handled = true;
> +			goto unlock_out;
> +		}
> +	}
> +
> +	/*
> +	 * We should get an opaque intercept message here for all intercept
> +	 * messages, since we're using the mapped VP intercept message page.
> +	 *
> +	 * The intercept message will have been placed in intercept message
> +	 * page at this point.
> +	 *
> +	 * Make sure the message type matches our expectation.
> +	 */
> +	if (msg->header.message_type != HVMSG_OPAQUE_INTERCEPT) {
> +		pr_debug("wrong message type %d", msg->header.message_type);
> +		goto unlock_out;
> +	}
> +
> +	/*
> +	 * Since we directly index the vp, and it has to exist for us to be here
> +	 * (because the vp is only deleted when the partition is), no additional
> +	 * locking is needed here
> +	 */
> +	vp_index =
> +	       ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> +	vp = partition->pt_vp_array[vp_index];
> +	if (unlikely(!vp)) {
> +		pr_debug("failed to find VP %u\n", vp_index);
> +		goto unlock_out;
> +	}
> +
> +	kick_vp(vp);
> +
> +	handled = true;
> +
> +unlock_out:
> +	rcu_read_unlock();
> +
> +	return handled;
> +}
> +
> +void mshv_isr(void)
> +{
> +	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_message_page **msg_page = &spages->synic_message_page;
> +	struct hv_message *msg;
> +	bool handled;
> +
> +	if (unlikely(!(*msg_page))) {
> +		pr_debug("Missing synic page!\n");
> +		return;
> +	}
> +
> +	msg = &((*msg_page)->sint_message[HV_SYNIC_INTERCEPTION_SINT_INDEX]);
> +
> +	/*
> +	 * If the type isn't set, there isn't really a message;
> +	 * it may be some other hyperv interrupt
> +	 */
> +	if (msg->header.message_type == HVMSG_NONE)
> +		return;
> +
> +	handled = mshv_doorbell_isr(msg);
> +
> +	if (!handled)
> +		handled = mshv_scheduler_isr(msg);
> +
> +	if (!handled)
> +		handled = mshv_async_call_completion_isr(msg);
> +
> +	if (!handled)
> +		handled = mshv_intercept_isr(msg);
> +
> +	if (handled) {
> +		/*
> +		 * Acknowledge message with hypervisor if another message is
> +		 * pending.
> +		 */
> +		msg->header.message_type = HVMSG_NONE;
> +		/*
> +		 * Ensure the write is complete so the hypervisor will deliver
> +		 * the next message if available.
> +		 */
> +		mb();
> +		if (msg->header.message_flags.msg_pending)
> +			hv_set_non_nested_msr(HV_MSR_EOM, 0);
> +
> +#ifdef HYPERVISOR_CALLBACK_VECTOR
> +		add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
> +#endif
> +	} else {
> +		pr_warn_once("%s: unknown message type 0x%x\n", __func__,
> +			     msg->header.message_type);
> +	}
> +}
> +
> +int mshv_synic_init(unsigned int cpu)
> +{
> +	union hv_synic_simp simp;
> +	union hv_synic_siefp siefp;
> +	union hv_synic_sirbp sirbp;
> +#ifdef HYPERVISOR_CALLBACK_VECTOR
> +	union hv_synic_sint sint;
> +#endif
> +	union hv_synic_scontrol sctrl;
> +	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_message_page **msg_page = &spages->synic_message_page;
> +	struct hv_synic_event_flags_page **event_flags_page =
> +			&spages->synic_event_flags_page;
> +	struct hv_synic_event_ring_page **event_ring_page =
> +			&spages->synic_event_ring_page;
> +
> +	/* Setup the Synic's message page */
> +	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> +	simp.simp_enabled = true;
> +	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
> +			     HV_HYP_PAGE_SIZE,
> +			     MEMREMAP_WB);
> +
> +	if (!(*msg_page))
> +		return -EFAULT;
> +
> +	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +
> +	/* Setup the Synic's event flags page */
> +	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> +	siefp.siefp_enabled = true;
> +	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
> +				     PAGE_SIZE, MEMREMAP_WB);
> +
> +	if (!(*event_flags_page))
> +		goto cleanup;
> +
> +	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> +	/* Setup the Synic's event ring page */
> +	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> +	sirbp.sirbp_enabled = true;
> +	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> +				    PAGE_SIZE, MEMREMAP_WB);
> +
> +	if (!(*event_ring_page))
> +		goto cleanup;
> +
> +	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +
> +#ifdef HYPERVISOR_CALLBACK_VECTOR
> +	/* Enable intercepts */
> +	sint.as_uint64 = 0;
> +	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.masked = false;
> +	sint.auto_eoi = hv_recommend_using_aeoi();
> +	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> +			      sint.as_uint64);
> +
> +	/* Doorbell SINT */
> +	sint.as_uint64 = 0;
> +	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.masked = false;
> +	sint.as_intercept = 1;
> +	sint.auto_eoi = hv_recommend_using_aeoi();
> +	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> +			      sint.as_uint64);
> +#endif
> +
> +	/* Enable global synic bit */
> +	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +	sctrl.enable = 1;
> +	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +
> +	return 0;
> +
> +cleanup:
> +	if (*event_ring_page) {
> +		sirbp.sirbp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		memunmap(*event_ring_page);
> +	}
> +	if (*event_flags_page) {
> +		siefp.siefp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +		memunmap(*event_flags_page);
> +	}
> +	if (*msg_page) {
> +		simp.simp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		memunmap(*msg_page);
> +	}
> +
> +	return -EFAULT;
> +}
> +
> +int mshv_synic_cleanup(unsigned int cpu)
> +{
> +	union hv_synic_sint sint;
> +	union hv_synic_simp simp;
> +	union hv_synic_siefp siefp;
> +	union hv_synic_sirbp sirbp;
> +	union hv_synic_scontrol sctrl;
> +	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_message_page **msg_page = &spages->synic_message_page;
> +	struct hv_synic_event_flags_page **event_flags_page =
> +		&spages->synic_event_flags_page;
> +	struct hv_synic_event_ring_page **event_ring_page =
> +		&spages->synic_event_ring_page;
> +
> +	/* Disable the interrupt */
> +	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
> +	sint.masked = true;
> +	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> +			      sint.as_uint64);
> +
> +	/* Disable Doorbell SINT */
> +	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX);
> +	sint.masked = true;
> +	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> +			      sint.as_uint64);
> +
> +	/* Disable Synic's event ring page */
> +	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> +	sirbp.sirbp_enabled = false;
> +	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +	memunmap(*event_ring_page);
> +
> +	/* Disable Synic's event flags page */
> +	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> +	siefp.siefp_enabled = false;
> +	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	memunmap(*event_flags_page);
> +
> +	/* Disable Synic's message page */
> +	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> +	simp.simp_enabled = false;
> +	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	memunmap(*msg_page);
> +
> +	/* Disable global synic bit */
> +	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +	sctrl.enable = 0;
> +	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +
> +	return 0;
> +}
> +
> +int
> +mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb, void *data,
> +		       u64 gpa, u64 val, u64 flags)
> +{
> +	struct hv_connection_info connection_info = { 0 };
> +	union hv_connection_id connection_id = { 0 };
> +	struct port_table_info *port_table_info;
> +	struct hv_port_info port_info = { 0 };
> +	union hv_port_id port_id = { 0 };
> +	int ret;
> +
> +	port_table_info = kmalloc(sizeof(*port_table_info), GFP_KERNEL);
> +	if (!port_table_info)
> +		return -ENOMEM;
> +
> +	port_table_info->hv_port_type = HV_PORT_TYPE_DOORBELL;
> +	port_table_info->hv_port_doorbell.doorbell_cb = doorbell_cb;
> +	port_table_info->hv_port_doorbell.data = data;
> +	ret = mshv_portid_alloc(port_table_info);
> +	if (ret < 0) {
> +		kfree(port_table_info);
> +		return ret;
> +	}
> +
> +	port_id.u.id = ret;
> +	port_info.port_type = HV_PORT_TYPE_DOORBELL;
> +	port_info.doorbell_port_info.target_sint = HV_SYNIC_DOORBELL_SINT_INDEX;
> +	port_info.doorbell_port_info.target_vp = HV_ANY_VP;
> +	ret = hv_call_create_port(hv_current_partition_id, port_id, partition_id,
> +				  &port_info,
> +				  0, 0, NUMA_NO_NODE);
> +
> +	if (ret < 0) {
> +		mshv_portid_free(port_id.u.id);
> +		return ret;
> +	}
> +
> +	connection_id.u.id = port_id.u.id;
> +	connection_info.port_type = HV_PORT_TYPE_DOORBELL;
> +	connection_info.doorbell_connection_info.gpa = gpa;
> +	connection_info.doorbell_connection_info.trigger_value = val;
> +	connection_info.doorbell_connection_info.flags = flags;
> +
> +	ret = hv_call_connect_port(hv_current_partition_id, port_id, partition_id,
> +				   connection_id, &connection_info, 0, NUMA_NO_NODE);
> +	if (ret < 0) {
> +		hv_call_delete_port(hv_current_partition_id, port_id);
> +		mshv_portid_free(port_id.u.id);
> +		return ret;
> +	}
> +
> +	// lets use the port_id as the doorbell_id
> +	return port_id.u.id;
> +}
> +
> +void
> +mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> +{
> +	union hv_port_id port_id = { 0 };
> +	union hv_connection_id connection_id = { 0 };
> +
> +	connection_id.u.id = doorbell_portid;
> +	hv_call_disconnect_port(partition_id, connection_id);
> +
> +	port_id.u.id = doorbell_portid;
> +	hv_call_delete_port(hv_current_partition_id, port_id);
> +
> +	mshv_portid_free(doorbell_portid);
> +}
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> new file mode 100644
> index 000000000000..9468f66c5658
> --- /dev/null
> +++ b/include/uapi/linux/mshv.h
> @@ -0,0 +1,287 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Userspace interfaces for /dev/mshv* devices and derived fds
> + *
> + * This file is divided into sections containing data structures and IOCTLs for
> + * a particular set of related devices or derived file descriptors.
> + *
> + * The IOCTL definitions are at the end of each section. They are grouped by
> + * device/fd, so that new IOCTLs can easily be added with a monotonically
> + * increasing number.
> + */
> +#ifndef _UAPI_LINUX_MSHV_H
> +#define _UAPI_LINUX_MSHV_H
> +
> +#include <linux/types.h>
> +
> +#define MSHV_IOCTL	0xB8
> +
> +/*
> + *******************************************
> + * Entry point to main VMM APIs: /dev/mshv *
> + *******************************************
> + */
> +
> +enum {
> +	MSHV_PT_BIT_LAPIC,
> +	MSHV_PT_BIT_X2APIC,
> +	MSHV_PT_BIT_GPA_SUPER_PAGES,
> +	MSHV_PT_BIT_COUNT,
> +};
> +
> +#define MSHV_PT_FLAGS_MASK ((1 << MSHV_PT_BIT_COUNT) - 1)
> +
> +enum {
> +	MSHV_PT_ISOLATION_NONE,
> +	MSHV_PT_ISOLATION_COUNT,
> +};
> +
> +/**
> + * struct mshv_create_partition - arguments for MSHV_CREATE_PARTITION
> + * @pt_flags: Bitmask of 1 << MSHV_PT_BIT_*
> + * @pt_isolation: MSHV_PT_ISOLATION_*
> + *
> + * Returns a file descriptor to act as a handle to a guest partition.
> + * At this point the partition is not yet initialized in the hypervisor.
> + * Some operations must be done with the partition in this state, e.g. setting
> + * so-called "early" partition properties. The partition can then be
> + * initialized with MSHV_INITIALIZE_PARTITION.
> + */
> +struct mshv_create_partition {
> +	__u64 pt_flags;
> +	__u64 pt_isolation;
> +};
> +
> +/* /dev/mshv */
> +#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x00, struct mshv_create_partition)
> +
> +/*
> + ************************
> + * Child partition APIs *
> + ************************
> + */
> +
> +struct mshv_create_vp {
> +	__u32 vp_index;
> +};
> +
> +enum {
> +	MSHV_SET_MEM_BIT_WRITABLE,
> +	MSHV_SET_MEM_BIT_EXECUTABLE,
> +	MSHV_SET_MEM_BIT_UNMAP,
> +	MSHV_SET_MEM_BIT_COUNT
> +};
> +
> +#define MSHV_SET_MEM_FLAGS_MASK ((1 << MSHV_SET_MEM_BIT_COUNT) - 1)
> +
> +/**
> + * struct mshv_user_mem_region - arguments for MSHV_SET_GUEST_MEMORY
> + * @size: Size of the memory region (bytes). Must be aligned to PAGE_SIZE
> + * @guest_pfn: Base guest page number to map
> + * @userspace_addr: Base address of userspace memory. Must be aligned to
> + *                  PAGE_SIZE
> + * @flags: Bitmask of 1 << MSHV_SET_MEM_BIT_*. If (1 << MSHV_SET_MEM_BIT_UNMAP)
> + *         is set, ignore other bits.
> + * @rsvd: MBZ
> + *
> + * Map or unmap a region of userspace memory to Guest Physical Addresses (GPA).
> + * Mappings can't overlap in GPA space or userspace.
> + * To unmap, these fields must match an existing mapping.
> + */
> +struct mshv_user_mem_region {
> +	__u64 size;
> +	__u64 guest_pfn;
> +	__u64 userspace_addr;
> +	__u8 flags;
> +	__u8 rsvd[7];
> +};
> +
> +enum {
> +	MSHV_IRQFD_BIT_DEASSIGN,
> +	MSHV_IRQFD_BIT_RESAMPLE,
> +	MSHV_IRQFD_BIT_COUNT,
> +};
> +
> +#define MSHV_IRQFD_FLAGS_MASK	((1 << MSHV_IRQFD_BIT_COUNT) - 1)
> +
> +struct mshv_user_irqfd {
> +	__s32 fd;
> +	__s32 resamplefd;
> +	__u32 gsi;
> +	__u32 flags;
> +};
> +
> +enum {
> +	MSHV_IOEVENTFD_BIT_DATAMATCH,
> +	MSHV_IOEVENTFD_BIT_PIO,
> +	MSHV_IOEVENTFD_BIT_DEASSIGN,
> +	MSHV_IOEVENTFD_BIT_COUNT,
> +};
> +
> +#define MSHV_IOEVENTFD_FLAGS_MASK	((1 << MSHV_IOEVENTFD_BIT_COUNT) - 1)
> +
> +struct mshv_user_ioeventfd {
> +	__u64 datamatch;
> +	__u64 addr;	   /* legal pio/mmio address */
> +	__u32 len;	   /* 1, 2, 4, or 8 bytes    */
> +	__s32 fd;
> +	__u32 flags;
> +	__u8  rsvd[4];
> +};
> +
> +struct mshv_user_irq_entry {
> +	__u32 gsi;
> +	__u32 address_lo;
> +	__u32 address_hi;
> +	__u32 data;
> +};
> +
> +struct mshv_user_irq_table {
> +	__u32 nr;
> +	__u32 rsvd; /* MBZ */
> +	struct mshv_user_irq_entry entries[];
> +};
> +
> +enum {
> +	MSHV_GPAP_ACCESS_TYPE_ACCESSED = 0,
> +	MSHV_GPAP_ACCESS_TYPE_DIRTY,
> +	MSHV_GPAP_ACCESS_TYPE_COUNT		/* Count of enum members */
> +};
> +
> +enum {
> +	MSHV_GPAP_ACCESS_OP_NOOP = 0,
> +	MSHV_GPAP_ACCESS_OP_CLEAR,
> +	MSHV_GPAP_ACCESS_OP_SET,
> +	MSHV_GPAP_ACCESS_OP_COUNT		/* Count of enum members */
> +};

Any reason these two enums explicitly set the first value to 0, while
earlier enums do not?  This is another case of there being a difference,
and me wondering if it's just gratuitous or if there's a specific reason.
Consistency is a good thing!

> +
> +/**
> + * struct mshv_gpap_access_bitmap - arguments for MSHV_GET_GPAP_ACCESS_BITMAP
> + * @access_type: MSHV_GPAP_ACCESS_TYPE_* - The type of access to record in the
> + *               bitmap
> + * @access_op: MSHV_GPAP_ACCESS_OP_* - Allows an optional clear or set of all
> + *             the access states in the range, after retrieving the current
> + *             states.
> + * @rsvd: MBZ
> + * @page_count: in: number of pages
> + *              out: on error, number of states successfully written to bitmap
> + * @gpap_base: Base gpa page number
> + * @bitmap_ptr: Output buffer for bitmap, at least (page_count + 7) / 8 bytes
> + *
> + * Retrieve a bitmap of either ACCESSED or DIRTY bits for a given range of guest
> + * memory, and optionally clear or set the bits.
> + */
> +struct mshv_gpap_access_bitmap {
> +	__u8 access_type;
> +	__u8 access_op;
> +	__u8 rsvd[6];
> +	__u64 page_count;
> +	__u64 gpap_base;
> +	__u64 bitmap_ptr;
> +};
> +
> +/**
> + * struct mshv_root_hvcall - arguments for MSHV_ROOT_HVCALL
> + * @code: Hypercall code (HVCALL_*)
> + * @reps: in: Rep count ('repcount')
> + *	  out: Reps completed ('repcomp'). MBZ unless rep hvcall
> + * @in_sz: Size of input incl rep data. <= HV_HYP_PAGE_SIZE
> + * @out_sz: Size of output buffer. <= HV_HYP_PAGE_SIZE. MBZ if out_ptr is 0
> + * @status: in: MBZ
> + *	    out: HV_STATUS_* from hypercall
> + * @rsvd: MBZ
> + * @in_ptr: Input data buffer (struct hv_input_*). If used with partition or
> + *	    vp fd, partition id field is populated by kernel.
> + * @out_ptr: Output data buffer (optional)
> + */
> +struct mshv_root_hvcall {
> +	__u16 code;
> +	__u16 reps;
> +	__u16 in_sz;
> +	__u16 out_sz;
> +	__u16 status;
> +	__u8 rsvd[6];
> +	__u64 in_ptr;
> +	__u64 out_ptr;
> +};
> +
> +/* Partition fds created with MSHV_CREATE_PARTITION */
> +#define MSHV_INITIALIZE_PARTITION	_IO(MSHV_IOCTL, 0x00)
> +#define MSHV_CREATE_VP			_IOW(MSHV_IOCTL, 0x01, struct mshv_create_vp)
> +#define MSHV_SET_GUEST_MEMORY		_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
> +#define MSHV_IRQFD			_IOW(MSHV_IOCTL, 0x03, struct mshv_user_irqfd)
> +#define MSHV_IOEVENTFD			_IOW(MSHV_IOCTL, 0x04, struct mshv_user_ioeventfd)
> +#define MSHV_SET_MSI_ROUTING		_IOW(MSHV_IOCTL, 0x05, struct mshv_user_irq_table)
> +#define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
> +/* Generic hypercall */
> +#define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)

I really don't like having the ioctl numbers here overlap with the /dev/mshv ioctls.
There's just no need to overlap. But I realize changing it now is a big hassle.

> +
> +/*
> + ********************************
> + * VP APIs for child partitions *
> + ********************************
> + */
> +
> +#define MSHV_RUN_VP_BUF_SZ 256
> +
> +/*
> + * Map various VP state pages to userspace.
> + * Multiply the offset by PAGE_SIZE before being passed as the 'offset'
> + * argument to mmap().
> + * e.g.
> + * void *reg_page = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> + *                       MAP_SHARED, vp_fd,
> + *                       MSHV_VP_MMAP_OFFSET_REGISTERS * PAGE_SIZE);
> + */

This is interesting.  I would not have thought PAGE_SIZE is available
in the UAPI.  You must use something like the getpagesize() call. I know
the root partition can only run with a 4K page size, but the symbol
"PAGE_SIZE" is probably kernel code only.

> +enum {
> +	MSHV_VP_MMAP_OFFSET_REGISTERS,
> +	MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE,
> +	MSHV_VP_MMAP_OFFSET_GHCB,
> +	MSHV_VP_MMAP_OFFSET_COUNT
> +};
> +
> +/**
> + * struct mshv_run_vp - argument for MSHV_RUN_VP
> + * @msg_buf: On success, the intercept message is copied here. It can be
> + *           interpreted using the relevant hypervisor definitions.
> + */
> +struct mshv_run_vp {
> +	__u8 msg_buf[MSHV_RUN_VP_BUF_SZ];
> +};
> +
> +enum {
> +	MSHV_VP_STATE_LAPIC,		/* Local interrupt controller state (either arch) */
> +	MSHV_VP_STATE_XSAVE,		/* XSAVE data in compacted form (x86_64) */
> +	MSHV_VP_STATE_SIMP,
> +	MSHV_VP_STATE_SIEFP,
> +	MSHV_VP_STATE_SYNTHETIC_TIMERS,
> +	MSHV_VP_STATE_COUNT,
> +};
> +
> +/**
> + * struct mshv_get_set_vp_hvcall - arguments for MSHV_[GET,SET]_VP_STATE

s/hvcall/state/

> + * @type: MSHV_VP_STATE_*
> + * @rsvd: MBZ
> + * @buf_sz: in: 4k page-aligned size of buffer
> + *          out: Actual size of data (on EINVAL, check this to see if buffer
> + *               was too small)
> + * @buf_ptr: 4k page-aligned data buffer
> + */
> +struct mshv_get_set_vp_state {
> +	__u8 type;
> +	__u8 rsvd[3];
> +	__u32 buf_sz;
> +	__u64 buf_ptr;
> +};
> +
> +/* VP fds created with MSHV_CREATE_VP */
> +#define MSHV_RUN_VP			_IOR(MSHV_IOCTL, 0x00, struct mshv_run_vp)
> +#define MSHV_GET_VP_STATE		_IOWR(MSHV_IOCTL, 0x01, struct mshv_get_set_vp_state)
> +#define MSHV_SET_VP_STATE		_IOWR(MSHV_IOCTL, 0x02, struct mshv_get_set_vp_state)
> +/*
> + * Generic hypercall
> + * Defined above in partition IOCTLs, avoid redefining it here
> + * #define MSHV_ROOT_HVCALL			_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
> + */
> +
> +#endif
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-17 23:51   ` Michael Kelley
@ 2025-03-18 17:24     ` Wei Liu
  2025-03-18 17:45       ` Michael Kelley
  2025-03-19  0:34     ` Nuno Das Neves
  1 sibling, 1 reply; 108+ messages in thread
From: Wei Liu @ 2025-03-18 17:24 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

On Mon, Mar 17, 2025 at 11:51:52PM +0000, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
[...]
> > +static long
> > +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
> > +			    struct mshv_get_set_vp_state __user *user_args,
> > +			    bool is_set)
> > +{
> > +	struct mshv_get_set_vp_state args;
> > +	long ret = 0;
> > +	union hv_output_get_vp_state vp_state;
> > +	u32 data_sz;
> > +	struct hv_vp_state_data state_data = {};
> > +
> > +	if (copy_from_user(&args, user_args, sizeof(args)))
> > +		return -EFAULT;
> > +
> > +	if (args.type >= MSHV_VP_STATE_COUNT || mshv_field_nonzero(args, rsvd) ||
> > +	    !args.buf_sz || !PAGE_ALIGNED(args.buf_sz) ||
> > +	    !PAGE_ALIGNED(args.buf_ptr))
> > +		return -EINVAL;
> > +
> > +	if (!access_ok((void __user *)args.buf_ptr, args.buf_sz))
> > +		return -EFAULT;
> > +
> > +	switch (args.type) {
> > +	case MSHV_VP_STATE_LAPIC:
> > +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
> > +		data_sz = HV_HYP_PAGE_SIZE;
> > +		break;
> > +	case MSHV_VP_STATE_XSAVE:
> 
> Just FYI, you can put a semicolon after the colon on the above line, which
> adds a null statement, and then the C compiler will accept the definition
> of local variable data_sz_64 without needing the odd-looking braces. 
> 
> See https://stackoverflow.com/questions/92396/why-cant-variables-be-declared-in-a-switch-statement/19830820
> 

This is a rarely seen pattern in the kernel, so I would prefer to keep
the braces for clarity.

 $ git grep -A5 -P 'case\s+\w+:;$'

This shows a few places are using this pattern. But they are not
declaring variables afterwards.

> I learn something new every day! :-)
> 

Yep, me too.

Thanks for reviewing the code. Nuno will address the comments. I can fix
them up.

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-18 17:24     ` Wei Liu
@ 2025-03-18 17:45       ` Michael Kelley
  2025-03-18 20:07         ` Wei Liu
  0 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-18 17:45 UTC (permalink / raw)
  To: Wei Liu
  Cc: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 18, 2025 10:25 AM
> 
> On Mon, Mar 17, 2025 at 11:51:52PM +0000, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday,
> February 26, 2025 3:08 PM
> [...]
> > > +static long
> > > +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
> > > +			    struct mshv_get_set_vp_state __user *user_args,
> > > +			    bool is_set)
> > > +{
> > > +	struct mshv_get_set_vp_state args;
> > > +	long ret = 0;
> > > +	union hv_output_get_vp_state vp_state;
> > > +	u32 data_sz;
> > > +	struct hv_vp_state_data state_data = {};
> > > +
> > > +	if (copy_from_user(&args, user_args, sizeof(args)))
> > > +		return -EFAULT;
> > > +
> > > +	if (args.type >= MSHV_VP_STATE_COUNT || mshv_field_nonzero(args, rsvd) ||
> > > +	    !args.buf_sz || !PAGE_ALIGNED(args.buf_sz) ||
> > > +	    !PAGE_ALIGNED(args.buf_ptr))
> > > +		return -EINVAL;
> > > +
> > > +	if (!access_ok((void __user *)args.buf_ptr, args.buf_sz))
> > > +		return -EFAULT;
> > > +
> > > +	switch (args.type) {
> > > +	case MSHV_VP_STATE_LAPIC:
> > > +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
> > > +		data_sz = HV_HYP_PAGE_SIZE;
> > > +		break;
> > > +	case MSHV_VP_STATE_XSAVE:
> >
> > Just FYI, you can put a semicolon after the colon on the above line, which
> > adds a null statement, and then the C compiler will accept the definition
> > of local variable data_sz_64 without needing the odd-looking braces.
> >
> > See https://stackoverflow.com/questions/92396/why-cant-variables-be-declared-in-a-switch-statement/19830820
> >
> 
> This is a rarely seen pattern in the kernel, so I would prefer to keep
> the braces for clarity.
> 
>  $ git grep -A5 -P 'case\s+\w+:;$'
> 
> This shows a few places are using this pattern. But they are not
> declaring variables afterwards.
> 

The braces just looked a little odd, particularly the way they are indented. Another
alternative is to move the variable to function scope, and avoid the issue altogether.
But I'm fine regardless of which approach you take, including keeping it like it is.

> > I learn something new every day! :-)
> >
> 
> Yep, me too.
> 
> Thanks for reviewing the code. Nuno will address the comments. I can fix
> them up.

FYI, I may submit a few more comments on the v6 version of the patches.
If there are changes you want to make based on my comments, I don't care
if you fix up the existing patches, or take them later as follow up patches.
And of course, you may choose to not make changes.

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-18 17:45       ` Michael Kelley
@ 2025-03-18 20:07         ` Wei Liu
  0 siblings, 0 replies; 108+ messages in thread
From: Wei Liu @ 2025-03-18 20:07 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Wei Liu, Nuno Das Neves, linux-hyperv@vger.kernel.org,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, daniel.lezcano@linaro.org, joro@8bytes.org,
	robin.murphy@arm.com, arnd@arndb.de,
	jinankjain@linux.microsoft.com, muminulrussell@gmail.com,
	skinsburskii@linux.microsoft.com, mrathor@linux.microsoft.com,
	ssengar@linux.microsoft.com, apais@linux.microsoft.com,
	Tianyu.Lan@microsoft.com, stanislav.kinsburskiy@gmail.com,
	gregkh@linuxfoundation.org, vkuznets@redhat.com,
	prapal@linux.microsoft.com, muislam@microsoft.com,
	anrayabh@linux.microsoft.com, rafael@kernel.org, lenb@kernel.org,
	corbet@lwn.net

On Tue, Mar 18, 2025 at 05:45:46PM +0000, Michael Kelley wrote:
> From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 18, 2025 10:25 AM
> > 
> > On Mon, Mar 17, 2025 at 11:51:52PM +0000, Michael Kelley wrote:
> > > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday,
> > February 26, 2025 3:08 PM
> > [...]
> > > > +static long
> > > > +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
> > > > +			    struct mshv_get_set_vp_state __user *user_args,
> > > > +			    bool is_set)
> > > > +{
> > > > +	struct mshv_get_set_vp_state args;
> > > > +	long ret = 0;
> > > > +	union hv_output_get_vp_state vp_state;
> > > > +	u32 data_sz;
> > > > +	struct hv_vp_state_data state_data = {};
> > > > +
> > > > +	if (copy_from_user(&args, user_args, sizeof(args)))
> > > > +		return -EFAULT;
> > > > +
> > > > +	if (args.type >= MSHV_VP_STATE_COUNT || mshv_field_nonzero(args, rsvd) ||
> > > > +	    !args.buf_sz || !PAGE_ALIGNED(args.buf_sz) ||
> > > > +	    !PAGE_ALIGNED(args.buf_ptr))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!access_ok((void __user *)args.buf_ptr, args.buf_sz))
> > > > +		return -EFAULT;
> > > > +
> > > > +	switch (args.type) {
> > > > +	case MSHV_VP_STATE_LAPIC:
> > > > +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
> > > > +		data_sz = HV_HYP_PAGE_SIZE;
> > > > +		break;
> > > > +	case MSHV_VP_STATE_XSAVE:
> > >
> > > Just FYI, you can put a semicolon after the colon on the above line, which
> > > adds a null statement, and then the C compiler will accept the definition
> > > of local variable data_sz_64 without needing the odd-looking braces.
> > >
> > > See https://stackoverflow.com/questions/92396/why-cant-variables-be-declared-in-a-switch-statement/19830820
> > >
> > 
> > This is a rarely seen pattern in the kernel, so I would prefer to keep
> > the braces for clarity.
> > 
> >  $ git grep -A5 -P 'case\s+\w+:;$'
> > 
> > This shows a few places are using this pattern. But they are not
> > declaring variables afterwards.
> > 
> 
> The braces just looked a little odd, particularly the way they are indented. Another
> alternative is to move the variable to function scope, and avoid the issue altogether.
> But I'm fine regardless of which approach you take, including keeping it like it is.
> 
> > > I learn something new every day! :-)
> > >
> > 
> > Yep, me too.
> > 
> > Thanks for reviewing the code. Nuno will address the comments. I can fix
> > them up.
> 
> FYI, I may submit a few more comments on the v6 version of the patches.
> If there are changes you want to make based on my comments, I don't care
> if you fix up the existing patches, or take them later as follow up patches.
> And of course, you may choose to not make changes.

We will aim to  address as many comments as possible in the first
submission to reduce the number of follow-up patches. This driver will
be backported to replace the one running internally in Microsoft, so the
fewer patches we need to backport the better.

Wei.

> 
> Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-17 23:51   ` Michael Kelley
  2025-03-18 17:24     ` Wei Liu
@ 2025-03-19  0:34     ` Nuno Das Neves
  2025-03-19  2:10       ` Michael Kelley
  1 sibling, 1 reply; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-19  0:34 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/17/2025 4:51 PM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
<snip>
>> +
>> +/* TODO move this to another file when debugfs code is added */
>> +enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
>> +#if defined(CONFIG_X86)
>> +	VpRootDispatchThreadBlocked			= 201,
>> +#elif defined(CONFIG_ARM64)
>> +	VpRootDispatchThreadBlocked			= 94,
>> +#endif
>> +	VpStatsMaxCounter
>> +};
> 
> Where do these "magic" numbers come from?  Are they matching something
> in the Hyper-V host?
> 
They are part of the hypervisor ABI and really belong in hvhdk.h. These enums
have many members which we use in another part of the code but which are omitted
here.

For this patchset I put them here to avoid putting PascalCase definitions in the
global namespace. I was undecided if we want to keep them like this (maybe keeping
them out of hvhdk.h), or change them to linux style in a followup.

>> +
>> +struct hv_stats_page {
>> +	union {
>> +		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
>> +		u8 data[HV_HYP_PAGE_SIZE];
>> +	};
>> +} __packed;
>> +
>> +struct mshv_root mshv_root = {};
> 
> Initializer is unnecessary for global variables. They are already set to zero.
> 
Yep, I'll remove it.

>> +
>> +enum hv_scheduler_type hv_scheduler_type;
>> +
>> +/* Once we implement the fast extended hypercall ABI they can go away. */
>> +static void __percpu **root_scheduler_input;
>> +static void __percpu **root_scheduler_output;
> 
> The __percpu is probably in the wrong place like mentioned in earlier
> patches in this series.
> 
Ack, will fix.

<snip>
>> +
>> +static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
>> +				      bool partition_locked,
>> +				      void __user *user_args)
>> +{
>> +	u64 status;
>> +	int ret, i;
> 
> 'ret' should be initialized to 0. There's a path through this function that
> never sets 'ret' and the return value would be stack garbage.
> 
Thanks, will fix.

<snip>
>> +
>> +/*
>> + * Explicit guest vCPU suspend is asynchronous by nature (as it is requested by
>> + * dom0 vCPU for guest vCPU) and thus it can race with "intercept" suspend,
>> + * done by the hypervisor.
>> + * "Intercept" suspend leads to asynchronous message delivery to dom0 which
>> + * should be awaited to keep the VP loop consistent (i.e. no message pending
>> + * upon VP resume).
>> + * VP intercept suspend can't be done when the VP is explicitly suspended
>> + * already, and thus can be only two possible race scenarios:
>> + *   1. implicit suspend bit set -> explicit suspend bit set -> message sent
>> + *   2. implicit suspend bit set -> message sent -> explicit suspend bit set
>> + * Checking for implicit suspend bit set after explicit suspend request has
>> + * succeeded in either case allows us to reliably identify, if there is a
>> + * message to receive and deliver to VMM.
>> + */
>> +static long
> 
> For this function, why is the return type "long" instead of "int"?  Same
> question for several other functions below.  "long" works, but it's another
> case of being gratuitously atypical -- unless there's a reason.
> 
No good reason. It should just be an int.

<snip>
>> +
>> +	switch (args.type) {
>> +	case MSHV_VP_STATE_LAPIC:
>> +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
>> +		data_sz = HV_HYP_PAGE_SIZE;
>> +		break;
>> +	case MSHV_VP_STATE_XSAVE:
> 
> Just FYI, you can put a semicolon after the colon on the above line, which
> adds a null statement, and then the C compiler will accept the definition
> of local variable data_sz_64 without needing the odd-looking braces. 
> 
> See https://stackoverflow.com/questions/92396/why-cant-variables-be-declared-in-a-switch-statement/19830820
> 
> I learn something new every day! :-)
> 
I didn't know that! But actually I prefer the braces because they clearly
denote a new block scope for that case.

>> +	{
>> +		u64 data_sz_64;
>> +
>> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
>> +						     HV_PARTITION_PROPERTY_XSAVE_STATES,
>> +						     &state_data.xsave.states.as_uint64);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
>> +						     HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE,
>> +						     &data_sz_64);
>> +		if (ret)
>> +			return ret;
>> +
>> +		data_sz = (u32)data_sz_64;
>> +		state_data.xsave.flags = 0;
>> +		/* Always request legacy states */
>> +		state_data.xsave.states.legacy_x87 = 1;
>> +		state_data.xsave.states.legacy_sse = 1;
>> +		state_data.type = HV_GET_SET_VP_STATE_XSAVE;
>> +		break;
>> +	}
>> +	case MSHV_VP_STATE_SIMP:
>> +		state_data.type = HV_GET_SET_VP_STATE_SIM_PAGE;
>> +		data_sz = HV_HYP_PAGE_SIZE;
>> +		break;
>> +	case MSHV_VP_STATE_SIEFP:
>> +		state_data.type = HV_GET_SET_VP_STATE_SIEF_PAGE;
>> +		data_sz = HV_HYP_PAGE_SIZE;
>> +		break;
>> +	case MSHV_VP_STATE_SYNTHETIC_TIMERS:
>> +		state_data.type = HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS;
>> +		data_sz = sizeof(vp_state.synthetic_timers_state);
>> +		break;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (copy_to_user(&user_args->buf_sz, &data_sz, sizeof(user_args->buf_sz)))
>> +		return -EFAULT;
>> +
>> +	if (data_sz > args.buf_sz)
>> +		return -EINVAL;
>> +
>> +	/* If the data is transmitted via pfns, delegate to helper */
>> +	if (state_data.type & HV_GET_SET_VP_STATE_TYPE_PFN) {
>> +		unsigned long user_pfn = PFN_DOWN(args.buf_ptr);
>> +		size_t page_count = PFN_DOWN(args.buf_sz);
>> +
>> +		return mshv_vp_ioctl_get_set_state_pfn(vp, state_data, user_pfn,
>> +						       page_count, is_set);
>> +	}
>> +
>> +	/* Paranoia check - this shouldn't happen! */
>> +	if (data_sz > sizeof(vp_state)) {
>> +		vp_err(vp, "Invalid vp state data size!\n");
>> +		return -EINVAL;
>> +	}
> 
> I don't understand the above check.  sizeof(vp_state) is relatively small since
> it is effectively sizeof(hv_synthetic_timers_state), which is 200 bytes if I've
> done the arithmetic correctly. But data_sz could be a full page (4096 bytes)
> for the LAPIC, SIMP, and SIEFP cases, and the check would cause an error to
> be returned.
> 
data_sz > sizeof(vp_state) is true if and only if the HV_GET_SET_VP_STATE_TYPE_PFN
bit is set in state_data.type. This check ensures that invariant holds.

See just above where we delegate to mshv_vp_ioctl_get_set_state_pfn() in that case.

>> +
>> +	if (is_set) {
>> +		if (copy_from_user(&vp_state, (__user void *)args.buf_ptr, data_sz))
>> +			return -EFAULT;
>> +
>> +		return hv_call_set_vp_state(vp->vp_index,
>> +					    vp->vp_partition->pt_id,
>> +					    state_data, 0, NULL,
>> +					    sizeof(vp_state), (u8 *)&vp_state);
> 
> This is one of the cases where data from user space gets passed directly to
> the hypercall. So user space is responsible for ensuring that reserved fields
> are zero'ed and for otherwise ensuring a proper hypercall input. I just
> wonder if user space really does this correctly.
> 
The interfaces that are 'passthrough' like this remove quite a bit of
complexity from the kernel code and delegates it to userspace and the hypervisor.

It is on userspace to ensure the parameters are valid, and it's on the
hypervisor to check the fields and error if they are used improperly.

Note the hypervisor still needs to check everything regardless of if it comes
from the kernel or directly from userspace.

>> +
>> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
>> +{
>> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
>> +
>> +	switch (vmf->vma->vm_pgoff) {
>> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
>> +		vmf->page = virt_to_page(vp->vp_register_page);
>> +		break;
>> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
>> +		vmf->page = virt_to_page(vp->vp_intercept_msg_page);
>> +		break;
>> +	case MSHV_VP_MMAP_OFFSET_GHCB:
>> +		if (is_ghcb_mapping_available())
>> +			vmf->page = virt_to_page(vp->vp_ghcb_page);
>> +		break;
> 
> If there's no GHCB mapping available, execution just continues with
> vmf->page not set. Won't the later get_page() call fail? Perhaps this
> should fail if there's no GHCB mapping available. Or maybe there's
> more about how this works that I'm ignorant of. :-)
> 
Hmm, maybe this check should just be removed. If we got here it means
the vmf->vma->vm_pgoff was already set in mmap(), so the page should be
valid in that case.

>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	get_page(vmf->page);
>> +
>> +	return 0;
>> +}
>> +
>> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	struct mshv_vp *vp = file->private_data;
>> +
>> +	switch (vma->vm_pgoff) {
>> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
>> +		if (!vp->vp_register_page)
>> +			return -ENODEV;
>> +		break;
>> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
>> +		if (!vp->vp_intercept_msg_page)
>> +			return -ENODEV;
>> +		break;
>> +	case MSHV_VP_MMAP_OFFSET_GHCB:
>> +		if (is_ghcb_mapping_available() && !vp->vp_ghcb_page)
>> +			return -ENODEV;
>> +		break;
> 
> Again, if no GHCB mapping is available, should this return success?
> 
I think this should just check the vp->vp_ghcb_page is not NULL, like
the other cases. is_ghcb_mapping_available() is already checked to
decide whether to map the page in the first place. I'll change it.

>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	vma->vm_ops = &mshv_vp_vm_ops;
>> +	return 0;
>> +}
<snip>
>> +
>> +	input_vtl.as_uint8 = 0;
> 
> I see eight occurrences in this source code file where the above statement
> occurs and there is no further modification. Perhaps declare a static
> variable that is initialized properly, and use it as the input parameter to the
> various functions.  A second static variable could have the use_target_vtl = 1
> setting that is needed in three places.
> 
I was a bit doubtful, but I tried this and it removes quite a few lines without
much tradeoff in readability. Thanks!

>> +	ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
>> +					HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
>> +					input_vtl,
>> +					&intercept_message_page);
<snip>>> +static int mshv_init_async_handler(struct mshv_partition *partition)
>> +{
>> +	if (completion_done(&partition->async_hypercall)) {
>> +		pt_err(partition,
>> +		       "Cannot issue another async hypercall, while another one in progress!\n");
> 
> Two uses of word "another" in the error message is redundant.  Perhaps
> 
> 	"Cannot issue async hypercall while another one is in progress!"
> 
Thanks, I'll change it.

<snip>>> +
>> +	/* Reject overlapping regions */
>> +	if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) ||
>> +	    mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - 1) ||
>> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr) ||
>> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr + mem->size - 1))
>> +		return -EEXIST;
> 
> Having to fully walk the partition region list four times for the above checks
> isn't the most efficient approach, but I'm guessing that creating a region isn't
> really a hot path so it doesn't matter. And I don't know how long the region list
> typically is.
> 
Indeed, it seems wasteful at first but the list is usually only a few entries long,
and regions are rarely added or removed (usually just at boot).

<snip>>> +/* Called for unmapping both the guest ram and the mmio space */
>> +static long
>> +mshv_unmap_user_memory(struct mshv_partition *partition,
>> +		       struct mshv_user_mem_region mem)
>> +{
>> +	struct mshv_mem_region *region;
>> +	u32 unmap_flags = 0;
>> +
>> +	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
>> +		return -EINVAL;
>> +
>> +	if (hlist_empty(&partition->pt_mem_regions))
>> +		return -EINVAL;
> 
> Isn't the above check redundant, given the lookup by gfn that is
> done immediately below?
> 
Yes, I'll remove it.

>> +
>> +	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
>> +	if (!region)
>> +		return -EINVAL;
<snip>>> +	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
>> +		hv_type_mask = 1;
>> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>> +			hv_flags.clear_accessed = 1;
>> +			/* not accessed implies not dirty */
>> +			hv_flags.clear_dirty = 1;
>> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> 
> Avoid C++ style comments.
> 
Ack

>> +			hv_flags.set_accessed = 1;
>> +		}
>> +		break;
>> +	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
>> +		hv_type_mask = 2;
>> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>> +			hv_flags.clear_dirty = 1;
>> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> 
> Same here.
> 
Ack

>> +			hv_flags.set_dirty = 1;
>> +			/* dirty implies accessed */
>> +			hv_flags.set_accessed = 1;
>> +		}
>> +		break;
>> +	}
>> +
>> +	states = vzalloc(states_buf_sz);
>> +	if (!states)
>> +		return -ENOMEM;
>> +
>> +	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
>> +					    args.gpap_base, hv_flags, &written,
>> +					    states);
>> +	if (ret)
>> +		goto free_return;
>> +
>> +	/*
>> +	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
>> +	 * correspond to bitfields in hv_gpa_page_access_state
>> +	 */
>> +	for (i = 0; i < written; ++i)
>> +		assign_bit(i, (ulong *)states,
> 
> Why the cast to ulong *?  I think this argument to assign_bit() is void *, in
> which case the cast wouldn't be needed.
> 
It looks like assign_bit() and friends resolve to a set of functions which do
take an unsigned long pointer, e.g.:

__set_bit() -> generic___set_bit(unsigned long nr, volatile unsigned long *addr)
set_bit() -> arch_set_bit(unsigned int nr, volatile unsigned long *p)
etc...

So a cast is necessary.

> Also, assign_bit() does atomic bit operations. Doing such in a loop like
> here will really hammer the hardware memory bus with atomic 
> read-modify-write cycles. Use __assign_bit() instead, which does
> non-atomic operations. You don't need atomic here as no other
> threads are modifying the bit array.
> 
I didn't realize it was atomic. I'll change it to __assign_bit().

>> +			   states[i].as_uint8 & hv_type_mask);
> 
> OK, so the starting contents of "states" is an array of bytes. The ending
> contents is an array of bits. This works because every bit in the ending
> bit array is set to either 0 or 1. Overlap occurs on the first iteration
> where the code reads the 0th byte, and writes the 0th bit, which is part of
> the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
> which doesn't overlap, and there's no overlap from then on.
> 
> Suppose "written" is not a multiple of 8. The last byte of "states" as an
> array of bits will have some bits that have not been set to either 0 or 1 and
> might be leftover garbage from when "states" was an array of bytes. That
> garbage will get copied to user space. Is that OK? Even if user space knows
> enough to ignore those bits, it seems a little dubious to be copying even
> a few bits of garbage to user space.
> 
> Some comments might help here.
> 
This is a good point. The expectation is indeed that userspace knows which
bits are valid from the returned "written" value, but I agree it's a bit
odd to have some garbage bits in the last byte. How does this look (to be
inserted here directly after the loop):

+       /* zero the unused bits in the last byte of the returned bitmap */
+       if (written > 0) {
+               u8 last_bits_mask;
+               int last_byte_idx;
+               int bits_rem = written % 8;
+
+               /* bits_rem == 0 when all bits in the last byte were assigned */
+               if (bits_rem > 0) {
+                       /* written > 0 ensures last_byte_idx >= 0 */
+                       last_byte_idx = ((written + 7) / 8) - 1;
+                       /* bits_rem > 0 ensures this masks 1 to 7 bits */
+                       last_bits_mask = (1 << bits_rem) - 1;
+                       states[last_byte_idx].as_uint8 &= last_bits_mask;
+               }
+       }

The remaining bytes could be memset() to zero but I think it's fine to leave
them.

>> +
>> +	args.page_count = written;
>> +
>> +	if (copy_to_user(user_args, &args, sizeof(args))) {
>> +		ret = -EFAULT;
>> +		goto free_return;
>> +	}
>> +	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
>> +		ret = -EFAULT;
>> +
>> +free_return:
>> +	vfree(states);
>> +	return ret;
>> +}
<snip>
>> +static void
>> +handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>> +{
>> +	int bank_idx, vps_signaled = 0, bank_mask_size;
>> +	struct mshv_partition *partition;
>> +	const struct hv_vpset *vpset;
>> +	const u64 *bank_contents;
>> +	u64 partition_id = msg->partition_id;
>> +
>> +	if (msg->vp_bitset.bitset.format != HV_GENERIC_SET_SPARSE_4K) {
>> +		pr_debug("scheduler message format is not HV_GENERIC_SET_SPARSE_4K");
>> +		return;
>> +	}
>> +
>> +	if (msg->vp_count == 0) {
>> +		pr_debug("scheduler message with no VP specified");
>> +		return;
>> +	}
>> +
>> +	rcu_read_lock();
>> +
>> +	partition = mshv_partition_find(partition_id);
>> +	if (unlikely(!partition)) {
>> +		pr_debug("failed to find partition %llu\n", partition_id);
>> +		goto unlock_out;
>> +	}
>> +
>> +	vpset = &msg->vp_bitset.bitset;
>> +
>> +	bank_idx = -1;
>> +	bank_contents = vpset->bank_contents;
>> +	bank_mask_size = sizeof(vpset->valid_bank_mask) * BITS_PER_BYTE;
>> +
>> +	while (true) {
>> +		int vp_bank_idx = -1;
>> +		int vp_bank_size = sizeof(*bank_contents) * BITS_PER_BYTE;
>> +		int vp_index;
>> +
>> +		bank_idx = find_next_bit((unsigned long *)&vpset->valid_bank_mask,
>> +					 bank_mask_size, bank_idx + 1);
>> +		if (bank_idx == bank_mask_size)
>> +			break;
>> +
>> +		while (true) {
>> +			struct mshv_vp *vp;
>> +
>> +			vp_bank_idx = find_next_bit((unsigned long *)bank_contents,
>> +						    vp_bank_size, vp_bank_idx + 1);
>> +			if (vp_bank_idx == vp_bank_size)
>> +				break;
>> +
>> +			vp_index = (bank_idx << HV_GENERIC_SET_SHIFT) + vp_bank_idx;
> 
> This would be clearer if just multiplied by bank_mask_size instead of shifting.
> Since the compiler knows the constant value of bank_mask_size, it should generate
> the same code as the shift.
> 
I agree, but it should be multiplied by vp_bank_size as that's the size of a bank
in bits, as opposed bank_mask_size which is the size of the valid banks mask in bits.

(They're both the same value though, 64).

<snip>>> +
>> +enum {
>> +	MSHV_GPAP_ACCESS_TYPE_ACCESSED = 0,
>> +	MSHV_GPAP_ACCESS_TYPE_DIRTY,
>> +	MSHV_GPAP_ACCESS_TYPE_COUNT		/* Count of enum members */
>> +};
>> +
>> +enum {
>> +	MSHV_GPAP_ACCESS_OP_NOOP = 0,
>> +	MSHV_GPAP_ACCESS_OP_CLEAR,
>> +	MSHV_GPAP_ACCESS_OP_SET,
>> +	MSHV_GPAP_ACCESS_OP_COUNT		/* Count of enum members */
>> +};
> 
> Any reason these two enums explicitly set the first value to 0, while
> earlier enums do not?  This is another case of there being a difference,
> and me wondering if it's just gratuitous or if there's a specific reason.
> Consistency is a good thing!
> 
No reason, I'll remove these assignments.

<snip>>> +/* Partition fds created with MSHV_CREATE_PARTITION */
>> +#define MSHV_INITIALIZE_PARTITION	_IO(MSHV_IOCTL, 0x00)
>> +#define MSHV_CREATE_VP			_IOW(MSHV_IOCTL, 0x01, struct mshv_create_vp)
>> +#define MSHV_SET_GUEST_MEMORY		_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
>> +#define MSHV_IRQFD			_IOW(MSHV_IOCTL, 0x03, struct mshv_user_irqfd)
>> +#define MSHV_IOEVENTFD			_IOW(MSHV_IOCTL, 0x04, struct mshv_user_ioeventfd)
>> +#define MSHV_SET_MSI_ROUTING		_IOW(MSHV_IOCTL, 0x05, struct mshv_user_irq_table)
>> +#define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
>> +/* Generic hypercall */
>> +#define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
> 
> I really don't like having the ioctl numbers here overlap with the /dev/mshv ioctls.
> There's just no need to overlap. But I realize changing it now is a big hassle.
> 
Fair enough, there isn't a real need for overlap between the different device IOCTLs.
But, yes, I am going to leave them alone unless there's a really good reason.

>> +
>> +/*
>> + ********************************
>> + * VP APIs for child partitions *
>> + ********************************
>> + */
>> +
>> +#define MSHV_RUN_VP_BUF_SZ 256
>> +
>> +/*
>> + * Map various VP state pages to userspace.
>> + * Multiply the offset by PAGE_SIZE before being passed as the 'offset'
>> + * argument to mmap().
>> + * e.g.
>> + * void *reg_page = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
>> + *                       MAP_SHARED, vp_fd,
>> + *                       MSHV_VP_MMAP_OFFSET_REGISTERS * PAGE_SIZE);
>> + */
> 
> This is interesting.  I would not have thought PAGE_SIZE is available
> in the UAPI.  You must use something like the getpagesize() call. I know
> the root partition can only run with a 4K page size, but the symbol
> "PAGE_SIZE" is probably kernel code only.
> 
PAGE_SIZE here is meant to imply using whatever the system page size is,
but I think it's probably better to be explicit in the example. I will
change it to sysconf(_SC_PAGE_SIZE) as that seems to be the recommended way.

While at it I realized there were some more references to PAGE_SIZE and
HV_HYP_PAGE_SIZE in this file, but neither are defined in uapi.
I'm going to add a new #define MSHV_HV_PAGE_SIZE which matches the
hypervisor native page size of 0x1000 for these cases.

This mmap() call is the only time where the system page size is needed
instead of the Hyper-V page size.

>> +enum {
>> +	MSHV_VP_MMAP_OFFSET_REGISTERS,
>> +	MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE,
>> +	MSHV_VP_MMAP_OFFSET_GHCB,
>> +	MSHV_VP_MMAP_OFFSET_COUNT
>> +};
>> +
>> +/**
>> + * struct mshv_run_vp - argument for MSHV_RUN_VP
>> + * @msg_buf: On success, the intercept message is copied here. It can be
>> + *           interpreted using the relevant hypervisor definitions.
>> + */
>> +struct mshv_run_vp {
>> +	__u8 msg_buf[MSHV_RUN_VP_BUF_SZ];
>> +};
>> +
>> +enum {
>> +	MSHV_VP_STATE_LAPIC,		/* Local interrupt controller state (either arch) */
>> +	MSHV_VP_STATE_XSAVE,		/* XSAVE data in compacted form (x86_64) */
>> +	MSHV_VP_STATE_SIMP,
>> +	MSHV_VP_STATE_SIEFP,
>> +	MSHV_VP_STATE_SYNTHETIC_TIMERS,
>> +	MSHV_VP_STATE_COUNT,
>> +};
>> +
>> +/**
>> + * struct mshv_get_set_vp_hvcall - arguments for MSHV_[GET,SET]_VP_STATE
> 
> s/hvcall/state/
> 
Ack

<snip>
Thanks for the comments
Nuno

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-19  0:34     ` Nuno Das Neves
@ 2025-03-19  2:10       ` Michael Kelley
  2025-03-19 15:26         ` Michael Kelley
  0 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-19  2:10 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, March 18, 2025 5:34 PM
> 
> On 3/17/2025 4:51 PM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday,
> February 26, 2025 3:08 PM
> <snip>
> >> +
> >> +/* TODO move this to another file when debugfs code is added */
> >> +enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
> >> +#if defined(CONFIG_X86)
> >> +	VpRootDispatchThreadBlocked			= 201,
> >> +#elif defined(CONFIG_ARM64)
> >> +	VpRootDispatchThreadBlocked			= 94,
> >> +#endif
> >> +	VpStatsMaxCounter
> >> +};
> >
> > Where do these "magic" numbers come from?  Are they matching something
> > in the Hyper-V host?
> >
> They are part of the hypervisor ABI and really belong in hvhdk.h. These enums
> have many members which we use in another part of the code but which are omitted
> here.
> 
> For this patchset I put them here to avoid putting PascalCase definitions in the
> global namespace. I was undecided if we want to keep them like this (maybe keeping
> them out of hvhdk.h), or change them to linux style in a followup.

OK.  I don't object to them staying like this pending a future decision.

> <snip>
> >> +
> >> +	switch (args.type) {
> >> +	case MSHV_VP_STATE_LAPIC:
> >> +		state_data.type = HV_GET_SET_VP_STATE_LAPIC_STATE;
> >> +		data_sz = HV_HYP_PAGE_SIZE;
> >> +		break;
> >> +	case MSHV_VP_STATE_XSAVE:
> >
> > Just FYI, you can put a semicolon after the colon on the above line, which
> > adds a null statement, and then the C compiler will accept the definition
> > of local variable data_sz_64 without needing the odd-looking braces.
> >
> > I learn something new every day! :-)
> >
> I didn't know that! But actually I prefer the braces because they clearly
> denote a new block scope for that case.

That's fine.

> >> +	{
> >> +		u64 data_sz_64;
> >> +
> >> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
> >> +						     HV_PARTITION_PROPERTY_XSAVE_STATES,
> >> +						     &state_data.xsave.states.as_uint64);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +		ret = hv_call_get_partition_property(vp->vp_partition->pt_id,
> >> +						     HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE,
> >> +						     &data_sz_64);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +		data_sz = (u32)data_sz_64;
> >> +		state_data.xsave.flags = 0;
> >> +		/* Always request legacy states */
> >> +		state_data.xsave.states.legacy_x87 = 1;
> >> +		state_data.xsave.states.legacy_sse = 1;
> >> +		state_data.type = HV_GET_SET_VP_STATE_XSAVE;
> >> +		break;
> >> +	}
> >> +	case MSHV_VP_STATE_SIMP:
> >> +		state_data.type = HV_GET_SET_VP_STATE_SIM_PAGE;
> >> +		data_sz = HV_HYP_PAGE_SIZE;
> >> +		break;
> >> +	case MSHV_VP_STATE_SIEFP:
> >> +		state_data.type = HV_GET_SET_VP_STATE_SIEF_PAGE;
> >> +		data_sz = HV_HYP_PAGE_SIZE;
> >> +		break;
> >> +	case MSHV_VP_STATE_SYNTHETIC_TIMERS:
> >> +		state_data.type = HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS;
> >> +		data_sz = sizeof(vp_state.synthetic_timers_state);
> >> +		break;
> >> +	default:
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	if (copy_to_user(&user_args->buf_sz, &data_sz, sizeof(user_args->buf_sz)))
> >> +		return -EFAULT;
> >> +
> >> +	if (data_sz > args.buf_sz)
> >> +		return -EINVAL;
> >> +
> >> +	/* If the data is transmitted via pfns, delegate to helper */
> >> +	if (state_data.type & HV_GET_SET_VP_STATE_TYPE_PFN) {
> >> +		unsigned long user_pfn = PFN_DOWN(args.buf_ptr);
> >> +		size_t page_count = PFN_DOWN(args.buf_sz);
> >> +
> >> +		return mshv_vp_ioctl_get_set_state_pfn(vp, state_data, user_pfn,
> >> +						       page_count, is_set);
> >> +	}
> >> +
> >> +	/* Paranoia check - this shouldn't happen! */
> >> +	if (data_sz > sizeof(vp_state)) {
> >> +		vp_err(vp, "Invalid vp state data size!\n");
> >> +		return -EINVAL;
> >> +	}
> >
> > I don't understand the above check.  sizeof(vp_state) is relatively small since
> > it is effectively sizeof(hv_synthetic_timers_state), which is 200 bytes if I've
> > done the arithmetic correctly. But data_sz could be a full page (4096 bytes)
> > for the LAPIC, SIMP, and SIEFP cases, and the check would cause an error to
> > be returned.
> >
> data_sz > sizeof(vp_state) is true if and only if the HV_GET_SET_VP_STATE_TYPE_PFN
> bit is set in state_data.type. This check ensures that invariant holds.
> 
> See just above where we delegate to mshv_vp_ioctl_get_set_state_pfn() in that case.

OK. Got it now.  

> 
> >> +
> >> +	if (is_set) {
> >> +		if (copy_from_user(&vp_state, (__user void *)args.buf_ptr, data_sz))
> >> +			return -EFAULT;
> >> +
> >> +		return hv_call_set_vp_state(vp->vp_index,
> >> +					    vp->vp_partition->pt_id,
> >> +					    state_data, 0, NULL,
> >> +					    sizeof(vp_state), (u8 *)&vp_state);
> >
> > This is one of the cases where data from user space gets passed directly to
> > the hypercall. So user space is responsible for ensuring that reserved fields
> > are zero'ed and for otherwise ensuring a proper hypercall input. I just
> > wonder if user space really does this correctly.
> >
> The interfaces that are 'passthrough' like this remove quite a bit of
> complexity from the kernel code and delegates it to userspace and the hypervisor.
> 
> It is on userspace to ensure the parameters are valid, and it's on the
> hypervisor to check the fields and error if they are used improperly.
> 
> Note the hypervisor still needs to check everything regardless of if it comes
> from the kernel or directly from userspace.
> 
> >> +
> >> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
> >> +{
> >> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
> >> +
> >> +	switch (vmf->vma->vm_pgoff) {
> >> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
> >> +		vmf->page = virt_to_page(vp->vp_register_page);
> >> +		break;
> >> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
> >> +		vmf->page = virt_to_page(vp->vp_intercept_msg_page);
> >> +		break;
> >> +	case MSHV_VP_MMAP_OFFSET_GHCB:
> >> +		if (is_ghcb_mapping_available())
> >> +			vmf->page = virt_to_page(vp->vp_ghcb_page);
> >> +		break;
> >
> > If there's no GHCB mapping available, execution just continues with
> > vmf->page not set. Won't the later get_page() call fail? Perhaps this
> > should fail if there's no GHCB mapping available. Or maybe there's
> > more about how this works that I'm ignorant of. :-)
> >
> Hmm, maybe this check should just be removed. If we got here it means
> the vmf->vma->vm_pgoff was already set in mmap(), so the page should be
> valid in that case.
> 
> >> +	default:
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	get_page(vmf->page);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
> >> +{
> >> +	struct mshv_vp *vp = file->private_data;
> >> +
> >> +	switch (vma->vm_pgoff) {
> >> +	case MSHV_VP_MMAP_OFFSET_REGISTERS:
> >> +		if (!vp->vp_register_page)
> >> +			return -ENODEV;
> >> +		break;
> >> +	case MSHV_VP_MMAP_OFFSET_INTERCEPT_MESSAGE:
> >> +		if (!vp->vp_intercept_msg_page)
> >> +			return -ENODEV;
> >> +		break;
> >> +	case MSHV_VP_MMAP_OFFSET_GHCB:
> >> +		if (is_ghcb_mapping_available() && !vp->vp_ghcb_page)
> >> +			return -ENODEV;
> >> +		break;
> >
> > Again, if no GHCB mapping is available, should this return success?
> >
> I think this should just check the vp->vp_ghcb_page is not NULL, like
> the other cases. is_ghcb_mapping_available() is already checked to
> decide whether to map the page in the first place. I'll change it.
> 
> >> +	default:
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	vma->vm_ops = &mshv_vp_vm_ops;
> >> +	return 0;
> >> +}
> <snip>
> >> +
> >> +	input_vtl.as_uint8 = 0;
> >
> > I see eight occurrences in this source code file where the above statement
> > occurs and there is no further modification. Perhaps declare a static
> > variable that is initialized properly, and use it as the input parameter to the
> > various functions.  A second static variable could have the use_target_vtl = 1
> > setting that is needed in three places.
> >
> I was a bit doubtful, but I tried this and it removes quite a few lines without
> much tradeoff in readability. Thanks!
> 
> >> +	ret = hv_call_map_vp_state_page(partition->pt_id, args.vp_index,
> >> +					HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
> >> +					input_vtl,
> >> +					&intercept_message_page);
> <snip>
>> +static int mshv_init_async_handler(struct mshv_partition *partition)
> >> +{
> >> +	if (completion_done(&partition->async_hypercall)) {
> >> +		pt_err(partition,
> >> +		       "Cannot issue another async hypercall, while another one in progress!\n");
> >
> > Two uses of word "another" in the error message is redundant.  Perhaps
> >
> > 	"Cannot issue async hypercall while another one is in progress!"
> >
> Thanks, I'll change it.
> 
> <snip>
>> +
> >> +	/* Reject overlapping regions */
> >> +	if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) ||
> >> +	    mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - 1) ||
> >> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr) ||
> >> +	    mshv_partition_region_by_uaddr(partition, mem->userspace_addr + mem- size - 1))
> >> +		return -EEXIST;
> >
> > Having to fully walk the partition region list four times for the above checks
> > isn't the most efficient approach, but I'm guessing that creating a region isn't
> > really a hot path so it doesn't matter. And I don't know how long the region list
> > typically is.
> >
> Indeed, it seems wasteful at first but the list is usually only a few entries long,
> and regions are rarely added or removed (usually just at boot).

OK, not a problem then.

> 
> <snip>
>> +/* Called for unmapping both the guest ram and the mmio space */
> >> +static long
> >> +mshv_unmap_user_memory(struct mshv_partition *partition,
> >> +		       struct mshv_user_mem_region mem)
> >> +{
> >> +	struct mshv_mem_region *region;
> >> +	u32 unmap_flags = 0;
> >> +
> >> +	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
> >> +		return -EINVAL;
> >> +
> >> +	if (hlist_empty(&partition->pt_mem_regions))
> >> +		return -EINVAL;
> >
> > Isn't the above check redundant, given the lookup by gfn that is
> > done immediately below?
> >
> Yes, I'll remove it.
> 
> >> +
> >> +	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
> >> +	if (!region)
> >> +		return -EINVAL;
> <snip>
>> +	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
> >> +		hv_type_mask = 1;
> >> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> >> +			hv_flags.clear_accessed = 1;
> >> +			/* not accessed implies not dirty */
> >> +			hv_flags.clear_dirty = 1;
> >> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> >
> > Avoid C++ style comments.
> >
> Ack
> 
> >> +			hv_flags.set_accessed = 1;
> >> +		}
> >> +		break;
> >> +	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
> >> +		hv_type_mask = 2;
> >> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> >> +			hv_flags.clear_dirty = 1;
> >> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> >
> > Same here.
> >
> Ack
> 
> >> +			hv_flags.set_dirty = 1;
> >> +			/* dirty implies accessed */
> >> +			hv_flags.set_accessed = 1;
> >> +		}
> >> +		break;
> >> +	}
> >> +
> >> +	states = vzalloc(states_buf_sz);
> >> +	if (!states)
> >> +		return -ENOMEM;
> >> +
> >> +	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
> >> +					    args.gpap_base, hv_flags, &written,
> >> +					    states);
> >> +	if (ret)
> >> +		goto free_return;
> >> +
> >> +	/*
> >> +	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
> >> +	 * correspond to bitfields in hv_gpa_page_access_state
> >> +	 */
> >> +	for (i = 0; i < written; ++i)
> >> +		assign_bit(i, (ulong *)states,
> >
> > Why the cast to ulong *?  I think this argument to assign_bit() is void *, in
> > which case the cast wouldn't be needed.
> >
> It looks like assign_bit() and friends resolve to a set of functions which do
> take an unsigned long pointer, e.g.:
> 
> __set_bit() -> generic___set_bit(unsigned long nr, volatile unsigned long *addr)
> set_bit() -> arch_set_bit(unsigned int nr, volatile unsigned long *p)
> etc...
> 
> So a cast is necessary.

Indeed, you are right.  Seems like set_bit() and friends should take a void *.
But that's a different kettle of fish.

> 
> > Also, assign_bit() does atomic bit operations. Doing such in a loop like
> > here will really hammer the hardware memory bus with atomic
> > read-modify-write cycles. Use __assign_bit() instead, which does
> > non-atomic operations. You don't need atomic here as no other
> > threads are modifying the bit array.
> >
> I didn't realize it was atomic. I'll change it to __assign_bit().
> 
> >> +			   states[i].as_uint8 & hv_type_mask);
> >
> > OK, so the starting contents of "states" is an array of bytes. The ending
> > contents is an array of bits. This works because every bit in the ending
> > bit array is set to either 0 or 1. Overlap occurs on the first iteration
> > where the code reads the 0th byte, and writes the 0th bit, which is part of
> > the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
> > which doesn't overlap, and there's no overlap from then on.
> >
> > Suppose "written" is not a multiple of 8. The last byte of "states" as an
> > array of bits will have some bits that have not been set to either 0 or 1 and
> > might be leftover garbage from when "states" was an array of bytes. That
> > garbage will get copied to user space. Is that OK? Even if user space knows
> > enough to ignore those bits, it seems a little dubious to be copying even
> > a few bits of garbage to user space.
> >
> > Some comments might help here.
> >
> This is a good point. The expectation is indeed that userspace knows which
> bits are valid from the returned "written" value, but I agree it's a bit
> odd to have some garbage bits in the last byte. How does this look (to be
> inserted here directly after the loop):
> 
> +       /* zero the unused bits in the last byte of the returned bitmap */
> +       if (written > 0) {
> +               u8 last_bits_mask;
> +               int last_byte_idx;
> +               int bits_rem = written % 8;
> +
> +               /* bits_rem == 0 when all bits in the last byte were assigned */
> +               if (bits_rem > 0) {
> +                       /* written > 0 ensures last_byte_idx >= 0 */
> +                       last_byte_idx = ((written + 7) / 8) - 1;
> +                       /* bits_rem > 0 ensures this masks 1 to 7 bits */
> +                       last_bits_mask = (1 << bits_rem) - 1;
> +                       states[last_byte_idx].as_uint8 &= last_bits_mask;
> +               }
> +       }

A simpler approach is to "continue" the previous loop.  And if "written"
is zero, this additional loop won't do anything either: 

	for (i = written; i < ALIGN(written, 8); ++i)
		__clear_bit(i, (ulong *)states);

> 
> The remaining bytes could be memset() to zero but I think it's fine to leave
> them.

I agree.  The remaining bytes aren't written back to user space anyway
since the copy_to_user() uses bitmap_buf_sz.

> 
> >> +
> >> +	args.page_count = written;
> >> +
> >> +	if (copy_to_user(user_args, &args, sizeof(args))) {
> >> +		ret = -EFAULT;
> >> +		goto free_return;
> >> +	}
> >> +	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
> >> +		ret = -EFAULT;
> >> +
> >> +free_return:
> >> +	vfree(states);
> >> +	return ret;
> >> +}
> <snip>
> >> +static void
> >> +handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
> >> +{
> >> +	int bank_idx, vps_signaled = 0, bank_mask_size;
> >> +	struct mshv_partition *partition;
> >> +	const struct hv_vpset *vpset;
> >> +	const u64 *bank_contents;
> >> +	u64 partition_id = msg->partition_id;
> >> +
> >> +	if (msg->vp_bitset.bitset.format != HV_GENERIC_SET_SPARSE_4K) {
> >> +		pr_debug("scheduler message format is not HV_GENERIC_SET_SPARSE_4K");
> >> +		return;
> >> +	}
> >> +
> >> +	if (msg->vp_count == 0) {
> >> +		pr_debug("scheduler message with no VP specified");
> >> +		return;
> >> +	}
> >> +
> >> +	rcu_read_lock();
> >> +
> >> +	partition = mshv_partition_find(partition_id);
> >> +	if (unlikely(!partition)) {
> >> +		pr_debug("failed to find partition %llu\n", partition_id);
> >> +		goto unlock_out;
> >> +	}
> >> +
> >> +	vpset = &msg->vp_bitset.bitset;
> >> +
> >> +	bank_idx = -1;
> >> +	bank_contents = vpset->bank_contents;
> >> +	bank_mask_size = sizeof(vpset->valid_bank_mask) * BITS_PER_BYTE;
> >> +
> >> +	while (true) {
> >> +		int vp_bank_idx = -1;
> >> +		int vp_bank_size = sizeof(*bank_contents) * BITS_PER_BYTE;
> >> +		int vp_index;
> >> +
> >> +		bank_idx = find_next_bit((unsigned long *)&vpset->valid_bank_mask,
> >> +					 bank_mask_size, bank_idx + 1);
> >> +		if (bank_idx == bank_mask_size)
> >> +			break;
> >> +
> >> +		while (true) {
> >> +			struct mshv_vp *vp;
> >> +
> >> +			vp_bank_idx = find_next_bit((unsigned long *)bank_contents,
> >> +						    vp_bank_size, vp_bank_idx + 1);
> >> +			if (vp_bank_idx == vp_bank_size)
> >> +				break;
> >> +
> >> +			vp_index = (bank_idx << HV_GENERIC_SET_SHIFT) + vp_bank_idx;
> >
> > This would be clearer if just multiplied by bank_mask_size instead of shifting.
> > Since the compiler knows the constant value of bank_mask_size, it should generate
> > the same code as the shift.
> >
> I agree, but it should be multiplied by vp_bank_size as that's the size of a bank
> in bits, as opposed bank_mask_size which is the size of the valid banks mask in bits.

Yep, you are right.

> 
> (They're both the same value though, 64).
> 
> <snip>>> +
> >> +/*
> >> + * Map various VP state pages to userspace.
> >> + * Multiply the offset by PAGE_SIZE before being passed as the 'offset'
> >> + * argument to mmap().
> >> + * e.g.
> >> + * void *reg_page = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> >> + *                       MAP_SHARED, vp_fd,
> >> + *                       MSHV_VP_MMAP_OFFSET_REGISTERS * PAGE_SIZE);
> >> + */
> >
> > This is interesting.  I would not have thought PAGE_SIZE is available
> > in the UAPI.  You must use something like the getpagesize() call. I know
> > the root partition can only run with a 4K page size, but the symbol
> > "PAGE_SIZE" is probably kernel code only.
> >
> PAGE_SIZE here is meant to imply using whatever the system page size is,
> but I think it's probably better to be explicit in the example. I will
> change it to sysconf(_SC_PAGE_SIZE) as that seems to be the recommended way.
> 
> While at it I realized there were some more references to PAGE_SIZE and
> HV_HYP_PAGE_SIZE in this file, but neither are defined in uapi.
> I'm going to add a new #define MSHV_HV_PAGE_SIZE which matches the
> hypervisor native page size of 0x1000 for these cases.

OK.

> 
> This mmap() call is the only time where the system page size is needed
> instead of the Hyper-V page size.
> 

Michael

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-19  2:10       ` Michael Kelley
@ 2025-03-19 15:26         ` Michael Kelley
  2025-03-19 18:04           ` Nuno Das Neves
  0 siblings, 1 reply; 108+ messages in thread
From: Michael Kelley @ 2025-03-19 15:26 UTC (permalink / raw)
  To: Michael Kelley, Nuno Das Neves, linux-hyperv@vger.kernel.org,
	x86@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, March 18, 2025 7:10 PM
> 
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, March
> 18, 2025 5:34 PM
> >
> > On 3/17/2025 4:51 PM, Michael Kelley wrote:
> > > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM

[snip]

> > >> +
> > >> +	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
> > >> +	if (!region)
> > >> +		return -EINVAL;
> > <snip>
> >> +	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
> > >> +		hv_type_mask = 1;
> > >> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> > >> +			hv_flags.clear_accessed = 1;
> > >> +			/* not accessed implies not dirty */
> > >> +			hv_flags.clear_dirty = 1;
> > >> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> > >
> > > Avoid C++ style comments.
> > >
> > Ack
> >
> > >> +			hv_flags.set_accessed = 1;
> > >> +		}
> > >> +		break;
> > >> +	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
> > >> +		hv_type_mask = 2;
> > >> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
> > >> +			hv_flags.clear_dirty = 1;
> > >> +		} else { // MSHV_GPAP_ACCESS_OP_SET
> > >
> > > Same here.
> > >
> > Ack
> >
> > >> +			hv_flags.set_dirty = 1;
> > >> +			/* dirty implies accessed */
> > >> +			hv_flags.set_accessed = 1;
> > >> +		}
> > >> +		break;
> > >> +	}
> > >> +
> > >> +	states = vzalloc(states_buf_sz);
> > >> +	if (!states)
> > >> +		return -ENOMEM;
> > >> +
> > >> +	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
> > >> +					    args.gpap_base, hv_flags, &written,
> > >> +					    states);
> > >> +	if (ret)
> > >> +		goto free_return;
> > >> +
> > >> +	/*
> > >> +	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
> > >> +	 * correspond to bitfields in hv_gpa_page_access_state
> > >> +	 */
> > >> +	for (i = 0; i < written; ++i)
> > >> +		assign_bit(i, (ulong *)states,
> > >
> > > Why the cast to ulong *?  I think this argument to assign_bit() is void *, in
> > > which case the cast wouldn't be needed.
> > >
> > It looks like assign_bit() and friends resolve to a set of functions which do
> > take an unsigned long pointer, e.g.:
> >
> > __set_bit() -> generic___set_bit(unsigned long nr, volatile unsigned long *addr)
> > set_bit() -> arch_set_bit(unsigned int nr, volatile unsigned long *p)
> > etc...
> >
> > So a cast is necessary.
> 
> Indeed, you are right.  Seems like set_bit() and friends should take a void *.
> But that's a different kettle of fish.
> 
> >
> > > Also, assign_bit() does atomic bit operations. Doing such in a loop like
> > > here will really hammer the hardware memory bus with atomic
> > > read-modify-write cycles. Use __assign_bit() instead, which does
> > > non-atomic operations. You don't need atomic here as no other
> > > threads are modifying the bit array.
> > >
> > I didn't realize it was atomic. I'll change it to __assign_bit().
> >
> > >> +			   states[i].as_uint8 & hv_type_mask);
> > >
> > > OK, so the starting contents of "states" is an array of bytes. The ending
> > > contents is an array of bits. This works because every bit in the ending
> > > bit array is set to either 0 or 1. Overlap occurs on the first iteration
> > > where the code reads the 0th byte, and writes the 0th bit, which is part of
> > > the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
> > > which doesn't overlap, and there's no overlap from then on.
> > >
> > > Suppose "written" is not a multiple of 8. The last byte of "states" as an
> > > array of bits will have some bits that have not been set to either 0 or 1 and
> > > might be leftover garbage from when "states" was an array of bytes. That
> > > garbage will get copied to user space. Is that OK? Even if user space knows
> > > enough to ignore those bits, it seems a little dubious to be copying even
> > > a few bits of garbage to user space.
> > >
> > > Some comments might help here.
> > >
> > This is a good point. The expectation is indeed that userspace knows which
> > bits are valid from the returned "written" value, but I agree it's a bit
> > odd to have some garbage bits in the last byte. How does this look (to be
> > inserted here directly after the loop):
> >
> > +       /* zero the unused bits in the last byte of the returned bitmap */
> > +       if (written > 0) {
> > +               u8 last_bits_mask;
> > +               int last_byte_idx;
> > +               int bits_rem = written % 8;
> > +
> > +               /* bits_rem == 0 when all bits in the last byte were assigned */
> > +               if (bits_rem > 0) {
> > +                       /* written > 0 ensures last_byte_idx >= 0 */
> > +                       last_byte_idx = ((written + 7) / 8) - 1;
> > +                       /* bits_rem > 0 ensures this masks 1 to 7 bits */
> > +                       last_bits_mask = (1 << bits_rem) - 1;
> > +                       states[last_byte_idx].as_uint8 &= last_bits_mask;
> > +               }
> > +       }
> 
> A simpler approach is to "continue" the previous loop.  And if "written"
> is zero, this additional loop won't do anything either:
> 
> 	for (i = written; i < ALIGN(written, 8); ++i)
> 		__clear_bit(i, (ulong *)states);
> 

One further thought here: Could "written" be less than
args.page_count at this point? That would require
hv_call_get_gpa_access_states() to not fail, but still return
a value for written that is less than args.page_count. If that
could happen, then the above loop should be:

	for (i = written; i < bitmap_buf_sz * 8; ++i)
		__clear_bit(i, (ulong *)states);

so that all the uninitialized bits and bytes that will be written
back to user space are cleared.

> >
> > The remaining bytes could be memset() to zero but I think it's fine to leave
> > them.
> 
> I agree.  The remaining bytes aren't written back to user space anyway
> since the copy_to_user() uses bitmap_buf_sz.

Maybe I misunderstood what you meant by "remaining bytes".  I think
all bits and bytes that are written back to user space should have
valid data or zeros so that no garbage is written back.

Michael

> 
> >
> > >> +
> > >> +	args.page_count = written;
> > >> +
> > >> +	if (copy_to_user(user_args, &args, sizeof(args))) {
> > >> +		ret = -EFAULT;
> > >> +		goto free_return;
> > >> +	}
> > >> +	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
> > >> +		ret = -EFAULT;
> > >> +
> > >> +free_return:
> > >> +	vfree(states);
> > >> +	return ret;
> > >> +}

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  2025-03-19 15:26         ` Michael Kelley
@ 2025-03-19 18:04           ` Nuno Das Neves
  0 siblings, 0 replies; 108+ messages in thread
From: Nuno Das Neves @ 2025-03-19 18:04 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-acpi@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, catalin.marinas@arm.com, will@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	daniel.lezcano@linaro.org, joro@8bytes.org, robin.murphy@arm.com,
	arnd@arndb.de, jinankjain@linux.microsoft.com,
	muminulrussell@gmail.com, skinsburskii@linux.microsoft.com,
	mrathor@linux.microsoft.com, ssengar@linux.microsoft.com,
	apais@linux.microsoft.com, Tianyu.Lan@microsoft.com,
	stanislav.kinsburskiy@gmail.com, gregkh@linuxfoundation.org,
	vkuznets@redhat.com, prapal@linux.microsoft.com,
	muislam@microsoft.com, anrayabh@linux.microsoft.com,
	rafael@kernel.org, lenb@kernel.org, corbet@lwn.net

On 3/19/2025 8:26 AM, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, March 18, 2025 7:10 PM
>>
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Tuesday, March
>> 18, 2025 5:34 PM
>>>
>>> On 3/17/2025 4:51 PM, Michael Kelley wrote:
>>>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Wednesday, February 26, 2025 3:08 PM
> 
> [snip]
> 
>>>>> +
>>>>> +	region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
>>>>> +	if (!region)
>>>>> +		return -EINVAL;
>>> <snip>
>>>> +	case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
>>>>> +		hv_type_mask = 1;
>>>>> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>>>>> +			hv_flags.clear_accessed = 1;
>>>>> +			/* not accessed implies not dirty */
>>>>> +			hv_flags.clear_dirty = 1;
>>>>> +		} else { // MSHV_GPAP_ACCESS_OP_SET
>>>>
>>>> Avoid C++ style comments.
>>>>
>>> Ack
>>>
>>>>> +			hv_flags.set_accessed = 1;
>>>>> +		}
>>>>> +		break;
>>>>> +	case MSHV_GPAP_ACCESS_TYPE_DIRTY:
>>>>> +		hv_type_mask = 2;
>>>>> +		if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>>>>> +			hv_flags.clear_dirty = 1;
>>>>> +		} else { // MSHV_GPAP_ACCESS_OP_SET
>>>>
>>>> Same here.
>>>>
>>> Ack
>>>
>>>>> +			hv_flags.set_dirty = 1;
>>>>> +			/* dirty implies accessed */
>>>>> +			hv_flags.set_accessed = 1;
>>>>> +		}
>>>>> +		break;
>>>>> +	}
>>>>> +
>>>>> +	states = vzalloc(states_buf_sz);
>>>>> +	if (!states)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
>>>>> +					    args.gpap_base, hv_flags, &written,
>>>>> +					    states);
>>>>> +	if (ret)
>>>>> +		goto free_return;
>>>>> +
>>>>> +	/*
>>>>> +	 * Overwrite states buffer with bitmap - the bits in hv_type_mask
>>>>> +	 * correspond to bitfields in hv_gpa_page_access_state
>>>>> +	 */
>>>>> +	for (i = 0; i < written; ++i)
>>>>> +		assign_bit(i, (ulong *)states,
>>>>
>>>> Why the cast to ulong *?  I think this argument to assign_bit() is void *, in
>>>> which case the cast wouldn't be needed.
>>>>
>>> It looks like assign_bit() and friends resolve to a set of functions which do
>>> take an unsigned long pointer, e.g.:
>>>
>>> __set_bit() -> generic___set_bit(unsigned long nr, volatile unsigned long *addr)
>>> set_bit() -> arch_set_bit(unsigned int nr, volatile unsigned long *p)
>>> etc...
>>>
>>> So a cast is necessary.
>>
>> Indeed, you are right.  Seems like set_bit() and friends should take a void *.
>> But that's a different kettle of fish.
>>
>>>
>>>> Also, assign_bit() does atomic bit operations. Doing such in a loop like
>>>> here will really hammer the hardware memory bus with atomic
>>>> read-modify-write cycles. Use __assign_bit() instead, which does
>>>> non-atomic operations. You don't need atomic here as no other
>>>> threads are modifying the bit array.
>>>>
>>> I didn't realize it was atomic. I'll change it to __assign_bit().
>>>
>>>>> +			   states[i].as_uint8 & hv_type_mask);
>>>>
>>>> OK, so the starting contents of "states" is an array of bytes. The ending
>>>> contents is an array of bits. This works because every bit in the ending
>>>> bit array is set to either 0 or 1. Overlap occurs on the first iteration
>>>> where the code reads the 0th byte, and writes the 0th bit, which is part of
>>>> the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
>>>> which doesn't overlap, and there's no overlap from then on.
>>>>
>>>> Suppose "written" is not a multiple of 8. The last byte of "states" as an
>>>> array of bits will have some bits that have not been set to either 0 or 1 and
>>>> might be leftover garbage from when "states" was an array of bytes. That
>>>> garbage will get copied to user space. Is that OK? Even if user space knows
>>>> enough to ignore those bits, it seems a little dubious to be copying even
>>>> a few bits of garbage to user space.
>>>>
>>>> Some comments might help here.
>>>>
>>> This is a good point. The expectation is indeed that userspace knows which
>>> bits are valid from the returned "written" value, but I agree it's a bit
>>> odd to have some garbage bits in the last byte. How does this look (to be
>>> inserted here directly after the loop):
>>>
>>> +       /* zero the unused bits in the last byte of the returned bitmap */
>>> +       if (written > 0) {
>>> +               u8 last_bits_mask;
>>> +               int last_byte_idx;
>>> +               int bits_rem = written % 8;
>>> +
>>> +               /* bits_rem == 0 when all bits in the last byte were assigned */
>>> +               if (bits_rem > 0) {
>>> +                       /* written > 0 ensures last_byte_idx >= 0 */
>>> +                       last_byte_idx = ((written + 7) / 8) - 1;
>>> +                       /* bits_rem > 0 ensures this masks 1 to 7 bits */
>>> +                       last_bits_mask = (1 << bits_rem) - 1;
>>> +                       states[last_byte_idx].as_uint8 &= last_bits_mask;
>>> +               }
>>> +       }
>>
>> A simpler approach is to "continue" the previous loop.  And if "written"
>> is zero, this additional loop won't do anything either:
>>
>> 	for (i = written; i < ALIGN(written, 8); ++i)
>> 		__clear_bit(i, (ulong *)states);
>>
> > One further thought here: Could "written" be less than
> args.page_count at this point? That would require
> hv_call_get_gpa_access_states() to not fail, but still return
> a value for written that is less than args.page_count. If that
> could happen, then the above loop should be:
> 
> 	for (i = written; i < bitmap_buf_sz * 8; ++i)
> 		__clear_bit(i, (ulong *)states);
> 
> so that all the uninitialized bits and bytes that will be written
> back to user space are cleared.
> Hmmm...now I'm not so sure where the need for "written" came from in
the first place - in practice "written" will always be equal to
args.page_count except on error, but in that case there's a goto
free_return anyway, so the number is never copied to userspace. And
I checked the userspace code - it doesn't expect a partial result
either.

So it seems to be redundant, but I don't really want to remove it just
now.

Your suggestion with bitmap_buf_sz * 8 should be fine, and will make it
straightforward to remove "written" in a future cleanup if that ends up
looking like a good idea.

>>>
>>> The remaining bytes could be memset() to zero but I think it's fine to leave
>>> them.
>>
>> I agree.  The remaining bytes aren't written back to user space anyway
>> since the copy_to_user() uses bitmap_buf_sz.
> 
> Maybe I misunderstood what you meant by "remaining bytes".  I think
> all bits and bytes that are written back to user space should have
> valid data or zeros so that no garbage is written back.
> 
Agreed.

Nuno

> Michael
> 
>>
>>>
>>>>> +
>>>>> +	args.page_count = written;
>>>>> +
>>>>> +	if (copy_to_user(user_args, &args, sizeof(args))) {
>>>>> +		ret = -EFAULT;
>>>>> +		goto free_return;
>>>>> +	}
>>>>> +	if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
>>>>> +		ret = -EFAULT;
>>>>> +
>>>>> +free_return:
>>>>> +	vfree(states);
>>>>> +	return ret;
>>>>> +}


^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2025-03-19 18:04 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26 23:07 [PATCH v5 00/10] Introduce /dev/mshv root partition driver Nuno Das Neves
2025-02-26 23:07 ` [PATCH v5 01/10] hyperv: Convert Hyper-V status codes to strings Nuno Das Neves
2025-02-26 23:26   ` Stanislav Kinsburskii
2025-02-27  4:22   ` Easwar Hariharan
2025-02-27 23:48     ` Nuno Das Neves
2025-02-27 17:02   ` Roman Kisel
2025-02-27 22:54     ` Easwar Hariharan
2025-02-27 23:08       ` Roman Kisel
2025-02-27 23:25         ` Easwar Hariharan
2025-02-28 17:20           ` Roman Kisel
2025-02-28 20:22             ` Easwar Hariharan
2025-02-28 22:26               ` Roman Kisel
2025-02-27 23:21       ` Roman Kisel
2025-02-28  0:15     ` Nuno Das Neves
2025-02-28 16:40       ` Roman Kisel
2025-03-06 17:57   ` Michael Kelley
2025-03-06 18:09     ` Michael Kelley
2025-03-06 18:40       ` Nuno Das Neves
2025-03-06 18:57         ` Michael Kelley
2025-03-07 19:38     ` Nuno Das Neves
2025-02-26 23:07 ` [PATCH v5 02/10] x86/mshyperv: Add support for extended Hyper-V features Nuno Das Neves
2025-02-26 23:27   ` Stanislav Kinsburskii
2025-02-27 17:59   ` Roman Kisel
2025-02-28  0:17     ` Nuno Das Neves
2025-02-28 16:42       ` Roman Kisel
2025-02-27 18:17   ` Easwar Hariharan
2025-03-06 18:30   ` Michael Kelley
2025-03-12 18:04     ` Nuno Das Neves
2025-03-10 13:17   ` Tianyu Lan
2025-02-26 23:07 ` [PATCH v5 03/10] arm64/hyperv: Add some missing functions to arm64 Nuno Das Neves
2025-02-26 23:27   ` Stanislav Kinsburskii
2025-02-27  5:56   ` Easwar Hariharan
2025-02-28  0:21     ` Nuno Das Neves
2025-03-06 19:05       ` Michael Kelley
2025-03-07 21:36         ` Nuno Das Neves
2025-03-07 21:55           ` Easwar Hariharan
2025-02-27 18:09   ` Roman Kisel
2025-02-26 23:07 ` [PATCH v5 04/10] hyperv: Introduce hv_recommend_using_aeoi() Nuno Das Neves
2025-02-26 23:28   ` Stanislav Kinsburskii
2025-02-27 18:04   ` Roman Kisel
2025-02-28  0:21     ` Nuno Das Neves
2025-02-27 23:03   ` Easwar Hariharan
2025-02-28  0:33     ` Nuno Das Neves
2025-02-28  0:49       ` Easwar Hariharan
2025-03-06 19:12   ` Michael Kelley
2025-03-10 12:51   ` Tianyu Lan
2025-02-26 23:07 ` [PATCH v5 05/10] acpi: numa: Export node_to_pxm() Nuno Das Neves
2025-02-26 23:31   ` Stanislav Kinsburskii
2025-02-27 23:05   ` Easwar Hariharan
2025-03-06 19:16   ` Michael Kelley
2025-03-10 12:50   ` Tianyu Lan
2025-02-26 23:08 ` [PATCH v5 06/10] Drivers/hv: Export some functions for use by root partition module Nuno Das Neves
2025-02-26 23:32   ` Stanislav Kinsburskii
2025-02-27 18:11   ` Roman Kisel
2025-02-28  0:51   ` Easwar Hariharan
2025-03-06 19:23   ` Michael Kelley
2025-03-07 21:38     ` Nuno Das Neves
2025-02-26 23:08 ` [PATCH v5 07/10] Drivers: hv: Introduce per-cpu event ring tail Nuno Das Neves
2025-02-26 23:39   ` Stanislav Kinsburskii
2025-03-07 17:02   ` Michael Kelley
2025-03-07 22:06     ` Nuno Das Neves
2025-03-07 23:21       ` Michael Kelley
2025-03-07 23:31         ` Nuno Das Neves
2025-03-07 23:37         ` Michael Kelley
2025-03-10 13:01   ` Tianyu Lan
2025-03-12 19:44     ` Nuno Das Neves
2025-03-13  7:34       ` Tianyu Lan
2025-03-13 15:56         ` Nuno Das Neves
2025-03-13 16:00           ` Tianyu Lan
2025-02-26 23:08 ` [PATCH v5 08/10] x86: hyperv: Add mshv_handler irq handler and setup function Nuno Das Neves
2025-02-26 23:43   ` Stanislav Kinsburskii
2025-03-01  0:38     ` Nuno Das Neves
2025-03-07 17:38       ` Michael Kelley
2025-03-10 21:46         ` Nuno Das Neves
2025-03-10 22:23           ` Michael Kelley
2025-03-07 17:44   ` Michael Kelley
2025-03-07 23:29     ` Nuno Das Neves
2025-03-07 23:45       ` Michael Kelley
2025-02-26 23:08 ` [PATCH v5 09/10] hyperv: Add definitions for root partition driver to hv headers Nuno Das Neves
2025-02-26 23:51   ` Stanislav Kinsburskii
2025-03-01  0:46     ` Nuno Das Neves
2025-02-27 18:13   ` Roman Kisel
2025-02-28  1:27   ` Easwar Hariharan
2025-03-01  0:52     ` Nuno Das Neves
2025-03-07 17:26   ` Michael Kelley
2025-03-07 23:35     ` Nuno Das Neves
2025-03-10 12:40   ` Tianyu Lan
2025-03-12 20:17     ` Nuno Das Neves
2025-02-26 23:08 ` [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs Nuno Das Neves
2025-02-27  4:59   ` Easwar Hariharan
2025-03-01  1:29     ` Nuno Das Neves
2025-02-27 18:50   ` Roman Kisel
2025-03-01  1:38     ` Nuno Das Neves
2025-03-06 17:32     ` Wei Liu
2025-03-07 18:06       ` Roman Kisel
2025-03-11 18:01   ` Jeff Johnson
2025-03-14 19:25     ` Nuno Das Neves
2025-03-13 16:43   ` Michael Kelley
2025-03-14  2:15     ` Nuno Das Neves
2025-03-14  3:27       ` Michael Kelley
2025-03-17 23:51   ` Michael Kelley
2025-03-18 17:24     ` Wei Liu
2025-03-18 17:45       ` Michael Kelley
2025-03-18 20:07         ` Wei Liu
2025-03-19  0:34     ` Nuno Das Neves
2025-03-19  2:10       ` Michael Kelley
2025-03-19 15:26         ` Michael Kelley
2025-03-19 18:04           ` Nuno Das Neves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).