* [PATCH 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>
Register the mshv reboot notifier for all parent partitions, not just
root. Previously the notifier was gated on hv_root_partition(), so on
L1VH (where hv_root_partition() is false) SINT0, SINT5, and SIRBP were
never cleaned up before kexec. The kexec'd kernel then inherited stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.
The L1VH SIRBP also needs special handling: unlike the root partition
where the hypervisor provides the SIRBP page, L1VH must allocate its
own page and program the GPA into the MSR. Add this allocation to
mshv_synic_init() and the corresponding free to mshv_synic_cleanup().
Remove the unnecessary mshv_root_partition_init/exit wrappers and
register the reboot notifier directly in mshv_parent_partition_init().
Make mshv_reboot_nb static since it no longer needs external linkage.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 21 ++++-----------------
drivers/hv/mshv_synic.c | 37 ++++++++++++++++++++++++++++++-------
2 files changed, 34 insertions(+), 24 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e6509c980763..281f530b68a9 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2256,20 +2256,10 @@ static int mshv_reboot_notify(struct notifier_block *nb,
return 0;
}
-struct notifier_block mshv_reboot_nb = {
+static struct notifier_block mshv_reboot_nb = {
.notifier_call = mshv_reboot_notify,
};
-static void mshv_root_partition_exit(void)
-{
- unregister_reboot_notifier(&mshv_reboot_nb);
-}
-
-static int __init mshv_root_partition_init(struct device *dev)
-{
- return register_reboot_notifier(&mshv_reboot_nb);
-}
-
static int __init mshv_init_vmm_caps(struct device *dev)
{
int ret;
@@ -2339,8 +2329,7 @@ static int __init mshv_parent_partition_init(void)
if (ret)
goto remove_cpu_state;
- if (hv_root_partition())
- ret = mshv_root_partition_init(dev);
+ ret = register_reboot_notifier(&mshv_reboot_nb);
if (ret)
goto remove_cpu_state;
@@ -2368,8 +2357,7 @@ static int __init mshv_parent_partition_init(void)
deinit_root_scheduler:
root_scheduler_deinit();
exit_partition:
- if (hv_root_partition())
- mshv_root_partition_exit();
+ unregister_reboot_notifier(&mshv_reboot_nb);
remove_cpu_state:
cpuhp_remove_state(mshv_cpuhp_online);
free_synic_pages:
@@ -2387,8 +2375,7 @@ static void __exit mshv_parent_partition_exit(void)
misc_deregister(&mshv_dev);
mshv_irqfd_wq_cleanup();
root_scheduler_deinit();
- if (hv_root_partition())
- mshv_root_partition_exit();
+ unregister_reboot_notifier(&mshv_reboot_nb);
cpuhp_remove_state(mshv_cpuhp_online);
free_percpu(mshv_root.synic_pages);
}
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 8a7d76a10dc3..32f91a714c97 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -495,13 +495,29 @@ int mshv_synic_init(unsigned int cpu)
/* Setup the Synic's event ring page */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
- sirbp.sirbp_enabled = true;
- *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
- PAGE_SIZE, MEMREMAP_WB);
- if (!(*event_ring_page))
- goto cleanup_siefp;
+ if (hv_root_partition()) {
+ *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+ PAGE_SIZE, MEMREMAP_WB);
+
+ if (!(*event_ring_page))
+ goto cleanup_siefp;
+ } else {
+ /*
+ * On L1VH the hypervisor does not provide a SIRBP page.
+ * Allocate one and program its GPA into the MSR.
+ */
+ *event_ring_page = (struct hv_synic_event_ring_page *)
+ get_zeroed_page(GFP_KERNEL);
+
+ if (!(*event_ring_page))
+ goto cleanup_siefp;
+ sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+ >> PAGE_SHIFT;
+ }
+
+ sirbp.sirbp_enabled = true;
hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
#ifdef HYPERVISOR_CALLBACK_VECTOR
@@ -581,8 +597,15 @@ int mshv_synic_cleanup(unsigned int cpu)
/* Disable SYNIC event ring page owned by MSHV */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
sirbp.sirbp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
- memunmap(*event_ring_page);
+
+ if (hv_root_partition()) {
+ hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+ memunmap(*event_ring_page);
+ } else {
+ sirbp.base_sirbp_gpa = 0;
+ hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+ free_page((unsigned long)*event_ring_page);
+ }
/*
* Release our mappings of the message and event flags pages.
--
2.43.0
^ permalink raw reply related
* [PATCH 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>
The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL), and
SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cleanup()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out from
under VMBus causes its later cleanup to write SynIC MSRs while SynIC
is disabled, which the hypervisor does not tolerate.
Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: on L1VH leave entirely to VMBus (it
already enabled them); on root partition VMBus doesn't run, so
MSHV must enable/disable them
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_synic.c | 109 ++++++++++++++++++++++++----------------
1 file changed, 67 insertions(+), 42 deletions(-)
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f8b0337cdc82..8a7d76a10dc3 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -454,7 +454,6 @@ int mshv_synic_init(unsigned int cpu)
#ifdef HYPERVISOR_CALLBACK_VECTOR
union hv_synic_sint sint;
#endif
- union hv_synic_scontrol sctrl;
struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
struct hv_synic_event_flags_page **event_flags_page =
@@ -462,28 +461,37 @@ int mshv_synic_init(unsigned int cpu)
struct hv_synic_event_ring_page **event_ring_page =
&spages->synic_event_ring_page;
- /* Setup the Synic's message page */
+ /*
+ * Map the SYNIC message page. On root partition the hypervisor
+ * pre-provisions the SIMP GPA but may not set simp_enabled;
+ * on L1VH, VMBus already fully set it up. Enable it on root.
+ */
simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
- simp.simp_enabled = true;
+ if (hv_root_partition()) {
+ simp.simp_enabled = true;
+ hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ }
*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
HV_HYP_PAGE_SIZE,
MEMREMAP_WB);
if (!(*msg_page))
- return -EFAULT;
-
- hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ goto cleanup_simp;
- /* Setup the Synic's event flags page */
+ /*
+ * Map the event flags page. Same as SIMP: enable on root,
+ * already enabled by VMBus on L1VH.
+ */
siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
- siefp.siefp_enabled = true;
+ if (hv_root_partition()) {
+ siefp.siefp_enabled = true;
+ hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ }
*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
PAGE_SIZE, MEMREMAP_WB);
if (!(*event_flags_page))
- goto cleanup;
-
- hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ goto cleanup_siefp;
/* Setup the Synic's event ring page */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
@@ -492,7 +500,7 @@ int mshv_synic_init(unsigned int cpu)
PAGE_SIZE, MEMREMAP_WB);
if (!(*event_ring_page))
- goto cleanup;
+ goto cleanup_siefp;
hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
@@ -515,28 +523,33 @@ int mshv_synic_init(unsigned int cpu)
sint.as_uint64);
#endif
- /* Enable global synic bit */
- sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
- sctrl.enable = 1;
- hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ /*
+ * On L1VH, VMBus owns SCONTROL and has already enabled it.
+ * On root partition, VMBus doesn't run so we must enable it.
+ */
+ if (hv_root_partition()) {
+ union hv_synic_scontrol sctrl;
+
+ sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+ sctrl.enable = 1;
+ hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ }
return 0;
-cleanup:
- if (*event_ring_page) {
- sirbp.sirbp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
- memunmap(*event_ring_page);
- }
- if (*event_flags_page) {
+cleanup_siefp:
+ if (*event_flags_page)
+ memunmap(*event_flags_page);
+ if (hv_root_partition()) {
siefp.siefp_enabled = false;
hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
- memunmap(*event_flags_page);
}
- if (*msg_page) {
+cleanup_simp:
+ if (*msg_page)
+ memunmap(*msg_page);
+ if (hv_root_partition()) {
simp.simp_enabled = false;
hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
- memunmap(*msg_page);
}
return -EFAULT;
@@ -545,10 +558,7 @@ int mshv_synic_init(unsigned int cpu)
int mshv_synic_cleanup(unsigned int cpu)
{
union hv_synic_sint sint;
- union hv_synic_simp simp;
- union hv_synic_siefp siefp;
union hv_synic_sirbp sirbp;
- union hv_synic_scontrol sctrl;
struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
struct hv_synic_event_flags_page **event_flags_page =
@@ -568,28 +578,43 @@ int mshv_synic_cleanup(unsigned int cpu)
hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
sint.as_uint64);
- /* Disable Synic's event ring page */
+ /* Disable SYNIC event ring page owned by MSHV */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
sirbp.sirbp_enabled = false;
hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
memunmap(*event_ring_page);
- /* Disable Synic's event flags page */
- siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
- siefp.siefp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ /*
+ * Release our mappings of the message and event flags pages.
+ * On root partition, we enabled SIMP/SIEFP — disable them.
+ * On L1VH, VMBus owns the MSRs, leave them alone.
+ */
memunmap(*event_flags_page);
+ if (hv_root_partition()) {
+ union hv_synic_simp simp;
+ union hv_synic_siefp siefp;
- /* Disable Synic's message page */
- simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
- simp.simp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+ siefp.siefp_enabled = false;
+ hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+ simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+ simp.simp_enabled = false;
+ hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ }
memunmap(*msg_page);
- /* Disable global synic bit */
- sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
- sctrl.enable = 0;
- hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ /*
+ * On root partition, we enabled SCONTROL in init — disable it.
+ * On L1VH, VMBus owns SCONTROL, leave it alone.
+ */
+ if (hv_root_partition()) {
+ union hv_synic_scontrol sctrl;
+
+ sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+ sctrl.enable = 0;
+ hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ }
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser,
Anirudh Rayabharam, Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>
After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.
Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.
Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
Co-developed-by: Mukesh Rathor <mukeshrathor@microsoft.com>
Signed-off-by: Mukesh Rathor <mukeshrathor@microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 7 +++++
drivers/hv/hv_proc.c | 47 ++++++++++++++++++++++++++++++++++
include/asm-generic/mshyperv.h | 10 ++++++++
include/hyperv/hvgdk_mini.h | 1 +
include/hyperv/hvhdk_mini.h | 12 +++++++++
5 files changed, 77 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 235087456bdf..f653feea880b 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -429,6 +429,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
}
#ifdef CONFIG_X86_64
+ /* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
+ if (num_present_cpus() == 1 || hv_lp_exists(1))
+ return;
+
for_each_present_cpu(i) {
if (i == 0)
continue;
@@ -436,6 +440,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
BUG_ON(ret);
}
+ ret = hv_call_notify_all_processors_started();
+ WARN_ON(ret);
+
for_each_present_cpu(i) {
if (i == 0)
continue;
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 5f4fd9c3231c..63a48e5a02c5 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
return ret;
}
EXPORT_SYMBOL_GPL(hv_call_create_vp);
+
+int hv_call_notify_all_processors_started(void)
+{
+ struct hv_input_notify_partition_event *input;
+ u64 status;
+ unsigned long irq_flags;
+ int ret = 0;
+
+ local_irq_save(irq_flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+ input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
+ status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
+ input, NULL);
+ local_irq_restore(irq_flags);
+
+ if (!hv_result_success(status)) {
+ hv_status_err(status, "\n");
+ ret = hv_result_to_errno(status);
+ }
+ return ret;
+}
+
+bool hv_lp_exists(u32 lp_index)
+{
+ struct hv_input_get_logical_processor_run_time *input;
+ struct hv_output_get_logical_processor_run_time *output;
+ unsigned long flags;
+ u64 status;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+ input->lp_index = lp_index;
+ status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
+ input, output);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status) &&
+ hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
+ hv_status_err(status, "\n");
+ BUG();
+ }
+
+ return hv_result_success(status);
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index d37b68238c97..bf601d67cecb 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
+int hv_call_notify_all_processors_started(void);
+bool hv_lp_exists(u32 lp_index);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
#else /* CONFIG_MSHV_ROOT */
@@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
{
return -EOPNOTSUPP;
}
+static inline int hv_call_notify_all_processors_started(void)
+{
+ return -EOPNOTSUPP;
+}
+static inline bool hv_lp_exists(u32 lp_index)
+{
+ return false;
+}
static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
{
return -EOPNOTSUPP;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 056ef7b6b360..f2598e186550 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
/* HV_CALL_CODE */
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
+#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME 0x0004
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
#define HVCALL_SEND_IPI 0x000b
#define HVCALL_ENABLE_VP_VTL 0x000f
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 091c03e26046..b4cb2fa26e9b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -362,6 +362,7 @@ union hv_partition_event_input {
enum hv_partition_event {
HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
+ HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
};
struct hv_input_notify_partition_event {
@@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
union hv_partition_event_input input;
} __packed;
+struct hv_input_get_logical_processor_run_time {
+ u32 lp_index;
+} __packed;
+
+struct hv_output_get_logical_processor_run_time {
+ u64 global_time;
+ u64 local_run_time;
+ u64 rsvdz0;
+ u64 hypervisor_time;
+} __packed;
+
struct hv_lp_startup_status {
u64 hv_status;
u64 substatus1;
--
2.43.0
^ permalink raw reply related
* [PATCH 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser,
Anirudh Rayabharam
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>
Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
drivers/hv/vmbus_drv.c | 1 -
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 89a2eb8a0722..235087456bdf 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -235,8 +235,12 @@ void hv_remove_crash_handler(void)
#ifdef CONFIG_KEXEC_CORE
static void hv_machine_shutdown(void)
{
- if (kexec_in_progress && hv_kexec_handler)
- hv_kexec_handler();
+ if (kexec_in_progress) {
+ hv_stimer_global_cleanup();
+
+ if (hv_kexec_handler)
+ hv_kexec_handler();
+ }
/*
* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 301273d61892..5d1449f8c6ea 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2892,7 +2892,6 @@ static struct platform_driver vmbus_platform_driver = {
static void hv_kexec_handler(void)
{
- hv_stimer_global_cleanup();
vmbus_initiate_unload(false);
/* Make sure conn_state is set as hv_synic_cleanup checks for it */
mb();
--
2.43.0
^ permalink raw reply related
* [PATCH 1/6] Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260327201920.2100427-1-jloeser@linux.microsoft.com>
vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().
Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.
Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/vmbus_drv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 3e7a52918ce0..301273d61892 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1430,7 +1430,6 @@ static int vmbus_alloc_synic_and_connect(void)
{
int ret, cpu;
struct work_struct __percpu *works;
- int hyperv_cpuhp_online;
ret = hv_synic_alloc();
if (ret < 0)
--
2.43.0
^ permalink raw reply related
* [PATCH 0/6] Hyper-V: kexec fixes for L1VH (mshv)
From: Jork Loeser @ 2026-03-27 20:19 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Roman Kisel,
Michael Kelley, linux-kernel, linux-arch, Jork Loeser
This series fixes kexec support when Linux runs as an L1 Virtual Host
(L1VH) under Hyper-V, using the MSHV driver to manage child VMs.
1. A variable shadowing bug in vmbus that hides the cpuhp state used
for teardown.
2. Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown(). This ensures stimer cleanup happens before
the vmbus unload.
3. LP/VP re-creation: after kexec, logical processors and virtual
processors already exist in the hypervisor. Detect this and skip
re-adding them.
4-5. SynIC cleanup: the MSHV driver manages its own SynIC resources
separately from vmbus. Add proper teardown of MSHV-owned SINTs,
SIMP, and SIEFP on kexec, scoped to only the resources MSHV
owns.
6. Debugfs stats pages: unmap the VP statistics overlay pages before
kexec to avoid stale mappings in the new kernel.
Jork Loeser (6):
Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
x86/hyperv: move stimer cleanup to hv_machine_shutdown()
x86/hyperv: Skip LP/VP creation on kexec
mshv: limit SynIC management to MSHV-owned resources
mshv: clean up SynIC state on kexec for L1VH
mshv: unmap debugfs stats pages on kexec
arch/x86/kernel/cpu/mshyperv.c | 15 +++-
drivers/hv/hv_proc.c | 47 +++++++++++
drivers/hv/mshv_debugfs.c | 7 +-
drivers/hv/mshv_root_main.c | 22 ++---
drivers/hv/mshv_synic.c | 144 ++++++++++++++++++++++-----------
drivers/hv/vmbus_drv.c | 2 -
include/asm-generic/mshyperv.h | 10 +++
include/hyperv/hvgdk_mini.h | 1 +
include/hyperv/hvhdk_mini.h | 12 +++
9 files changed, 190 insertions(+), 70 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [RFC PATCH V3] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-03-27 9:32 UTC (permalink / raw)
To: Easwar Hariharan
Cc: kys, haiyangz, wei.liu, decui, longli, m.szyprowski, robin.murphy,
Tianyu Lan, iommu, linux-hyperv, linux-kernel, hch, vdso,
Michael Kelley
In-Reply-To: <75c6dd78-bbae-4f5a-94ef-9de299720d38@linux.microsoft.com>
On Fri, Mar 27, 2026 at 1:05 AM Easwar Hariharan
<easwar.hariharan@linux.microsoft.com> wrote:
>
> On 3/25/2026 12:56 AM, Tianyu Lan wrote:
> > Hyper-V provides Confidential VMBus to communicate between
> > device model and device guest driver via encrypted/private
> > memory in Confidential VM. The device model is in OpenHCL
> > (https://openvmm.dev/guide/user_guide/openhcl.html) that
> > plays the paravisor role.
> >
> > For a VMBus device, there are two communication methods to
> > talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
> > DMA transfer.
> >
> > The Confidential VMBus Ring buffer has been upstreamed by
> > Roman Kisel(commit 6802d8af47d1).
> >
> > The dynamic DMA transition of VMBus device normally goes
> > through DMA core and it uses SWIOTLB as bounce buffer in
> > a CoCo VM.
> >
> > The Confidential VMBus device can do DMA directly to
> > private/encrypted memory. Because the swiotlb is decrypted
> > memory, the DMA transfer must not be bounced through the
> > swiotlb, so as to preserve confidentiality. This is different
> > from the default for Linux CoCo VMs, so disable the VMBus
> > device's use of swiotlb.
> >
> > Expose swiotlb_dev_disable() from DMA Core to disable
> > bounce buffer for device.
> >
> > Suggested-by: Michael Kelley <mhklinux@outlook.com>
> > Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> > ---
> > drivers/hv/vmbus_drv.c | 6 +++++-
> > include/linux/swiotlb.h | 5 +++++
> > 2 files changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 3d1a58b667db..84e6971fc90f 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -2184,11 +2184,15 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> > child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
> > dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
> >
> > + device_initialize(&child_device_obj->device);
> > + if (child_device_obj->channel->co_external_memory)
> > + swiotlb_dev_disable(&child_device_obj->device);
> > +
> > /*
> > * Register with the LDM. This will kick off the driver/device
> > * binding...which will eventually call vmbus_match() and vmbus_probe()
> > */
> > - ret = device_register(&child_device_obj->device);
> > + ret = device_add(&child_device_obj->device);
> > if (ret) {
> > pr_err("Unable to register child device\n");
> > put_device(&child_device_obj->device);
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3dae0f592063..7c572570d5d9 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -169,6 +169,11 @@ static inline struct io_tlb_pool *swiotlb_find_pool(struct device *dev,
> > return NULL;
> > }
> >
> > +static inline bool swiotlb_dev_disable(struct device *dev)
> > +{
> > + return dev->dma_io_tlb_mem == NULL;
>
> Is there an extra = here?
>
> - Easwar (he/him)
Hi Easwar:
Thanks for your review. Nice catch. Oops. Will try other way to disable
device bounce buffer in the next version.
--
Thanks
Tianyu Lan
^ permalink raw reply
* Re: [RFC PATCH V3] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-03-27 9:28 UTC (permalink / raw)
To: Leon Romanovsky
Cc: kys, haiyangz, wei.liu, decui, longli, m.szyprowski, robin.murphy,
Tianyu Lan, iommu, linux-hyperv, linux-kernel, hch, vdso,
Michael Kelley
In-Reply-To: <20260325092200.GQ814676@unreal>
On Wed, Mar 25, 2026 at 5:22 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Mar 25, 2026 at 03:56:49AM -0400, Tianyu Lan wrote:
> > Hyper-V provides Confidential VMBus to communicate between
> > device model and device guest driver via encrypted/private
> > memory in Confidential VM. The device model is in OpenHCL
> > (https://openvmm.dev/guide/user_guide/openhcl.html) that
> > plays the paravisor role.
> >
> > For a VMBus device, there are two communication methods to
> > talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
> > DMA transfer.
> >
> > The Confidential VMBus Ring buffer has been upstreamed by
> > Roman Kisel(commit 6802d8af47d1).
> >
> > The dynamic DMA transition of VMBus device normally goes
> > through DMA core and it uses SWIOTLB as bounce buffer in
> > a CoCo VM.
> >
> > The Confidential VMBus device can do DMA directly to
> > private/encrypted memory. Because the swiotlb is decrypted
> > memory, the DMA transfer must not be bounced through the
> > swiotlb, so as to preserve confidentiality. This is different
> > from the default for Linux CoCo VMs, so disable the VMBus
> > device's use of swiotlb.
> >
> > Expose swiotlb_dev_disable() from DMA Core to disable
> > bounce buffer for device.
>
> It feels awkward and like a layering violation to let arbitrary kernel
> drivers manipulate SWIOTLB, which sits beneath the DMA core.
>
Hi Leon:
Thanks for your review. I will try other way since now DMA core has
not stand way to disable device swiotlb.
--
Thanks
Tianyu Lan
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v2] net: mana: Set default number of queues to 16
From: Long Li @ 2026-03-27 4:00 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260326201841.3b7e5b78@kernel.org>
> On Mon, 23 Mar 2026 12:49:25 -0700 Long Li wrote:
> > Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES
> > (16), as 16 queues can achieve optimal throughput for typical
> > workloads. The actual number of queues may be lower if it exceeds the
> > hardware reported limit. Users can increase the number of queues up to
> > max_queues via ethtool if needed.
>
> Sorry we are a bit backlogged I didn't spot this in time (read: I'm planning to
> revert this unless proper explanation is provided)
>
> Could you explain why not use netif_get_num_default_rss_queues() ?
> Having local driver innovations is a major PITA for users who deal with
> heterogeneous envs.
Hi Jakub,
We considered netif_get_num_default_rss_queues() but chose a fixed default based on our performance testing. On Azure VMs, typical
workloads plateau at around 16 queues - adding more queues beyond that doesn't improve throughput but increases memory usage and
interrupt overhead.
netif_get_num_default_rss_queues() would return 32-64 on large VMs (64-128 vCPUs), which wastes resources without benefit.
That said, I agree that completely ignoring the core-based heuristic isn't ideal for consistency. One option is to use
netif_get_num_default_rss_queues() but clamp it to a maximum of MANA_DEF_NUM_QUEUES (16), so small VMs still get enough queues and
large VMs don't over-allocate. Something like:
apc->num_queues = min(netif_get_num_default_rss_queues(), MANA_DEF_NUM_QUEUES);
apc->num_queues = min(apc->num_queues, gc->max_num_queues);
For reference, it seems mlx4 does something similar - it caps at DEF_RX_RINGS (16) regardless of core count.
Do you want me to send a v2?
Thanks,
Long
^ permalink raw reply
* Re: [PATCH net-next v2] net: mana: Set default number of queues to 16
From: Jakub Kicinski @ 2026-03-27 3:18 UTC (permalink / raw)
To: Long Li
Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
K . Y . Srinivasan, Wei Liu, Dexuan Cui, Simon Horman, netdev,
linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323194925.1766385-1-longli@microsoft.com>
On Mon, 23 Mar 2026 12:49:25 -0700 Long Li wrote:
> Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES (16),
> as 16 queues can achieve optimal throughput for typical workloads. The
> actual number of queues may be lower if it exceeds the hardware reported
> limit. Users can increase the number of queues up to max_queues via
> ethtool if needed.
Sorry we are a bit backlogged I didn't spot this in time (read: I'm
planning to revert this unless proper explanation is provided)
Could you explain why not use netif_get_num_default_rss_queues() ?
Having local driver innovations is a major PITA for users who deal
with heterogeneous envs.
^ permalink raw reply
* Re: [PATCH net,v2] net: mana: Fix RX skb truesize accounting
From: patchwork-bot+netdevbpf @ 2026-03-27 2:10 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy
In-Reply-To: <acLUhLpLum6qrD/N@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 24 Mar 2026 11:14:28 -0700 you wrote:
> MANA passes rxq->alloc_size to napi_build_skb() for all RX buffers.
> It is correct for fragment-backed RX buffers, where alloc_size matches
> the actual backing allocation used for each packet buffer. However, in
> the non-fragment RX path mana allocates a full page, or a higher-order
> page, per RX buffer. In that case alloc_size only reflects the usable
> packet area and not the actual backing memory.
>
> [...]
Here is the summary with links:
- [net,v2] net: mana: Fix RX skb truesize accounting
https://git.kernel.org/netdev/net/c/f73896b4197e
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v2] net: mana: Use at least SZ_4K in doorbell ID range check
From: Simon Horman @ 2026-03-26 20:07 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, shradhagupta, kotaranov, dipayanroy,
yury.norov, kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260325180423.1923060-1-ernis@linux.microsoft.com>
On Wed, Mar 25, 2026 at 11:04:17AM -0700, Erni Sri Satya Vennela wrote:
> mana_gd_ring_doorbell() accesses offsets up to DOORBELL_OFFSET_EQ
> (0xFF8) + 8 bytes = 4KB within each doorbell page. A db_page_size
> smaller than SZ_4K is fundamentally incompatible with the driver:
> doorbell pages would overlap and the device cannot function correctly.
>
> Validate db_page_size at the source and fail the
> probe early if the value is below SZ_4K. This ensures the doorbell ID
> range check in mana_gd_register_device() can rely on db_page_size
> being valid.
>
> Fixes: 89fe91c65992 ("net: mana: hardening: Validate doorbell ID from GDMA_REGISTER_DEVICE response")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> Changes in v2:
> * Remove "db_page_sz = max_t(u64, SZ_4K, gc->db_page_size)" in
> mana_gd_register_device and validate db_page_sz at the source
> mana_gf_init_pf_regs and mana_gd_init_vf_regs.
> * Update commit message.
Thanks for the update.
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Bjorn Helgaas @ 2026-03-26 18:08 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP), linux-kernel,
driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Gui-Dong Han
In-Reply-To: <20260324005919.2408620-6-dakr@kernel.org>
On Tue, Mar 24, 2026 at 01:59:09AM +0100, Danilo Krummrich wrote:
> When a driver is probed through __driver_attach(), the bus' match()
> callback is called without the device lock held, thus accessing the
> driver_override field without a lock, which can cause a UAF.
>
> Fix this by using the driver-core driver_override infrastructure taking
> care of proper locking internally.
>
> Note that calling match() from __driver_attach() without the device lock
> held is intentional. [1]
>
> Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
> Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override")
> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> ---
> drivers/pci/pci-driver.c | 11 +++++++----
> drivers/pci/pci-sysfs.c | 28 ----------------------------
> drivers/pci/probe.c | 1 -
> include/linux/pci.h | 6 ------
For the above:
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
"driver_override" is mentioned several places in
Documentation/ABI/testing/sysfs-bus-*. I assume this series doesn't
change the behavior documented there? Should any of this doc be
consolidated?
> drivers/vfio/pci/vfio_pci_core.c | 5 ++---
> drivers/xen/xen-pciback/pci_stub.c | 6 ++++--
> 6 files changed, 13 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index dd9075403987..d10ece0889f0 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -138,9 +138,11 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
> {
> struct pci_dynid *dynid;
> const struct pci_device_id *found_id = NULL, *ids;
> + int ret;
>
> /* When driver_override is set, only bind to the matching driver */
> - if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> + ret = device_match_driver_override(&dev->dev, &drv->driver);
> + if (ret == 0)
> return NULL;
>
> /* Look at the dynamic ids first, before the static ones */
> @@ -164,7 +166,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
> * matching.
> */
> if (found_id->override_only) {
> - if (dev->driver_override)
> + if (ret > 0)
> return found_id;
> } else {
> return found_id;
> @@ -172,7 +174,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
> }
>
> /* driver_override will always match, send a dummy id */
> - if (dev->driver_override)
> + if (ret > 0)
> return &pci_device_id_any;
> return NULL;
> }
> @@ -452,7 +454,7 @@ static int __pci_device_probe(struct pci_driver *drv, struct pci_dev *pci_dev)
> static inline bool pci_device_can_probe(struct pci_dev *pdev)
> {
> return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe ||
> - pdev->driver_override);
> + device_has_driver_override(&pdev->dev));
> }
> #else
> static inline bool pci_device_can_probe(struct pci_dev *pdev)
> @@ -1722,6 +1724,7 @@ static const struct cpumask *pci_device_irq_get_affinity(struct device *dev,
>
> const struct bus_type pci_bus_type = {
> .name = "pci",
> + .driver_override = true,
> .match = pci_bus_match,
> .uevent = pci_uevent,
> .probe = pci_device_probe,
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 16eaaf749ba9..a9006cf4e9c8 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -615,33 +615,6 @@ static ssize_t devspec_show(struct device *dev,
> static DEVICE_ATTR_RO(devspec);
> #endif
>
> -static ssize_t driver_override_store(struct device *dev,
> - struct device_attribute *attr,
> - const char *buf, size_t count)
> -{
> - struct pci_dev *pdev = to_pci_dev(dev);
> - int ret;
> -
> - ret = driver_set_override(dev, &pdev->driver_override, buf, count);
> - if (ret)
> - return ret;
> -
> - return count;
> -}
> -
> -static ssize_t driver_override_show(struct device *dev,
> - struct device_attribute *attr, char *buf)
> -{
> - struct pci_dev *pdev = to_pci_dev(dev);
> - ssize_t len;
> -
> - device_lock(dev);
> - len = sysfs_emit(buf, "%s\n", pdev->driver_override);
> - device_unlock(dev);
> - return len;
> -}
> -static DEVICE_ATTR_RW(driver_override);
> -
> static struct attribute *pci_dev_attrs[] = {
> &dev_attr_power_state.attr,
> &dev_attr_resource.attr,
> @@ -669,7 +642,6 @@ static struct attribute *pci_dev_attrs[] = {
> #ifdef CONFIG_OF
> &dev_attr_devspec.attr,
> #endif
> - &dev_attr_driver_override.attr,
> &dev_attr_ari_enabled.attr,
> NULL,
> };
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index bccc7a4bdd79..b4707640e102 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2488,7 +2488,6 @@ static void pci_release_dev(struct device *dev)
> pci_release_of_node(pci_dev);
> pcibios_release_device(pci_dev);
> pci_bus_put(pci_dev->bus);
> - kfree(pci_dev->driver_override);
> bitmap_free(pci_dev->dma_alias_mask);
> dev_dbg(dev, "device released\n");
> kfree(pci_dev);
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d43745fe4c84..460852f79f29 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
> pdev->is_virtfn && physfn == vdev->pdev) {
> pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
> pci_name(pdev));
> - pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
> - vdev->vdev.ops->name);
> - WARN_ON(!pdev->driver_override);
> + WARN_ON(device_set_driver_override(&pdev->dev,
> + vdev->vdev.ops->name));
> } else if (action == BUS_NOTIFY_BOUND_DRIVER &&
> pdev->is_virtfn && physfn == vdev->pdev) {
> struct pci_driver *drv = pci_dev_driver(pdev);
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index e4b27aecbf05..79a2b5dfd694 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -598,6 +598,8 @@ static int pcistub_seize(struct pci_dev *dev,
> return err;
> }
>
> +static struct pci_driver xen_pcibk_pci_driver;
> +
> /* Called when 'bind'. This means we must _NOT_ call pci_reset_function or
> * other functions that take the sysfs lock. */
> static int pcistub_probe(struct pci_dev *dev, const struct pci_device_id *id)
> @@ -609,8 +611,8 @@ static int pcistub_probe(struct pci_dev *dev, const struct pci_device_id *id)
>
> match = pcistub_match(dev);
>
> - if ((dev->driver_override &&
> - !strcmp(dev->driver_override, PCISTUB_DRIVER_NAME)) ||
> + if (device_match_driver_override(&dev->dev,
> + &xen_pcibk_pci_driver.driver) > 0 ||
> match) {
>
> if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 1c270f1d5123..57e9463e4347 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -575,12 +575,6 @@ struct pci_dev {
> u8 supported_speeds; /* Supported Link Speeds Vector */
> phys_addr_t rom; /* Physical address if not from BAR */
> size_t romlen; /* Length if not from BAR */
> - /*
> - * Driver name to force a match. Do not set directly, because core
> - * frees it. Use driver_set_override() to set or clear it.
> - */
> - const char *driver_override;
> -
> unsigned long priv_flags; /* Private flags for the PCI driver */
>
> /* These methods index pci_reset_fn_methods[] */
> --
> 2.53.0
>
^ permalink raw reply
* [PATCH net-next] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
From: Erni Sri Satya Vennela @ 2026-03-26 17:48 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
shirazsaleem, kees, linux-hyperv, netdev, linux-kernel
As a part of MANA hardening for CVM, validate that max_num_sq and
max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
values flow into apc->num_queues, which is used as an allocation count
and loop bound. A zero value would result in zero-size allocations and
incorrect driver behavior.
Return -EPROTO if either value is zero.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b39e8b920791..a4197b4b0597 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1249,6 +1249,12 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
*max_sq = resp.max_num_sq;
*max_rq = resp.max_num_rq;
+
+ if (*max_sq == 0 || *max_rq == 0) {
+ netdev_err(apc->ndev, "Invalid max queues from vPort config\n");
+ return -EPROTO;
+ }
+
if (resp.num_indirection_ent > 0 &&
resp.num_indirection_ent <= MANA_INDIRECT_TABLE_MAX_SIZE &&
is_power_of_2(resp.num_indirection_ent)) {
--
2.34.1
^ permalink raw reply related
* Re: [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-03-26 17:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Alex Williamson, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko,
Christophe Leroy (CS GROUP), linux-kernel, driver-core,
linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
virtualization, kvm, xen-devel, linux-arm-kernel
In-Reply-To: <20260325052919-mutt-send-email-mst@kernel.org>
On Wed Mar 25, 2026 at 10:29 AM CET, Michael S. Tsirkin wrote:
> vdpa bits:
>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> I assume it'll all be merged together?
I can take it through the driver-core tree if you prefer, but you can also pick
it up yourself.
^ permalink raw reply
* [PATCH net-next] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
From: Erni Sri Satya Vennela @ 2026-03-26 17:30 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
shirazsaleem, kees, linux-hyperv, netdev, linux-kernel
As a part of MANA hardening for CVM, validate the adapter_mtu value
returned from the MANA_QUERY_DEV_CONFIG HWC command.
The adapter_mtu value is used to compute ndev->max_mtu via:
gc->adapter_mtu - ETH_HLEN. If hardware returns a bogus adapter_mtu
smaller than ETH_HLEN (e.g. 0), the unsigned subtraction wraps to a
huge value, silently allowing oversized MTU settings.
Add a validation check to reject adapter_mtu values below
ETH_MIN_MTU + ETH_HLEN, returning -EPROTO to fail the device
configuration early with a clear error message.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b39e8b920791..bd07d17a6017 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1207,10 +1207,16 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
*max_num_vports = resp.max_num_vports;
- if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+ if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
+ if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+ dev_err(dev, "Adapter MTU too small: %u\n",
+ resp.adapter_mtu);
+ return -EPROTO;
+ }
gc->adapter_mtu = resp.adapter_mtu;
- else
+ } else {
gc->adapter_mtu = ETH_FRAME_LEN;
+ }
if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V3)
*bm_hostmode = resp.bm_hostmode;
--
2.34.1
^ permalink raw reply related
* Re: [EXTERNAL] Re: [PATCH net-next v5 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
From: Simon Horman @ 2026-03-26 17:19 UTC (permalink / raw)
To: Long Li
Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Leon Romanovsky,
Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683FB2D67A3BBCC74B62D45CE49A@SA1PR21MB6683.namprd21.prod.outlook.com>
On Wed, Mar 25, 2026 at 08:47:35PM +0000, Long Li wrote:
> > > On Mon, Mar 23, 2026 at 12:59:46PM -0700, Long Li wrote:
...
> > Hi Simon,
> >
> > This patch set should apply after this patch: (which is also pending net-next)
> > net: mana: Set default number of queues to 16
> >
> > Can you apply the patch set after this patch, or should I wait for the next patch
> > merge window?
> >
> > Thank you,
> > Long
>
>
> I'll send it over in the next patch merging window.
Thanks,
The way I understand things net-next, and in particular the CI,
can only handle patches where all the dependencies are already
present in net-next.
^ permalink raw reply
* Re: [RFC PATCH V3] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Easwar Hariharan @ 2026-03-26 17:05 UTC (permalink / raw)
To: Tianyu Lan
Cc: kys, haiyangz, wei.liu, decui, longli, m.szyprowski, robin.murphy,
easwar.hariharan, Tianyu Lan, iommu, linux-hyperv, linux-kernel,
hch, vdso, Michael Kelley
In-Reply-To: <20260325075649.248241-1-tiala@microsoft.com>
On 3/25/2026 12:56 AM, Tianyu Lan wrote:
> Hyper-V provides Confidential VMBus to communicate between
> device model and device guest driver via encrypted/private
> memory in Confidential VM. The device model is in OpenHCL
> (https://openvmm.dev/guide/user_guide/openhcl.html) that
> plays the paravisor role.
>
> For a VMBus device, there are two communication methods to
> talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
> DMA transfer.
>
> The Confidential VMBus Ring buffer has been upstreamed by
> Roman Kisel(commit 6802d8af47d1).
>
> The dynamic DMA transition of VMBus device normally goes
> through DMA core and it uses SWIOTLB as bounce buffer in
> a CoCo VM.
>
> The Confidential VMBus device can do DMA directly to
> private/encrypted memory. Because the swiotlb is decrypted
> memory, the DMA transfer must not be bounced through the
> swiotlb, so as to preserve confidentiality. This is different
> from the default for Linux CoCo VMs, so disable the VMBus
> device's use of swiotlb.
>
> Expose swiotlb_dev_disable() from DMA Core to disable
> bounce buffer for device.
>
> Suggested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
> drivers/hv/vmbus_drv.c | 6 +++++-
> include/linux/swiotlb.h | 5 +++++
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 3d1a58b667db..84e6971fc90f 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2184,11 +2184,15 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
> dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
>
> + device_initialize(&child_device_obj->device);
> + if (child_device_obj->channel->co_external_memory)
> + swiotlb_dev_disable(&child_device_obj->device);
> +
> /*
> * Register with the LDM. This will kick off the driver/device
> * binding...which will eventually call vmbus_match() and vmbus_probe()
> */
> - ret = device_register(&child_device_obj->device);
> + ret = device_add(&child_device_obj->device);
> if (ret) {
> pr_err("Unable to register child device\n");
> put_device(&child_device_obj->device);
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063..7c572570d5d9 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -169,6 +169,11 @@ static inline struct io_tlb_pool *swiotlb_find_pool(struct device *dev,
> return NULL;
> }
>
> +static inline bool swiotlb_dev_disable(struct device *dev)
> +{
> + return dev->dma_io_tlb_mem == NULL;
Is there an extra = here?
- Easwar (he/him)
^ permalink raw reply
* RE: [PATCH net,v2] net: mana: Fix RX skb truesize accounting
From: Haiyang Zhang @ 2026-03-26 15:04 UTC (permalink / raw)
To: Dipayaan Roy, KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, leon@kernel.org, Long Li,
Konstantin Taranov, horms@kernel.org,
shradhagupta@linux.microsoft.com, ssengar@linux.microsoft.com,
ernis@linux.microsoft.com, Shiraz Saleem,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
stephen@networkplumber.org, jacob.e.keller@intel.com,
Dipayaan Roy
In-Reply-To: <acLUhLpLum6qrD/N@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
> -----Original Message-----
> From: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> Sent: Tuesday, March 24, 2026 2:14 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; andrew+netdev@lunn.ch; davem@davemloft.net;
> edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; leon@kernel.org;
> Long Li <longli@microsoft.com>; Konstantin Taranov
> <kotaranov@microsoft.com>; horms@kernel.org;
> shradhagupta@linux.microsoft.com; ssengar@linux.microsoft.com;
> ernis@linux.microsoft.com; Shiraz Saleem <shirazsaleem@microsoft.com>;
> linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org;
> stephen@networkplumber.org; jacob.e.keller@intel.com; Dipayaan Roy
> <dipayanroy@microsoft.com>
> Subject: [PATCH net,v2] net: mana: Fix RX skb truesize accounting
>
> MANA passes rxq->alloc_size to napi_build_skb() for all RX buffers.
> It is correct for fragment-backed RX buffers, where alloc_size matches
> the actual backing allocation used for each packet buffer. However, in
> the non-fragment RX path mana allocates a full page, or a higher-order
> page, per RX buffer. In that case alloc_size only reflects the usable
> packet area and not the actual backing memory.
>
> This causes napi_build_skb() to underestimate the skb backing allocation
> in the single-buffer RX path, so skb->truesize is derived from a value
> smaller than the real RX buffer allocation.
>
> Fix this by updating alloc_size in the non-fragment RX path to the
> actual backing allocation size before it is passed to napi_build_skb().
>
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers
> instead of full pages to improve memory efficiency.")
> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> Changes in v2:
> - Added maintainers missed in v1.
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index ea71de39f996..884f8e548174 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -766,6 +766,13 @@ static void mana_get_rxbuf_cfg(struct
> mana_port_context *apc,
> }
>
> *frag_count = 1;
> +
> + /* In the single-buffer path, napi_build_skb() must see the
> + * actual backing allocation size so skb->truesize reflects
> + * the full page (or higher-order page), not just the usable
> + * packet area.
> + */
> + *alloc_size = PAGE_SIZE << get_order(*alloc_size);
> return;
> }
>
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH net-next v2] net: mana: Set default number of queues to 16
From: patchwork-bot+netdevbpf @ 2026-03-26 14:10 UTC (permalink / raw)
To: Long Li
Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
leon, haiyangz, kys, wei.liu, decui, horms, netdev, linux-rdma,
linux-hyperv, linux-kernel
In-Reply-To: <20260323194925.1766385-1-longli@microsoft.com>
Hello:
This patch was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Mon, 23 Mar 2026 12:49:25 -0700 you wrote:
> Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES (16),
> as 16 queues can achieve optimal throughput for typical workloads. The
> actual number of queues may be lower if it exceeds the hardware reported
> limit. Users can increase the number of queues up to max_queues via
> ethtool if needed.
>
> Signed-off-by: Long Li <longli@microsoft.com>
>
> [...]
Here is the summary with links:
- [net-next,v2] net: mana: Set default number of queues to 16
https://git.kernel.org/netdev/net-next/c/45b2b84ac6fd
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v4 21/21] mm: on remap assert that input range within the proposed VMA
From: Vlastimil Babka (SUSE) @ 2026-03-26 10:46 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle), Andrew Morton
Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
Alexandre Torgue, Miquel Raynal, Richard Weinberger,
Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
Jan Kara, David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-kernel, linux-doc, linux-hyperv, linux-stm32,
linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <0fc1092f4b74f3f673a58e4e3942dc83f336dd85.1774045440.git.ljs@kernel.org>
On 3/20/26 23:39, Lorenzo Stoakes (Oracle) wrote:
> Now we have range_in_vma_desc(), update remap_pfn_range_prepare() to check
> whether the input range in contained within the specified VMA, so we can
> fail at prepare time if an invalid range is specified.
>
> This covers the I/O remap mmap actions also which ultimately call into
> this function, and other mmap action types either already span the full
> VMA or check this already.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> mm/memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 53ef8ef3d04a..68cc592ff0ba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3142,6 +3142,9 @@ int remap_pfn_range_prepare(struct vm_area_desc *desc)
> const bool is_cow = vma_desc_is_cow_mapping(desc);
> int err;
>
> + if (!range_in_vma_desc(desc, start, end))
> + return -EFAULT;
> +
> err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
> &desc->pgoff);
> if (err)
^ permalink raw reply
* Re: [PATCH v4 20/21] mm: add mmap_action_map_kernel_pages[_full]()
From: Vlastimil Babka (SUSE) @ 2026-03-26 10:44 UTC (permalink / raw)
To: Lorenzo Stoakes (Oracle), Andrew Morton
Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
Alexandre Torgue, Miquel Raynal, Richard Weinberger,
Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
Jan Kara, David Hildenbrand, Liam R . Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
linux-kernel, linux-doc, linux-hyperv, linux-stm32,
linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <926ac961690d856e67ec847bee2370ab3c6b9046.1774045440.git.ljs@kernel.org>
On 3/20/26 23:39, Lorenzo Stoakes (Oracle) wrote:
> A user can invoke mmap_action_map_kernel_pages() to specify that the
> mapping should map kernel pages starting from desc->start of a specified
> number of pages specified in an array.
>
> In order to implement this, adjust mmap_action_prepare() to be able to
> return an error code, as it makes sense to assert that the specified
> parameters are valid as quickly as possible as well as updating the VMA
> flags to include VMA_MIXEDMAP_BIT as necessary.
>
> This provides an mmap_prepare equivalent of vm_insert_pages(). We
> additionally update the existing vm_insert_pages() code to use
> range_in_vma() and add a new range_in_vma_desc() helper function for the
> mmap_prepare case, sharing the code between the two in range_is_subset().
>
> We add both mmap_action_map_kernel_pages() and
> mmap_action_map_kernel_pages_full() to allow for both partial and full VMA
> mappings.
>
> We update the documentation to reflect the new features.
>
> Finally, we update the VMA tests accordingly to reflect the changes.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH 09/12] s390/cio: use generic driver_override infrastructure
From: Vineeth Vijayan @ 2026-03-26 9:43 UTC (permalink / raw)
To: Danilo Krummrich, Russell King, Greg Kroah-Hartman,
Rafael J. Wysocki, Ioana Ciornei, Nipun Gupta, Nikhil Agarwal,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Bjorn Helgaas, Armin Wolf, Bjorn Andersson, Mathieu Poirier,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Gui-Dong Han
In-Reply-To: <20260324005919.2408620-10-dakr@kernel.org>
On 3/24/26 01:59, Danilo Krummrich wrote:
> When a driver is probed through __driver_attach(), the bus' match()
> callback is called without the device lock held, thus accessing the
> driver_override field without a lock, which can cause a UAF.
>
> Fix this by using the driver-core driver_override infrastructure taking
> care of proper locking internally.
>
> Note that calling match() from __driver_attach() without the device lock
> held is intentional. [1]
>
> Link:https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
> Reported-by: Gui-Dong Han<hanguidong02@gmail.com>
> Closes:https://bugzilla.kernel.org/show_bug.cgi?id=220789
> Fixes: ebc3d1791503 ("s390/cio: introduce driver_override on the css bus")
> Signed-off-by: Danilo Krummrich<dakr@kernel.org>
> ---
Thank you Danilo.
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
^ permalink raw reply
* Re: [PATCH] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: vdso @ 2026-03-25 22:37 UTC (permalink / raw)
To: Junrui Luo, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Mukesh Rathor, Nuno Das Neves, Roman Kisel,
Stanislav Kinsburskii
Cc: Muminul Islam, Praveen K Paladugu, linux-hyperv, linux-kernel,
Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881689C0F58149DD986A6D1AF49A@SYBPR01MB7881.ausprd01.prod.outlook.com>
> On 03/24/2026 9:05 PM PDT Junrui Luo <moonafterrain@outlook.com> wrote:
>
Hi Junrui,
I think that checking for overflow as implemented can be improved.
`guest_pfn` is a guest page frame number (GPA/page size). Hyper-V uses
page size of 4KiB (`HV_HYP_PAGE_SIZE`). On x86_64 GPAs are limited to
52 bits, and max GFN = (1<<52)/(1<<12) = 1<<40. On ARM64, 52 bits is
also the limit for the bits used in GPA. Thus checking for overflowing is
not the only thing needed here because _well_ before overflowing there is
that (1<<40)-th GFN which is problematic as using it or going above means
going over the arch limits of bits used in GPA (the processor won't be able
to map the memory through the page tables).
So we could check for (1<<40)-th GFN, too. That is, if we'd like to return
an error early instead of trying to do physically impossible things and
erroring out later anyway.
Perhaps something along the lines of
| if (mem->guest_pfn + nr_pages > HVPFN_DOWN(1ULL << MAX_PHYSMEM_BITS))
| return -EINVAL;
could be an meaningful improvement in addition to checking overflow which
alone doesn't take into account the specifics outlined above.
If folks like that, maybe could hoist an improved check out into a function
and apply throughout the file.
--
Cheers,
Roman
>
> mshv_partition_create_region() computes mem->guest_pfn + nr_pages to
> check for overlapping regions without verifying u64 wraparound. A
> sufficiently large guest_pfn can cause the addition to overflow,
> bypassing the overlap check and allowing creation of regions that wrap
> around the address space.
>
> Fix by using check_add_overflow() to reject such regions.
>
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
> ---
> drivers/hv/mshv_root_main.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 6f42423f7faa..6ddb315fc2c2 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1174,11 +1174,16 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> {
> struct mshv_mem_region *rg;
> u64 nr_pages = HVPFN_DOWN(mem->size);
> + u64 new_region_end;
> +
> + /* Reject regions whose end address would wrap around */
> + if (check_add_overflow(mem->guest_pfn, nr_pages, &new_region_end))
> + return -EOVERFLOW;
>
> /* Reject overlapping regions */
> spin_lock(&partition->pt_mem_regions_lock);
> hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> - if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> + if (new_region_end <= rg->start_gfn ||
> rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> continue;
> spin_unlock(&partition->pt_mem_regions_lock);
>
> ---
> base-commit: c369299895a591d96745d6492d4888259b004a9e
> change-id: 20260325-fixes-9a58895aea55
>
> Best regards,
> --
> Junrui Luo <moonafterrain@outlook.com>
^ permalink raw reply
* Re: [PATCH v2 14/16] RDMA/irdma: Add missing comp_mask check in alloc_ucontext
From: Jacob Moroni @ 2026-03-25 22:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Michal Kalderon, Michael Margolin,
Nelson Escobar, Satish Kharat, Selvin Xavier, Yossi Leybovich,
Chengchang Tang, Tatyana Nikolova, Vishnu Dasa, Yishai Hadas,
Zhu Yanjun, Long Li, patches
In-Reply-To: <14-v2-f4ac6f418bd6+12c5-rdma_udata_req_jgg@nvidia.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
On Wed, Mar 25, 2026 at 5:27 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> irdma has a comp_mask field that was never checked for validity, check
> it.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> drivers/infiniband/hw/irdma/verbs.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
> index b2978632241900..d695130b187bdd 100644
> --- a/drivers/infiniband/hw/irdma/verbs.c
> +++ b/drivers/infiniband/hw/irdma/verbs.c
> @@ -296,7 +296,9 @@ static int irdma_alloc_ucontext(struct ib_ucontext *uctx,
> if (udata->outlen < IRDMA_ALLOC_UCTX_MIN_RESP_LEN)
> return -EINVAL;
>
> - ret = ib_copy_validate_udata_in(udata, req, rsvd8);
> + ret = ib_copy_validate_udata_in_cm(udata, req, rsvd8,
> + IRDMA_ALLOC_UCTX_USE_RAW_ATTR |
> + IRDMA_SUPPORT_WQE_FORMAT_V2);
> if (ret)
> return ret;
>
> --
> 2.43.0
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox