Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [RFC PATCH V2] x86/VMBus: Confidential VMBus for dynamic DMA buffer transition
From: Aneesh Kumar K.V @ 2026-02-16 10:21 UTC (permalink / raw)
  To: Robin Murphy, Michael Kelley, Tianyu Lan, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	longli@microsoft.com
  Cc: Tianyu Lan, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, hch@infradead.org,
	vdso@hexbites.dev, Suzuki K Poulose
In-Reply-To: <cc4dc4a6-2d74-49c1-bbb0-cfa44802a66b@arm.com>

Robin Murphy <robin.murphy@arm.com> writes:

> On 2026-02-11 6:00 pm, Michael Kelley wrote:
>> From: Tianyu Lan <ltykernel@gmail.com> Sent: Tuesday, February 10, 2026 8:21 AM
>>>
>>> Hyper-V provides Confidential VMBus to communicate between
>>> device model and device guest driver via encrypted/private
>>> memory in Confidential VM. The device model is in OpenHCL
>>> (https://openvmm.dev/guide/user_guide/openhcl.html) that
>>> plays the paravisor rule.
>>>
>>> For a VMBUS device, there are two communication methods to
>> 
>> s/VMBUS/VMBus/
>> 
>>> talk with Host/Hypervisor. 1) VMBus Ring buffer 2) dynamic
>>> DMA transition.
>> 
>> I'm not sure what "dynamic DMA transition" is. Maybe just
>> "DMA transfers"?  Also, do the same substitution further
>> down in this commit message.
>> 
>>> The Confidential VMBus Ring buffer has been
>>> upstreamed by Roman Kisel(commit 6802d8af).
>> 
>> It's customary to use 12 character commit IDs, which would be
>> 6802d8af47d1 in this case.
>> 
>>>
>>> The dynamic DMA transition of VMBus device normally goes
>>> through DMA core and it uses SWIOTLB as bounce buffer in
>>> CVM
>> 
>> "CVM" is Microsoft-speak. The Linux terminology is "a CoCo VM".
>> 
>>> to communicate with Host/Hypervisor. The Confidential
>>> VMBus device may use private/encrypted memory to do DMA
>>> and so the device swiotlb(bounce buffer) isn't necessary.
>> 
>> The phrase "isn't necessary" does not capture the real issue
>> here. Saying "isn't necessary" makes it sound like this patch is
>> just avoids unnecessary work, so that it is a performance
>> improvement. But that's not the case.
>> 
>> The real issue is that swiotlb memory is decrypted. So bouncing
>> through the swiotlb exposes to the host what is supposed to be
>> confidential data passed on the Confidential VMBus. Disabling
>> the swiotlb bouncing in this case is a hard requirement to preserve
>> confidentially.
>
> Yeah, this really isn't a Hyper-V problem. Indeed as things stand, 
> "swiotlb=force" could potentially break confidentiality for any 
> environment trying to invent a notion of private DMA, and perhaps we 
> could throw a big warning about that, but really the answer there is 
> "Don't run your confidential workload with 'swiotlb=force'. Why would 
> you even do that? Debug your drivers in a regular VM or bare-metal with 
> full debug visibility like a normal person..."
>
> The fact is we do not have a proper notion of trusted/private DMA yet, 
> and this is not the way to add it. The current assumption is very much 
> that all DMA is untrusted in the CoCo sense, because initially it was 
> only virtual devices emulated by a hypervisor, thus had to be bounced 
> through shared memory anyway. AMD SEV with a stage 1 IOMMU in the guest 
> can allow an assigned physical device to access a suitably-aligned 
> encrypted buffer directly, but that's still effectively just putting the 
> buffer into a temporarily shared state for that device, it merely skips 
> sharing it with the rest of the system. !force_dma_unencrypted() doesn't 
> mean "we trust this device's DMA", it just means "we don't have to use 
> explicitly-decrypted pages to accommodate untrusted/shared DMA here", 
> plus it also serves double-duty for host encryption which doesn't share 
> the same trust model anyway.
>
> I assumed this would follow the TDISP stuff, but if Hyper-V has an 
> alternative device-trusting mechanism already then there's no need to 
> wait. We want some common device property (likely consolidating the 
> current PCI external-facing port notion of trustedness plus whatever 
> TDISP wants), with which we can then make proper decisions in all the 
> right DMA API paths - and if it can end up replacing the horrible 
> force_dma_unencrypted() as well then all the better! I'd totally 
> forgotten about the previous discussion that Michael referred to (which 
> I had to track down[1]), but it looks like all the main points were 
> already covered there and we were approaching a consensus, so really I 
> guess someone just needs to give it a go.
>

With my device-assignment–related changes, I have made the following
update. It may be a slightly stronger requirement to enforce that 
trusted device cannot use SWIOTLB, but it simplifies the overall design.
I also have a prototype, that added two default swiotlb, ie,

static struct io_tlb_mem io_tlb_default_mem;
static struct io_tlb_mem io_tlb_default_shared_mem;

Looking at that change, I would suggest we avoid doing this unless we
are certain that there is a requirement for a trusted device to use
SWIOTLB bouncing.

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b27de03f2466..07ef149bd9fc 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -292,6 +292,9 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size);
 
 static inline bool is_swiotlb_for_alloc(struct device *dev)
 {
+       if (device_cc_accepted(dev))
+               return false;
+
        return dev->dma_io_tlb_mem->for_alloc;
 }
 #else
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 34fe14b987f0..a89a7ac07499 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -159,6 +159,14 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
  */
 static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
 {
+       /*
+        * Atomic pools are marked decrypted and are used if we require require
+        * updation of pfn mem encryption attributes or for DMA non-coherent
+        * device allocation. Both is not true for trusted device.
+        */
+       if (device_cc_accepted(dev))
+               return false;
+
        return !gfpflags_allow_blocking(gfp) && !is_swiotlb_for_alloc(dev);
 }
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a862712f4dc6..6d9f0c869c6f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1643,6 +1643,9 @@ bool is_swiotlb_active(struct device *dev)
 {
        struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 
+       if (device_cc_accepted(dev))
+               return false;
+
        return mem && mem->nslabs;
 }

^ permalink raw reply related

* Re: [PATCH] x86: mshyperv: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-16 12:24 UTC (permalink / raw)
  To: mhklkml, 'Michael Kelley', 'Florian Bezdeka',
	'K. Y. Srinivasan', 'Haiyang Zhang',
	'Wei Liu', 'Dexuan Cui', 'Long Li',
	'Thomas Gleixner', 'Ingo Molnar',
	'Borislav Petkov', 'Dave Hansen', x86
  Cc: linux-hyperv, linux-kernel, 'RT', 'Mitchell Levy',
	skinsburskii, mrathor, anirudh, schakrabarti, ssengar
In-Reply-To: <005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com>

On 13.02.26 22:35, mhklkml@zohomail.com wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com> Sent: Thursday, February 12, 2026 8:06 AM
>>
>> On 09.02.26 19:25, Michael Kelley wrote:
>>> From: Florian Bezdeka <florian.bezdeka@siemens.com> Sent: Monday, February 9, 2026 2:35 AM
>>>>
>>>> On Sat, 2026-02-07 at 01:30 +0000, Michael Kelley wrote:
>>>>
>>>> [snip]
>>>>>
>>>>> I've run your suggested experiment on an arm64 VM in the Azure cloud. My
>>>>> kernel was linux-next 20260128. I set CONFIG_PREEMPT_RT=y and
>>>>> CONFIG_PROVE_LOCKING=y, but did not add either of your two patches
>>>>> (neither the storvsc driver patch nor the x86 VMBus interrupt handling patch).
>>>>> The VM comes up and runs, but with this warning during boot:
>>>>>
>>>>> [    3.075604] hv_utils: Registering HyperV Utility Driver
>>>>> [    3.075636] hv_vmbus: registering driver hv_utils
>>>>> [    3.085920] =============================
>>>>> [    3.088128] hv_vmbus: registering driver hv_netvsc
>>>>> [    3.091180] [ BUG: Invalid wait context ]
>>>>> [    3.093544] 6.19.0-rc7-next-20260128+ #3 Tainted: G            E
>>>>> [    3.097582] -----------------------------
>>>>> [    3.099899] systemd-udevd/284 is trying to lock:
>>>>> [    3.102568] ffff000100e24490 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0x128/0x3b8 [hv_vmbus]
>>>>> [    3.108208] other info that might help us debug this:
>>>>> [    3.111454] context-{2:2}
>>>>> [    3.112987] 1 lock held by systemd-udevd/284:
>>>>> [    3.115626]  #0: ffffd5cfc20bcc80 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0xcc/0x3b8 [hv_vmbus]
>>>>> [    3.121224] stack backtrace:
>>>>> [    3.122897] CPU: 0 UID: 0 PID: 284 Comm: systemd-udevd Tainted: G            E 6.19.0-rc7-next-20260128+ #3 PREEMPT_RT
>>>>> [    3.129631] Tainted: [E]=UNSIGNED_MODULE
>>>>> [    3.131946] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 06/10/2025
>>>>> [    3.138553] Call trace:
>>>>> [    3.140015]  show_stack+0x20/0x38 (C)
>>>>> [    3.142137]  dump_stack_lvl+0x9c/0x158
>>>>> [    3.144340]  dump_stack+0x18/0x28
>>>>> [    3.146290]  __lock_acquire+0x488/0x1e20
>>>>> [    3.148569]  lock_acquire+0x11c/0x388
>>>>> [    3.150703]  rt_spin_lock+0x54/0x230
>>>>> [    3.152785]  vmbus_chan_sched+0x128/0x3b8 [hv_vmbus]
>>>>> [    3.155611]  vmbus_isr+0x34/0x80 [hv_vmbus]
>>>>> [    3.158093]  vmbus_percpu_isr+0x18/0x30 [hv_vmbus]
>>>>> [    3.160848]  handle_percpu_devid_irq+0xdc/0x348
>>>>> [    3.163495]  handle_irq_desc+0x48/0x68
>>>>> [    3.165851]  generic_handle_domain_irq+0x20/0x38
>>>>> [    3.168664]  gic_handle_irq+0x1dc/0x430
>>>>> [    3.170868]  call_on_irq_stack+0x30/0x70
>>>>> [    3.173161]  do_interrupt_handler+0x88/0xa0
>>>>> [    3.175724]  el1_interrupt+0x4c/0xb0
>>>>> [    3.177855]  el1h_64_irq_handler+0x18/0x28
>>>>> [    3.180332]  el1h_64_irq+0x84/0x88
>>>>> [    3.182378]  _raw_spin_unlock_irqrestore+0x4c/0xb0 (P)
>>>>> [    3.185493]  rt_mutex_slowunlock+0x404/0x440
>>>>> [    3.187951]  rt_spin_unlock+0xb8/0x178
>>>>> [    3.190394]  kmem_cache_alloc_noprof+0xf0/0x4f8
>>>>> [    3.193100]  alloc_empty_file+0x64/0x148
>>>>> [    3.195461]  path_openat+0x58/0xaa0
>>>>> [    3.197658]  do_file_open+0xa0/0x140
>>>>> [    3.199752]  do_sys_openat2+0x190/0x278
>>>>> [    3.202124]  do_sys_open+0x60/0xb8
>>>>> [    3.204047]  __arm64_sys_openat+0x2c/0x48
>>>>> [    3.206433]  invoke_syscall+0x6c/0xf8
>>>>> [    3.208519]  el0_svc_common.constprop.0+0x48/0xf0
>>>>> [    3.211050]  do_el0_svc+0x24/0x38
>>>>> [    3.212990]  el0_svc+0x164/0x3c8
>>>>> [    3.214842]  el0t_64_sync_handler+0xd0/0xe8
>>>>> [    3.217251]  el0t_64_sync+0x1b0/0x1b8
>>>>> [    3.219450] hv_utils: Heartbeat IC version 3.0
>>>>> [    3.219471] hv_utils: Shutdown IC version 3.2
>>>>> [    3.219844] hv_utils: TimeSync IC version 4.0
>>>>
>>>> That matches with my expectation that the same problem exists on arm64.
>>>> The patch from Jan addresses that issue for x86 (only, so far) as we do
>>>> not have a working test environment for arm64 yet.
>>>
>>> OK. I had understood Jan's earlier comments to mean that the VMBus
>>> interrupt problem was implicitly solved on arm64 because of VMBus using
>>> a standard Linux IRQ on arm64. But evidently that's not the case. So my
>>> earlier comment stands: The code changes should go into the architecture
>>> independent portion of the VMBus driver, and not under arch/x86. I
>>> can probably work with you to test on arm64 if need be.
>>>
>>
>> I can move the code, sure, but I still haven't understood what
>> invalidates my assumptions (beside what you observed). vmbus_drv calls
>> request_percpu_irq, and that is - as far as I can see - not injecting
>> IRQF_NO_THREAD. Any explanations welcome.
> 
> I haven't setup detailed debugging on arm64 yet, but in prep for that
> I went looking at the places in the kernel IRQ handling where
> IRQF_NO_THREAD influences behavior. The key function appears to be
> irq_setup_forced_threading(). This function first checks force_irqthreads(),
> which will be "true" when PREEMPT_RT is set. The function then checks
> the IRQF_NO_THREAD flag and the IRQF_PERCPU flag. From what I can
> see, the IRQF_PERCPU flag is treated like the IRQF_NO_THREAD flag, and
> causes forced threading to *not* be done. So the behavior ends up being
> the same as when PREEMPT_RT is not set.
> 
> Since the VMBus interrupt is a per-cpu interrupt, forced threading is not
> done. In that case, the stack trace I reported makes sense. Take a look at
> the code and see if you agree.

Indeed, missed the IRQF_PERCPU impact on auto-threading. I'll rework my
patch to perform the transition arch-independently.

Thanks,
Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-16 16:24 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
	Michael Kelley, Saurabh Singh Sengar, Naman Jain

From: Jan Kiszka <jan.kiszka@siemens.com>

Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
with related guest support enabled:

[    1.127941] hv_vmbus: registering driver hyperv_drm

[    1.132518] =============================
[    1.132519] [ BUG: Invalid wait context ]
[    1.132521] 6.19.0-rc8+ #9 Not tainted
[    1.132524] -----------------------------
[    1.132525] swapper/0/0 is trying to lock:
[    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
[    1.132543] other info that might help us debug this:
[    1.132544] context-{2:2}
[    1.132545] 1 lock held by swapper/0/0:
[    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
[    1.132557] stack backtrace:
[    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
[    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[    1.132567] Call Trace:
[    1.132570]  <IRQ>
[    1.132573]  dump_stack_lvl+0x6e/0xa0
[    1.132581]  __lock_acquire+0xee0/0x21b0
[    1.132592]  lock_acquire+0xd5/0x2d0
[    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132606]  ? lock_acquire+0xd5/0x2d0
[    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132619]  rt_spin_lock+0x3f/0x1f0
[    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132634]  vmbus_chan_sched+0xc4/0x2b0
[    1.132641]  vmbus_isr+0x2c/0x150
[    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
[    1.132654]  sysvec_hyperv_callback+0x88/0xb0
[    1.132658]  </IRQ>
[    1.132659]  <TASK>
[    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20

As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
the vmbus_isr execution needs to be moved into thread context. Open-
coding this allows to skip the IPI that irq_work would additionally
bring and which we do not need, being an IRQ, never an NMI.

This affects both x86 and arm64, therefore hook into the common driver
logic.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Changes in v3:
 - move logic to generic vmbus driver, targeting arm64 as well
 - annotate non-RT path with lockdep_hardirq_threaded
 - only teardown if setup ran

Changes in v2:
 - reorder vmbus_irq_pending clearing to fix a race condition

 drivers/hv/vmbus_drv.c | 66 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 6785ad63a9cb..749a2e68af05 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -25,6 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/task_stack.h>
+#include <linux/smpboot.h>
 
 #include <linux/delay.h>
 #include <linux/panic_notifier.h>
@@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
 	}
 }
 
-void vmbus_isr(void)
+static void __vmbus_isr(void)
 {
 	struct hv_per_cpu_context *hv_cpu
 		= this_cpu_ptr(hv_context.cpu_context);
@@ -1363,6 +1364,53 @@ void vmbus_isr(void)
 
 	add_interrupt_randomness(vmbus_interrupt);
 }
+
+static DEFINE_PER_CPU(bool, vmbus_irq_pending);
+static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
+
+static void vmbus_irqd_wake(void)
+{
+	struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
+
+	__this_cpu_write(vmbus_irq_pending, true);
+	wake_up_process(tsk);
+}
+
+static void vmbus_irqd_setup(unsigned int cpu)
+{
+	sched_set_fifo(current);
+}
+
+static int vmbus_irqd_should_run(unsigned int cpu)
+{
+	return __this_cpu_read(vmbus_irq_pending);
+}
+
+static void run_vmbus_irqd(unsigned int cpu)
+{
+	__this_cpu_write(vmbus_irq_pending, false);
+	__vmbus_isr();
+}
+
+static bool vmbus_irq_initialized;
+
+static struct smp_hotplug_thread vmbus_irq_threads = {
+	.store                  = &vmbus_irqd,
+	.setup			= vmbus_irqd_setup,
+	.thread_should_run      = vmbus_irqd_should_run,
+	.thread_fn              = run_vmbus_irqd,
+	.thread_comm            = "vmbus_irq/%u",
+};
+
+void vmbus_isr(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		vmbus_irqd_wake();
+	} else {
+		lockdep_hardirq_threaded();
+		__vmbus_isr();
+	}
+}
 EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
 
 static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
@@ -1462,6 +1510,13 @@ static int vmbus_bus_init(void)
 	 * the VMbus interrupt handler.
 	 */
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
+		ret = smpboot_register_percpu_thread(&vmbus_irq_threads);
+		if (ret)
+			goto err_kthread;
+		vmbus_irq_initialized = true;
+	}
+
 	if (vmbus_irq == -1) {
 		hv_setup_vmbus_handler(vmbus_isr);
 	} else {
@@ -1507,6 +1562,11 @@ static int vmbus_bus_init(void)
 		free_percpu(vmbus_evt);
 	}
 err_setup:
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
+err_kthread:
 	bus_unregister(&hv_bus);
 	return ret;
 }
@@ -2976,6 +3036,10 @@ static void __exit vmbus_exit(void)
 		free_percpu_irq(vmbus_irq, vmbus_evt);
 		free_percpu(vmbus_evt);
 	}
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
 	for_each_online_cpu(cpu) {
 		struct hv_per_cpu_context *hv_cpu
 			= per_cpu_ptr(hv_context.cpu_context, cpu);
-- 
2.47.3

^ permalink raw reply related

* Re: [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Selvin Xavier @ 2026-02-17  5:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Kalesh AP, Potnuri Bharat Teja, Michael Margolin,
	Gal Pressman, Yossi Leybovich, Cheng Xu, Kai Shen,
	Chengchang Tang, Junxian Huang, Abhijit Gangurde, Allen Hubbe,
	Krzysztof Czurylo, Tatyana Nikolova, Long Li, Konstantin Taranov,
	Yishai Hadas, Michal Kalderon, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Christian Benvenuti,
	Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun,
	linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260216080746.GD12989@unreal>

[-- Attachment #1: Type: text/plain, Size: 4916 bytes --]

On Mon, Feb 16, 2026 at 1:37 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Feb 16, 2026 at 09:29:29AM +0530, Selvin Xavier wrote:
> > On Fri, Feb 13, 2026 at 4:31 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > There is no need to defer the CQ resize operation, as it is intended to
> > > be completed in one pass. The current bnxt_re_resize_cq() implementation
> > > does not handle concurrent CQ resize requests, and this will be addressed
> > > in the following patches.
> > bnxt HW requires that the previous CQ memory be available with the HW until
> > HW generates a cut off cqe on the CQ that is being destroyed. This is
> > the reason for
> > polling the completions in the user library after returning the
> > resize_cq call. Once the polling
> > thread sees the expected CQE, it will invoke the driver to free CQ
> > memory.
>
> This flow is problematic. It requires the kernel to trust a user‑space
> application, which is not acceptable. There is no guarantee that the
> rdma-core implementation is correct or will invoke the interface properly.
> Users can bypass rdma-core entirely and issue ioctls directly (syzkaller,
> custom rdma-core variants, etc.), leading to umem leaks, races that overwrite
> kernel memory, and access to fields that are now being modified. All of this
> can occur silently and without any protections.
>
> > So ib_umem_release should wait. This patch doesn't guarantee that.
>
> The issue is that it was never guaranteed in the first place. It only appeared
> to work under very controlled conditions.
>
> > Do you think if there is a better way to handle this requirement?
>
> You should wait for BNXT_RE_WC_TYPE_COFF in the kernel before returning
> from resize_cq.
The difficulty is that libbnxt_re  in rdma-core has the  queue  the
consumer index used for completion lookup. The driver therefore has to
use copy_from_user to read the queue memory and then check for
BNXT_RE_WC_TYPE_COFF, along with the queue consumer index and the
relevant validity flags. I’ll explore if we have a way to handle this
and get back.
>
> Thanks
>
> >
> > >
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  drivers/infiniband/hw/bnxt_re/ib_verbs.c | 33 +++++++++-----------------------
> > >  1 file changed, 9 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > > index d652018c19b3..2aecfbbb7eaf 100644
> > > --- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > > +++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
> > > @@ -3309,20 +3309,6 @@ int bnxt_re_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
> > >         return rc;
> > >  }
> > >
> > > -static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
> > > -{
> > > -       struct bnxt_re_dev *rdev = cq->rdev;
> > > -
> > > -       bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> > > -
> > > -       cq->qplib_cq.max_wqe = cq->resize_cqe;
> > > -       if (cq->resize_umem) {
> > > -               ib_umem_release(cq->ib_cq.umem);
> > > -               cq->ib_cq.umem = cq->resize_umem;
> > > -               cq->resize_umem = NULL;
> > > -               cq->resize_cqe = 0;
> > > -       }
> > > -}
> > >
> > >  int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> > >                       struct ib_udata *udata)
> > > @@ -3387,7 +3373,15 @@ int bnxt_re_resize_cq(struct ib_cq *ibcq, unsigned int cqe,
> > >                 goto fail;
> > >         }
> > >
> > > -       cq->ib_cq.cqe = cq->resize_cqe;
> > > +       bnxt_qplib_resize_cq_complete(&rdev->qplib_res, &cq->qplib_cq);
> > > +
> > > +       cq->qplib_cq.max_wqe = cq->resize_cqe;
> > > +       ib_umem_release(cq->ib_cq.umem);
> > > +       cq->ib_cq.umem = cq->resize_umem;
> > > +       cq->resize_umem = NULL;
> > > +       cq->resize_cqe = 0;
> > > +
> > > +       cq->ib_cq.cqe = entries;
> > >         atomic_inc(&rdev->stats.res.resize_count);
> > >
> > >         return 0;
> > > @@ -3907,15 +3901,6 @@ int bnxt_re_poll_cq(struct ib_cq *ib_cq, int num_entries, struct ib_wc *wc)
> > >         struct bnxt_re_sqp_entries *sqp_entry = NULL;
> > >         unsigned long flags;
> > >
> > > -       /* User CQ; the only processing we do is to
> > > -        * complete any pending CQ resize operation.
> > > -        */
> > > -       if (cq->ib_cq.umem) {
> > > -               if (cq->resize_umem)
> > > -                       bnxt_re_resize_cq_complete(cq);
> > > -               return 0;
> > > -       }
> > > -
> > >         spin_lock_irqsave(&cq->cq_lock, flags);
> > >         budget = min_t(u32, num_entries, cq->max_cql);
> > >         num_entries = budget;
> > >
> > > --
> > > 2.52.0
> > >
>
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5473 bytes --]

^ permalink raw reply

* RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Michael Kelley @ 2026-02-17  6:42 UTC (permalink / raw)
  To: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Florian Bezdeka, RT, Mitchell Levy, Michael Kelley,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <289d8e52-40f8-4b22-8aa9-d0bd3bd15aae@siemens.com>

From: Jan Kiszka <jan.kiszka@siemens.com> Sent: Monday, February 16, 2026 8:25 AM
> 
> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> with related guest support enabled:
> 
> [    1.127941] hv_vmbus: registering driver hyperv_drm
> 
> [    1.132518] =============================
> [    1.132519] [ BUG: Invalid wait context ]
> [    1.132521] 6.19.0-rc8+ #9 Not tainted
> [    1.132524] -----------------------------
> [    1.132525] swapper/0/0 is trying to lock:
> [    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> [    1.132543] other info that might help us debug this:
> [    1.132544] context-{2:2}
> [    1.132545] 1 lock held by swapper/0/0:
> [    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> [    1.132557] stack backtrace:
> [    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> [    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [    1.132567] Call Trace:
> [    1.132570]  <IRQ>
> [    1.132573]  dump_stack_lvl+0x6e/0xa0
> [    1.132581]  __lock_acquire+0xee0/0x21b0
> [    1.132592]  lock_acquire+0xd5/0x2d0
> [    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132606]  ? lock_acquire+0xd5/0x2d0
> [    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132619]  rt_spin_lock+0x3f/0x1f0
> [    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132634]  vmbus_chan_sched+0xc4/0x2b0
> [    1.132641]  vmbus_isr+0x2c/0x150
> [    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
> [    1.132654]  sysvec_hyperv_callback+0x88/0xb0
> [    1.132658]  </IRQ>
> [    1.132659]  <TASK>
> [    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20
> 
> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> the vmbus_isr execution needs to be moved into thread context. Open-
> coding this allows to skip the IPI that irq_work would additionally
> bring and which we do not need, being an IRQ, never an NMI.
> 
> This affects both x86 and arm64, therefore hook into the common driver
> logic.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Tested this patch in combination with the related SCSI driver patch.
Tested three configurations with a recent linux-next kernel, either
20260128 or 20260205.

1) Normal Linux kernel
2) Normal Linux kernel plus CONFIG_PROVE_LOCKING
3) PREEMPT_RT kernel plus CONFIG_PROVE_LOCKING

Tested these three configurations in an x86/x64 VM on a local Hyper-V
and again in an ARM64 VM in the Azure public cloud. With all
combinations, ran the "stress-ng" command provided by Florian
Bezdeka for several minutes. Saw no issues related to these patches.
Presumably the normal kernel with CONFIG_PROVE_LOCKING produced
the lockdep report that Saurabh Sengar saw, and that also appears to be
fixed in this version of the patch due to adding lockdep_hardirq_threaded().

However, I noted one additional locking problem in the ARM64 Azure
VM, which has multiple PCI pass-thru devices -- one Mellanox NIC VF and
two NVMe controllers. The first PCI device to be brought online gets
this lockdep report, though Linux continues to run without problems:

[    8.128629] hv_vmbus: registering driver hv_pci
[    8.132276] hv_pci ad26ad39-fa5e-4d12-9825-fa62e9c88483: PCI VMBus probing: Using version 0x10004
[    8.142956] hv_pci ad26ad39-fa5e-4d12-9825-fa62e9c88483: PCI host bridge to bus fa5e:00
[    8.143231] pci_bus fa5e:00: root bus resource [mem 0xfc0000000-0xfc00fffff window]
[    8.143272] pci_bus fa5e:00: No busn resource found for root bus, will use [bus 00-ff]
[    8.154069] =============================
[    8.156609] [ BUG: Invalid wait context ]
[    8.159209] 6.19.0-rc7rt-next-20260128+ #9 Tainted: G            E
[    8.163582] -----------------------------
[    8.166323] systemd-udevd/575 is trying to lock:
[    8.169163] ffff00011fb62260 (&hbus->device_list_lock){+.+.}-{3:3}, at: get_pcichild_wslot+0x30/0xe0 [pci_hyperv]
[    8.175792] other info that might help us debug this:
[    8.179187] context-{5:5}
[    8.180954] 3 locks held by systemd-udevd/575:
[    8.183048]  #0: ffff000116e50100 (&dev->mutex){....}-{4:4}, at: __device_driver_lock+0x4c/0xb0
[    8.193285]  #1: ffff00011fb62118 (&hbus->state_lock){+.+.}-{4:4}, at: hv_pci_probe+0x32c/0x590 [pci_hyperv]
[    8.199565]  #2: ffffa40f7caa61e0 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x64/0xf8
[    8.205112] stack backtrace:
[    8.207037] CPU: 0 UID: 0 PID: 575 Comm: systemd-udevd Tainted: G            E       6.19.0-rc7rt-next-20260128+ #9 PREEMPT_RT
[    8.209134] Tainted: [E]=UNSIGNED_MODULE
[    8.219505] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 06/10/2025
[    8.226029] Call trace:
[    8.227433]  show_stack+0x20/0x38 (C)
[    8.229541]  dump_stack_lvl+0x9c/0x158
[    8.231698]  dump_stack+0x18/0x28
[    8.233799]  __lock_acquire+0x488/0x1e20
[    8.236373]  lock_acquire+0x11c/0x388
[    8.238783]  rt_spin_lock+0x54/0x230
[    8.241138]  get_pcichild_wslot+0x30/0xe0 [pci_hyperv]
[    8.244550]  hv_pcifront_read_config+0x3c/0x98 [pci_hyperv]
[    8.248323]  pci_bus_read_config_dword+0x88/0xf8
[    8.250419]  pci_bus_generic_read_dev_vendor_id+0x3c/0x1c0
[    8.252517]  pci_bus_read_dev_vendor_id+0x54/0x80
[    8.263922]  pci_scan_single_device+0x88/0x100
[    8.266903]  pci_scan_slot+0x74/0x1e0
[    8.269208]  pci_scan_child_bus_extend+0x50/0x328
[    8.271978]  pci_scan_root_bus_bridge+0xc4/0xf8
[    8.274705]  hv_pci_probe+0x390/0x590 [pci_hyperv]
[    8.277584]  vmbus_probe+0x4c/0xb0 [hv_vmbus]
[    8.279688]  really_probe+0xd4/0x3d8
[    8.285954]  __driver_probe_device+0x90/0x1a0
[    8.288645]  driver_probe_device+0x44/0x148
[    8.291011]  __driver_attach+0x154/0x290
[    8.293201]  bus_for_each_dev+0x80/0xf0
[    8.295407]  driver_attach+0x2c/0x40
[    8.297478]  bus_add_driver+0x128/0x270
[    8.299607]  driver_register+0x68/0x138
[    8.302179]  __vmbus_driver_register+0x98/0xc0 [hv_vmbus]
[    8.305535]  init_hv_pci_drv+0x198/0xff8 [pci_hyperv]
[    8.308566]  do_one_initcall+0x70/0x400
[    8.310957]  do_init_module+0x60/0x280
[    8.313393]  load_module+0x2308/0x2680
[    8.315535]  init_module_from_file+0xe0/0x110
[    8.318432]  idempotent_init_module+0x194/0x280
[    8.321141]  __arm64_sys_finit_module+0x74/0xf8
[    8.323874]  invoke_syscall+0x6c/0xf8
[    8.326213]  el0_svc_common.constprop.0+0xe0/0xf0
[    8.329068]  do_el0_svc+0x24/0x38
[    8.331070]  el0_svc+0x164/0x3c8
[    8.333137]  el0t_64_sync_handler+0xd0/0xe8
[    8.335599]  el0t_64_sync+0x1b0/0x1b8
[    8.338598] pci fa5e:00:00.0: [1414:b111] type 00 class 0x010802 PCIe Endpoint
[    8.340646] pci fa5e:00:00.0: BAR 0 [mem 0xfc0000000-0xfc00fffff 64bit]
[    8.357759] pci_bus fa5e:00: busn_res: [bus 00-ff] end is updated to 00

The lockdep report would also be seen in an x86/x64 VM in Azure, though I
did not explicitly test that combination. I have not looked at what it would
take to fix this for PREEMPT_RT. But the fix would be a separate patch that
does not affect the validity of this patch.

So for this patch,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>

> ---
> 
> Changes in v3:
>  - move logic to generic vmbus driver, targeting arm64 as well
>  - annotate non-RT path with lockdep_hardirq_threaded
>  - only teardown if setup ran
> 
> Changes in v2:
>  - reorder vmbus_irq_pending clearing to fix a race condition
> 
>  drivers/hv/vmbus_drv.c | 66 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 6785ad63a9cb..749a2e68af05 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -25,6 +25,7 @@
>  #include <linux/cpu.h>
>  #include <linux/sched/isolation.h>
>  #include <linux/sched/task_stack.h>
> +#include <linux/smpboot.h>
> 
>  #include <linux/delay.h>
>  #include <linux/panic_notifier.h>
> @@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct
> hv_per_cpu_context *hv_cpu, void *message
>  	}
>  }
> 
> -void vmbus_isr(void)
> +static void __vmbus_isr(void)
>  {
>  	struct hv_per_cpu_context *hv_cpu
>  		= this_cpu_ptr(hv_context.cpu_context);
> @@ -1363,6 +1364,53 @@ void vmbus_isr(void)
> 
>  	add_interrupt_randomness(vmbus_interrupt);
>  }
> +
> +static DEFINE_PER_CPU(bool, vmbus_irq_pending);
> +static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
> +
> +static void vmbus_irqd_wake(void)
> +{
> +	struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
> +
> +	__this_cpu_write(vmbus_irq_pending, true);
> +	wake_up_process(tsk);
> +}
> +
> +static void vmbus_irqd_setup(unsigned int cpu)
> +{
> +	sched_set_fifo(current);
> +}
> +
> +static int vmbus_irqd_should_run(unsigned int cpu)
> +{
> +	return __this_cpu_read(vmbus_irq_pending);
> +}
> +
> +static void run_vmbus_irqd(unsigned int cpu)
> +{
> +	__this_cpu_write(vmbus_irq_pending, false);
> +	__vmbus_isr();
> +}
> +
> +static bool vmbus_irq_initialized;
> +
> +static struct smp_hotplug_thread vmbus_irq_threads = {
> +	.store                  = &vmbus_irqd,
> +	.setup			= vmbus_irqd_setup,
> +	.thread_should_run      = vmbus_irqd_should_run,
> +	.thread_fn              = run_vmbus_irqd,
> +	.thread_comm            = "vmbus_irq/%u",
> +};
> +
> +void vmbus_isr(void)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +		vmbus_irqd_wake();
> +	} else {
> +		lockdep_hardirq_threaded();
> +		__vmbus_isr();
> +	}
> +}
>  EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> 
>  static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
> @@ -1462,6 +1510,13 @@ static int vmbus_bus_init(void)
>  	 * the VMbus interrupt handler.
>  	 */
> 
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
> +		ret = smpboot_register_percpu_thread(&vmbus_irq_threads);
> +		if (ret)
> +			goto err_kthread;
> +		vmbus_irq_initialized = true;
> +	}
> +
>  	if (vmbus_irq == -1) {
>  		hv_setup_vmbus_handler(vmbus_isr);
>  	} else {
> @@ -1507,6 +1562,11 @@ static int vmbus_bus_init(void)
>  		free_percpu(vmbus_evt);
>  	}
>  err_setup:
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
> +		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
> +		vmbus_irq_initialized = false;
> +	}
> +err_kthread:
>  	bus_unregister(&hv_bus);
>  	return ret;
>  }
> @@ -2976,6 +3036,10 @@ static void __exit vmbus_exit(void)
>  		free_percpu_irq(vmbus_irq, vmbus_evt);
>  		free_percpu(vmbus_evt);
>  	}
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
> +		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
> +		vmbus_irq_initialized = false;
> +	}
>  	for_each_online_cpu(cpu) {
>  		struct hv_per_cpu_context *hv_cpu
>  			= per_cpu_ptr(hv_context.cpu_context, cpu);
> --
> 2.47.3


^ permalink raw reply

* Re: [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Leon Romanovsky @ 2026-02-17  7:56 UTC (permalink / raw)
  To: Selvin Xavier
  Cc: Jason Gunthorpe, Kalesh AP, Potnuri Bharat Teja, Michael Margolin,
	Gal Pressman, Yossi Leybovich, Cheng Xu, Kai Shen,
	Chengchang Tang, Junxian Huang, Abhijit Gangurde, Allen Hubbe,
	Krzysztof Czurylo, Tatyana Nikolova, Long Li, Konstantin Taranov,
	Yishai Hadas, Michal Kalderon, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Christian Benvenuti,
	Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun,
	linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <CA+sbYW0Ba==5Z5fyqjBS1AH8HE37ese2qMiR4+hoY-i8pajzQg@mail.gmail.com>

On Tue, Feb 17, 2026 at 10:32:25AM +0530, Selvin Xavier wrote:
> On Mon, Feb 16, 2026 at 1:37 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Feb 16, 2026 at 09:29:29AM +0530, Selvin Xavier wrote:
> > > On Fri, Feb 13, 2026 at 4:31 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > There is no need to defer the CQ resize operation, as it is intended to
> > > > be completed in one pass. The current bnxt_re_resize_cq() implementation
> > > > does not handle concurrent CQ resize requests, and this will be addressed
> > > > in the following patches.
> > > bnxt HW requires that the previous CQ memory be available with the HW until
> > > HW generates a cut off cqe on the CQ that is being destroyed. This is
> > > the reason for
> > > polling the completions in the user library after returning the
> > > resize_cq call. Once the polling
> > > thread sees the expected CQE, it will invoke the driver to free CQ
> > > memory.
> >
> > This flow is problematic. It requires the kernel to trust a user‑space
> > application, which is not acceptable. There is no guarantee that the
> > rdma-core implementation is correct or will invoke the interface properly.
> > Users can bypass rdma-core entirely and issue ioctls directly (syzkaller,
> > custom rdma-core variants, etc.), leading to umem leaks, races that overwrite
> > kernel memory, and access to fields that are now being modified. All of this
> > can occur silently and without any protections.
> >
> > > So ib_umem_release should wait. This patch doesn't guarantee that.
> >
> > The issue is that it was never guaranteed in the first place. It only appeared
> > to work under very controlled conditions.
> >
> > > Do you think if there is a better way to handle this requirement?
> >
> > You should wait for BNXT_RE_WC_TYPE_COFF in the kernel before returning
> > from resize_cq.
> The difficulty is that libbnxt_re  in rdma-core has the  queue  the
> consumer index used for completion lookup. The driver therefore has to
> use copy_from_user to read the queue memory and then check for
> BNXT_RE_WC_TYPE_COFF, along with the queue consumer index and the
> relevant validity flags. I’ll explore if we have a way to handle this
> and get back.

The thing is that you need to ensure that after libbnxt_re issued resize_cq command,
kernel won't require anything from user-space.

Can you cause to your HW to stop generate CQEs before resize_cq?

Thanks

^ permalink raw reply

* Re: [PATCH rdma-next 42/50] RDMA/bnxt_re: Complete CQ resize in a single step
From: Selvin Xavier @ 2026-02-17 10:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Kalesh AP, Potnuri Bharat Teja, Michael Margolin,
	Gal Pressman, Yossi Leybovich, Cheng Xu, Kai Shen,
	Chengchang Tang, Junxian Huang, Abhijit Gangurde, Allen Hubbe,
	Krzysztof Czurylo, Tatyana Nikolova, Long Li, Konstantin Taranov,
	Yishai Hadas, Michal Kalderon, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Christian Benvenuti,
	Nelson Escobar, Dennis Dalessandro, Bernard Metzler, Zhu Yanjun,
	linux-kernel, linux-rdma, linux-hyperv
In-Reply-To: <20260217075654.GI12989@unreal>

[-- Attachment #1: Type: text/plain, Size: 2851 bytes --]

On Tue, Feb 17, 2026 at 1:27 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Feb 17, 2026 at 10:32:25AM +0530, Selvin Xavier wrote:
> > On Mon, Feb 16, 2026 at 1:37 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Feb 16, 2026 at 09:29:29AM +0530, Selvin Xavier wrote:
> > > > On Fri, Feb 13, 2026 at 4:31 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > >
> > > > > There is no need to defer the CQ resize operation, as it is intended to
> > > > > be completed in one pass. The current bnxt_re_resize_cq() implementation
> > > > > does not handle concurrent CQ resize requests, and this will be addressed
> > > > > in the following patches.
> > > > bnxt HW requires that the previous CQ memory be available with the HW until
> > > > HW generates a cut off cqe on the CQ that is being destroyed. This is
> > > > the reason for
> > > > polling the completions in the user library after returning the
> > > > resize_cq call. Once the polling
> > > > thread sees the expected CQE, it will invoke the driver to free CQ
> > > > memory.
> > >
> > > This flow is problematic. It requires the kernel to trust a user‑space
> > > application, which is not acceptable. There is no guarantee that the
> > > rdma-core implementation is correct or will invoke the interface properly.
> > > Users can bypass rdma-core entirely and issue ioctls directly (syzkaller,
> > > custom rdma-core variants, etc.), leading to umem leaks, races that overwrite
> > > kernel memory, and access to fields that are now being modified. All of this
> > > can occur silently and without any protections.
> > >
> > > > So ib_umem_release should wait. This patch doesn't guarantee that.
> > >
> > > The issue is that it was never guaranteed in the first place. It only appeared
> > > to work under very controlled conditions.
> > >
> > > > Do you think if there is a better way to handle this requirement?
> > >
> > > You should wait for BNXT_RE_WC_TYPE_COFF in the kernel before returning
> > > from resize_cq.
> > The difficulty is that libbnxt_re  in rdma-core has the  queue  the
> > consumer index used for completion lookup. The driver therefore has to
> > use copy_from_user to read the queue memory and then check for
> > BNXT_RE_WC_TYPE_COFF, along with the queue consumer index and the
> > relevant validity flags. I’ll explore if we have a way to handle this
> > and get back.
>
> The thing is that you need to ensure that after libbnxt_re issued resize_cq command,
> kernel won't require anything from user-space.
>
> Can you cause to your HW to stop generate CQEs before resize_cq?
we dont have this control (especially on the Receive CQ side).  For
the Tx side, maybe we can prevent
posting to the Tx queue.
>
> Thanks

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5473 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v16 01/12] vsock: add netns to vsock core
From: Stefano Garzarella @ 2026-02-17 15:08 UTC (permalink / raw)
  To: Bobby Eshleman, Paolo Abeni, Jakub Kicinski, Daan De Meyer,
	Michael S. Tsirkin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet, linux-kernel, virtualization, netdev, kvm,
	linux-hyperv, linux-kselftest, berrange, Sargun Dhillon,
	linux-doc, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-1-2859a7512097@meta.com>

Hi,

On Wed, Jan 21, 2026 at 02:11:41PM -0800, Bobby Eshleman wrote:
>From: Bobby Eshleman <bobbyeshleman@meta.com>
>
>Add netns logic to vsock core. Additionally, modify transport hook
>prototypes to be used by later transport-specific patches (e.g.,
>*_seqpacket_allow()).
>
>Namespaces are supported primarily by changing socket lookup functions
>(e.g., vsock_find_connected_socket()) to take into account the socket
>namespace and the namespace mode before considering a candidate socket a
>"match".
>
>This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
>report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
>for new namespaces.

talking about this new feature with Daan (in CC) we were discussing a 
possible change to `child_ns_mode`.

Currently, if two or more administrator processes in the same namespace 
set `child_ns_mode`, they compete. Obviously, after unshare()/clone(), 
the process can always access `ns_mode` to check if everything went well 
and eventually retry.

Daan suggested a more conservative approach, allowing `child_ns_mode` to 
be written only once (a bit like we did in the old version when the 
child could change the mode only once). This way, most users who want 
isolation write `local` in `child_ns_mode` at startup in the init_ns. At 
that point the user  and can be sure that no other process (including 
administrators, e.g., container managers) can change it, so all new 
namespaces will have `local` mode.

I think we should support this option in some way, because it seems to 
simplify the user space in most common cases (ensure isolation). I see 
few options for doing this:

1. Change the behavior of `child_ns_mode` to be written only once, but 
this would limit other possible use cases where `child_ns_mode` can be 
changed more than once (I don't know if Bobby had any in mind).

2. Add a new sysctl `child_ns_mode_lockin` (or something similar), which 
can only be written once with a mode (local or global). A write on this 
will also locks `child_ns_mode`, of course.

3. Add a new `local-locked` mode, reusing the same sysctl.

If we go for 1, maybe we can do it in 7.0, or not?

2 and 3, on the other hand, may have to wait until the next release.

What do you think? Any comments?

Thanks,
Stefano

^ permalink raw reply

* RE: [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Michael Kelley @ 2026-02-17 15:47 UTC (permalink / raw)
  To: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org
  Cc: linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka, RT, Mitchell Levy
In-Reply-To: <0c7fb5cd-fb21-4760-8593-e04bade84744@siemens.com>

From: Jan Kiszka <jan.kiszka@siemens.com> Sent: Thursday, January 29, 2026 6:31 AM
> 
> This resolves the follow splat and lock-up when running with PREEMPT_RT
> enabled on Hyper-V:
> 
> [  415.140818] BUG: scheduling while atomic: stress-ng-iomix/1048/0x00000002
> [  415.140822] INFO: lockdep is turned off.
> [  415.140823] Modules linked in: intel_rapl_msr intel_rapl_common
> intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
> pmt_class intel_pmc_ssram_telemetry intel_vsec ghash_clmulni_intel aesni_intel rapl
> binfmt_misc nls_ascii nls_cp437 vfat fat snd_pcm hyperv_drm snd_timer drm_client_lib
> drm_shmem_helper snd sg soundcore drm_kms_helper pcspkr hv_balloon hv_utils
> evdev joydev drm configfs efi_pstore nfnetlink vsock_loopback
> vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vsock
> vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod cdrom hv_storvsc
> serio_raw hid_generic scsi_transport_fc hid_hyperv scsi_mod hid hv_netvsc
> hyperv_keyboard scsi_common
> [  415.140846] Preemption disabled at:
> [  415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
> [  415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)}
> [  415.140856] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/04/2024
> [  415.140857] Call Trace:
> [  415.140861]  <TASK>
> [  415.140861]  ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
> [  415.140863]  dump_stack_lvl+0x91/0xb0
> [  415.140870]  __schedule_bug+0x9c/0xc0
> [  415.140875]  __schedule+0xdf6/0x1300
> [  415.140877]  ? rtlock_slowlock_locked+0x56c/0x1980
> [  415.140879]  ? rcu_is_watching+0x12/0x60
> [  415.140883]  schedule_rtlock+0x21/0x40
> [  415.140885]  rtlock_slowlock_locked+0x502/0x1980
> [  415.140891]  rt_spin_lock+0x89/0x1e0
> [  415.140893]  hv_ringbuffer_write+0x87/0x2a0
> [  415.140899]  vmbus_sendpacket_mpb_desc+0xb6/0xe0
> [  415.140900]  ? rcu_is_watching+0x12/0x60
> [  415.140902]  storvsc_queuecommand+0x669/0xbe0 [hv_storvsc]
> [  415.140904]  ? HARDIRQ_verbose+0x10/0x10
> [  415.140908]  ? __rq_qos_issue+0x28/0x40
> [  415.140911]  scsi_queue_rq+0x760/0xd80 [scsi_mod]
> [  415.140926]  __blk_mq_issue_directly+0x4a/0xc0
> [  415.140928]  blk_mq_issue_direct+0x87/0x2b0
> [  415.140931]  blk_mq_dispatch_queue_requests+0x120/0x440
> [  415.140933]  blk_mq_flush_plug_list+0x7a/0x1a0
> [  415.140935]  __blk_flush_plug+0xf4/0x150
> [  415.140940]  __submit_bio+0x2b2/0x5c0
> [  415.140944]  ? submit_bio_noacct_nocheck+0x272/0x360
> [  415.140946]  submit_bio_noacct_nocheck+0x272/0x360
> [  415.140951]  ext4_read_bh_lock+0x3e/0x60 [ext4]
> [  415.140995]  ext4_block_write_begin+0x396/0x650 [ext4]
> [  415.141018]  ? __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4]
> [  415.141038]  ext4_da_write_begin+0x1c4/0x350 [ext4]
> [  415.141060]  generic_perform_write+0x14e/0x2c0
> [  415.141065]  ext4_buffered_write_iter+0x6b/0x120 [ext4]
> [  415.141083]  vfs_write+0x2ca/0x570
> [  415.141087]  ksys_write+0x76/0xf0
> [  415.141089]  do_syscall_64+0x99/0x1490
> [  415.141093]  ? rcu_is_watching+0x12/0x60
> [  415.141095]  ? finish_task_switch.isra.0+0xdf/0x3d0
> [  415.141097]  ? rcu_is_watching+0x12/0x60
> [  415.141098]  ? lock_release+0x1f0/0x2a0
> [  415.141100]  ? rcu_is_watching+0x12/0x60
> [  415.141101]  ? finish_task_switch.isra.0+0xe4/0x3d0
> [  415.141103]  ? rcu_is_watching+0x12/0x60
> [  415.141104]  ? __schedule+0xb34/0x1300
> [  415.141106]  ? hrtimer_try_to_cancel+0x1d/0x170
> [  415.141109]  ? do_nanosleep+0x8b/0x160
> [  415.141111]  ? hrtimer_nanosleep+0x89/0x100
> [  415.141114]  ? __pfx_hrtimer_wakeup+0x10/0x10
> [  415.141116]  ? xfd_validate_state+0x26/0x90
> [  415.141118]  ? rcu_is_watching+0x12/0x60
> [  415.141120]  ? do_syscall_64+0x1e0/0x1490
> [  415.141121]  ? do_syscall_64+0x1e0/0x1490
> [  415.141123]  ? rcu_is_watching+0x12/0x60
> [  415.141124]  ? do_syscall_64+0x1e0/0x1490
> [  415.141125]  ? do_syscall_64+0x1e0/0x1490
> [  415.141127]  ? irqentry_exit+0x140/0x7e0
> [  415.141129]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> get_cpu() disables preemption while the spinlock hv_ringbuffer_write is
> using is converted to an rt-mutex under PREEMPT_RT.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>

> ---
> 
> This is likely just the tip of an iceberg, see specifically [1], but if
> you never start addressing it, it will continue to crash ships, even if
> those are only on test cruises (we are fully aware that Hyper-V provides
> no RT guarantees for guests). A pragmatic alternative to that would be a
> simple
> 
> config HYPERV
>     depends on !PREEMPT_RT
> 
> Please share your thoughts if this fix is worth it, or if we should
> better stop looking at the next splats that show up after it. We are
> currently considering to thread some of the hv platform IRQs under
> PREEMPT_RT as potential next step.
> 
> TIA!
> 
> [1] https://lore.kernel.org/all/20230809-b4-rt_preempt-fix-v1-0-7283bbdc8b14@gmail.com/
> 
>  drivers/scsi/storvsc_drv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index b43d876747b7..68c837146b9e 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1855,8 +1855,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
>  	cmd_request->payload_sz = payload_sz;
> 
>  	/* Invokes the vsc to start an IO */
> -	ret = storvsc_do_io(dev, cmd_request, get_cpu());
> -	put_cpu();
> +	migrate_disable();
> +	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
> +	migrate_enable();
> 
>  	if (ret)
>  		scsi_dma_unmap(scmnd);
> --
> 2.51.0


^ permalink raw reply

* [PATCH v2 1/2] Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
From: Michael Kelley @ 2026-02-17 18:23 UTC (permalink / raw)
  To: drawat.floss, maarten.lankhorst, mripard, tzimmermann, airlied,
	simona, kys, haiyangz, wei.liu, decui, longli, ryasuoka, jfalempe
  Cc: dri-devel, linux-kernel, linux-hyperv, stable

From: Michael Kelley <mhklinux@outlook.com>

Currently, VMBus code initiates a VMBus unload in the panic path so
that if a kdump kernel is loaded, it can start fresh in setting up its
own VMBus connection. However, a driver for the VMBus virtual frame
buffer may need to flush dirty portions of the frame buffer back to
the Hyper-V host so that panic information is visible in the graphics
console. To support such flushing, provide exported functions for the
frame buffer driver to specify that the VMBus unload should not be
done by the VMBus driver, and to initiate the VMBus unload itself.
Together these allow a frame buffer driver to delay the VMBus unload
until after it has completed the flush.

Ideally, the VMBus driver could use its own panic-path callback to do
the unload after all frame buffer drivers have finished. But DRM frame
buffer drivers use the kmsg dump callback, and there are no callbacks
after that in the panic path. Hence this somewhat messy approach to
properly sequencing the frame buffer flush and the VMBus unload.

Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
Changes in v2: None

 drivers/hv/channel_mgmt.c |  1 +
 drivers/hv/hyperv_vmbus.h |  1 -
 drivers/hv/vmbus_drv.c    | 25 ++++++++++++++++++-------
 include/linux/hyperv.h    |  3 +++
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 74fed2c073d4..5de83676dbad 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -944,6 +944,7 @@ void vmbus_initiate_unload(bool crash)
 	else
 		vmbus_wait_for_unload();
 }
+EXPORT_SYMBOL_GPL(vmbus_initiate_unload);
 
 static void vmbus_setup_channel_state(struct vmbus_channel *channel,
 				      struct vmbus_channel_offer_channel *offer)
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index cdbc5f5c3215..5d3944fc93ae 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -440,7 +440,6 @@ void hv_vss_deinit(void);
 int hv_vss_pre_suspend(void);
 int hv_vss_pre_resume(void);
 void hv_vss_onchannelcallback(void *context);
-void vmbus_initiate_unload(bool crash);
 
 static inline void hv_poll_channel(struct vmbus_channel *channel,
 				   void (*cb)(void *))
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 6785ad63a9cb..97dfa529d250 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -69,19 +69,29 @@ bool vmbus_is_confidential(void)
 }
 EXPORT_SYMBOL_GPL(vmbus_is_confidential);
 
+static bool skip_vmbus_unload;
+
+/*
+ * Allow a VMBus framebuffer driver to specify that in the case of a panic,
+ * it will do the VMbus unload operation once it has flushed any dirty
+ * portions of the framebuffer to the Hyper-V host.
+ */
+void vmbus_set_skip_unload(bool skip)
+{
+	skip_vmbus_unload = skip;
+}
+EXPORT_SYMBOL_GPL(vmbus_set_skip_unload);
+
 /*
  * The panic notifier below is responsible solely for unloading the
  * vmbus connection, which is necessary in a panic event.
- *
- * Notice an intrincate relation of this notifier with Hyper-V
- * framebuffer panic notifier exists - we need vmbus connection alive
- * there in order to succeed, so we need to order both with each other
- * [see hvfb_on_panic()] - this is done using notifiers' priorities.
  */
 static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
 			      void *args)
 {
-	vmbus_initiate_unload(true);
+	if (!skip_vmbus_unload)
+		vmbus_initiate_unload(true);
+
 	return NOTIFY_DONE;
 }
 static struct notifier_block hyperv_panic_vmbus_unload_block = {
@@ -2848,7 +2858,8 @@ static void hv_crash_handler(struct pt_regs *regs)
 {
 	int cpu;
 
-	vmbus_initiate_unload(true);
+	if (!skip_vmbus_unload)
+		vmbus_initiate_unload(true);
 	/*
 	 * In crash handler we can't schedule synic cleanup for all CPUs,
 	 * doing the cleanup for current CPU only. This should be sufficient
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..b0502a336eb3 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1334,6 +1334,9 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
 			bool fb_overlap_ok);
 void vmbus_free_mmio(resource_size_t start, resource_size_t size);
 
+void vmbus_initiate_unload(bool crash);
+void vmbus_set_skip_unload(bool skip);
+
 /*
  * GUID definitions of various offer types - services offered to the guest.
  */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 2/2] drm/hyperv: During panic do VMBus unload after frame buffer is flushed
From: Michael Kelley @ 2026-02-17 18:23 UTC (permalink / raw)
  To: drawat.floss, maarten.lankhorst, mripard, tzimmermann, airlied,
	simona, kys, haiyangz, wei.liu, decui, longli, ryasuoka, jfalempe
  Cc: dri-devel, linux-kernel, linux-hyperv, stable
In-Reply-To: <20260217182335.265585-1-mhklkml@zohomail.com>

From: Michael Kelley <mhklinux@outlook.com>

In a VM, Linux panic information (reason for the panic, stack trace,
etc.) may be written to a serial console and/or a virtual frame buffer
for a graphics console. The latter may need to be flushed back to the
host hypervisor for display.

The current Hyper-V DRM driver for the frame buffer does the flushing
*after* the VMBus connection has been unloaded, such that panic messages
are not displayed on the graphics console. A user with a Hyper-V graphics
console is left with just a hung empty screen after a panic. The enhanced
control that DRM provides over the panic display in the graphics console
is similarly non-functional.

Commit 3671f3777758 ("drm/hyperv: Add support for drm_panic") added
the Hyper-V DRM driver support to flush the virtual frame buffer. It
provided necessary functionality but did not handle the sequencing
problem with VMBus unload.

Fix the full problem by using VMBus functions to suppress the VMBus
unload that is normally done by the VMBus driver in the panic path. Then
after the frame buffer has been flushed, do the VMBus unload so that a
kdump kernel can start cleanly. As expected, CONFIG_DRM_PANIC must be
selected for these changes to have effect. As a side benefit, the
enhanced features of the DRM panic path are also functional.

Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
Changes in v2: Removed test of CONFIG_PRINTK in deciding whether
   to have VMBus skip the unload. A separate patch by Jocelyn Falempe
   incorporates the CONFIG_PRINTK dependency into CONFIG_DRM_PANIC.

 drivers/gpu/drm/hyperv/hyperv_drm_drv.c     |  5 +++++
 drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 15 ++++++++-------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index 06b5d96e6eaf..b6bf6412ae34 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -150,6 +150,10 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
 		goto err_free_mmio;
 	}

+	/* If DRM panic path is stubbed out VMBus code must do the unload */
+	if (IS_ENABLED(CONFIG_DRM_PANIC))
+		vmbus_set_skip_unload(true);
+
 	drm_client_setup(dev, NULL);

 	return 0;
@@ -169,6 +173,7 @@ static void hyperv_vmbus_remove(struct hv_device *hdev)
 	struct drm_device *dev = hv_get_drvdata(hdev);
 	struct hyperv_drm_device *hv = to_hv(dev);

+	vmbus_set_skip_unload(false);
 	drm_dev_unplug(dev);
 	drm_atomic_helper_shutdown(dev);
 	vmbus_close(hdev->channel);
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
index 7978f8c8108c..d48ca6c23b7c 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
@@ -212,15 +212,16 @@ static void hyperv_plane_panic_flush(struct drm_plane *plane)
 	struct hyperv_drm_device *hv = to_hv(plane->dev);
 	struct drm_rect rect;

-	if (!plane->state || !plane->state->fb)
-		return;
+	if (plane->state && plane->state->fb) {
+		rect.x1 = 0;
+		rect.y1 = 0;
+		rect.x2 = plane->state->fb->width;
+		rect.y2 = plane->state->fb->height;

-	rect.x1 = 0;
-	rect.y1 = 0;
-	rect.x2 = plane->state->fb->width;
-	rect.y2 = plane->state->fb->height;
+		hyperv_update_dirt(hv->hdev, &rect);
+	}

-	hyperv_update_dirt(hv->hdev, &rect);
+	vmbus_initiate_unload(true);
 }

 static const struct drm_plane_helper_funcs hyperv_plane_helper_funcs = {
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH net-next v16 01/12] vsock: add netns to vsock core
From: Jakub Kicinski @ 2026-02-17 21:46 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, Paolo Abeni, Daan De Meyer, Michael S. Tsirkin,
	David S. Miller, Eric Dumazet, Simon Horman, Stefan Hajnoczi,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet, linux-kernel, virtualization, netdev, kvm,
	linux-hyperv, linux-kselftest, berrange, Sargun Dhillon,
	linux-doc, Bobby Eshleman
In-Reply-To: <aZNNBc390y6V09qO@sgarzare-redhat>

On Tue, 17 Feb 2026 16:08:33 +0100 Stefano Garzarella wrote:
> If we go for 1, maybe we can do it in 7.0, or not?

Just my preference but changing the behavior ASAP in 7.0 seems better,
but the code must be ready very soon for that to happen.

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Bezdeka, Florian @ 2026-02-17 23:03 UTC (permalink / raw)
  To: kys@microsoft.com, decui@microsoft.com, bp@alien8.de,
	longli@microsoft.com, dave.hansen@linux.intel.com,
	mingo@redhat.com, wei.liu@kernel.org, tglx@kernel.org,
	Kiszka, Jan, haiyangz@microsoft.com, x86@kernel.org
  Cc: linux-rt-users@vger.kernel.org, namjain@linux.microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	levymitchell0@gmail.com, mhklinux@outlook.com,
	ssengar@linux.microsoft.com
In-Reply-To: <289d8e52-40f8-4b22-8aa9-d0bd3bd15aae@siemens.com>

On Mon, 2026-02-16 at 17:24 +0100, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> with related guest support enabled:
> 
> [    1.127941] hv_vmbus: registering driver hyperv_drm
> 
> [    1.132518] =============================
> [    1.132519] [ BUG: Invalid wait context ]
> [    1.132521] 6.19.0-rc8+ #9 Not tainted
> [    1.132524] -----------------------------
> [    1.132525] swapper/0/0 is trying to lock:
> [    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> [    1.132543] other info that might help us debug this:
> [    1.132544] context-{2:2}
> [    1.132545] 1 lock held by swapper/0/0:
> [    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> [    1.132557] stack backtrace:
> [    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> [    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [    1.132567] Call Trace:
> [    1.132570]  <IRQ>
> [    1.132573]  dump_stack_lvl+0x6e/0xa0
> [    1.132581]  __lock_acquire+0xee0/0x21b0
> [    1.132592]  lock_acquire+0xd5/0x2d0
> [    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132606]  ? lock_acquire+0xd5/0x2d0
> [    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132619]  rt_spin_lock+0x3f/0x1f0
> [    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132634]  vmbus_chan_sched+0xc4/0x2b0
> [    1.132641]  vmbus_isr+0x2c/0x150
> [    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
> [    1.132654]  sysvec_hyperv_callback+0x88/0xb0
> [    1.132658]  </IRQ>
> [    1.132659]  <TASK>
> [    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20
> 
> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> the vmbus_isr execution needs to be moved into thread context. Open-
> coding this allows to skip the IPI that irq_work would additionally
> bring and which we do not need, being an IRQ, never an NMI.
> 
> This affects both x86 and arm64, therefore hook into the common driver
> logic.

I tested this patch in combination with the related SCSI driver patch.
The tests were done on x86 with both VM generations provided by Hyper-v.

Lockdep was enabled and there were no splat reports within 24 hours of
massive load produced by stress-ng.

With that:

Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com>


Side note: We did some backports down to 6.1 already, just in case
someone is interested. We recognized a massive network performance drop
in 6.1. The root cause has been identified and is not related to this
patch. It's simply another RT regression caused by a missing stable-rt
backport. Upstreaming in progress...

Best regards,
Florian

^ permalink raw reply

* [PATCH v2] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Mukesh R @ 2026-02-17 23:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel
  Cc: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, x86, hpa

From: Mukesh Rathor <mrathor@linux.microsoft.com>

MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
has an assert intrinsic that uses interrupt vector 0x29 to create an
exception. This will cause hypervisor to then crash and collect core. As
such, if this interrupt number is assigned to a device by Linux and the
device generates it, hypervisor will crash. There are two other such
vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
Fortunately, the three vectors are part of the kernel driver space and
that makes it feasible to reserve them early so they are not assigned
later.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---

v1: Add ifndef CONFIG_X86_FRED (thanks hpa)
v2: replace ifndef with cpu_feature_enabled() (thanks hpa and tglx)

 arch/x86/kernel/cpu/mshyperv.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 579fb2c64cfd..88ca127dc6d4 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -478,6 +478,28 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 }
 EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
 
+/*
+ * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
+ * will either crash or hang or attempt to break into debugger.
+ */
+static void hv_reserve_irq_vectors(void)
+{
+	#define HYPERV_DBG_FASTFAIL_VECTOR	0x29
+	#define HYPERV_DBG_ASSERT_VECTOR	0x2C
+	#define HYPERV_DBG_SERVICE_VECTOR	0x2D
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		return;
+
+	if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
+	    test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
+	    test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
+		BUG();
+
+	pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
+		HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
+}
+
 static void __init ms_hyperv_init_platform(void)
 {
 	int hv_max_functions_eax, eax;
@@ -510,6 +532,11 @@ static void __init ms_hyperv_init_platform(void)
 
 	hv_identify_partition_type();
 
+#ifndef CONFIG_X86_FRED
+	if (hv_root_partition())
+		hv_reserve_irq_vectors();
+#endif	/* CONFIG_X86_FRED */
+
 	if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
 		ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
 
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* Re: [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Martin K. Petersen @ 2026-02-18  2:14 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	James E.J. Bottomley, Martin K. Petersen, linux-hyperv,
	linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka, RT, Mitchell Levy
In-Reply-To: <0c7fb5cd-fb21-4760-8593-e04bade84744@siemens.com>


Jan,

> This resolves the follow splat and lock-up when running with
> PREEMPT_RT enabled on Hyper-V:

Applied to 7.0/scsi-staging, thanks!

-- 
Martin K. Petersen

^ permalink raw reply

* [PATCH] x86/hyperv: Fix error pointer deference
From: Ethan Tidmore @ 2026-02-18  2:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen
  Cc: x86, hpa, mhklinux, ssengar, linux-hyperv, linux-kernel,
	Ethan Tidmore

The function idle_thread_get() can return an error pointer and is not
checked for it. Add check for error pointer.

Detected by Smatch:
arch/x86/hyperv/hv_vtl.c:126 hv_vtl_bringup_vcpu() error:
'idle' dereferencing possible ERR_PTR()

Fixes: 2b4b90e053a29 ("x86/hyperv: Use per cpu initial stack for vtl context")
Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
---
 arch/x86/hyperv/hv_vtl.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index c0edaed0efb3..9b6a9bc4ab76 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -110,7 +110,7 @@ static void hv_vtl_ap_entry(void)
 
 static int hv_vtl_bringup_vcpu(u32 target_vp_index, int cpu, u64 eip_ignored)
 {
-	u64 status;
+	u64 status, rsp, rip;
 	int ret = 0;
 	struct hv_enable_vp_vtl *input;
 	unsigned long irq_flags;
@@ -123,9 +123,11 @@ static int hv_vtl_bringup_vcpu(u32 target_vp_index, int cpu, u64 eip_ignored)
 	struct desc_struct *gdt;
 
 	struct task_struct *idle = idle_thread_get(cpu);
-	u64 rsp = (unsigned long)idle->thread.sp;
+	if (IS_ERR(idle))
+		return PTR_ERR(idle);
 
-	u64 rip = (u64)&hv_vtl_ap_entry;
+	rsp = (unsigned long)idle->thread.sp;
+	rip = (u64)&hv_vtl_ap_entry;
 
 	native_store_gdt(&gdt_ptr);
 	store_idt(&idt_ptr);
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] x86/hyperv: Fix error pointer deference
From: Ethan Tidmore @ 2026-02-18  2:53 UTC (permalink / raw)
  To: Ethan Tidmore, kys, haiyangz, wei.liu, decui, longli, tglx, mingo,
	bp, dave.hansen
  Cc: x86, hpa, mhklinux, ssengar, linux-hyperv, linux-kernel
In-Reply-To: <20260218024351.594068-1-ethantidmore06@gmail.com>

On Tue Feb 17, 2026 at 8:43 PM CST, Ethan Tidmore wrote:
> The function idle_thread_get() can return an error pointer and is not
> checked for it. Add check for error pointer.
>
> Detected by Smatch:
> arch/x86/hyperv/hv_vtl.c:126 hv_vtl_bringup_vcpu() error:
> 'idle' dereferencing possible ERR_PTR()
>
> Fixes: 2b4b90e053a29 ("x86/hyperv: Use per cpu initial stack for vtl context")
> Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
> ---

Just noticed typo "deference" ignore this.

Thanks,

ET

^ permalink raw reply

* RE: [PATCH v4 1/2] mshv: refactor synic init and cleanup
From: Michael Kelley @ 2026-02-18  4:17 UTC (permalink / raw)
  To: Anirudh Rayabharam, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260211170728.3056226-2-anirudh@anirudhrb.com>

From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Wednesday, February 11, 2026 9:07 AM
> 
> Rename mshv_synic_init() to mshv_synic_cpu_init() and
> mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
> these functions handle per-cpu synic setup and teardown.
> 
> Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
> Move all the synic related setup from mshv_parent_partition_init.
> 
> Move the reboot notifier to mshv_synic.c because it currently only
> operates on the synic cpuhp state.
> 
> Move out synic_pages from the global mshv_root since it's use is now

s/it's/its/

> completely local to mshv_synic.c.
> 
> This is in preparation for the next patch which will add more stuff to
> mshv_synic_init().
> 
> No functional change.
> 
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
>  drivers/hv/mshv_root.h      |  5 ++-
>  drivers/hv/mshv_root_main.c | 59 +++++-------------------------
>  drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
>  3 files changed, 75 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3c1d88b36741..26e0320c8097 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -183,7 +183,6 @@ struct hv_synic_pages {
>  };
> 
>  struct mshv_root {
> -	struct hv_synic_pages __percpu *synic_pages;
>  	spinlock_t pt_ht_lock;
>  	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
>  	struct hv_partition_property_vmm_capabilities vmm_caps;
> @@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
>  void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
> 
>  void mshv_isr(void);
> -int mshv_synic_init(unsigned int cpu);
> -int mshv_synic_cleanup(unsigned int cpu);
> +int mshv_synic_init(struct device *dev);
> +void mshv_synic_cleanup(void);
> 
>  static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
>  {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 681b58154d5e..7c1666456e78 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
> 
> -static int mshv_cpuhp_online;
>  static int mshv_root_sched_online;
> 
>  static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> @@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
>  	free_percpu(root_scheduler_output);
>  }
> 
> -static int mshv_reboot_notify(struct notifier_block *nb,
> -			      unsigned long code, void *unused)
> -{
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -	return 0;
> -}
> -
> -struct notifier_block mshv_reboot_nb = {
> -	.notifier_call = mshv_reboot_notify,
> -};
> -
>  static void mshv_root_partition_exit(void)
>  {
> -	unregister_reboot_notifier(&mshv_reboot_nb);
>  	root_scheduler_deinit();
>  }
> 
>  static int __init mshv_root_partition_init(struct device *dev)
>  {
> -	int err;
> -
> -	err = root_scheduler_init(dev);
> -	if (err)
> -		return err;
> -
> -	err = register_reboot_notifier(&mshv_reboot_nb);
> -	if (err)
> -		goto root_sched_deinit;
> -
> -	return 0;
> -
> -root_sched_deinit:
> -	root_scheduler_deinit();
> -	return err;
> +	return root_scheduler_init(dev);
>  }
> 
>  static void mshv_init_vmm_caps(struct device *dev)
> @@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
>  			MSHV_HV_MAX_VERSION);
>  	}
> 
> -	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> -	if (!mshv_root.synic_pages) {
> -		dev_err(dev, "Failed to allocate percpu synic page\n");
> -		ret = -ENOMEM;
> +	ret = mshv_synic_init(dev);
> +	if (ret)
>  		goto device_deregister;
> -	}
> -
> -	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> -				mshv_synic_init,
> -				mshv_synic_cleanup);
> -	if (ret < 0) {
> -		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> -		goto free_synic_pages;
> -	}
> -
> -	mshv_cpuhp_online = ret;
> 
>  	ret = mshv_retrieve_scheduler_type(dev);
>  	if (ret)
> -		goto remove_cpu_state;
> +		goto synic_cleanup;
> 
>  	if (hv_root_partition())
>  		ret = mshv_root_partition_init(dev);
>  	if (ret)
> -		goto remove_cpu_state;
> +		goto synic_cleanup;
> 
>  	mshv_init_vmm_caps(dev);
> 
> @@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
>  exit_partition:
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
> -remove_cpu_state:
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -free_synic_pages:
> -	free_percpu(mshv_root.synic_pages);
> +synic_cleanup:
> +	mshv_synic_cleanup();
>  device_deregister:
>  	misc_deregister(&mshv_dev);
>  	return ret;
> @@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
>  	mshv_irqfd_wq_cleanup();
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -	free_percpu(mshv_root.synic_pages);
> +	mshv_synic_cleanup();
>  }
> 
>  module_init(mshv_parent_partition_init);
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index f8b0337cdc82..074e37c48876 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -12,11 +12,16 @@
>  #include <linux/mm.h>
>  #include <linux/io.h>
>  #include <linux/random.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/reboot.h>
>  #include <asm/mshyperv.h>
> 
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
> 
> +static int synic_cpuhp_online;
> +static struct hv_synic_pages __percpu *synic_pages;
> +
>  static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  {
>  	struct hv_synic_event_ring_page **event_ring_page;
> @@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  	u32 message;
>  	u8 tail;
> 
> -	spages = this_cpu_ptr(mshv_root.synic_pages);
> +	spages = this_cpu_ptr(synic_pages);
>  	event_ring_page = &spages->synic_event_ring_page;
>  	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> 
> @@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
> 
>  void mshv_isr(void)
>  {
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_message *msg;
>  	bool handled;
> @@ -446,7 +451,7 @@ void mshv_isr(void)
>  	}
>  }
> 
> -int mshv_synic_init(unsigned int cpu)
> +static int mshv_synic_cpu_init(unsigned int cpu)
>  {
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
> @@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
>  	union hv_synic_sint sint;
>  #endif
>  	union hv_synic_scontrol sctrl;
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  			&spages->synic_event_flags_page;
> @@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
>  	return -EFAULT;
>  }
> 
> -int mshv_synic_cleanup(unsigned int cpu)
> +static int mshv_synic_cpu_exit(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
>  	union hv_synic_scontrol sctrl;
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  		&spages->synic_event_flags_page;
> @@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> 
>  	mshv_portid_free(doorbell_portid);
>  }
> +
> +static int mshv_synic_reboot_notify(struct notifier_block *nb,
> +			      unsigned long code, void *unused)
> +{
> +	if (!hv_root_partition())
> +		return 0;

I'm curious as to why the synic is cleaned up only for the root partition,
but not for L1VH parents. L1VH parents *do* cleanup their synic in
mshv_parent_partition_exit(). I probably don't understand all the
vagaries of L1VH parents ....

> +
> +	cpuhp_remove_state(synic_cpuhp_online);
> +	return 0;
> +}
> +
> +static struct notifier_block mshv_synic_reboot_nb = {
> +	.notifier_call = mshv_synic_reboot_notify,
> +};
> +
> +int __init mshv_synic_init(struct device *dev)
> +{
> +	int ret = 0;
> +
> +	synic_pages = alloc_percpu(struct hv_synic_pages);
> +	if (!synic_pages) {
> +		dev_err(dev, "Failed to allocate percpu synic page\n");
> +		return -ENOMEM;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> +				mshv_synic_cpu_init,
> +				mshv_synic_cpu_exit);
> +	if (ret < 0) {
> +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> +		goto free_synic_pages;
> +	}
> +
> +	synic_cpuhp_online = ret;
> +
> +	ret = register_reboot_notifier(&mshv_synic_reboot_nb);
> +	if (ret)
> +		goto remove_cpuhp_state;
> +
> +	return 0;
> +
> +remove_cpuhp_state:
> +	cpuhp_remove_state(synic_cpuhp_online);
> +free_synic_pages:
> +	free_percpu(synic_pages);
> +	return ret;
> +}
> +
> +void mshv_synic_cleanup(void)
> +{
> +	unregister_reboot_notifier(&mshv_synic_reboot_nb);
> +	cpuhp_remove_state(synic_cpuhp_online);
> +	free_percpu(synic_pages);
> +}
> --
> 2.34.1
> 


^ permalink raw reply

* RE: [PATCH v4 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Michael Kelley @ 2026-02-18  4:17 UTC (permalink / raw)
  To: Anirudh Rayabharam, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260211170728.3056226-3-anirudh@anirudhrb.com>

From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Wednesday, February 11, 2026 9:07 AM
> 
> On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> There is no such vector reserved for arm64.
> 
> On arm64, the hypervisor exposes a synthetic register that can be read
> to find the INTID that should be used for SINTs. This INTID is in the
> PPI range.
> 
> To better unify the code paths, introduce mshv_sint_vector_init() that
> either reads the synthetic register and obtains the INTID (arm64) or
> just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> 
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
>  drivers/hv/mshv_synic.c     | 112 +++++++++++++++++++++++++++++++++---
>  include/hyperv/hvgdk_mini.h |   2 +
>  2 files changed, 107 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 074e37c48876..7957ad0328dd 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -10,17 +10,24 @@
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
>  #include <linux/mm.h>
> +#include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/random.h>
>  #include <linux/cpuhotplug.h>
>  #include <linux/reboot.h>
>  #include <asm/mshyperv.h>
> +#include <linux/platform_device.h>
> +#include <linux/acpi.h>
> 
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
> 
>  static int synic_cpuhp_online;
>  static struct hv_synic_pages __percpu *synic_pages;
> +static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */

With the introduction of this variable, the call to add_interrupt_randomness()
in mshv_isr() should be updated to pass mshv_sint_vector as the argument,
and the #ifdef HYPERVISOR_CALLBACK_VECTOR can be dropped (yea!).  My
previous comment about the generic Linux IRQ handling doing the call
to add_interrupt_randomness() is true for "normal" IRQs but not for per-CPU
IRQs like these. So the call to add_interrupt_randomness() in mshv_isr() is
needed on both x86 and ARM64.

> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
> +#endif

Documentation/process/coding-style.rst says the following in Section 21:

If you have a function or variable which may potentially go unused in a
particular configuration, and the compiler would warn about its definition
going unused, mark the definition as __maybe_unused rather than wrapping it in
a preprocessor conditional.

You could tag mshv_sint_irq with "__maybe_unused" and avoid the #ifndef. But
see further comments below.

> 
>  static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  {
> @@ -456,9 +463,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
>  	union hv_synic_sint sint;
> -#endif
>  	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> @@ -501,10 +506,13 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> 
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> 
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +	enable_percpu_irq(mshv_sint_irq, 0);
> +#endif
> +

Using IS_ENABLED() would be better than the #ifndef. (See Section 21
of coding-style.rst about this as well.) You would need to drop the #ifndef
around mshv_sint_irq, which is fine.

	if (!IS_ENABLED(HYPERVISOR_CALLBACK_VECTOR))
		enable_percpu_irq(mshv_sint_irq, 0);

That said, I prefer the approach in v1 of your series where basically
the code says "if we have a sint irq, enable it". This links the enablement
most closely to what it directly depends on.

	if (mshv_sint_irq != -1)
		enable_percpu_irq(mshv_sint_irq, 0);

But I realize the approach is somewhat a matter of personal preference so either
way is acceptable.

>  	/* Enable intercepts */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_sint_vector;
>  	sint.masked = false;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> @@ -512,13 +520,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> 
>  	/* Doorbell SINT */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_sint_vector;
>  	sint.masked = false;
>  	sint.as_intercept = 1;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> -#endif
> 
>  	/* Enable global synic bit */
>  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> @@ -573,6 +580,10 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> 
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +	disable_percpu_irq(mshv_sint_irq);
> +#endif
> +

Same here.

>  	/* Disable Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> @@ -683,14 +694,98 @@ static struct notifier_block mshv_synic_reboot_nb = {
>  	.notifier_call = mshv_synic_reboot_notify,
>  };
> 
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +#ifdef CONFIG_ACPI
> +static long __percpu *mshv_evt;
> +#endif

Same comment here about the coding-style.rst guidelines.

Furthermore, mshv_evt could be directly defined here as a per-cpu "long",
rather than a pointer to a long. Then you don't need to do a runtime
per-cpu allocation with all the attendant error checking and cleanup, which
saves about 10 lines of code. So

static DEFINE_PER_CPU(long, mshv_evt);

drivers/clocksource/hyperv_timer.c does the definition for stimer0_evt this
way. I looked through all kernel code and found several other places doing
the direct definition. I don't remember why I didn't do the direct method for
vmbus_evt, but I'm planning to submit a patch to change it, which will drop
a few lines of code.

> +
> +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> +{
> +	mshv_isr();
> +	return IRQ_HANDLED;
> +}

This function generates a warning about being unused when !CONFIG_ACPI.
But see further comments below.

> +
> +static int __init mshv_sint_vector_init(void)
> +{
> +#ifdef CONFIG_ACPI
> +	int ret;
> +	struct hv_register_assoc reg = {
> +		.name = HV_ARM64_REGISTER_SINT_RESERVED_INTERRUPT_ID,
> +	};
> +	union hv_input_vtl input_vtl = { 0 };
> +
> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> +				1, input_vtl, &reg);
> +	if (ret || !reg.value.reg64)
> +		return -ENODEV;
> +
> +	mshv_sint_vector = reg.value.reg64;
> +	ret  = acpi_register_gsi(NULL, mshv_sint_vector, ACPI_EDGE_SENSITIVE,
> +					ACPI_ACTIVE_HIGH);
> +	if (ret < 0)
> +		goto out_fail;
> +
> +	mshv_sint_irq = ret;
> +
> +	mshv_evt = alloc_percpu(long);
> +	if (!mshv_evt) {
> +		ret = -ENOMEM;
> +		goto out_unregister;
> +	}
> +
> +	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
> +		mshv_evt);
> +	if (ret)
> +		goto free_evt;
> +
> +	return 0;
> +
> +free_evt:
> +	free_percpu(mshv_evt);
> +out_unregister:
> +	acpi_unregister_gsi(mshv_sint_vector);
> +out_fail:
> +	return ret;
> +#else
> +	return -ENODEV;
> +#endif
> +}

I have several thoughts about the #ifdef CONFIG_ACPI.

The coding-style.rst guidelines in Section 21 also say:

Prefer to compile out entire functions, rather than portions of functions or
portions of expressions.  Rather than putting an ifdef in an expression, factor
out part or all of the expression into a separate helper function and apply the
conditional to that function.

But more fundamentally, it looks like the #ifdef CONFIG_ACPI is there
solely because acpi_register_gsi() exists only when CONFIG_ACPI is set.
The rest of the code doesn't depend on ACPI. In the !CONFIG_ACPI case,
your stub code returns -ENODEV, so doorbell & intercept SINTs just don't
work, and pretty much everything is non-functional.

This patch doesn't allude to any future DeviceTree case that parallels ACPI,
so I'm unsure what's expected in the future.  If such a future DT case is
murky, perhaps drivers/hv/Kconfig should give MSHV_ROOT a dependency
on ACPI. Then the #ifdef CONFIG_ACPI could be dropped, along with the
#else stub code. When/if the DT use case comes along, the dependency
can be removed and the code structured to handle both ACPI and DT.
The code to fetch the INTID via the hypervisor synthetic register, and the
request_percpu_irq() would be applicable to both. It's only the GSI
registration that would be different, and that could be pulled out into a
helper function that handles the difference in ACPI and DT. I haven't looked
to see how DT does the equivalent of GSI registration.

Another approach would be to add stubs for acpi_register_gsi() and
acpi_unregister_gsi() in include/linux/acpi.h.  A number of such stubs
have been added over the years. Saurabh got one added in 2023
(commit 1f6277bf716cc). Then the above code would compile even
with !CONFIG_ACPI.  acpi_register_gsi() would fail, and you would get
an error return. This approach produces cleaner code and is consistent
with similar use cases that depend on stubs provided by include/linux/acpi.h
rather than #ifdefs.

And either of these approaches avoids the unused mshv_percpu_isr()
function and mshv_evt variable.

> +
> +static void mshv_sint_vector_cleanup(void)
> +{
> +#ifdef CONFIG_ACPI
> +	free_percpu_irq(mshv_sint_irq, mshv_evt);
> +	free_percpu(mshv_evt);
> +	acpi_unregister_gsi(mshv_sint_vector);
> +#endif
> +}
> +#else /* !HYPERVISOR_CALLBACK_VECTOR */
> +static int __init mshv_sint_vector_init(void)
> +{
> +	mshv_sint_vector = HYPERVISOR_CALLBACK_VECTOR;
> +	return 0;
> +}
> +
> +static void mshv_sint_vector_cleanup(void)
> +{
> +}
> +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> +
>  int __init mshv_synic_init(struct device *dev)
>  {
>  	int ret = 0;
> 
> +	ret = mshv_sint_vector_init();
> +	if (ret) {
> +		dev_err(dev, "Failed to get MSHV SINT vector: %i\n", ret);
> +		return ret;
> +	}
> +
>  	synic_pages = alloc_percpu(struct hv_synic_pages);
>  	if (!synic_pages) {
>  		dev_err(dev, "Failed to allocate percpu synic page\n");
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto sint_vector_cleanup;
>  	}
> 
>  	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> @@ -713,6 +808,8 @@ int __init mshv_synic_init(struct device *dev)
>  	cpuhp_remove_state(synic_cpuhp_online);
>  free_synic_pages:
>  	free_percpu(synic_pages);
> +sint_vector_cleanup:
> +	mshv_sint_vector_cleanup();
>  	return ret;
>  }
> 
> @@ -721,4 +818,5 @@ void mshv_synic_cleanup(void)
>  	unregister_reboot_notifier(&mshv_synic_reboot_nb);
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	free_percpu(synic_pages);
> +	mshv_sint_vector_cleanup();
>  }
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 30fbbde81c5c..7676f78e0766 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -1117,6 +1117,8 @@ enum hv_register_name {
>  	HV_X64_REGISTER_MSR_MTRR_FIX4KF8000	= 0x0008007A,
> 
>  	HV_X64_REGISTER_REG_PAGE	= 0x0009001C,
> +#elif defined(CONFIG_ARM64)
> +	HV_ARM64_REGISTER_SINT_RESERVED_INTERRUPT_ID	= 0x00070001,
>  #endif
>  };
> 
> --
> 2.34.1
> 

^ permalink raw reply

* Re: [PATCH v2 2/3] x86/hyperv: Use savesegment() instead of inline asm() to save segment registers
From: Wei Liu @ 2026-02-18  6:43 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Uros Bizjak, linux-hyperv@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, Wei Liu, K. Y. Srinivasan,
	Haiyang Zhang, Dexuan Cui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin
In-Reply-To: <SN6PR02MB4157A2BE3BF643B9E8CB2B0BD462A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Feb 10, 2026 at 10:40:35PM +0000, Michael Kelley wrote:
> From: Uros Bizjak <ubizjak@gmail.com> Sent: Friday, November 21, 2025 6:14 AM
> > 
> > Use standard savesegment() utility macro to save segment registers.
> 
> Patch 1 of this series was included in the tip tree. But this patch (Patch 2) and
> Patch 3 have not been picked up anywhere.
> 
> Wei Liu -- could you pick these two up in the hyperv tree?

Applied patch 2 and 3. Thank you for the reminder. 

Wei

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-18  6:48 UTC (permalink / raw)
  To: Bezdeka, Florian (FT RPD CED OES-DE), kys@microsoft.com,
	decui@microsoft.com, bp@alien8.de, longli@microsoft.com,
	dave.hansen@linux.intel.com, mingo@redhat.com, wei.liu@kernel.org,
	tglx@kernel.org, haiyangz@microsoft.com, x86@kernel.org
  Cc: linux-rt-users@vger.kernel.org, namjain@linux.microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	levymitchell0@gmail.com, mhklinux@outlook.com,
	ssengar@linux.microsoft.com
In-Reply-To: <033ecfefc85bdb7c508d488c5004913d87057142.camel@siemens.com>

On 18.02.26 00:03, Bezdeka, Florian (FT RPD CED OES-DE) wrote:
> On Mon, 2026-02-16 at 17:24 +0100, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
>> with related guest support enabled:
>>
>> [    1.127941] hv_vmbus: registering driver hyperv_drm
>>
>> [    1.132518] =============================
>> [    1.132519] [ BUG: Invalid wait context ]
>> [    1.132521] 6.19.0-rc8+ #9 Not tainted
>> [    1.132524] -----------------------------
>> [    1.132525] swapper/0/0 is trying to lock:
>> [    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
>> [    1.132543] other info that might help us debug this:
>> [    1.132544] context-{2:2}
>> [    1.132545] 1 lock held by swapper/0/0:
>> [    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
>> [    1.132557] stack backtrace:
>> [    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
>> [    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
>> [    1.132567] Call Trace:
>> [    1.132570]  <IRQ>
>> [    1.132573]  dump_stack_lvl+0x6e/0xa0
>> [    1.132581]  __lock_acquire+0xee0/0x21b0
>> [    1.132592]  lock_acquire+0xd5/0x2d0
>> [    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
>> [    1.132606]  ? lock_acquire+0xd5/0x2d0
>> [    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
>> [    1.132619]  rt_spin_lock+0x3f/0x1f0
>> [    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
>> [    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
>> [    1.132634]  vmbus_chan_sched+0xc4/0x2b0
>> [    1.132641]  vmbus_isr+0x2c/0x150
>> [    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
>> [    1.132654]  sysvec_hyperv_callback+0x88/0xb0
>> [    1.132658]  </IRQ>
>> [    1.132659]  <TASK>
>> [    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20
>>
>> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
>> the vmbus_isr execution needs to be moved into thread context. Open-
>> coding this allows to skip the IPI that irq_work would additionally
>> bring and which we do not need, being an IRQ, never an NMI.
>>
>> This affects both x86 and arm64, therefore hook into the common driver
>> logic.
> 
> I tested this patch in combination with the related SCSI driver patch.
> The tests were done on x86 with both VM generations provided by Hyper-v.
> 
> Lockdep was enabled and there were no splat reports within 24 hours of
> massive load produced by stress-ng.
> 
> With that:
> 
> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
> Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com>
> 
> 
> Side note: We did some backports down to 6.1 already, just in case
> someone is interested. We recognized a massive network performance drop
> in 6.1. The root cause has been identified and is not related to this
> patch. It's simply another RT regression caused by a missing stable-rt
> backport. Upstreaming in progress...
> 

Submitted:
https://lore.kernel.org/stable/05ae6b87-0b53-4948-a1ed-2a3235a5f82b@siemens.com/T/#u

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Wei Liu @ 2026-02-18  6:49 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-hyperv, linux-kernel, kys, haiyangz, wei.liu, decui, longli,
	tglx, mingo, bp, dave.hansen, x86, hpa
In-Reply-To: <20260217231158.1184736-1-mrathor@linux.microsoft.com>

On Tue, Feb 17, 2026 at 03:11:58PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
> has an assert intrinsic that uses interrupt vector 0x29 to create an
> exception. This will cause hypervisor to then crash and collect core. As
> such, if this interrupt number is assigned to a device by Linux and the
> device generates it, hypervisor will crash. There are two other such
> vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
> Fortunately, the three vectors are part of the kernel driver space and
> that makes it feasible to reserve them early so they are not assigned
> later.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

Queued. I also did a few cosmetic changes to this patch.

Wei

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Wei Liu @ 2026-02-18  7:05 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
	Michael Kelley, Saurabh Singh Sengar, Naman Jain
In-Reply-To: <289d8e52-40f8-4b22-8aa9-d0bd3bd15aae@siemens.com>

On Mon, Feb 16, 2026 at 05:24:56PM +0100, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> with related guest support enabled:
> 
> [    1.127941] hv_vmbus: registering driver hyperv_drm
> 
> [    1.132518] =============================
> [    1.132519] [ BUG: Invalid wait context ]
> [    1.132521] 6.19.0-rc8+ #9 Not tainted
> [    1.132524] -----------------------------
> [    1.132525] swapper/0/0 is trying to lock:
> [    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> [    1.132543] other info that might help us debug this:
> [    1.132544] context-{2:2}
> [    1.132545] 1 lock held by swapper/0/0:
> [    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> [    1.132557] stack backtrace:
> [    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> [    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [    1.132567] Call Trace:
> [    1.132570]  <IRQ>
> [    1.132573]  dump_stack_lvl+0x6e/0xa0
> [    1.132581]  __lock_acquire+0xee0/0x21b0
> [    1.132592]  lock_acquire+0xd5/0x2d0
> [    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132606]  ? lock_acquire+0xd5/0x2d0
> [    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132619]  rt_spin_lock+0x3f/0x1f0
> [    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
> [    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
> [    1.132634]  vmbus_chan_sched+0xc4/0x2b0
> [    1.132641]  vmbus_isr+0x2c/0x150
> [    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
> [    1.132654]  sysvec_hyperv_callback+0x88/0xb0
> [    1.132658]  </IRQ>
> [    1.132659]  <TASK>
> [    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20
> 
> As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> the vmbus_isr execution needs to be moved into thread context. Open-
> coding this allows to skip the IPI that irq_work would additionally
> bring and which we do not need, being an IRQ, never an NMI.
> 
> This affects both x86 and arm64, therefore hook into the common driver
> logic.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Applied to hyperv-next. Thanks.

Saurabh and Naman, I want to get this submitted in this merge window. If
you find any more issues with this patch, we can address them in the RC
phase. In the worst case, we can revert this patch later.

Wei

^ permalink raw reply

* Re: [PATCH v2] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Wei Liu @ 2026-02-18  7:17 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-hyperv, linux-kernel, kys, haiyangz, wei.liu, decui, longli,
	tglx, mingo, bp, dave.hansen, x86, hpa
In-Reply-To: <20260217231158.1184736-1-mrathor@linux.microsoft.com>

On Tue, Feb 17, 2026 at 03:11:58PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
> has an assert intrinsic that uses interrupt vector 0x29 to create an
> exception. This will cause hypervisor to then crash and collect core. As
> such, if this interrupt number is assigned to a device by Linux and the
> device generates it, hypervisor will crash. There are two other such
> vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
> Fortunately, the three vectors are part of the kernel driver space and
> that makes it feasible to reserve them early so they are not assigned
> later.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
> 
> v1: Add ifndef CONFIG_X86_FRED (thanks hpa)
> v2: replace ifndef with cpu_feature_enabled() (thanks hpa and tglx)
> 
>  arch/x86/kernel/cpu/mshyperv.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 579fb2c64cfd..88ca127dc6d4 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -478,6 +478,28 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>  }
>  EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>  
> +/*
> + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
> + * will either crash or hang or attempt to break into debugger.
> + */
> +static void hv_reserve_irq_vectors(void)
> +{
> +	#define HYPERV_DBG_FASTFAIL_VECTOR	0x29
> +	#define HYPERV_DBG_ASSERT_VECTOR	0x2C
> +	#define HYPERV_DBG_SERVICE_VECTOR	0x2D
> +
> +	if (cpu_feature_enabled(X86_FEATURE_FRED))
> +		return;
> +
> +	if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
> +	    test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
> +	    test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
> +		BUG();
> +
> +	pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
> +		HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
> +}
> +
>  static void __init ms_hyperv_init_platform(void)
>  {
>  	int hv_max_functions_eax, eax;
> @@ -510,6 +532,11 @@ static void __init ms_hyperv_init_platform(void)
>  
>  	hv_identify_partition_type();
>  
> +#ifndef CONFIG_X86_FRED
> +	if (hv_root_partition())
> +		hv_reserve_irq_vectors();
> +#endif	/* CONFIG_X86_FRED */
> +

On a CONFIG_X86_FRED=y system, this call is skipped. However, the kernel
may not have FRED active, so the vectors should still be reserved.

I think the function should always be called.

Wei

>  	if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
>  		ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
>  
> -- 
> 2.51.2.vfs.0.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Saurabh Singh Sengar @ 2026-02-18  7:19 UTC (permalink / raw)
  To: Wei Liu
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
	Michael Kelley, Naman Jain
In-Reply-To: <20260218070557.GF2236050@liuwe-devbox-debian-v2.local>

On Wed, Feb 18, 2026 at 07:05:57AM +0000, Wei Liu wrote:
> On Mon, Feb 16, 2026 at 05:24:56PM +0100, Jan Kiszka wrote:
> > From: Jan Kiszka <jan.kiszka@siemens.com>
> > 
> > Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
> > with related guest support enabled:
> > 
> > [    1.127941] hv_vmbus: registering driver hyperv_drm
> > 
> > [    1.132518] =============================
> > [    1.132519] [ BUG: Invalid wait context ]
> > [    1.132521] 6.19.0-rc8+ #9 Not tainted
> > [    1.132524] -----------------------------
> > [    1.132525] swapper/0/0 is trying to lock:
> > [    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
> > [    1.132543] other info that might help us debug this:
> > [    1.132544] context-{2:2}
> > [    1.132545] 1 lock held by swapper/0/0:
> > [    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
> > [    1.132557] stack backtrace:
> > [    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
> > [    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > [    1.132567] Call Trace:
> > [    1.132570]  <IRQ>
> > [    1.132573]  dump_stack_lvl+0x6e/0xa0
> > [    1.132581]  __lock_acquire+0xee0/0x21b0
> > [    1.132592]  lock_acquire+0xd5/0x2d0
> > [    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
> > [    1.132606]  ? lock_acquire+0xd5/0x2d0
> > [    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
> > [    1.132619]  rt_spin_lock+0x3f/0x1f0
> > [    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
> > [    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
> > [    1.132634]  vmbus_chan_sched+0xc4/0x2b0
> > [    1.132641]  vmbus_isr+0x2c/0x150
> > [    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
> > [    1.132654]  sysvec_hyperv_callback+0x88/0xb0
> > [    1.132658]  </IRQ>
> > [    1.132659]  <TASK>
> > [    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20
> > 
> > As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
> > the vmbus_isr execution needs to be moved into thread context. Open-
> > coding this allows to skip the IPI that irq_work would additionally
> > bring and which we do not need, being an IRQ, never an NMI.
> > 
> > This affects both x86 and arm64, therefore hook into the common driver
> > logic.
> > 
> > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Applied to hyperv-next. Thanks.
> 
> Saurabh and Naman, I want to get this submitted in this merge window. If
> you find any more issues with this patch, we can address them in the RC
> phase. In the worst case, we can revert this patch later.
> 
> Wei

I was in the process of completing the final round of testing; however, since
the change has now been merged, it will receive broader coverage, I will rely
on that.

Overall, the patch looks good to me.

- Saurabh

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox