Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Mukesh R @ 2026-02-20 18:56 UTC (permalink / raw)
  To: Wei Liu, Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
	longli@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com
In-Reply-To: <20260220184520.GB3119916@liuwe-devbox-debian-v2.local>

On 2/20/26 10:45, Wei Liu wrote:
> On Fri, Feb 20, 2026 at 05:14:26PM +0000, Michael Kelley wrote:
>> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, February 17, 2026 3:12 PM
>>>
>>> MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
>>> has an assert intrinsic that uses interrupt vector 0x29 to create an
>>> exception. This will cause hypervisor to then crash and collect core. As
>>> such, if this interrupt number is assigned to a device by Linux and the
>>> device generates it, hypervisor will crash. There are two other such
>>> vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
>>> Fortunately, the three vectors are part of the kernel driver space and
>>> that makes it feasible to reserve them early so they are not assigned
>>> later.
>>>
>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>> ---
>>>
>>> v1: Add ifndef CONFIG_X86_FRED (thanks hpa)
>>> v2: replace ifndef with cpu_feature_enabled() (thanks hpa and tglx)
>>>
>>>   arch/x86/kernel/cpu/mshyperv.c | 27 +++++++++++++++++++++++++++
>>>   1 file changed, 27 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>>> index 579fb2c64cfd..88ca127dc6d4 100644
>>> --- a/arch/x86/kernel/cpu/mshyperv.c
>>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>>> @@ -478,6 +478,28 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>>>   }
>>>   EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>>>
>>> +/*
>>> + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
>>> + * will either crash or hang or attempt to break into debugger.
>>> + */
>>> +static void hv_reserve_irq_vectors(void)
>>> +{
>>> +	#define HYPERV_DBG_FASTFAIL_VECTOR	0x29
>>> +	#define HYPERV_DBG_ASSERT_VECTOR	0x2C
>>> +	#define HYPERV_DBG_SERVICE_VECTOR	0x2D
>>> +
>>> +	if (cpu_feature_enabled(X86_FEATURE_FRED))
>>> +		return;
>>> +
>>> +	if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
>>> +	    test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
>>> +	    test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
>>> +		BUG();
>>> +
>>> +	pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
>>> +		HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
>>
>> I'm a little late to the party here, but I've always seen Intel interrupt vectors
>> displayed as 2-digit hex numbers. This info message is displaying decimal,
>> which is atypical and will probably be confusing.
> 
> Noted. The pull request to Linus has been sent. We will change the
> format in a follow up patch.

Well, there is no 0x prefix, so should not be confusing, but no big
deal, whatever.....

Thanks,
-Mukesh



^ permalink raw reply

* Re: [PATCH] mshv: Replace fixed memory deposit with status driven helper
From: Mukesh R @ 2026-02-20 19:05 UTC (permalink / raw)
  To: Michael Kelley, Stanislav Kinsburskii, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415705AA10C44D52CFFC0D31D468A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2/20/26 09:05, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, February 19, 2026 2:10 PM
>>
>> Replace hardcoded HV_MAP_GPA_DEPOSIT_PAGES usage with
>> hv_deposit_memory() which derives the deposit size from
>> the hypercall status, and remove the now-unused constant.
>>
>> The previous code always deposited a fixed 256 pages on
>> insufficient memory, ignoring the actual demand reported
>> by the hypervisor.
> 
> Does the hypervisor report a specific page count demand? I haven't
> seen that anywhere. It seems like the deposit memory operation is
> always something of a guess.
> 
>> hv_deposit_memory() handles different
>> deposit statuses, aligning map-GPA retries with the rest
>> of the codebase.
>>
>> This approach may require more allocation and deposit
>> hypercall iterations, but avoids over-depositing large
>> fixed chunks when fewer pages would suffice. Until any
>> performance impact is measured, the more frugal and
>> consistent behavior is preferred.
>>
>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
>  From a purely functional standpoint, this change addresses the
> concern that I raised. But I don?t have any intuition on the performance
> impact of having to iterate. hv_deposit_memory() adds only a single

Indeed, it is not insignificant. Some discussions with hyp team while
ago had resulted in suggestions around depositing larger sizes, but then
there are many places where single page suffices. This is just lateral
change. But as this thing bakes, heuristics will evolve and we'll do
some optimizations aroud it... my 2 cents...

Thanks,
-Mukesh



> page for some of the statuses, so if there really is a large memory need,
> the new code would iterate 256 times to achieve what the existing code
> does.
> 
> Any idea where the 256 came from the first place?  Was that
> empirically determined like some of the other memory deposit counts?
> 
> In addition to a potential performance impact, I know the hypervisor tries
> to detect denial-of-service attempts that make "too many" calls to the
> hypervisor in a short period of time. In such a case, the hypervisor
> suspends scheduling the VM for a few seconds before allowing it to resume.
> Just need to make sure the hypervisor doesn't think the iterating is a
> denial-of-service attack. Or maybe that denial-of-service detection
> doesn't apply to the root partition VM.
> 
> But from a functional standpoint,
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 
>> ---
>>   drivers/hv/mshv_root_hv_call.c |    4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
>> index 7f91096f95a8..317191462b63 100644
>> --- a/drivers/hv/mshv_root_hv_call.c
>> +++ b/drivers/hv/mshv_root_hv_call.c
>> @@ -16,7 +16,6 @@
>>
>>   /* Determined empirically */
>>   #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
>> -#define HV_MAP_GPA_DEPOSIT_PAGES	256
>>   #define HV_UMAP_GPA_PAGES		512
>>
>>   #define HV_PAGE_COUNT_2M_ALIGNED(pg_count) (!((pg_count) & (0x200 - 1)))
>> @@ -239,8 +238,7 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64
>> page_struct_count,
>>   		completed = hv_repcomp(status);
>>
>>   		if (hv_result_needs_memory(status)) {
>> -			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
>> -						    HV_MAP_GPA_DEPOSIT_PAGES);
>> +			ret = hv_deposit_memory(partition_id, status);
>>   			if (ret)
>>   				break;
>>
>>
>>
> 


^ permalink raw reply

* Re: [GIT PULL] Hyper-V patches for 7.0
From: pr-tracker-bot @ 2026-02-20 20:50 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux on Hyper-V List, Linux Kernel List,
	kys, haiyangz, decui, longli
In-Reply-To: <20260219074550.GA2773704@liuwe-devbox-debian-v2.local>

The pull request you sent on Thu, 19 Feb 2026 07:45:50 +0000:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-next-signed-20260218

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/d31558c077d8be422b65e97974017c030b4bd91a

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH 1/1] Drivers: hv: vmbus: Limit channel interrupt scan to relid high water mark
From: vdso @ 2026-02-21  1:47 UTC (permalink / raw)
  To: mhklinux, Michael Kelley
  Cc: linux-kernel, kys, wei.liu, haiyangz, longli, decui, linux-hyperv
In-Reply-To: <20260220164045.1670-1-mhklkml@zohomail.com>

Hi Michael,

Boots for me on an x86_64 machine. Got a typo fix and a question for you.
Tagging as reviewed and tested regardless :) 

> On 02/20/2026 8:40 AM  Michael Kelley <mhklkml@zohomail.com> wrote:
> 
>  
> From: Michael Kelley <mhklinux@outlook.com>
> 
> When checking for VMBus channel interrutps, current code always scans the

/s/interrutps/interrupts

> full SynIC receive interrupt bit array to get the relid of the
> interrupting channels. The array has HV_EVENT_FLAGS_COUNT (2048) bits.
> But VMs rarely have more than 100 channels, and the relid is typically
> a small integer that is densely assigned by the Hyper-V host. It's
> wasteful to scan 2048 bits when it is highly unlikely that anything will
> be found past bit 100. The waste is double with Confidential VMBus because
> there are two receive interrupt arrays that must be scanned: one for the
> hypervisor SynIC and one for the paravisor SynIC.
> 
> Improve the scanning by tracking the largest relid that has been offered
> by the Hyper-V host. Then when checking for VMBus channel interrupts, only
> scan up to this high water mark.
> 
> When channels are rescinded, it's not worth the complexity to recalculate
> the high water mark. Hyper-V tends to reuse the rescinded relids for any
> new channels that are subsequently added, and the performance benefit of
> exactly tracking the high water mark would be minimal.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Tested-by: Roman Kisel <vdso@mailbox.org>
Reviewed-by: Roman Kisel <vdso@mailbox.org>

> ---
>  drivers/hv/channel_mgmt.c | 16 ++++++++++++----
>  drivers/hv/hyperv_vmbus.h |  3 ++-
>  drivers/hv/vmbus_drv.c    |  7 +------
>  3 files changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 74fed2c073d4..61f7dffd0f50 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -384,8 +384,18 @@ static void free_channel(struct vmbus_channel *channel)
>  
>  void vmbus_channel_map_relid(struct vmbus_channel *channel)
>  {
> -	if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> +	u32 new_relid = channel->offermsg.child_relid;
> +
> +	if (WARN_ON(new_relid >= MAX_CHANNEL_RELIDS))
>  		return;
> +
> +	/*
> +	 * This function is always called in the tasklet for the connect CPU.
> +	 * So updating the relid hiwater mark does not need to be atomic.
> +	 */
> +	if (new_relid > READ_ONCE(vmbus_connection.relid_hiwater))
> +		WRITE_ONCE(vmbus_connection.relid_hiwater, new_relid);
> +
>  	/*
>  	 * The mapping of the channel's relid is visible from the CPUs that
>  	 * execute vmbus_chan_sched() by the time that vmbus_chan_sched() will
> @@ -411,9 +421,7 @@ void vmbus_channel_map_relid(struct vmbus_channel *channel)
>  	 *      of the VMBus driver and vmbus_chan_sched() can not run before
>  	 *      vmbus_bus_resume() has completed execution (cf. resume_noirq).
>  	 */
> -	virt_store_mb(
> -		vmbus_connection.channels[channel->offermsg.child_relid],
> -		channel);
> +	virt_store_mb(vmbus_connection.channels[new_relid], channel);
>  }
>  
>  void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 7bd8f8486e85..2c90c81a3b0f 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -276,8 +276,9 @@ struct vmbus_connection {
>  	struct list_head chn_list;
>  	struct mutex channel_mutex;
>  
> -	/* Array of channels */
> +	/* Array of channel pointers, indexed by relid */
>  	struct vmbus_channel **channels;
> +	u32 relid_hiwater;
>  
>  	/*
>  	 * An offer message is handled first on the work_queue, and then
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 3e7a52918ce0..a96da105b593 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1258,17 +1258,12 @@ static void vmbus_chan_sched(void *event_page_addr)
>  		return;
>  	event = (union hv_synic_event_flags *)event_page_addr + VMBUS_MESSAGE_SINT;
>  
> -	maxbits = HV_EVENT_FLAGS_COUNT;
> +	maxbits = READ_ONCE(vmbus_connection.relid_hiwater) + 1;

Worth checking that "maxbits <= HV_EVENT_FLAGS_COUNT" to protect from corruptions,
etc. or would be too paranoidal?

>  	recv_int_page = event->flags;
>  
>  	if (unlikely(!recv_int_page))
>  		return;
>  
> -	/*
> -	 * Suggested-by: Michael Kelley <mhklinux@outlook.com>
> -	 * One possible optimization would be to keep track of the largest relID that's in use,
> -	 * and only scan up to that relID.
> -	 */
>  	for_each_set_bit(relid, recv_int_page, maxbits) {
>  		void (*callback_fn)(void *context);
>  		struct vmbus_channel *channel;
> -- 
> 2.25.1

^ permalink raw reply

* RE: [PATCH 1/1] Drivers: hv: vmbus: Limit channel interrupt scan to relid high water mark
From: mhklkml @ 2026-02-21  2:55 UTC (permalink / raw)
  To: vdso, mhklinux
  Cc: linux-kernel, kys, wei.liu, haiyangz, longli, decui, linux-hyperv
In-Reply-To: <1554036576.472972.1771638469213@app.mailbox.org>

From: vdso@mailbox.org <vdso@mailbox.org> Sent: Friday, February 20, 2026 5:48 PM
>
> Hi Michael,
> 
> Boots for me on an x86_64 machine. Got a typo fix and a question for you.
> Tagging as reviewed and tested regardless :)
> 
> > On 02/20/2026 8:40 AM  Michael Kelley <mhklkml@zohomail.com> wrote:
> >
> >
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > When checking for VMBus channel interrutps, current code always scans the
> 
> /s/interrutps/interrupts
> 
> > full SynIC receive interrupt bit array to get the relid of the
> > interrupting channels. The array has HV_EVENT_FLAGS_COUNT (2048) bits.
> > But VMs rarely have more than 100 channels, and the relid is typically
> > a small integer that is densely assigned by the Hyper-V host. It's
> > wasteful to scan 2048 bits when it is highly unlikely that anything will
> > be found past bit 100. The waste is double with Confidential VMBus because
> > there are two receive interrupt arrays that must be scanned: one for the
> > hypervisor SynIC and one for the paravisor SynIC.
> >
> > Improve the scanning by tracking the largest relid that has been offered
> > by the Hyper-V host. Then when checking for VMBus channel interrupts, only
> > scan up to this high water mark.
> >
> > When channels are rescinded, it's not worth the complexity to recalculate
> > the high water mark. Hyper-V tends to reuse the rescinded relids for any
> > new channels that are subsequently added, and the performance benefit of
> > exactly tracking the high water mark would be minimal.
> >
> > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> 
> Tested-by: Roman Kisel <vdso@mailbox.org>
> Reviewed-by: Roman Kisel <vdso@mailbox.org>

Thanks!

> 
> > ---
> >  drivers/hv/channel_mgmt.c | 16 ++++++++++++----
> >  drivers/hv/hyperv_vmbus.h |  3 ++-
> >  drivers/hv/vmbus_drv.c    |  7 +------
> >  3 files changed, 15 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> > index 74fed2c073d4..61f7dffd0f50 100644
> > --- a/drivers/hv/channel_mgmt.c
> > +++ b/drivers/hv/channel_mgmt.c
> > @@ -384,8 +384,18 @@ static void free_channel(struct vmbus_channel *channel)
> >
> >  void vmbus_channel_map_relid(struct vmbus_channel *channel)
> >  {
> > -	if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
> > +	u32 new_relid = channel->offermsg.child_relid;
> > +
> > +	if (WARN_ON(new_relid >= MAX_CHANNEL_RELIDS))
> >  		return;
> > +
> > +	/*
> > +	 * This function is always called in the tasklet for the connect CPU.
> > +	 * So updating the relid hiwater mark does not need to be atomic.
> > +	 */
> > +	if (new_relid > READ_ONCE(vmbus_connection.relid_hiwater))
> > +		WRITE_ONCE(vmbus_connection.relid_hiwater, new_relid);
> > +
> >  	/*
> >  	 * The mapping of the channel's relid is visible from the CPUs that
> >  	 * execute vmbus_chan_sched() by the time that vmbus_chan_sched() will
> > @@ -411,9 +421,7 @@ void vmbus_channel_map_relid(struct vmbus_channel *channel)
> >  	 *      of the VMBus driver and vmbus_chan_sched() can not run before
> >  	 *      vmbus_bus_resume() has completed execution (cf. resume_noirq).
> >  	 */
> > -	virt_store_mb(
> > -		vmbus_connection.channels[channel->offermsg.child_relid],
> > -		channel);
> > +	virt_store_mb(vmbus_connection.channels[new_relid], channel);
> >  }
> >
> >  void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
> > diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> > index 7bd8f8486e85..2c90c81a3b0f 100644
> > --- a/drivers/hv/hyperv_vmbus.h
> > +++ b/drivers/hv/hyperv_vmbus.h
> > @@ -276,8 +276,9 @@ struct vmbus_connection {
> >  	struct list_head chn_list;
> >  	struct mutex channel_mutex;
> >
> > -	/* Array of channels */
> > +	/* Array of channel pointers, indexed by relid */
> >  	struct vmbus_channel **channels;
> > +	u32 relid_hiwater;
> >
> >  	/*
> >  	 * An offer message is handled first on the work_queue, and then
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 3e7a52918ce0..a96da105b593 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1258,17 +1258,12 @@ static void vmbus_chan_sched(void *event_page_addr)
> >  		return;
> >  	event = (union hv_synic_event_flags *)event_page_addr + VMBUS_MESSAGE_SINT;
> >
> > -	maxbits = HV_EVENT_FLAGS_COUNT;
> > +	maxbits = READ_ONCE(vmbus_connection.relid_hiwater) + 1;
> 
> Worth checking that "maxbits <= HV_EVENT_FLAGS_COUNT" to protect from
> corruptions, etc. or would be too paranoidal?

We definitely want to validate what Hyper-V returns to the guest as a relid,
and drop any values that are "too big", so we don't go indexing off into
bogus memory. But that validation is done in vmbus_channel_map_relid()
with a WARN_ON() before setting relid_hiwater.  So there's no way for
relid_hiwater to be bogus, and additional validation here in
vmbus_chan_sched() really isn't necessary.

Michael

> 
> >  	recv_int_page = event->flags;
> >
> >  	if (unlikely(!recv_int_page))
> >  		return;
> >
> > -	/*
> > -	 * Suggested-by: Michael Kelley <mhklinux@outlook.com>
> > -	 * One possible optimization would be to keep track of the largest relID that's in use,
> > -	 * and only scan up to that relID.
> > -	 */
> >  	for_each_set_bit(relid, recv_int_page, maxbits) {
> >  		void (*callback_fn)(void *context);
> >  		struct vmbus_channel *channel;
> > --
> > 2.25.1



^ permalink raw reply

* Re: [RFC PATCH V2] x86/VMBus: Confidential VMBus for dynamic DMA buffer transition
From: Tianyu Lan @ 2026-02-21 14:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Robin Murphy, Michael Kelley, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	longli@microsoft.com, Tianyu Lan, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, hch@infradead.org,
	vdso@hexbites.dev, Suzuki K Poulose
In-Reply-To: <yq5a5x7xq997.fsf@kernel.org>

On Mon, Feb 16, 2026 at 6:21 PM Aneesh Kumar K.V
<aneesh.kumar@kernel.org> wrote:
>
> Robin Murphy <robin.murphy@arm.com> writes:
>
> > On 2026-02-11 6:00 pm, Michael Kelley wrote:
> >> From: Tianyu Lan <ltykernel@gmail.com> Sent: Tuesday, February 10, 2026 8:21 AM
> >>>
> >>> Hyper-V provides Confidential VMBus to communicate between
> >>> device model and device guest driver via encrypted/private
> >>> memory in Confidential VM. The device model is in OpenHCL
> >>> (https://openvmm.dev/guide/user_guide/openhcl.html) that
> >>> plays the paravisor rule.
> >>>
> >>> For a VMBUS device, there are two communication methods to
> >>
> >> s/VMBUS/VMBus/
> >>
> >>> talk with Host/Hypervisor. 1) VMBus Ring buffer 2) dynamic
> >>> DMA transition.
> >>
> >> I'm not sure what "dynamic DMA transition" is. Maybe just
> >> "DMA transfers"?  Also, do the same substitution further
> >> down in this commit message.
> >>
> >>> The Confidential VMBus Ring buffer has been
> >>> upstreamed by Roman Kisel(commit 6802d8af).
> >>
> >> It's customary to use 12 character commit IDs, which would be
> >> 6802d8af47d1 in this case.
> >>
> >>>
> >>> The dynamic DMA transition of VMBus device normally goes
> >>> through DMA core and it uses SWIOTLB as bounce buffer in
> >>> CVM
> >>
> >> "CVM" is Microsoft-speak. The Linux terminology is "a CoCo VM".
> >>
> >>> to communicate with Host/Hypervisor. The Confidential
> >>> VMBus device may use private/encrypted memory to do DMA
> >>> and so the device swiotlb(bounce buffer) isn't necessary.
> >>
> >> The phrase "isn't necessary" does not capture the real issue
> >> here. Saying "isn't necessary" makes it sound like this patch is
> >> just avoids unnecessary work, so that it is a performance
> >> improvement. But that's not the case.
> >>
> >> The real issue is that swiotlb memory is decrypted. So bouncing
> >> through the swiotlb exposes to the host what is supposed to be
> >> confidential data passed on the Confidential VMBus. Disabling
> >> the swiotlb bouncing in this case is a hard requirement to preserve
> >> confidentially.
> >
> > Yeah, this really isn't a Hyper-V problem. Indeed as things stand,
> > "swiotlb=force" could potentially break confidentiality for any
> > environment trying to invent a notion of private DMA, and perhaps we
> > could throw a big warning about that, but really the answer there is
> > "Don't run your confidential workload with 'swiotlb=force'. Why would
> > you even do that? Debug your drivers in a regular VM or bare-metal with
> > full debug visibility like a normal person..."
> >
> > The fact is we do not have a proper notion of trusted/private DMA yet,
> > and this is not the way to add it. The current assumption is very much
> > that all DMA is untrusted in the CoCo sense, because initially it was
> > only virtual devices emulated by a hypervisor, thus had to be bounced
> > through shared memory anyway. AMD SEV with a stage 1 IOMMU in the guest
> > can allow an assigned physical device to access a suitably-aligned
> > encrypted buffer directly, but that's still effectively just putting the
> > buffer into a temporarily shared state for that device, it merely skips
> > sharing it with the rest of the system. !force_dma_unencrypted() doesn't
> > mean "we trust this device's DMA", it just means "we don't have to use
> > explicitly-decrypted pages to accommodate untrusted/shared DMA here",
> > plus it also serves double-duty for host encryption which doesn't share
> > the same trust model anyway.
> >
> > I assumed this would follow the TDISP stuff, but if Hyper-V has an
> > alternative device-trusting mechanism already then there's no need to
> > wait. We want some common device property (likely consolidating the
> > current PCI external-facing port notion of trustedness plus whatever
> > TDISP wants), with which we can then make proper decisions in all the
> > right DMA API paths - and if it can end up replacing the horrible
> > force_dma_unencrypted() as well then all the better! I'd totally
> > forgotten about the previous discussion that Michael referred to (which
> > I had to track down[1]), but it looks like all the main points were
> > already covered there and we were approaching a consensus, so really I
> > guess someone just needs to give it a go.
> >
>
> With my device-assignment–related changes, I have made the following
> update. It may be a slightly stronger requirement to enforce that
> trusted device cannot use SWIOTLB, but it simplifies the overall design.
> I also have a prototype, that added two default swiotlb, ie,
>
> static struct io_tlb_mem io_tlb_default_mem;
> static struct io_tlb_mem io_tlb_default_shared_mem;
>
> Looking at that change, I would suggest we avoid doing this unless we
> are certain that there is a requirement for a trusted device to use
> SWIOTLB bouncing.
>

Hi Robin & Aneesh:
     Thanks for your suggestion and draft patch. Later response due to
holiday. We may combine the Aneesh's change with Michael's suggestion
that DMA core exposes DMA core API of disabling swiotlb allocation and force
using swiotlb and latform or subsystem(e.g, TSM module) maycall them
according to user case.

> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index b27de03f2466..07ef149bd9fc 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -292,6 +292,9 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>
>  static inline bool is_swiotlb_for_alloc(struct device *dev)
>  {
> +       if (device_cc_accepted(dev))
> +               return false;
> +
>         return dev->dma_io_tlb_mem->for_alloc;
>  }
>  #else
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 34fe14b987f0..a89a7ac07499 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -159,6 +159,14 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
>   */
>  static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
>  {
> +       /*
> +        * Atomic pools are marked decrypted and are used if we require require
> +        * updation of pfn mem encryption attributes or for DMA non-coherent
> +        * device allocation. Both is not true for trusted device.
> +        */
> +       if (device_cc_accepted(dev))
> +               return false;
> +
>         return !gfpflags_allow_blocking(gfp) && !is_swiotlb_for_alloc(dev);
>  }
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index a862712f4dc6..6d9f0c869c6f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1643,6 +1643,9 @@ bool is_swiotlb_active(struct device *dev)
>  {
>         struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>
> +       if (device_cc_accepted(dev))
> +               return false;
> +
>         return mem && mem->nslabs;
>  }



--
Thanks
Tianyu Lan

^ permalink raw reply

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
From: Ard Biesheuvel @ 2026-02-21 16:43 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, dave.hansen, x86, H . Peter Anvin, Arnd Bergmann
In-Reply-To: <20250910001009.2651481-6-mrathor@linux.microsoft.com>

Just spotted this code in v7.0-rc

On Wed, 10 Sep 2025, at 02:10, Mukesh Rathor wrote:
...

> +static asmlinkage void __noreturn hv_crash_c_entry(void)

'asmlinkage' means that the function may be called from another compilation unit written in assembler, but it doesn't actually evaluate to anything in most cases. Combining it with 'static' makes no sense whatsoever.

> +{
> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> +
> +	/* first thing, restore kernel gdt */
> +	native_load_gdt(&ctxt->gdtr);
> +
> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> +

This code is truly very broken. You cannot enter a C function without a stack, and assign RSP half way down the function. Especially after allocating local variables and/or calling other functions - it may happen to work in most cases, but it is very fragile. (Other architectures have the concept of 'naked' functions for this purpose but x86 does not)

IOW, this whole function should be written in asm.

> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
> +
> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
> +
> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
> +
> +	native_load_idt(&ctxt->idtr);
> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
> +	native_wrmsrq(MSR_EFER, ctxt->efer);
> +
> +	/* restore the original kernel CS now via far return */
> +	asm volatile("movzwq %0, %%rax\n\t"
> +		     "pushq %%rax\n\t"
> +		     "pushq $1f\n\t"
> +		     "lretq\n\t"
> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
> +
> +	/* We are in asmlinkage without stack frame,

You just switched to __KERNEL_CS via the stack.

> hence make a C function
> +	 * call which will buy stack frame to restore the tss or clear PT 
> entry.
> +	 */

Where does one buy a stack frame?

> +	hv_crash_restore_tss();
> +	hv_crash_clear_kernpt();
> +
> +	/* we are now fully in devirtualized normal kernel mode */
> +	__crash_kexec(NULL);
> +
> +	for (;;)
> +		cpu_relax();
> +}
> +/* Tell gcc we are using lretq long jump in the above function 
> intentionally */
> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
> +


^ permalink raw reply

* [PATCH net-next] net: ethtool: add COALESCE_RX_CQE_FRAMES/NSECS parameters
From: Haiyang Zhang @ 2026-02-22 21:23 UTC (permalink / raw)
  To: linux-hyperv, netdev, Andrew Lunn, Jakub Kicinski, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Kory Maincent (Dent Project),
	Gal Pressman, Oleksij Rempel, Vadim Fedorenko, linux-kernel,
	linux-doc
  Cc: haiyangz, paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Add two parameters for drivers supporting Rx CQE Coalescing.

ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE.

ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Time out value in nanoseconds after the first packet arrival in a
coalesced CQE to be sent.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 Documentation/netlink/specs/ethtool.yaml       |  8 ++++++++
 Documentation/networking/ethtool-netlink.rst   | 10 ++++++++++
 include/linux/ethtool.h                        |  6 +++++-
 include/uapi/linux/ethtool_netlink_generated.h |  2 ++
 net/ethtool/coalesce.c                         | 14 +++++++++++++-
 5 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 0a2d2343f79a..951d98f6bb12 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -861,6 +861,12 @@ attribute-sets:
         name: tx-profile
         type: nest
         nested-attributes: profile
+      -
+        name: rx-cqe-frames
+        type: u32
+      -
+        name: rx-cqe-nsecs
+        type: u32
 
   -
     name: pause-stat
@@ -2244,6 +2250,8 @@ operations:
             - tx-aggr-time-usecs
             - rx-profile
             - tx-profile
+            - rx-cqe-frames
+            - rx-cqe-nsecs
       dump: *coalesce-get-op
     -
       name: coalesce-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index af56c304cef4..a3e78b69fd07 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1072,6 +1072,8 @@ Kernel response contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
   ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
+  ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES``         u32     max packets, Rx CQE
+  ``ETHTOOL_A_COALESCE_RX_CQE_NSECS``          u32     delay (ns), Rx CQE
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
@@ -1105,6 +1107,12 @@ well with frequent small-sized URBs transmissions.
 to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
 <https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
 
+Rx CQE coalescing allows multiple received packets to be coalesced into a single
+Completion Queue Entry (CQE). ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` describes the
+maximum number of frames that can be coalesced into a CQE.
+``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds after the
+first packet arrival in a coalesced CQE to be sent.
+
 COALESCE_SET
 ============
 
@@ -1143,6 +1151,8 @@ Request contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
   ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
+  ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES``         u32     max packets, Rx CQE
+  ``ETHTOOL_A_COALESCE_RX_CQE_NSECS``          u32     delay (ns), Rx CQE
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 798abec67a1b..25ccd2d5d4dc 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -332,6 +332,8 @@ struct kernel_ethtool_coalesce {
 	u32 tx_aggr_max_bytes;
 	u32 tx_aggr_max_frames;
 	u32 tx_aggr_time_usecs;
+	u32 rx_cqe_frames;
+	u32 rx_cqe_nsecs;
 };
 
 /**
@@ -380,7 +382,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
 #define ETHTOOL_COALESCE_RX_PROFILE		BIT(27)
 #define ETHTOOL_COALESCE_TX_PROFILE		BIT(28)
-#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(28, 0)
+#define ETHTOOL_COALESCE_RX_CQE_FRAMES		BIT(29)
+#define ETHTOOL_COALESCE_RX_CQE_NSECS		BIT(30)
+#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(30, 0)
 
 #define ETHTOOL_COALESCE_USECS						\
 	(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index 556a0c834df5..efc6e4ade77b 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -371,6 +371,8 @@ enum {
 	ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
 	ETHTOOL_A_COALESCE_RX_PROFILE,
 	ETHTOOL_A_COALESCE_TX_PROFILE,
+	ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+	ETHTOOL_A_COALESCE_RX_CQE_NSECS,
 
 	__ETHTOOL_A_COALESCE_CNT,
 	ETHTOOL_A_COALESCE_MAX = (__ETHTOOL_A_COALESCE_CNT - 1)
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 3e18ca1ccc5e..349bb02c517a 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -118,6 +118,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_TIME_USECS */
+	       nla_total_size(sizeof(u32)) +	/* _RX_CQE_FRAMES */
+	       nla_total_size(sizeof(u32)) +	/* _RX_CQE_NSECS */
 	       total_modersz * 2;		/* _{R,T}X_PROFILE */
 }
 
@@ -269,7 +271,11 @@ static int coalesce_fill_reply(struct sk_buff *skb,
 	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,
 			     kcoal->tx_aggr_max_frames, supported) ||
 	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
-			     kcoal->tx_aggr_time_usecs, supported))
+			     kcoal->tx_aggr_time_usecs, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+			     kcoal->rx_cqe_frames, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_NSECS,
+			     kcoal->rx_cqe_nsecs, supported))
 		return -EMSGSIZE;
 
 	if (!req_base->dev || !req_base->dev->irq_moder)
@@ -338,6 +344,8 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_RX_CQE_FRAMES] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_RX_CQE_NSECS] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_RX_PROFILE] =
 		NLA_POLICY_NESTED(coalesce_profile_policy),
 	[ETHTOOL_A_COALESCE_TX_PROFILE] =
@@ -570,6 +578,10 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
 			 tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES], &mod);
 	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
 			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
+	ethnl_update_u32(&kernel_coalesce.rx_cqe_frames,
+			 tb[ETHTOOL_A_COALESCE_RX_CQE_FRAMES], &mod);
+	ethnl_update_u32(&kernel_coalesce.rx_cqe_nsecs,
+			 tb[ETHTOOL_A_COALESCE_RX_CQE_NSECS], &mod);
 
 	if (dev->irq_moder && dev->irq_moder->profile_flags & DIM_PROFILE_RX) {
 		ret = ethnl_update_profile(dev, &dev->irq_moder->rx_profile,
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-02-22 21:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
	Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
	Shiraz Saleem, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260117144847.20676729@kernel.org>



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, January 17, 2026 5:49 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add
> support for coalesced RX packets on CQE
> 
> On Sat, 17 Jan 2026 18:01:18 +0000 Haiyang Zhang wrote:
> > > > Since this feature is not common to other NICs, can we use an
> > > > ethtool private flag instead?
> > >
> > > It's extremely common. Descriptor writeback at the granularity of one
> > > packet would kill PCIe performance. We just don't have uAPI so NICs
> > > either don't expose the knob or "reuse" another coalescing param.
> >
> > I see. So how about adding a new param like below to "ethtool -C"?
> > ethtool -C|--coalesce devname [rx-cqe-coalesce on|off]
> 
> I don't think we need on / off, just the params.
> If someone needs on / off setting - the size to 1 is basically off.
> 
> > > > When the flag is set, the CQE coalescing will be enabled and put
> > > > up to 4 pkts in a CQE. support
> > > > Does the "size" mean the max pks per CQE (1 or 4)?
> >  [...]
> >
> > In "ethtool -c" output, add a new value like this?
> > rx-cqe-frames:      (1 or 4 frames/CQE for this NIC)
> 
> SG
> 
> > > > The timeout value is not even exposed to driver, and subject to
> change
> > > > in the future. Also the HW mechanism is proprietary... So, can we
> not
> > > > "expose" the timeout value in "ethtool -c" outputs, because it's not
> > > > available at driver level?
> > >
> > > Add it to the FW API and have FW send the current value to the driver?
> >
> > I don't know where is the timeout value in the HW / FW layers. Adding
> > new info to the HW/FW API needs other team's approval, and their work,
> > which will need a complex process and a long time.
> >
> > > You were concerned (in the commit msg) that there's a latency cost,
> > > which is fair but I think for 99% of users 2usec is absolutely
> > > not detectable (it takes longer for the CPU to wake). So I think it'd
> > > be very valuable to the user to understand the order of magnitude of
> > > latency we're talking about here.
> >
> > For now, may I document the 2us in the patch description? And add a
> > new item to the "ethtool -c" output, like "rx-cqe-usecs", label is as
> > "n/a" for now, while we work out with other teams on the time value
> > API at HW/FW layers? So, this CQE coalescing feature support won't be
> > blocked by this "2usec" info API for a long time?
> 
> Please do it right. We are in no rush upstream. It can't be that hard
> to add a single API to the FW within a single organization..

I have sent out a patch to add two parameters for ethtool:
COALESCE_RX_CQE_FRAMES/NSECS

I will send out ethtool user cmd patch, and driver patches later, after
the new parameters are added to kernel.

Thanks,
- Haiyang


^ permalink raw reply

* [PATCH net-next v3] net: mana: Add MAC address to vPort logs and clarify error messages
From: Erni Sri Satya Vennela @ 2026-02-23  4:08 UTC (permalink / raw)
  To: Erni Sri Satya Vennela, kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, davem, edumazet, kuba, pabeni, dipayanroy, ssengar,
	shradhagupta, ernis, shirazsaleem, gargaditya, linux-hyperv,
	netdev, linux-kernel

Add MAC address to vPort configuration success message and update error
message to be more specific about HWC message errors in
mana_send_request and mana_hwc_send_request.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
* Remove the changes from v2 and Update commit message.
* Use "Enabled vPort ..." instead of "Configured vPort" in
  mana_cfg_vport.
* Update error logs in mana_hwc_send_request.
Changes in v2:
* Update commit message.
* Use "Enabled vPort ..." instead of "Configured vPort" in
  mana_cfg_vport.
* Add info log in mana_uncfg_vport, mana_gd_verify_vf_version,
  mana_gd_query_max_resources, mana_query_device_cfg and
  mana_query_vport_cfg.
---
 .../net/ethernet/microsoft/mana/hw_channel.c  | 19 +++++++++++--------
 drivers/net/ethernet/microsoft/mana/mana_en.c |  8 ++++----
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index aa4e2731e2ba..d4fd513dc1d6 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -853,6 +853,7 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 	struct hwc_caller_ctx *ctx;
 	u32 dest_vrcq = 0;
 	u32 dest_vrq = 0;
+	u32 command;
 	u16 msg_id;
 	int err;
 
@@ -861,8 +862,8 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 	tx_wr = &txq->msg_buf->reqs[msg_id];
 
 	if (req_len > tx_wr->buf_len) {
-		dev_err(hwc->dev, "HWC: req msg size: %d > %d\n", req_len,
-			tx_wr->buf_len);
+		dev_err(hwc->dev, "%s:%d: req msg size: %d > %d\n",
+			__func__, __LINE__, req_len, tx_wr->buf_len);
 		err = -EINVAL;
 		goto out;
 	}
@@ -878,6 +879,7 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 	req_msg->req.hwc_msg_id = msg_id;
 
 	tx_wr->msg_size = req_len;
+	command = req_msg->req.msg_type;
 
 	if (gc->is_pf) {
 		dest_vrq = hwc->pf_dest_vrq_id;
@@ -886,15 +888,16 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 
 	err = mana_hwc_post_tx_wqe(txq, tx_wr, dest_vrq, dest_vrcq, false);
 	if (err) {
-		dev_err(hwc->dev, "HWC: Failed to post send WQE: %d\n", err);
+		dev_err(hwc->dev, "%s:%d: Failed to post send WQE: %d\n",
+			__func__, __LINE__, err);
 		goto out;
 	}
 
 	if (!wait_for_completion_timeout(&ctx->comp_event,
 					 (msecs_to_jiffies(hwc->hwc_timeout)))) {
 		if (hwc->hwc_timeout != 0)
-			dev_err(hwc->dev, "HWC: Request timed out: %u ms\n",
-				hwc->hwc_timeout);
+			dev_err(hwc->dev, "%s:%d: Command 0x%x timed out: %u ms\n",
+				__func__, __LINE__, command, hwc->hwc_timeout);
 
 		/* Reduce further waiting if HWC no response */
 		if (hwc->hwc_timeout > 1)
@@ -914,9 +917,9 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 			err = -EOPNOTSUPP;
 			goto out;
 		}
-		if (req_msg->req.msg_type != MANA_QUERY_PHY_STAT)
-			dev_err(hwc->dev, "HWC: Failed hw_channel req: 0x%x\n",
-				ctx->status_code);
+		if (command != MANA_QUERY_PHY_STAT)
+			dev_err(hwc->dev, "%s:%d: Command 0x%x failed with status: 0x%x\n",
+				__func__, __LINE__, command, ctx->status_code);
 		err = -EPROTO;
 		goto out;
 	}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9b5a72ada5c4..53f24244de75 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1023,8 +1023,8 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
 
 		if (req->req.msg_type != MANA_QUERY_PHY_STAT &&
 		    mana_need_log(gc, err))
-			dev_err(dev, "Failed to send mana message: %d, 0x%x\n",
-				err, resp->status);
+			dev_err(dev, "Command 0x%x failed with status: 0x%x, err: %d\n",
+				req->req.msg_type, resp->status, err);
 		return err ? err : -EPROTO;
 	}
 
@@ -1337,8 +1337,8 @@ int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
 	apc->tx_shortform_allowed = resp.short_form_allowed;
 	apc->tx_vp_offset = resp.tx_vport_offset;
 
-	netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
-		    apc->port_handle, protection_dom_id, doorbell_pg_id);
+	netdev_info(apc->ndev, "Enabled vPort %llu PD %u DB %u MAC %pM\n",
+		    apc->port_handle, protection_dom_id, doorbell_pg_id, apc->mac_addr);
 out:
 	if (err)
 		mana_uncfg_vport(apc);
-- 
2.34.1


^ permalink raw reply related

* [PATCH, net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
From: Dipayaan Roy @ 2026-02-23  8:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy

The GF stats periodic query is used as mechanism to monitor HWC health
check. If this HWC command times out, it is a strong indication that
the device/SoC is in a faulty state and requires recovery.

Today, when a timeout is detected, the driver marks
hwc_timeout_occurred, clears cached stats, and stops rescheduling the
periodic work. However, the device itself is left in the same failing
state.

Extend the timeout handling path to trigger the existing MANA VF
recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
This is expected to initiate the appropriate recovery flow by suspende
resume first and if it fails then trigger a bus rescan.

This change is intentionally limited to HWC command timeouts and does
not trigger recovery for errors reported by the SoC as a normal command
response.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 14 +++-------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 28 ++++++++++++++++++-
 include/net/mana/gdma.h                       | 16 +++++++++--
 3 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 0055c231acf6..16c438d2aaa3 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -490,15 +490,9 @@ static void mana_serv_reset(struct pci_dev *pdev)
 		dev_info(&pdev->dev, "MANA reset cycle completed\n");
 
 out:
-	gc->in_service = false;
+	clear_bit(GC_IN_SERVICE, &gc->flags);
 }
 
-struct mana_serv_work {
-	struct work_struct serv_work;
-	struct pci_dev *pdev;
-	enum gdma_eqe_type type;
-};
-
 static void mana_do_service(enum gdma_eqe_type type, struct pci_dev *pdev)
 {
 	switch (type) {
@@ -542,7 +536,7 @@ static void mana_recovery_delayed_func(struct work_struct *w)
 	spin_unlock_irqrestore(&work->lock, flags);
 }
 
-static void mana_serv_func(struct work_struct *w)
+void mana_serv_func(struct work_struct *w)
 {
 	struct mana_serv_work *mns_wk;
 	struct pci_dev *pdev;
@@ -624,7 +618,7 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
 			break;
 		}
 
-		if (gc->in_service) {
+		if (test_bit(GC_IN_SERVICE, &gc->flags)) {
 			dev_info(gc->dev, "Already in service\n");
 			break;
 		}
@@ -641,7 +635,7 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
 		}
 
 		dev_info(gc->dev, "Start MANA service type:%d\n", type);
-		gc->in_service = true;
+		set_bit(GC_IN_SERVICE, &gc->flags);
 		mns_wk->pdev = to_pci_dev(gc->dev);
 		mns_wk->type = type;
 		pci_dev_get(mns_wk->pdev);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 91c418097284..8da574cf06f2 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -879,7 +879,7 @@ static void mana_tx_timeout(struct net_device *netdev, unsigned int txqueue)
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 
 	/* Already in service, hence tx queue reset is not required.*/
-	if (gc->in_service)
+	if (test_bit(GC_IN_SERVICE, &gc->flags))
 		return;
 
 	/* Note: If there are pending queue reset work for this port(apc),
@@ -3533,6 +3533,8 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
 {
 	struct mana_context *ac =
 		container_of(to_delayed_work(work), struct mana_context, gf_stats_work);
+	struct gdma_context *gc = ac->gdma_dev->gdma_context;
+	struct mana_serv_work *mns_wk;
 	int err;
 
 	err = mana_query_gf_stats(ac);
@@ -3540,6 +3542,30 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
 		/* HWC timeout detected - reset stats and stop rescheduling */
 		ac->hwc_timeout_occurred = true;
 		memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
+		dev_warn(gc->dev,
+			 "Gf stats wk handler: gf stats query timed out.\n");
+
+		/* As HWC timed out, indicating a faulty HW state and needs a
+		 * reset.
+		 */
+		if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
+			if (!try_module_get(THIS_MODULE)) {
+				dev_info(gc->dev, "Module is unloading\n");
+				return;
+			}
+
+			mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
+			if (!mns_wk) {
+				module_put(THIS_MODULE);
+				return;
+			}
+
+			mns_wk->pdev = to_pci_dev(gc->dev);
+			mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
+			pci_dev_get(mns_wk->pdev);
+			INIT_WORK(&mns_wk->serv_work, mana_serv_func);
+			schedule_work(&mns_wk->serv_work);
+		}
 		return;
 	}
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index a59bd4035a99..fb946389d593 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -213,6 +213,12 @@ enum gdma_page_type {
 
 #define GDMA_INVALID_DMA_REGION 0
 
+struct mana_serv_work {
+	struct work_struct serv_work;
+	struct pci_dev *pdev;
+	enum gdma_eqe_type type;
+};
+
 struct gdma_mem_info {
 	struct device *dev;
 
@@ -384,6 +390,7 @@ struct gdma_irq_context {
 
 enum gdma_context_flags {
 	GC_PROBE_SUCCEEDED	= 0,
+	GC_IN_SERVICE		= 1,
 };
 
 struct gdma_context {
@@ -409,7 +416,6 @@ struct gdma_context {
 	u32			test_event_eq_id;
 
 	bool			is_pf;
-	bool			in_service;
 
 	phys_addr_t		bar0_pa;
 	void __iomem		*bar0_va;
@@ -471,6 +477,8 @@ int mana_gd_poll_cq(struct gdma_queue *cq, struct gdma_comp *comp, int num_cqe);
 
 void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
 
+void mana_serv_func(struct work_struct *w);
+
 struct gdma_wqe {
 	u32 reserved	:24;
 	u32 last_vbytes	:8;
@@ -613,6 +621,9 @@ enum {
 /* Driver can handle hardware recovery events during probe */
 #define GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY BIT(22)
 
+/* Driver supports self recovery on Hardware Channel timeouts */
+#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -626,7 +637,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
 	 GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
+	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next] net: ethtool: add COALESCE_RX_CQE_FRAMES/NSECS parameters
From: Kory Maincent @ 2026-02-23  9:25 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, Andrew Lunn, Jakub Kicinski, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Gal Pressman, Oleksij Rempel,
	Vadim Fedorenko, linux-kernel, linux-doc, haiyangz, paulros
In-Reply-To: <20260222212328.736628-1-haiyangz@linux.microsoft.com>

On Sun, 22 Feb 2026 13:23:17 -0800
Haiyang Zhang <haiyangz@linux.microsoft.com> wrote:

> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Add two parameters for drivers supporting Rx CQE Coalescing.
> 
> ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
> Maximum number of frames that can be coalesced into a CQE.
> 
> ETHTOOL_A_COALESCE_RX_CQE_NSECS:
> Time out value in nanoseconds after the first packet arrival in a
> coalesced CQE to be sent.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

You send this patch one day before the official reopening of net-next.
Not sure if this will be taken into account by patchwork.
Else:
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>

Thank you!
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply

* [PATCH AUTOSEL 6.19] mshv: Ignore second stats page map result failure
From: Sasha Levin @ 2026-02-23 12:37 UTC (permalink / raw)
  To: patches, stable
  Cc: Purna Pavan Chandra Aekkaladevi, Nuno Das Neves,
	Stanislav Kinsburskii, Michael Kelley, Wei Liu, Sasha Levin, kys,
	haiyangz, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260223123738.1532940-1-sashal@kernel.org>

From: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>

[ Upstream commit 7538b80e5a4b473b73428d13b3a47ceaad9a8a7c ]

Older versions of the hypervisor do not have a concept of separate SELF
and PARENT stats areas. In this case, mapping the HV_STATS_AREA_SELF page
is sufficient - it's the only page and it contains all available stats.

Mapping HV_STATS_AREA_PARENT returns HV_STATUS_INVALID_PARAMETER which
currently causes module init to fail on older hypevisor versions.

Detect this case and gracefully fall back to populating
stats_pages[HV_STATS_AREA_PARENT] with the already-mapped SELF page.

Add comments to clarify the behavior, including a clarification of why
this isn't needed for hv_call_map_stats_page2() which always supports
PARENT and SELF areas.

Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a thorough analysis.

## Analysis

### 1. Commit Message Analysis

The commit clearly describes a backward compatibility bug: on older
Hyper-V hypervisor versions that don't support separate SELF and PARENT
stats areas, mapping `HV_STATS_AREA_PARENT` returns
`HV_STATUS_INVALID_PARAMETER`, which causes **module initialization to
fail entirely**. This is not a feature addition — it's fixing a
regression/incompatibility where the driver doesn't work on older
hypervisors.

### 2. Code Change Analysis

The fix has three parts:

**a) New helper `hv_stats_get_area_type()`** (~15 lines): Extracts the
stats area type from the identity union based on the object type. This
is needed to distinguish PARENT from SELF area mapping requests.

**b) Modified `hv_call_map_stats_page()`** (~20 lines changed): When the
hypercall returns `HV_STATUS_INVALID_PARAMETER` specifically for a
PARENT area mapping, instead of failing with an error, it returns
success but with `*addr = NULL`. This signals to the caller that PARENT
isn't supported.

**c) Modified `mshv_vp_stats_map()`** (+3 lines): After mapping PARENT,
if the address is NULL (meaning older hypervisor), it falls back to
using the already-mapped SELF page for both areas. This is safe because
on older hypervisors, the SELF page contains all available stats.

### 3. Bug Impact

- **Severity**: HIGH — the driver completely fails to create VPs
  (virtual processors), making it unusable on older hypervisor versions
- **User impact**: Anyone running the mshv_root driver on an older
  Hyper-V hypervisor version cannot use the driver at all
- **Trigger**: Deterministic — always fails on affected hypervisor
  versions, not a race or edge case

### 4. Scope and Risk

- The change is small (~40 lines including comments) and well-contained
  to the stats page mapping path
- It adds graceful degradation, not new behavior — the driver works the
  same on new hypervisors
- Multiple reviewers: Reviewed-by and Acked-by from Stanislav
  Kinsburskii, Reviewed-by from Michael Kelley
- The fallback logic (using SELF page for both) is explicitly described
  as correct by the authors since older hypervisors put all stats in the
  single SELF page

### 5. Stable Tree Applicability

- The mshv_root driver was introduced in **v6.15** (commit
  `621191d709b14`)
- The bug has been present **since the driver was first introduced** —
  the original `mshv_vp_stats_map()` function already mapped both SELF
  and PARENT areas with no fallback
- In v6.19 the function was renamed from `hv_call_map_stat_page` to
  `hv_call_map_stats_page` and wrapped by `hv_map_stats_page` (commit
  `d62313bdf5961`), so backporting to pre-6.19 trees would require minor
  adaptation
- The fix applies to all stable trees containing the mshv_root driver
  (6.15.y through 6.18.y)

### 6. Stable Criteria Assessment

| Criterion | Assessment |
|-----------|-----------|
| Obviously correct and tested | YES — well-reviewed, clear logic |
| Fixes a real bug | YES — module init failure |
| Important issue | YES — driver completely non-functional on older
hypervisors |
| Small and contained | YES — ~40 lines, 2 files, single subsystem |
| No new features | YES — adds backward compatibility/graceful
degradation |

### Verification

- `git show 621191d709b14 -- drivers/hv/mshv_root_main.c` confirmed
  `mshv_vp_stats_map()` with SELF+PARENT mapping was present in the
  original v6.15 driver introduction
- `git show 621191d709b14 -- drivers/hv/mshv_root_hv_call.c` confirmed
  the original `hv_call_map_stat_page()` had no fallback for PARENT area
  failures
- `git describe --contains 621191d709b14` returned `v6.15-rc1~193^2~2`,
  confirming v6.15 introduction
- `git describe --contains d62313bdf5961` returned `v6.19-rc1~56^2~30`,
  confirming function rename happened in v6.19
- `git log -p` diff confirmed the rename from `hv_call_map_stat_page` to
  `hv_call_map_stats_page` occurred in `d62313bdf5961`
- The bug mechanism is verified: older hypervisors return
  `HV_STATUS_INVALID_PARAMETER` for PARENT area mapping, which was not
  handled, causing the error to propagate up and fail VP creation
- Could NOT independently verify which exact stable trees (6.15.y,
  6.16.y, 6.17.y, 6.18.y) are still actively maintained (unverified)

### Conclusion

This is a legitimate bug fix for a complete driver initialization
failure on older Hyper-V hypervisor versions. The bug has existed since
the driver was first introduced in v6.15. The fix is small, well-
contained, well-reviewed, and poses minimal regression risk. Backporting
to pre-6.19 stable trees would require minor adaptation due to function
renames, but the logic is straightforward. It meets all stable kernel
criteria.

**YES**

 drivers/hv/mshv_root_hv_call.c | 52 +++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root_main.c    |  3 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 598eaff4ff299..1f93b94d7580c 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -813,6 +813,13 @@ hv_call_notify_port_ring_empty(u32 sint_index)
 	return hv_result_to_errno(status);
 }
 
+/*
+ * Equivalent of hv_call_map_stats_page() for cases when the caller provides
+ * the map location.
+ *
+ * NOTE: This is a newer hypercall that always supports SELF and PARENT stats
+ * areas, unlike hv_call_map_stats_page().
+ */
 static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 				   const union hv_stats_object_identity *identity,
 				   u64 map_location)
@@ -855,6 +862,34 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 	return ret;
 }
 
+static int
+hv_stats_get_area_type(enum hv_stats_object_type type,
+		       const union hv_stats_object_identity *identity)
+{
+	switch (type) {
+	case HV_STATS_OBJECT_HYPERVISOR:
+		return identity->hv.stats_area_type;
+	case HV_STATS_OBJECT_LOGICAL_PROCESSOR:
+		return identity->lp.stats_area_type;
+	case HV_STATS_OBJECT_PARTITION:
+		return identity->partition.stats_area_type;
+	case HV_STATS_OBJECT_VP:
+		return identity->vp.stats_area_type;
+	}
+
+	return -EINVAL;
+}
+
+/*
+ * Map a stats page, where the page location is provided by the hypervisor.
+ *
+ * NOTE: The concept of separate SELF and PARENT stats areas does not exist on
+ * older hypervisor versions. All the available stats information can be found
+ * on the SELF page. When attempting to map the PARENT area on a hypervisor
+ * that doesn't support it, return "success" but with a NULL address. The
+ * caller should check for this case and instead fallback to the SELF area
+ * alone.
+ */
 static int hv_call_map_stats_page(enum hv_stats_object_type type,
 				  const union hv_stats_object_identity *identity,
 				  void **addr)
@@ -863,7 +898,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 	struct hv_input_map_stats_page *input;
 	struct hv_output_map_stats_page *output;
 	u64 status, pfn;
-	int ret = 0;
+	int hv_status, ret = 0;
 
 	do {
 		local_irq_save(flags);
@@ -878,11 +913,20 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 		pfn = output->map_location;
 
 		local_irq_restore(flags);
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
-			ret = hv_result_to_errno(status);
+
+		hv_status = hv_result(status);
+		if (hv_status != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (hv_result_success(status))
 				break;
-			return ret;
+
+			if (hv_stats_get_area_type(type, identity) == HV_STATS_AREA_PARENT &&
+			    hv_status == HV_STATUS_INVALID_PARAMETER) {
+				*addr = NULL;
+				return 0;
+			}
+
+			hv_status_debug(status, "\n");
+			return hv_result_to_errno(status);
 		}
 
 		ret = hv_call_deposit_pages(NUMA_NO_NODE,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 681b58154d5ea..d3e8a66443ad6 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -993,6 +993,9 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
 	if (err)
 		goto unmap_self;
 
+	if (!stats_pages[HV_STATS_AREA_PARENT])
+		stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+
 	return 0;
 
 unmap_self:
-- 
2.51.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.19] x86/hyperv: Move hv crash init after hypercall pg setup
From: Sasha Levin @ 2026-02-23 12:37 UTC (permalink / raw)
  To: patches, stable
  Cc: Mukesh R, Wei Liu, Sasha Levin, kys, haiyangz, decui, longli,
	tglx, mingo, bp, dave.hansen, x86, linux-hyperv, linux-kernel
In-Reply-To: <20260223123738.1532940-1-sashal@kernel.org>

From: Mukesh R <mrathor@linux.microsoft.com>

[ Upstream commit c3a6ae7ea2d3f507cbddb5818ccc65b9d84d6dc7 ]

hv_root_crash_init() is not setting up the hypervisor crash collection
for baremetal cases because when it's called, hypervisor page is not
setup.

Fix is simple, just move the crash init call after the hypercall
page setup.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a clear picture. Let me write my analysis.

## Analysis

### What the commit does

This commit fixes an initialization ordering bug in the Hyper-V root
partition crash dump setup on x86. The function `hv_root_crash_init()`
needs to make hypercalls (specifically `HVCALL_GET_SYSTEM_PROPERTY`) to
set up crash dump collection for the root partition. However, it was
being called **before** `hv_set_hypercall_pg()`, which is the function
that installs the hypercall page into the static call trampoline.
Without `hv_set_hypercall_pg()` having run, any hypercalls made by
`hv_root_crash_init()` would fail silently (or call the `__hv_hyperfail`
stub), meaning crash collection was never properly set up on baremetal
Hyper-V root partitions.

The fix is small and surgical: move the `hv_root_crash_init()` call from
inside the `if (hv_root_partition())` block (before
`hv_set_hypercall_pg()`) to after `hv_set_hypercall_pg()`, with an
explicit `hv_root_partition()` guard.

### Does it fix a real bug?

Yes. On baremetal Hyper-V root partitions, crash dump collection was
completely non-functional. This is a real bug that affects kernel crash
diagnostics in production Hyper-V environments.

### Size and scope

Very small: 1 file, 3 lines added, 1 line removed. The change is a
simple reordering of an existing function call.

### Dependency analysis - CRITICAL ISSUE

The prerequisite commit `77c860d2dbb72` ("x86/hyperv: Enable build of
hypervisor crashdump collection files") that **introduced**
`hv_root_crash_init()` was first merged in **v6.19-rc1**. It is NOT
present in v6.18.y or any earlier stable trees.

This means:
- The code being fixed (`hv_root_crash_init()`) does not exist in any
  stable tree prior to 6.19.y
- The bug was introduced in v6.19-rc1 and this fix targets the same
  v6.19.y tree
- For stable trees 6.18.y and older, there is nothing to fix — the buggy
  code doesn't exist there

### Risk assessment

For 6.19.y stable: Very low risk. The change is a simple reordering of
an initialization call, only affects Hyper-V root partition (baremetal)
configurations, and the commit is authored by the same developer who
introduced the feature.

### Stable kernel criteria

- Obviously correct: Yes, the ordering dependency is clear
- Fixes a real bug: Yes, crash dump collection fails on root partitions
- Small and contained: Yes, 4-line change in 1 file
- No new features: Correct, just reorders existing initialization

### Verdict

This is a valid bugfix for v6.19.y stable. It fixes code that was
introduced in v6.19-rc1 and is only relevant to the 6.19.y stable tree.
For that tree, it should be backported.

### Verification

- **git log** confirmed `77c860d2dbb72` introduced
  `hv_root_crash_init()` on 2025-10-06
- **git tag --contains** confirmed `77c860d2dbb72` is in v6.19-rc1 and
  v6.19 but NOT in v6.18.13
- **git merge-base --is-ancestor** confirmed the prerequisite is NOT in
  v6.18.y stable
- **Read of hv_init.c:63-70** confirmed `hv_set_hypercall_pg()` sets up
  the static call trampoline needed for hypercalls to work
- **Read of hv_init.c:530-589** confirmed the ordering:
  `hv_root_crash_init()` was called at line 561, before
  `hv_set_hypercall_pg()` at line 568
- The fix commit `c3a6ae7ea2d3f` changes 1 file, 3 insertions, 1
  deletion — verified via `git show --stat`
- The Explore agent confirmed `hv_root_crash_init()` makes hypercalls
  (`HVCALL_GET_SYSTEM_PROPERTY`) that require the hypercall page to be
  set up

**YES**

 arch/x86/hyperv/hv_init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 14de43f4bc6c1..7f3301bd081ec 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -558,7 +558,6 @@ void __init hyperv_init(void)
 		memunmap(src);

 		hv_remap_tsc_clocksource();
-		hv_root_crash_init();
 		hv_sleep_notifiers_register();
 	} else {
 		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
@@ -567,6 +566,9 @@ void __init hyperv_init(void)

 	hv_set_hypercall_pg(hv_hypercall_pg);

+	if (hv_root_partition())        /* after set hypercall pg */
+		hv_root_crash_init();
+
 skip_hypercall_pg_init:
 	/*
 	 * hyperv_init() is called before LAPIC is initialized: see
-- 
2.51.0

^ permalink raw reply related

* [PATCH AUTOSEL 6.19-6.18] mshv: clear eventfd counter on irqfd shutdown
From: Sasha Levin @ 2026-02-23 12:37 UTC (permalink / raw)
  To: patches, stable
  Cc: Carlos López, Wei Liu, Sasha Levin, kys, haiyangz, decui,
	longli, linux-hyperv, linux-kernel
In-Reply-To: <20260223123738.1532940-1-sashal@kernel.org>

From: Carlos López <clopez@suse.de>

[ Upstream commit 2b4246153e2184e3a3b4edc8cc35337d7a2455a6 ]

While unhooking from the irqfd waitqueue, clear the internal eventfd
counter by using eventfd_ctx_remove_wait_queue() instead of
remove_wait_queue(), preventing potential spurious interrupts. This
removes the need to store a pointer into the workqueue, as the eventfd
already keeps track of it.

This mimicks what other similar subsystems do on their equivalent paths
with their irqfds (KVM, Xen, ACRN support, etc).

Signed-off-by: Carlos López <clopez@suse.de>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a clear picture. Let me provide my analysis.

## Analysis

### 1. Commit Message Analysis

The commit replaces `remove_wait_queue()` with
`eventfd_ctx_remove_wait_queue()` in `mshv_irqfd_shutdown()`. The key
claim is that this "prevents potential spurious interrupts" by clearing
the eventfd counter atomically when unhooking from the waitqueue. The
commit also removes the now-unnecessary `irqfd_wqh` pointer from the
struct.

The phrase "potential spurious interrupts" uses the word "potential" —
suggesting this is a preventive/hardening fix rather than a response to
an observed bug.

### 2. Code Change Analysis

The change is small and well-defined:
- **`mshv_irqfd_shutdown()`**: `remove_wait_queue(irqfd->irqfd_wqh,
  &irqfd->irqfd_wait)` →
  `eventfd_ctx_remove_wait_queue(irqfd->irqfd_eventfd_ctx,
  &irqfd->irqfd_wait, &cnt)`. The new call atomically removes the waiter
  AND resets the eventfd counter to zero.
- **`mshv_irqfd_queue_proc()`**: Removes `irqfd->irqfd_wqh = wqh` since
  the field is no longer needed.
- **`struct mshv_irqfd`**: Removes the `irqfd_wqh` field.

Without clearing the counter, if an eventfd had been signaled before
shutdown completes, stale events could remain in the counter. This is a
real correctness concern, though labeled as "potential."

### 3. Pattern Match with KVM/Xen/ACRN/VFIO

All four analogous subsystems use `eventfd_ctx_remove_wait_queue()` in
their irqfd shutdown paths:
- `virt/kvm/eventfd.c:136`
- `drivers/xen/privcmd.c:906`
- `drivers/virt/acrn/irqfd.c:55`
- `drivers/vfio/virqfd.c:90`

The mshv code was the sole outlier using plain `remove_wait_queue()`.
This is a well-established pattern for correct irqfd teardown.

### 4. Driver Age and Stable Tree Applicability

The mshv driver was introduced in v6.15-rc1 (commit `621191d709b14`). It
would only exist in stable trees 6.15.y and newer (6.16.y, 6.17.y,
6.18.y). It does NOT exist in any LTS trees (6.12.y, 6.6.y, 6.1.y,
5.15.y).

### 5. Risk Assessment

- **Size**: Very small — changes 2 files, net removal of code (removes a
  struct field and an assignment)
- **Risk**: Very low — uses a well-understood API that's already used by
  KVM, Xen, ACRN, and VFIO
- **Scope**: Confined to mshv irqfd shutdown path only

### 6. Bug Severity

Without this fix, if the eventfd counter is non-zero during shutdown,
leftover events could trigger spurious interrupt injection into the
guest VM. In a hypervisor context, spurious interrupts can cause guest
OS instability. While this is described as "potential" (no specific bug
report), the failure mode is real and the fix is straightforward.

### 7. Stable Criteria Assessment

- **Obviously correct**: YES — follows the exact pattern used by KVM,
  Xen, ACRN, VFIO
- **Fixes a real bug**: Borderline — fixes a latent correctness issue
  rather than a reported crash
- **Small and contained**: YES — minimal change
- **No new features**: YES — this is purely a correctness fix
- **Tested**: It mirrors established code in KVM and other subsystems

### Verification

- `git log --follow --diff-filter=A -- drivers/hv/mshv_eventfd.c`
  confirmed the file was introduced in commit `621191d709b14` ("Drivers:
  hv: Introduce mshv_root module")
- `git describe --tags --contains 621191d709b14` confirmed this was
  introduced in v6.15-rc1
- Grep for `eventfd_ctx_remove_wait_queue` confirmed all four analogous
  subsystems (KVM, Xen, ACRN, VFIO) use this API in their shutdown paths
- Read of `drivers/hv/mshv_eventfd.c` confirmed the pre-patch code uses
  `remove_wait_queue()` with the stored `irqfd_wqh` pointer (line 255)
- Verified stable tree tags exist for 6.15.y through 6.18.y that would
  contain this driver
- Could NOT verify any specific user-reported bugs caused by the lack of
  counter clearing (unverified — commit only says "potential")

### Conclusion

This is a small, low-risk correctness fix that aligns mshv with the
well-established pattern used by KVM, Xen, ACRN, and VFIO for irqfd
shutdown. It prevents stale eventfd events from potentially causing
spurious interrupts in guest VMs. The fix is obviously correct, tiny in
scope, and carries essentially zero regression risk. While it addresses
a "potential" rather than actively reported issue, the fix is clearly
the right thing to do for stable users running Microsoft Hypervisor
workloads on 6.15+ kernels.

**YES**

 drivers/hv/mshv_eventfd.c | 5 ++---
 drivers/hv/mshv_eventfd.h | 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 0b75ff1edb735..cb8b24b81cd5e 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -247,12 +247,13 @@ static void mshv_irqfd_shutdown(struct work_struct *work)
 {
 	struct mshv_irqfd *irqfd =
 			container_of(work, struct mshv_irqfd, irqfd_shutdown);
+	u64 cnt;

 	/*
 	 * Synchronize with the wait-queue and unhook ourselves to prevent
 	 * further events.
 	 */
-	remove_wait_queue(irqfd->irqfd_wqh, &irqfd->irqfd_wait);
+	eventfd_ctx_remove_wait_queue(irqfd->irqfd_eventfd_ctx, &irqfd->irqfd_wait, &cnt);

 	if (irqfd->irqfd_resampler) {
 		mshv_irqfd_resampler_shutdown(irqfd);
@@ -371,8 +372,6 @@ static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
 	struct mshv_irqfd *irqfd =
 			container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);

-	irqfd->irqfd_wqh = wqh;
-
 	/*
 	 * TODO: Ensure there isn't already an exclusive, priority waiter, e.g.
 	 * that the irqfd isn't already bound to another partition.  Only the
diff --git a/drivers/hv/mshv_eventfd.h b/drivers/hv/mshv_eventfd.h
index 332e7670a3442..464c6b81ab336 100644
--- a/drivers/hv/mshv_eventfd.h
+++ b/drivers/hv/mshv_eventfd.h
@@ -32,7 +32,6 @@ struct mshv_irqfd {
 	struct mshv_lapic_irq		     irqfd_lapic_irq;
 	struct hlist_node		     irqfd_hnode;
 	poll_table			     irqfd_polltbl;
-	wait_queue_head_t		    *irqfd_wqh;
 	wait_queue_entry_t		     irqfd_wait;
 	struct work_struct		     irqfd_shutdown;
 	struct mshv_irqfd_resampler	    *irqfd_resampler;
-- 
2.51.0

^ permalink raw reply related

* [PATCH AUTOSEL 6.19] Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sasha Levin @ 2026-02-23 12:37 UTC (permalink / raw)
  To: patches, stable
  Cc: Jan Kiszka, Florian Bezdeka, Michael Kelley, Wei Liu, Sasha Levin,
	kys, haiyangz, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260223123738.1532940-1-sashal@kernel.org>

From: Jan Kiszka <jan.kiszka@siemens.com>

[ Upstream commit f8e6343b7a89c7c649db5a9e309ba7aa20401813 ]

Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
with related guest support enabled:

[    1.127941] hv_vmbus: registering driver hyperv_drm

[    1.132518] =============================
[    1.132519] [ BUG: Invalid wait context ]
[    1.132521] 6.19.0-rc8+ #9 Not tainted
[    1.132524] -----------------------------
[    1.132525] swapper/0/0 is trying to lock:
[    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
[    1.132543] other info that might help us debug this:
[    1.132544] context-{2:2}
[    1.132545] 1 lock held by swapper/0/0:
[    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
[    1.132557] stack backtrace:
[    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
[    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[    1.132567] Call Trace:
[    1.132570]  <IRQ>
[    1.132573]  dump_stack_lvl+0x6e/0xa0
[    1.132581]  __lock_acquire+0xee0/0x21b0
[    1.132592]  lock_acquire+0xd5/0x2d0
[    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132606]  ? lock_acquire+0xd5/0x2d0
[    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132619]  rt_spin_lock+0x3f/0x1f0
[    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132634]  vmbus_chan_sched+0xc4/0x2b0
[    1.132641]  vmbus_isr+0x2c/0x150
[    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
[    1.132654]  sysvec_hyperv_callback+0x88/0xb0
[    1.132658]  </IRQ>
[    1.132659]  <TASK>
[    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20

As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
the vmbus_isr execution needs to be moved into thread context. Open-
coding this allows to skip the IPI that irq_work would additionally
bring and which we do not need, being an IRQ, never an NMI.

This affects both x86 and arm64, therefore hook into the common driver
logic.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

The bug exists in all stable trees that have PREEMPT_RT support.
However, the patch would need significant adaptation to apply to them
due to the 6.19-specific refactoring of `vmbus_isr` and
`vmbus_chan_sched`.

## Analysis

### What the commit fixes

This commit fixes a **sleeping-in-atomic-context bug** on PREEMPT_RT
kernels running on Hyper-V. The issue is that `vmbus_isr()` runs in hard
IRQ context (called from `sysvec_hyperv_callback` on x86) and acquires
`spin_lock(&channel->sched_lock)` via `vmbus_chan_sched()`. Under
PREEMPT_RT, spinlocks are converted to `rt_spin_lock` (sleeping locks),
which cannot be acquired from hard IRQ context. This triggers a lockdep
"BUG: Invalid wait context" and represents a real correctness issue (not
just a warning).

### Does it fix a real bug?

**Yes.** This is a legitimate bug that makes Hyper-V VMs with PREEMPT_RT
unusable or unstable. The lockdep trace is from real testing (6.19-rc8).
The issue affects all PREEMPT_RT Hyper-V guests.

### Stable kernel rule assessment

1. **Obviously correct and tested**: Yes - reviewed by Michael Kelley
   (Hyper-V maintainer) and Florian Bezdeka, tested by both.
2. **Fixes a real bug**: Yes - sleeping in hardirq context is a real bug
   on PREEMPT_RT.
3. **Important issue**: Moderate - affects PREEMPT_RT on Hyper-V, which
   is a meaningful but somewhat niche combination.
4. **Small and contained**: Borderline - ~80 lines in one file, but adds
   new per-CPU thread infrastructure.
5. **No new features**: The kthread is a mechanism to fix the bug, not a
   feature.

### Risk vs benefit

- **Benefit**: Fixes a real bug that makes PREEMPT_RT on Hyper-V broken.
- **Risk**: Low for non-RT kernels (everything is behind
  `IS_ENABLED(CONFIG_PREEMPT_RT)`, which is compile-time). Moderate for
  RT kernels (new kthread infrastructure, though using well-established
  `smpboot` API).

### Backport concerns

**Critical issue: Dependencies.** This patch was written against the
6.19 codebase which has undergone significant refactoring:
- `vmbus_isr()` changed from `static` to exported
  (`EXPORT_SYMBOL_FOR_MODULES`) in 6.19 via commit `cffe9f58de1eb`
- `vmbus_chan_sched()` signature changed from `vmbus_chan_sched(hv_cpu)`
  to `vmbus_chan_sched(event_page_addr)` in 6.19 via commit
  `163224c189e8b`
- The `vmbus_message_sched()` helper was factored out in 6.19

The patch **will not apply cleanly** to any existing stable tree
(6.12.y, 6.6.y, 6.1.y). A manual backport would be needed, adapting the
fix to the older `vmbus_isr` structure. While the core concept (use
kthread for RT) would work, the adaptation is non-trivial.

### Verification

- **git show v6.12, v6.6, v6.1 kernel/Kconfig.preempt**: Confirmed
  PREEMPT_RT config option exists in all these stable trees
- **git show v6.1 include/linux/smpboot.h**: Confirmed
  `smpboot_register_percpu_thread` API available since at least 6.1
- **git show v6.18 drivers/hv/vmbus_drv.c**: Confirmed `vmbus_isr` is
  `static` in v6.18, only exported in 6.19
- **git show v6.12, v6.6 drivers/hv/vmbus_drv.c**: Confirmed
  `vmbus_chan_sched()` takes `hv_cpu` (not `event_page_addr`) in older
  kernels
- **git log v6.18..v6.19 drivers/hv/vmbus_drv.c**: Identified
  prerequisite commits (163224c189e8b, cffe9f58de1eb) that refactored
  the code
- **Read arch/x86/kernel/cpu/mshyperv.c lines 153-168**: Confirmed
  `vmbus_handler()` is called from `sysvec_hyperv_callback` IDTENTRY
  (hard IRQ context)
- **Read drivers/hv/vmbus_drv.c lines 1305**: Confirmed
  `spin_lock(&channel->sched_lock)` is the sleeping lock in the IRQ path
- **git show v6.12, v6.6**: Confirmed the `sched_lock` spin_lock exists
  in stable trees' vmbus_chan_sched, confirming the bug exists there too

### Decision

This is a legitimate bug fix for PREEMPT_RT on Hyper-V. The bug is real
and affects all stable trees with PREEMPT_RT. However, the patch:

1. Is moderate in size, adding new per-CPU thread infrastructure (~80
   lines)
2. Has significant dependencies on 6.19-specific refactoring and won't
   apply cleanly to any stable tree
3. Targets a somewhat niche combination (PREEMPT_RT + Hyper-V)
4. Would require careful manual adaptation for each stable tree

The fix is well-reviewed and technically sound, and it fixes a real bug.
Despite the backport complexity, the bug is severe enough (sleeping-in-
atomic-context breaks PREEMPT_RT on Hyper-V) that it warrants
backporting with appropriate adaptation. The
`IS_ENABLED(CONFIG_PREEMPT_RT)` guard ensures zero risk to non-RT users.

**YES**

 drivers/hv/vmbus_drv.c | 66 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index a53af6fe81a65..1d5cba142828e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -25,6 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/task_stack.h>
+#include <linux/smpboot.h>
 
 #include <linux/delay.h>
 #include <linux/panic_notifier.h>
@@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
 	}
 }
 
-void vmbus_isr(void)
+static void __vmbus_isr(void)
 {
 	struct hv_per_cpu_context *hv_cpu
 		= this_cpu_ptr(hv_context.cpu_context);
@@ -1363,6 +1364,53 @@ void vmbus_isr(void)
 
 	add_interrupt_randomness(vmbus_interrupt);
 }
+
+static DEFINE_PER_CPU(bool, vmbus_irq_pending);
+static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
+
+static void vmbus_irqd_wake(void)
+{
+	struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
+
+	__this_cpu_write(vmbus_irq_pending, true);
+	wake_up_process(tsk);
+}
+
+static void vmbus_irqd_setup(unsigned int cpu)
+{
+	sched_set_fifo(current);
+}
+
+static int vmbus_irqd_should_run(unsigned int cpu)
+{
+	return __this_cpu_read(vmbus_irq_pending);
+}
+
+static void run_vmbus_irqd(unsigned int cpu)
+{
+	__this_cpu_write(vmbus_irq_pending, false);
+	__vmbus_isr();
+}
+
+static bool vmbus_irq_initialized;
+
+static struct smp_hotplug_thread vmbus_irq_threads = {
+	.store                  = &vmbus_irqd,
+	.setup			= vmbus_irqd_setup,
+	.thread_should_run      = vmbus_irqd_should_run,
+	.thread_fn              = run_vmbus_irqd,
+	.thread_comm            = "vmbus_irq/%u",
+};
+
+void vmbus_isr(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		vmbus_irqd_wake();
+	} else {
+		lockdep_hardirq_threaded();
+		__vmbus_isr();
+	}
+}
 EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
 
 static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
@@ -1462,6 +1510,13 @@ static int vmbus_bus_init(void)
 	 * the VMbus interrupt handler.
 	 */
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
+		ret = smpboot_register_percpu_thread(&vmbus_irq_threads);
+		if (ret)
+			goto err_kthread;
+		vmbus_irq_initialized = true;
+	}
+
 	if (vmbus_irq == -1) {
 		hv_setup_vmbus_handler(vmbus_isr);
 	} else {
@@ -1507,6 +1562,11 @@ static int vmbus_bus_init(void)
 		free_percpu(vmbus_evt);
 	}
 err_setup:
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
+err_kthread:
 	bus_unregister(&hv_bus);
 	return ret;
 }
@@ -2976,6 +3036,10 @@ static void __exit vmbus_exit(void)
 		free_percpu_irq(vmbus_irq, vmbus_evt);
 		free_percpu(vmbus_evt);
 	}
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
 	for_each_online_cpu(cpu) {
 		struct hv_per_cpu_context *hv_cpu
 			= per_cpu_ptr(hv_context.cpu_context, cpu);
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH v4 1/2] mshv: refactor synic init and cleanup
From: Anirudh Rayabharam @ 2026-02-23 13:24 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415781511D0B2A10FB9BB365D46AA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, Feb 18, 2026 at 04:17:23AM +0000, Michael Kelley wrote:
> From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Wednesday, February 11, 2026 9:07 AM
> > 
> > Rename mshv_synic_init() to mshv_synic_cpu_init() and
> > mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
> > these functions handle per-cpu synic setup and teardown.
> > 
> > Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
> > Move all the synic related setup from mshv_parent_partition_init.
> > 
> > Move the reboot notifier to mshv_synic.c because it currently only
> > operates on the synic cpuhp state.
> > 
> > Move out synic_pages from the global mshv_root since it's use is now
> 
> s/it's/its/
> 
> > completely local to mshv_synic.c.
> > 
> > This is in preparation for the next patch which will add more stuff to
> > mshv_synic_init().
> > 
> > No functional change.
> > 
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> >  drivers/hv/mshv_root.h      |  5 ++-
> >  drivers/hv/mshv_root_main.c | 59 +++++-------------------------
> >  drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
> >  3 files changed, 75 insertions(+), 60 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 3c1d88b36741..26e0320c8097 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -183,7 +183,6 @@ struct hv_synic_pages {
> >  };
> > 
> >  struct mshv_root {
> > -	struct hv_synic_pages __percpu *synic_pages;
> >  	spinlock_t pt_ht_lock;
> >  	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
> >  	struct hv_partition_property_vmm_capabilities vmm_caps;
> > @@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
> >  void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
> > 
> >  void mshv_isr(void);
> > -int mshv_synic_init(unsigned int cpu);
> > -int mshv_synic_cleanup(unsigned int cpu);
> > +int mshv_synic_init(struct device *dev);
> > +void mshv_synic_cleanup(void);
> > 
> >  static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
> >  {
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 681b58154d5e..7c1666456e78 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
> >  	return 0;
> >  }
> > 
> > -static int mshv_cpuhp_online;
> >  static int mshv_root_sched_online;
> > 
> >  static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> > @@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
> >  	free_percpu(root_scheduler_output);
> >  }
> > 
> > -static int mshv_reboot_notify(struct notifier_block *nb,
> > -			      unsigned long code, void *unused)
> > -{
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -	return 0;
> > -}
> > -
> > -struct notifier_block mshv_reboot_nb = {
> > -	.notifier_call = mshv_reboot_notify,
> > -};
> > -
> >  static void mshv_root_partition_exit(void)
> >  {
> > -	unregister_reboot_notifier(&mshv_reboot_nb);
> >  	root_scheduler_deinit();
> >  }
> > 
> >  static int __init mshv_root_partition_init(struct device *dev)
> >  {
> > -	int err;
> > -
> > -	err = root_scheduler_init(dev);
> > -	if (err)
> > -		return err;
> > -
> > -	err = register_reboot_notifier(&mshv_reboot_nb);
> > -	if (err)
> > -		goto root_sched_deinit;
> > -
> > -	return 0;
> > -
> > -root_sched_deinit:
> > -	root_scheduler_deinit();
> > -	return err;
> > +	return root_scheduler_init(dev);
> >  }
> > 
> >  static void mshv_init_vmm_caps(struct device *dev)
> > @@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
> >  			MSHV_HV_MAX_VERSION);
> >  	}
> > 
> > -	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> > -	if (!mshv_root.synic_pages) {
> > -		dev_err(dev, "Failed to allocate percpu synic page\n");
> > -		ret = -ENOMEM;
> > +	ret = mshv_synic_init(dev);
> > +	if (ret)
> >  		goto device_deregister;
> > -	}
> > -
> > -	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > -				mshv_synic_init,
> > -				mshv_synic_cleanup);
> > -	if (ret < 0) {
> > -		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> > -		goto free_synic_pages;
> > -	}
> > -
> > -	mshv_cpuhp_online = ret;
> > 
> >  	ret = mshv_retrieve_scheduler_type(dev);
> >  	if (ret)
> > -		goto remove_cpu_state;
> > +		goto synic_cleanup;
> > 
> >  	if (hv_root_partition())
> >  		ret = mshv_root_partition_init(dev);
> >  	if (ret)
> > -		goto remove_cpu_state;
> > +		goto synic_cleanup;
> > 
> >  	mshv_init_vmm_caps(dev);
> > 
> > @@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
> >  exit_partition:
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > -remove_cpu_state:
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -free_synic_pages:
> > -	free_percpu(mshv_root.synic_pages);
> > +synic_cleanup:
> > +	mshv_synic_cleanup();
> >  device_deregister:
> >  	misc_deregister(&mshv_dev);
> >  	return ret;
> > @@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
> >  	mshv_irqfd_wq_cleanup();
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > -	cpuhp_remove_state(mshv_cpuhp_online);
> > -	free_percpu(mshv_root.synic_pages);
> > +	mshv_synic_cleanup();
> >  }
> > 
> >  module_init(mshv_parent_partition_init);
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index f8b0337cdc82..074e37c48876 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -12,11 +12,16 @@
> >  #include <linux/mm.h>
> >  #include <linux/io.h>
> >  #include <linux/random.h>
> > +#include <linux/cpuhotplug.h>
> > +#include <linux/reboot.h>
> >  #include <asm/mshyperv.h>
> > 
> >  #include "mshv_eventfd.h"
> >  #include "mshv.h"
> > 
> > +static int synic_cpuhp_online;
> > +static struct hv_synic_pages __percpu *synic_pages;
> > +
> >  static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  {
> >  	struct hv_synic_event_ring_page **event_ring_page;
> > @@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  	u32 message;
> >  	u8 tail;
> > 
> > -	spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	spages = this_cpu_ptr(synic_pages);
> >  	event_ring_page = &spages->synic_event_ring_page;
> >  	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
> > 
> > @@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
> > 
> >  void mshv_isr(void)
> >  {
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_message *msg;
> >  	bool handled;
> > @@ -446,7 +451,7 @@ void mshv_isr(void)
> >  	}
> >  }
> > 
> > -int mshv_synic_init(unsigned int cpu)
> > +static int mshv_synic_cpu_init(unsigned int cpu)
> >  {
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> > @@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
> >  	union hv_synic_sint sint;
> >  #endif
> >  	union hv_synic_scontrol sctrl;
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_synic_event_flags_page **event_flags_page =
> >  			&spages->synic_event_flags_page;
> > @@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
> >  	return -EFAULT;
> >  }
> > 
> > -int mshv_synic_cleanup(unsigned int cpu)
> > +static int mshv_synic_cpu_exit(unsigned int cpu)
> >  {
> >  	union hv_synic_sint sint;
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> >  	union hv_synic_sirbp sirbp;
> >  	union hv_synic_scontrol sctrl;
> > -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> >  	struct hv_synic_event_flags_page **event_flags_page =
> >  		&spages->synic_event_flags_page;
> > @@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
> > 
> >  	mshv_portid_free(doorbell_portid);
> >  }
> > +
> > +static int mshv_synic_reboot_notify(struct notifier_block *nb,
> > +			      unsigned long code, void *unused)
> > +{
> > +	if (!hv_root_partition())
> > +		return 0;
> 
> I'm curious as to why the synic is cleaned up only for the root partition,
> but not for L1VH parents. L1VH parents *do* cleanup their synic in
> mshv_parent_partition_exit(). I probably don't understand all the
> vagaries of L1VH parents ....

I will check this. This cleanup matters mainly for kexec. I will do some
tests to see if L1VH needs it too.

If required, I will fix it in a separate patch. For this series I would
prefer to keep the "No function changes" claim intact.

Thanks,
Anirudh.

> 
> > +
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block mshv_synic_reboot_nb = {
> > +	.notifier_call = mshv_synic_reboot_notify,
> > +};
> > +
> > +int __init mshv_synic_init(struct device *dev)
> > +{
> > +	int ret = 0;
> > +
> > +	synic_pages = alloc_percpu(struct hv_synic_pages);
> > +	if (!synic_pages) {
> > +		dev_err(dev, "Failed to allocate percpu synic page\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> > +				mshv_synic_cpu_init,
> > +				mshv_synic_cpu_exit);
> > +	if (ret < 0) {
> > +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> > +		goto free_synic_pages;
> > +	}
> > +
> > +	synic_cpuhp_online = ret;
> > +
> > +	ret = register_reboot_notifier(&mshv_synic_reboot_nb);
> > +	if (ret)
> > +		goto remove_cpuhp_state;
> > +
> > +	return 0;
> > +
> > +remove_cpuhp_state:
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +free_synic_pages:
> > +	free_percpu(synic_pages);
> > +	return ret;
> > +}
> > +
> > +void mshv_synic_cleanup(void)
> > +{
> > +	unregister_reboot_notifier(&mshv_synic_reboot_nb);
> > +	cpuhp_remove_state(synic_cpuhp_online);
> > +	free_percpu(synic_pages);
> > +}
> > --
> > 2.34.1
> > 
> 

^ permalink raw reply

* Re: [PATCH v4 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-02-23 13:41 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157B6F44266C4E813D3CF74D46AA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, Feb 18, 2026 at 04:17:29AM +0000, Michael Kelley wrote:
> From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Wednesday, February 11, 2026 9:07 AM
> > 
> > On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> > interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> > There is no such vector reserved for arm64.
> > 
> > On arm64, the hypervisor exposes a synthetic register that can be read
> > to find the INTID that should be used for SINTs. This INTID is in the
> > PPI range.
> > 
> > To better unify the code paths, introduce mshv_sint_vector_init() that
> > either reads the synthetic register and obtains the INTID (arm64) or
> > just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> > 
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> >  drivers/hv/mshv_synic.c     | 112 +++++++++++++++++++++++++++++++++---
> >  include/hyperv/hvgdk_mini.h |   2 +
> >  2 files changed, 107 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> > index 074e37c48876..7957ad0328dd 100644
> > --- a/drivers/hv/mshv_synic.c
> > +++ b/drivers/hv/mshv_synic.c
> > @@ -10,17 +10,24 @@
> >  #include <linux/kernel.h>
> >  #include <linux/slab.h>
> >  #include <linux/mm.h>
> > +#include <linux/interrupt.h>
> >  #include <linux/io.h>
> >  #include <linux/random.h>
> >  #include <linux/cpuhotplug.h>
> >  #include <linux/reboot.h>
> >  #include <asm/mshyperv.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/acpi.h>
> > 
> >  #include "mshv_eventfd.h"
> >  #include "mshv.h"
> > 
> >  static int synic_cpuhp_online;
> >  static struct hv_synic_pages __percpu *synic_pages;
> > +static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */
> 
> With the introduction of this variable, the call to add_interrupt_randomness()
> in mshv_isr() should be updated to pass mshv_sint_vector as the argument,
> and the #ifdef HYPERVISOR_CALLBACK_VECTOR can be dropped (yea!).  My
> previous comment about the generic Linux IRQ handling doing the call
> to add_interrupt_randomness() is true for "normal" IRQs but not for per-CPU
> IRQs like these. So the call to add_interrupt_randomness() in mshv_isr() is
> needed on both x86 and ARM64.
> 
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
> > +#endif
> 
> Documentation/process/coding-style.rst says the following in Section 21:
> 
> If you have a function or variable which may potentially go unused in a
> particular configuration, and the compiler would warn about its definition
> going unused, mark the definition as __maybe_unused rather than wrapping it in
> a preprocessor conditional.
> 
> You could tag mshv_sint_irq with "__maybe_unused" and avoid the #ifndef. But
> see further comments below.
> 
> > 
> >  static u32 synic_event_ring_get_queued_port(u32 sint_index)
> >  {
> > @@ -456,9 +463,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> >  	union hv_synic_simp simp;
> >  	union hv_synic_siefp siefp;
> >  	union hv_synic_sirbp sirbp;
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> >  	union hv_synic_sint sint;
> > -#endif
> >  	union hv_synic_scontrol sctrl;
> >  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
> >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > @@ -501,10 +506,13 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> > 
> >  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> > 
> > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +	enable_percpu_irq(mshv_sint_irq, 0);
> > +#endif
> > +
> 
> Using IS_ENABLED() would be better than the #ifndef. (See Section 21
> of coding-style.rst about this as well.) You would need to drop the #ifndef
> around mshv_sint_irq, which is fine.
> 
> 	if (!IS_ENABLED(HYPERVISOR_CALLBACK_VECTOR))
> 		enable_percpu_irq(mshv_sint_irq, 0);
> 
> That said, I prefer the approach in v1 of your series where basically
> the code says "if we have a sint irq, enable it". This links the enablement
> most closely to what it directly depends on.
> 
> 	if (mshv_sint_irq != -1)
> 		enable_percpu_irq(mshv_sint_irq, 0);
> 
> But I realize the approach is somewhat a matter of personal preference so either
> way is acceptable.
> 
> >  	/* Enable intercepts */
> >  	sint.as_uint64 = 0;
> > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > +	sint.vector = mshv_sint_vector;
> >  	sint.masked = false;
> >  	sint.auto_eoi = hv_recommend_using_aeoi();
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > @@ -512,13 +520,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
> > 
> >  	/* Doorbell SINT */
> >  	sint.as_uint64 = 0;
> > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > +	sint.vector = mshv_sint_vector;
> >  	sint.masked = false;
> >  	sint.as_intercept = 1;
> >  	sint.auto_eoi = hv_recommend_using_aeoi();
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> >  			      sint.as_uint64);
> > -#endif
> > 
> >  	/* Enable global synic bit */
> >  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > @@ -573,6 +580,10 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
> >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> >  			      sint.as_uint64);
> > 
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +	disable_percpu_irq(mshv_sint_irq);
> > +#endif
> > +
> 
> Same here.
> 
> >  	/* Disable Synic's event ring page */
> >  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> >  	sirbp.sirbp_enabled = false;
> > @@ -683,14 +694,98 @@ static struct notifier_block mshv_synic_reboot_nb = {
> >  	.notifier_call = mshv_synic_reboot_notify,
> >  };
> > 
> > +#ifndef HYPERVISOR_CALLBACK_VECTOR
> > +#ifdef CONFIG_ACPI
> > +static long __percpu *mshv_evt;
> > +#endif
> 
> Same comment here about the coding-style.rst guidelines.
> 
> Furthermore, mshv_evt could be directly defined here as a per-cpu "long",
> rather than a pointer to a long. Then you don't need to do a runtime
> per-cpu allocation with all the attendant error checking and cleanup, which
> saves about 10 lines of code. So
> 
> static DEFINE_PER_CPU(long, mshv_evt);
> 
> drivers/clocksource/hyperv_timer.c does the definition for stimer0_evt this
> way. I looked through all kernel code and found several other places doing
> the direct definition. I don't remember why I didn't do the direct method for
> vmbus_evt, but I'm planning to submit a patch to change it, which will drop
> a few lines of code.
> 
> > +
> > +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> > +{
> > +	mshv_isr();
> > +	return IRQ_HANDLED;
> > +}
> 
> This function generates a warning about being unused when !CONFIG_ACPI.
> But see further comments below.
> 
> > +
> > +static int __init mshv_sint_vector_init(void)
> > +{
> > +#ifdef CONFIG_ACPI
> > +	int ret;
> > +	struct hv_register_assoc reg = {
> > +		.name = HV_ARM64_REGISTER_SINT_RESERVED_INTERRUPT_ID,
> > +	};
> > +	union hv_input_vtl input_vtl = { 0 };
> > +
> > +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> > +				1, input_vtl, &reg);
> > +	if (ret || !reg.value.reg64)
> > +		return -ENODEV;
> > +
> > +	mshv_sint_vector = reg.value.reg64;
> > +	ret  = acpi_register_gsi(NULL, mshv_sint_vector, ACPI_EDGE_SENSITIVE,
> > +					ACPI_ACTIVE_HIGH);
> > +	if (ret < 0)
> > +		goto out_fail;
> > +
> > +	mshv_sint_irq = ret;
> > +
> > +	mshv_evt = alloc_percpu(long);
> > +	if (!mshv_evt) {
> > +		ret = -ENOMEM;
> > +		goto out_unregister;
> > +	}
> > +
> > +	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
> > +		mshv_evt);
> > +	if (ret)
> > +		goto free_evt;
> > +
> > +	return 0;
> > +
> > +free_evt:
> > +	free_percpu(mshv_evt);
> > +out_unregister:
> > +	acpi_unregister_gsi(mshv_sint_vector);
> > +out_fail:
> > +	return ret;
> > +#else
> > +	return -ENODEV;
> > +#endif
> > +}
> 
> I have several thoughts about the #ifdef CONFIG_ACPI.
> 
> The coding-style.rst guidelines in Section 21 also say:
> 
> Prefer to compile out entire functions, rather than portions of functions or
> portions of expressions.  Rather than putting an ifdef in an expression, factor
> out part or all of the expression into a separate helper function and apply the
> conditional to that function.
> 
> But more fundamentally, it looks like the #ifdef CONFIG_ACPI is there
> solely because acpi_register_gsi() exists only when CONFIG_ACPI is set.
> The rest of the code doesn't depend on ACPI. In the !CONFIG_ACPI case,
> your stub code returns -ENODEV, so doorbell & intercept SINTs just don't
> work, and pretty much everything is non-functional.
> 
> This patch doesn't allude to any future DeviceTree case that parallels ACPI,
> so I'm unsure what's expected in the future.  If such a future DT case is
> murky, perhaps drivers/hv/Kconfig should give MSHV_ROOT a dependency
> on ACPI. Then the #ifdef CONFIG_ACPI could be dropped, along with the
> #else stub code. When/if the DT use case comes along, the dependency
> can be removed and the code structured to handle both ACPI and DT.
> The code to fetch the INTID via the hypervisor synthetic register, and the
> request_percpu_irq() would be applicable to both. It's only the GSI
> registration that would be different, and that could be pulled out into a
> helper function that handles the difference in ACPI and DT. I haven't looked
> to see how DT does the equivalent of GSI registration.

The DT case will materialize in the future. Making MSHV_ROOT depend on
ACPI seems a bit drastic to me when all we want to do is follow the
coding style guideline that says "prefer to compile out entire
functions...".

> 
> Another approach would be to add stubs for acpi_register_gsi() and
> acpi_unregister_gsi() in include/linux/acpi.h.  A number of such stubs
> have been added over the years. Saurabh got one added in 2023
> (commit 1f6277bf716cc). Then the above code would compile even
> with !CONFIG_ACPI.  acpi_register_gsi() would fail, and you would get
> an error return. This approach produces cleaner code and is consistent
> with similar use cases that depend on stubs provided by include/linux/acpi.h
> rather than #ifdefs.

I'll send out a v5 which takes a simpler approach to conform to the
coding guidelines. I'll also address all the other comments from above.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH net-next] net: ethtool: add COALESCE_RX_CQE_FRAMES/NSECS parameters
From: Andrew Lunn @ 2026-02-23 14:00 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, Jakub Kicinski, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Kory Maincent (Dent Project),
	Gal Pressman, Oleksij Rempel, Vadim Fedorenko, linux-kernel,
	linux-doc, haiyangz, paulros
In-Reply-To: <20260222212328.736628-1-haiyangz@linux.microsoft.com>

On Sun, Feb 22, 2026 at 01:23:17PM -0800, Haiyang Zhang wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Add two parameters for drivers supporting Rx CQE Coalescing.
> 
> ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
> Maximum number of frames that can be coalesced into a CQE.
> 
> ETHTOOL_A_COALESCE_RX_CQE_NSECS:
> Time out value in nanoseconds after the first packet arrival in a
> coalesced CQE to be sent.

A new API needs a user. A kAPI especially needs a user. Please add
support to at least one driver.

    Andrew

---
pw-bot: cr

^ permalink raw reply

* [PATCH v5 0/2] ARM64 support for doorbell and intercept SINTs
From: Anirudh Rayabharam @ 2026-02-23 14:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh

From: "Anirudh Rayabharam (Microsoft)" <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the hypervisor exposes a synthetic register that can be read
to find the INTID that should be used for SINTs. This INTID is in the
PPI range.

Changes in v5:
  - Better align with coding-style.rst guidelines.

Changes in v4:
  - Hypervisor now exposes a synthetic register to read the SINT vector
    instead of using an ACPI platform device. So make changes to accomodate that.

Changes in v3:
  - Moved the hv_root_partition() check into the reboot notifier
    to avoid doing it multiple times.

v2: https://lore.kernel.org/linux-hyperv/20260202182706.648192-1-anirudh@anirudhrb.com/
Changes in v2:
Addressed review comments:
  - Moved more stuff into mshv_synic.c
  - Code simplifications
  - Removed unnecessary debug prints

v1: https://lore.kernel.org/linux-hyperv/20260128160437.3342167-1-anirudh@anirudhrb.com/

Anirudh Rayabharam (Microsoft) (2):
  mshv: refactor synic init and cleanup
  mshv: add arm64 support for doorbell & intercept SINTs

 drivers/hv/mshv_root.h      |   5 +-
 drivers/hv/mshv_root_main.c |  59 ++---------
 drivers/hv/mshv_synic.c     | 189 +++++++++++++++++++++++++++++++++---
 include/hyperv/hvgdk_mini.h |   2 +
 4 files changed, 186 insertions(+), 69 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH v5 1/2] mshv: refactor synic init and cleanup
From: Anirudh Rayabharam @ 2026-02-23 14:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260223140159.1627229-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

Rename mshv_synic_init() to mshv_synic_cpu_init() and
mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
these functions handle per-cpu synic setup and teardown.

Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
Move all the synic related setup from mshv_parent_partition_init.

Move the reboot notifier to mshv_synic.c because it currently only
operates on the synic cpuhp state.

Move out synic_pages from the global mshv_root since its use is now
completely local to mshv_synic.c.

This is in preparation for the next patch which will add more stuff to
mshv_synic_init().

No functional change.

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root.h      |  5 ++-
 drivers/hv/mshv_root_main.c | 59 +++++-------------------------
 drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
 3 files changed, 75 insertions(+), 60 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..26e0320c8097 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -183,7 +183,6 @@ struct hv_synic_pages {
 };
 
 struct mshv_root {
-	struct hv_synic_pages __percpu *synic_pages;
 	spinlock_t pt_ht_lock;
 	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
 	struct hv_partition_property_vmm_capabilities vmm_caps;
@@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
 void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
 
 void mshv_isr(void);
-int mshv_synic_init(unsigned int cpu);
-int mshv_synic_cleanup(unsigned int cpu);
+int mshv_synic_init(struct device *dev);
+void mshv_synic_cleanup(void);
 
 static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
 {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 681b58154d5e..7c1666456e78 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int mshv_cpuhp_online;
 static int mshv_root_sched_online;
 
 static const char *scheduler_type_to_string(enum hv_scheduler_type type)
@@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
 	free_percpu(root_scheduler_output);
 }
 
-static int mshv_reboot_notify(struct notifier_block *nb,
-			      unsigned long code, void *unused)
-{
-	cpuhp_remove_state(mshv_cpuhp_online);
-	return 0;
-}
-
-struct notifier_block mshv_reboot_nb = {
-	.notifier_call = mshv_reboot_notify,
-};
-
 static void mshv_root_partition_exit(void)
 {
-	unregister_reboot_notifier(&mshv_reboot_nb);
 	root_scheduler_deinit();
 }
 
 static int __init mshv_root_partition_init(struct device *dev)
 {
-	int err;
-
-	err = root_scheduler_init(dev);
-	if (err)
-		return err;
-
-	err = register_reboot_notifier(&mshv_reboot_nb);
-	if (err)
-		goto root_sched_deinit;
-
-	return 0;
-
-root_sched_deinit:
-	root_scheduler_deinit();
-	return err;
+	return root_scheduler_init(dev);
 }
 
 static void mshv_init_vmm_caps(struct device *dev)
@@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
 			MSHV_HV_MAX_VERSION);
 	}
 
-	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
-	if (!mshv_root.synic_pages) {
-		dev_err(dev, "Failed to allocate percpu synic page\n");
-		ret = -ENOMEM;
+	ret = mshv_synic_init(dev);
+	if (ret)
 		goto device_deregister;
-	}
-
-	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
-				mshv_synic_init,
-				mshv_synic_cleanup);
-	if (ret < 0) {
-		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
-		goto free_synic_pages;
-	}
-
-	mshv_cpuhp_online = ret;
 
 	ret = mshv_retrieve_scheduler_type(dev);
 	if (ret)
-		goto remove_cpu_state;
+		goto synic_cleanup;
 
 	if (hv_root_partition())
 		ret = mshv_root_partition_init(dev);
 	if (ret)
-		goto remove_cpu_state;
+		goto synic_cleanup;
 
 	mshv_init_vmm_caps(dev);
 
@@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
-remove_cpu_state:
-	cpuhp_remove_state(mshv_cpuhp_online);
-free_synic_pages:
-	free_percpu(mshv_root.synic_pages);
+synic_cleanup:
+	mshv_synic_cleanup();
 device_deregister:
 	misc_deregister(&mshv_dev);
 	return ret;
@@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
 	mshv_irqfd_wq_cleanup();
 	if (hv_root_partition())
 		mshv_root_partition_exit();
-	cpuhp_remove_state(mshv_cpuhp_online);
-	free_percpu(mshv_root.synic_pages);
+	mshv_synic_cleanup();
 }
 
 module_init(mshv_parent_partition_init);
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f8b0337cdc82..074e37c48876 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -12,11 +12,16 @@
 #include <linux/mm.h>
 #include <linux/io.h>
 #include <linux/random.h>
+#include <linux/cpuhotplug.h>
+#include <linux/reboot.h>
 #include <asm/mshyperv.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 
+static int synic_cpuhp_online;
+static struct hv_synic_pages __percpu *synic_pages;
+
 static u32 synic_event_ring_get_queued_port(u32 sint_index)
 {
 	struct hv_synic_event_ring_page **event_ring_page;
@@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
 	u32 message;
 	u8 tail;
 
-	spages = this_cpu_ptr(mshv_root.synic_pages);
+	spages = this_cpu_ptr(synic_pages);
 	event_ring_page = &spages->synic_event_ring_page;
 	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
 
@@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
 
 void mshv_isr(void)
 {
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_message *msg;
 	bool handled;
@@ -446,7 +451,7 @@ void mshv_isr(void)
 	}
 }
 
-int mshv_synic_init(unsigned int cpu)
+static int mshv_synic_cpu_init(unsigned int cpu)
 {
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
@@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
 	union hv_synic_sint sint;
 #endif
 	union hv_synic_scontrol sctrl;
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 			&spages->synic_event_flags_page;
@@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
 	return -EFAULT;
 }
 
-int mshv_synic_cleanup(unsigned int cpu)
+static int mshv_synic_cpu_exit(unsigned int cpu)
 {
 	union hv_synic_sint sint;
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
 	union hv_synic_scontrol sctrl;
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 		&spages->synic_event_flags_page;
@@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 
 	mshv_portid_free(doorbell_portid);
 }
+
+static int mshv_synic_reboot_notify(struct notifier_block *nb,
+			      unsigned long code, void *unused)
+{
+	if (!hv_root_partition())
+		return 0;
+
+	cpuhp_remove_state(synic_cpuhp_online);
+	return 0;
+}
+
+static struct notifier_block mshv_synic_reboot_nb = {
+	.notifier_call = mshv_synic_reboot_notify,
+};
+
+int __init mshv_synic_init(struct device *dev)
+{
+	int ret = 0;
+
+	synic_pages = alloc_percpu(struct hv_synic_pages);
+	if (!synic_pages) {
+		dev_err(dev, "Failed to allocate percpu synic page\n");
+		return -ENOMEM;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
+				mshv_synic_cpu_init,
+				mshv_synic_cpu_exit);
+	if (ret < 0) {
+		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
+		goto free_synic_pages;
+	}
+
+	synic_cpuhp_online = ret;
+
+	ret = register_reboot_notifier(&mshv_synic_reboot_nb);
+	if (ret)
+		goto remove_cpuhp_state;
+
+	return 0;
+
+remove_cpuhp_state:
+	cpuhp_remove_state(synic_cpuhp_online);
+free_synic_pages:
+	free_percpu(synic_pages);
+	return ret;
+}
+
+void mshv_synic_cleanup(void)
+{
+	unregister_reboot_notifier(&mshv_synic_reboot_nb);
+	cpuhp_remove_state(synic_cpuhp_online);
+	free_percpu(synic_pages);
+}
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-02-23 14:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260223140159.1627229-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the hypervisor exposes a synthetic register that can be read
to find the INTID that should be used for SINTs. This INTID is in the
PPI range.

To better unify the code paths, introduce mshv_sint_vector_init() that
either reads the synthetic register and obtains the INTID (arm64) or
just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_synic.c     | 120 +++++++++++++++++++++++++++++++++---
 include/hyperv/hvgdk_mini.h |   2 +
 2 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 074e37c48876..75ef2160b3e0 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -10,17 +10,22 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
+#include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/random.h>
 #include <linux/cpuhotplug.h>
 #include <linux/reboot.h>
 #include <asm/mshyperv.h>
+#include <linux/platform_device.h>
+#include <linux/acpi.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 
 static int synic_cpuhp_online;
 static struct hv_synic_pages __percpu *synic_pages;
+static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */
+static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
 
 static u32 synic_event_ring_get_queued_port(u32 sint_index)
 {
@@ -442,9 +447,7 @@ void mshv_isr(void)
 		if (msg->header.message_flags.msg_pending)
 			hv_set_non_nested_msr(HV_MSR_EOM, 0);
 
-#ifdef HYPERVISOR_CALLBACK_VECTOR
-		add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
-#endif
+		add_interrupt_randomness(mshv_sint_vector);
 	} else {
 		pr_warn_once("%s: unknown message type 0x%x\n", __func__,
 			     msg->header.message_type);
@@ -456,9 +459,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-#ifdef HYPERVISOR_CALLBACK_VECTOR
 	union hv_synic_sint sint;
-#endif
 	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
@@ -501,10 +502,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
-#ifdef HYPERVISOR_CALLBACK_VECTOR
+	if (mshv_sint_irq != -1)
+		enable_percpu_irq(mshv_sint_irq, 0);
+
 	/* Enable intercepts */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_sint_vector;
 	sint.masked = false;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
@@ -512,13 +515,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 
 	/* Doorbell SINT */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_sint_vector;
 	sint.masked = false;
 	sint.as_intercept = 1;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
-#endif
 
 	/* Enable global synic bit */
 	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
@@ -573,6 +575,9 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
+	if (mshv_sint_irq != -1)
+		disable_percpu_irq(mshv_sint_irq);
+
 	/* Disable Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
@@ -683,14 +688,106 @@ static struct notifier_block mshv_synic_reboot_nb = {
 	.notifier_call = mshv_synic_reboot_notify,
 };
 
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+static DEFINE_PER_CPU(long, mshv_evt);
+
+static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
+{
+	mshv_isr();
+	return IRQ_HANDLED;
+}
+
+#ifdef CONFIG_ACPI
+static int __init mshv_acpi_setup_sint_irq(void)
+{
+	return acpi_register_gsi(NULL, mshv_sint_vector, ACPI_EDGE_SENSITIVE,
+					ACPI_ACTIVE_HIGH);
+}
+
+static void mshv_acpi_cleanup_sint_irq(void)
+{
+	acpi_unregister_gsi(mshv_sint_vector);
+}
+#else
+static int __init mshv_acpi_setup_sint_irq(void)
+{
+	return -ENODEV;
+}
+
+static void mshv_acpi_cleanup_sint_irq(void)
+{
+}
+#endif
+
+static int __init mshv_sint_vector_init(void)
+{
+	int ret;
+	struct hv_register_assoc reg = {
+		.name = HV_ARM64_REGISTER_SINT_RESERVED_INTERRUPT_ID,
+	};
+	union hv_input_vtl input_vtl = { 0 };
+
+	if (acpi_disabled)
+		return -ENODEV;
+
+	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
+				1, input_vtl, &reg);
+	if (ret || !reg.value.reg64)
+		return -ENODEV;
+
+	mshv_sint_vector = reg.value.reg64;
+	ret = mshv_acpi_setup_sint_irq();
+	if (ret <= 0) {
+		pr_err("Failed to setup IRQ for MSHV SINT vector %d: %d\n",
+			mshv_sint_vector, ret);
+		goto out_fail;
+	}
+
+	mshv_sint_irq = ret;
+
+	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
+		&mshv_evt);
+	if (ret)
+		goto out_unregister;
+
+	return 0;
+
+out_unregister:
+	mshv_acpi_cleanup_sint_irq();
+out_fail:
+	return ret;
+}
+
+static void mshv_sint_vector_cleanup(void)
+{
+	free_percpu_irq(mshv_sint_irq, &mshv_evt);
+	mshv_acpi_cleanup_sint_irq();
+}
+#else /* !HYPERVISOR_CALLBACK_VECTOR */
+static int __init mshv_sint_vector_init(void)
+{
+	mshv_sint_vector = HYPERVISOR_CALLBACK_VECTOR;
+	return 0;
+}
+
+static void mshv_sint_vector_cleanup(void)
+{
+}
+#endif /* HYPERVISOR_CALLBACK_VECTOR */
+
 int __init mshv_synic_init(struct device *dev)
 {
 	int ret = 0;
 
+	ret = mshv_sint_vector_init();
+	if (ret)
+		return ret;
+
 	synic_pages = alloc_percpu(struct hv_synic_pages);
 	if (!synic_pages) {
 		dev_err(dev, "Failed to allocate percpu synic page\n");
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto sint_vector_cleanup;
 	}
 
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
@@ -713,6 +810,8 @@ int __init mshv_synic_init(struct device *dev)
 	cpuhp_remove_state(synic_cpuhp_online);
 free_synic_pages:
 	free_percpu(synic_pages);
+sint_vector_cleanup:
+	mshv_sint_vector_cleanup();
 	return ret;
 }
 
@@ -721,4 +820,5 @@ void mshv_synic_cleanup(void)
 	unregister_reboot_notifier(&mshv_synic_reboot_nb);
 	cpuhp_remove_state(synic_cpuhp_online);
 	free_percpu(synic_pages);
+	mshv_sint_vector_cleanup();
 }
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 30fbbde81c5c..7676f78e0766 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -1117,6 +1117,8 @@ enum hv_register_name {
 	HV_X64_REGISTER_MSR_MTRR_FIX4KF8000	= 0x0008007A,
 
 	HV_X64_REGISTER_REG_PAGE	= 0x0009001C,
+#elif defined(CONFIG_ARM64)
+	HV_ARM64_REGISTER_SINT_RESERVED_INTERRUPT_ID	= 0x00070001,
 #endif
 };
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH 6.18, 6.12] Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-02-23 15:34 UTC (permalink / raw)
  To: stable@vger.kernel.org, Greg Kroah-Hartman, Sasha Levin
  Cc: Florian Bezdeka, Michael Kelley, Wei Liu, Sasha Levin, kys,
	haiyangz, decui, longli, linux-hyperv, linux-kernel

From: Jan Kiszka <jan.kiszka@siemens.com>

[ Upstream commit f8e6343b7a89c7c649db5a9e309ba7aa20401813 ]

Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V
with related guest support enabled:

[    1.127941] hv_vmbus: registering driver hyperv_drm

[    1.132518] =============================
[    1.132519] [ BUG: Invalid wait context ]
[    1.132521] 6.19.0-rc8+ #9 Not tainted
[    1.132524] -----------------------------
[    1.132525] swapper/0/0 is trying to lock:
[    1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0
[    1.132543] other info that might help us debug this:
[    1.132544] context-{2:2}
[    1.132545] 1 lock held by swapper/0/0:
[    1.132547]  #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0
[    1.132557] stack backtrace:
[    1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)}
[    1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[    1.132567] Call Trace:
[    1.132570]  <IRQ>
[    1.132573]  dump_stack_lvl+0x6e/0xa0
[    1.132581]  __lock_acquire+0xee0/0x21b0
[    1.132592]  lock_acquire+0xd5/0x2d0
[    1.132598]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132606]  ? lock_acquire+0xd5/0x2d0
[    1.132613]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132619]  rt_spin_lock+0x3f/0x1f0
[    1.132623]  ? vmbus_chan_sched+0xc4/0x2b0
[    1.132629]  ? vmbus_chan_sched+0x31/0x2b0
[    1.132634]  vmbus_chan_sched+0xc4/0x2b0
[    1.132641]  vmbus_isr+0x2c/0x150
[    1.132648]  __sysvec_hyperv_callback+0x5f/0xa0
[    1.132654]  sysvec_hyperv_callback+0x88/0xb0
[    1.132658]  </IRQ>
[    1.132659]  <TASK>
[    1.132660]  asm_sysvec_hyperv_callback+0x1a/0x20

As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT,
the vmbus_isr execution needs to be moved into thread context. Open-
coding this allows to skip the IPI that irq_work would additionally
bring and which we do not need, being an IRQ, never an NMI.

This affects both x86 and arm64, therefore hook into the common driver
logic.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 drivers/hv/vmbus_drv.c | 66 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 69591dc7bad2..3ab62277b6be 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -25,6 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/task_stack.h>
+#include <linux/smpboot.h>
 
 #include <linux/delay.h>
 #include <linux/panic_notifier.h>
@@ -1306,7 +1307,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
 	}
 }
 
-static void vmbus_isr(void)
+static void __vmbus_isr(void)
 {
 	struct hv_per_cpu_context *hv_cpu
 		= this_cpu_ptr(hv_context.cpu_context);
@@ -1330,6 +1331,53 @@ static void vmbus_isr(void)
 	add_interrupt_randomness(vmbus_interrupt);
 }
 
+static DEFINE_PER_CPU(bool, vmbus_irq_pending);
+static DEFINE_PER_CPU(struct task_struct *, vmbus_irqd);
+
+static void vmbus_irqd_wake(void)
+{
+	struct task_struct *tsk = __this_cpu_read(vmbus_irqd);
+
+	__this_cpu_write(vmbus_irq_pending, true);
+	wake_up_process(tsk);
+}
+
+static void vmbus_irqd_setup(unsigned int cpu)
+{
+	sched_set_fifo(current);
+}
+
+static int vmbus_irqd_should_run(unsigned int cpu)
+{
+	return __this_cpu_read(vmbus_irq_pending);
+}
+
+static void run_vmbus_irqd(unsigned int cpu)
+{
+	__this_cpu_write(vmbus_irq_pending, false);
+	__vmbus_isr();
+}
+
+static bool vmbus_irq_initialized;
+
+static struct smp_hotplug_thread vmbus_irq_threads = {
+	.store                  = &vmbus_irqd,
+	.setup			= vmbus_irqd_setup,
+	.thread_should_run      = vmbus_irqd_should_run,
+	.thread_fn              = run_vmbus_irqd,
+	.thread_comm            = "vmbus_irq/%u",
+};
+
+static void vmbus_isr(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		vmbus_irqd_wake();
+	} else {
+		lockdep_hardirq_threaded();
+		__vmbus_isr();
+	}
+}
+
 static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
 {
 	vmbus_isr();
@@ -1375,6 +1423,13 @@ static int vmbus_bus_init(void)
 	 * the VMbus interrupt handler.
 	 */
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
+		ret = smpboot_register_percpu_thread(&vmbus_irq_threads);
+		if (ret)
+			goto err_kthread;
+		vmbus_irq_initialized = true;
+	}
+
 	if (vmbus_irq == -1) {
 		hv_setup_vmbus_handler(vmbus_isr);
 	} else {
@@ -1449,6 +1504,11 @@ static int vmbus_bus_init(void)
 		free_percpu(vmbus_evt);
 	}
 err_setup:
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
+err_kthread:
 	bus_unregister(&hv_bus);
 	return ret;
 }
@@ -2914,6 +2974,10 @@ static void __exit vmbus_exit(void)
 		free_percpu_irq(vmbus_irq, vmbus_evt);
 		free_percpu(vmbus_evt);
 	}
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
+		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
+		vmbus_irq_initialized = false;
+	}
 	for_each_online_cpu(cpu) {
 		struct hv_per_cpu_context *hv_cpu
 			= per_cpu_ptr(hv_context.cpu_context, cpu);
-- 
2.47.3

^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net-next] net: ethtool: add COALESCE_RX_CQE_FRAMES/NSECS parameters
From: Haiyang Zhang @ 2026-02-23 16:07 UTC (permalink / raw)
  To: Kory Maincent, Haiyang Zhang
  Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, Andrew Lunn,
	Jakub Kicinski, Donald Hunter, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Gal Pressman, Oleksij Rempel, Vadim Fedorenko,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Paul Rosswurm
In-Reply-To: <20260223102534.0a87ed4c@kmaincent-XPS-13-7390>



> -----Original Message-----
> From: Kory Maincent <kory.maincent@bootlin.com>
> Sent: Monday, February 23, 2026 4:26 AM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; Andrew Lunn
> <andrew@lunn.ch>; Jakub Kicinski <kuba@kernel.org>; Donald Hunter
> <donald.hunter@gmail.com>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Simon
> Horman <horms@kernel.org>; Jonathan Corbet <corbet@lwn.net>; Shuah Khan
> <skhan@linuxfoundation.org>; Gal Pressman <gal@nvidia.com>; Oleksij Rempel
> <o.rempel@pengutronix.de>; Vadim Fedorenko <vadim.fedorenko@linux.dev>;
> linux-kernel@vger.kernel.org; linux-doc@vger.kernel.org; Haiyang Zhang
> <haiyangz@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next] net: ethtool: add
> COALESCE_RX_CQE_FRAMES/NSECS parameters
> 
> [You don't often get email from kory.maincent@bootlin.com. Learn why this
> is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Sun, 22 Feb 2026 13:23:17 -0800
> Haiyang Zhang <haiyangz@linux.microsoft.com> wrote:
> 
> > From: Haiyang Zhang <haiyangz@microsoft.com>
> >
> > Add two parameters for drivers supporting Rx CQE Coalescing.
> >
> > ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
> > Maximum number of frames that can be coalesced into a CQE.
> >
> > ETHTOOL_A_COALESCE_RX_CQE_NSECS:
> > Time out value in nanoseconds after the first packet arrival in a
> > coalesced CQE to be sent.
> >
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> You send this patch one day before the official reopening of net-next.
> Not sure if this will be taken into account by patchwork.
> Else:
> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>

Thanks for the review! I sent it a day earlier because of the winter
storm :)

- Haiyang


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next] net: ethtool: add COALESCE_RX_CQE_FRAMES/NSECS parameters
From: Haiyang Zhang @ 2026-02-23 16:11 UTC (permalink / raw)
  To: Andrew Lunn, Haiyang Zhang
  Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	Jakub Kicinski, Donald Hunter, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Kory Maincent (Dent Project), Gal Pressman, Oleksij Rempel,
	Vadim Fedorenko, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, Paul Rosswurm
In-Reply-To: <6bf21536-569b-49b4-9541-c22a152570fd@lunn.ch>



> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: Monday, February 23, 2026 9:01 AM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; Jakub Kicinski
> <kuba@kernel.org>; Donald Hunter <donald.hunter@gmail.com>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Paolo
> Abeni <pabeni@redhat.com>; Simon Horman <horms@kernel.org>; Jonathan
> Corbet <corbet@lwn.net>; Shuah Khan <skhan@linuxfoundation.org>; Kory
> Maincent (Dent Project) <kory.maincent@bootlin.com>; Gal Pressman
> <gal@nvidia.com>; Oleksij Rempel <o.rempel@pengutronix.de>; Vadim
> Fedorenko <vadim.fedorenko@linux.dev>; linux-kernel@vger.kernel.org;
> linux-doc@vger.kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Paul
> Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next] net: ethtool: add
> COALESCE_RX_CQE_FRAMES/NSECS parameters
> 
> On Sun, Feb 22, 2026 at 01:23:17PM -0800, Haiyang Zhang wrote:
> > From: Haiyang Zhang <haiyangz@microsoft.com>
> >
> > Add two parameters for drivers supporting Rx CQE Coalescing.
> >
> > ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
> > Maximum number of frames that can be coalesced into a CQE.
> >
> > ETHTOOL_A_COALESCE_RX_CQE_NSECS:
> > Time out value in nanoseconds after the first packet arrival in a
> > coalesced CQE to be sent.
> 
> A new API needs a user. A kAPI especially needs a user. Please add
> support to at least one driver.

Sure, next time I will include MANA driver patches using this kAPI
in the same series. The MANA HW/FW API is still being worked on by
other teams.

Thanks,
- Haiyang

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox