Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Leon Romanovsky @ 2026-03-18 14:49 UTC (permalink / raw)
  To: Long Li
  Cc: Jason Gunthorpe, Konstantin Taranov, Jakub Kicinski,
	David S . Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66832D0A369DE7E411ACCDEDCE41A@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote:
> > 
> > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > > When the MANA hardware undergoes a service reset, the ETH
> > > > > auxiliary device
> > > > > (mana.eth) used by DPDK persists across the reset cycle — it is
> > > > > not removed and re-added like RC/UD/GSI QPs. This means userspace
> > > > > RDMA consumers such as DPDK have no way of knowing that firmware
> > > > > handles for their PD, CQ, WQ, QP and MR resources have become stale.
> > > >
> > > > NAK to any of this.
> > > >
> > > > In case of hardware reset, mana_ib AUX device needs to be destroyed
> > > > and recreated later.
> > >
> > > Yeah, that is our general model for any serious RAS event where the
> > > driver's view of resources becomes out of sync with the HW.
> > >
> > > You have tear down the ib_device by removing the aux and then bring
> > > back a new one.
> > >
> > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> > > tell userspace to close and re-open their uverbs FD.
> > >
> > > We don't have a model where a uverbs FD in userspace can continue to
> > > work after the device has a catasrophic RAS event.
> > >
> > > There may be room to have a model where the ib device doesn't fully
> > > unplug/replug so it retains its name and things, but that is core code
> > > not driver stuff.
> > 
> > Good luck with that model. It is going to break RDMA-CM hotplug support.
> > 
> 
>    I think we can preserve RDMA-CM behavior without requiring ib_device
>    unregister/re-register.
> 
>    On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
>    new reset event) through ib_dispatch_event(). RDMA-CM already handles
>    device events — we would add a handler that iterates all rdma_cm_ids
>    on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
>    as cma_process_remove() does today. The difference: cma_device stays
>    alive, so applications can reconnect on the same device after recovery
>    instead of waiting for a new one to appear.
> 
>    The motivation for keeping ib_device alive is that some RDMA consumers
>    — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
>    manage QP state themselves.

RDMA-CM provides an "external QP" model where the QP is managed by the
rdma-cm user.

As Jason noted, you should propose the core changes together with the
corresponding librdmacm updates. The final result must ensure that legacy
applications continue to function correctly with the new kernel.

Thanks

^ permalink raw reply

* RE: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Michael Kelley @ 2026-03-18 14:38 UTC (permalink / raw)
  To: Wei Liu, Michael Kelley
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260318062001.GA262287@liuwe-devbox-debian-v2.local>

From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 17, 2026 11:20 PM
> 
> On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > >
> > > The current error handling has two issues:
> > >
> > > First, pin_user_pages_fast() can return a short pin count (less than
> > > requested but greater than zero) when it cannot pin all requested pages.
> > > This is treated as success, leading to partially pinned regions being
> > > used, which causes memory corruption.
> > >
> > > Second, when an error occurs mid-loop, already pinned pages from the
> > > current batch are not released before calling mshv_region_evict_pages(),
> > > causing a page reference leak.
> >
> > There's now an online LLM-based tool that is automatically reviewing
> > kernel patches.  For this patch, the results are here:
> >
> >
> https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit
> %40skinsburskii-cloud-desktop.internal.cloudapp.net
> >
> > It has flagged the commit message as incorrectly referencing the
> > function mshv_region_evict_pages(), which doesn't exist.
> >
> > FWIW, the announcement about sashiko.dev is here:
> >
> > https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> >
> > Other than the commit message reference, this looks good to me.
> >
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 
> The second point is written as if the code here should release the
> already pinned pages before calling mshv_region_invalidate_pages(), but
> the code actually relies on mshv_mem_region_invalidate_pages() to
> release the pages. The change here fixes the accounting.
> 
>  Second, when an error occurs mid-loop, already pinned pages from the
>  current batch are not accounted for before calling
>  mshv_region_invalidate_pages(), causing a page reference leak.
> 
> And queued up the patch to hyperv-fixes.

One other thing I noticed:  The "Subject" of the patch is wrong. It
mentions mshv_region_populate_pages(), but the function being
modified is actually mshv_region_pin().

Michael

> 
> Wei
> 
> >
> > >
> > > Fix by treating short pins as errors and explicitly unpinning the
> > > partial batch before cleanup.
> > >
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_regions.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > index c28aac0726de..fdffd4f002f6 100644
> > > --- a/drivers/hv/mshv_regions.c
> > > +++ b/drivers/hv/mshv_regions.c
> > > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > >  					  FOLL_WRITE | FOLL_LONGTERM,
> > >  					  pages);
> > > -		if (ret < 0)
> > > +		if (ret != nr_pages)
> > >  			goto release_pages;
> > >  	}
> > >
> > >  	return 0;
> > >
> > >  release_pages:
> > > +	if (ret > 0)
> > > +		done_count += ret;
> > >  	mshv_region_invalidate_pages(region, 0, done_count);
> > > -	return ret;
> > > +	return ret < 0 ? ret : -ENOMEM;
> > >  }
> > >
> > >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > >
> > >
> >


^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 12:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Michael Kelley, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Florian Bezdeka, RT, Mitchell Levy, Saurabh Singh Sengar,
	Naman Jain
In-Reply-To: <20260318112113.JDm06Hr-@linutronix.de>

On 18.03.26 12:21, Sebastian Andrzej Siewior wrote:
> On 2026-03-18 12:03:03 [+0100], Jan Kiszka wrote:
>>> Either way, that add_interrupt_randomness() should be moved to
>>> sysvec_hyperv_callback() like it has been done for
>>> sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
>>> via vmbus_percpu_isr().
>>
>> No, this would degrade arm64.
> 
> Okay. So why does this needs to be done for _this_ per-CPU IRQ on ARM64
> but not for the others? What does it make so special.
> 

See the other thread:
https://lore.kernel.org/lkml/SN6PR02MB41573332BF202DAE0AF79ED1D441A@SN6PR02MB4157.namprd02.prod.outlook.com/

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18 11:21 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael Kelley, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Florian Bezdeka, RT, Mitchell Levy, Saurabh Singh Sengar,
	Naman Jain
In-Reply-To: <7f248f1f-a4ad-442d-bd85-23e57e58eeba@siemens.com>

On 2026-03-18 12:03:03 [+0100], Jan Kiszka wrote:
> > Either way, that add_interrupt_randomness() should be moved to
> > sysvec_hyperv_callback() like it has been done for
> > sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
> > via vmbus_percpu_isr().
> 
> No, this would degrade arm64.

Okay. So why does this needs to be done for _this_ per-CPU IRQ on ARM64
but not for the others? What does it make so special.

> Jan

Sebastian

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 11:03 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Kelley
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260318100138.GimjldpV@linutronix.de>

On 18.03.26 11:01, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 17:25:20 [+0000], Michael Kelley wrote:
>> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
>>>
>>
>> Let me try to address the range of questions here and in the follow-up
>> discussion. As background, an overview of VMBus interrupt handling is in:
>>
>> Documentation/virt/hyperv/vmbus.rst
>>
>> in the section entitled "Synthetic Interrupt Controller (synic)". The
>> relevant text is:
>>
>>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>>    each CPU in the guest has a synic and may receive VMBus interrupts,
>>    they are best modeled in Linux as per-CPU interrupts. This model works
>>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>>    /proc/interrupts on the "HYP" line.
>>
>> The use of a statically allocated sysvec pre-dates my involvement in this
>> code starting in 2017, but I believe it was modelled after what Xen does,
>> and for the same reason -- to effectively create a per-CPU interrupt on
>> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
>> don't know if that is also to create a per-CPU interrupt.
> 
> If you create a vector, it becomes per-CPU. There is simply no mapping
> from HYPERVISOR_CALLBACK_VECTOR to request_percpu_irq(). But if we had
> this…
> 
> …
>>> What clears this? This is wrongly placed. This should go to
>>> sysvec_hyperv_callback() instead with its matching canceling part. The
>>> add_interrupt_randomness() should also be there and not here.
>>> sysvec_hyperv_stimer0() managed to do so.
>>
>> I don't have any knowledge to bring regarding the use of
>> lockdep_hardirq_threaded().
> 
> It is used in IRQ core to mark the execution of an interrupt handler
> which becomes threaded in a forced-threaded scenario. The goal is to let
> lockdep know that this piece of code on !RT will be threaded on RT and
> therefore there is no need to report a possible locking problem that
> will not exist on RT.
> 
>>> Different question: What guarantees that there won't be another
>>> interrupt before this one is done? The handshake appears to be
>>> deprecated. The interrupt itself returns ACKing (or not) but the actual
>>> handler is delayed to this thread. Depending on the userland it could
>>> take some time and I don't know how impatient the host is.
>>
>> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
>> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
>> is usually explicitly called to ack the interrupt.
>>
>> There's no guarantee, in either the existing case or the new PREEMPT_RT
>> case, that another VMBus interrupt won't come in on the same CPU
>> before the tasklets scheduled by vmbus_message_sched() or
>> vmbus_chan_sched() have run. From a functional standpoint, the Linux
>> code and interaction with Hyper-V handles another interrupt correctly.
> 
> So there is no scenario that the host will trigger interrupts because
> the guest is leaving the ISR without doing anything/ making progress?
> 
>> From a delay standpoint, there's not a problem for the normal (i.e., not
>> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
>> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
>> about delays since the tasklets are scheduled from the new per-CPU thread.
>> But my understanding is that Jan's motivation for these changes is not to
>> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
>> The goal is simply to make PREEMPT_RT builds functional, though Jan may
>> have further comments on the goal.
> 
> I would be worried if the host would storming interrupts to the guest
> because it makes no progress.
> 
>>>> +		__vmbus_isr();
>>> Moving on. This (trying very hard here) even schedules tasklets. Why?
>>> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
>>> You don't want that.
>>
>> Again, Jan can comment on the impact of delays due to ending up
>> in ksoftirqd.
> 
> My point is that having this with threaded interrupt support would
> eliminate the usage of tasklets.
> 
>>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>>> instead apic_eoi() + schedule_delayed_work().
>>
>> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
>> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
>> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
>> path gets to vmbus_isr(), the two architectures share the same code. Same
>> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
> 
> This one has the "random" collecting on the right spot.
> 
>> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
>> to looking at it.
>>
>> As I recently discovered in discussion with Jan, standard Linux IRQ handling
>> will *not* thread per-CPU interrupts. So even on arm64 with a standard
>> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
>> request threading.
> 
> It would require a statement from the x86 & IRQ maintainers if it is
> worth on x86 to make allow pass HYPERVISOR_CALLBACK_VECTOR to
> request_percpu_irq() and have an IRQF_ that this one needs to be forced
> threaded. Otherwise we would need to remain with the workarounds.
> 
> If you say that an interrupt storm can not occur, I would prefer
> |static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
> |…
> |	lock_map_acquire_try(&vmbus_map);
> |	__vmbus_isr();
> |	lock_map_release(&vmbus_map);
> 
> while it has mostly the same effect.
> 
> Either way, that add_interrupt_randomness() should be moved to
> sysvec_hyperv_callback() like it has been done for
> sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
> via vmbus_percpu_isr().

No, this would degrade arm64.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 11:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hyperv, linux-kernel,
	Florian Bezdeka, RT, Mitchell Levy, Michael Kelley,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260318090817.zuFUjrxd@linutronix.de>

On 18.03.26 10:08, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 12:55:15 [+0100], Jan Kiszka wrote:
>> Point is that a task that was interrupted by a potentially threaded
>> interrupt keeps this flag longer that it needs it. And that is
>> apparently harmless, but fairly confusing.
> 
> correct. My only concern would be a shared handler where the second is
> not threaded.

The vmbus irqs are not shared (beyond what sysvec_hyperv_callback does).

> 
>>>> With that in mind, the new logic here is no different from the one the
>>>> kernel used before. If both are not doing what they should, we likely
>>>> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
>>>
>>> The difference is that you expect that _everyone_ calling this driver
>>> has everything else threaded. This might not be the case. That is why
>>> this should be in core knowing what is called if threaded, use in driver
>>> after explicit killing that flag afterwards since you don't know what
>>> can follow or add a generic threaded infrastructure here. 
>>
>> This driver is different, unfortunately. I'm not sure if we can / want
>> to thread everything that the platform interrupt does on x86. So far,
>> only the last part of it - vmbus handling - is threaded. On arm64, the
>> irq is exclusive (see vmbus_percpu_isr), thus everything can be and is
>> threaded.
> 
> No, it is a percpu interrupt which are not forced-threaded.

It is threaded now due to my patch.

> 
>>>>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>>>>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>>>>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>>>>> instead apic_eoi() + schedule_delayed_work(). 
>>>>>
>>>>
>>>> Again, you are thinking x86-only. We need a portable solution.
>>>
>>> well, ARM could use a threaded interrupt, too.
>>
>> For a reason we didn't explore in details, per-CPU interrupts aren't
>> threaded. See older version of this patch
>> (https://lore.kernel.org/lkml/005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com/)
>> where I thought I only had to fix x86, but arm64 was needing care as well.
> 
> Per-CPU are usually timers or other things which are not threaded and
> have their own thing for the "second" port and I only remember MCE using
> a workqueue for notification.

And the hv vmbus now provides a case where threading could be useful, at
least for arm64. For x86, we would have to check if the first half of
sysvec_hyperv_callback (mshv_handler) wants threading as well / would
support that.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Sebastian Andrzej Siewior @ 2026-03-18 10:10 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka
In-Reply-To: <SN6PR02MB41573332BF202DAE0AF79ED1D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-17 17:26:39 [+0000], Michael Kelley wrote:
> > > Who is other one and does it have its add_interrupt_randomness() there
> > > already?
> > 
> > It's the arm64 path of the hv support. Regarding the vmbus IRQ, it seems
> > to be fully handled here, without an equivalent of
> > arch/x86/kernel/cpu/mshyperv.c.
> 
> The arm64 path is the call to request_percpu_irq() in vmbus_bus_init().
> That call is only made when running on arm64. See the code comment in
> vmbus_bus_init().
> 
> The specified interrupt handler is vmbus_percpu_isr(), which again runs
> only on arm64. It calls vmbus_isr(), which starts the common path for both
> x86/x64 and arm64.
> 
> Then the slight weirdness is that the standard Linux IRQ handling for
> per-CPU IRQs on arm64 with a GICv3 (which is what Hyper-V emulates) 
> does *not* call add_interrupt_randomness().  The function
> gic_irq_domain_map() sets the IRQ handler for PPI range to
> handle_percpu_devid_irq(), and that function does not do
> add_interrupt_randomness().  The other variant, handle_percpu_irq(),
> calls handle_irq_event_percpu(), which *does* do the
> add_interrupt_randomness().

So despite all the generic code on arm64 does not do it? Then don't
workaround this in your driver. Either talk to the IRQ maintainer and
suggest adding it there so everyone benefits from or don't because there
might be a reason to avoid it. Having it in driver code is wrong.

> So at this point, putting the add_interrupt_randomness() in
> vmbus_isr() is needed to catch both architectures. If the lack of
> add_interrupt_randomness() in handle_percpu_devid_irq() is a bug,
> then that would be a cleaner way to handle this. But maybe there's
> a reason behind the current behavior of handle_percpu_devid_irq()
> that I'm unaware of.
>
> Michael

Sebastian

^ permalink raw reply

* Re: RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18 10:01 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <SN6PR02MB415753FDA0DEEA0B4A8B9994D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-17 17:25:20 [+0000], Michael Kelley wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
> >
> 
> Let me try to address the range of questions here and in the follow-up
> discussion. As background, an overview of VMBus interrupt handling is in:
> 
> Documentation/virt/hyperv/vmbus.rst
> 
> in the section entitled "Synthetic Interrupt Controller (synic)". The
> relevant text is:
> 
>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>    each CPU in the guest has a synic and may receive VMBus interrupts,
>    they are best modeled in Linux as per-CPU interrupts. This model works
>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>    /proc/interrupts on the "HYP" line.
> 
> The use of a statically allocated sysvec pre-dates my involvement in this
> code starting in 2017, but I believe it was modelled after what Xen does,
> and for the same reason -- to effectively create a per-CPU interrupt on
> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
> don't know if that is also to create a per-CPU interrupt.

If you create a vector, it becomes per-CPU. There is simply no mapping
from HYPERVISOR_CALLBACK_VECTOR to request_percpu_irq(). But if we had
this…

…
> > What clears this? This is wrongly placed. This should go to
> > sysvec_hyperv_callback() instead with its matching canceling part. The
> > add_interrupt_randomness() should also be there and not here.
> > sysvec_hyperv_stimer0() managed to do so.
> 
> I don't have any knowledge to bring regarding the use of
> lockdep_hardirq_threaded().

It is used in IRQ core to mark the execution of an interrupt handler
which becomes threaded in a forced-threaded scenario. The goal is to let
lockdep know that this piece of code on !RT will be threaded on RT and
therefore there is no need to report a possible locking problem that
will not exist on RT.

> > Different question: What guarantees that there won't be another
> > interrupt before this one is done? The handshake appears to be
> > deprecated. The interrupt itself returns ACKing (or not) but the actual
> > handler is delayed to this thread. Depending on the userland it could
> > take some time and I don't know how impatient the host is.
> 
> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
> is usually explicitly called to ack the interrupt.
> 
> There's no guarantee, in either the existing case or the new PREEMPT_RT
> case, that another VMBus interrupt won't come in on the same CPU
> before the tasklets scheduled by vmbus_message_sched() or
> vmbus_chan_sched() have run. From a functional standpoint, the Linux
> code and interaction with Hyper-V handles another interrupt correctly.

So there is no scenario that the host will trigger interrupts because
the guest is leaving the ISR without doing anything/ making progress?

> From a delay standpoint, there's not a problem for the normal (i.e., not
> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
> about delays since the tasklets are scheduled from the new per-CPU thread.
> But my understanding is that Jan's motivation for these changes is not to
> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
> The goal is simply to make PREEMPT_RT builds functional, though Jan may
> have further comments on the goal.

I would be worried if the host would storming interrupts to the guest
because it makes no progress.

> > > +		__vmbus_isr();
> > Moving on. This (trying very hard here) even schedules tasklets. Why?
> > You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> > You don't want that.
> 
> Again, Jan can comment on the impact of delays due to ending up
> in ksoftirqd.

My point is that having this with threaded interrupt support would
eliminate the usage of tasklets.

> > Couldn't the whole logic be integrated into the IRQ code? Then we could
> > have mask/ unmask if supported/ provided and threaded interrupts. Then
> > sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> > instead apic_eoi() + schedule_delayed_work().
> 
> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
> path gets to vmbus_isr(), the two architectures share the same code. Same
> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.

This one has the "random" collecting on the right spot.

> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
> to looking at it.
> 
> As I recently discovered in discussion with Jan, standard Linux IRQ handling
> will *not* thread per-CPU interrupts. So even on arm64 with a standard
> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
> request threading.

It would require a statement from the x86 & IRQ maintainers if it is
worth on x86 to make allow pass HYPERVISOR_CALLBACK_VECTOR to
request_percpu_irq() and have an IRQF_ that this one needs to be forced
threaded. Otherwise we would need to remain with the workarounds.

If you say that an interrupt storm can not occur, I would prefer
|static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
|…
|	lock_map_acquire_try(&vmbus_map);
|	__vmbus_isr();
|	lock_map_release(&vmbus_map);

while it has mostly the same effect.

Either way, that add_interrupt_randomness() should be moved to
sysvec_hyperv_callback() like it has been done for
sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
via vmbus_percpu_isr().

> I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
> I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
> need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
> functionality is a bit of a fossil that isn't needed on modern x86/x64
> processors that support TSC scaling. And it doesn't exist for arm64.
> It might be worth seeing if it could be dropped entirely ...
> 
> Michael

Sebastian

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18  9:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Peter Zijlstra, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hyperv, linux-kernel,
	Florian Bezdeka, RT, Mitchell Levy, Michael Kelley,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <1e15ac0d-9835-487c-9a16-c55203f01a3d@siemens.com>

On 2026-03-17 12:55:15 [+0100], Jan Kiszka wrote:
> Point is that a task that was interrupted by a potentially threaded
> interrupt keeps this flag longer that it needs it. And that is
> apparently harmless, but fairly confusing.

correct. My only concern would be a shared handler where the second is
not threaded.

> >> With that in mind, the new logic here is no different from the one the
> >> kernel used before. If both are not doing what they should, we likely
> >> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
> > 
> > The difference is that you expect that _everyone_ calling this driver
> > has everything else threaded. This might not be the case. That is why
> > this should be in core knowing what is called if threaded, use in driver
> > after explicit killing that flag afterwards since you don't know what
> > can follow or add a generic threaded infrastructure here. 
> 
> This driver is different, unfortunately. I'm not sure if we can / want
> to thread everything that the platform interrupt does on x86. So far,
> only the last part of it - vmbus handling - is threaded. On arm64, the
> irq is exclusive (see vmbus_percpu_isr), thus everything can be and is
> threaded.

No, it is a percpu interrupt which are not forced-threaded.

> >>> Couldn't the whole logic be integrated into the IRQ code? Then we could
> >>> have mask/ unmask if supported/ provided and threaded interrupts. Then
> >>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> >>> instead apic_eoi() + schedule_delayed_work(). 
> >>>
> >>
> >> Again, you are thinking x86-only. We need a portable solution.
> > 
> > well, ARM could use a threaded interrupt, too.
> 
> For a reason we didn't explore in details, per-CPU interrupts aren't
> threaded. See older version of this patch
> (https://lore.kernel.org/lkml/005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com/)
> where I thought I only had to fix x86, but arm64 was needing care as well.

Per-CPU are usually timers or other things which are not threaded and
have their own thing for the "second" port and I only remember MCE using
a workqueue for notification.

> Jan

Sebastian

^ permalink raw reply

* Re: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Wei Liu @ 2026-03-18  6:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D2316EC9E5B0BAE656C0D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > 
> > The current error handling has two issues:
> > 
> > First, pin_user_pages_fast() can return a short pin count (less than
> > requested but greater than zero) when it cannot pin all requested pages.
> > This is treated as success, leading to partially pinned regions being
> > used, which causes memory corruption.
> > 
> > Second, when an error occurs mid-loop, already pinned pages from the
> > current batch are not released before calling mshv_region_evict_pages(),
> > causing a page reference leak.
> 
> There's now an online LLM-based tool that is automatically reviewing
> kernel patches.  For this patch, the results are here:
> 
> https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
> 
> It has flagged the commit message as incorrectly referencing the
> function mshv_region_evict_pages(), which doesn't exist.
> 
> FWIW, the announcement about sashiko.dev is here:
> 
> https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> 
> Other than the commit message reference, this looks good to me.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>

The second point is written as if the code here should release the
already pinned pages before calling mshv_region_invalidate_pages(), but
the code actually relies on mshv_mem_region_invalidate_pages() to
release the pages. The change here fixes the accounting.

 Second, when an error occurs mid-loop, already pinned pages from the
 current batch are not accounted for before calling
 mshv_region_invalidate_pages(), causing a page reference leak.

And queued up the patch to hyperv-fixes.

Wei

> 
> > 
> > Fix by treating short pins as errors and explicitly unpinning the
> > partial batch before cleanup.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index c28aac0726de..fdffd4f002f6 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> >  					  FOLL_WRITE | FOLL_LONGTERM,
> >  					  pages);
> > -		if (ret < 0)
> > +		if (ret != nr_pages)
> >  			goto release_pages;
> >  	}
> > 
> >  	return 0;
> > 
> >  release_pages:
> > +	if (ret > 0)
> > +		done_count += ret;
> >  	mshv_region_invalidate_pages(region, 0, done_count);
> > -	return ret;
> > +	return ret < 0 ? ret : -ENOMEM;
> >  }
> > 
> >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18  5:52 UTC (permalink / raw)
  To: Michael Kelley, Sebastian Andrzej Siewior
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <SN6PR02MB415753FDA0DEEA0B4A8B9994D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 17.03.26 18:25, Michael Kelley wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
>>
> 
> Let me try to address the range of questions here and in the follow-up
> discussion. As background, an overview of VMBus interrupt handling is in:
> 
> Documentation/virt/hyperv/vmbus.rst
> 
> in the section entitled "Synthetic Interrupt Controller (synic)". The
> relevant text is:
> 
>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>    each CPU in the guest has a synic and may receive VMBus interrupts,
>    they are best modeled in Linux as per-CPU interrupts. This model works
>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>    /proc/interrupts on the "HYP" line.
> 
> The use of a statically allocated sysvec pre-dates my involvement in this
> code starting in 2017, but I believe it was modelled after what Xen does,
> and for the same reason -- to effectively create a per-CPU interrupt on
> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
> don't know if that is also to create a per-CPU interrupt.

Long ago, we demonstrated via Jailhouse that you do not necessarily gain
complexity on the hypervisor side by providing a minimal PCI host and
attaching all your virtual devices to that instead. Even longer ago in
the absence of proper IRQ controller virtualization on the various
archs, there was a bit of performance to gain doing "special"
interrupts. All these design decisions made sense at a certain time but
you would likely no longer repeat them today.

> 
> More below ....
> 
>> On 2026-02-16 17:24:56 [+0100], Jan Kiszka wrote:
>>> --- a/drivers/hv/vmbus_drv.c
>>> +++ b/drivers/hv/vmbus_drv.c
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/cpu.h>
>>>  #include <linux/sched/isolation.h>
>>>  #include <linux/sched/task_stack.h>
>>> +#include <linux/smpboot.h>
>>>
>>>  #include <linux/delay.h>
>>>  #include <linux/panic_notifier.h>
>>> @@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
>>>  	}
>>>  }
>>>
>>> -void vmbus_isr(void)
>>> +static void __vmbus_isr(void)
>>>  {
>>>  	struct hv_per_cpu_context *hv_cpu
>>>  		= this_cpu_ptr(hv_context.cpu_context);
>>> @@ -1363,6 +1364,53 @@ void vmbus_isr(void)
>>>
>>>  	add_interrupt_randomness(vmbus_interrupt);
>>
>> This is feeding entropy and would like to see interrupt registers. But
>> since this is invoked from a thread it won't.
> 
> I'll respond to this topic on the new thread for the new patch
> where Jan has moved the call to add_interrupt_randomness().
> 
>>
>>>  }
>>> +
>> …
>>> +void vmbus_isr(void)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>>> +		vmbus_irqd_wake();
>>> +	} else {
>>> +		lockdep_hardirq_threaded();
>>
>> What clears this? This is wrongly placed. This should go to
>> sysvec_hyperv_callback() instead with its matching canceling part. The
>> add_interrupt_randomness() should also be there and not here.
>> sysvec_hyperv_stimer0() managed to do so.
> 
> I don't have any knowledge to bring regarding the use of
> lockdep_hardirq_threaded().
> 
>>
>> Different question: What guarantees that there won't be another
>> interrupt before this one is done? The handshake appears to be
>> deprecated. The interrupt itself returns ACKing (or not) but the actual
>> handler is delayed to this thread. Depending on the userland it could
>> take some time and I don't know how impatient the host is.
> 
> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
> is usually explicitly called to ack the interrupt.
> 
> There's no guarantee, in either the existing case or the new PREEMPT_RT
> case, that another VMBus interrupt won't come in on the same CPU
> before the tasklets scheduled by vmbus_message_sched() or
> vmbus_chan_sched() have run. From a functional standpoint, the Linux
> code and interaction with Hyper-V handles another interrupt correctly.
> 
> From a delay standpoint, there's not a problem for the normal (i.e., not
> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
> about delays since the tasklets are scheduled from the new per-CPU thread.
> But my understanding is that Jan's motivation for these changes is not to
> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
> The goal is simply to make PREEMPT_RT builds functional, though Jan may
> have further comments on the goal.
> 

That is exactly the goal: A Linux guest happening to use a PREEMPT_RT
kernel should correctly run on Hyper-V, and that without losing relevant
performance. However, we do not expect any deterministic timing behavior
from such a setup.

>>
>>> +		__vmbus_isr();
>> Moving on. This (trying very hard here) even schedules tasklets. Why?
>> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
>> You don't want that.
> 
> Again, Jan can comment on the impact of delays due to ending up
> in ksoftirqd.
> 
>>
>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>> instead apic_eoi() + schedule_delayed_work().
> 
> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
> path gets to vmbus_isr(), the two architectures share the same code. Same
> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
> to looking at it.
> 
> As I recently discovered in discussion with Jan, standard Linux IRQ handling
> will *not* thread per-CPU interrupts. So even on arm64 with a standard
> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
> request threading.
> 
> I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
> I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
> need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
> functionality is a bit of a fossil that isn't needed on modern x86/x64
> processors that support TSC scaling. And it doesn't exist for arm64.
> It might be worth seeing if it could be dropped entirely ...
> 

I suppose that all depends on how long Linux needs to support the
underlying hypervisor versions and interfaces, no? It's a bit like
supporting old hardware...

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH 00/11] Drivers: hv: Add ARM64 support in mshv_vtl
From: Naman Jain @ 2026-03-18  4:23 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157DFC7B7CE94500C89664BD441A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 3/18/2026 3:33 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> The series intends to add support for ARM64 to mshv_vtl driver.
>> For this, common Hyper-V code is refactored, necessary support is added,
>> mshv_vtl_main.c is refactored and then finally support is added in
>> Kconfig.
>>
>> Based on commit 1f318b96cc84 ("Linux 7.0-rc3")
> 
> There's now an online LLM-based tool that is automatically reviewing
> kernel patches. For this patch set, the results are here:
> 
> https://sashiko.dev/#/patchset/20260316121241.910764-1-namjain%40linux.microsoft.com
> 
> It has flagged several things that are worth checking, but I haven't
> reviewed them to see if they are actually valid.
> 
> FWIW, the announcement about sashiko.dev is here:
> 
> https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> 
> Michael


Thanks for sharing Michael,
I'll check it out and do the needful.

Regards,
Naman


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Long Li @ 2026-03-17 23:43 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Haiyang Zhang, KY Srinivasan, Wei Liu,
	Dexuan Cui, Simon Horman, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260316200843.GK61385@unreal>

> 
> On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > When the MANA hardware undergoes a service reset, the ETH
> > > > auxiliary device
> > > > (mana.eth) used by DPDK persists across the reset cycle — it is
> > > > not removed and re-added like RC/UD/GSI QPs. This means userspace
> > > > RDMA consumers such as DPDK have no way of knowing that firmware
> > > > handles for their PD, CQ, WQ, QP and MR resources have become stale.
> > >
> > > NAK to any of this.
> > >
> > > In case of hardware reset, mana_ib AUX device needs to be destroyed
> > > and recreated later.
> >
> > Yeah, that is our general model for any serious RAS event where the
> > driver's view of resources becomes out of sync with the HW.
> >
> > You have tear down the ib_device by removing the aux and then bring
> > back a new one.
> >
> > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> > tell userspace to close and re-open their uverbs FD.
> >
> > We don't have a model where a uverbs FD in userspace can continue to
> > work after the device has a catasrophic RAS event.
> >
> > There may be room to have a model where the ib device doesn't fully
> > unplug/replug so it retains its name and things, but that is core code
> > not driver stuff.
> 
> Good luck with that model. It is going to break RDMA-CM hotplug support.
> 

   I think we can preserve RDMA-CM behavior without requiring ib_device
   unregister/re-register.

   On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
   new reset event) through ib_dispatch_event(). RDMA-CM already handles
   device events — we would add a handler that iterates all rdma_cm_ids
   on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
   as cma_process_remove() does today. The difference: cma_device stays
   alive, so applications can reconnect on the same device after recovery
   instead of waiting for a new one to appear.

   The motivation for keeping ib_device alive is that some RDMA consumers
   — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
   manage QP state themselves. For these users, a persistent ib_device
   with IB_EVENT_PORT_ERR / IB_EVENT_PORT_ACTIVE notifications enables
   reliable in-place recovery without reopening the device.

   This matters especially for PCI DPC recovery, which is becoming
   critical for large-scale GPU/storage deployments. See this talk for
   context on the value of surviving DPC events:
   https://www.youtube.com/watch?v=TpNNeMGEsdU&t=1619s

   Today a DPC event on one NIC kills all RDMA connections and can
   crash entire training jobs. If the ib_device persists and the driver
   recreates firmware resources after recovery, raw verbs users can
   resume without full teardown, and RDMA-CM users get the same
   disconnect/reconnect behavior they have today.

Thanks,
Long

^ permalink raw reply

* RE: [PATCH 00/11] Drivers: hv: Add ARM64 support in mshv_vtl
From: Michael Kelley @ 2026-03-17 22:03 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, Michael Kelley,
	linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
> 
> The series intends to add support for ARM64 to mshv_vtl driver.
> For this, common Hyper-V code is refactored, necessary support is added,
> mshv_vtl_main.c is refactored and then finally support is added in
> Kconfig.
> 
> Based on commit 1f318b96cc84 ("Linux 7.0-rc3")

There's now an online LLM-based tool that is automatically reviewing
kernel patches. For this patch set, the results are here:

https://sashiko.dev/#/patchset/20260316121241.910764-1-namjain%40linux.microsoft.com

It has flagged several things that are worth checking, but I haven't
reviewed them to see if they are actually valid.

FWIW, the announcement about sashiko.dev is here:

https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/

Michael

> 
> Naman Jain (11):
>   arch: arm64: Export arch_smp_send_reschedule for mshv_vtl module
>   Drivers: hv: Move hv_vp_assist_page to common files
>   Drivers: hv: Add support to setup percpu vmbus handler
>   Drivers: hv: Refactor mshv_vtl for ARM64 support to be added
>   drivers: hv: Export vmbus_interrupt for mshv_vtl module
>   Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
>   arch: arm64: Add support for mshv_vtl_return_call
>   Drivers: hv: mshv_vtl: Move register page config to arch-specific
>     files
>   Drivers: hv: mshv_vtl: Let userspace do VSM configuration
>   Drivers: hv: Add support for arm64 in MSHV_VTL
>   Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
> 
>  arch/arm64/hyperv/Makefile        |   1 +
>  arch/arm64/hyperv/hv_vtl.c        | 152 ++++++++++++++++++++++
>  arch/arm64/hyperv/mshyperv.c      |  13 ++
>  arch/arm64/include/asm/mshyperv.h |  28 ++++
>  arch/arm64/kernel/smp.c           |   1 +
>  arch/x86/hyperv/hv_init.c         |  88 +------------
>  arch/x86/hyperv/hv_vtl.c          | 130 +++++++++++++++++++
>  arch/x86/include/asm/mshyperv.h   |   8 +-
>  drivers/hv/Kconfig                |   2 +-
>  drivers/hv/hv_common.c            |  99 +++++++++++++++
>  drivers/hv/mshv.h                 |   8 --
>  drivers/hv/mshv_vtl_main.c        | 205 ++++--------------------------
>  drivers/hv/vmbus_drv.c            |   8 +-
>  include/asm-generic/mshyperv.h    |  49 +++++++
>  include/hyperv/hvgdk_mini.h       |   2 +
>  15 files changed, 505 insertions(+), 289 deletions(-)
>  create mode 100644 arch/arm64/hyperv/hv_vtl.c
> 
> 
> base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
> prerequisite-patch-id: 24022ec1fb63bc20de8114eedf03c81bb1086e0e
> prerequisite-patch-id: 801f2588d5c6db4ceb9a6705a09e4649fab411b1
> prerequisite-patch-id: 581c834aa268f0c54120c6efbc1393fbd9893f49
> prerequisite-patch-id: b0b153807bab40860502c52e4a59297258ade0db
> prerequisite-patch-id: 2bff6accea80e7976c58d80d847cd33f260a3cb9
> prerequisite-patch-id: 296ffbc4f119a5b249bc9c840f84129f5c151139
> prerequisite-patch-id: 3b54d121145e743ac5184518df33a1812280ec96
> prerequisite-patch-id: 06fc5b37b23ee3f91a2c8c9b9c126fde290834f2
> prerequisite-patch-id: 6e8afed988309b03485f5538815ea29c8fa5b0a9
> prerequisite-patch-id: 4f1fb1b7e9cfa8a3b1c02fafecdbb432b74ee367
> prerequisite-patch-id: 49944347e0b2d93e72911a153979c567ebb7e66b
> prerequisite-patch-id: 6dec75498eeae6365d15ac12b5d0a3bd32e9f91c
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Michael Kelley @ 2026-03-17 21:56 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177375989324.25621.6532741522672582851.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> 
> The current error handling has two issues:
> 
> First, pin_user_pages_fast() can return a short pin count (less than
> requested but greater than zero) when it cannot pin all requested pages.
> This is treated as success, leading to partially pinned regions being
> used, which causes memory corruption.
> 
> Second, when an error occurs mid-loop, already pinned pages from the
> current batch are not released before calling mshv_region_evict_pages(),
> causing a page reference leak.

There's now an online LLM-based tool that is automatically reviewing
kernel patches.  For this patch, the results are here:

https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net

It has flagged the commit message as incorrectly referencing the
function mshv_region_evict_pages(), which doesn't exist.

FWIW, the announcement about sashiko.dev is here:

https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/

Other than the commit message reference, this looks good to me.

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> 
> Fix by treating short pins as errors and explicitly unpinning the
> partial batch before cleanup.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index c28aac0726de..fdffd4f002f6 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
>  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
>  					  FOLL_WRITE | FOLL_LONGTERM,
>  					  pages);
> -		if (ret < 0)
> +		if (ret != nr_pages)
>  			goto release_pages;
>  	}
> 
>  	return 0;
> 
>  release_pages:
> +	if (ret > 0)
> +		done_count += ret;
>  	mshv_region_invalidate_pages(region, 0, done_count);
> -	return ret;
> +	return ret < 0 ? ret : -ENOMEM;
>  }
> 
>  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> 
> 


^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 21:32 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpFXuHg4KPY27pqMC-xV5y9ZY2W72_R8_rxO0DvrJ=_yvw@mail.gmail.com>

On Tue, Mar 17, 2026 at 2:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > The f_op->mmap interface is deprecated, so update driver to use its
> > successor, mmap_prepare.
> >
> > The driver previously used vm_iomap_memory(), so this change replaces it
> > with its mmap_prepare equivalent, mmap_action_simple_ioremap().
> >
> > Functions that wrap mmap() are also converted to wrap mmap_prepare()
> > instead.
> >
> > Also update the documentation accordingly.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  Documentation/driver-api/vme.rst    |  2 +-
> >  drivers/staging/vme_user/vme.c      | 20 +++++------
> >  drivers/staging/vme_user/vme.h      |  2 +-
> >  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
> >  4 files changed, 42 insertions(+), 33 deletions(-)
> >
> > diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> > index c0b475369de0..7111999abc14 100644
> > --- a/Documentation/driver-api/vme.rst
> > +++ b/Documentation/driver-api/vme.rst
> > @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
> >
> >  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
> >  do a read-modify-write transaction. Parts of a VME window can also be mapped
> > -into user space memory using :c:func:`vme_master_mmap`.
> > +into user space memory using :c:func:`vme_master_mmap_prepare`.
> >
> >
> >  Slave windows
> > diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> > index f10a00c05f12..7220aba7b919 100644
> > --- a/drivers/staging/vme_user/vme.c
> > +++ b/drivers/staging/vme_user/vme.c
> > @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
> >  EXPORT_SYMBOL(vme_master_rmw);
> >
> >  /**
> > - * vme_master_mmap - Mmap region of VME master window.
> > + * vme_master_mmap_prepare - Mmap region of VME master window.
> >   * @resource: Pointer to VME master resource.
> > - * @vma: Pointer to definition of user mapping.
> > + * @desc: Pointer to descriptor of user mapping.
> >   *
> >   * Memory map a region of the VME master window into user space.
> >   *
> > @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
> >   *         resource or -EFAULT if map exceeds window size. Other generic mmap
> >   *         errors may also be returned.
> >   */
> > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > +int vme_master_mmap_prepare(struct vme_resource *resource,
> > +                           struct vm_area_desc *desc)
> >  {
> > +       const unsigned long vma_size = vma_desc_size(desc);
> >         struct vme_bridge *bridge = find_bridge(resource);
> >         struct vme_master_resource *image;
> >         phys_addr_t phys_addr;
> > -       unsigned long vma_size;
> >
> >         if (resource->type != VME_MASTER) {
> >                 dev_err(bridge->parent, "Not a master resource\n");
> > @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> >         }
> >
> >         image = list_entry(resource->entry, struct vme_master_resource, list);
> > -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> > -       vma_size = vma->vm_end - vma->vm_start;
> > +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
> >
> >         if (phys_addr + vma_size > image->bus_resource.end + 1) {
> >                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
> >                 return -EFAULT;
> >         }
> >
> > -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > -
> > -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> > +       desc->page_prot = pgprot_noncached(desc->page_prot);
> > +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> > +       return 0;
> >  }
> > -EXPORT_SYMBOL(vme_master_mmap);
> > +EXPORT_SYMBOL(vme_master_mmap_prepare);
> >
> >  /**
> >   * vme_master_free - Free VME master window
> > diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> > index 797e9940fdd1..b6413605ea49 100644
> > --- a/drivers/staging/vme_user/vme.h
> > +++ b/drivers/staging/vme_user/vme.h
> > @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
> >  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
> >  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
> >                             unsigned int swap, loff_t offset);
> > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> > +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
> >  void vme_master_free(struct vme_resource *resource);
> >
> >  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> > diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> > index d95dd7d9190a..11e25c2f6b0a 100644
> > --- a/drivers/staging/vme_user/vme_user.c
> > +++ b/drivers/staging/vme_user/vme_user.c
> > @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
> >         kfree(vma_priv);
> >  }
> >
> > -static const struct vm_operations_struct vme_user_vm_ops = {
> > -       .open = vme_user_vm_open,
> > -       .close = vme_user_vm_close,
> > -};
> > -
> > -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > +                             const struct file *file, void **vm_private_data)
> >  {
> > -       int err;
> > +       const unsigned int minor = iminor(file_inode(file));
> >         struct vme_user_vma_priv *vma_priv;
> >
> >         mutex_lock(&image[minor].mutex);
> >
> > -       err = vme_master_mmap(image[minor].resource, vma);
> > -       if (err) {
> > -               mutex_unlock(&image[minor].mutex);
> > -               return err;
> > -       }
> > -
>
> Ok, this changes the set of the operations performed under image[minor].mutex.
> Before we had:
>
> mutex_lock(&image[minor].mutex);
> vme_master_mmap();
> <some final adjustments>
> mutex_unlock(&image[minor].mutex);
>
> Now we have:
>
> mutex_lock(&image[minor].mutex);
> vme_master_mmap_prepare()
> mutex_unlock(&image[minor].mutex);
> vm_iomap_memory();
> mutex_lock(&image[minor].mutex);
> vme_user_vm_mapped(); // <some final adjustments>
> mutex_unlock(&image[minor].mutex);
>
> I think as long as image[minor] does not change while we are not
> holding the mutex we should be safe, and looking at the code it seems
> to be the case. But I'm not familiar with this driver and might be
> wrong. Worth double-checking.

A side note: if we had to hold the mutex across all those operations I
think we would need to take the mutex in the vm_ops->mmap_prepare and
add a vm_ops->map_failed hook or something along that line to drop the
mutex in case mmap_action_complete() fails. Not sure if we will have
such cases though...

>
> >         vma_priv = kmalloc_obj(*vma_priv);
> >         if (!vma_priv) {
> >                 mutex_unlock(&image[minor].mutex);
> > @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> >
> >         vma_priv->minor = minor;
> >         refcount_set(&vma_priv->refcnt, 1);
> > -       vma->vm_ops = &vme_user_vm_ops;
> > -       vma->vm_private_data = vma_priv;
> > -
> > +       *vm_private_data = vma_priv;
> >         image[minor].mmap_count++;
> >
> >         mutex_unlock(&image[minor].mutex);
> > -
> >         return 0;
> >  }
> >
> > -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> > +static const struct vm_operations_struct vme_user_vm_ops = {
> > +       .mapped = vme_user_vm_mapped,
> > +       .open = vme_user_vm_open,
> > +       .close = vme_user_vm_close,
> > +};
> > +
> > +static int vme_user_master_mmap_prepare(unsigned int minor,
> > +                                       struct vm_area_desc *desc)
> > +{
> > +       int err;
> > +
> > +       mutex_lock(&image[minor].mutex);
> > +
> > +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> > +       if (!err)
> > +               desc->vm_ops = &vme_user_vm_ops;
> > +
> > +       mutex_unlock(&image[minor].mutex);
> > +       return err;
> > +}
> > +
> > +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
> >  {
> > -       unsigned int minor = iminor(file_inode(file));
> > +       const struct file *file = desc->file;
> > +       const unsigned int minor = iminor(file_inode(file));
> >
> >         if (type[minor] == MASTER_MINOR)
> > -               return vme_user_master_mmap(minor, vma);
> > +               return vme_user_master_mmap_prepare(minor, desc);
> >
> >         return -ENODEV;
> >  }
> > @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
> >         .llseek = vme_user_llseek,
> >         .unlocked_ioctl = vme_user_unlocked_ioctl,
> >         .compat_ioctl = compat_ptr_ioctl,
> > -       .mmap = vme_user_mmap,
> > +       .mmap_prepare = vme_user_mmap_prepare,
> >  };
> >
> >  static int vme_user_match(struct vme_dev *vdev)
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 21:26 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <48c6d25e374b57dba6df4fdddd4830d3fc1105be.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> The f_op->mmap interface is deprecated, so update driver to use its
> successor, mmap_prepare.
>
> The driver previously used vm_iomap_memory(), so this change replaces it
> with its mmap_prepare equivalent, mmap_action_simple_ioremap().
>
> Functions that wrap mmap() are also converted to wrap mmap_prepare()
> instead.
>
> Also update the documentation accordingly.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
>  Documentation/driver-api/vme.rst    |  2 +-
>  drivers/staging/vme_user/vme.c      | 20 +++++------
>  drivers/staging/vme_user/vme.h      |  2 +-
>  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
>  4 files changed, 42 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> index c0b475369de0..7111999abc14 100644
> --- a/Documentation/driver-api/vme.rst
> +++ b/Documentation/driver-api/vme.rst
> @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
>
>  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
>  do a read-modify-write transaction. Parts of a VME window can also be mapped
> -into user space memory using :c:func:`vme_master_mmap`.
> +into user space memory using :c:func:`vme_master_mmap_prepare`.
>
>
>  Slave windows
> diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> index f10a00c05f12..7220aba7b919 100644
> --- a/drivers/staging/vme_user/vme.c
> +++ b/drivers/staging/vme_user/vme.c
> @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
>  EXPORT_SYMBOL(vme_master_rmw);
>
>  /**
> - * vme_master_mmap - Mmap region of VME master window.
> + * vme_master_mmap_prepare - Mmap region of VME master window.
>   * @resource: Pointer to VME master resource.
> - * @vma: Pointer to definition of user mapping.
> + * @desc: Pointer to descriptor of user mapping.
>   *
>   * Memory map a region of the VME master window into user space.
>   *
> @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
>   *         resource or -EFAULT if map exceeds window size. Other generic mmap
>   *         errors may also be returned.
>   */
> -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> +int vme_master_mmap_prepare(struct vme_resource *resource,
> +                           struct vm_area_desc *desc)
>  {
> +       const unsigned long vma_size = vma_desc_size(desc);
>         struct vme_bridge *bridge = find_bridge(resource);
>         struct vme_master_resource *image;
>         phys_addr_t phys_addr;
> -       unsigned long vma_size;
>
>         if (resource->type != VME_MASTER) {
>                 dev_err(bridge->parent, "Not a master resource\n");
> @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
>         }
>
>         image = list_entry(resource->entry, struct vme_master_resource, list);
> -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> -       vma_size = vma->vm_end - vma->vm_start;
> +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
>
>         if (phys_addr + vma_size > image->bus_resource.end + 1) {
>                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
>                 return -EFAULT;
>         }
>
> -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> -
> -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> +       desc->page_prot = pgprot_noncached(desc->page_prot);
> +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> +       return 0;
>  }
> -EXPORT_SYMBOL(vme_master_mmap);
> +EXPORT_SYMBOL(vme_master_mmap_prepare);
>
>  /**
>   * vme_master_free - Free VME master window
> diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> index 797e9940fdd1..b6413605ea49 100644
> --- a/drivers/staging/vme_user/vme.h
> +++ b/drivers/staging/vme_user/vme.h
> @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
>  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
>  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
>                             unsigned int swap, loff_t offset);
> -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
>  void vme_master_free(struct vme_resource *resource);
>
>  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> index d95dd7d9190a..11e25c2f6b0a 100644
> --- a/drivers/staging/vme_user/vme_user.c
> +++ b/drivers/staging/vme_user/vme_user.c
> @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
>         kfree(vma_priv);
>  }
>
> -static const struct vm_operations_struct vme_user_vm_ops = {
> -       .open = vme_user_vm_open,
> -       .close = vme_user_vm_close,
> -};
> -
> -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> +                             const struct file *file, void **vm_private_data)
>  {
> -       int err;
> +       const unsigned int minor = iminor(file_inode(file));
>         struct vme_user_vma_priv *vma_priv;
>
>         mutex_lock(&image[minor].mutex);
>
> -       err = vme_master_mmap(image[minor].resource, vma);
> -       if (err) {
> -               mutex_unlock(&image[minor].mutex);
> -               return err;
> -       }
> -

Ok, this changes the set of the operations performed under image[minor].mutex.
Before we had:

mutex_lock(&image[minor].mutex);
vme_master_mmap();
<some final adjustments>
mutex_unlock(&image[minor].mutex);

Now we have:

mutex_lock(&image[minor].mutex);
vme_master_mmap_prepare()
mutex_unlock(&image[minor].mutex);
vm_iomap_memory();
mutex_lock(&image[minor].mutex);
vme_user_vm_mapped(); // <some final adjustments>
mutex_unlock(&image[minor].mutex);

I think as long as image[minor] does not change while we are not
holding the mutex we should be safe, and looking at the code it seems
to be the case. But I'm not familiar with this driver and might be
wrong. Worth double-checking.

>         vma_priv = kmalloc_obj(*vma_priv);
>         if (!vma_priv) {
>                 mutex_unlock(&image[minor].mutex);
> @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
>
>         vma_priv->minor = minor;
>         refcount_set(&vma_priv->refcnt, 1);
> -       vma->vm_ops = &vme_user_vm_ops;
> -       vma->vm_private_data = vma_priv;
> -
> +       *vm_private_data = vma_priv;
>         image[minor].mmap_count++;
>
>         mutex_unlock(&image[minor].mutex);
> -
>         return 0;
>  }
>
> -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> +static const struct vm_operations_struct vme_user_vm_ops = {
> +       .mapped = vme_user_vm_mapped,
> +       .open = vme_user_vm_open,
> +       .close = vme_user_vm_close,
> +};
> +
> +static int vme_user_master_mmap_prepare(unsigned int minor,
> +                                       struct vm_area_desc *desc)
> +{
> +       int err;
> +
> +       mutex_lock(&image[minor].mutex);
> +
> +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> +       if (!err)
> +               desc->vm_ops = &vme_user_vm_ops;
> +
> +       mutex_unlock(&image[minor].mutex);
> +       return err;
> +}
> +
> +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
>  {
> -       unsigned int minor = iminor(file_inode(file));
> +       const struct file *file = desc->file;
> +       const unsigned int minor = iminor(file_inode(file));
>
>         if (type[minor] == MASTER_MINOR)
> -               return vme_user_master_mmap(minor, vma);
> +               return vme_user_master_mmap_prepare(minor, desc);
>
>         return -ENODEV;
>  }
> @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
>         .llseek = vme_user_llseek,
>         .unlocked_ioctl = vme_user_unlocked_ioctl,
>         .compat_ioctl = compat_ptr_ioctl,
> -       .mmap = vme_user_mmap,
> +       .mmap_prepare = vme_user_mmap_prepare,
>  };
>
>  static int vme_user_match(struct vme_dev *vdev)
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v2 10/16] stm: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 20:46 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <d34056a65bd387286f4e155d52449106ddc99f78.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> The f_op->mmap interface is deprecated, so update driver to use its
> successor, mmap_prepare.
>
> The driver previously used vm_iomap_memory(), so this change replaces it
> with its mmap_prepare equivalent, mmap_action_simple_ioremap().
>
> Also, in order to correctly maintain reference counting, add a
> vm_ops->mapped callback to increment the reference count when successfully
> mapped.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  drivers/hwtracing/stm/core.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/hwtracing/stm/core.c b/drivers/hwtracing/stm/core.c
> index 37584e786bb5..f48c6a8a0654 100644
> --- a/drivers/hwtracing/stm/core.c
> +++ b/drivers/hwtracing/stm/core.c
> @@ -666,6 +666,16 @@ static ssize_t stm_char_write(struct file *file, const char __user *buf,
>         return count;
>  }
>
> +static int stm_mmap_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> +                          const struct file *file, void **vm_private_data)
> +{
> +       struct stm_file *stmf = file->private_data;
> +       struct stm_device *stm = stmf->stm;
> +
> +       pm_runtime_get_sync(&stm->dev);
> +       return 0;
> +}
> +
>  static void stm_mmap_open(struct vm_area_struct *vma)
>  {
>         struct stm_file *stmf = vma->vm_file->private_data;
> @@ -684,12 +694,14 @@ static void stm_mmap_close(struct vm_area_struct *vma)
>  }
>
>  static const struct vm_operations_struct stm_mmap_vmops = {
> +       .mapped = stm_mmap_mapped,
>         .open   = stm_mmap_open,
>         .close  = stm_mmap_close,
>  };
>
> -static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
> +static int stm_char_mmap_prepare(struct vm_area_desc *desc)
>  {
> +       struct file *file = desc->file;
>         struct stm_file *stmf = file->private_data;
>         struct stm_device *stm = stmf->stm;
>         unsigned long size, phys;
> @@ -697,10 +709,10 @@ static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!stm->data->mmio_addr)
>                 return -EOPNOTSUPP;
>
> -       if (vma->vm_pgoff)
> +       if (desc->pgoff)
>                 return -EINVAL;
>
> -       size = vma->vm_end - vma->vm_start;
> +       size = vma_desc_size(desc);
>
>         if (stmf->output.nr_chans * stm->data->sw_mmiosz != size)
>                 return -EINVAL;
> @@ -712,13 +724,12 @@ static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!phys)
>                 return -EINVAL;
>
> -       pm_runtime_get_sync(&stm->dev);
> -
> -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> -       vm_flags_set(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
> -       vma->vm_ops = &stm_mmap_vmops;
> -       vm_iomap_memory(vma, phys, size);
> +       desc->page_prot = pgprot_noncached(desc->page_prot);
> +       vma_desc_set_flags(desc, VMA_IO_BIT, VMA_DONTEXPAND_BIT,
> +                          VMA_DONTDUMP_BIT);
> +       desc->vm_ops = &stm_mmap_vmops;
>
> +       mmap_action_simple_ioremap(desc, phys, size);
>         return 0;
>  }
>
> @@ -836,7 +847,7 @@ static const struct file_operations stm_fops = {
>         .open           = stm_char_open,
>         .release        = stm_char_release,
>         .write          = stm_char_write,
> -       .mmap           = stm_char_mmap,
> +       .mmap_prepare   = stm_char_mmap_prepare,
>         .unlocked_ioctl = stm_char_ioctl,
>         .compat_ioctl   = compat_ptr_ioctl,
>  };
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH net-next v6 3/3] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Shiraz Saleem,
	Kees Cook, Subbaraya Sundeep, Breno Leitao, linux-kernel,
	linux-rdma
  Cc: paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.

Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v5:
  Combine the accounting logics as suggested by Simon Horman.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++------
 .../ethernet/microsoft/mana/mana_ethtool.c    | 15 ++++++++++--
 include/net/mana/mana.h                       |  9 ++++---
 3 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index fa30046dcd3d..49c65cc1697c 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2147,14 +2147,8 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
 		old_buf = NULL;
 		pktlen = oob->ppi[i].pkt_len;
-		if (pktlen == 0) {
-			if (i == 0)
-				netdev_err_once(
-					ndev,
-					"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
-					rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+		if (pktlen == 0)
 			break;
-		}
 
 		curr = rxq->buf_index;
 		rxbuf_oob = &rxq->rx_oobs[curr];
@@ -2175,6 +2169,22 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		if (!coalesced)
 			break;
 	}
+
+	/* Collect coalesced CQE count based on packets processed.
+	 * Coalesced CQEs have at least 2 packets, so index is i - 2.
+	 */
+	if (i > 1) {
+		u64_stats_update_begin(&rxq->stats.syncp);
+		rxq->stats.coalesced_cqe[i - 2]++;
+		u64_stats_update_end(&rxq->stats.syncp);
+	} else if (!i && !pktlen) {
+		u64_stats_update_begin(&rxq->stats.syncp);
+		rxq->stats.pkt_len0_err++;
+		u64_stats_update_end(&rxq->stats.syncp);
+		netdev_err_once(ndev,
+				"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+				rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+	}
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 4b234b16e57a..6a4b42fe0944 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -149,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
-	int i;
+	int i, j;
 
 	if (stringset != ETH_SS_STATS)
 		return;
@@ -168,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
@@ -201,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 xdp_xmit;
 	u64 xdp_drop;
 	u64 xdp_tx;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	u64 tso_packets;
 	u64 tso_bytes;
 	u64 tso_inner_packets;
@@ -209,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 short_pkt_fmt;
 	u64 csum_partial;
 	u64 mana_map_err;
-	int q, i = 0;
+	int q, i = 0, j;
 
 	if (!apc->port_is_up)
 		return;
@@ -239,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 			xdp_drop = rx_stats->xdp_drop;
 			xdp_tx = rx_stats->xdp_tx;
 			xdp_redirect = rx_stats->xdp_redirect;
+			pkt_len0_err = rx_stats->pkt_len0_err;
+			for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+				coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
 		} while (u64_stats_fetch_retry(&rx_stats->syncp, start));
 
 		data[i++] = packets;
@@ -246,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 		data[i++] = xdp_drop;
 		data[i++] = xdp_tx;
 		data[i++] = xdp_redirect;
+		data[i++] = pkt_len0_err;
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			data[i++] = coalesced_cqe[j];
 	}
 
 	for (q = 0; q < num_queues; q++) {
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a7f89e7ddc56..3336688fed5e 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -61,8 +61,11 @@ enum TRI_STATE {
 
 #define MAX_PORTS_IN_MANA_DEV 256
 
+/* Maximum number of packets per coalesced CQE */
+#define MANA_RXCOMP_OOB_NUM_PPI 4
+
 /* Update this count whenever the respective structures are changed */
-#define MANA_STATS_RX_COUNT 5
+#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
 
 #define MANA_RX_FRAG_ALIGNMENT 64
@@ -73,6 +76,8 @@ struct mana_stats_rx {
 	u64 xdp_drop;
 	u64 xdp_tx;
 	u64 xdp_redirect;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	struct u64_stats_sync syncp;
 };
 
@@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
 	u32 pkt_hash;
 }; /* HW DATA */
 
-#define MANA_RXCOMP_OOB_NUM_PPI 4
-
 /* Receive completion OOB */
 struct mana_rxcomp_oob {
 	struct mana_cqe_header cqe_hdr;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 2/3] net: mana: Add support for RX CQE Coalescing
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Saurabh Sengar, Dipayaan Roy, Aditya Garg,
	Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Breno Leitao,
	linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.

And, add ethtool handler to switch this feature. To turn it on, run:
  ethtool -C <nic> rx-cqe-frames 4
To turn it off:
  ethtool -C <nic> rx-cqe-frames 1

The rx-cqe-nsec is the time out value in nanoseconds after the first
packet arrival in a coalesced CQE to be sent. It's read-only for this
NIC.

Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v4:
  Fixed the old_buf issue found by AI.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 74 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    | 60 ++++++++++++++-
 include/net/mana/mana.h                       |  8 +-
 3 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..fa30046dcd3d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1365,6 +1365,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 			     sizeof(resp));
 
 	req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+	req->hdr.resp.msg_version = GDMA_MESSAGE_V2;
 
 	req->vport = apc->port_handle;
 	req->num_indir_entries = apc->indir_table_sz;
@@ -1376,7 +1377,9 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	req->update_hashkey = update_key;
 	req->update_indir_tab = update_tab;
 	req->default_rxobj = apc->default_rxobj;
-	req->cqe_coalescing_enable = 0;
+
+	if (rx != TRI_STATE_FALSE)
+		req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
 
 	if (update_key)
 		memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
@@ -1405,8 +1408,13 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 		netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
 			   resp.hdr.status);
 		err = -EPROTO;
+		goto out;
 	}
 
+	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+		apc->cqe_coalescing_timeout_ns =
+			resp.cqe_coalescing_timeout_ns;
+
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
 out:
@@ -1915,11 +1923,12 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 }
 
 static void mana_rx_skb(void *buf_va, bool from_pool,
-			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
+			int i)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
-	uint pkt_len = cqe->ppi[0].pkt_len;
+	uint pkt_len = cqe->ppi[i].pkt_len;
 	u16 rxq_idx = rxq->rxq_idx;
 	struct napi_struct *napi;
 	struct xdp_buff xdp = {};
@@ -1963,7 +1972,7 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 	}
 
 	if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
-		hash_value = cqe->ppi[0].pkt_hash;
+		hash_value = cqe->ppi[i].pkt_hash;
 
 		if (cqe->rx_hashtype & MANA_HASH_L4)
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
@@ -2098,9 +2107,11 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct mana_recv_buf_oob *rxbuf_oob;
 	struct mana_port_context *apc;
 	struct device *dev = gc->dev;
+	bool coalesced = false;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
 	bool old_fp;
+	int i;
 
 	apc = netdev_priv(ndev);
 
@@ -2112,13 +2123,16 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		++ndev->stats.rx_dropped;
 		rxbuf_oob = &rxq->rx_oobs[rxq->buf_index];
 		netdev_warn_once(ndev, "Dropped a truncated packet\n");
-		goto drop;
 
-	case CQE_RX_COALESCED_4:
-		netdev_err(ndev, "RX coalescing is unsupported\n");
-		apc->eth_stats.rx_coalesced_err++;
+		mana_move_wq_tail(rxq->gdma_rq,
+				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
+		mana_post_pkt_rxq(rxq);
 		return;
 
+	case CQE_RX_COALESCED_4:
+		coalesced = true;
+		break;
+
 	case CQE_RX_OBJECT_FENCE:
 		complete(&rxq->fence_event);
 		return;
@@ -2130,30 +2144,37 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		return;
 	}
 
-	pktlen = oob->ppi[0].pkt_len;
+	for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
+		old_buf = NULL;
+		pktlen = oob->ppi[i].pkt_len;
+		if (pktlen == 0) {
+			if (i == 0)
+				netdev_err_once(
+					ndev,
+					"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+					rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+			break;
+		}
 
-	if (pktlen == 0) {
-		/* data packets should never have packetlength of zero */
-		netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
-			   rxq->gdma_id, cq->gdma_id, rxq->rxobj);
-		return;
-	}
+		curr = rxq->buf_index;
+		rxbuf_oob = &rxq->rx_oobs[curr];
+		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-	curr = rxq->buf_index;
-	rxbuf_oob = &rxq->rx_oobs[curr];
-	WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
+		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
 
-	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		/* Unsuccessful refill will have old_buf == NULL.
+		 * In this case, mana_rx_skb() will drop the packet.
+		 */
+		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
 
-	/* Unsuccessful refill will have old_buf == NULL.
-	 * In this case, mana_rx_skb() will drop the packet.
-	 */
-	mana_rx_skb(old_buf, old_fp, oob, rxq);
+		mana_move_wq_tail(rxq->gdma_rq,
+				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
 
-drop:
-	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
+		mana_post_pkt_rxq(rxq);
 
-	mana_post_pkt_rxq(rxq);
+		if (!coalesced)
+			break;
+	}
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
@@ -3332,6 +3353,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->port_handle = INVALID_MANA_HANDLE;
 	apc->pf_filter_handle = INVALID_MANA_HANDLE;
 	apc->port_idx = port_idx;
+	apc->cqe_coalescing_enable = 0;
 
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..4b234b16e57a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
 					tx_cqe_unknown_type)},
 	{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
 				       tx_linear_pkt_cnt)},
-	{"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
-					rx_coalesced_err)},
 	{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
 					rx_cqe_unknown_type)},
 };
@@ -390,6 +388,61 @@ static void mana_get_channels(struct net_device *ndev,
 	channel->combined_count = apc->num_queues;
 }
 
+#define MANA_RX_CQE_NSEC_DEF 2048
+static int mana_get_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	kernel_coal->rx_cqe_frames =
+		apc->cqe_coalescing_enable ? MANA_RXCOMP_OOB_NUM_PPI : 1;
+
+	kernel_coal->rx_cqe_nsecs = apc->cqe_coalescing_timeout_ns;
+
+	/* Return the default timeout value for old FW not providing
+	 * this value.
+	 */
+	if (apc->port_is_up && apc->cqe_coalescing_enable &&
+	    !kernel_coal->rx_cqe_nsecs)
+		kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
+
+	return 0;
+}
+
+static int mana_set_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u8 saved_cqe_coalescing_enable;
+	int err;
+
+	if (kernel_coal->rx_cqe_frames != 1 &&
+	    kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "rx-frames must be 1 or %u, got %u",
+				   MANA_RXCOMP_OOB_NUM_PPI,
+				   kernel_coal->rx_cqe_frames);
+		return -EINVAL;
+	}
+
+	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	apc->cqe_coalescing_enable =
+		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
+
+	if (!apc->port_is_up)
+		return 0;
+
+	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+	if (err)
+		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+
+	return err;
+}
+
 static int mana_set_channels(struct net_device *ndev,
 			     struct ethtool_channels *channels)
 {
@@ -510,6 +563,7 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
 	.get_sset_count		= mana_get_sset_count,
 	.get_strings		= mana_get_strings,
@@ -520,6 +574,8 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_rxfh		= mana_set_rxfh,
 	.get_channels		= mana_get_channels,
 	.set_channels		= mana_set_channels,
+	.get_coalesce		= mana_get_coalesce,
+	.set_coalesce		= mana_set_coalesce,
 	.get_ringparam          = mana_get_ringparam,
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..a7f89e7ddc56 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -378,7 +378,6 @@ struct mana_ethtool_stats {
 	u64 tx_cqe_err;
 	u64 tx_cqe_unknown_type;
 	u64 tx_linear_pkt_cnt;
-	u64 rx_coalesced_err;
 	u64 rx_cqe_unknown_type;
 };
 
@@ -557,6 +556,9 @@ struct mana_port_context {
 	bool port_is_up;
 	bool port_st_save; /* Saved port state */
 
+	u8 cqe_coalescing_enable;
+	u32 cqe_coalescing_timeout_ns;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
@@ -902,6 +904,10 @@ struct mana_cfg_rx_steer_req_v2 {
 
 struct mana_cfg_rx_steer_resp {
 	struct gdma_resp_hdr hdr;
+
+	/* V2 */
+	u32 cqe_coalescing_timeout_ns;
+	u32 reserved1;
 }; /* HW DATA */
 
 /* Register HW vPort */
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 1/3] net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev, Andrew Lunn, Jakub Kicinski,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Donald Hunter, Jonathan Corbet, Shuah Khan,
	Kory Maincent (Dent Project), Gal Pressman, Oleksij Rempel,
	Vadim Fedorenko, linux-kernel, linux-doc
  Cc: haiyangz, paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

Add two parameters for drivers supporting Rx CQE coalescing /
descriptor writeback.

ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.

ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v6:
  Updated comment to include "descriptor writeback", as suggested by
  Jakub Kicinski

---
 Documentation/netlink/specs/ethtool.yaml       |  8 ++++++++
 Documentation/networking/ethtool-netlink.rst   | 11 +++++++++++
 include/linux/ethtool.h                        |  6 +++++-
 include/uapi/linux/ethtool_netlink_generated.h |  2 ++
 net/ethtool/coalesce.c                         | 14 +++++++++++++-
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 4707063af3b4..d254e26c014c 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -861,6 +861,12 @@ attribute-sets:
         name: tx-profile
         type: nest
         nested-attributes: profile
+      -
+        name: rx-cqe-frames
+        type: u32
+      -
+        name: rx-cqe-nsecs
+        type: u32
 
   -
     name: pause-stat
@@ -2257,6 +2263,8 @@ operations:
             - tx-aggr-time-usecs
             - rx-profile
             - tx-profile
+            - rx-cqe-frames
+            - rx-cqe-nsecs
       dump: *coalesce-get-op
     -
       name: coalesce-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 32179168eb73..e92abf45faf5 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1076,6 +1076,8 @@ Kernel response contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
   ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
+  ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES``         u32     max packets, Rx CQE
+  ``ETHTOOL_A_COALESCE_RX_CQE_NSECS``          u32     delay (ns), Rx CQE
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
@@ -1109,6 +1111,13 @@ well with frequent small-sized URBs transmissions.
 to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
 <https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
 
+Rx CQE coalescing allows multiple received packets to be coalesced into a
+single Completion Queue Entry (CQE) or descriptor writeback.
+``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` describes the maximum number of
+frames that can be coalesced into a CQE or writeback.
+``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds after
+the first packet arrival in a coalesced CQE or writeback to be sent.
+
 COALESCE_SET
 ============
 
@@ -1147,6 +1156,8 @@ Request contents:
   ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
   ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
   ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
+  ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES``         u32     max packets, Rx CQE
+  ``ETHTOOL_A_COALESCE_RX_CQE_NSECS``          u32     delay (ns), Rx CQE
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 83c375840835..656d465bcd06 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -332,6 +332,8 @@ struct kernel_ethtool_coalesce {
 	u32 tx_aggr_max_bytes;
 	u32 tx_aggr_max_frames;
 	u32 tx_aggr_time_usecs;
+	u32 rx_cqe_frames;
+	u32 rx_cqe_nsecs;
 };
 
 /**
@@ -380,7 +382,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 #define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS	BIT(26)
 #define ETHTOOL_COALESCE_RX_PROFILE		BIT(27)
 #define ETHTOOL_COALESCE_TX_PROFILE		BIT(28)
-#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(28, 0)
+#define ETHTOOL_COALESCE_RX_CQE_FRAMES		BIT(29)
+#define ETHTOOL_COALESCE_RX_CQE_NSECS		BIT(30)
+#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(30, 0)
 
 #define ETHTOOL_COALESCE_USECS						\
 	(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index 114b83017297..8134baf7860f 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -371,6 +371,8 @@ enum {
 	ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
 	ETHTOOL_A_COALESCE_RX_PROFILE,
 	ETHTOOL_A_COALESCE_TX_PROFILE,
+	ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+	ETHTOOL_A_COALESCE_RX_CQE_NSECS,
 
 	__ETHTOOL_A_COALESCE_CNT,
 	ETHTOOL_A_COALESCE_MAX = (__ETHTOOL_A_COALESCE_CNT - 1)
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 3e18ca1ccc5e..349bb02c517a 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -118,6 +118,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_BYTES */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_MAX_FRAMES */
 	       nla_total_size(sizeof(u32)) +	/* _TX_AGGR_TIME_USECS */
+	       nla_total_size(sizeof(u32)) +	/* _RX_CQE_FRAMES */
+	       nla_total_size(sizeof(u32)) +	/* _RX_CQE_NSECS */
 	       total_modersz * 2;		/* _{R,T}X_PROFILE */
 }
 
@@ -269,7 +271,11 @@ static int coalesce_fill_reply(struct sk_buff *skb,
 	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,
 			     kcoal->tx_aggr_max_frames, supported) ||
 	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
-			     kcoal->tx_aggr_time_usecs, supported))
+			     kcoal->tx_aggr_time_usecs, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+			     kcoal->rx_cqe_frames, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_NSECS,
+			     kcoal->rx_cqe_nsecs, supported))
 		return -EMSGSIZE;
 
 	if (!req_base->dev || !req_base->dev->irq_moder)
@@ -338,6 +344,8 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_RX_CQE_FRAMES] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_RX_CQE_NSECS] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_RX_PROFILE] =
 		NLA_POLICY_NESTED(coalesce_profile_policy),
 	[ETHTOOL_A_COALESCE_TX_PROFILE] =
@@ -570,6 +578,10 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
 			 tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES], &mod);
 	ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
 			 tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
+	ethnl_update_u32(&kernel_coalesce.rx_cqe_frames,
+			 tb[ETHTOOL_A_COALESCE_RX_CQE_FRAMES], &mod);
+	ethnl_update_u32(&kernel_coalesce.rx_cqe_nsecs,
+			 tb[ETHTOOL_A_COALESCE_RX_CQE_NSECS], &mod);
 
 	if (dev->irq_moder && dev->irq_moder->profile_flags & DIM_PROFILE_RX) {
 		ret = ethnl_update_profile(dev, &dev->irq_moder->rx_profile,
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 0/3] add ethtool COALESCE_RX_CQE_FRAMES/NSECS and use it in MANA driver
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev; +Cc: haiyangz, paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Add two parameters for drivers supporting Rx CQE Coalescing.

ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.

ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.

Also implement it in MANA driver with the new parameter and
counters.


Haiyang Zhang (3):
  net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
  net: mana: Add support for RX CQE Coalescing
  net: mana: Add ethtool counters for RX CQEs in coalesced type

 Documentation/netlink/specs/ethtool.yaml      |  8 ++
 Documentation/networking/ethtool-netlink.rst  | 11 +++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 84 +++++++++++++------
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++++++-
 include/linux/ethtool.h                       |  6 +-
 include/net/mana/mana.h                       | 17 +++-
 .../uapi/linux/ethtool_netlink_generated.h    |  2 +
 net/ethtool/coalesce.c                        | 14 +++-
 8 files changed, 181 insertions(+), 36 deletions(-)

-- 
2.34.1


^ permalink raw reply

* RE: [EXTERNAL] [PATCH 1/2] Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
From: Long Li @ 2026-03-17 18:43 UTC (permalink / raw)
  To: mhklinux@outlook.com, drawat.floss@gmail.com,
	maarten.lankhorst@linux.intel.com, mripard@kernel.org,
	tzimmermann@suse.de, airlied@gmail.com, simona@ffwll.ch,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	ryasuoka@redhat.com, jfalempe@redhat.com
  Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260209070201.1492-1-mhklinux@outlook.com>

> 
> Currently, VMBus code initiates a VMBus unload in the panic path so that if a
> kdump kernel is loaded, it can start fresh in setting up its own VMBus connection.
> However, a driver for the VMBus virtual frame buffer may need to flush dirty
> portions of the frame buffer back to the Hyper-V host so that panic information is
> visible in the graphics console. To support such flushing, provide exported
> functions for the frame buffer driver to specify that the VMBus unload should not
> be done by the VMBus driver, and to initiate the VMBus unload itself.
> Together these allow a frame buffer driver to delay the VMBus unload until after
> it has completed the flush.
> 
> Ideally, the VMBus driver could use its own panic-path callback to do the unload
> after all frame buffer drivers have finished. But DRM frame buffer drivers use the
> kmsg dump callback, and there are no callbacks after that in the panic path.
> Hence this somewhat messy approach to properly sequencing the frame buffer
> flush and the VMBus unload.
> 
> Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/hv/channel_mgmt.c |  1 +
>  drivers/hv/hyperv_vmbus.h |  1 -
>  drivers/hv/vmbus_drv.c    | 25 ++++++++++++++++++-------
>  include/linux/hyperv.h    |  3 +++
>  4 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index
> 74fed2c073d4..5de83676dbad 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -944,6 +944,7 @@ void vmbus_initiate_unload(bool crash)
>  	else
>  		vmbus_wait_for_unload();
>  }
> +EXPORT_SYMBOL_GPL(vmbus_initiate_unload);
> 
>  static void vmbus_setup_channel_state(struct vmbus_channel *channel,
>  				      struct vmbus_channel_offer_channel *offer)
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index
> cdbc5f5c3215..5d3944fc93ae 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -440,7 +440,6 @@ void hv_vss_deinit(void);  int hv_vss_pre_suspend(void);
> int hv_vss_pre_resume(void);  void hv_vss_onchannelcallback(void *context); -
> void vmbus_initiate_unload(bool crash);
> 
>  static inline void hv_poll_channel(struct vmbus_channel *channel,
>  				   void (*cb)(void *))
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index
> 6785ad63a9cb..97dfa529d250 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -69,19 +69,29 @@ bool vmbus_is_confidential(void)  }
> EXPORT_SYMBOL_GPL(vmbus_is_confidential);
> 
> +static bool skip_vmbus_unload;
> +
> +/*
> + * Allow a VMBus framebuffer driver to specify that in the case of a
> +panic,
> + * it will do the VMbus unload operation once it has flushed any dirty
> + * portions of the framebuffer to the Hyper-V host.
> + */
> +void vmbus_set_skip_unload(bool skip)
> +{
> +	skip_vmbus_unload = skip;
> +}
> +EXPORT_SYMBOL_GPL(vmbus_set_skip_unload);
> +
>  /*
>   * The panic notifier below is responsible solely for unloading the
>   * vmbus connection, which is necessary in a panic event.
> - *
> - * Notice an intrincate relation of this notifier with Hyper-V
> - * framebuffer panic notifier exists - we need vmbus connection alive
> - * there in order to succeed, so we need to order both with each other
> - * [see hvfb_on_panic()] - this is done using notifiers' priorities.
>   */
>  static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
>  			      void *args)
>  {
> -	vmbus_initiate_unload(true);
> +	if (!skip_vmbus_unload)
> +		vmbus_initiate_unload(true);
> +
>  	return NOTIFY_DONE;
>  }
>  static struct notifier_block hyperv_panic_vmbus_unload_block = { @@ -2848,7
> +2858,8 @@ static void hv_crash_handler(struct pt_regs *regs)  {
>  	int cpu;
> 
> -	vmbus_initiate_unload(true);
> +	if (!skip_vmbus_unload)
> +		vmbus_initiate_unload(true);
>  	/*
>  	 * In crash handler we can't schedule synic cleanup for all CPUs,
>  	 * doing the cleanup for current CPU only. This should be sufficient diff --
> git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> dfc516c1c719..b0502a336eb3 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1334,6 +1334,9 @@ int vmbus_allocate_mmio(struct resource **new,
> struct hv_device *device_obj,
>  			bool fb_overlap_ok);
>  void vmbus_free_mmio(resource_size_t start, resource_size_t size);
> 
> +void vmbus_initiate_unload(bool crash); void vmbus_set_skip_unload(bool
> +skip);
> +
>  /*
>   * GUID definitions of various offer types - services offered to the guest.
>   */
> --
> 2.25.1


^ permalink raw reply

* RE: [EXTERNAL] [PATCH 2/2] drm/hyperv: During panic do VMBus unload after frame buffer is flushed
From: Long Li @ 2026-03-17 18:43 UTC (permalink / raw)
  To: mhklinux@outlook.com, drawat.floss@gmail.com,
	maarten.lankhorst@linux.intel.com, mripard@kernel.org,
	tzimmermann@suse.de, airlied@gmail.com, simona@ffwll.ch,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	ryasuoka@redhat.com, jfalempe@redhat.com
  Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260209070201.1492-2-mhklinux@outlook.com>

> 
> In a VM, Linux panic information (reason for the panic, stack trace,
> etc.) may be written to a serial console and/or a virtual frame buffer for a
> graphics console. The latter may need to be flushed back to the host hypervisor
> for display.
> 
> The current Hyper-V DRM driver for the frame buffer does the flushing
> *after* the VMBus connection has been unloaded, such that panic messages are
> not displayed on the graphics console. A user with a Hyper-V graphics console is
> left with just a hung empty screen after a panic. The enhanced control that DRM
> provides over the panic display in the graphics console is similarly non-functional.
> 
> Commit 3671f3777758 ("drm/hyperv: Add support for drm_panic") added the
> Hyper-V DRM driver support to flush the virtual frame buffer. It provided
> necessary functionality but did not handle the sequencing problem with VMBus
> unload.
> 
> Fix the full problem by using VMBus functions to suppress the VMBus unload that
> is normally done by the VMBus driver in the panic path. Then after the frame
> buffer has been flushed, do the VMBus unload so that a kdump kernel can start
> cleanly. As expected, CONFIG_DRM_PANIC must be selected for these changes to
> have effect. As a side benefit, the enhanced features of the DRM panic path are
> also functional.
> 
> Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Reviewed-by: Long Li <longli@microsoft.com>

> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_drv.c     |  4 ++++
>  drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 15 ++++++++-------
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index 06b5d96e6eaf..79e51643be67 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -150,6 +150,9 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
>  		goto err_free_mmio;
>  	}
> 
> +	/* If DRM panic path is stubbed out VMBus code must do the unload */
> +	if (IS_ENABLED(CONFIG_DRM_PANIC) &&
> IS_ENABLED(CONFIG_PRINTK))
> +		vmbus_set_skip_unload(true);
>  	drm_client_setup(dev, NULL);
> 
>  	return 0;
> @@ -169,6 +172,7 @@ static void hyperv_vmbus_remove(struct hv_device
> *hdev)
>  	struct drm_device *dev = hv_get_drvdata(hdev);
>  	struct hyperv_drm_device *hv = to_hv(dev);
> 
> +	vmbus_set_skip_unload(false);
>  	drm_dev_unplug(dev);
>  	drm_atomic_helper_shutdown(dev);
>  	vmbus_close(hdev->channel);
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> index 7978f8c8108c..d48ca6c23b7c 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> @@ -212,15 +212,16 @@ static void hyperv_plane_panic_flush(struct
> drm_plane *plane)
>  	struct hyperv_drm_device *hv = to_hv(plane->dev);
>  	struct drm_rect rect;
> 
> -	if (!plane->state || !plane->state->fb)
> -		return;
> +	if (plane->state && plane->state->fb) {
> +		rect.x1 = 0;
> +		rect.y1 = 0;
> +		rect.x2 = plane->state->fb->width;
> +		rect.y2 = plane->state->fb->height;
> 
> -	rect.x1 = 0;
> -	rect.y1 = 0;
> -	rect.x2 = plane->state->fb->width;
> -	rect.y2 = plane->state->fb->height;
> +		hyperv_update_dirt(hv->hdev, &rect);
> +	}
> 
> -	hyperv_update_dirt(hv->hdev, &rect);
> +	vmbus_initiate_unload(true);
>  }
> 
>  static const struct drm_plane_helper_funcs hyperv_plane_helper_funcs = {
> --
> 2.25.1


^ permalink raw reply

* RE: [EXTERNAL] [PATCH 15/16] RDMA: Remove redundant = {} for udata req structs
From: Long Li @ 2026-03-17 18:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
	Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
	Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
	Vishnu Dasa, Yishai Hadas, Zhu Yanjun
  Cc: patches@lists.linux.dev
In-Reply-To: <15-v1-2b86f54cda42+7d-rdma_udata_req_jgg@nvidia.com>

> 
> Now that all of the udata request structs are loaded with the helpers the callers
> should not pre-zero them. The helpers all guarantee that the entire struct is filled
> with something.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/infiniband/hw/efa/efa_verbs.c       | 4 ++--
>  drivers/infiniband/hw/hns/hns_roce_main.c   | 2 +-
>  drivers/infiniband/hw/hns/hns_roce_srq.c    | 2 +-
>  drivers/infiniband/hw/mana/cq.c             | 2 +-
>  drivers/infiniband/hw/mana/qp.c             | 2 +-
>  drivers/infiniband/hw/mana/wq.c             | 2 +-
>  drivers/infiniband/hw/mlx4/qp.c             | 4 ++--
>  drivers/infiniband/hw/mlx5/cq.c             | 2 +-
>  drivers/infiniband/hw/mlx5/main.c           | 2 +-
>  drivers/infiniband/hw/mlx5/mr.c             | 2 +-
>  drivers/infiniband/hw/mlx5/qp.c             | 4 ++--
>  drivers/infiniband/hw/mlx5/srq.c            | 2 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 4 +++-
>  drivers/infiniband/hw/qedr/verbs.c          | 8 ++++----
>  14 files changed, 22 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c
> b/drivers/infiniband/hw/efa/efa_verbs.c
> index b491bcd886ccb0..f1020921f0e742 100644
> --- a/drivers/infiniband/hw/efa/efa_verbs.c
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -682,7 +682,7 @@ int efa_create_qp(struct ib_qp *ibqp, struct
> ib_qp_init_attr *init_attr,
>  	struct efa_com_create_qp_result create_qp_resp;
>  	struct efa_dev *dev = to_edev(ibqp->device);
>  	struct efa_ibv_create_qp_resp resp = {};
> -	struct efa_ibv_create_qp cmd = {};
> +	struct efa_ibv_create_qp cmd;
>  	struct efa_qp *qp = to_eqp(ibqp);
>  	struct efa_ucontext *ucontext;
>  	u16 supported_efa_flags = 0;
> @@ -1121,7 +1121,7 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
>  	struct efa_com_create_cq_result result;
>  	struct ib_device *ibdev = ibcq->device;
>  	struct efa_dev *dev = to_edev(ibdev);
> -	struct efa_ibv_create_cq cmd = {};
> +	struct efa_ibv_create_cq cmd;
>  	struct efa_cq *cq = to_ecq(ibcq);
>  	int entries = attr->cqe;
>  	bool set_src_addr;
> diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c
> b/drivers/infiniband/hw/hns/hns_roce_main.c
> index ec6fb3f1177941..0dbe99aab6ad21 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_main.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> @@ -425,7 +425,7 @@ static int hns_roce_alloc_ucontext(struct ib_ucontext
> *uctx,
>  	struct hns_roce_ucontext *context = to_hr_ucontext(uctx);
>  	struct hns_roce_dev *hr_dev = to_hr_dev(uctx->device);
>  	struct hns_roce_ib_alloc_ucontext_resp resp = {};
> -	struct hns_roce_ib_alloc_ucontext ucmd = {};
> +	struct hns_roce_ib_alloc_ucontext ucmd;
>  	int ret = -EAGAIN;
> 
>  	if (!hr_dev->active)
> diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c
> b/drivers/infiniband/hw/hns/hns_roce_srq.c
> index b37a76587aa868..601f8cdfce96a3 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_srq.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
> @@ -406,7 +406,7 @@ static int alloc_srq_db(struct hns_roce_dev *hr_dev,
> struct hns_roce_srq *srq,
>  			struct ib_udata *udata,
>  			struct hns_roce_ib_create_srq_resp *resp)  {
> -	struct hns_roce_ib_create_srq ucmd = {};
> +	struct hns_roce_ib_create_srq ucmd;
>  	struct hns_roce_ucontext *uctx;
>  	int ret;
> 
> diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
> index 3f932ef6e5fff6..f4cbe21763bf11 100644
> --- a/drivers/infiniband/hw/mana/cq.c
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -13,7 +13,7 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
>  	struct mana_ib_create_cq_resp resp = {};
>  	struct mana_ib_ucontext *mana_ucontext;
>  	struct ib_device *ibdev = ibcq->device;
> -	struct mana_ib_create_cq ucmd = {};
> +	struct mana_ib_create_cq ucmd;
>  	struct mana_ib_dev *mdev;
>  	bool is_rnic_cq;
>  	u32 doorbell;
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 69c8d4f7a1f46b..ddc30d37d715f6 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -97,7 +97,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct
> ib_pd *pd,
>  		container_of(pd->device, struct mana_ib_dev, ib_dev);
>  	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
>  	struct mana_ib_create_qp_rss_resp resp = {};
> -	struct mana_ib_create_qp_rss ucmd = {};
> +	struct mana_ib_create_qp_rss ucmd;
>  	mana_handle_t *mana_ind_table;
>  	struct mana_port_context *mpc;
>  	unsigned int ind_tbl_size;
> diff --git a/drivers/infiniband/hw/mana/wq.c
> b/drivers/infiniband/hw/mana/wq.c index aceeea7f17b339..5c2134a0b1a196
> 100644
> --- a/drivers/infiniband/hw/mana/wq.c
> +++ b/drivers/infiniband/hw/mana/wq.c
> @@ -11,7 +11,7 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,  {
>  	struct mana_ib_dev *mdev =
>  		container_of(pd->device, struct mana_ib_dev, ib_dev);
> -	struct mana_ib_create_wq ucmd = {};
> +	struct mana_ib_create_wq ucmd;
>  	struct mana_ib_wq *wq;
>  	int err;
> 
> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
> index cfb54ffcaac22c..790be09d985a1a 100644
> --- a/drivers/infiniband/hw/mlx4/qp.c
> +++ b/drivers/infiniband/hw/mlx4/qp.c
> @@ -709,7 +709,7 @@ static int _mlx4_ib_create_qp_rss(struct ib_pd *pd,
> struct mlx4_ib_qp *qp,
>  				  struct ib_qp_init_attr *init_attr,
>  				  struct ib_udata *udata)
>  {
> -	struct mlx4_ib_create_qp_rss ucmd = {};
> +	struct mlx4_ib_create_qp_rss ucmd;
>  	int err;
> 
>  	if (!udata) {
> @@ -4230,7 +4230,7 @@ int mlx4_ib_modify_wq(struct ib_wq *ibwq, struct
> ib_wq_attr *wq_attr,
>  		      u32 wq_attr_mask, struct ib_udata *udata)  {
>  	struct mlx4_ib_qp *qp = to_mqp((struct ib_qp *)ibwq);
> -	struct mlx4_ib_modify_wq ucmd = {};
> +	struct mlx4_ib_modify_wq ucmd;
>  	enum ib_wq_state cur_state, new_state;
>  	int err;
> 
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index f5e75e51c6763f..1f94863e755cc7 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -720,7 +720,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct
> ib_udata *udata,
>  			  int *cqe_size, int *index, int *inlen,
>  			  struct uverbs_attr_bundle *attrs)
>  {
> -	struct mlx5_ib_create_cq ucmd = {};
> +	struct mlx5_ib_create_cq ucmd;
>  	unsigned long page_size;
>  	unsigned int page_offset_quantized;
>  	__be64 *pas;
> diff --git a/drivers/infiniband/hw/mlx5/main.c
> b/drivers/infiniband/hw/mlx5/main.c
> index ff2c02c85625ce..fe3de414bfcad5 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -2178,7 +2178,7 @@ static int mlx5_ib_alloc_ucontext(struct ib_ucontext
> *uctx,  {
>  	struct ib_device *ibdev = uctx->device;
>  	struct mlx5_ib_dev *dev = to_mdev(ibdev);
> -	struct mlx5_ib_alloc_ucontext_req_v2 req = {};
> +	struct mlx5_ib_alloc_ucontext_req_v2 req;
>  	struct mlx5_ib_alloc_ucontext_resp resp = {};
>  	struct mlx5_ib_ucontext *context = to_mucontext(uctx);
>  	struct mlx5_bfreg_info *bfregi;
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 49dcc39836c047..37f3d19bd374ee 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1768,7 +1768,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct
> ib_udata *udata)
>  	u32 *in = NULL;
>  	void *mkc;
>  	int err;
> -	struct mlx5_ib_alloc_mw req = {};
> +	struct mlx5_ib_alloc_mw req;
>  	struct {
>  		__u32	comp_mask;
>  		__u32	response_length;
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index 3b602ed0a2dafc..8f50e7342a7694 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -4692,7 +4692,7 @@ int mlx5_ib_modify_qp(struct ib_qp *ibqp, struct
> ib_qp_attr *attr,
>  	struct mlx5_ib_dev *dev = to_mdev(ibqp->device);
>  	struct mlx5_ib_modify_qp_resp resp = {};
>  	struct mlx5_ib_qp *qp = to_mqp(ibqp);
> -	struct mlx5_ib_modify_qp ucmd = {};
> +	struct mlx5_ib_modify_qp ucmd;
>  	enum ib_qp_type qp_type;
>  	enum ib_qp_state cur_state, new_state;
>  	int err = -EINVAL;
> @@ -5379,7 +5379,7 @@ static int prepare_user_rq(struct ib_pd *pd,
>  			   struct mlx5_ib_rwq *rwq)
>  {
>  	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> -	struct mlx5_ib_create_wq ucmd = {};
> +	struct mlx5_ib_create_wq ucmd;
>  	int err;
> 
>  	err = ib_copy_validate_udata_in_cm(udata, ucmd, diff --git
> a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
> index 6d89c0242cab61..852f6f502d14d0 100644
> --- a/drivers/infiniband/hw/mlx5/srq.c
> +++ b/drivers/infiniband/hw/mlx5/srq.c
> @@ -45,7 +45,7 @@ static int create_srq_user(struct ib_pd *pd, struct
> mlx5_ib_srq *srq,
>  			   struct ib_udata *udata, int buf_size)  {
>  	struct mlx5_ib_dev *dev = to_mdev(pd->device);
> -	struct mlx5_ib_create_srq ucmd = {};
> +	struct mlx5_ib_create_srq ucmd;
>  	struct mlx5_ib_ucontext *ucontext = rdma_udata_to_drv_context(
>  		udata, struct mlx5_ib_ucontext, ibucontext);
>  	int err;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 8b285fcc638701..eed149f7a942b8 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -1311,12 +1311,14 @@ int ocrdma_create_qp(struct ib_qp *ibqp, struct
> ib_qp_init_attr *attrs,
>  	if (status)
>  		goto gen_err;
> 
> -	memset(&ureq, 0, sizeof(ureq));
>  	if (udata) {
>  		status = ib_copy_validate_udata_in(udata, ureq, rsvd1);
>  		if (status)
>  			return status;
> +	} else {
> +		memset(&ureq, 0, sizeof(ureq));
>  	}
> +
>  	ocrdma_set_qp_init_params(qp, pd, attrs);
>  	if (udata == NULL)
>  		qp->cap_flags |= (OCRDMA_QP_MW_BIND |
> OCRDMA_QP_LKEY0 | diff --git a/drivers/infiniband/hw/qedr/verbs.c
> b/drivers/infiniband/hw/qedr/verbs.c
> index 42d20b35ff3fe0..679aa6f3a63bc5 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -264,7 +264,7 @@ int qedr_alloc_ucontext(struct ib_ucontext *uctx, struct
> ib_udata *udata)
>  	int rc;
>  	struct qedr_ucontext *ctx = get_qedr_ucontext(uctx);
>  	struct qedr_alloc_ucontext_resp uresp = {};
> -	struct qedr_alloc_ucontext_req ureq = {};
> +	struct qedr_alloc_ucontext_req ureq;
>  	struct qedr_dev *dev = get_qedr_dev(ibdev);
>  	struct qed_rdma_add_user_out_params oparams;
>  	struct qedr_user_mmap_entry *entry;
> @@ -913,7 +913,7 @@ int qedr_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
>  	};
>  	struct qedr_dev *dev = get_qedr_dev(ibdev);
>  	struct qed_rdma_create_cq_in_params params;
> -	struct qedr_create_cq_ureq ureq = {};
> +	struct qedr_create_cq_ureq ureq;
>  	int vector = attr->comp_vector;
>  	int entries = attr->cqe;
>  	struct qedr_cq *cq = get_qedr_cq(ibcq); @@ -1541,7 +1541,7 @@ int
> qedr_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init_attr,
>  	struct qedr_dev *dev = get_qedr_dev(ibsrq->device);
>  	struct qed_rdma_create_srq_out_params out_params;
>  	struct qedr_pd *pd = get_qedr_pd(ibsrq->pd);
> -	struct qedr_create_srq_ureq ureq = {};
> +	struct qedr_create_srq_ureq ureq;
>  	u64 pbl_base_addr, phy_prod_pair_addr;
>  	struct qedr_srq_hwq_info *hw_srq;
>  	u32 page_cnt, page_size;
> @@ -1837,7 +1837,7 @@ static int qedr_create_user_qp(struct qedr_dev *dev,
>  	struct qed_rdma_create_qp_in_params in_params;
>  	struct qed_rdma_create_qp_out_params out_params;
>  	struct qedr_create_qp_uresp uresp = {};
> -	struct qedr_create_qp_ureq ureq = {};
> +	struct qedr_create_qp_ureq ureq;
>  	int alloc_and_init = rdma_protocol_roce(&dev->ibdev, 1);
>  	struct qedr_ucontext *ctx = NULL;
>  	struct qedr_pd *pd = NULL;
> --
> 2.43.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox