* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-17 7:49 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley, Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260312170715.HA08BHiO@linutronix.de>
On 12.03.26 18:07, Sebastian Andrzej Siewior wrote:
> On 2026-02-16 17:24:56 [+0100], Jan Kiszka wrote:
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -25,6 +25,7 @@
>> #include <linux/cpu.h>
>> #include <linux/sched/isolation.h>
>> #include <linux/sched/task_stack.h>
>> +#include <linux/smpboot.h>
>>
>> #include <linux/delay.h>
>> #include <linux/panic_notifier.h>
>> @@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
>> }
>> }
>>
>> -void vmbus_isr(void)
>> +static void __vmbus_isr(void)
>> {
>> struct hv_per_cpu_context *hv_cpu
>> = this_cpu_ptr(hv_context.cpu_context);
>> @@ -1363,6 +1364,53 @@ void vmbus_isr(void)
>>
>> add_interrupt_randomness(vmbus_interrupt);
>
> This is feeding entropy and would like to see interrupt registers. But
> since this is invoked from a thread it won't.
>
Good point, will move this to vmbus_isr.
>> }
>> +
> …
>> +void vmbus_isr(void)
>> +{
>> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> + vmbus_irqd_wake();
>> + } else {
>> + lockdep_hardirq_threaded();
>
> What clears this? This is wrongly placed. This should go to
> sysvec_hyperv_callback() instead with its matching canceling part. The
> add_interrupt_randomness() should also be there and not here.
> sysvec_hyperv_stimer0() managed to do so.
First of all, we need to keep all this in generic code to avoid missing
arm64.
But the question about lockdep_hardirq_threaded() is valid - and that
not only for this new code: I tried hard to understand from the code how
hardirq_threaded is managed, but I simply couldn't find the spot where
it is reset after lockdep_hardirq_threaded() but before returning from
the interrupt to the task that now has hardirq_threaded=1. I failed, and
so I started a debugger. That confirms for the existing code path
(__handle_irq_event_percpu) that we are indeed returning to the
interrupted task with hardirq_threaded set. I'm not sure if that is
intended that only the next irq_enter_rcu->lockdep_hardirq_enter of the
next IRQ over this same task will reset the flag again.
With that in mind, the new logic here is no different from the one the
kernel used before. If both are not doing what they should, we likely
want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
>
> Different question: What guarantees that there won't be another
> interrupt before this one is done? The handshake appears to be
> deprecated. The interrupt itself returns ACKing (or not) but the actual
> handler is delayed to this thread. Depending on the userland it could
> take some time and I don't know how impatient the host is.
>
Good question. I guess people familiar with the hv interface need to
comment on that.
>> + __vmbus_isr();
> Moving on. This (trying very hard here) even schedules tasklets. Why?
> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> You don't want that.
>
You are referring to the re-existing logic now, aren't you?
> Couldn't the whole logic be integrated into the IRQ code? Then we could
> have mask/ unmask if supported/ provided and threaded interrupts. Then
> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> instead apic_eoi() + schedule_delayed_work().
>
Again, you are thinking x86-only. We need a portable solution.
>> + }
>> +}
>> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
>>
>> static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>
> Sebastian
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
^ permalink raw reply
* [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Jan Kiszka @ 2026-03-17 8:09 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Sebastian Andrzej Siewior, Florian Bezdeka
From: Jan Kiszka <jan.kiszka@siemens.com>
Sebastian Siewior wrote:
"This is feeding entropy and would like to see interrupt registers. But
since this is invoked from a thread it won't."
So move it back to where it is always in interrupt context.
Fixes: f8e6343b7a89 ("Drivers: hv: vmbus: Simplify allocation of vmbus_evt")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
drivers/hv/vmbus_drv.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1..28025a264861 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1361,8 +1361,6 @@ static void __vmbus_isr(void)
vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page);
vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page);
-
- add_interrupt_randomness(vmbus_interrupt);
}
static DEFINE_PER_CPU(bool, vmbus_irq_pending);
@@ -1410,6 +1408,8 @@ void vmbus_isr(void)
lockdep_hardirq_threaded();
__vmbus_isr();
}
+
+ add_interrupt_randomness(vmbus_interrupt);
}
EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
--
2.47.3
^ permalink raw reply related
* Re: [PATCH 04/15] mm: add vm_ops->mapped hook
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 8:42 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
Alexandre Torgue, Miquel Raynal, Richard Weinberger,
Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
linux-kernel, linux-doc, linux-hyperv, linux-stm32,
linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpGFiKd-1rDdMviy8mUFiCtB9pxPj6ux-tF60eB4uVm4=A@mail.gmail.com>
On Mon, Mar 16, 2026 at 04:39:00PM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 16, 2026 at 6:39 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Sun, Mar 15, 2026 at 07:18:38PM -0700, Suren Baghdasaryan wrote:
> > > On Fri, Mar 13, 2026 at 4:58 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > >
> > > > On Fri, Mar 13, 2026 at 04:02:36AM -0700, Usama Arif wrote:
> > > > > On Thu, 12 Mar 2026 20:27:19 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > > > >
> > > > > > Previously, when a driver needed to do something like establish a reference
> > > > > > count, it could do so in the mmap hook in the knowledge that the mapping
> > > > > > would succeed.
> > > > > >
> > > > > > With the introduction of f_op->mmap_prepare this is no longer the case, as
> > > > > > it is invoked prior to actually establishing the mapping.
> > > > > >
> > > > > > To take this into account, introduce a new vm_ops->mapped callback which is
> > > > > > invoked when the VMA is first mapped (though notably - not when it is
> > > > > > merged - which is correct and mirrors existing mmap/open/close behaviour).
> > > > > >
> > > > > > We do better that vm_ops->open() here, as this callback can return an
> > > > > > error, at which point the VMA will be unmapped.
> > > > > >
> > > > > > Note that vm_ops->mapped() is invoked after any mmap action is
> > > > > > complete (such as I/O remapping).
> > > > > >
> > > > > > We intentionally do not expose the VMA at this point, exposing only the
> > > > > > fields that could be used, and an output parameter in case the operation
> > > > > > needs to update the vma->vm_private_data field.
> > > > > >
> > > > > > In order to deal with stacked filesystems which invoke inner filesystem's
> > > > > > mmap() invocations, add __compat_vma_mapped() and invoke it on
> > > > > > vfs_mmap() (via compat_vma_mmap()) to ensure that the mapped callback is
> > > > > > handled when an mmap() caller invokes a nested filesystem's mmap_prepare()
> > > > > > callback.
> > > > > >
> > > > > > We can now also remove call_action_complete() and invoke
> > > > > > mmap_action_complete() directly, as we separate out the rmap lock logic to
> > > > > > be called in __mmap_region() instead via maybe_drop_file_rmap_lock().
> > > > > >
> > > > > > We also abstract unmapping of a VMA on mmap action completion into its own
> > > > > > helper function, unmap_vma_locked().
> > > > > >
> > > > > > Additionally, update VMA userland test headers to reflect the change.
> > > > > >
> > > > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > > > ---
> > > > > > include/linux/fs.h | 9 +++-
> > > > > > include/linux/mm.h | 17 +++++++
> > > > > > mm/internal.h | 10 ++++
> > > > > > mm/util.c | 86 ++++++++++++++++++++++++---------
> > > > > > mm/vma.c | 41 +++++++++++-----
> > > > > > tools/testing/vma/include/dup.h | 34 ++++++++++++-
> > > > > > 6 files changed, 158 insertions(+), 39 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index a2628a12bd2b..c390f5c667e3 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -2059,13 +2059,20 @@ static inline bool can_mmap_file(struct file *file)
> > > > > > }
> > > > > >
> > > > > > int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> > > > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma);
> > > > > >
> > > > > > static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > {
> > > > > > + int err;
> > > > > > +
> > > > > > if (file->f_op->mmap_prepare)
> > > > > > return compat_vma_mmap(file, vma);
> > > > > >
> > > > > > - return file->f_op->mmap(file, vma);
> > > > > > + err = file->f_op->mmap(file, vma);
> > > > > > + if (err)
> > > > > > + return err;
> > > > > > +
> > > > > > + return __vma_check_mmap_hook(vma);
> > > > > > }
> > > > > >
> > > > > > static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 12a0b4c63736..7333d5db1221 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -759,6 +759,23 @@ struct vm_operations_struct {
> > > > > > * Context: User context. May sleep. Caller holds mmap_lock.
> > > > > > */
> > > > > > void (*close)(struct vm_area_struct *vma);
> > > > > > + /**
> > > > > > + * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > > > + * the new VMA is merged with an adjacent VMA.
> > > > > > + *
> > > > > > + * The @vm_private_data field is an output field allowing the user to
> > > > > > + * modify vma->vm_private_data as necessary.
> > > > > > + *
> > > > > > + * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > > > + * set from f_op->mmap.
> > > > > > + *
> > > > > > + * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > > > + * be unmapped.
> > > > > > + *
> > > > > > + * Context: User context. May sleep. Caller holds mmap_lock.
> > > > > > + */
> > > > > > + int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > > > + const struct file *file, void **vm_private_data);
> > > > > > /* Called any time before splitting to check if it's allowed */
> > > > > > int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
> > > > > > int (*mremap)(struct vm_area_struct *vma);
> > > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > > index 7bfa85b5e78b..f0f2cf1caa36 100644
> > > > > > --- a/mm/internal.h
> > > > > > +++ b/mm/internal.h
> > > > > > @@ -158,6 +158,8 @@ static inline void *folio_raw_mapping(const struct folio *folio)
> > > > > > * mmap hook and safely handle error conditions. On error, VMA hooks will be
> > > > > > * mutated.
> > > > > > *
> > > > > > + * IMPORTANT: f_op->mmap() is deprecated, prefer f_op->mmap_prepare().
> > > > > > + *
> > >
> > > What exactly would one do to "prefer f_op->mmap_prepare()"?
> >
> > I'm saying a person should implement f_op->mmap_prepare() rather than
> > f_op->mmap(), since the latter is deprecated :)
> >
> > I think that's pretty clear no?
> >
> > > Since you are adding this comment for mmap_file(), I think you need to
> > > describe more specifically what one should call instead.
> >
> > I think it'd be a complete distraction, since if you're at the point of calling
> > mmap_file() you're already not implement mmap_prepare except as a compatbility
> > layer.
>
> Yep, it seems like a warning that comes too late.
Yeah, it's the wrong place for it, agreed.
>
> >
> > I mean maybe I'll just drop this as it seems to be causing confusion.
>
> Maybe instead we add a comment that f_ops->mmap is deprecated in favor
> of f_ops->mmap_prepare() in here:
> https://elixir.bootlin.com/linux/v7.0-rc4/source/include/linux/fs.h#L1940
> ?
Yeah could do, I think maybe once the mmap_prepare changes are further along
actually, as I am still essentially figuring out what functionality to
provide/the shape of it as I develop it.
It's a bit chicken-and-egg, but doing it this way has evolved to a pretty nice
approach so far matching what drivers _actually do_ + finding new ways of doing
them without risk of them breaking stuff which is kinda the whole point - this
isn't a rework for rework's sake, but rather effectively completely changing how
drivers perform mmap.
>
> >
> > >
> > > > > > * @file: File which backs the mapping.
> > > > > > * @vma: VMA which we are mapping.
> > > > > > *
> > > > > > @@ -201,6 +203,14 @@ static inline void vma_close(struct vm_area_struct *vma)
> > > > > > /* unmap_vmas is in mm/memory.c */
> > > > > > void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap);
> > > > > >
> > > > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > > > +
> > > > > > + mmap_assert_locked(vma->vm_mm);
> > >
> > > You must hold the mmap write lock when unmapping. Would be better to
> > > assert mmap_assert_write_locked() or even vma_assert_write_locked(),
> > > which implies mmap_assert_write_locked().
> >
> > I'm not sure why we don't assert this in those paths.
> >
> > I think I assumed we could only assert readonly because one of those paths
> > downgrades the mmap write lock to a read lock.
> >
> > I don't think we can do a VMA write lock assert here, since at the point of
> > do_munmap() all callers can't possibly have the VMA write lock, since they are
> > _looking up_ the VMA at the specified address.
>
> It sounds strange to me that we are unmapping a VMA that was not
> locked beforehand. Let me look into the call chains a bit more to
> convince myself one way or the other. The fact that do_munmap() looks
> up the VMA by address and then write-locks it inside
> vms_gather_munmap_vmas() does not mean the VMA was not already locked.
> vma_start_write() is re-entrant.
Well I mean:
SYSCALL_DEFINE2(munmap, ...)
-> __vm_munmap [ takes mmap write lock ]
-> do_vmi_munmap()
do_munmap() [ assumes (but does not assert, we should add) mmap write lock]
-> do_vmi_munmap()
You can unmap more than one VMA from this interface, or even choose a range that
doesn't have anything mapped.
do_vmi_munmap() gets the first VMA and if none present exits early, then calls
into do_vmi_align_munmap() otherwise, which does the whole gather/complete
dance.
With respect to the mmap()'ing, actually we probably should always have VMA
write lock, because for any action to be taken, you couldn't merge since
VMA_SPECIAL_FLAGS would be specified (any kind of remap would be VMA_PFNMAP_BIT
+ friends, map kernel pages would be VMA_MIXEDMAP_BIT).
(Might be worth me adding an assert for that actually to avoid confusion.)
Not merging would mean __mmap_new_vma() would be called which naturally gets the
VMA write lock.
So you're right I think we should hold the VMA lock here, but I'm wondering if
it's much of a muchness since really we only _need_ the mmap write lock here.
>
> >
> > But I can convert this to an mmap_assert_write_locked()!
>
> Ok, let's go with that. I don't want to slow down your patchset while
> I investigate locking rules here. We can strengthen the assertion
> later.
Thanks!
>
> >
> > >
> > > > > > + do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > > > +}
> > > > > > +
> > > > > > #ifdef CONFIG_MMU
> > > > > >
> > > > > > static inline void get_anon_vma(struct anon_vma *anon_vma)
> > > > > > diff --git a/mm/util.c b/mm/util.c
> > > > > > index dba1191725b6..2b0ed54008d6 100644
> > > > > > --- a/mm/util.c
> > > > > > +++ b/mm/util.c
> > > > > > @@ -1163,6 +1163,55 @@ void flush_dcache_folio(struct folio *folio)
> > > > > > EXPORT_SYMBOL(flush_dcache_folio);
> > > > > > #endif
> > > > > >
> > > > > > +static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + struct vm_area_desc desc = {
> > > > > > + .mm = vma->vm_mm,
> > > > > > + .file = file,
> > > > > > + .start = vma->vm_start,
> > > > > > + .end = vma->vm_end,
> > > > > > +
> > > > > > + .pgoff = vma->vm_pgoff,
> > > > > > + .vm_file = vma->vm_file,
> > > > > > + .vma_flags = vma->flags,
> > > > > > + .page_prot = vma->vm_page_prot,
> > > > > > +
> > > > > > + .action.type = MMAP_NOTHING, /* Default */
> > > > > > + };
> > > > > > + int err;
> > > > > > +
> > > > > > + err = vfs_mmap_prepare(file, &desc);
> > > > > > + if (err)
> > > > > > + return err;
> > > > > > +
> > > > > > + err = mmap_action_prepare(&desc, &desc.action);
> > > > > > + if (err)
> > > > > > + return err;
> > > > > > +
> > > > > > + set_vma_from_desc(vma, &desc);
> > > > > > + return mmap_action_complete(vma, &desc.action);
> > > > > > +}
> > > > > > +
> > > > > > +static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > > > + void *vm_private_data = vma->vm_private_data;
> > > > > > + int err;
> > > > > > +
> > > > > > + if (!vm_ops->mapped)
> > > > > > + return 0;
> > > > > > +
> > > > >
> > > > > Hello!
> > > > >
> > > > > Can vm_ops be NULL here? __compat_vma_mapped() is called from
> > > > > compat_vma_mmap(), which is reached when a filesystem provides
> > > > > mmap_prepare. If the mmap_prepare hook does not set desc->vm_ops,
> > > > > vma->vm_ops will be NULL and this dereferences a NULL pointer.
> > > >
> > > > I _think_ for this to ever be invoked, you would need to be dealing with a
> > > > file-backed VMA so vm_ops->fault would HAVE to be defined.
> > > >
> > > > But you're right anyway as a matter of principle we should check it! Will fix.
> > > >
> > > > >
> > > > > For e.g. drivers/char/mem.c, mmap_zero_prepare() would trigger
> > > > > a NULL pointer dereference here.
> > > > >
> > > > > Would need to do
> > > > > if (!vm_ops || !vm_ops->mapped)
> > > > > return 0;
> > > > >
> > > > > here
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > >
> > > > > > + err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
> > > > > > + &vm_private_data);
> > > > > > + if (err)
> > > > > > + unmap_vma_locked(vma);
> > > > >
> > > > > when mapped() returns an error, unmap_vma_locked(vma) is called
> > > > > but execution continues into the vm_private_data update below. After
> > > > > unmap_vma_locked() the VMA may be freed (do_munmap can remove the VMA
> > > > > entirely), so accessing vma->vm_private_data after that is a
> > > > > use-after-free.
> > > >
> > > > Very good point :) will fix thanks!
> > > >
> > > > Probably:
> > > >
> > > > if (err)
> > > > unmap_vma_locked(vma);
> > > > else if (vm_private_data != vma->vm_private_data)
> > > > vma->vm_private_data = vm_private_data;
> > > >
> > > > return err;
> > > >
> > > > Would be fine.
> > > >
> > > > >
> > > > > Probably need to do:
> > > > > if (err) {
> > > > > unmap_vma_locked(vma);
> > > > > return err;
> > > > > }
> > > > >
> > > > > > + /* Update private data if changed. */
> > > > > > + if (vm_private_data != vma->vm_private_data)
> > > > > > + vma->vm_private_data = vm_private_data;
> > > > > > +
> > > > > > + return err;
> > > > > > +}
> > > > > > +
> > > > > > /**
> > > > > > * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> > > > > > * existing VMA and execute any requested actions.
> > > > > > @@ -1191,34 +1240,26 @@ EXPORT_SYMBOL(flush_dcache_folio);
> > > > > > */
> > > > > > int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > {
> > > > > > - struct vm_area_desc desc = {
> > > > > > - .mm = vma->vm_mm,
> > > > > > - .file = file,
> > > > > > - .start = vma->vm_start,
> > > > > > - .end = vma->vm_end,
> > > > > > -
> > > > > > - .pgoff = vma->vm_pgoff,
> > > > > > - .vm_file = vma->vm_file,
> > > > > > - .vma_flags = vma->flags,
> > > > > > - .page_prot = vma->vm_page_prot,
> > > > > > -
> > > > > > - .action.type = MMAP_NOTHING, /* Default */
> > > > > > - };
> > > > > > int err;
> > > > > >
> > > > > > - err = vfs_mmap_prepare(file, &desc);
> > > > > > - if (err)
> > > > > > - return err;
> > > > > > -
> > > > > > - err = mmap_action_prepare(&desc, &desc.action);
> > > > > > + err = __compat_vma_mmap(file, vma);
> > > > > > if (err)
> > > > > > return err;
> > > > > >
> > > > > > - set_vma_from_desc(vma, &desc);
> > > > > > - return mmap_action_complete(vma, &desc.action);
> > > > > > + return __compat_vma_mapped(file, vma);
> > > > > > }
> > > > > > EXPORT_SYMBOL(compat_vma_mmap);
> > > > > >
> > > > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + /* vm_ops->mapped is not valid if mmap() is specified. */
> > > > > > + if (WARN_ON_ONCE(vma->vm_ops->mapped))
> > > > > > + return -EINVAL;
> > > > >
> > > > > I think vma->vm_ops can be NULL here. Should be:
> > > > >
> > > > > if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped))
> > > > > return -EINVAL;
> > > >
> > > > I think again you'd probably only invoke this on file-backed so be ok, but again
> > > > as a matter of principle we should check it so will fix, thanks!
> > > >
> > > > >
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(__vma_check_mmap_hook);
> > >
> > > nit: Any reason __vma_check_mmap_hook() is not inlined next to its
> > > user vfs_mmap()?
> >
> > Headers fun, fs.h is a 'before mm.h' header, so vm_operations_struct is not
> > declared yet here, so we can't actually do the check there.
>
> Ack.
>
> >
> > >
> > > > > > +
> > > > > > static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
> > > > > > const struct page *page)
> > > > > > {
> > > > > > @@ -1316,10 +1357,7 @@ static int mmap_action_finish(struct vm_area_struct *vma,
> > > > > > * invoked if we do NOT merge, so we only clean up the VMA we created.
> > > > > > */
> > > > > > if (err) {
> > > > > > - const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > > > -
> > > > > > - do_munmap(current->mm, vma->vm_start, len, NULL);
> > > > > > -
> > > > > > + unmap_vma_locked(vma);
> > > > > > if (action->error_hook) {
> > > > > > /* We may want to filter the error. */
> > > > > > err = action->error_hook(err);
> > > > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > > > index 054cf1d262fb..ef9f5a5365d1 100644
> > > > > > --- a/mm/vma.c
> > > > > > +++ b/mm/vma.c
> > > > > > @@ -2705,21 +2705,35 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> > > > > > return false;
> > > > > > }
> > > > > >
> > > > > > -static int call_action_complete(struct mmap_state *map,
> > > > > > - struct mmap_action *action,
> > > > > > - struct vm_area_struct *vma)
> > > > > > +static int call_mapped_hook(struct vm_area_struct *vma)
> > > > > > {
> > > > > > - int ret;
> > > > > > + const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > > > + void *vm_private_data = vma->vm_private_data;
> > > > > > + int err;
> > > > > >
> > > > > > - ret = mmap_action_complete(vma, action);
> > > > > > + if (!vm_ops || !vm_ops->mapped)
> > > > > > + return 0;
> > > > > > + err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff,
> > > > > > + vma->vm_file, &vm_private_data);
> > > > > > + if (err) {
> > > > > > + unmap_vma_locked(vma);
> > > > > > + return err;
> > > > > > + }
> > > > > > + /* Update private data if changed. */
> > > > > > + if (vm_private_data != vma->vm_private_data)
> > > > > > + vma->vm_private_data = vm_private_data;
> > > > > > + return 0;
> > > > > > +}
> > > > > >
> > > > > > - /* If we held the file rmap we need to release it. */
> > > > > > - if (map->hold_file_rmap_lock) {
> > > > > > - struct file *file = vma->vm_file;
> > > > > > +static void maybe_drop_file_rmap_lock(struct mmap_state *map,
> > > > > > + struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + struct file *file;
> > > > > >
> > > > > > - i_mmap_unlock_write(file->f_mapping);
> > > > > > - }
> > > > > > - return ret;
> > > > > > + if (!map->hold_file_rmap_lock)
> > > > > > + return;
> > > > > > + file = vma->vm_file;
> > > > > > + i_mmap_unlock_write(file->f_mapping);
> > > > > > }
> > > > > >
> > > > > > static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > > > > @@ -2773,8 +2787,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > > > > __mmap_complete(&map, vma);
> > > > > >
> > > > > > if (have_mmap_prepare && allocated_new) {
> > > > > > - error = call_action_complete(&map, &desc.action, vma);
> > > > > > + error = mmap_action_complete(vma, &desc.action);
> > > > > > + if (!error)
> > > > > > + error = call_mapped_hook(vma);
> > > > > >
> > > > > > + maybe_drop_file_rmap_lock(&map, vma);
> > > > > > if (error)
> > > > > > return error;
> > > > > > }
> > > > > > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > > > > > index 908beb263307..47d8db809f31 100644
> > > > > > --- a/tools/testing/vma/include/dup.h
> > > > > > +++ b/tools/testing/vma/include/dup.h
> > > > > > @@ -606,12 +606,34 @@ struct vm_area_struct {
> > > > > > } __randomize_layout;
> > > > > >
> > > > > > struct vm_operations_struct {
> > > > > > - void (*open)(struct vm_area_struct * area);
> > > > > > + /**
> > > > > > + * @open: Called when a VMA is remapped or split. Not called upon first
> > > > > > + * mapping a VMA.
> > > > > > + * Context: User context. May sleep. Caller holds mmap_lock.
> > > > > > + */
> > >
> > > This comment should have been introduced in the previous patch.
> >
> > It's the testing code, it's not really important. But if I respin I'll fix... :)
>
> Thanks!
>
> >
> > >
> > > > > > + void (*open)(struct vm_area_struct *vma);
> > > > > > /**
> > > > > > * @close: Called when the VMA is being removed from the MM.
> > > > > > * Context: User context. May sleep. Caller holds mmap_lock.
> > > > > > */
> > > > > > - void (*close)(struct vm_area_struct * area);
> > > > > > + void (*close)(struct vm_area_struct *vma);
> > > > > > + /**
> > > > > > + * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > > > + * the new VMA is merged with an adjacent VMA.
> > > > > > + *
> > > > > > + * The @vm_private_data field is an output field allowing the user to
> > > > > > + * modify vma->vm_private_data as necessary.
> > > > > > + *
> > > > > > + * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > > > + * set from f_op->mmap.
> > > > > > + *
> > > > > > + * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > > > + * be unmapped.
> > > > > > + *
> > > > > > + * Context: User context. May sleep. Caller holds mmap_lock.
> > > > > > + */
> > > > > > + int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > > > + const struct file *file, void **vm_private_data);
> > > > > > /* Called any time before splitting to check if it's allowed */
> > > > > > int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> > > > > > int (*mremap)(struct vm_area_struct *area);
> > > > > > @@ -1345,3 +1367,11 @@ static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
> > > > > > swap(vma->vm_file, file);
> > > > > > fput(file);
> > > > > > }
> > > > > > +
> > > > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > > > +
> > > > > > + mmap_assert_locked(vma->vm_mm);
> > > > > > + do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > > > +}
> > > > > > --
> > > > > > 2.53.0
> > > > > >
> > > > > >
> > > >
> > > > Cheers, Lorenzo
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH 05/15] fs: afs: correctly drop reference count on mapping failure
From: Lorenzo Stoakes (Oracle) @ 2026-03-17 8:58 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
Alexandre Torgue, Miquel Raynal, Richard Weinberger,
Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
linux-kernel, linux-doc, linux-hyperv, linux-stm32,
linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpH2XyAJOFKCZnviVV_UbF4O0wzj3QgJieo+LD=Cvr71jA@mail.gmail.com>
On Mon, Mar 16, 2026 at 08:41:48PM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 16, 2026 at 7:29 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Sun, Mar 15, 2026 at 07:32:54PM -0700, Suren Baghdasaryan wrote:
> > > On Fri, Mar 13, 2026 at 5:00 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > >
> > > > On Fri, Mar 13, 2026 at 04:07:43AM -0700, Usama Arif wrote:
> > > > > On Thu, 12 Mar 2026 20:27:20 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > > > >
> > > > > > Commit 9d5403b1036c ("fs: convert most other generic_file_*mmap() users to
> > > > > > .mmap_prepare()") updated AFS to use the mmap_prepare callback in favour of
> > > > > > the deprecated mmap callback.
> > > > > >
> > > > > > However, it did not account for the fact that mmap_prepare can fail to map
> > > > > > due to an out of memory error, and thus should not be incrementing a
> > > > > > reference count on mmap_prepare.
> > >
> > > This is a bit confusing. I see the current implementation does
> > > afs_add_open_mmap() and then if generic_file_mmap_prepare() fails it
> > > does afs_drop_open_mmap(), therefore refcounting seems to be balanced.
> > > Is there really a problem?
> >
> > Firstly, mmap_prepare is invoked before we try to merge, so the VMA could in
> > theory get merged and then the refcounting will be wrong.
>
> I see now. Ok, makes sense.
>
> >
> > Secondly, mmap_prepare occurs at such at time where it is _possible_ that
> > allocation failures as described below could happen.
>
> Right, but in that case afs_file_mmap_prepare() would drop its
> refcount and return an error, so refcounting is still good, no?
Nope, in __mmap_region():
call_mmap_prepare()
-> __mmap_new_vma()
vm_area_alloc() -> can fail
vma_iter_prealloc() -> can fail
__mmap_new_file_vma() / shmem_zero_setup() -> can fail
If any of those fail the VMA is not even set up, so no close() will be called
because there's no VMA to call close on.
This is what makes mmap_prepare very different from mmap which passes in (a
partially established) VMA.
That and of course a potential merge would mean any refcount increment would be
wrong.
>
> >
> > I'll update the commit message to reflect the merge aspect actually.
>
> Thanks!
You're welcome, and done in v2 :)
>
> >
> > >
> > > > > >
> > > > > > With the newly added vm_ops->mapped callback available, we can simply defer
> > > > > > this operation to that callback which is only invoked once the mapping is
> > > > > > successfully in place (but not yet visible to userspace as the mmap and VMA
> > > > > > write locks are held).
> > > > > >
> > > > > > Therefore add afs_mapped() to implement this callback for AFS.
> > > > > >
> > > > > > In practice the mapping allocations are 'too small to fail' so this is
> > > > > > something that realistically should never happen in practice (or would do
> > > > > > so in a case where the process is about to die anyway), but we should still
> > > > > > handle this.
> > >
> > > nit: I would drop the above paragraph. If it's impossible why are you
> > > handling it? If it's unlikely, then handling it is even more
> > > important.
> >
> > Sure I can drop it, but it's an ongoing thing with these small allocations.
> >
> > I wish we could just move to a scenario where we can simpy assume allocations
> > will always succeed :)
>
> That would be really nice but unfortunately the world is not that
> perfect. I just don't want to be chasing some rarely reproducible bug
> because of the assumption that an allocation is too small to fail.
I mean I agree, we should handle all error paths.
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH] lib: count_zeros: fix 32/64-bit inconsistency in count_trailing_zeros()
From: Leon Romanovsky @ 2026-03-17 9:14 UTC (permalink / raw)
To: Yury Norov
Cc: Jason Gunthorpe, Yury Norov, Andy Shevchenko, Rasmus Villemoes,
Eric Biggers, Jason A. Donenfeld, Ard Biesheuvel, linux-kernel,
kexec, linux-cifs, linux-spi, linux-hyperv, K. Y. Srinivasan,
Haiyang Zhang, Mark Brown, Steve French, Alexander Graf,
Mike Rapoport, Pasha Tatashin
In-Reply-To: <abRUGVW6ZuGioa4Z@yury>
On Fri, Mar 13, 2026 at 02:14:49PM -0400, Yury Norov wrote:
> On Fri, Mar 13, 2026 at 02:18:55PM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 12, 2026 at 07:08:16PM -0400, Yury Norov wrote:
> > > Based on 'sizeof(x) == 4' condition, in 32-bit case the function is wired
> > > to ffs(), while in 64-bit case to __ffs(). The difference is substantial:
> > > ffs(x) == __ffs(x) + 1. Also, ffs(0) == 0, while __ffs(0) is undefined.
> > >
> > > The 32-bit behaviour is inconsistent with the function description, so it
> > > needs to get fixed.
> > >
> > > There are 9 individual users for the function in 6 different subsystems.
> > > Some arches and drivers are 64-bit only:
> > > - arch/loongarch/kvm/intc/eiointc.c;
> > > - drivers/hv/mshv_vtl_main.c;
> > > - kernel/liveupdate/kexec_handover.c;
> > >
> > > The others are:
> > > - ib_umem_find_best_pgsz(): as per comment, __ffs() should be correct;
> >
> > So long as 32 bit works the same as 64 bit it is correct for ib
>
> This is what the patch does, except that it doesn't account for the
> word length. In you case, 'mask' is dma_addr_t, which is u32 or u64
> depending ARCH_DMA_ADDR_T_64BIT.
>
> This config is:
>
> config ARCH_DMA_ADDR_T_64BIT
> def_bool 64BIT || PHYS_ADDR_T_64BIT
>
> And PHYS_ADDR_T_64BIT is simply def_bool 64BIT. So, at least now
> dma_addr_t simply follows unsigned long, and thus, the patch is
> correct. But IDK what's the history behind this configurations.
>
> Anyways, the patch aligns 32-bit count_trailing_zeros() with the
> 64-bit one. If you OK with that, as you said, can you please send
> an explicit ack?
I can do that, 32 bits architectures are rarely used in the IB world.
Thanks,
Acked-by: Leon Romanovsky <leon@kernel.org>
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Leon Romanovsky @ 2026-03-17 9:44 UTC (permalink / raw)
To: Long Li
Cc: Erni Sri Satya Vennela, Konstantin Taranov, Jason Gunthorpe,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66832D25A93394735624F454CE40A@SA1PR21MB6683.namprd21.prod.outlook.com>
On Mon, Mar 16, 2026 at 08:50:39PM +0000, Long Li wrote:
> > On Thu, Mar 12, 2026 at 11:16:41AM -0700, Erni Sri Satya Vennela wrote:
> > > As part of MANA hardening for CVM, clamp hardware-reported adapter
> > > capability values from the MANA_IB_GET_ADAPTER_CAP response before
> > > they are used by the IB subsystem.
> > >
> > > The response fields (max_qp_count, max_cq_count, max_mr_count,
> > > max_pd_count, max_inbound_read_limit, max_outbound_read_limit,
> > > max_qp_wr, max_send_sge_count, max_recv_sge_count) are u32 but are
> > > assigned to signed int members in struct ib_device_attr. If hardware
> > > returns a value exceeding INT_MAX, the implicit u32-to-int conversion
> > > produces a negative value, which can cause incorrect behavior in the
> > > IB core and userspace applications.
> >
> > This sentence does not make sense in the context of the Linux kernel.
> > The fundamental assumption is that the underlying hardware behaves correctly,
> > and driver code should not attempt to guard against purely hypothetical
> > failures. The kernel only implements such self‑protection when there is a
> > documented hardware issue accompanied by official errata.
> >
> > Thanks
>
> The idea is that a malicious hardware can't corrupt and steal other data from the kernel.
>
> The assumption is that in a public cloud environment, you can't trust the hardware 100%.
You cannot separate functionality and claim that one line of code is trusted
while another is not.
Thanks
^ permalink raw reply
* Re: [PATCH 00/11] Drivers: hv: Add ARM64 support in mshv_vtl
From: Naman Jain @ 2026-03-17 9:51 UTC (permalink / raw)
To: vdso, ssengar
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
Michael Kelley, linux-hyperv, linux-arm-kernel, linux-kernel,
linux-arch, linux-riscv, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
In-Reply-To: <1755043210.33472.1773718457301@app.mailbox.org>
On 3/17/2026 9:04 AM, vdso@mailbox.org wrote:
>
>> On 03/16/2026 5:12 AM Naman Jain <namjain@linux.microsoft.com> wrote:
>>
>>
>> The series intends to add support for ARM64 to mshv_vtl driver.
>> For this, common Hyper-V code is refactored, necessary support is added,
>> mshv_vtl_main.c is refactored and then finally support is added in
>> Kconfig.
>
> Hi Naman, Saurabh,
>
> So awesome to see the ARM64 support for the VSM being upstreamed!!
>
> Few of the patches carry my old Microsoft "Signed-off-by" tag,
> and I really appreciate you folks very much kindly adding it
> although the code appears to be a far more evolved and crisper
> version of what it was back then!
>
> Do feel free to drop my SOB from these few patches so the below SRB
> doesn't look weird or as a conflict of interest - that is if you see
> adding my below SRB to these few patches as a good option. It's been
> 2 years, and after 2 years who can really remember their code :D
>
> For the series,
> Reviewed-by: Roman Kisel <vdso@mailbox.org>
Thank you so much Roman for reviewing the changes. I think we can retain
both the tags from you. I'll let the maintainers decide.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-17 11:01 UTC (permalink / raw)
To: Jan Kiszka, Peter Zijlstra
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley, Saurabh Singh Sengar, Naman Jain
In-Reply-To: <b0359046-3c58-47a6-b503-8a2b52cb1448@siemens.com>
On 2026-03-17 08:49:38 [+0100], Jan Kiszka wrote:
> >> +void vmbus_isr(void)
> >> +{
> >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >> + vmbus_irqd_wake();
> >> + } else {
> >> + lockdep_hardirq_threaded();
> >
> > What clears this? This is wrongly placed. This should go to
> > sysvec_hyperv_callback() instead with its matching canceling part. The
> > add_interrupt_randomness() should also be there and not here.
> > sysvec_hyperv_stimer0() managed to do so.
>
> First of all, we need to keep all this in generic code to avoid missing
> arm64.
This kind of belongs to the IRQ core code so I would prefer to see it on
IRQ entry, not in a random driver.
> But the question about lockdep_hardirq_threaded() is valid - and that
> not only for this new code: I tried hard to understand from the code how
> hardirq_threaded is managed, but I simply couldn't find the spot where
> it is reset after lockdep_hardirq_threaded() but before returning from
> the interrupt to the task that now has hardirq_threaded=1. I failed, and
> so I started a debugger. That confirms for the existing code path
> (__handle_irq_event_percpu) that we are indeed returning to the
> interrupted task with hardirq_threaded set. I'm not sure if that is
> intended that only the next irq_enter_rcu->lockdep_hardirq_enter of the
> next IRQ over this same task will reset the flag again.
While looking into it again, it assumes that you enter an IRQ and due to
the implementation if one is threaded, all of them are. So if you switch
from IRQ handling to TIMER then this does not happen "as-is" but exit
from one and then entry another at which point it is set to zero again.
> With that in mind, the new logic here is no different from the one the
> kernel used before. If both are not doing what they should, we likely
> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
The difference is that you expect that _everyone_ calling this driver
has everything else threaded. This might not be the case. That is why
this should be in core knowing what is called if threaded, use in driver
after explicit killing that flag afterwards since you don't know what
can follow or add a generic threaded infrastructure here.
A different option which I would prefer in the drivere, would be an
explicit lockdep override for the locking class without using
lockdep_hardirq_threaded()
> > Different question: What guarantees that there won't be another
> > interrupt before this one is done? The handshake appears to be
> > deprecated. The interrupt itself returns ACKing (or not) but the actual
> > handler is delayed to this thread. Depending on the userland it could
> > take some time and I don't know how impatient the host is.
> >
>
> Good question. I guess people familiar with the hv interface need to
> comment on that.
>
> >> + __vmbus_isr();
> > Moving on. This (trying very hard here) even schedules tasklets. Why?
> > You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> > You don't want that.
> >
>
> You are referring to the re-existing logic now, aren't you?
Yes.
> > Couldn't the whole logic be integrated into the IRQ code? Then we could
> > have mask/ unmask if supported/ provided and threaded interrupts. Then
> > sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> > instead apic_eoi() + schedule_delayed_work().
> >
>
> Again, you are thinking x86-only. We need a portable solution.
well, ARM could use a threaded interrupt, too.
> >> + }
> >> +}
> >> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> >>
> >> static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
> >
> > Sebastian
>
> Jan
Sebastian
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Sebastian Andrzej Siewior @ 2026-03-17 11:05 UTC (permalink / raw)
To: Jan Kiszka
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka
In-Reply-To: <1b53653a-98a5-402a-a224-996b26edaa97@siemens.com>
On 2026-03-17 09:09:27 [+0100], Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> Sebastian Siewior wrote:
> "This is feeding entropy and would like to see interrupt registers. But
> since this is invoked from a thread it won't."
>
> So move it back to where it is always in interrupt context.
>
> Fixes: f8e6343b7a89 ("Drivers: hv: vmbus: Simplify allocation of vmbus_evt")
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> drivers/hv/vmbus_drv.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index bc4fc1951ae1..28025a264861 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1361,8 +1361,6 @@ static void __vmbus_isr(void)
>
> vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page);
> vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page);
> -
> - add_interrupt_randomness(vmbus_interrupt);
> }
>
> static DEFINE_PER_CPU(bool, vmbus_irq_pending);
> @@ -1410,6 +1408,8 @@ void vmbus_isr(void)
> lockdep_hardirq_threaded();
> __vmbus_isr();
> }
> +
> + add_interrupt_randomness(vmbus_interrupt);
> }
> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
Why not sysvec_hyperv_callback()?
Sebastian
^ permalink raw reply
* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-17 11:55 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Peter Zijlstra
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
linux-hyperv, linux-kernel, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley, Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260317110128.k59TflVp@linutronix.de>
On 17.03.26 12:01, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 08:49:38 [+0100], Jan Kiszka wrote:
>>>> +void vmbus_isr(void)
>>>> +{
>>>> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>>>> + vmbus_irqd_wake();
>>>> + } else {
>>>> + lockdep_hardirq_threaded();
>>>
>>> What clears this? This is wrongly placed. This should go to
>>> sysvec_hyperv_callback() instead with its matching canceling part. The
>>> add_interrupt_randomness() should also be there and not here.
>>> sysvec_hyperv_stimer0() managed to do so.
>>
>> First of all, we need to keep all this in generic code to avoid missing
>> arm64.
>
> This kind of belongs to the IRQ core code so I would prefer to see it on
> IRQ entry, not in a random driver.
I have no idea why hv is so special, starting with having its own
vectors. But if you have an idea how to address those needs via core
APIs or to create new ones for it, I guess that is welcome.
>
>> But the question about lockdep_hardirq_threaded() is valid - and that
>> not only for this new code: I tried hard to understand from the code how
>> hardirq_threaded is managed, but I simply couldn't find the spot where
>> it is reset after lockdep_hardirq_threaded() but before returning from
>> the interrupt to the task that now has hardirq_threaded=1. I failed, and
>> so I started a debugger. That confirms for the existing code path
>> (__handle_irq_event_percpu) that we are indeed returning to the
>> interrupted task with hardirq_threaded set. I'm not sure if that is
>> intended that only the next irq_enter_rcu->lockdep_hardirq_enter of the
>> next IRQ over this same task will reset the flag again.
>
> While looking into it again, it assumes that you enter an IRQ and due to
> the implementation if one is threaded, all of them are. So if you switch
> from IRQ handling to TIMER then this does not happen "as-is" but exit
> from one and then entry another at which point it is set to zero again.
Point is that a task that was interrupted by a potentially threaded
interrupt keeps this flag longer that it needs it. And that is
apparently harmless, but fairly confusing.
>
>> With that in mind, the new logic here is no different from the one the
>> kernel used before. If both are not doing what they should, we likely
>> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
>
> The difference is that you expect that _everyone_ calling this driver
> has everything else threaded. This might not be the case. That is why
> this should be in core knowing what is called if threaded, use in driver
> after explicit killing that flag afterwards since you don't know what
> can follow or add a generic threaded infrastructure here.
This driver is different, unfortunately. I'm not sure if we can / want
to thread everything that the platform interrupt does on x86. So far,
only the last part of it - vmbus handling - is threaded. On arm64, the
irq is exclusive (see vmbus_percpu_isr), thus everything can be and is
threaded.
>
> A different option which I would prefer in the drivere, would be an
> explicit lockdep override for the locking class without using
> lockdep_hardirq_threaded()
Happy to learn how to do that.
>
>>> Different question: What guarantees that there won't be another
>>> interrupt before this one is done? The handshake appears to be
>>> deprecated. The interrupt itself returns ACKing (or not) but the actual
>>> handler is delayed to this thread. Depending on the userland it could
>>> take some time and I don't know how impatient the host is.
>>>
>>
>> Good question. I guess people familiar with the hv interface need to
>> comment on that.
>>
>>>> + __vmbus_isr();
>>> Moving on. This (trying very hard here) even schedules tasklets. Why?
>>> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
>>> You don't want that.
>>>
>>
>> You are referring to the re-existing logic now, aren't you?
>
> Yes.
>
Then someone else needs to answer this.
>>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>>> instead apic_eoi() + schedule_delayed_work().
>>>
>>
>> Again, you are thinking x86-only. We need a portable solution.
>
> well, ARM could use a threaded interrupt, too.
For a reason we didn't explore in details, per-CPU interrupts aren't
threaded. See older version of this patch
(https://lore.kernel.org/lkml/005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com/)
where I thought I only had to fix x86, but arm64 was needing care as well.
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Jan Kiszka @ 2026-03-17 11:56 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka
In-Reply-To: <20260317110535.Smn9viQ7@linutronix.de>
On 17.03.26 12:05, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 09:09:27 [+0100], Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Sebastian Siewior wrote:
>> "This is feeding entropy and would like to see interrupt registers. But
>> since this is invoked from a thread it won't."
>>
>> So move it back to where it is always in interrupt context.
>>
>> Fixes: f8e6343b7a89 ("Drivers: hv: vmbus: Simplify allocation of vmbus_evt")
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>> drivers/hv/vmbus_drv.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index bc4fc1951ae1..28025a264861 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -1361,8 +1361,6 @@ static void __vmbus_isr(void)
>>
>> vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page);
>> vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page);
>> -
>> - add_interrupt_randomness(vmbus_interrupt);
>> }
>>
>> static DEFINE_PER_CPU(bool, vmbus_irq_pending);
>> @@ -1410,6 +1408,8 @@ void vmbus_isr(void)
>> lockdep_hardirq_threaded();
>> __vmbus_isr();
>> }
>> +
>> + add_interrupt_randomness(vmbus_interrupt);
>> }
>> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
>
> Why not sysvec_hyperv_callback()?
Because we do not want to be x86-only.
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Sebastian Andrzej Siewior @ 2026-03-17 13:22 UTC (permalink / raw)
To: Jan Kiszka
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka
In-Reply-To: <f718a22c-bbf2-4206-ba7d-391243c84f60@siemens.com>
On 2026-03-17 12:56:02 [+0100], Jan Kiszka wrote:
> >> @@ -1410,6 +1408,8 @@ void vmbus_isr(void)
> >> lockdep_hardirq_threaded();
> >> __vmbus_isr();
> >> }
> >> +
> >> + add_interrupt_randomness(vmbus_interrupt);
> >> }
> >> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> >
> > Why not sysvec_hyperv_callback()?
>
> Because we do not want to be x86-only.
Who is other one and does it have its add_interrupt_randomness() there
already?
This is a driver, this does not belong here.
> Jan
Sebastian
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Jan Kiszka @ 2026-03-17 13:34 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Michael Kelley
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka
In-Reply-To: <20260317132252.AJlwEyMh@linutronix.de>
On 17.03.26 14:22, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 12:56:02 [+0100], Jan Kiszka wrote:
>>>> @@ -1410,6 +1408,8 @@ void vmbus_isr(void)
>>>> lockdep_hardirq_threaded();
>>>> __vmbus_isr();
>>>> }
>>>> +
>>>> + add_interrupt_randomness(vmbus_interrupt);
>>>> }
>>>> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
>>>
>>> Why not sysvec_hyperv_callback()?
>>
>> Because we do not want to be x86-only.
>
> Who is other one and does it have its add_interrupt_randomness() there
> already?
It's the arm64 path of the hv support. Regarding the vmbus IRQ, it seems
to be fully handled here, without an equivalent of
arch/x86/kernel/cpu/mshyperv.c.
> This is a driver, this does not belong here.
Don't argue with me, I didn't put it here in the beginning. Maybe
Michael can shed more light on this (and sorry for having forgotten to
CC you on this patch).
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
^ permalink raw reply
* [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Stanislav Kinsburskii @ 2026-03-17 15:04 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
The current error handling has two issues:
First, pin_user_pages_fast() can return a short pin count (less than
requested but greater than zero) when it cannot pin all requested pages.
This is treated as success, leading to partially pinned regions being
used, which causes memory corruption.
Second, when an error occurs mid-loop, already pinned pages from the
current batch are not released before calling mshv_region_evict_pages(),
causing a page reference leak.
Fix by treating short pins as errors and explicitly unpinning the
partial batch before cleanup.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index c28aac0726de..fdffd4f002f6 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
ret = pin_user_pages_fast(userspace_addr, nr_pages,
FOLL_WRITE | FOLL_LONGTERM,
pages);
- if (ret < 0)
+ if (ret != nr_pages)
goto release_pages;
}
return 0;
release_pages:
+ if (ret > 0)
+ done_count += ret;
mshv_region_invalidate_pages(region, 0, done_count);
- return ret;
+ return ret < 0 ? ret : -ENOMEM;
}
static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net-next v5 1/3] net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
From: Haiyang Zhang @ 2026-03-17 17:06 UTC (permalink / raw)
To: Jakub Kicinski, Haiyang Zhang
Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Donald Hunter, Jonathan Corbet, Shuah Khan,
Kory Maincent (Dent Project), Gal Pressman, Oleksij Rempel,
Vadim Fedorenko, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260316200434.3a0b99ec@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Monday, March 16, 2026 11:05 PM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; Andrew Lunn
> <andrew@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric Dumazet
> <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Simon Horman
> <horms@kernel.org>; Donald Hunter <donald.hunter@gmail.com>; Jonathan
> Corbet <corbet@lwn.net>; Shuah Khan <skhan@linuxfoundation.org>; Kory
> Maincent (Dent Project) <kory.maincent@bootlin.com>; Gal Pressman
> <gal@nvidia.com>; Oleksij Rempel <o.rempel@pengutronix.de>; Vadim
> Fedorenko <vadim.fedorenko@linux.dev>; linux-kernel@vger.kernel.org;
> linux-doc@vger.kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Paul
> Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next v5 1/3] net: ethtool: add ethtool
> COALESCE_RX_CQE_FRAMES/NSECS
>
> On Thu, 12 Mar 2026 12:37:04 -0700 Haiyang Zhang wrote:
> > +Rx CQE coalescing allows multiple received packets to be coalesced into
> a single
> > +Completion Queue Entry (CQE). ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES``
> describes the
> > +maximum number of frames that can be coalesced into a CQE.
> > +``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds
> after the
> > +first packet arrival in a coalesced CQE to be sent.
>
> Looks good overall, can we broaden the language a bit?
> Replace "a single Completion Queue Entry (CQE)" with "a single
> Completion Queue Entry (CQE) or descriptor write back"?
Sure.
> I'm assuming your devices don't coalesce CQE writes.
> For non-RDMA devices the notion of CQE is a bit foreign but
> descriptor write back coalescing serves similar purpose.
> In either case host can't see the frame even if it's busy
> polling.
>
> So:
>
> Rx CQE coalescing allows multiple received packets to be coalesced
> into a single Completion Queue Entry (CQE) or descriptor writeback.
> ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` describes the maximum number of
> frames that can be coalesced into a CQE or writeback.
> ``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds
> after the first packet arrival in a coalesced CQE to be sent.
Will do.
Thanks,
- Haiyang
^ permalink raw reply
* RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Michael Kelley @ 2026-03-17 17:25 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Jan Kiszka
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
Michael Kelley, Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260312170715.HA08BHiO@linutronix.de>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
>
Let me try to address the range of questions here and in the follow-up
discussion. As background, an overview of VMBus interrupt handling is in:
Documentation/virt/hyperv/vmbus.rst
in the section entitled "Synthetic Interrupt Controller (synic)". The
relevant text is:
The SINT is mapped to a single per-CPU architectural interrupt (i.e,
an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
each CPU in the guest has a synic and may receive VMBus interrupts,
they are best modeled in Linux as per-CPU interrupts. This model works
well on arm64 where a single per-CPU Linux IRQ is allocated for
VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
across all CPUs and explicitly coded to call vmbus_isr(). In this case,
there's no Linux IRQ, and the interrupts are visible in aggregate in
/proc/interrupts on the "HYP" line.
The use of a statically allocated sysvec pre-dates my involvement in this
code starting in 2017, but I believe it was modelled after what Xen does,
and for the same reason -- to effectively create a per-CPU interrupt on
x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
don't know if that is also to create a per-CPU interrupt.
More below ....
> On 2026-02-16 17:24:56 [+0100], Jan Kiszka wrote:
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -25,6 +25,7 @@
> > #include <linux/cpu.h>
> > #include <linux/sched/isolation.h>
> > #include <linux/sched/task_stack.h>
> > +#include <linux/smpboot.h>
> >
> > #include <linux/delay.h>
> > #include <linux/panic_notifier.h>
> > @@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
> > }
> > }
> >
> > -void vmbus_isr(void)
> > +static void __vmbus_isr(void)
> > {
> > struct hv_per_cpu_context *hv_cpu
> > = this_cpu_ptr(hv_context.cpu_context);
> > @@ -1363,6 +1364,53 @@ void vmbus_isr(void)
> >
> > add_interrupt_randomness(vmbus_interrupt);
>
> This is feeding entropy and would like to see interrupt registers. But
> since this is invoked from a thread it won't.
I'll respond to this topic on the new thread for the new patch
where Jan has moved the call to add_interrupt_randomness().
>
> > }
> > +
> …
> > +void vmbus_isr(void)
> > +{
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > + vmbus_irqd_wake();
> > + } else {
> > + lockdep_hardirq_threaded();
>
> What clears this? This is wrongly placed. This should go to
> sysvec_hyperv_callback() instead with its matching canceling part. The
> add_interrupt_randomness() should also be there and not here.
> sysvec_hyperv_stimer0() managed to do so.
I don't have any knowledge to bring regarding the use of
lockdep_hardirq_threaded().
>
> Different question: What guarantees that there won't be another
> interrupt before this one is done? The handshake appears to be
> deprecated. The interrupt itself returns ACKing (or not) but the actual
> handler is delayed to this thread. Depending on the userland it could
> take some time and I don't know how impatient the host is.
In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
is usually explicitly called to ack the interrupt.
There's no guarantee, in either the existing case or the new PREEMPT_RT
case, that another VMBus interrupt won't come in on the same CPU
before the tasklets scheduled by vmbus_message_sched() or
vmbus_chan_sched() have run. From a functional standpoint, the Linux
code and interaction with Hyper-V handles another interrupt correctly.
From a delay standpoint, there's not a problem for the normal (i.e., not
PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
about delays since the tasklets are scheduled from the new per-CPU thread.
But my understanding is that Jan's motivation for these changes is not to
achieve true RT behavior, since Hyper-V doesn't provide that anyway.
The goal is simply to make PREEMPT_RT builds functional, though Jan may
have further comments on the goal.
>
> > + __vmbus_isr();
> Moving on. This (trying very hard here) even schedules tasklets. Why?
> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> You don't want that.
Again, Jan can comment on the impact of delays due to ending up
in ksoftirqd.
>
> Couldn't the whole logic be integrated into the IRQ code? Then we could
> have mask/ unmask if supported/ provided and threaded interrupts. Then
> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> instead apic_eoi() + schedule_delayed_work().
As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
path gets to vmbus_isr(), the two architectures share the same code. Same
thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
to looking at it.
As I recently discovered in discussion with Jan, standard Linux IRQ handling
will *not* thread per-CPU interrupts. So even on arm64 with a standard
Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
request threading.
I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
functionality is a bit of a fossil that isn't needed on modern x86/x64
processors that support TSC scaling. And it doesn't exist for arm64.
It might be worth seeing if it could be dropped entirely ...
Michael
>
> > + }
> > +}
> > EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> >
> > static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>
> Sebastian
^ permalink raw reply
* RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Michael Kelley @ 2026-03-17 17:26 UTC (permalink / raw)
To: Jan Kiszka, Sebastian Andrzej Siewior
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
Florian Bezdeka
In-Reply-To: <5262eafa-7f94-41c8-85d7-a2b8d7f27c5a@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com> Sent: Tuesday, March 17, 2026 6:34 AM
>
> On 17.03.26 14:22, Sebastian Andrzej Siewior wrote:
> > On 2026-03-17 12:56:02 [+0100], Jan Kiszka wrote:
> >>>> @@ -1410,6 +1408,8 @@ void vmbus_isr(void)
> >>>> lockdep_hardirq_threaded();
> >>>> __vmbus_isr();
> >>>> }
> >>>> +
> >>>> + add_interrupt_randomness(vmbus_interrupt);
> >>>> }
> >>>> EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> >>>
> >>> Why not sysvec_hyperv_callback()?
> >>
> >> Because we do not want to be x86-only.
> >
> > Who is other one and does it have its add_interrupt_randomness() there
> > already?
>
> It's the arm64 path of the hv support. Regarding the vmbus IRQ, it seems
> to be fully handled here, without an equivalent of
> arch/x86/kernel/cpu/mshyperv.c.
The arm64 path is the call to request_percpu_irq() in vmbus_bus_init().
That call is only made when running on arm64. See the code comment in
vmbus_bus_init().
The specified interrupt handler is vmbus_percpu_isr(), which again runs
only on arm64. It calls vmbus_isr(), which starts the common path for both
x86/x64 and arm64.
Then the slight weirdness is that the standard Linux IRQ handling for
per-CPU IRQs on arm64 with a GICv3 (which is what Hyper-V emulates)
does *not* call add_interrupt_randomness(). The function
gic_irq_domain_map() sets the IRQ handler for PPI range to
handle_percpu_devid_irq(), and that function does not do
add_interrupt_randomness(). The other variant, handle_percpu_irq(),
calls handle_irq_event_percpu(), which *does* do the
add_interrupt_randomness().
So at this point, putting the add_interrupt_randomness() in
vmbus_isr() is needed to catch both architectures. If the lack of
add_interrupt_randomness() in handle_percpu_devid_irq() is a bug,
then that would be a cleaner way to handle this. But maybe there's
a reason behind the current behavior of handle_percpu_devid_irq()
that I'm unaware of.
Michael
>
> > This is a driver, this does not belong here.
>
> Don't argue with me, I didn't put it here in the beginning. Maybe
> Michael can shed more light on this (and sorry for having forgotten to
> CC you on this patch).
>
> Jan
>
> --
> Siemens AG, Foundational Technologies
> Linux Expert Center
^ permalink raw reply
* RE: [EXTERNAL] [PATCH 02/16] RDMA: Consolidate patterns with offsetof() to ib_copy_validate_udata_in()
From: Long Li @ 2026-03-17 18:03 UTC (permalink / raw)
To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
Vishnu Dasa, Yishai Hadas, Zhu Yanjun
Cc: patches@lists.linux.dev
In-Reply-To: <2-v1-2b86f54cda42+7d-rdma_udata_req_jgg@nvidia.com>
>
> Similar to the prior patch, these patterns are open coding an offsetofend(). The
> use of offsetof() targets the prior field as the last field in the struct.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/infiniband/hw/mana/cq.c | 9 ++------- drivers/infiniband/hw/mlx5/cq.c
> | 10 +++-------
> 2 files changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
> index b2749f971cd0af..3f932ef6e5fff6 100644
> --- a/drivers/infiniband/hw/mana/cq.c
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -27,14 +27,9 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
> is_rnic_cq = mana_ib_is_rnic(mdev);
>
> if (udata) {
> - if (udata->inlen < offsetof(struct mana_ib_create_cq, flags))
> - return -EINVAL;
> -
> - err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd),
> udata->inlen));
> - if (err) {
> - ibdev_dbg(ibdev, "Failed to copy from udata for create
> cq, %d\n", err);
> + err = ib_copy_validate_udata_in(udata, ucmd, buf_addr);
> + if (err)
> return err;
> - }
>
> if ((!is_rnic_cq && attr->cqe > mdev->adapter_caps.max_qp_wr)
> ||
> attr->cqe > U32_MAX / COMP_ENTRY_SIZE) { diff --git
> a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c index
> 43a7b5ca49dcc9..643b3b7d387834 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -723,7 +723,6 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct
> ib_udata *udata,
> struct mlx5_ib_create_cq ucmd = {};
> unsigned long page_size;
> unsigned int page_offset_quantized;
> - size_t ucmdlen;
> __be64 *pas;
> int ncont;
> void *cqc;
> @@ -731,12 +730,9 @@ static int create_cq_user(struct mlx5_ib_dev *dev,
> struct ib_udata *udata,
> struct mlx5_ib_ucontext *context = rdma_udata_to_drv_context(
> udata, struct mlx5_ib_ucontext, ibucontext);
>
> - ucmdlen = min(udata->inlen, sizeof(ucmd));
> - if (ucmdlen < offsetof(struct mlx5_ib_create_cq, flags))
> - return -EINVAL;
> -
> - if (ib_copy_from_udata(&ucmd, udata, ucmdlen))
> - return -EFAULT;
> + err = ib_copy_validate_udata_in(udata, ucmd, cqe_comp_res_format);
> + if (err)
> + return err;
>
> if ((ucmd.flags & ~(MLX5_IB_CREATE_CQ_FLAGS_CQE_128B_PAD |
> MLX5_IB_CREATE_CQ_FLAGS_UAR_PAGE_INDEX |
> --
> 2.43.0
^ permalink raw reply
* RE: [EXTERNAL] [PATCH 03/16] RDMA: Consolidate patterns with sizeof() to ib_copy_validate_udata_in()
From: Long Li @ 2026-03-17 18:08 UTC (permalink / raw)
To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
Vishnu Dasa, Yishai Hadas, Zhu Yanjun
Cc: patches@lists.linux.dev
In-Reply-To: <3-v1-2b86f54cda42+7d-rdma_udata_req_jgg@nvidia.com>
>
> Similar to the prior patch, these patterns are open coding an
> offsetofend() using sizeof(), which targets the last member of the current struct.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/infiniband/hw/mana/qp.c | 27 +++++++++------------------
> drivers/infiniband/hw/mana/wq.c | 10 ++--------
> drivers/infiniband/hw/mlx4/main.c | 6 ++----
> drivers/infiniband/hw/mlx5/cq.c | 2 +-
> drivers/infiniband/sw/rxe/rxe_verbs.c | 13 ++-----------
> drivers/infiniband/sw/siw/siw_verbs.c | 6 +-----
> 6 files changed, 17 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 82f84f7ad37a90..69c8d4f7a1f46b 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -111,16 +111,12 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> u32 port;
> int ret;
>
> - if (!udata || udata->inlen < sizeof(ucmd))
> + if (!udata)
> return -EINVAL;
>
> - ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
> - if (ret) {
> - ibdev_dbg(&mdev->ib_dev,
> - "Failed copy from udata for create rss-qp, err %d\n",
> - ret);
> + ret = ib_copy_validate_udata_in(udata, ucmd, port);
> + if (ret)
> return ret;
> - }
>
> if (attr->cap.max_recv_wr > mdev->adapter_caps.max_qp_wr) {
> ibdev_dbg(&mdev->ib_dev,
> @@ -282,15 +278,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp,
> struct ib_pd *ibpd,
> u32 port;
> int err;
>
> - if (!mana_ucontext || udata->inlen < sizeof(ucmd))
> + if (!mana_ucontext)
> return -EINVAL;
>
> - err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
> - if (err) {
> - ibdev_dbg(&mdev->ib_dev,
> - "Failed to copy from udata create qp-raw, %d\n", err);
> + err = ib_copy_validate_udata_in(udata, ucmd, port);
> + if (err)
> return err;
> - }
>
> if (attr->cap.max_send_wr > mdev->adapter_caps.max_qp_wr) {
> ibdev_dbg(&mdev->ib_dev,
> @@ -535,17 +528,15 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp,
> struct ib_pd *ibpd,
> u64 flags = 0;
> u32 doorbell;
>
> - if (!udata || udata->inlen < sizeof(ucmd))
> + if (!udata)
> return -EINVAL;
>
> mana_ucontext = rdma_udata_to_drv_context(udata, struct
> mana_ib_ucontext, ibucontext);
> doorbell = mana_ucontext->doorbell;
> flags = MANA_RC_FLAG_NO_FMR;
> - err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
> - if (err) {
> - ibdev_dbg(&mdev->ib_dev, "Failed to copy from udata, %d\n",
> err);
> + err = ib_copy_validate_udata_in(udata, ucmd, queue_size);
> + if (err)
> return err;
> - }
>
> for (i = 0, j = 0; i < MANA_RC_QUEUE_TYPE_MAX; ++i) {
> /* skip FMR for user-level RC QPs */
> diff --git a/drivers/infiniband/hw/mana/wq.c
> b/drivers/infiniband/hw/mana/wq.c index 6206244f762e42..aceeea7f17b339
> 100644
> --- a/drivers/infiniband/hw/mana/wq.c
> +++ b/drivers/infiniband/hw/mana/wq.c
> @@ -15,15 +15,9 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
> struct mana_ib_wq *wq;
> int err;
>
> - if (udata->inlen < sizeof(ucmd))
> - return ERR_PTR(-EINVAL);
> -
> - err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
> - if (err) {
> - ibdev_dbg(&mdev->ib_dev,
> - "Failed to copy from udata for create wq, %d\n", err);
> + err = ib_copy_validate_udata_in(udata, ucmd, reserved);
> + if (err)
> return ERR_PTR(err);
> - }
>
> wq = kzalloc_obj(*wq);
> if (!wq)
> diff --git a/drivers/infiniband/hw/mlx4/main.c
> b/drivers/infiniband/hw/mlx4/main.c
> index 73e17b4339eb60..16e4cffbd7a84d 100644
> --- a/drivers/infiniband/hw/mlx4/main.c
> +++ b/drivers/infiniband/hw/mlx4/main.c
> @@ -50,6 +50,7 @@
> #include <rdma/ib_user_verbs.h>
> #include <rdma/ib_addr.h>
> #include <rdma/ib_cache.h>
> +#include <rdma/uverbs_ioctl.h>
>
> #include <net/bonding.h>
>
> @@ -445,10 +446,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
> struct mlx4_clock_params clock_params;
>
> if (uhw->inlen) {
> - if (uhw->inlen < sizeof(cmd))
> - return -EINVAL;
> -
> - err = ib_copy_from_udata(&cmd, uhw, sizeof(cmd));
> + err = ib_copy_validate_udata_in(uhw, cmd, reserved);
> if (err)
> return err;
>
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 643b3b7d387834..f5e75e51c6763f 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -1229,7 +1229,7 @@ static int resize_user(struct mlx5_ib_dev *dev, struct
> mlx5_ib_cq *cq,
> struct ib_umem *umem;
> int err;
>
> - err = ib_copy_from_udata(&ucmd, udata, sizeof(ucmd));
> + err = ib_copy_validate_udata_in(udata, ucmd, reserved1);
> if (err)
> return err;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c
> b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index fe41362c51444c..c9fd40bfa09eb2 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -452,18 +452,9 @@ static int rxe_modify_srq(struct ib_srq *ibsrq, struct
> ib_srq_attr *attr,
> int err;
>
> if (udata) {
> - if (udata->inlen < sizeof(cmd)) {
> - err = -EINVAL;
> - rxe_dbg_srq(srq, "malformed udata\n");
> + err = ib_copy_validate_udata_in(udata, cmd, mmap_info_addr);
> + if (err)
> goto err_out;
> - }
> -
> - err = ib_copy_from_udata(&cmd, udata, sizeof(cmd));
> - if (err) {
> - err = -EFAULT;
> - rxe_dbg_srq(srq, "unable to read udata\n");
> - goto err_out;
> - }
> }
>
> err = rxe_srq_chk_attr(rxe, srq, attr, mask); diff --git
> a/drivers/infiniband/sw/siw/siw_verbs.c
> b/drivers/infiniband/sw/siw/siw_verbs.c
> index ef504db8f2b48b..1e1d262a4ae2db 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -1373,11 +1373,7 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64
> start, u64 len,
> struct siw_uresp_reg_mr uresp = {};
> struct siw_mem *mem = mr->mem;
>
> - if (udata->inlen < sizeof(ureq)) {
> - rv = -EINVAL;
> - goto err_out;
> - }
> - rv = ib_copy_from_udata(&ureq, udata, sizeof(ureq));
> + rv = ib_copy_validate_udata_in(udata, ureq, pad);
> if (rv)
> goto err_out;
>
> --
> 2.43.0
^ permalink raw reply
* RE: [EXTERNAL] [PATCH 15/16] RDMA: Remove redundant = {} for udata req structs
From: Long Li @ 2026-03-17 18:16 UTC (permalink / raw)
To: Jason Gunthorpe, Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
Vishnu Dasa, Yishai Hadas, Zhu Yanjun
Cc: patches@lists.linux.dev
In-Reply-To: <15-v1-2b86f54cda42+7d-rdma_udata_req_jgg@nvidia.com>
>
> Now that all of the udata request structs are loaded with the helpers the callers
> should not pre-zero them. The helpers all guarantee that the entire struct is filled
> with something.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/infiniband/hw/efa/efa_verbs.c | 4 ++--
> drivers/infiniband/hw/hns/hns_roce_main.c | 2 +-
> drivers/infiniband/hw/hns/hns_roce_srq.c | 2 +-
> drivers/infiniband/hw/mana/cq.c | 2 +-
> drivers/infiniband/hw/mana/qp.c | 2 +-
> drivers/infiniband/hw/mana/wq.c | 2 +-
> drivers/infiniband/hw/mlx4/qp.c | 4 ++--
> drivers/infiniband/hw/mlx5/cq.c | 2 +-
> drivers/infiniband/hw/mlx5/main.c | 2 +-
> drivers/infiniband/hw/mlx5/mr.c | 2 +-
> drivers/infiniband/hw/mlx5/qp.c | 4 ++--
> drivers/infiniband/hw/mlx5/srq.c | 2 +-
> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 4 +++-
> drivers/infiniband/hw/qedr/verbs.c | 8 ++++----
> 14 files changed, 22 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c
> b/drivers/infiniband/hw/efa/efa_verbs.c
> index b491bcd886ccb0..f1020921f0e742 100644
> --- a/drivers/infiniband/hw/efa/efa_verbs.c
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -682,7 +682,7 @@ int efa_create_qp(struct ib_qp *ibqp, struct
> ib_qp_init_attr *init_attr,
> struct efa_com_create_qp_result create_qp_resp;
> struct efa_dev *dev = to_edev(ibqp->device);
> struct efa_ibv_create_qp_resp resp = {};
> - struct efa_ibv_create_qp cmd = {};
> + struct efa_ibv_create_qp cmd;
> struct efa_qp *qp = to_eqp(ibqp);
> struct efa_ucontext *ucontext;
> u16 supported_efa_flags = 0;
> @@ -1121,7 +1121,7 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
> struct efa_com_create_cq_result result;
> struct ib_device *ibdev = ibcq->device;
> struct efa_dev *dev = to_edev(ibdev);
> - struct efa_ibv_create_cq cmd = {};
> + struct efa_ibv_create_cq cmd;
> struct efa_cq *cq = to_ecq(ibcq);
> int entries = attr->cqe;
> bool set_src_addr;
> diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c
> b/drivers/infiniband/hw/hns/hns_roce_main.c
> index ec6fb3f1177941..0dbe99aab6ad21 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_main.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> @@ -425,7 +425,7 @@ static int hns_roce_alloc_ucontext(struct ib_ucontext
> *uctx,
> struct hns_roce_ucontext *context = to_hr_ucontext(uctx);
> struct hns_roce_dev *hr_dev = to_hr_dev(uctx->device);
> struct hns_roce_ib_alloc_ucontext_resp resp = {};
> - struct hns_roce_ib_alloc_ucontext ucmd = {};
> + struct hns_roce_ib_alloc_ucontext ucmd;
> int ret = -EAGAIN;
>
> if (!hr_dev->active)
> diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c
> b/drivers/infiniband/hw/hns/hns_roce_srq.c
> index b37a76587aa868..601f8cdfce96a3 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_srq.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
> @@ -406,7 +406,7 @@ static int alloc_srq_db(struct hns_roce_dev *hr_dev,
> struct hns_roce_srq *srq,
> struct ib_udata *udata,
> struct hns_roce_ib_create_srq_resp *resp) {
> - struct hns_roce_ib_create_srq ucmd = {};
> + struct hns_roce_ib_create_srq ucmd;
> struct hns_roce_ucontext *uctx;
> int ret;
>
> diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
> index 3f932ef6e5fff6..f4cbe21763bf11 100644
> --- a/drivers/infiniband/hw/mana/cq.c
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -13,7 +13,7 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
> struct mana_ib_create_cq_resp resp = {};
> struct mana_ib_ucontext *mana_ucontext;
> struct ib_device *ibdev = ibcq->device;
> - struct mana_ib_create_cq ucmd = {};
> + struct mana_ib_create_cq ucmd;
> struct mana_ib_dev *mdev;
> bool is_rnic_cq;
> u32 doorbell;
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 69c8d4f7a1f46b..ddc30d37d715f6 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -97,7 +97,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct
> ib_pd *pd,
> container_of(pd->device, struct mana_ib_dev, ib_dev);
> struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
> struct mana_ib_create_qp_rss_resp resp = {};
> - struct mana_ib_create_qp_rss ucmd = {};
> + struct mana_ib_create_qp_rss ucmd;
> mana_handle_t *mana_ind_table;
> struct mana_port_context *mpc;
> unsigned int ind_tbl_size;
> diff --git a/drivers/infiniband/hw/mana/wq.c
> b/drivers/infiniband/hw/mana/wq.c index aceeea7f17b339..5c2134a0b1a196
> 100644
> --- a/drivers/infiniband/hw/mana/wq.c
> +++ b/drivers/infiniband/hw/mana/wq.c
> @@ -11,7 +11,7 @@ struct ib_wq *mana_ib_create_wq(struct ib_pd *pd, {
> struct mana_ib_dev *mdev =
> container_of(pd->device, struct mana_ib_dev, ib_dev);
> - struct mana_ib_create_wq ucmd = {};
> + struct mana_ib_create_wq ucmd;
> struct mana_ib_wq *wq;
> int err;
>
> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
> index cfb54ffcaac22c..790be09d985a1a 100644
> --- a/drivers/infiniband/hw/mlx4/qp.c
> +++ b/drivers/infiniband/hw/mlx4/qp.c
> @@ -709,7 +709,7 @@ static int _mlx4_ib_create_qp_rss(struct ib_pd *pd,
> struct mlx4_ib_qp *qp,
> struct ib_qp_init_attr *init_attr,
> struct ib_udata *udata)
> {
> - struct mlx4_ib_create_qp_rss ucmd = {};
> + struct mlx4_ib_create_qp_rss ucmd;
> int err;
>
> if (!udata) {
> @@ -4230,7 +4230,7 @@ int mlx4_ib_modify_wq(struct ib_wq *ibwq, struct
> ib_wq_attr *wq_attr,
> u32 wq_attr_mask, struct ib_udata *udata) {
> struct mlx4_ib_qp *qp = to_mqp((struct ib_qp *)ibwq);
> - struct mlx4_ib_modify_wq ucmd = {};
> + struct mlx4_ib_modify_wq ucmd;
> enum ib_wq_state cur_state, new_state;
> int err;
>
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index f5e75e51c6763f..1f94863e755cc7 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -720,7 +720,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct
> ib_udata *udata,
> int *cqe_size, int *index, int *inlen,
> struct uverbs_attr_bundle *attrs)
> {
> - struct mlx5_ib_create_cq ucmd = {};
> + struct mlx5_ib_create_cq ucmd;
> unsigned long page_size;
> unsigned int page_offset_quantized;
> __be64 *pas;
> diff --git a/drivers/infiniband/hw/mlx5/main.c
> b/drivers/infiniband/hw/mlx5/main.c
> index ff2c02c85625ce..fe3de414bfcad5 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -2178,7 +2178,7 @@ static int mlx5_ib_alloc_ucontext(struct ib_ucontext
> *uctx, {
> struct ib_device *ibdev = uctx->device;
> struct mlx5_ib_dev *dev = to_mdev(ibdev);
> - struct mlx5_ib_alloc_ucontext_req_v2 req = {};
> + struct mlx5_ib_alloc_ucontext_req_v2 req;
> struct mlx5_ib_alloc_ucontext_resp resp = {};
> struct mlx5_ib_ucontext *context = to_mucontext(uctx);
> struct mlx5_bfreg_info *bfregi;
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 49dcc39836c047..37f3d19bd374ee 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1768,7 +1768,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct
> ib_udata *udata)
> u32 *in = NULL;
> void *mkc;
> int err;
> - struct mlx5_ib_alloc_mw req = {};
> + struct mlx5_ib_alloc_mw req;
> struct {
> __u32 comp_mask;
> __u32 response_length;
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index 3b602ed0a2dafc..8f50e7342a7694 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -4692,7 +4692,7 @@ int mlx5_ib_modify_qp(struct ib_qp *ibqp, struct
> ib_qp_attr *attr,
> struct mlx5_ib_dev *dev = to_mdev(ibqp->device);
> struct mlx5_ib_modify_qp_resp resp = {};
> struct mlx5_ib_qp *qp = to_mqp(ibqp);
> - struct mlx5_ib_modify_qp ucmd = {};
> + struct mlx5_ib_modify_qp ucmd;
> enum ib_qp_type qp_type;
> enum ib_qp_state cur_state, new_state;
> int err = -EINVAL;
> @@ -5379,7 +5379,7 @@ static int prepare_user_rq(struct ib_pd *pd,
> struct mlx5_ib_rwq *rwq)
> {
> struct mlx5_ib_dev *dev = to_mdev(pd->device);
> - struct mlx5_ib_create_wq ucmd = {};
> + struct mlx5_ib_create_wq ucmd;
> int err;
>
> err = ib_copy_validate_udata_in_cm(udata, ucmd, diff --git
> a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
> index 6d89c0242cab61..852f6f502d14d0 100644
> --- a/drivers/infiniband/hw/mlx5/srq.c
> +++ b/drivers/infiniband/hw/mlx5/srq.c
> @@ -45,7 +45,7 @@ static int create_srq_user(struct ib_pd *pd, struct
> mlx5_ib_srq *srq,
> struct ib_udata *udata, int buf_size) {
> struct mlx5_ib_dev *dev = to_mdev(pd->device);
> - struct mlx5_ib_create_srq ucmd = {};
> + struct mlx5_ib_create_srq ucmd;
> struct mlx5_ib_ucontext *ucontext = rdma_udata_to_drv_context(
> udata, struct mlx5_ib_ucontext, ibucontext);
> int err;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 8b285fcc638701..eed149f7a942b8 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -1311,12 +1311,14 @@ int ocrdma_create_qp(struct ib_qp *ibqp, struct
> ib_qp_init_attr *attrs,
> if (status)
> goto gen_err;
>
> - memset(&ureq, 0, sizeof(ureq));
> if (udata) {
> status = ib_copy_validate_udata_in(udata, ureq, rsvd1);
> if (status)
> return status;
> + } else {
> + memset(&ureq, 0, sizeof(ureq));
> }
> +
> ocrdma_set_qp_init_params(qp, pd, attrs);
> if (udata == NULL)
> qp->cap_flags |= (OCRDMA_QP_MW_BIND |
> OCRDMA_QP_LKEY0 | diff --git a/drivers/infiniband/hw/qedr/verbs.c
> b/drivers/infiniband/hw/qedr/verbs.c
> index 42d20b35ff3fe0..679aa6f3a63bc5 100644
> --- a/drivers/infiniband/hw/qedr/verbs.c
> +++ b/drivers/infiniband/hw/qedr/verbs.c
> @@ -264,7 +264,7 @@ int qedr_alloc_ucontext(struct ib_ucontext *uctx, struct
> ib_udata *udata)
> int rc;
> struct qedr_ucontext *ctx = get_qedr_ucontext(uctx);
> struct qedr_alloc_ucontext_resp uresp = {};
> - struct qedr_alloc_ucontext_req ureq = {};
> + struct qedr_alloc_ucontext_req ureq;
> struct qedr_dev *dev = get_qedr_dev(ibdev);
> struct qed_rdma_add_user_out_params oparams;
> struct qedr_user_mmap_entry *entry;
> @@ -913,7 +913,7 @@ int qedr_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
> };
> struct qedr_dev *dev = get_qedr_dev(ibdev);
> struct qed_rdma_create_cq_in_params params;
> - struct qedr_create_cq_ureq ureq = {};
> + struct qedr_create_cq_ureq ureq;
> int vector = attr->comp_vector;
> int entries = attr->cqe;
> struct qedr_cq *cq = get_qedr_cq(ibcq); @@ -1541,7 +1541,7 @@ int
> qedr_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init_attr,
> struct qedr_dev *dev = get_qedr_dev(ibsrq->device);
> struct qed_rdma_create_srq_out_params out_params;
> struct qedr_pd *pd = get_qedr_pd(ibsrq->pd);
> - struct qedr_create_srq_ureq ureq = {};
> + struct qedr_create_srq_ureq ureq;
> u64 pbl_base_addr, phy_prod_pair_addr;
> struct qedr_srq_hwq_info *hw_srq;
> u32 page_cnt, page_size;
> @@ -1837,7 +1837,7 @@ static int qedr_create_user_qp(struct qedr_dev *dev,
> struct qed_rdma_create_qp_in_params in_params;
> struct qed_rdma_create_qp_out_params out_params;
> struct qedr_create_qp_uresp uresp = {};
> - struct qedr_create_qp_ureq ureq = {};
> + struct qedr_create_qp_ureq ureq;
> int alloc_and_init = rdma_protocol_roce(&dev->ibdev, 1);
> struct qedr_ucontext *ctx = NULL;
> struct qedr_pd *pd = NULL;
> --
> 2.43.0
^ permalink raw reply
* RE: [EXTERNAL] [PATCH 2/2] drm/hyperv: During panic do VMBus unload after frame buffer is flushed
From: Long Li @ 2026-03-17 18:43 UTC (permalink / raw)
To: mhklinux@outlook.com, drawat.floss@gmail.com,
maarten.lankhorst@linux.intel.com, mripard@kernel.org,
tzimmermann@suse.de, airlied@gmail.com, simona@ffwll.ch,
KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
ryasuoka@redhat.com, jfalempe@redhat.com
Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260209070201.1492-2-mhklinux@outlook.com>
>
> In a VM, Linux panic information (reason for the panic, stack trace,
> etc.) may be written to a serial console and/or a virtual frame buffer for a
> graphics console. The latter may need to be flushed back to the host hypervisor
> for display.
>
> The current Hyper-V DRM driver for the frame buffer does the flushing
> *after* the VMBus connection has been unloaded, such that panic messages are
> not displayed on the graphics console. A user with a Hyper-V graphics console is
> left with just a hung empty screen after a panic. The enhanced control that DRM
> provides over the panic display in the graphics console is similarly non-functional.
>
> Commit 3671f3777758 ("drm/hyperv: Add support for drm_panic") added the
> Hyper-V DRM driver support to flush the virtual frame buffer. It provided
> necessary functionality but did not handle the sequencing problem with VMBus
> unload.
>
> Fix the full problem by using VMBus functions to suppress the VMBus unload that
> is normally done by the VMBus driver in the panic path. Then after the frame
> buffer has been flushed, do the VMBus unload so that a kdump kernel can start
> cleanly. As expected, CONFIG_DRM_PANIC must be selected for these changes to
> have effect. As a side benefit, the enhanced features of the DRM panic path are
> also functional.
>
> Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 4 ++++
> drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 15 ++++++++-------
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> index 06b5d96e6eaf..79e51643be67 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
> @@ -150,6 +150,9 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
> goto err_free_mmio;
> }
>
> + /* If DRM panic path is stubbed out VMBus code must do the unload */
> + if (IS_ENABLED(CONFIG_DRM_PANIC) &&
> IS_ENABLED(CONFIG_PRINTK))
> + vmbus_set_skip_unload(true);
> drm_client_setup(dev, NULL);
>
> return 0;
> @@ -169,6 +172,7 @@ static void hyperv_vmbus_remove(struct hv_device
> *hdev)
> struct drm_device *dev = hv_get_drvdata(hdev);
> struct hyperv_drm_device *hv = to_hv(dev);
>
> + vmbus_set_skip_unload(false);
> drm_dev_unplug(dev);
> drm_atomic_helper_shutdown(dev);
> vmbus_close(hdev->channel);
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> index 7978f8c8108c..d48ca6c23b7c 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> @@ -212,15 +212,16 @@ static void hyperv_plane_panic_flush(struct
> drm_plane *plane)
> struct hyperv_drm_device *hv = to_hv(plane->dev);
> struct drm_rect rect;
>
> - if (!plane->state || !plane->state->fb)
> - return;
> + if (plane->state && plane->state->fb) {
> + rect.x1 = 0;
> + rect.y1 = 0;
> + rect.x2 = plane->state->fb->width;
> + rect.y2 = plane->state->fb->height;
>
> - rect.x1 = 0;
> - rect.y1 = 0;
> - rect.x2 = plane->state->fb->width;
> - rect.y2 = plane->state->fb->height;
> + hyperv_update_dirt(hv->hdev, &rect);
> + }
>
> - hyperv_update_dirt(hv->hdev, &rect);
> + vmbus_initiate_unload(true);
> }
>
> static const struct drm_plane_helper_funcs hyperv_plane_helper_funcs = {
> --
> 2.25.1
^ permalink raw reply
* RE: [EXTERNAL] [PATCH 1/2] Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
From: Long Li @ 2026-03-17 18:43 UTC (permalink / raw)
To: mhklinux@outlook.com, drawat.floss@gmail.com,
maarten.lankhorst@linux.intel.com, mripard@kernel.org,
tzimmermann@suse.de, airlied@gmail.com, simona@ffwll.ch,
KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
ryasuoka@redhat.com, jfalempe@redhat.com
Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260209070201.1492-1-mhklinux@outlook.com>
>
> Currently, VMBus code initiates a VMBus unload in the panic path so that if a
> kdump kernel is loaded, it can start fresh in setting up its own VMBus connection.
> However, a driver for the VMBus virtual frame buffer may need to flush dirty
> portions of the frame buffer back to the Hyper-V host so that panic information is
> visible in the graphics console. To support such flushing, provide exported
> functions for the frame buffer driver to specify that the VMBus unload should not
> be done by the VMBus driver, and to initiate the VMBus unload itself.
> Together these allow a frame buffer driver to delay the VMBus unload until after
> it has completed the flush.
>
> Ideally, the VMBus driver could use its own panic-path callback to do the unload
> after all frame buffer drivers have finished. But DRM frame buffer drivers use the
> kmsg dump callback, and there are no callbacks after that in the panic path.
> Hence this somewhat messy approach to properly sequencing the frame buffer
> flush and the VMBus unload.
>
> Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/hv/channel_mgmt.c | 1 +
> drivers/hv/hyperv_vmbus.h | 1 -
> drivers/hv/vmbus_drv.c | 25 ++++++++++++++++++-------
> include/linux/hyperv.h | 3 +++
> 4 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index
> 74fed2c073d4..5de83676dbad 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -944,6 +944,7 @@ void vmbus_initiate_unload(bool crash)
> else
> vmbus_wait_for_unload();
> }
> +EXPORT_SYMBOL_GPL(vmbus_initiate_unload);
>
> static void vmbus_setup_channel_state(struct vmbus_channel *channel,
> struct vmbus_channel_offer_channel *offer)
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index
> cdbc5f5c3215..5d3944fc93ae 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -440,7 +440,6 @@ void hv_vss_deinit(void); int hv_vss_pre_suspend(void);
> int hv_vss_pre_resume(void); void hv_vss_onchannelcallback(void *context); -
> void vmbus_initiate_unload(bool crash);
>
> static inline void hv_poll_channel(struct vmbus_channel *channel,
> void (*cb)(void *))
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index
> 6785ad63a9cb..97dfa529d250 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -69,19 +69,29 @@ bool vmbus_is_confidential(void) }
> EXPORT_SYMBOL_GPL(vmbus_is_confidential);
>
> +static bool skip_vmbus_unload;
> +
> +/*
> + * Allow a VMBus framebuffer driver to specify that in the case of a
> +panic,
> + * it will do the VMbus unload operation once it has flushed any dirty
> + * portions of the framebuffer to the Hyper-V host.
> + */
> +void vmbus_set_skip_unload(bool skip)
> +{
> + skip_vmbus_unload = skip;
> +}
> +EXPORT_SYMBOL_GPL(vmbus_set_skip_unload);
> +
> /*
> * The panic notifier below is responsible solely for unloading the
> * vmbus connection, which is necessary in a panic event.
> - *
> - * Notice an intrincate relation of this notifier with Hyper-V
> - * framebuffer panic notifier exists - we need vmbus connection alive
> - * there in order to succeed, so we need to order both with each other
> - * [see hvfb_on_panic()] - this is done using notifiers' priorities.
> */
> static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
> void *args)
> {
> - vmbus_initiate_unload(true);
> + if (!skip_vmbus_unload)
> + vmbus_initiate_unload(true);
> +
> return NOTIFY_DONE;
> }
> static struct notifier_block hyperv_panic_vmbus_unload_block = { @@ -2848,7
> +2858,8 @@ static void hv_crash_handler(struct pt_regs *regs) {
> int cpu;
>
> - vmbus_initiate_unload(true);
> + if (!skip_vmbus_unload)
> + vmbus_initiate_unload(true);
> /*
> * In crash handler we can't schedule synic cleanup for all CPUs,
> * doing the cleanup for current CPU only. This should be sufficient diff --
> git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> dfc516c1c719..b0502a336eb3 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1334,6 +1334,9 @@ int vmbus_allocate_mmio(struct resource **new,
> struct hv_device *device_obj,
> bool fb_overlap_ok);
> void vmbus_free_mmio(resource_size_t start, resource_size_t size);
>
> +void vmbus_initiate_unload(bool crash); void vmbus_set_skip_unload(bool
> +skip);
> +
> /*
> * GUID definitions of various offer types - services offered to the guest.
> */
> --
> 2.25.1
^ permalink raw reply
* [PATCH net-next v6 0/3] add ethtool COALESCE_RX_CQE_FRAMES/NSECS and use it in MANA driver
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
To: linux-hyperv, netdev; +Cc: haiyangz, paulros
From: Haiyang Zhang <haiyangz@microsoft.com>
Add two parameters for drivers supporting Rx CQE Coalescing.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.
Also implement it in MANA driver with the new parameter and
counters.
Haiyang Zhang (3):
net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
net: mana: Add support for RX CQE Coalescing
net: mana: Add ethtool counters for RX CQEs in coalesced type
Documentation/netlink/specs/ethtool.yaml | 8 ++
Documentation/networking/ethtool-netlink.rst | 11 +++
drivers/net/ethernet/microsoft/mana/mana_en.c | 84 +++++++++++++------
.../ethernet/microsoft/mana/mana_ethtool.c | 75 ++++++++++++++++-
include/linux/ethtool.h | 6 +-
include/net/mana/mana.h | 17 +++-
.../uapi/linux/ethtool_netlink_generated.h | 2 +
net/ethtool/coalesce.c | 14 +++-
8 files changed, 181 insertions(+), 36 deletions(-)
--
2.34.1
^ permalink raw reply
* [PATCH net-next v6 1/3] net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
To: linux-hyperv, netdev, Andrew Lunn, Jakub Kicinski,
David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Donald Hunter, Jonathan Corbet, Shuah Khan,
Kory Maincent (Dent Project), Gal Pressman, Oleksij Rempel,
Vadim Fedorenko, linux-kernel, linux-doc
Cc: haiyangz, paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
Add two parameters for drivers supporting Rx CQE coalescing /
descriptor writeback.
ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
Maximum number of frames that can be coalesced into a CQE or
writeback.
ETHTOOL_A_COALESCE_RX_CQE_NSECS:
Max time in nanoseconds after the first packet arrival in a
coalesced CQE or writeback to be sent.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v6:
Updated comment to include "descriptor writeback", as suggested by
Jakub Kicinski
---
Documentation/netlink/specs/ethtool.yaml | 8 ++++++++
Documentation/networking/ethtool-netlink.rst | 11 +++++++++++
include/linux/ethtool.h | 6 +++++-
include/uapi/linux/ethtool_netlink_generated.h | 2 ++
net/ethtool/coalesce.c | 14 +++++++++++++-
5 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 4707063af3b4..d254e26c014c 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -861,6 +861,12 @@ attribute-sets:
name: tx-profile
type: nest
nested-attributes: profile
+ -
+ name: rx-cqe-frames
+ type: u32
+ -
+ name: rx-cqe-nsecs
+ type: u32
-
name: pause-stat
@@ -2257,6 +2263,8 @@ operations:
- tx-aggr-time-usecs
- rx-profile
- tx-profile
+ - rx-cqe-frames
+ - rx-cqe-nsecs
dump: *coalesce-get-op
-
name: coalesce-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 32179168eb73..e92abf45faf5 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1076,6 +1076,8 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
+ ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` u32 max packets, Rx CQE
+ ``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` u32 delay (ns), Rx CQE
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@@ -1109,6 +1111,13 @@ well with frequent small-sized URBs transmissions.
to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
+Rx CQE coalescing allows multiple received packets to be coalesced into a
+single Completion Queue Entry (CQE) or descriptor writeback.
+``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` describes the maximum number of
+frames that can be coalesced into a CQE or writeback.
+``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` describes max time in nanoseconds after
+the first packet arrival in a coalesced CQE or writeback to be sent.
+
COALESCE_SET
============
@@ -1147,6 +1156,8 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
+ ``ETHTOOL_A_COALESCE_RX_CQE_FRAMES`` u32 max packets, Rx CQE
+ ``ETHTOOL_A_COALESCE_RX_CQE_NSECS`` u32 delay (ns), Rx CQE
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 83c375840835..656d465bcd06 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -332,6 +332,8 @@ struct kernel_ethtool_coalesce {
u32 tx_aggr_max_bytes;
u32 tx_aggr_max_frames;
u32 tx_aggr_time_usecs;
+ u32 rx_cqe_frames;
+ u32 rx_cqe_nsecs;
};
/**
@@ -380,7 +382,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
#define ETHTOOL_COALESCE_TX_AGGR_TIME_USECS BIT(26)
#define ETHTOOL_COALESCE_RX_PROFILE BIT(27)
#define ETHTOOL_COALESCE_TX_PROFILE BIT(28)
-#define ETHTOOL_COALESCE_ALL_PARAMS GENMASK(28, 0)
+#define ETHTOOL_COALESCE_RX_CQE_FRAMES BIT(29)
+#define ETHTOOL_COALESCE_RX_CQE_NSECS BIT(30)
+#define ETHTOOL_COALESCE_ALL_PARAMS GENMASK(30, 0)
#define ETHTOOL_COALESCE_USECS \
(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
diff --git a/include/uapi/linux/ethtool_netlink_generated.h b/include/uapi/linux/ethtool_netlink_generated.h
index 114b83017297..8134baf7860f 100644
--- a/include/uapi/linux/ethtool_netlink_generated.h
+++ b/include/uapi/linux/ethtool_netlink_generated.h
@@ -371,6 +371,8 @@ enum {
ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
ETHTOOL_A_COALESCE_RX_PROFILE,
ETHTOOL_A_COALESCE_TX_PROFILE,
+ ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+ ETHTOOL_A_COALESCE_RX_CQE_NSECS,
__ETHTOOL_A_COALESCE_CNT,
ETHTOOL_A_COALESCE_MAX = (__ETHTOOL_A_COALESCE_CNT - 1)
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 3e18ca1ccc5e..349bb02c517a 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -118,6 +118,8 @@ static int coalesce_reply_size(const struct ethnl_req_info *req_base,
nla_total_size(sizeof(u32)) + /* _TX_AGGR_MAX_BYTES */
nla_total_size(sizeof(u32)) + /* _TX_AGGR_MAX_FRAMES */
nla_total_size(sizeof(u32)) + /* _TX_AGGR_TIME_USECS */
+ nla_total_size(sizeof(u32)) + /* _RX_CQE_FRAMES */
+ nla_total_size(sizeof(u32)) + /* _RX_CQE_NSECS */
total_modersz * 2; /* _{R,T}X_PROFILE */
}
@@ -269,7 +271,11 @@ static int coalesce_fill_reply(struct sk_buff *skb,
coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES,
kcoal->tx_aggr_max_frames, supported) ||
coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS,
- kcoal->tx_aggr_time_usecs, supported))
+ kcoal->tx_aggr_time_usecs, supported) ||
+ coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_FRAMES,
+ kcoal->rx_cqe_frames, supported) ||
+ coalesce_put_u32(skb, ETHTOOL_A_COALESCE_RX_CQE_NSECS,
+ kcoal->rx_cqe_nsecs, supported))
return -EMSGSIZE;
if (!req_base->dev || !req_base->dev->irq_moder)
@@ -338,6 +344,8 @@ const struct nla_policy ethnl_coalesce_set_policy[] = {
[ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS] = { .type = NLA_U32 },
+ [ETHTOOL_A_COALESCE_RX_CQE_FRAMES] = { .type = NLA_U32 },
+ [ETHTOOL_A_COALESCE_RX_CQE_NSECS] = { .type = NLA_U32 },
[ETHTOOL_A_COALESCE_RX_PROFILE] =
NLA_POLICY_NESTED(coalesce_profile_policy),
[ETHTOOL_A_COALESCE_TX_PROFILE] =
@@ -570,6 +578,10 @@ __ethnl_set_coalesce(struct ethnl_req_info *req_info, struct genl_info *info,
tb[ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES], &mod);
ethnl_update_u32(&kernel_coalesce.tx_aggr_time_usecs,
tb[ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS], &mod);
+ ethnl_update_u32(&kernel_coalesce.rx_cqe_frames,
+ tb[ETHTOOL_A_COALESCE_RX_CQE_FRAMES], &mod);
+ ethnl_update_u32(&kernel_coalesce.rx_cqe_nsecs,
+ tb[ETHTOOL_A_COALESCE_RX_CQE_NSECS], &mod);
if (dev->irq_moder && dev->irq_moder->profile_flags & DIM_PROFILE_RX) {
ret = ethnl_update_profile(dev, &dev->irq_moder->rx_profile,
--
2.34.1
^ permalink raw reply related
* [PATCH net-next v6 2/3] net: mana: Add support for RX CQE Coalescing
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Saurabh Sengar, Dipayaan Roy, Aditya Garg,
Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Breno Leitao,
linux-kernel, linux-rdma
Cc: paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.
And, add ethtool handler to switch this feature. To turn it on, run:
ethtool -C <nic> rx-cqe-frames 4
To turn it off:
ethtool -C <nic> rx-cqe-frames 1
The rx-cqe-nsec is the time out value in nanoseconds after the first
packet arrival in a coalesced CQE to be sent. It's read-only for this
NIC.
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v4:
Fixed the old_buf issue found by AI.
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 74 ++++++++++++-------
.../ethernet/microsoft/mana/mana_ethtool.c | 60 ++++++++++++++-
include/net/mana/mana.h | 8 +-
3 files changed, 113 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..fa30046dcd3d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1365,6 +1365,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
sizeof(resp));
req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+ req->hdr.resp.msg_version = GDMA_MESSAGE_V2;
req->vport = apc->port_handle;
req->num_indir_entries = apc->indir_table_sz;
@@ -1376,7 +1377,9 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
req->update_hashkey = update_key;
req->update_indir_tab = update_tab;
req->default_rxobj = apc->default_rxobj;
- req->cqe_coalescing_enable = 0;
+
+ if (rx != TRI_STATE_FALSE)
+ req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
if (update_key)
memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
@@ -1405,8 +1408,13 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
resp.hdr.status);
err = -EPROTO;
+ goto out;
}
+ if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+ apc->cqe_coalescing_timeout_ns =
+ resp.cqe_coalescing_timeout_ns;
+
netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
apc->port_handle, apc->indir_table_sz);
out:
@@ -1915,11 +1923,12 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
}
static void mana_rx_skb(void *buf_va, bool from_pool,
- struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
+ struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
+ int i)
{
struct mana_stats_rx *rx_stats = &rxq->stats;
struct net_device *ndev = rxq->ndev;
- uint pkt_len = cqe->ppi[0].pkt_len;
+ uint pkt_len = cqe->ppi[i].pkt_len;
u16 rxq_idx = rxq->rxq_idx;
struct napi_struct *napi;
struct xdp_buff xdp = {};
@@ -1963,7 +1972,7 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
- hash_value = cqe->ppi[0].pkt_hash;
+ hash_value = cqe->ppi[i].pkt_hash;
if (cqe->rx_hashtype & MANA_HASH_L4)
skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
@@ -2098,9 +2107,11 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
struct mana_recv_buf_oob *rxbuf_oob;
struct mana_port_context *apc;
struct device *dev = gc->dev;
+ bool coalesced = false;
void *old_buf = NULL;
u32 curr, pktlen;
bool old_fp;
+ int i;
apc = netdev_priv(ndev);
@@ -2112,13 +2123,16 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
++ndev->stats.rx_dropped;
rxbuf_oob = &rxq->rx_oobs[rxq->buf_index];
netdev_warn_once(ndev, "Dropped a truncated packet\n");
- goto drop;
- case CQE_RX_COALESCED_4:
- netdev_err(ndev, "RX coalescing is unsupported\n");
- apc->eth_stats.rx_coalesced_err++;
+ mana_move_wq_tail(rxq->gdma_rq,
+ rxbuf_oob->wqe_inf.wqe_size_in_bu);
+ mana_post_pkt_rxq(rxq);
return;
+ case CQE_RX_COALESCED_4:
+ coalesced = true;
+ break;
+
case CQE_RX_OBJECT_FENCE:
complete(&rxq->fence_event);
return;
@@ -2130,30 +2144,37 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
return;
}
- pktlen = oob->ppi[0].pkt_len;
+ for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
+ old_buf = NULL;
+ pktlen = oob->ppi[i].pkt_len;
+ if (pktlen == 0) {
+ if (i == 0)
+ netdev_err_once(
+ ndev,
+ "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+ rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+ break;
+ }
- if (pktlen == 0) {
- /* data packets should never have packetlength of zero */
- netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
- rxq->gdma_id, cq->gdma_id, rxq->rxobj);
- return;
- }
+ curr = rxq->buf_index;
+ rxbuf_oob = &rxq->rx_oobs[curr];
+ WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- curr = rxq->buf_index;
- rxbuf_oob = &rxq->rx_oobs[curr];
- WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq);
+ mana_move_wq_tail(rxq->gdma_rq,
+ rxbuf_oob->wqe_inf.wqe_size_in_bu);
-drop:
- mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
+ mana_post_pkt_rxq(rxq);
- mana_post_pkt_rxq(rxq);
+ if (!coalesced)
+ break;
+ }
}
static void mana_poll_rx_cq(struct mana_cq *cq)
@@ -3332,6 +3353,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc->port_handle = INVALID_MANA_HANDLE;
apc->pf_filter_handle = INVALID_MANA_HANDLE;
apc->port_idx = port_idx;
+ apc->cqe_coalescing_enable = 0;
mutex_init(&apc->vport_mutex);
apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..4b234b16e57a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
tx_cqe_unknown_type)},
{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
tx_linear_pkt_cnt)},
- {"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
- rx_coalesced_err)},
{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
rx_cqe_unknown_type)},
};
@@ -390,6 +388,61 @@ static void mana_get_channels(struct net_device *ndev,
channel->combined_count = apc->num_queues;
}
+#define MANA_RX_CQE_NSEC_DEF 2048
+static int mana_get_coalesce(struct net_device *ndev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ kernel_coal->rx_cqe_frames =
+ apc->cqe_coalescing_enable ? MANA_RXCOMP_OOB_NUM_PPI : 1;
+
+ kernel_coal->rx_cqe_nsecs = apc->cqe_coalescing_timeout_ns;
+
+ /* Return the default timeout value for old FW not providing
+ * this value.
+ */
+ if (apc->port_is_up && apc->cqe_coalescing_enable &&
+ !kernel_coal->rx_cqe_nsecs)
+ kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
+
+ return 0;
+}
+
+static int mana_set_coalesce(struct net_device *ndev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+ u8 saved_cqe_coalescing_enable;
+ int err;
+
+ if (kernel_coal->rx_cqe_frames != 1 &&
+ kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
+ NL_SET_ERR_MSG_FMT(extack,
+ "rx-frames must be 1 or %u, got %u",
+ MANA_RXCOMP_OOB_NUM_PPI,
+ kernel_coal->rx_cqe_frames);
+ return -EINVAL;
+ }
+
+ saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+ apc->cqe_coalescing_enable =
+ kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
+
+ if (!apc->port_is_up)
+ return 0;
+
+ err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+ if (err)
+ apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+
+ return err;
+}
+
static int mana_set_channels(struct net_device *ndev,
struct ethtool_channels *channels)
{
@@ -510,6 +563,7 @@ static int mana_get_link_ksettings(struct net_device *ndev,
}
const struct ethtool_ops mana_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.get_ethtool_stats = mana_get_ethtool_stats,
.get_sset_count = mana_get_sset_count,
.get_strings = mana_get_strings,
@@ -520,6 +574,8 @@ const struct ethtool_ops mana_ethtool_ops = {
.set_rxfh = mana_set_rxfh,
.get_channels = mana_get_channels,
.set_channels = mana_set_channels,
+ .get_coalesce = mana_get_coalesce,
+ .set_coalesce = mana_set_coalesce,
.get_ringparam = mana_get_ringparam,
.set_ringparam = mana_set_ringparam,
.get_link_ksettings = mana_get_link_ksettings,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..a7f89e7ddc56 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -378,7 +378,6 @@ struct mana_ethtool_stats {
u64 tx_cqe_err;
u64 tx_cqe_unknown_type;
u64 tx_linear_pkt_cnt;
- u64 rx_coalesced_err;
u64 rx_cqe_unknown_type;
};
@@ -557,6 +556,9 @@ struct mana_port_context {
bool port_is_up;
bool port_st_save; /* Saved port state */
+ u8 cqe_coalescing_enable;
+ u32 cqe_coalescing_timeout_ns;
+
struct mana_ethtool_stats eth_stats;
struct mana_ethtool_phy_stats phy_stats;
@@ -902,6 +904,10 @@ struct mana_cfg_rx_steer_req_v2 {
struct mana_cfg_rx_steer_resp {
struct gdma_resp_hdr hdr;
+
+ /* V2 */
+ u32 cqe_coalescing_timeout_ns;
+ u32 reserved1;
}; /* HW DATA */
/* Register HW vPort */
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox