Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Wei Liu @ 2026-03-18 16:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Wei Liu, Stanislav Kinsburskii, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157A6D37C19379C74C88ACCD44EA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, Mar 18, 2026 at 02:38:49PM +0000, Michael Kelley wrote:
> From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 17, 2026 11:20 PM
> > 
> > On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > > >
> > > > The current error handling has two issues:
> > > >
> > > > First, pin_user_pages_fast() can return a short pin count (less than
> > > > requested but greater than zero) when it cannot pin all requested pages.
> > > > This is treated as success, leading to partially pinned regions being
> > > > used, which causes memory corruption.
> > > >
> > > > Second, when an error occurs mid-loop, already pinned pages from the
> > > > current batch are not released before calling mshv_region_evict_pages(),
> > > > causing a page reference leak.
> > >
> > > There's now an online LLM-based tool that is automatically reviewing
> > > kernel patches.  For this patch, the results are here:
> > >
> > >
> > https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit
> > %40skinsburskii-cloud-desktop.internal.cloudapp.net
> > >
> > > It has flagged the commit message as incorrectly referencing the
> > > function mshv_region_evict_pages(), which doesn't exist.
> > >
> > > FWIW, the announcement about sashiko.dev is here:
> > >
> > > https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> > >
> > > Other than the commit message reference, this looks good to me.
> > >
> > > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > 
> > The second point is written as if the code here should release the
> > already pinned pages before calling mshv_region_invalidate_pages(), but
> > the code actually relies on mshv_mem_region_invalidate_pages() to
> > release the pages. The change here fixes the accounting.
> > 
> >  Second, when an error occurs mid-loop, already pinned pages from the
> >  current batch are not accounted for before calling
> >  mshv_region_invalidate_pages(), causing a page reference leak.
> > 
> > And queued up the patch to hyperv-fixes.
> 
> One other thing I noticed:  The "Subject" of the patch is wrong. It
> mentions mshv_region_populate_pages(), but the function being
> modified is actually mshv_region_pin().

Good catch. I have updated the subject line and pushed to hyperv-fixes.

Wei

> 
> Michael
> 
> > 
> > Wei
> > 
> > >
> > > >
> > > > Fix by treating short pins as errors and explicitly unpinning the
> > > > partial batch before cleanup.
> > > >
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_regions.c |    6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > > index c28aac0726de..fdffd4f002f6 100644
> > > > --- a/drivers/hv/mshv_regions.c
> > > > +++ b/drivers/hv/mshv_regions.c
> > > > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > > >  					  FOLL_WRITE | FOLL_LONGTERM,
> > > >  					  pages);
> > > > -		if (ret < 0)
> > > > +		if (ret != nr_pages)
> > > >  			goto release_pages;
> > > >  	}
> > > >
> > > >  	return 0;
> > > >
> > > >  release_pages:
> > > > +	if (ret > 0)
> > > > +		done_count += ret;
> > > >  	mshv_region_invalidate_pages(region, 0, done_count);
> > > > -	return ret;
> > > > +	return ret < 0 ? ret : -ENOMEM;
> > > >  }
> > > >
> > > >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > > >
> > > >
> > >
> 

^ permalink raw reply

* Re: [PATCH v2 16/16] mm: on remap assert that input range within the proposed VMA
From: Suren Baghdasaryan @ 2026-03-18 16:02 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <4e152e7b8e1a93baf0777628eef9409d031cf8f6.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> Now we have range_in_vma_desc(), update remap_pfn_range_prepare() to check
> whether the input range in contained within the specified VMA, so we can

s/in contained/is contained

> fail at prepare time if an invalid range is specified.
>
> This covers the I/O remap mmap actions also which ultimately call into this
> function, and other mmap action types either already span the full VMA or
> check this already.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/memory.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 849d5d9eeb83..de0dd17759e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3142,6 +3142,9 @@ int remap_pfn_range_prepare(struct vm_area_desc *desc)
>         const bool is_cow = vma_desc_is_cow_mapping(desc);
>         int err;
>
> +       if (!range_in_vma_desc(desc, start, end))
> +               return -EFAULT;
> +
>         err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
>                               &desc->pgoff);
>         if (err)
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v2 15/16] mm: add mmap_action_map_kernel_pages[_full]()
From: Suren Baghdasaryan @ 2026-03-18 16:00 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <8e28e4b63bae67bfa1a59ccbac9dc6db1442d75d.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> A user can invoke mmap_action_map_kernel_pages() to specify that the
> mapping should map kernel pages starting from desc->start of a specified
> number of pages specified in an array.
>
> In order to implement this, adjust mmap_action_prepare() to be able to
> return an error code, as it makes sense to assert that the specified
> parameters are valid as quickly as possible as well as updating the VMA
> flags to include VMA_MIXEDMAP_BIT as necessary.
>
> This provides an mmap_prepare equivalent of vm_insert_pages().
>
> We additionally update the existing vm_insert_pages() code to use
> range_in_vma() and add a new range_in_vma_desc() helper function for the
> mmap_prepare case, sharing the code between the two in range_is_subset().
>
> We add both mmap_action_map_kernel_pages() and
> mmap_action_map_kernel_pages_full() to allow for both partial and full VMA
> mappings.
>
> We also add mmap_action_map_kernel_pages_discontig() to allow for
> discontiguous mapping of kernel pages should the need arise.
>
> We update the documentation to reflect the new features.
>
> Finally, we update the VMA tests accordingly to reflect the changes.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

With one nit,
Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  Documentation/filesystems/mmap_prepare.rst |  8 ++
>  include/linux/mm.h                         | 95 +++++++++++++++++++++-
>  include/linux/mm_types.h                   |  7 ++
>  mm/memory.c                                | 42 +++++++++-
>  mm/util.c                                  |  6 ++
>  tools/testing/vma/include/dup.h            |  7 ++
>  6 files changed, 159 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> index be76ae475b9c..e810aa4134eb 100644
> --- a/Documentation/filesystems/mmap_prepare.rst
> +++ b/Documentation/filesystems/mmap_prepare.rst
> @@ -156,5 +156,13 @@ pointer. These are:
>  * mmap_action_simple_ioremap() - Sets up an I/O remap from a specified
>    physical address and over a specified length.
>
> +* mmap_action_map_kernel_pages() - Maps a specified array of `struct page`
> +  pointers in the VMA from a specific offset.
> +
> +* mmap_action_map_kernel_pages_full() - Maps a specified array of `struct
> +  page` pointers over the entire VMA. The caller must ensure there are
> +  sufficient entries in the page array to cover the entire range of the
> +  described VMA.
> +
>  **NOTE:** The ``action`` field should never normally be manipulated directly,
>  rather you ought to use one of these helpers.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index df8fa6e6402b..6f0a3edb24e1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2912,7 +2912,7 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio)
>   * The caller must add any reference (e.g., from folio_try_get()) it might be
>   * holding itself to the result.
>   *
> - * Returns the expected folio refcount.
> + * Returns: the expected folio refcount.

nit: I see both "Returns:" and "Return:" being used in the codebase
but this header file uses "Return:", so for consistency you should
probably do the same. This also applies to later instances in this
patch.

>   */
>  static inline int folio_expected_ref_count(const struct folio *folio)
>  {
> @@ -4364,6 +4364,45 @@ static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
>         action->type = MMAP_SIMPLE_IO_REMAP;
>  }
>
> +/**
> + * mmap_action_map_kernel_pages - helper for mmap_prepare hook to specify that
> + * @num kernel pages contained in the @pages array should be mapped to userland
> + * starting at virtual address @start.
> + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> + * @start: The virtual address from which to map them.
> + * @pages: An array of struct page pointers describing the memory to map.
> + * @nr_pages: The number of entries in the @pages aray.
> + */
> +static inline void mmap_action_map_kernel_pages(struct vm_area_desc *desc,
> +               unsigned long start, struct page **pages,
> +               unsigned long nr_pages)
> +{
> +       struct mmap_action *action = &desc->action;
> +
> +       action->type = MMAP_MAP_KERNEL_PAGES;
> +       action->map_kernel.start = start;
> +       action->map_kernel.pages = pages;
> +       action->map_kernel.nr_pages = nr_pages;
> +       action->map_kernel.pgoff = desc->pgoff;
> +}
> +
> +/**
> + * mmap_action_map_kernel_pages_full - helper for mmap_prepare hook to specify that
> + * kernel pages contained in the @pages array should be mapped to userland
> + * from @desc->start to @desc->end.
> + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> + * @pages: An array of struct page pointers describing the memory to map.
> + *
> + * The caller must ensure that @pages contains sufficient entries to cover the
> + * entire range described by @desc.
> + */
> +static inline void mmap_action_map_kernel_pages_full(struct vm_area_desc *desc,
> +               struct page **pages)
> +{
> +       mmap_action_map_kernel_pages(desc, desc->start, pages,
> +                                    vma_desc_pages(desc));
> +}
> +
>  int mmap_action_prepare(struct vm_area_desc *desc);
>  int mmap_action_complete(struct vm_area_struct *vma,
>                          struct mmap_action *action);
> @@ -4380,10 +4419,59 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
>         return vma;
>  }
>
> +/**
> + * range_is_subset - Is the specified inner range a subset of the outer range?
> + * @outer_start: The start of the outer range.
> + * @outer_end: The exclusive end of the outer range.
> + * @inner_start: The start of the inner range.
> + * @inner_end: The exclusive end of the inner range.
> + *
> + * Returns: %true if [inner_start, inner_end) is a subset of [outer_start,
> + * outer_end), otherwise %false.
> + */
> +static inline bool range_is_subset(unsigned long outer_start,
> +                                  unsigned long outer_end,
> +                                  unsigned long inner_start,
> +                                  unsigned long inner_end)
> +{
> +       return outer_start <= inner_start && inner_end <= outer_end;
> +}
> +
> +/**
> + * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
> + * @vma: The VMA against which we want to check [@start, @end).
> + * @start: The start of the range we wish to check.
> + * @end: The exclusive end of the range we wish to check.
> + *
> + * Returns: %true if [@start, @end) is a subset of [@vma->vm_start,
> + * @vma->vm_end), %false otherwise.
> + */
>  static inline bool range_in_vma(const struct vm_area_struct *vma,
>                                 unsigned long start, unsigned long end)
>  {
> -       return (vma && vma->vm_start <= start && end <= vma->vm_end);
> +       if (!vma)
> +               return false;
> +
> +       return range_is_subset(vma->vm_start, vma->vm_end, start, end);
> +}
> +
> +/**
> + * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
> + * described by @desc, a VMA descriptor?
> + * @desc: The VMA descriptor against which we want to check [@start, @end).
> + * @start: The start of the range we wish to check.
> + * @end: The exclusive end of the range we wish to check.
> + *
> + * Returns: %true if [@start, @end) is a subset of [@desc->start, @desc->end),
> + * %false otherwise.
> + */
> +static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
> +                                    unsigned long start, unsigned long end)
> +{
> +       if (!desc)
> +               return false;
> +
> +       return range_is_subset(desc->start, desc->end, start, end);
>  }
>
>  #ifdef CONFIG_MMU
> @@ -4427,6 +4515,9 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
>  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
>                         struct page **pages, unsigned long *num);
> +int map_kernel_pages_prepare(struct vm_area_desc *desc);
> +int map_kernel_pages_complete(struct vm_area_struct *vma,
> +                             struct mmap_action *action);
>  int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
>                                 unsigned long num);
>  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7538d64f8848..c46224020a46 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -815,6 +815,7 @@ enum mmap_action_type {
>         MMAP_REMAP_PFN,         /* Remap PFN range. */
>         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
>         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from array. */
>  };
>
>  /*
> @@ -833,6 +834,12 @@ struct mmap_action {
>                         phys_addr_t start_phys_addr;
>                         unsigned long size;
>                 } simple_ioremap;
> +               struct {
> +                       unsigned long start;
> +                       struct page **pages;
> +                       unsigned long nr_pages;
> +                       pgoff_t pgoff;
> +               } map_kernel;
>         };
>         enum mmap_action_type type;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f3f4046aee97..849d5d9eeb83 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2484,13 +2484,14 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
>  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
>                         struct page **pages, unsigned long *num)
>  {
> -       const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1;
> +       const unsigned long nr_pages = *num;
> +       const unsigned long end = addr + PAGE_SIZE * nr_pages;
>
> -       if (addr < vma->vm_start || end_addr >= vma->vm_end)
> +       if (!range_in_vma(vma, addr, end))
>                 return -EFAULT;
>         if (!(vma->vm_flags & VM_MIXEDMAP)) {
> -               BUG_ON(mmap_read_trylock(vma->vm_mm));
> -               BUG_ON(vma->vm_flags & VM_PFNMAP);
> +               VM_WARN_ON_ONCE(mmap_read_trylock(vma->vm_mm));
> +               VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
>                 vm_flags_set(vma, VM_MIXEDMAP);
>         }
>         /* Defer page refcount checking till we're about to map that page. */
> @@ -2498,6 +2499,39 @@ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_pages);
>
> +int map_kernel_pages_prepare(struct vm_area_desc *desc)
> +{
> +       const struct mmap_action *action = &desc->action;
> +       const unsigned long addr = action->map_kernel.start;
> +       unsigned long nr_pages, end;
> +
> +       if (!vma_desc_test(desc, VMA_MIXEDMAP_BIT)) {
> +               VM_WARN_ON_ONCE(mmap_read_trylock(desc->mm));
> +               VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_PFNMAP_BIT));
> +               vma_desc_set_flags(desc, VMA_MIXEDMAP_BIT);
> +       }
> +
> +       nr_pages = action->map_kernel.nr_pages;
> +       end = addr + PAGE_SIZE * nr_pages;
> +       if (!range_in_vma_desc(desc, addr, end))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL(map_kernel_pages_prepare);
> +
> +int map_kernel_pages_complete(struct vm_area_struct *vma,
> +                             struct mmap_action *action)
> +{
> +       unsigned long nr_pages;
> +
> +       nr_pages = action->map_kernel.nr_pages;
> +       return insert_pages(vma, action->map_kernel.start,
> +                           action->map_kernel.pages,
> +                           &nr_pages, vma->vm_page_prot);
> +}
> +EXPORT_SYMBOL(map_kernel_pages_complete);
> +
>  /**
>   * vm_insert_page - insert single page into user vma
>   * @vma: user vma to map to
> diff --git a/mm/util.c b/mm/util.c
> index a166c48fe894..dea590e7a26c 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1441,6 +1441,8 @@ int mmap_action_prepare(struct vm_area_desc *desc)
>                 return io_remap_pfn_range_prepare(desc);
>         case MMAP_SIMPLE_IO_REMAP:
>                 return simple_ioremap_prepare(desc);
> +       case MMAP_MAP_KERNEL_PAGES:
> +               return map_kernel_pages_prepare(desc);
>         }
>
>         WARN_ON_ONCE(1);
> @@ -1472,6 +1474,9 @@ int mmap_action_complete(struct vm_area_struct *vma,
>         case MMAP_IO_REMAP_PFN:
>                 err = io_remap_pfn_range_complete(vma, action);
>                 break;
> +       case MMAP_MAP_KERNEL_PAGES:
> +               err = map_kernel_pages_complete(vma, action);
> +               break;
>         case MMAP_SIMPLE_IO_REMAP:
>                 /*
>                  * The simple I/O remap should have been delegated to an I/O
> @@ -1494,6 +1499,7 @@ int mmap_action_prepare(struct vm_area_desc *desc)
>         case MMAP_REMAP_PFN:
>         case MMAP_IO_REMAP_PFN:
>         case MMAP_SIMPLE_IO_REMAP:
> +       case MMAP_MAP_KERNEL_PAGES:
>                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
>                 break;
>         }
> diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> index 6658df26698a..4407caf207ad 100644
> --- a/tools/testing/vma/include/dup.h
> +++ b/tools/testing/vma/include/dup.h
> @@ -454,6 +454,7 @@ enum mmap_action_type {
>         MMAP_REMAP_PFN,         /* Remap PFN range. */
>         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
>         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from an array. */
>  };
>
>  /*
> @@ -472,6 +473,12 @@ struct mmap_action {
>                         phys_addr_t start;
>                         unsigned long len;
>                 } simple_ioremap;
> +               struct {
> +                       unsigned long start;
> +                       struct page **pages;
> +                       unsigned long num;
> +                       pgoff_t pgoff;
> +               } map_kernel;
>         };
>         enum mmap_action_type type;
>
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH] net: mana: fix use-after-free in add_adev() error path
From: Guangshuo Li @ 2026-03-18 15:40 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Saurabh Sengar, Erni Sri Satya Vennela,
	Shradha Gupta, Dipayaan Roy, Aditya Garg, Shiraz Saleem,
	Leon Romanovsky, linux-hyperv, netdev, linux-kernel
  Cc: Guangshuo Li, stable

If auxiliary_device_add() fails, add_adev() calls
auxiliary_device_uninit(adev), whose release callback adev_release()
frees the containing struct mana_adev.

The current error path then falls through to init_fail and accesses
adev->id. Since adev is embedded in struct mana_adev, this may lead
to a use-after-free.

Fix it by storing the allocated auxiliary device id in a local
variable and using that saved id in the cleanup path after
auxiliary_device_uninit().

Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1ad154f9db1a..70d71594c599 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3362,6 +3362,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
 {
 	struct auxiliary_device *adev;
 	struct mana_adev *madev;
+	int id;
 	int ret;
 
 	madev = kzalloc(sizeof(*madev), GFP_KERNEL);
@@ -3372,7 +3373,8 @@ static int add_adev(struct gdma_dev *gd, const char *name)
 	ret = mana_adev_idx_alloc();
 	if (ret < 0)
 		goto idx_fail;
-	adev->id = ret;
+	id = ret;
+	adev->id = id;
 
 	adev->name = name;
 	adev->dev.parent = gd->gdma_context->dev;
@@ -3398,7 +3400,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
 	auxiliary_device_uninit(adev);
 
 init_fail:
-	mana_adev_idx_free(adev->id);
+	mana_adev_idx_free(id);
 
 idx_fail:
 	kfree(madev);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2 12/16] mm: allow handling of stacked mmap_prepare hooks in more drivers
From: Suren Baghdasaryan @ 2026-03-18 15:33 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <72750af6906fd96fb6f18e83ac3e694cf357a2c1.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> While the conversion of mmap hooks to mmap_prepare is underway, we wil

nit: s/wil/will

> encounter situations where mmap hooks need to invoke nested mmap_prepare
> hooks.
>
> The nesting of mmap hooks is termed 'stacking'.  In order to flexibly
> facilitate the conversion of custom mmap hooks in drivers which stack, we
> must split up the existing compat_vma_mapped() function into two separate
> functions:
>
> * compat_set_desc_from_vma() - This allows the setting of a vm_area_desc
>   object's fields to the relevant fields of a VMA.
>
> * __compat_vma_mmap() - Once an mmap_prepare hook has been executed upon a
>   vm_area_desc object, this function performs any mmap actions specified by
>   the mmap_prepare hook and then invokes its vm_ops->mapped() hook if any
>   were specified.
>
> In ordinary cases, where a file's f_op->mmap_prepare() hook simply needs to
> be invoked in a stacked mmap() hook, compat_vma_mmap() can be used.
>
> However some drivers define their own nested hooks, which are invoked in
> turn by another hook.
>
> A concrete example is vmbus_channel->mmap_ring_buffer(), which is invoked
> in turn by bin_attribute->mmap():
>
> vmbus_channel->mmap_ring_buffer() has a signature of:
>
> int (*mmap_ring_buffer)(struct vmbus_channel *channel,
>                         struct vm_area_struct *vma);
>
> And bin_attribute->mmap() has a signature of:
>
>         int (*mmap)(struct file *, struct kobject *,
>                     const struct bin_attribute *attr,
>                     struct vm_area_struct *vma);
>
> And so compat_vma_mmap() cannot be used here for incremental conversion of
> hooks from mmap() to mmap_prepare().
>
> There are many such instances like this, where conversion to mmap_prepare
> would otherwise cascade to a huge change set due to nesting of this kind.
>
> The changes in this patch mean we could now instead convert
> vmbus_channel->mmap_ring_buffer() to
> vmbus_channel->mmap_prepare_ring_buffer(), and implement something like:
>
>         struct vm_area_desc desc;
>         int err;
>
>         compat_set_desc_from_vm(&desc, file, vma);
>         err = channel->mmap_prepare_ring_buffer(channel, &desc);
>         if (err)
>                 return err;
>
>         return __compat_vma_mmap(&desc, vma);
>
> Allowing us to incrementally update this logic, and other logic like it.

The way I understand this and the next 2 patches is that they are
preperations for later replacement of mmap() with mmap_prepare() but
they don't yet do that completely. Is that right?
To clarify what I mean, in [1] for example, you are replacing struct
uio_info.mmap with uio_info.mmap_prepare but it's still being called
from uio_mmap(). IOW, you are not replacing uio_mmap with
uio_mmap_prepare. Is that the next step that's not yet implemented?

[1] https://lore.kernel.org/all/892a8b32e5ef64c69239ccc2d1bd364716fd7fdf.1773695307.git.ljs@kernel.org/

>
> Unfortunately, as part of this change, we need to be able to flexibly
> assign to the VMA descriptor, so have to remove some of the const
> declarations within the structure.
>
> Also update the VMA tests to reflect the changes.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
>  include/linux/fs.h              |   3 +
>  include/linux/mm_types.h        |   4 +-
>  mm/util.c                       | 111 +++++++++++++++++++++++---------
>  mm/vma.h                        |   2 +-
>  tools/testing/vma/include/dup.h | 111 ++++++++++++++++++++------------
>  5 files changed, 157 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c390f5c667e3..0bdccfa70b44 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2058,6 +2058,9 @@ static inline bool can_mmap_file(struct file *file)
>         return true;
>  }
>
> +void compat_set_desc_from_vma(struct vm_area_desc *desc, const struct file *file,
> +                             const struct vm_area_struct *vma);
> +int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma);
>  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
>  int __vma_check_mmap_hook(struct vm_area_struct *vma);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 50685cf29792..7538d64f8848 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -891,8 +891,8 @@ static __always_inline bool vma_flags_empty(vma_flags_t *flags)
>   */
>  struct vm_area_desc {
>         /* Immutable state. */
> -       const struct mm_struct *const mm;
> -       struct file *const file; /* May vary from vm_file in stacked callers. */
> +       struct mm_struct *mm;
> +       struct file *file; /* May vary from vm_file in stacked callers. */
>         unsigned long start;
>         unsigned long end;
>
> diff --git a/mm/util.c b/mm/util.c
> index aa92e471afe1..a166c48fe894 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1163,34 +1163,38 @@ void flush_dcache_folio(struct folio *folio)
>  EXPORT_SYMBOL(flush_dcache_folio);
>  #endif
>
> -static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> +/**
> + * compat_set_desc_from_vma() - assigns VMA descriptor @desc fields from a VMA.
> + * @desc: A VMA descriptor whose fields need to be set.
> + * @file: The file object describing the file being mmap()'d.
> + * @vma: The VMA whose fields we wish to assign to @desc.
> + *
> + * This is a compatibility function to allow an mmap() hook to call
> + * mmap_prepare() hooks when drivers nest these. This function specifically
> + * allows the construction of a vm_area_desc value, @desc, from a VMA @vma for
> + * the purposes of doing this.
> + *
> + * Once the conversion of drivers is complete this function will no longer be
> + * required and will be removed.
> + */
> +void compat_set_desc_from_vma(struct vm_area_desc *desc,
> +                             const struct file *file,
> +                             const struct vm_area_struct *vma)
>  {
> -       struct vm_area_desc desc = {
> -               .mm = vma->vm_mm,
> -               .file = file,
> -               .start = vma->vm_start,
> -               .end = vma->vm_end,
> -
> -               .pgoff = vma->vm_pgoff,
> -               .vm_file = vma->vm_file,
> -               .vma_flags = vma->flags,
> -               .page_prot = vma->vm_page_prot,
> -
> -               .action.type = MMAP_NOTHING, /* Default */
> -       };
> -       int err;
> +       desc->mm = vma->vm_mm;
> +       desc->file = (struct file *)file;
> +       desc->start = vma->vm_start;
> +       desc->end = vma->vm_end;
>
> -       err = vfs_mmap_prepare(file, &desc);
> -       if (err)
> -               return err;
> +       desc->pgoff = vma->vm_pgoff;
> +       desc->vm_file = vma->vm_file;
> +       desc->vma_flags = vma->flags;
> +       desc->page_prot = vma->vm_page_prot;
>
> -       err = mmap_action_prepare(&desc);
> -       if (err)
> -               return err;
> -
> -       set_vma_from_desc(vma, &desc);
> -       return mmap_action_complete(vma, &desc.action);
> +       /* Default. */
> +       desc->action.type = MMAP_NOTHING;
>  }
> +EXPORT_SYMBOL(compat_set_desc_from_vma);
>
>  static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
>  {
> @@ -1211,6 +1215,49 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
>         return err;
>  }
>
> +/**
> + * __compat_vma_mmap() - Similar to compat_vma_mmap(), only it allows
> + * flexibility as to how the mmap_prepare callback is invoked, which is useful
> + * for drivers which invoke nested mmap_prepare callbacks in an mmap() hook.
> + * @desc: A VMA descriptor upon which an mmap_prepare() hook has already been
> + * executed.
> + * @vma: The VMA to which @desc should be applied.
> + *
> + * The function assumes that you have obtained a VMA descriptor @desc from
> + * compt_set_desc_from_vma(), and already executed the mmap_prepare() hook upon
> + * it.
> + *
> + * It then performs any specified mmap actions, and invokes the vm_ops->mapped()
> + * hook if one is present.
> + *
> + * See the description of compat_vma_mmap() for more details.
> + *
> + * Once the conversion of drivers is complete this function will no longer be
> + * required and will be removed.
> + *
> + * Returns: 0 on success or error.
> + */
> +int __compat_vma_mmap(struct vm_area_desc *desc,
> +                     struct vm_area_struct *vma)
> +{
> +       int err;
> +
> +       /* Perform any preparatory tasks for mmap action. */
> +       err = mmap_action_prepare(desc);
> +       if (err)
> +               return err;
> +       /* Update the VMA from the descriptor. */
> +       compat_set_vma_from_desc(vma, desc);
> +       /* Complete any specified mmap actions. */
> +       err = mmap_action_complete(vma, &desc->action);
> +       if (err)
> +               return err;
> +
> +       /* Invoke vm_ops->mapped callback. */
> +       return __compat_vma_mapped(desc->file, vma);
> +}
> +EXPORT_SYMBOL(__compat_vma_mmap);
> +
>  /**
>   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
>   * existing VMA and execute any requested actions.
> @@ -1218,10 +1265,10 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
>   * @vma: The VMA to apply the .mmap_prepare() hook to.
>   *
>   * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
> - * stacked filesystems invoke a nested mmap hook of an underlying file.
> + * stacked drivers invoke a nested mmap hook of an underlying file.
>   *
> - * Until all filesystems are converted to use .mmap_prepare(), we must be
> - * conservative and continue to invoke these stacked filesystems using the
> + * Until all drivers are converted to use .mmap_prepare(), we must be
> + * conservative and continue to invoke these stacked drivers using the
>   * deprecated .mmap() hook.
>   *
>   * However we have a problem if the underlying file system possesses an
> @@ -1232,20 +1279,22 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
>   * establishes a struct vm_area_desc descriptor, passes to the underlying
>   * .mmap_prepare() hook and applies any changes performed by it.
>   *
> - * Once the conversion of filesystems is complete this function will no longer
> - * be required and will be removed.
> + * Once the conversion of drivers is complete this function will no longer be
> + * required and will be removed.
>   *
>   * Returns: 0 on success or error.
>   */
>  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +       struct vm_area_desc desc;
>         int err;
>
> -       err = __compat_vma_mmap(file, vma);
> +       compat_set_desc_from_vma(&desc, file, vma);
> +       err = vfs_mmap_prepare(file, &desc);
>         if (err)
>                 return err;
>
> -       return __compat_vma_mapped(file, vma);
> +       return __compat_vma_mmap(&desc, vma);
>  }
>  EXPORT_SYMBOL(compat_vma_mmap);
>
> diff --git a/mm/vma.h b/mm/vma.h
> index adc18f7dd9f1..a76046c39b14 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -300,7 +300,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
>   * f_op->mmap() but which might have an underlying file system which implements
>   * f_op->mmap_prepare().
>   */
> -static inline void set_vma_from_desc(struct vm_area_struct *vma,
> +static inline void compat_set_vma_from_desc(struct vm_area_struct *vma,
>                 struct vm_area_desc *desc)
>  {
>         /*
> diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> index 114daaef4f73..6658df26698a 100644
> --- a/tools/testing/vma/include/dup.h
> +++ b/tools/testing/vma/include/dup.h
> @@ -519,8 +519,8 @@ enum vma_operation {
>   */
>  struct vm_area_desc {
>         /* Immutable state. */
> -       const struct mm_struct *const mm;
> -       struct file *const file; /* May vary from vm_file in stacked callers. */
> +       struct mm_struct *mm;
> +       struct file *file; /* May vary from vm_file in stacked callers. */
>         unsigned long start;
>         unsigned long end;
>
> @@ -1272,43 +1272,92 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  }
>
>  /* Declared in vma.h. */
> -static inline void set_vma_from_desc(struct vm_area_struct *vma,
> +static inline void compat_set_vma_from_desc(struct vm_area_struct *vma,
>                 struct vm_area_desc *desc);
>
> -static inline int __compat_vma_mmap(const struct file_operations *f_op,
> -               struct file *file, struct vm_area_struct *vma)
> +static inline void compat_set_desc_from_vma(struct vm_area_desc *desc,
> +                             const struct file *file,
> +                             const struct vm_area_struct *vma)
>  {
> -       struct vm_area_desc desc = {
> -               .mm = vma->vm_mm,
> -               .file = file,
> -               .start = vma->vm_start,
> -               .end = vma->vm_end,
> +       desc->mm = vma->vm_mm;
> +       desc->file = (struct file *)file;
> +       desc->start = vma->vm_start;
> +       desc->end = vma->vm_end;
>
> -               .pgoff = vma->vm_pgoff,
> -               .vm_file = vma->vm_file,
> -               .vma_flags = vma->flags,
> -               .page_prot = vma->vm_page_prot,
> +       desc->pgoff = vma->vm_pgoff;
> +       desc->vm_file = vma->vm_file;
> +       desc->vma_flags = vma->flags;
> +       desc->page_prot = vma->vm_page_prot;
>
> -               .action.type = MMAP_NOTHING, /* Default */
> -       };
> +       /* Default. */
> +       desc->action.type = MMAP_NOTHING;
> +}
> +
> +static inline unsigned long vma_pages(const struct vm_area_struct *vma)
> +{
> +       return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +}
> +
> +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> +{
> +       const size_t len = vma_pages(vma) << PAGE_SHIFT;
> +
> +       mmap_assert_write_locked(vma->vm_mm);
> +       do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> +}
> +
> +static inline int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> +{
> +       const struct vm_operations_struct *vm_ops = vma->vm_ops;
>         int err;
>
> -       err = f_op->mmap_prepare(&desc);
> +       if (!vm_ops->mapped)
> +               return 0;
> +
> +       err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
> +                            &vma->vm_private_data);
>         if (err)
> -               return err;
> +               unmap_vma_locked(vma);
> +       return err;
> +}
>
> -       err = mmap_action_prepare(&desc);
> +static inline int __compat_vma_mmap(struct vm_area_desc *desc,
> +               struct vm_area_struct *vma)
> +{
> +       int err;
> +
> +       /* Perform any preparatory tasks for mmap action. */
> +       err = mmap_action_prepare(desc);
> +       if (err)
> +               return err;
> +       /* Update the VMA from the descriptor. */
> +       compat_set_vma_from_desc(vma, desc);
> +       /* Complete any specified mmap actions. */
> +       err = mmap_action_complete(vma, &desc->action);
>         if (err)
>                 return err;
>
> -       set_vma_from_desc(vma, &desc);
> -       return mmap_action_complete(vma, &desc.action);
> +       /* Invoke vm_ops->mapped callback. */
> +       return __compat_vma_mapped(desc->file, vma);
> +}
> +
> +static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> +{
> +       return file->f_op->mmap_prepare(desc);
>  }
>
>  static inline int compat_vma_mmap(struct file *file,
>                 struct vm_area_struct *vma)
>  {
> -       return __compat_vma_mmap(file->f_op, file, vma);
> +       struct vm_area_desc desc;
> +       int err;
> +
> +       compat_set_desc_from_vma(&desc, file, vma);
> +       err = vfs_mmap_prepare(file, &desc);
> +       if (err)
> +               return err;
> +
> +       return __compat_vma_mmap(&desc, vma);
>  }
>
>
> @@ -1318,11 +1367,6 @@ static inline void vma_iter_init(struct vma_iterator *vmi,
>         mas_init(&vmi->mas, &mm->mm_mt, addr);
>  }
>
> -static inline unsigned long vma_pages(struct vm_area_struct *vma)
> -{
> -       return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> -}
> -
>  static inline void mmap_assert_locked(struct mm_struct *);
>  static inline struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
>                                                 unsigned long start_addr,
> @@ -1492,11 +1536,6 @@ static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
>         return file->f_op->mmap(file, vma);
>  }
>
> -static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> -{
> -       return file->f_op->mmap_prepare(desc);
> -}
> -
>  static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
>  {
>         /* Changing an anonymous vma with this is illegal */
> @@ -1521,11 +1560,3 @@ static inline pgprot_t vma_get_page_prot(vma_flags_t vma_flags)
>
>         return vm_get_page_prot(vm_flags);
>  }
> -
> -static inline void unmap_vma_locked(struct vm_area_struct *vma)
> -{
> -       const size_t len = vma_pages(vma) << PAGE_SHIFT;
> -
> -       mmap_assert_write_locked(vma->vm_mm);
> -       do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> -}
> --
> 2.53.0
>

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Leon Romanovsky @ 2026-03-18 14:49 UTC (permalink / raw)
  To: Long Li
  Cc: Jason Gunthorpe, Konstantin Taranov, Jakub Kicinski,
	David S . Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66832D0A369DE7E411ACCDEDCE41A@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote:
> > 
> > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > > When the MANA hardware undergoes a service reset, the ETH
> > > > > auxiliary device
> > > > > (mana.eth) used by DPDK persists across the reset cycle — it is
> > > > > not removed and re-added like RC/UD/GSI QPs. This means userspace
> > > > > RDMA consumers such as DPDK have no way of knowing that firmware
> > > > > handles for their PD, CQ, WQ, QP and MR resources have become stale.
> > > >
> > > > NAK to any of this.
> > > >
> > > > In case of hardware reset, mana_ib AUX device needs to be destroyed
> > > > and recreated later.
> > >
> > > Yeah, that is our general model for any serious RAS event where the
> > > driver's view of resources becomes out of sync with the HW.
> > >
> > > You have tear down the ib_device by removing the aux and then bring
> > > back a new one.
> > >
> > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> > > tell userspace to close and re-open their uverbs FD.
> > >
> > > We don't have a model where a uverbs FD in userspace can continue to
> > > work after the device has a catasrophic RAS event.
> > >
> > > There may be room to have a model where the ib device doesn't fully
> > > unplug/replug so it retains its name and things, but that is core code
> > > not driver stuff.
> > 
> > Good luck with that model. It is going to break RDMA-CM hotplug support.
> > 
> 
>    I think we can preserve RDMA-CM behavior without requiring ib_device
>    unregister/re-register.
> 
>    On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
>    new reset event) through ib_dispatch_event(). RDMA-CM already handles
>    device events — we would add a handler that iterates all rdma_cm_ids
>    on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
>    as cma_process_remove() does today. The difference: cma_device stays
>    alive, so applications can reconnect on the same device after recovery
>    instead of waiting for a new one to appear.
> 
>    The motivation for keeping ib_device alive is that some RDMA consumers
>    — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
>    manage QP state themselves.

RDMA-CM provides an "external QP" model where the QP is managed by the
rdma-cm user.

As Jason noted, you should propose the core changes together with the
corresponding librdmacm updates. The final result must ensure that legacy
applications continue to function correctly with the new kernel.

Thanks

^ permalink raw reply

* RE: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Michael Kelley @ 2026-03-18 14:38 UTC (permalink / raw)
  To: Wei Liu, Michael Kelley
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260318062001.GA262287@liuwe-devbox-debian-v2.local>

From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 17, 2026 11:20 PM
> 
> On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > >
> > > The current error handling has two issues:
> > >
> > > First, pin_user_pages_fast() can return a short pin count (less than
> > > requested but greater than zero) when it cannot pin all requested pages.
> > > This is treated as success, leading to partially pinned regions being
> > > used, which causes memory corruption.
> > >
> > > Second, when an error occurs mid-loop, already pinned pages from the
> > > current batch are not released before calling mshv_region_evict_pages(),
> > > causing a page reference leak.
> >
> > There's now an online LLM-based tool that is automatically reviewing
> > kernel patches.  For this patch, the results are here:
> >
> >
> https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit
> %40skinsburskii-cloud-desktop.internal.cloudapp.net
> >
> > It has flagged the commit message as incorrectly referencing the
> > function mshv_region_evict_pages(), which doesn't exist.
> >
> > FWIW, the announcement about sashiko.dev is here:
> >
> > https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> >
> > Other than the commit message reference, this looks good to me.
> >
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 
> The second point is written as if the code here should release the
> already pinned pages before calling mshv_region_invalidate_pages(), but
> the code actually relies on mshv_mem_region_invalidate_pages() to
> release the pages. The change here fixes the accounting.
> 
>  Second, when an error occurs mid-loop, already pinned pages from the
>  current batch are not accounted for before calling
>  mshv_region_invalidate_pages(), causing a page reference leak.
> 
> And queued up the patch to hyperv-fixes.

One other thing I noticed:  The "Subject" of the patch is wrong. It
mentions mshv_region_populate_pages(), but the function being
modified is actually mshv_region_pin().

Michael

> 
> Wei
> 
> >
> > >
> > > Fix by treating short pins as errors and explicitly unpinning the
> > > partial batch before cleanup.
> > >
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_regions.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > index c28aac0726de..fdffd4f002f6 100644
> > > --- a/drivers/hv/mshv_regions.c
> > > +++ b/drivers/hv/mshv_regions.c
> > > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > >  					  FOLL_WRITE | FOLL_LONGTERM,
> > >  					  pages);
> > > -		if (ret < 0)
> > > +		if (ret != nr_pages)
> > >  			goto release_pages;
> > >  	}
> > >
> > >  	return 0;
> > >
> > >  release_pages:
> > > +	if (ret > 0)
> > > +		done_count += ret;
> > >  	mshv_region_invalidate_pages(region, 0, done_count);
> > > -	return ret;
> > > +	return ret < 0 ? ret : -ENOMEM;
> > >  }
> > >
> > >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > >
> > >
> >


^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 12:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Michael Kelley, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Florian Bezdeka, RT, Mitchell Levy, Saurabh Singh Sengar,
	Naman Jain
In-Reply-To: <20260318112113.JDm06Hr-@linutronix.de>

On 18.03.26 12:21, Sebastian Andrzej Siewior wrote:
> On 2026-03-18 12:03:03 [+0100], Jan Kiszka wrote:
>>> Either way, that add_interrupt_randomness() should be moved to
>>> sysvec_hyperv_callback() like it has been done for
>>> sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
>>> via vmbus_percpu_isr().
>>
>> No, this would degrade arm64.
> 
> Okay. So why does this needs to be done for _this_ per-CPU IRQ on ARM64
> but not for the others? What does it make so special.
> 

See the other thread:
https://lore.kernel.org/lkml/SN6PR02MB41573332BF202DAE0AF79ED1D441A@SN6PR02MB4157.namprd02.prod.outlook.com/

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18 11:21 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Michael Kelley, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86@kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	Florian Bezdeka, RT, Mitchell Levy, Saurabh Singh Sengar,
	Naman Jain
In-Reply-To: <7f248f1f-a4ad-442d-bd85-23e57e58eeba@siemens.com>

On 2026-03-18 12:03:03 [+0100], Jan Kiszka wrote:
> > Either way, that add_interrupt_randomness() should be moved to
> > sysvec_hyperv_callback() like it has been done for
> > sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
> > via vmbus_percpu_isr().
> 
> No, this would degrade arm64.

Okay. So why does this needs to be done for _this_ per-CPU IRQ on ARM64
but not for the others? What does it make so special.

> Jan

Sebastian

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 11:03 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Kelley
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260318100138.GimjldpV@linutronix.de>

On 18.03.26 11:01, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 17:25:20 [+0000], Michael Kelley wrote:
>> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
>>>
>>
>> Let me try to address the range of questions here and in the follow-up
>> discussion. As background, an overview of VMBus interrupt handling is in:
>>
>> Documentation/virt/hyperv/vmbus.rst
>>
>> in the section entitled "Synthetic Interrupt Controller (synic)". The
>> relevant text is:
>>
>>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>>    each CPU in the guest has a synic and may receive VMBus interrupts,
>>    they are best modeled in Linux as per-CPU interrupts. This model works
>>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>>    /proc/interrupts on the "HYP" line.
>>
>> The use of a statically allocated sysvec pre-dates my involvement in this
>> code starting in 2017, but I believe it was modelled after what Xen does,
>> and for the same reason -- to effectively create a per-CPU interrupt on
>> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
>> don't know if that is also to create a per-CPU interrupt.
> 
> If you create a vector, it becomes per-CPU. There is simply no mapping
> from HYPERVISOR_CALLBACK_VECTOR to request_percpu_irq(). But if we had
> this…
> 
> …
>>> What clears this? This is wrongly placed. This should go to
>>> sysvec_hyperv_callback() instead with its matching canceling part. The
>>> add_interrupt_randomness() should also be there and not here.
>>> sysvec_hyperv_stimer0() managed to do so.
>>
>> I don't have any knowledge to bring regarding the use of
>> lockdep_hardirq_threaded().
> 
> It is used in IRQ core to mark the execution of an interrupt handler
> which becomes threaded in a forced-threaded scenario. The goal is to let
> lockdep know that this piece of code on !RT will be threaded on RT and
> therefore there is no need to report a possible locking problem that
> will not exist on RT.
> 
>>> Different question: What guarantees that there won't be another
>>> interrupt before this one is done? The handshake appears to be
>>> deprecated. The interrupt itself returns ACKing (or not) but the actual
>>> handler is delayed to this thread. Depending on the userland it could
>>> take some time and I don't know how impatient the host is.
>>
>> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
>> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
>> is usually explicitly called to ack the interrupt.
>>
>> There's no guarantee, in either the existing case or the new PREEMPT_RT
>> case, that another VMBus interrupt won't come in on the same CPU
>> before the tasklets scheduled by vmbus_message_sched() or
>> vmbus_chan_sched() have run. From a functional standpoint, the Linux
>> code and interaction with Hyper-V handles another interrupt correctly.
> 
> So there is no scenario that the host will trigger interrupts because
> the guest is leaving the ISR without doing anything/ making progress?
> 
>> From a delay standpoint, there's not a problem for the normal (i.e., not
>> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
>> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
>> about delays since the tasklets are scheduled from the new per-CPU thread.
>> But my understanding is that Jan's motivation for these changes is not to
>> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
>> The goal is simply to make PREEMPT_RT builds functional, though Jan may
>> have further comments on the goal.
> 
> I would be worried if the host would storming interrupts to the guest
> because it makes no progress.
> 
>>>> +		__vmbus_isr();
>>> Moving on. This (trying very hard here) even schedules tasklets. Why?
>>> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
>>> You don't want that.
>>
>> Again, Jan can comment on the impact of delays due to ending up
>> in ksoftirqd.
> 
> My point is that having this with threaded interrupt support would
> eliminate the usage of tasklets.
> 
>>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>>> instead apic_eoi() + schedule_delayed_work().
>>
>> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
>> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
>> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
>> path gets to vmbus_isr(), the two architectures share the same code. Same
>> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
> 
> This one has the "random" collecting on the right spot.
> 
>> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
>> to looking at it.
>>
>> As I recently discovered in discussion with Jan, standard Linux IRQ handling
>> will *not* thread per-CPU interrupts. So even on arm64 with a standard
>> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
>> request threading.
> 
> It would require a statement from the x86 & IRQ maintainers if it is
> worth on x86 to make allow pass HYPERVISOR_CALLBACK_VECTOR to
> request_percpu_irq() and have an IRQF_ that this one needs to be forced
> threaded. Otherwise we would need to remain with the workarounds.
> 
> If you say that an interrupt storm can not occur, I would prefer
> |static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
> |…
> |	lock_map_acquire_try(&vmbus_map);
> |	__vmbus_isr();
> |	lock_map_release(&vmbus_map);
> 
> while it has mostly the same effect.
> 
> Either way, that add_interrupt_randomness() should be moved to
> sysvec_hyperv_callback() like it has been done for
> sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
> via vmbus_percpu_isr().

No, this would degrade arm64.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18 11:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hyperv, linux-kernel,
	Florian Bezdeka, RT, Mitchell Levy, Michael Kelley,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260318090817.zuFUjrxd@linutronix.de>

On 18.03.26 10:08, Sebastian Andrzej Siewior wrote:
> On 2026-03-17 12:55:15 [+0100], Jan Kiszka wrote:
>> Point is that a task that was interrupted by a potentially threaded
>> interrupt keeps this flag longer that it needs it. And that is
>> apparently harmless, but fairly confusing.
> 
> correct. My only concern would be a shared handler where the second is
> not threaded.

The vmbus irqs are not shared (beyond what sysvec_hyperv_callback does).

> 
>>>> With that in mind, the new logic here is no different from the one the
>>>> kernel used before. If both are not doing what they should, we likely
>>>> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
>>>
>>> The difference is that you expect that _everyone_ calling this driver
>>> has everything else threaded. This might not be the case. That is why
>>> this should be in core knowing what is called if threaded, use in driver
>>> after explicit killing that flag afterwards since you don't know what
>>> can follow or add a generic threaded infrastructure here. 
>>
>> This driver is different, unfortunately. I'm not sure if we can / want
>> to thread everything that the platform interrupt does on x86. So far,
>> only the last part of it - vmbus handling - is threaded. On arm64, the
>> irq is exclusive (see vmbus_percpu_isr), thus everything can be and is
>> threaded.
> 
> No, it is a percpu interrupt which are not forced-threaded.

It is threaded now due to my patch.

> 
>>>>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>>>>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>>>>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>>>>> instead apic_eoi() + schedule_delayed_work(). 
>>>>>
>>>>
>>>> Again, you are thinking x86-only. We need a portable solution.
>>>
>>> well, ARM could use a threaded interrupt, too.
>>
>> For a reason we didn't explore in details, per-CPU interrupts aren't
>> threaded. See older version of this patch
>> (https://lore.kernel.org/lkml/005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com/)
>> where I thought I only had to fix x86, but arm64 was needing care as well.
> 
> Per-CPU are usually timers or other things which are not threaded and
> have their own thing for the "second" port and I only remember MCE using
> a workqueue for notification.

And the hv vmbus now provides a case where threading could be useful, at
least for arm64. For x86, we would have to check if the first half of
sysvec_hyperv_callback (mshv_handler) wants threading as well / would
support that.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Sebastian Andrzej Siewior @ 2026-03-18 10:10 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka
In-Reply-To: <SN6PR02MB41573332BF202DAE0AF79ED1D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-17 17:26:39 [+0000], Michael Kelley wrote:
> > > Who is other one and does it have its add_interrupt_randomness() there
> > > already?
> > 
> > It's the arm64 path of the hv support. Regarding the vmbus IRQ, it seems
> > to be fully handled here, without an equivalent of
> > arch/x86/kernel/cpu/mshyperv.c.
> 
> The arm64 path is the call to request_percpu_irq() in vmbus_bus_init().
> That call is only made when running on arm64. See the code comment in
> vmbus_bus_init().
> 
> The specified interrupt handler is vmbus_percpu_isr(), which again runs
> only on arm64. It calls vmbus_isr(), which starts the common path for both
> x86/x64 and arm64.
> 
> Then the slight weirdness is that the standard Linux IRQ handling for
> per-CPU IRQs on arm64 with a GICv3 (which is what Hyper-V emulates) 
> does *not* call add_interrupt_randomness().  The function
> gic_irq_domain_map() sets the IRQ handler for PPI range to
> handle_percpu_devid_irq(), and that function does not do
> add_interrupt_randomness().  The other variant, handle_percpu_irq(),
> calls handle_irq_event_percpu(), which *does* do the
> add_interrupt_randomness().

So despite all the generic code on arm64 does not do it? Then don't
workaround this in your driver. Either talk to the IRQ maintainer and
suggest adding it there so everyone benefits from or don't because there
might be a reason to avoid it. Having it in driver code is wrong.

> So at this point, putting the add_interrupt_randomness() in
> vmbus_isr() is needed to catch both architectures. If the lack of
> add_interrupt_randomness() in handle_percpu_devid_irq() is a bug,
> then that would be a cleaner way to handle this. But maybe there's
> a reason behind the current behavior of handle_percpu_devid_irq()
> that I'm unaware of.
>
> Michael

Sebastian

^ permalink raw reply

* Re: RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18 10:01 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <SN6PR02MB415753FDA0DEEA0B4A8B9994D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-17 17:25:20 [+0000], Michael Kelley wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
> >
> 
> Let me try to address the range of questions here and in the follow-up
> discussion. As background, an overview of VMBus interrupt handling is in:
> 
> Documentation/virt/hyperv/vmbus.rst
> 
> in the section entitled "Synthetic Interrupt Controller (synic)". The
> relevant text is:
> 
>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>    each CPU in the guest has a synic and may receive VMBus interrupts,
>    they are best modeled in Linux as per-CPU interrupts. This model works
>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>    /proc/interrupts on the "HYP" line.
> 
> The use of a statically allocated sysvec pre-dates my involvement in this
> code starting in 2017, but I believe it was modelled after what Xen does,
> and for the same reason -- to effectively create a per-CPU interrupt on
> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
> don't know if that is also to create a per-CPU interrupt.

If you create a vector, it becomes per-CPU. There is simply no mapping
from HYPERVISOR_CALLBACK_VECTOR to request_percpu_irq(). But if we had
this…

…
> > What clears this? This is wrongly placed. This should go to
> > sysvec_hyperv_callback() instead with its matching canceling part. The
> > add_interrupt_randomness() should also be there and not here.
> > sysvec_hyperv_stimer0() managed to do so.
> 
> I don't have any knowledge to bring regarding the use of
> lockdep_hardirq_threaded().

It is used in IRQ core to mark the execution of an interrupt handler
which becomes threaded in a forced-threaded scenario. The goal is to let
lockdep know that this piece of code on !RT will be threaded on RT and
therefore there is no need to report a possible locking problem that
will not exist on RT.

> > Different question: What guarantees that there won't be another
> > interrupt before this one is done? The handshake appears to be
> > deprecated. The interrupt itself returns ACKing (or not) but the actual
> > handler is delayed to this thread. Depending on the userland it could
> > take some time and I don't know how impatient the host is.
> 
> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
> is usually explicitly called to ack the interrupt.
> 
> There's no guarantee, in either the existing case or the new PREEMPT_RT
> case, that another VMBus interrupt won't come in on the same CPU
> before the tasklets scheduled by vmbus_message_sched() or
> vmbus_chan_sched() have run. From a functional standpoint, the Linux
> code and interaction with Hyper-V handles another interrupt correctly.

So there is no scenario that the host will trigger interrupts because
the guest is leaving the ISR without doing anything/ making progress?

> From a delay standpoint, there's not a problem for the normal (i.e., not
> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
> about delays since the tasklets are scheduled from the new per-CPU thread.
> But my understanding is that Jan's motivation for these changes is not to
> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
> The goal is simply to make PREEMPT_RT builds functional, though Jan may
> have further comments on the goal.

I would be worried if the host would storming interrupts to the guest
because it makes no progress.

> > > +		__vmbus_isr();
> > Moving on. This (trying very hard here) even schedules tasklets. Why?
> > You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> > You don't want that.
> 
> Again, Jan can comment on the impact of delays due to ending up
> in ksoftirqd.

My point is that having this with threaded interrupt support would
eliminate the usage of tasklets.

> > Couldn't the whole logic be integrated into the IRQ code? Then we could
> > have mask/ unmask if supported/ provided and threaded interrupts. Then
> > sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> > instead apic_eoi() + schedule_delayed_work().
> 
> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
> path gets to vmbus_isr(), the two architectures share the same code. Same
> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.

This one has the "random" collecting on the right spot.

> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
> to looking at it.
> 
> As I recently discovered in discussion with Jan, standard Linux IRQ handling
> will *not* thread per-CPU interrupts. So even on arm64 with a standard
> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
> request threading.

It would require a statement from the x86 & IRQ maintainers if it is
worth on x86 to make allow pass HYPERVISOR_CALLBACK_VECTOR to
request_percpu_irq() and have an IRQF_ that this one needs to be forced
threaded. Otherwise we would need to remain with the workarounds.

If you say that an interrupt storm can not occur, I would prefer
|static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
|…
|	lock_map_acquire_try(&vmbus_map);
|	__vmbus_isr();
|	lock_map_release(&vmbus_map);

while it has mostly the same effect.

Either way, that add_interrupt_randomness() should be moved to
sysvec_hyperv_callback() like it has been done for
sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
via vmbus_percpu_isr().

> I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
> I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
> need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
> functionality is a bit of a fossil that isn't needed on modern x86/x64
> processors that support TSC scaling. And it doesn't exist for arm64.
> It might be worth seeing if it could be dropped entirely ...
> 
> Michael

Sebastian

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-18  9:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Peter Zijlstra, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hyperv, linux-kernel,
	Florian Bezdeka, RT, Mitchell Levy, Michael Kelley,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <1e15ac0d-9835-487c-9a16-c55203f01a3d@siemens.com>

On 2026-03-17 12:55:15 [+0100], Jan Kiszka wrote:
> Point is that a task that was interrupted by a potentially threaded
> interrupt keeps this flag longer that it needs it. And that is
> apparently harmless, but fairly confusing.

correct. My only concern would be a shared handler where the second is
not threaded.

> >> With that in mind, the new logic here is no different from the one the
> >> kernel used before. If both are not doing what they should, we likely
> >> want to add a generic reset of hardirq_threaded to the IRQ exit path(s).
> > 
> > The difference is that you expect that _everyone_ calling this driver
> > has everything else threaded. This might not be the case. That is why
> > this should be in core knowing what is called if threaded, use in driver
> > after explicit killing that flag afterwards since you don't know what
> > can follow or add a generic threaded infrastructure here. 
> 
> This driver is different, unfortunately. I'm not sure if we can / want
> to thread everything that the platform interrupt does on x86. So far,
> only the last part of it - vmbus handling - is threaded. On arm64, the
> irq is exclusive (see vmbus_percpu_isr), thus everything can be and is
> threaded.

No, it is a percpu interrupt which are not forced-threaded.

> >>> Couldn't the whole logic be integrated into the IRQ code? Then we could
> >>> have mask/ unmask if supported/ provided and threaded interrupts. Then
> >>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> >>> instead apic_eoi() + schedule_delayed_work(). 
> >>>
> >>
> >> Again, you are thinking x86-only. We need a portable solution.
> > 
> > well, ARM could use a threaded interrupt, too.
> 
> For a reason we didn't explore in details, per-CPU interrupts aren't
> threaded. See older version of this patch
> (https://lore.kernel.org/lkml/005a01dc9d30$a40515e0$ec0f41a0$@zohomail.com/)
> where I thought I only had to fix x86, but arm64 was needing care as well.

Per-CPU are usually timers or other things which are not threaded and
have their own thing for the "second" port and I only remember MCE using
a workqueue for notification.

> Jan

Sebastian

^ permalink raw reply

* Re: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Wei Liu @ 2026-03-18  6:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D2316EC9E5B0BAE656C0D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > 
> > The current error handling has two issues:
> > 
> > First, pin_user_pages_fast() can return a short pin count (less than
> > requested but greater than zero) when it cannot pin all requested pages.
> > This is treated as success, leading to partially pinned regions being
> > used, which causes memory corruption.
> > 
> > Second, when an error occurs mid-loop, already pinned pages from the
> > current batch are not released before calling mshv_region_evict_pages(),
> > causing a page reference leak.
> 
> There's now an online LLM-based tool that is automatically reviewing
> kernel patches.  For this patch, the results are here:
> 
> https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
> 
> It has flagged the commit message as incorrectly referencing the
> function mshv_region_evict_pages(), which doesn't exist.
> 
> FWIW, the announcement about sashiko.dev is here:
> 
> https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> 
> Other than the commit message reference, this looks good to me.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>

The second point is written as if the code here should release the
already pinned pages before calling mshv_region_invalidate_pages(), but
the code actually relies on mshv_mem_region_invalidate_pages() to
release the pages. The change here fixes the accounting.

 Second, when an error occurs mid-loop, already pinned pages from the
 current batch are not accounted for before calling
 mshv_region_invalidate_pages(), causing a page reference leak.

And queued up the patch to hyperv-fixes.

Wei

> 
> > 
> > Fix by treating short pins as errors and explicitly unpinning the
> > partial batch before cleanup.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_regions.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index c28aac0726de..fdffd4f002f6 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> >  					  FOLL_WRITE | FOLL_LONGTERM,
> >  					  pages);
> > -		if (ret < 0)
> > +		if (ret != nr_pages)
> >  			goto release_pages;
> >  	}
> > 
> >  	return 0;
> > 
> >  release_pages:
> > +	if (ret > 0)
> > +		done_count += ret;
> >  	mshv_region_invalidate_pages(region, 0, done_count);
> > -	return ret;
> > +	return ret < 0 ? ret : -ENOMEM;
> >  }
> > 
> >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > 
> > 
> 

^ permalink raw reply

* Re: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Jan Kiszka @ 2026-03-18  5:52 UTC (permalink / raw)
  To: Michael Kelley, Sebastian Andrzej Siewior
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <SN6PR02MB415753FDA0DEEA0B4A8B9994D441A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 17.03.26 18:25, Michael Kelley wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
>>
> 
> Let me try to address the range of questions here and in the follow-up
> discussion. As background, an overview of VMBus interrupt handling is in:
> 
> Documentation/virt/hyperv/vmbus.rst
> 
> in the section entitled "Synthetic Interrupt Controller (synic)". The
> relevant text is:
> 
>    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
>    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
>    each CPU in the guest has a synic and may receive VMBus interrupts,
>    they are best modeled in Linux as per-CPU interrupts. This model works
>    well on arm64 where a single per-CPU Linux IRQ is allocated for
>    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
>    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
>    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
>    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
>    there's no Linux IRQ, and the interrupts are visible in aggregate in
>    /proc/interrupts on the "HYP" line.
> 
> The use of a statically allocated sysvec pre-dates my involvement in this
> code starting in 2017, but I believe it was modelled after what Xen does,
> and for the same reason -- to effectively create a per-CPU interrupt on
> x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
> don't know if that is also to create a per-CPU interrupt.

Long ago, we demonstrated via Jailhouse that you do not necessarily gain
complexity on the hypervisor side by providing a minimal PCI host and
attaching all your virtual devices to that instead. Even longer ago in
the absence of proper IRQ controller virtualization on the various
archs, there was a bit of performance to gain doing "special"
interrupts. All these design decisions made sense at a certain time but
you would likely no longer repeat them today.

> 
> More below ....
> 
>> On 2026-02-16 17:24:56 [+0100], Jan Kiszka wrote:
>>> --- a/drivers/hv/vmbus_drv.c
>>> +++ b/drivers/hv/vmbus_drv.c
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/cpu.h>
>>>  #include <linux/sched/isolation.h>
>>>  #include <linux/sched/task_stack.h>
>>> +#include <linux/smpboot.h>
>>>
>>>  #include <linux/delay.h>
>>>  #include <linux/panic_notifier.h>
>>> @@ -1350,7 +1351,7 @@ static void vmbus_message_sched(struct hv_per_cpu_context *hv_cpu, void *message
>>>  	}
>>>  }
>>>
>>> -void vmbus_isr(void)
>>> +static void __vmbus_isr(void)
>>>  {
>>>  	struct hv_per_cpu_context *hv_cpu
>>>  		= this_cpu_ptr(hv_context.cpu_context);
>>> @@ -1363,6 +1364,53 @@ void vmbus_isr(void)
>>>
>>>  	add_interrupt_randomness(vmbus_interrupt);
>>
>> This is feeding entropy and would like to see interrupt registers. But
>> since this is invoked from a thread it won't.
> 
> I'll respond to this topic on the new thread for the new patch
> where Jan has moved the call to add_interrupt_randomness().
> 
>>
>>>  }
>>> +
>> …
>>> +void vmbus_isr(void)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>>> +		vmbus_irqd_wake();
>>> +	} else {
>>> +		lockdep_hardirq_threaded();
>>
>> What clears this? This is wrongly placed. This should go to
>> sysvec_hyperv_callback() instead with its matching canceling part. The
>> add_interrupt_randomness() should also be there and not here.
>> sysvec_hyperv_stimer0() managed to do so.
> 
> I don't have any knowledge to bring regarding the use of
> lockdep_hardirq_threaded().
> 
>>
>> Different question: What guarantees that there won't be another
>> interrupt before this one is done? The handshake appears to be
>> deprecated. The interrupt itself returns ACKing (or not) but the actual
>> handler is delayed to this thread. Depending on the userland it could
>> take some time and I don't know how impatient the host is.
> 
> In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
> and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
> is usually explicitly called to ack the interrupt.
> 
> There's no guarantee, in either the existing case or the new PREEMPT_RT
> case, that another VMBus interrupt won't come in on the same CPU
> before the tasklets scheduled by vmbus_message_sched() or
> vmbus_chan_sched() have run. From a functional standpoint, the Linux
> code and interaction with Hyper-V handles another interrupt correctly.
> 
> From a delay standpoint, there's not a problem for the normal (i.e., not
> PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
> don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
> about delays since the tasklets are scheduled from the new per-CPU thread.
> But my understanding is that Jan's motivation for these changes is not to
> achieve true RT behavior, since Hyper-V doesn't provide that anyway.
> The goal is simply to make PREEMPT_RT builds functional, though Jan may
> have further comments on the goal.
> 

That is exactly the goal: A Linux guest happening to use a PREEMPT_RT
kernel should correctly run on Hyper-V, and that without losing relevant
performance. However, we do not expect any deterministic timing behavior
from such a setup.

>>
>>> +		__vmbus_isr();
>> Moving on. This (trying very hard here) even schedules tasklets. Why?
>> You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
>> You don't want that.
> 
> Again, Jan can comment on the impact of delays due to ending up
> in ksoftirqd.
> 
>>
>> Couldn't the whole logic be integrated into the IRQ code? Then we could
>> have mask/ unmask if supported/ provided and threaded interrupts. Then
>> sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
>> instead apic_eoi() + schedule_delayed_work().
> 
> As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
> on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
> entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
> path gets to vmbus_isr(), the two architectures share the same code. Same
> thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
> If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
> to looking at it.
> 
> As I recently discovered in discussion with Jan, standard Linux IRQ handling
> will *not* thread per-CPU interrupts. So even on arm64 with a standard
> Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
> request threading.
> 
> I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
> I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
> need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
> functionality is a bit of a fossil that isn't needed on modern x86/x64
> processors that support TSC scaling. And it doesn't exist for arm64.
> It might be worth seeing if it could be dropped entirely ...
> 

I suppose that all depends on how long Linux needs to support the
underlying hypervisor versions and interfaces, no? It's a bit like
supporting old hardware...

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply

* Re: [PATCH 00/11] Drivers: hv: Add ARM64 support in mshv_vtl
From: Naman Jain @ 2026-03-18  4:23 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157DFC7B7CE94500C89664BD441A@SN6PR02MB4157.namprd02.prod.outlook.com>



On 3/18/2026 3:33 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> The series intends to add support for ARM64 to mshv_vtl driver.
>> For this, common Hyper-V code is refactored, necessary support is added,
>> mshv_vtl_main.c is refactored and then finally support is added in
>> Kconfig.
>>
>> Based on commit 1f318b96cc84 ("Linux 7.0-rc3")
> 
> There's now an online LLM-based tool that is automatically reviewing
> kernel patches. For this patch set, the results are here:
> 
> https://sashiko.dev/#/patchset/20260316121241.910764-1-namjain%40linux.microsoft.com
> 
> It has flagged several things that are worth checking, but I haven't
> reviewed them to see if they are actually valid.
> 
> FWIW, the announcement about sashiko.dev is here:
> 
> https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> 
> Michael


Thanks for sharing Michael,
I'll check it out and do the needful.

Regards,
Naman


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Long Li @ 2026-03-17 23:43 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Haiyang Zhang, KY Srinivasan, Wei Liu,
	Dexuan Cui, Simon Horman, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260316200843.GK61385@unreal>

> 
> On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > When the MANA hardware undergoes a service reset, the ETH
> > > > auxiliary device
> > > > (mana.eth) used by DPDK persists across the reset cycle — it is
> > > > not removed and re-added like RC/UD/GSI QPs. This means userspace
> > > > RDMA consumers such as DPDK have no way of knowing that firmware
> > > > handles for their PD, CQ, WQ, QP and MR resources have become stale.
> > >
> > > NAK to any of this.
> > >
> > > In case of hardware reset, mana_ib AUX device needs to be destroyed
> > > and recreated later.
> >
> > Yeah, that is our general model for any serious RAS event where the
> > driver's view of resources becomes out of sync with the HW.
> >
> > You have tear down the ib_device by removing the aux and then bring
> > back a new one.
> >
> > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> > tell userspace to close and re-open their uverbs FD.
> >
> > We don't have a model where a uverbs FD in userspace can continue to
> > work after the device has a catasrophic RAS event.
> >
> > There may be room to have a model where the ib device doesn't fully
> > unplug/replug so it retains its name and things, but that is core code
> > not driver stuff.
> 
> Good luck with that model. It is going to break RDMA-CM hotplug support.
> 

   I think we can preserve RDMA-CM behavior without requiring ib_device
   unregister/re-register.

   On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
   new reset event) through ib_dispatch_event(). RDMA-CM already handles
   device events — we would add a handler that iterates all rdma_cm_ids
   on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
   as cma_process_remove() does today. The difference: cma_device stays
   alive, so applications can reconnect on the same device after recovery
   instead of waiting for a new one to appear.

   The motivation for keeping ib_device alive is that some RDMA consumers
   — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
   manage QP state themselves. For these users, a persistent ib_device
   with IB_EVENT_PORT_ERR / IB_EVENT_PORT_ACTIVE notifications enables
   reliable in-place recovery without reopening the device.

   This matters especially for PCI DPC recovery, which is becoming
   critical for large-scale GPU/storage deployments. See this talk for
   context on the value of surviving DPC events:
   https://www.youtube.com/watch?v=TpNNeMGEsdU&t=1619s

   Today a DPC event on one NIC kills all RDMA connections and can
   crash entire training jobs. If the ib_device persists and the driver
   recreates firmware resources after recovery, raw verbs users can
   resume without full teardown, and RDMA-CM users get the same
   disconnect/reconnect behavior they have today.

Thanks,
Long

^ permalink raw reply

* RE: [PATCH 00/11] Drivers: hv: Add ARM64 support in mshv_vtl
From: Michael Kelley @ 2026-03-17 22:03 UTC (permalink / raw)
  To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	ssengar@linux.microsoft.com, Michael Kelley,
	linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
> 
> The series intends to add support for ARM64 to mshv_vtl driver.
> For this, common Hyper-V code is refactored, necessary support is added,
> mshv_vtl_main.c is refactored and then finally support is added in
> Kconfig.
> 
> Based on commit 1f318b96cc84 ("Linux 7.0-rc3")

There's now an online LLM-based tool that is automatically reviewing
kernel patches. For this patch set, the results are here:

https://sashiko.dev/#/patchset/20260316121241.910764-1-namjain%40linux.microsoft.com

It has flagged several things that are worth checking, but I haven't
reviewed them to see if they are actually valid.

FWIW, the announcement about sashiko.dev is here:

https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/

Michael

> 
> Naman Jain (11):
>   arch: arm64: Export arch_smp_send_reschedule for mshv_vtl module
>   Drivers: hv: Move hv_vp_assist_page to common files
>   Drivers: hv: Add support to setup percpu vmbus handler
>   Drivers: hv: Refactor mshv_vtl for ARM64 support to be added
>   drivers: hv: Export vmbus_interrupt for mshv_vtl module
>   Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
>   arch: arm64: Add support for mshv_vtl_return_call
>   Drivers: hv: mshv_vtl: Move register page config to arch-specific
>     files
>   Drivers: hv: mshv_vtl: Let userspace do VSM configuration
>   Drivers: hv: Add support for arm64 in MSHV_VTL
>   Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
> 
>  arch/arm64/hyperv/Makefile        |   1 +
>  arch/arm64/hyperv/hv_vtl.c        | 152 ++++++++++++++++++++++
>  arch/arm64/hyperv/mshyperv.c      |  13 ++
>  arch/arm64/include/asm/mshyperv.h |  28 ++++
>  arch/arm64/kernel/smp.c           |   1 +
>  arch/x86/hyperv/hv_init.c         |  88 +------------
>  arch/x86/hyperv/hv_vtl.c          | 130 +++++++++++++++++++
>  arch/x86/include/asm/mshyperv.h   |   8 +-
>  drivers/hv/Kconfig                |   2 +-
>  drivers/hv/hv_common.c            |  99 +++++++++++++++
>  drivers/hv/mshv.h                 |   8 --
>  drivers/hv/mshv_vtl_main.c        | 205 ++++--------------------------
>  drivers/hv/vmbus_drv.c            |   8 +-
>  include/asm-generic/mshyperv.h    |  49 +++++++
>  include/hyperv/hvgdk_mini.h       |   2 +
>  15 files changed, 505 insertions(+), 289 deletions(-)
>  create mode 100644 arch/arm64/hyperv/hv_vtl.c
> 
> 
> base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
> prerequisite-patch-id: 24022ec1fb63bc20de8114eedf03c81bb1086e0e
> prerequisite-patch-id: 801f2588d5c6db4ceb9a6705a09e4649fab411b1
> prerequisite-patch-id: 581c834aa268f0c54120c6efbc1393fbd9893f49
> prerequisite-patch-id: b0b153807bab40860502c52e4a59297258ade0db
> prerequisite-patch-id: 2bff6accea80e7976c58d80d847cd33f260a3cb9
> prerequisite-patch-id: 296ffbc4f119a5b249bc9c840f84129f5c151139
> prerequisite-patch-id: 3b54d121145e743ac5184518df33a1812280ec96
> prerequisite-patch-id: 06fc5b37b23ee3f91a2c8c9b9c126fde290834f2
> prerequisite-patch-id: 6e8afed988309b03485f5538815ea29c8fa5b0a9
> prerequisite-patch-id: 4f1fb1b7e9cfa8a3b1c02fafecdbb432b74ee367
> prerequisite-patch-id: 49944347e0b2d93e72911a153979c567ebb7e66b
> prerequisite-patch-id: 6dec75498eeae6365d15ac12b5d0a3bd32e9f91c
> --
> 2.43.0
> 


^ permalink raw reply

* RE: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Michael Kelley @ 2026-03-17 21:56 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177375989324.25621.6532741522672582851.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> 
> The current error handling has two issues:
> 
> First, pin_user_pages_fast() can return a short pin count (less than
> requested but greater than zero) when it cannot pin all requested pages.
> This is treated as success, leading to partially pinned regions being
> used, which causes memory corruption.
> 
> Second, when an error occurs mid-loop, already pinned pages from the
> current batch are not released before calling mshv_region_evict_pages(),
> causing a page reference leak.

There's now an online LLM-based tool that is automatically reviewing
kernel patches.  For this patch, the results are here:

https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net

It has flagged the commit message as incorrectly referencing the
function mshv_region_evict_pages(), which doesn't exist.

FWIW, the announcement about sashiko.dev is here:

https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/

Other than the commit message reference, this looks good to me.

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> 
> Fix by treating short pins as errors and explicitly unpinning the
> partial batch before cleanup.
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index c28aac0726de..fdffd4f002f6 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
>  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
>  					  FOLL_WRITE | FOLL_LONGTERM,
>  					  pages);
> -		if (ret < 0)
> +		if (ret != nr_pages)
>  			goto release_pages;
>  	}
> 
>  	return 0;
> 
>  release_pages:
> +	if (ret > 0)
> +		done_count += ret;
>  	mshv_region_invalidate_pages(region, 0, done_count);
> -	return ret;
> +	return ret < 0 ? ret : -ENOMEM;
>  }
> 
>  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> 
> 


^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 21:32 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpFXuHg4KPY27pqMC-xV5y9ZY2W72_R8_rxO0DvrJ=_yvw@mail.gmail.com>

On Tue, Mar 17, 2026 at 2:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > The f_op->mmap interface is deprecated, so update driver to use its
> > successor, mmap_prepare.
> >
> > The driver previously used vm_iomap_memory(), so this change replaces it
> > with its mmap_prepare equivalent, mmap_action_simple_ioremap().
> >
> > Functions that wrap mmap() are also converted to wrap mmap_prepare()
> > instead.
> >
> > Also update the documentation accordingly.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  Documentation/driver-api/vme.rst    |  2 +-
> >  drivers/staging/vme_user/vme.c      | 20 +++++------
> >  drivers/staging/vme_user/vme.h      |  2 +-
> >  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
> >  4 files changed, 42 insertions(+), 33 deletions(-)
> >
> > diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> > index c0b475369de0..7111999abc14 100644
> > --- a/Documentation/driver-api/vme.rst
> > +++ b/Documentation/driver-api/vme.rst
> > @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
> >
> >  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
> >  do a read-modify-write transaction. Parts of a VME window can also be mapped
> > -into user space memory using :c:func:`vme_master_mmap`.
> > +into user space memory using :c:func:`vme_master_mmap_prepare`.
> >
> >
> >  Slave windows
> > diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> > index f10a00c05f12..7220aba7b919 100644
> > --- a/drivers/staging/vme_user/vme.c
> > +++ b/drivers/staging/vme_user/vme.c
> > @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
> >  EXPORT_SYMBOL(vme_master_rmw);
> >
> >  /**
> > - * vme_master_mmap - Mmap region of VME master window.
> > + * vme_master_mmap_prepare - Mmap region of VME master window.
> >   * @resource: Pointer to VME master resource.
> > - * @vma: Pointer to definition of user mapping.
> > + * @desc: Pointer to descriptor of user mapping.
> >   *
> >   * Memory map a region of the VME master window into user space.
> >   *
> > @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
> >   *         resource or -EFAULT if map exceeds window size. Other generic mmap
> >   *         errors may also be returned.
> >   */
> > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > +int vme_master_mmap_prepare(struct vme_resource *resource,
> > +                           struct vm_area_desc *desc)
> >  {
> > +       const unsigned long vma_size = vma_desc_size(desc);
> >         struct vme_bridge *bridge = find_bridge(resource);
> >         struct vme_master_resource *image;
> >         phys_addr_t phys_addr;
> > -       unsigned long vma_size;
> >
> >         if (resource->type != VME_MASTER) {
> >                 dev_err(bridge->parent, "Not a master resource\n");
> > @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> >         }
> >
> >         image = list_entry(resource->entry, struct vme_master_resource, list);
> > -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> > -       vma_size = vma->vm_end - vma->vm_start;
> > +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
> >
> >         if (phys_addr + vma_size > image->bus_resource.end + 1) {
> >                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
> >                 return -EFAULT;
> >         }
> >
> > -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > -
> > -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> > +       desc->page_prot = pgprot_noncached(desc->page_prot);
> > +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> > +       return 0;
> >  }
> > -EXPORT_SYMBOL(vme_master_mmap);
> > +EXPORT_SYMBOL(vme_master_mmap_prepare);
> >
> >  /**
> >   * vme_master_free - Free VME master window
> > diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> > index 797e9940fdd1..b6413605ea49 100644
> > --- a/drivers/staging/vme_user/vme.h
> > +++ b/drivers/staging/vme_user/vme.h
> > @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
> >  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
> >  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
> >                             unsigned int swap, loff_t offset);
> > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> > +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
> >  void vme_master_free(struct vme_resource *resource);
> >
> >  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> > diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> > index d95dd7d9190a..11e25c2f6b0a 100644
> > --- a/drivers/staging/vme_user/vme_user.c
> > +++ b/drivers/staging/vme_user/vme_user.c
> > @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
> >         kfree(vma_priv);
> >  }
> >
> > -static const struct vm_operations_struct vme_user_vm_ops = {
> > -       .open = vme_user_vm_open,
> > -       .close = vme_user_vm_close,
> > -};
> > -
> > -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > +                             const struct file *file, void **vm_private_data)
> >  {
> > -       int err;
> > +       const unsigned int minor = iminor(file_inode(file));
> >         struct vme_user_vma_priv *vma_priv;
> >
> >         mutex_lock(&image[minor].mutex);
> >
> > -       err = vme_master_mmap(image[minor].resource, vma);
> > -       if (err) {
> > -               mutex_unlock(&image[minor].mutex);
> > -               return err;
> > -       }
> > -
>
> Ok, this changes the set of the operations performed under image[minor].mutex.
> Before we had:
>
> mutex_lock(&image[minor].mutex);
> vme_master_mmap();
> <some final adjustments>
> mutex_unlock(&image[minor].mutex);
>
> Now we have:
>
> mutex_lock(&image[minor].mutex);
> vme_master_mmap_prepare()
> mutex_unlock(&image[minor].mutex);
> vm_iomap_memory();
> mutex_lock(&image[minor].mutex);
> vme_user_vm_mapped(); // <some final adjustments>
> mutex_unlock(&image[minor].mutex);
>
> I think as long as image[minor] does not change while we are not
> holding the mutex we should be safe, and looking at the code it seems
> to be the case. But I'm not familiar with this driver and might be
> wrong. Worth double-checking.

A side note: if we had to hold the mutex across all those operations I
think we would need to take the mutex in the vm_ops->mmap_prepare and
add a vm_ops->map_failed hook or something along that line to drop the
mutex in case mmap_action_complete() fails. Not sure if we will have
such cases though...

>
> >         vma_priv = kmalloc_obj(*vma_priv);
> >         if (!vma_priv) {
> >                 mutex_unlock(&image[minor].mutex);
> > @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> >
> >         vma_priv->minor = minor;
> >         refcount_set(&vma_priv->refcnt, 1);
> > -       vma->vm_ops = &vme_user_vm_ops;
> > -       vma->vm_private_data = vma_priv;
> > -
> > +       *vm_private_data = vma_priv;
> >         image[minor].mmap_count++;
> >
> >         mutex_unlock(&image[minor].mutex);
> > -
> >         return 0;
> >  }
> >
> > -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> > +static const struct vm_operations_struct vme_user_vm_ops = {
> > +       .mapped = vme_user_vm_mapped,
> > +       .open = vme_user_vm_open,
> > +       .close = vme_user_vm_close,
> > +};
> > +
> > +static int vme_user_master_mmap_prepare(unsigned int minor,
> > +                                       struct vm_area_desc *desc)
> > +{
> > +       int err;
> > +
> > +       mutex_lock(&image[minor].mutex);
> > +
> > +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> > +       if (!err)
> > +               desc->vm_ops = &vme_user_vm_ops;
> > +
> > +       mutex_unlock(&image[minor].mutex);
> > +       return err;
> > +}
> > +
> > +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
> >  {
> > -       unsigned int minor = iminor(file_inode(file));
> > +       const struct file *file = desc->file;
> > +       const unsigned int minor = iminor(file_inode(file));
> >
> >         if (type[minor] == MASTER_MINOR)
> > -               return vme_user_master_mmap(minor, vma);
> > +               return vme_user_master_mmap_prepare(minor, desc);
> >
> >         return -ENODEV;
> >  }
> > @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
> >         .llseek = vme_user_llseek,
> >         .unlocked_ioctl = vme_user_unlocked_ioctl,
> >         .compat_ioctl = compat_ptr_ioctl,
> > -       .mmap = vme_user_mmap,
> > +       .mmap_prepare = vme_user_mmap_prepare,
> >  };
> >
> >  static int vme_user_match(struct vme_dev *vdev)
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 21:26 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <48c6d25e374b57dba6df4fdddd4830d3fc1105be.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> The f_op->mmap interface is deprecated, so update driver to use its
> successor, mmap_prepare.
>
> The driver previously used vm_iomap_memory(), so this change replaces it
> with its mmap_prepare equivalent, mmap_action_simple_ioremap().
>
> Functions that wrap mmap() are also converted to wrap mmap_prepare()
> instead.
>
> Also update the documentation accordingly.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
>  Documentation/driver-api/vme.rst    |  2 +-
>  drivers/staging/vme_user/vme.c      | 20 +++++------
>  drivers/staging/vme_user/vme.h      |  2 +-
>  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
>  4 files changed, 42 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> index c0b475369de0..7111999abc14 100644
> --- a/Documentation/driver-api/vme.rst
> +++ b/Documentation/driver-api/vme.rst
> @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
>
>  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
>  do a read-modify-write transaction. Parts of a VME window can also be mapped
> -into user space memory using :c:func:`vme_master_mmap`.
> +into user space memory using :c:func:`vme_master_mmap_prepare`.
>
>
>  Slave windows
> diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> index f10a00c05f12..7220aba7b919 100644
> --- a/drivers/staging/vme_user/vme.c
> +++ b/drivers/staging/vme_user/vme.c
> @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
>  EXPORT_SYMBOL(vme_master_rmw);
>
>  /**
> - * vme_master_mmap - Mmap region of VME master window.
> + * vme_master_mmap_prepare - Mmap region of VME master window.
>   * @resource: Pointer to VME master resource.
> - * @vma: Pointer to definition of user mapping.
> + * @desc: Pointer to descriptor of user mapping.
>   *
>   * Memory map a region of the VME master window into user space.
>   *
> @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
>   *         resource or -EFAULT if map exceeds window size. Other generic mmap
>   *         errors may also be returned.
>   */
> -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> +int vme_master_mmap_prepare(struct vme_resource *resource,
> +                           struct vm_area_desc *desc)
>  {
> +       const unsigned long vma_size = vma_desc_size(desc);
>         struct vme_bridge *bridge = find_bridge(resource);
>         struct vme_master_resource *image;
>         phys_addr_t phys_addr;
> -       unsigned long vma_size;
>
>         if (resource->type != VME_MASTER) {
>                 dev_err(bridge->parent, "Not a master resource\n");
> @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
>         }
>
>         image = list_entry(resource->entry, struct vme_master_resource, list);
> -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> -       vma_size = vma->vm_end - vma->vm_start;
> +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
>
>         if (phys_addr + vma_size > image->bus_resource.end + 1) {
>                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
>                 return -EFAULT;
>         }
>
> -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> -
> -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> +       desc->page_prot = pgprot_noncached(desc->page_prot);
> +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> +       return 0;
>  }
> -EXPORT_SYMBOL(vme_master_mmap);
> +EXPORT_SYMBOL(vme_master_mmap_prepare);
>
>  /**
>   * vme_master_free - Free VME master window
> diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> index 797e9940fdd1..b6413605ea49 100644
> --- a/drivers/staging/vme_user/vme.h
> +++ b/drivers/staging/vme_user/vme.h
> @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
>  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
>  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
>                             unsigned int swap, loff_t offset);
> -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
>  void vme_master_free(struct vme_resource *resource);
>
>  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> index d95dd7d9190a..11e25c2f6b0a 100644
> --- a/drivers/staging/vme_user/vme_user.c
> +++ b/drivers/staging/vme_user/vme_user.c
> @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
>         kfree(vma_priv);
>  }
>
> -static const struct vm_operations_struct vme_user_vm_ops = {
> -       .open = vme_user_vm_open,
> -       .close = vme_user_vm_close,
> -};
> -
> -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> +                             const struct file *file, void **vm_private_data)
>  {
> -       int err;
> +       const unsigned int minor = iminor(file_inode(file));
>         struct vme_user_vma_priv *vma_priv;
>
>         mutex_lock(&image[minor].mutex);
>
> -       err = vme_master_mmap(image[minor].resource, vma);
> -       if (err) {
> -               mutex_unlock(&image[minor].mutex);
> -               return err;
> -       }
> -

Ok, this changes the set of the operations performed under image[minor].mutex.
Before we had:

mutex_lock(&image[minor].mutex);
vme_master_mmap();
<some final adjustments>
mutex_unlock(&image[minor].mutex);

Now we have:

mutex_lock(&image[minor].mutex);
vme_master_mmap_prepare()
mutex_unlock(&image[minor].mutex);
vm_iomap_memory();
mutex_lock(&image[minor].mutex);
vme_user_vm_mapped(); // <some final adjustments>
mutex_unlock(&image[minor].mutex);

I think as long as image[minor] does not change while we are not
holding the mutex we should be safe, and looking at the code it seems
to be the case. But I'm not familiar with this driver and might be
wrong. Worth double-checking.

>         vma_priv = kmalloc_obj(*vma_priv);
>         if (!vma_priv) {
>                 mutex_unlock(&image[minor].mutex);
> @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
>
>         vma_priv->minor = minor;
>         refcount_set(&vma_priv->refcnt, 1);
> -       vma->vm_ops = &vme_user_vm_ops;
> -       vma->vm_private_data = vma_priv;
> -
> +       *vm_private_data = vma_priv;
>         image[minor].mmap_count++;
>
>         mutex_unlock(&image[minor].mutex);
> -
>         return 0;
>  }
>
> -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> +static const struct vm_operations_struct vme_user_vm_ops = {
> +       .mapped = vme_user_vm_mapped,
> +       .open = vme_user_vm_open,
> +       .close = vme_user_vm_close,
> +};
> +
> +static int vme_user_master_mmap_prepare(unsigned int minor,
> +                                       struct vm_area_desc *desc)
> +{
> +       int err;
> +
> +       mutex_lock(&image[minor].mutex);
> +
> +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> +       if (!err)
> +               desc->vm_ops = &vme_user_vm_ops;
> +
> +       mutex_unlock(&image[minor].mutex);
> +       return err;
> +}
> +
> +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
>  {
> -       unsigned int minor = iminor(file_inode(file));
> +       const struct file *file = desc->file;
> +       const unsigned int minor = iminor(file_inode(file));
>
>         if (type[minor] == MASTER_MINOR)
> -               return vme_user_master_mmap(minor, vma);
> +               return vme_user_master_mmap_prepare(minor, desc);
>
>         return -ENODEV;
>  }
> @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
>         .llseek = vme_user_llseek,
>         .unlocked_ioctl = vme_user_unlocked_ioctl,
>         .compat_ioctl = compat_ptr_ioctl,
> -       .mmap = vme_user_mmap,
> +       .mmap_prepare = vme_user_mmap_prepare,
>  };
>
>  static int vme_user_match(struct vme_dev *vdev)
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v2 10/16] stm: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-17 20:46 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <d34056a65bd387286f4e155d52449106ddc99f78.1773695307.git.ljs@kernel.org>

On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> The f_op->mmap interface is deprecated, so update driver to use its
> successor, mmap_prepare.
>
> The driver previously used vm_iomap_memory(), so this change replaces it
> with its mmap_prepare equivalent, mmap_action_simple_ioremap().
>
> Also, in order to correctly maintain reference counting, add a
> vm_ops->mapped callback to increment the reference count when successfully
> mapped.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  drivers/hwtracing/stm/core.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/hwtracing/stm/core.c b/drivers/hwtracing/stm/core.c
> index 37584e786bb5..f48c6a8a0654 100644
> --- a/drivers/hwtracing/stm/core.c
> +++ b/drivers/hwtracing/stm/core.c
> @@ -666,6 +666,16 @@ static ssize_t stm_char_write(struct file *file, const char __user *buf,
>         return count;
>  }
>
> +static int stm_mmap_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> +                          const struct file *file, void **vm_private_data)
> +{
> +       struct stm_file *stmf = file->private_data;
> +       struct stm_device *stm = stmf->stm;
> +
> +       pm_runtime_get_sync(&stm->dev);
> +       return 0;
> +}
> +
>  static void stm_mmap_open(struct vm_area_struct *vma)
>  {
>         struct stm_file *stmf = vma->vm_file->private_data;
> @@ -684,12 +694,14 @@ static void stm_mmap_close(struct vm_area_struct *vma)
>  }
>
>  static const struct vm_operations_struct stm_mmap_vmops = {
> +       .mapped = stm_mmap_mapped,
>         .open   = stm_mmap_open,
>         .close  = stm_mmap_close,
>  };
>
> -static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
> +static int stm_char_mmap_prepare(struct vm_area_desc *desc)
>  {
> +       struct file *file = desc->file;
>         struct stm_file *stmf = file->private_data;
>         struct stm_device *stm = stmf->stm;
>         unsigned long size, phys;
> @@ -697,10 +709,10 @@ static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!stm->data->mmio_addr)
>                 return -EOPNOTSUPP;
>
> -       if (vma->vm_pgoff)
> +       if (desc->pgoff)
>                 return -EINVAL;
>
> -       size = vma->vm_end - vma->vm_start;
> +       size = vma_desc_size(desc);
>
>         if (stmf->output.nr_chans * stm->data->sw_mmiosz != size)
>                 return -EINVAL;
> @@ -712,13 +724,12 @@ static int stm_char_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!phys)
>                 return -EINVAL;
>
> -       pm_runtime_get_sync(&stm->dev);
> -
> -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> -       vm_flags_set(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP);
> -       vma->vm_ops = &stm_mmap_vmops;
> -       vm_iomap_memory(vma, phys, size);
> +       desc->page_prot = pgprot_noncached(desc->page_prot);
> +       vma_desc_set_flags(desc, VMA_IO_BIT, VMA_DONTEXPAND_BIT,
> +                          VMA_DONTDUMP_BIT);
> +       desc->vm_ops = &stm_mmap_vmops;
>
> +       mmap_action_simple_ioremap(desc, phys, size);
>         return 0;
>  }
>
> @@ -836,7 +847,7 @@ static const struct file_operations stm_fops = {
>         .open           = stm_char_open,
>         .release        = stm_char_release,
>         .write          = stm_char_write,
> -       .mmap           = stm_char_mmap,
> +       .mmap_prepare   = stm_char_mmap_prepare,
>         .unlocked_ioctl = stm_char_ioctl,
>         .compat_ioctl   = compat_ptr_ioctl,
>  };
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH net-next v6 3/3] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Shiraz Saleem,
	Kees Cook, Subbaraya Sundeep, Breno Leitao, linux-kernel,
	linux-rdma
  Cc: paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

For RX CQEs with type CQE_RX_COALESCED_4, to measure the coalescing
efficiency, add counters to count how many contains 2, 3, 4 packets
respectively.
Also, add a counter for the error case of first packet with length == 0.

Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v5:
  Combine the accounting logics as suggested by Simon Horman.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++------
 .../ethernet/microsoft/mana/mana_ethtool.c    | 15 ++++++++++--
 include/net/mana/mana.h                       |  9 ++++---
 3 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index fa30046dcd3d..49c65cc1697c 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2147,14 +2147,8 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
 		old_buf = NULL;
 		pktlen = oob->ppi[i].pkt_len;
-		if (pktlen == 0) {
-			if (i == 0)
-				netdev_err_once(
-					ndev,
-					"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
-					rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+		if (pktlen == 0)
 			break;
-		}
 
 		curr = rxq->buf_index;
 		rxbuf_oob = &rxq->rx_oobs[curr];
@@ -2175,6 +2169,22 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		if (!coalesced)
 			break;
 	}
+
+	/* Collect coalesced CQE count based on packets processed.
+	 * Coalesced CQEs have at least 2 packets, so index is i - 2.
+	 */
+	if (i > 1) {
+		u64_stats_update_begin(&rxq->stats.syncp);
+		rxq->stats.coalesced_cqe[i - 2]++;
+		u64_stats_update_end(&rxq->stats.syncp);
+	} else if (!i && !pktlen) {
+		u64_stats_update_begin(&rxq->stats.syncp);
+		rxq->stats.pkt_len0_err++;
+		u64_stats_update_end(&rxq->stats.syncp);
+		netdev_err_once(ndev,
+				"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+				rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+	}
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 4b234b16e57a..6a4b42fe0944 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -149,7 +149,7 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
-	int i;
+	int i, j;
 
 	if (stringset != ETH_SS_STATS)
 		return;
@@ -168,6 +168,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
 		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
@@ -201,6 +204,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 xdp_xmit;
 	u64 xdp_drop;
 	u64 xdp_tx;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	u64 tso_packets;
 	u64 tso_bytes;
 	u64 tso_inner_packets;
@@ -209,7 +214,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	u64 short_pkt_fmt;
 	u64 csum_partial;
 	u64 mana_map_err;
-	int q, i = 0;
+	int q, i = 0, j;
 
 	if (!apc->port_is_up)
 		return;
@@ -239,6 +244,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 			xdp_drop = rx_stats->xdp_drop;
 			xdp_tx = rx_stats->xdp_tx;
 			xdp_redirect = rx_stats->xdp_redirect;
+			pkt_len0_err = rx_stats->pkt_len0_err;
+			for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+				coalesced_cqe[j] = rx_stats->coalesced_cqe[j];
 		} while (u64_stats_fetch_retry(&rx_stats->syncp, start));
 
 		data[i++] = packets;
@@ -246,6 +254,9 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 		data[i++] = xdp_drop;
 		data[i++] = xdp_tx;
 		data[i++] = xdp_redirect;
+		data[i++] = pkt_len0_err;
+		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
+			data[i++] = coalesced_cqe[j];
 	}
 
 	for (q = 0; q < num_queues; q++) {
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a7f89e7ddc56..3336688fed5e 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -61,8 +61,11 @@ enum TRI_STATE {
 
 #define MAX_PORTS_IN_MANA_DEV 256
 
+/* Maximum number of packets per coalesced CQE */
+#define MANA_RXCOMP_OOB_NUM_PPI 4
+
 /* Update this count whenever the respective structures are changed */
-#define MANA_STATS_RX_COUNT 5
+#define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
 
 #define MANA_RX_FRAG_ALIGNMENT 64
@@ -73,6 +76,8 @@ struct mana_stats_rx {
 	u64 xdp_drop;
 	u64 xdp_tx;
 	u64 xdp_redirect;
+	u64 pkt_len0_err;
+	u64 coalesced_cqe[MANA_RXCOMP_OOB_NUM_PPI - 1];
 	struct u64_stats_sync syncp;
 };
 
@@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
 	u32 pkt_hash;
 }; /* HW DATA */
 
-#define MANA_RXCOMP_OOB_NUM_PPI 4
-
 /* Receive completion OOB */
 struct mana_rxcomp_oob {
 	struct mana_cqe_header cqe_hdr;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 2/3] net: mana: Add support for RX CQE Coalescing
From: Haiyang Zhang @ 2026-03-17 19:18 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Saurabh Sengar, Dipayaan Roy, Aditya Garg,
	Shiraz Saleem, Kees Cook, Subbaraya Sundeep, Breno Leitao,
	linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

From: Haiyang Zhang <haiyangz@microsoft.com>

Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
check and process the type CQE_RX_COALESCED_4. The default setting is
disabled, to avoid possible regression on latency.

And, add ethtool handler to switch this feature. To turn it on, run:
  ethtool -C <nic> rx-cqe-frames 4
To turn it off:
  ethtool -C <nic> rx-cqe-frames 1

The rx-cqe-nsec is the time out value in nanoseconds after the first
packet arrival in a coalesced CQE to be sent. It's read-only for this
NIC.

Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v4:
  Fixed the old_buf issue found by AI.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 74 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    | 60 ++++++++++++++-
 include/net/mana/mana.h                       |  8 +-
 3 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ea71de39f996..fa30046dcd3d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1365,6 +1365,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 			     sizeof(resp));
 
 	req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+	req->hdr.resp.msg_version = GDMA_MESSAGE_V2;
 
 	req->vport = apc->port_handle;
 	req->num_indir_entries = apc->indir_table_sz;
@@ -1376,7 +1377,9 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	req->update_hashkey = update_key;
 	req->update_indir_tab = update_tab;
 	req->default_rxobj = apc->default_rxobj;
-	req->cqe_coalescing_enable = 0;
+
+	if (rx != TRI_STATE_FALSE)
+		req->cqe_coalescing_enable = apc->cqe_coalescing_enable;
 
 	if (update_key)
 		memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
@@ -1405,8 +1408,13 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 		netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
 			   resp.hdr.status);
 		err = -EPROTO;
+		goto out;
 	}
 
+	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
+		apc->cqe_coalescing_timeout_ns =
+			resp.cqe_coalescing_timeout_ns;
+
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
 out:
@@ -1915,11 +1923,12 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 }
 
 static void mana_rx_skb(void *buf_va, bool from_pool,
-			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq,
+			int i)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
-	uint pkt_len = cqe->ppi[0].pkt_len;
+	uint pkt_len = cqe->ppi[i].pkt_len;
 	u16 rxq_idx = rxq->rxq_idx;
 	struct napi_struct *napi;
 	struct xdp_buff xdp = {};
@@ -1963,7 +1972,7 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 	}
 
 	if (cqe->rx_hashtype != 0 && (ndev->features & NETIF_F_RXHASH)) {
-		hash_value = cqe->ppi[0].pkt_hash;
+		hash_value = cqe->ppi[i].pkt_hash;
 
 		if (cqe->rx_hashtype & MANA_HASH_L4)
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L4);
@@ -2098,9 +2107,11 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct mana_recv_buf_oob *rxbuf_oob;
 	struct mana_port_context *apc;
 	struct device *dev = gc->dev;
+	bool coalesced = false;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
 	bool old_fp;
+	int i;
 
 	apc = netdev_priv(ndev);
 
@@ -2112,13 +2123,16 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		++ndev->stats.rx_dropped;
 		rxbuf_oob = &rxq->rx_oobs[rxq->buf_index];
 		netdev_warn_once(ndev, "Dropped a truncated packet\n");
-		goto drop;
 
-	case CQE_RX_COALESCED_4:
-		netdev_err(ndev, "RX coalescing is unsupported\n");
-		apc->eth_stats.rx_coalesced_err++;
+		mana_move_wq_tail(rxq->gdma_rq,
+				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
+		mana_post_pkt_rxq(rxq);
 		return;
 
+	case CQE_RX_COALESCED_4:
+		coalesced = true;
+		break;
+
 	case CQE_RX_OBJECT_FENCE:
 		complete(&rxq->fence_event);
 		return;
@@ -2130,30 +2144,37 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		return;
 	}
 
-	pktlen = oob->ppi[0].pkt_len;
+	for (i = 0; i < MANA_RXCOMP_OOB_NUM_PPI; i++) {
+		old_buf = NULL;
+		pktlen = oob->ppi[i].pkt_len;
+		if (pktlen == 0) {
+			if (i == 0)
+				netdev_err_once(
+					ndev,
+					"RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
+					rxq->gdma_id, cq->gdma_id, rxq->rxobj);
+			break;
+		}
 
-	if (pktlen == 0) {
-		/* data packets should never have packetlength of zero */
-		netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
-			   rxq->gdma_id, cq->gdma_id, rxq->rxobj);
-		return;
-	}
+		curr = rxq->buf_index;
+		rxbuf_oob = &rxq->rx_oobs[curr];
+		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-	curr = rxq->buf_index;
-	rxbuf_oob = &rxq->rx_oobs[curr];
-	WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
+		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
 
-	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		/* Unsuccessful refill will have old_buf == NULL.
+		 * In this case, mana_rx_skb() will drop the packet.
+		 */
+		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
 
-	/* Unsuccessful refill will have old_buf == NULL.
-	 * In this case, mana_rx_skb() will drop the packet.
-	 */
-	mana_rx_skb(old_buf, old_fp, oob, rxq);
+		mana_move_wq_tail(rxq->gdma_rq,
+				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
 
-drop:
-	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
+		mana_post_pkt_rxq(rxq);
 
-	mana_post_pkt_rxq(rxq);
+		if (!coalesced)
+			break;
+	}
 }
 
 static void mana_poll_rx_cq(struct mana_cq *cq)
@@ -3332,6 +3353,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->port_handle = INVALID_MANA_HANDLE;
 	apc->pf_filter_handle = INVALID_MANA_HANDLE;
 	apc->port_idx = port_idx;
+	apc->cqe_coalescing_enable = 0;
 
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..4b234b16e57a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -20,8 +20,6 @@ static const struct mana_stats_desc mana_eth_stats[] = {
 					tx_cqe_unknown_type)},
 	{"tx_linear_pkt_cnt", offsetof(struct mana_ethtool_stats,
 				       tx_linear_pkt_cnt)},
-	{"rx_coalesced_err", offsetof(struct mana_ethtool_stats,
-					rx_coalesced_err)},
 	{"rx_cqe_unknown_type", offsetof(struct mana_ethtool_stats,
 					rx_cqe_unknown_type)},
 };
@@ -390,6 +388,61 @@ static void mana_get_channels(struct net_device *ndev,
 	channel->combined_count = apc->num_queues;
 }
 
+#define MANA_RX_CQE_NSEC_DEF 2048
+static int mana_get_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	kernel_coal->rx_cqe_frames =
+		apc->cqe_coalescing_enable ? MANA_RXCOMP_OOB_NUM_PPI : 1;
+
+	kernel_coal->rx_cqe_nsecs = apc->cqe_coalescing_timeout_ns;
+
+	/* Return the default timeout value for old FW not providing
+	 * this value.
+	 */
+	if (apc->port_is_up && apc->cqe_coalescing_enable &&
+	    !kernel_coal->rx_cqe_nsecs)
+		kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
+
+	return 0;
+}
+
+static int mana_set_coalesce(struct net_device *ndev,
+			     struct ethtool_coalesce *ec,
+			     struct kernel_ethtool_coalesce *kernel_coal,
+			     struct netlink_ext_ack *extack)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u8 saved_cqe_coalescing_enable;
+	int err;
+
+	if (kernel_coal->rx_cqe_frames != 1 &&
+	    kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "rx-frames must be 1 or %u, got %u",
+				   MANA_RXCOMP_OOB_NUM_PPI,
+				   kernel_coal->rx_cqe_frames);
+		return -EINVAL;
+	}
+
+	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	apc->cqe_coalescing_enable =
+		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
+
+	if (!apc->port_is_up)
+		return 0;
+
+	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+	if (err)
+		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+
+	return err;
+}
+
 static int mana_set_channels(struct net_device *ndev,
 			     struct ethtool_channels *channels)
 {
@@ -510,6 +563,7 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
 	.get_sset_count		= mana_get_sset_count,
 	.get_strings		= mana_get_strings,
@@ -520,6 +574,8 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_rxfh		= mana_set_rxfh,
 	.get_channels		= mana_get_channels,
 	.set_channels		= mana_set_channels,
+	.get_coalesce		= mana_get_coalesce,
+	.set_coalesce		= mana_set_coalesce,
 	.get_ringparam          = mana_get_ringparam,
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..a7f89e7ddc56 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -378,7 +378,6 @@ struct mana_ethtool_stats {
 	u64 tx_cqe_err;
 	u64 tx_cqe_unknown_type;
 	u64 tx_linear_pkt_cnt;
-	u64 rx_coalesced_err;
 	u64 rx_cqe_unknown_type;
 };
 
@@ -557,6 +556,9 @@ struct mana_port_context {
 	bool port_is_up;
 	bool port_st_save; /* Saved port state */
 
+	u8 cqe_coalescing_enable;
+	u32 cqe_coalescing_timeout_ns;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
@@ -902,6 +904,10 @@ struct mana_cfg_rx_steer_req_v2 {
 
 struct mana_cfg_rx_steer_resp {
 	struct gdma_resp_hdr hdr;
+
+	/* V2 */
+	u32 cqe_coalescing_timeout_ns;
+	u32 reserved1;
 }; /* HW DATA */
 
 /* Register HW vPort */
-- 
2.34.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox