Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Wei Liu @ 2026-03-18 16:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Wei Liu, Stanislav Kinsburskii, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157A6D37C19379C74C88ACCD44EA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, Mar 18, 2026 at 02:38:49PM +0000, Michael Kelley wrote:
> From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 17, 2026 11:20 PM
> > 
> > On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > > >
> > > > The current error handling has two issues:
> > > >
> > > > First, pin_user_pages_fast() can return a short pin count (less than
> > > > requested but greater than zero) when it cannot pin all requested pages.
> > > > This is treated as success, leading to partially pinned regions being
> > > > used, which causes memory corruption.
> > > >
> > > > Second, when an error occurs mid-loop, already pinned pages from the
> > > > current batch are not released before calling mshv_region_evict_pages(),
> > > > causing a page reference leak.
> > >
> > > There's now an online LLM-based tool that is automatically reviewing
> > > kernel patches.  For this patch, the results are here:
> > >
> > >
> > https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit
> > %40skinsburskii-cloud-desktop.internal.cloudapp.net
> > >
> > > It has flagged the commit message as incorrectly referencing the
> > > function mshv_region_evict_pages(), which doesn't exist.
> > >
> > > FWIW, the announcement about sashiko.dev is here:
> > >
> > > https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> > >
> > > Other than the commit message reference, this looks good to me.
> > >
> > > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > 
> > The second point is written as if the code here should release the
> > already pinned pages before calling mshv_region_invalidate_pages(), but
> > the code actually relies on mshv_mem_region_invalidate_pages() to
> > release the pages. The change here fixes the accounting.
> > 
> >  Second, when an error occurs mid-loop, already pinned pages from the
> >  current batch are not accounted for before calling
> >  mshv_region_invalidate_pages(), causing a page reference leak.
> > 
> > And queued up the patch to hyperv-fixes.
> 
> One other thing I noticed:  The "Subject" of the patch is wrong. It
> mentions mshv_region_populate_pages(), but the function being
> modified is actually mshv_region_pin().

Good catch. I have updated the subject line and pushed to hyperv-fixes.

Wei

> 
> Michael
> 
> > 
> > Wei
> > 
> > >
> > > >
> > > > Fix by treating short pins as errors and explicitly unpinning the
> > > > partial batch before cleanup.
> > > >
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_regions.c |    6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > > index c28aac0726de..fdffd4f002f6 100644
> > > > --- a/drivers/hv/mshv_regions.c
> > > > +++ b/drivers/hv/mshv_regions.c
> > > > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > >  		ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > > >  					  FOLL_WRITE | FOLL_LONGTERM,
> > > >  					  pages);
> > > > -		if (ret < 0)
> > > > +		if (ret != nr_pages)
> > > >  			goto release_pages;
> > > >  	}
> > > >
> > > >  	return 0;
> > > >
> > > >  release_pages:
> > > > +	if (ret > 0)
> > > > +		done_count += ret;
> > > >  	mshv_region_invalidate_pages(region, 0, done_count);
> > > > -	return ret;
> > > > +	return ret < 0 ? ret : -ENOMEM;
> > > >  }
> > > >
> > > >  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > > >
> > > >
> > >
> 

^ permalink raw reply

* Re: [PATCH] lib: count_zeros: fix 32/64-bit inconsistency in count_trailing_zeros()
From: Yury Norov @ 2026-03-18 16:31 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Yury Norov, Andy Shevchenko, Rasmus Villemoes,
	Eric Biggers, Jason A. Donenfeld, Ard Biesheuvel, linux-kernel,
	kexec, linux-cifs, linux-spi, linux-hyperv, K. Y. Srinivasan,
	Haiyang Zhang, Mark Brown, Steve French, Alexander Graf,
	Mike Rapoport, Pasha Tatashin
In-Reply-To: <20260317091411.GQ61385@unreal>

On Tue, Mar 17, 2026 at 11:14:11AM +0200, Leon Romanovsky wrote:
> On Fri, Mar 13, 2026 at 02:14:49PM -0400, Yury Norov wrote:
> > On Fri, Mar 13, 2026 at 02:18:55PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 12, 2026 at 07:08:16PM -0400, Yury Norov wrote:
> > > > Based on 'sizeof(x) == 4' condition, in 32-bit case the function is wired
> > > > to ffs(), while in 64-bit case to __ffs(). The difference is substantial:
> > > > ffs(x) == __ffs(x) + 1. Also, ffs(0) == 0, while __ffs(0) is undefined.
> > > > 
> > > > The 32-bit behaviour is inconsistent with the function description, so it
> > > > needs to get fixed.
> > > > 
> > > > There are 9 individual users for the function in 6 different subsystems.
> > > > Some arches and drivers are 64-bit only:
> > > >  - arch/loongarch/kvm/intc/eiointc.c;
> > > >  - drivers/hv/mshv_vtl_main.c;
> > > >  - kernel/liveupdate/kexec_handover.c;
> > > > 
> > > > The others are:
> > > >  - ib_umem_find_best_pgsz(): as per comment, __ffs() should be correct;
> > > 
> > > So long as 32 bit works the same as 64 bit it is correct for ib
> > 
> > This is what the patch does, except that it doesn't account for the
> > word length. In you case, 'mask' is dma_addr_t, which is u32 or u64
> > depending ARCH_DMA_ADDR_T_64BIT.
> > 
> > This config is:
> > 
> >         config ARCH_DMA_ADDR_T_64BIT
> >                 def_bool 64BIT || PHYS_ADDR_T_64BIT
> > 
> > And PHYS_ADDR_T_64BIT is simply def_bool 64BIT. So, at least now
> > dma_addr_t simply follows unsigned long, and thus, the patch is
> > correct. But IDK what's the history behind this configurations.
> > 
> > Anyways, the patch aligns 32-bit count_trailing_zeros() with the
> > 64-bit one. If you OK with that, as you said, can you please send
> > an explicit ack?
> 
> I can do that, 32 bits architectures are rarely used in the IB world.
> 
> Thanks,
> Acked-by: Leon Romanovsky <leon@kernel.org>

Thanks, Leon. Seemingly no headwinds for the patch. Taking in in -next
for testing.

Thanks,
Yury

^ permalink raw reply

* Re: [PATCH v2 06/16] mm: add mmap_action_simple_ioremap()
From: Lorenzo Stoakes (Oracle) @ 2026-03-18 20:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpGocCSRT0yDxPOLg2NZ+W_ZSTjHGPZRKBd3U90=sQtHCw@mail.gmail.com>

On Mon, Mar 16, 2026 at 09:14:28PM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 16, 2026 at 2:13 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > Currently drivers use vm_iomap_memory() as a simple helper function for
> > I/O remapping memory over a range starting at a specified physical address
> > over a specified length.
> >
> > In order to utilise this from mmap_prepare, separate out the core logic
> > into __simple_ioremap_prep(), update vm_iomap_memory() to use it, and add
> > simple_ioremap_prepare() to do the same with a VMA descriptor object.
> >
> > We also add MMAP_SIMPLE_IO_REMAP and relevant fields to the struct
> > mmap_action type to permit this operation also.
> >
> > We use mmap_action_ioremap() to set up the actual I/O remap operation once
> > we have checked and figured out the parameters, which makes
> > simple_ioremap_prepare() easy to implement.
> >
> > We then add mmap_action_simple_ioremap() to allow drivers to make use of
> > this mode.
> >
> > We update the mmap_prepare documentation to describe this mode.
> >
> > Finally, we update the VMA tests to reflect this change.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>
> A couple of nits, but otherwise LGTM.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst |  3 +
> >  include/linux/mm.h                         | 24 +++++-
> >  include/linux/mm_types.h                   |  6 +-
> >  mm/internal.h                              |  2 +
> >  mm/memory.c                                | 87 +++++++++++++++-------
> >  mm/util.c                                  | 12 +++
> >  tools/testing/vma/include/dup.h            |  6 +-
> >  7 files changed, 112 insertions(+), 28 deletions(-)
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > index 20db474915da..be76ae475b9c 100644
> > --- a/Documentation/filesystems/mmap_prepare.rst
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -153,5 +153,8 @@ pointer. These are:
> >  * mmap_action_ioremap_full() - Same as mmap_action_ioremap(), only remaps
> >    the entire mapping from ``start_pfn`` onward.
> >
> > +* mmap_action_simple_ioremap() - Sets up an I/O remap from a specified
> > +  physical address and over a specified length.
> > +
> >  **NOTE:** The ``action`` field should never normally be manipulated directly,
> >  rather you ought to use one of these helpers.
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ad1b8c3c0cfd..df8fa6e6402b 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4337,11 +4337,33 @@ static inline void mmap_action_ioremap(struct vm_area_desc *desc,
> >   * @start_pfn: The first PFN in the range to remap.
> >   */
> >  static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
> > -                                         unsigned long start_pfn)
> > +                                           unsigned long start_pfn)
> >  {
> >         mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
> >  }
> >
> > +/**
> > + * mmap_action_simple_ioremap - helper for mmap_prepare hook to specify that the
> > + * physical range in [start_phys_addr, start_phys_addr + size) should be I/O
> > + * remapped.
> > + * @desc: The VMA descriptor for the VMA requiring remap.
> > + * @start_phys_addr: Start of the physical memory to be mapped.
> > + * @size: Size of the area to map.
> > + *
> > + * NOTE: Some drivers might want to tweak desc->page_prot for purposes of
> > + * write-combine or similar.
> > + */
> > +static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
> > +                                             phys_addr_t start_phys_addr,
> > +                                             unsigned long size)
> > +{
> > +       struct mmap_action *action = &desc->action;
> > +
> > +       action->simple_ioremap.start_phys_addr = start_phys_addr;
> > +       action->simple_ioremap.size = size;
> > +       action->type = MMAP_SIMPLE_IO_REMAP;
> > +}
> > +
> >  int mmap_action_prepare(struct vm_area_desc *desc);
> >  int mmap_action_complete(struct vm_area_struct *vma,
> >                          struct mmap_action *action);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4a229cc0a06b..50685cf29792 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -814,6 +814,7 @@ enum mmap_action_type {
> >         MMAP_NOTHING,           /* Mapping is complete, no further action. */
> >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> > +       MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> >  };
> >
> >  /*
> > @@ -822,13 +823,16 @@ enum mmap_action_type {
> >   */
> >  struct mmap_action {
> >         union {
> > -               /* Remap range. */
> >                 struct {
> >                         unsigned long start;
> >                         unsigned long start_pfn;
> >                         unsigned long size;
> >                         pgprot_t pgprot;
> >                 } remap;
> > +               struct {
> > +                       phys_addr_t start_phys_addr;
> > +                       unsigned long size;
> > +               } simple_ioremap;
> >         };
> >         enum mmap_action_type type;
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f5774892071e..0eaca2f0eb6a 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1804,6 +1804,8 @@ int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >  int remap_pfn_range_prepare(struct vm_area_desc *desc);
> >  int remap_pfn_range_complete(struct vm_area_struct *vma,
> >                              struct mmap_action *action);
> > +int simple_ioremap_prepare(struct vm_area_desc *desc);
> > +/* No simple_ioremap_complete, is ultimately handled by remap complete. */
> >
> >  static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc)
> >  {
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9dec67a18116..f3f4046aee97 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3170,6 +3170,59 @@ int remap_pfn_range_complete(struct vm_area_struct *vma,
> >         return do_remap_pfn_range(vma, start, pfn, size, prot);
> >  }
> >
> > +static int __simple_ioremap_prep(unsigned long vm_start, unsigned long vm_end,
>
> nit: vm_start and vm_end are used only to calculate vm_len. You could
> reduce the number of arguments by just passing vm_len.

Ack will fixup!

>
> > +                                pgoff_t vm_pgoff, phys_addr_t start_phys,
> > +                                unsigned long size, unsigned long *pfnp)
> > +{
> > +       const unsigned long vm_len = vm_end - vm_start;
> > +       unsigned long pfn, pages;
> > +
> > +       /* Check that the physical memory area passed in looks valid */
> > +       if (start_phys + size < start_phys)
> > +               return -EINVAL;
> > +       /*
> > +        * You *really* shouldn't map things that aren't page-aligned,
> > +        * but we've historically allowed it because IO memory might
> > +        * just have smaller alignment.
> > +        */
> > +       size += start_phys & ~PAGE_MASK;
> > +       pfn = start_phys >> PAGE_SHIFT;
> > +       pages = (size + ~PAGE_MASK) >> PAGE_SHIFT;
> > +       if (pfn + pages < pfn)
> > +               return -EINVAL;
> > +
> > +       /* We start the mapping 'vm_pgoff' pages into the area */
> > +       if (vm_pgoff > pages)
> > +               return -EINVAL;
> > +       pfn += vm_pgoff;
> > +       pages -= vm_pgoff;
> > +
> > +       /* Can we fit all of the mapping? */
> > +       if ((vm_len >> PAGE_SHIFT) > pages)
> > +               return -EINVAL;
> > +
> > +       *pfnp = pfn;
> > +       return 0;
> > +}
> > +
> > +int simple_ioremap_prepare(struct vm_area_desc *desc)
> > +{
> > +       struct mmap_action *action = &desc->action;
> > +       const phys_addr_t start = action->simple_ioremap.start_phys_addr;
> > +       const unsigned long size = action->simple_ioremap.size;
> > +       unsigned long pfn;
> > +       int err;
> > +
> > +       err = __simple_ioremap_prep(desc->start, desc->end, desc->pgoff,
> > +                                   start, size, &pfn);
> > +       if (err)
> > +               return err;
> > +
> > +       /* The I/O remap logic does the heavy lifting. */
> > +       mmap_action_ioremap(desc, desc->start, pfn, vma_desc_size(desc));
>
> nit: Looks like a perfect opportunity to use mmap_action_ioremap_full() here.

Yeah can do!

>
> > +       return mmap_action_prepare(desc);
>
> Ok, so IIUC this uses recursion:
> mmap_action_prepare(MMAP_SIMPLE_IO_REMAP) -> simple_ioremap_prepare()
> -> mmap_action_prepare(MMAP_IO_REMAP_PFN).

Yep, it's one level, I think that should be ok? :)

>
> > +}
> > +
> >  /**
> >   * vm_iomap_memory - remap memory to userspace
> >   * @vma: user vma to map to
> > @@ -3187,32 +3240,16 @@ int remap_pfn_range_complete(struct vm_area_struct *vma,
> >   */
> >  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len)
> >  {
> > -       unsigned long vm_len, pfn, pages;
> > -
> > -       /* Check that the physical memory area passed in looks valid */
> > -       if (start + len < start)
> > -               return -EINVAL;
> > -       /*
> > -        * You *really* shouldn't map things that aren't page-aligned,
> > -        * but we've historically allowed it because IO memory might
> > -        * just have smaller alignment.
> > -        */
> > -       len += start & ~PAGE_MASK;
> > -       pfn = start >> PAGE_SHIFT;
> > -       pages = (len + ~PAGE_MASK) >> PAGE_SHIFT;
> > -       if (pfn + pages < pfn)
> > -               return -EINVAL;
> > -
> > -       /* We start the mapping 'vm_pgoff' pages into the area */
> > -       if (vma->vm_pgoff > pages)
> > -               return -EINVAL;
> > -       pfn += vma->vm_pgoff;
> > -       pages -= vma->vm_pgoff;
> > +       const unsigned long vm_start = vma->vm_start;
> > +       const unsigned long vm_end = vma->vm_end;
> > +       const unsigned long vm_len = vm_end - vm_start;
> > +       unsigned long pfn;
> > +       int err;
> >
> > -       /* Can we fit all of the mapping? */
> > -       vm_len = vma->vm_end - vma->vm_start;
> > -       if (vm_len >> PAGE_SHIFT > pages)
> > -               return -EINVAL;
> > +       err = __simple_ioremap_prep(vm_start, vm_end, vma->vm_pgoff, start,
> > +                                   len, &pfn);
> > +       if (err)
> > +               return err;
> >
> >         /* Ok, let it rip */
> >         return io_remap_pfn_range(vma, vma->vm_start, pfn, vm_len, vma->vm_page_prot);
> > diff --git a/mm/util.c b/mm/util.c
> > index cdfba09e50d7..aa92e471afe1 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -1390,6 +1390,8 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> >                 return remap_pfn_range_prepare(desc);
> >         case MMAP_IO_REMAP_PFN:
> >                 return io_remap_pfn_range_prepare(desc);
> > +       case MMAP_SIMPLE_IO_REMAP:
> > +               return simple_ioremap_prepare(desc);
> >         }
> >
> >         WARN_ON_ONCE(1);
> > @@ -1421,6 +1423,14 @@ int mmap_action_complete(struct vm_area_struct *vma,
> >         case MMAP_IO_REMAP_PFN:
> >                 err = io_remap_pfn_range_complete(vma, action);
> >                 break;
> > +       case MMAP_SIMPLE_IO_REMAP:
> > +               /*
> > +                * The simple I/O remap should have been delegated to an I/O
> > +                * remap.
> > +                */
> > +               WARN_ON_ONCE(1);
> > +               err = -EINVAL;
> > +               break;
> >         }
> >
> >         return mmap_action_finish(vma, action, err);
> > @@ -1434,6 +1444,7 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> >                 break;
> >         case MMAP_REMAP_PFN:
> >         case MMAP_IO_REMAP_PFN:
> > +       case MMAP_SIMPLE_IO_REMAP:
> >                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
> >                 break;
> >         }
> > @@ -1452,6 +1463,7 @@ int mmap_action_complete(struct vm_area_struct *vma,
> >                 break;
> >         case MMAP_REMAP_PFN:
> >         case MMAP_IO_REMAP_PFN:
> > +       case MMAP_SIMPLE_IO_REMAP:
> >                 WARN_ON_ONCE(1); /* nommu cannot handle this. */
> >
> >                 err = -EINVAL;
> > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > index 4570ec77f153..114daaef4f73 100644
> > --- a/tools/testing/vma/include/dup.h
> > +++ b/tools/testing/vma/include/dup.h
> > @@ -453,6 +453,7 @@ enum mmap_action_type {
> >         MMAP_NOTHING,           /* Mapping is complete, no further action. */
> >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> > +       MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> >  };
> >> >  /*
> > @@ -461,13 +462,16 @@ enum mmap_action_type {
> >   */
> >  struct mmap_action {
> >         union {
> > -               /* Remap range. */
> >                 struct {
> >                         unsigned long start;
> >                         unsigned long start_pfn;
> >                         unsigned long size;
> >                         pgprot_t pgprot;
> >                 } remap;
> > +               struct {
> > +                       phys_addr_t start;
> > +                       unsigned long len;
> > +               } simple_ioremap;
> >         };
> >         enum mmap_action_type type;
> >
> > --
> > 2.53.0
> >

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 12/16] mm: allow handling of stacked mmap_prepare hooks in more drivers
From: Joshua Hahn @ 2026-03-18 21:08 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Clemens Ladisch, Arnd Bergmann, Greg Kroah-Hartman,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Alexander Shishkin, Maxime Coquelin, Alexandre Torgue,
	Miquel Raynal, Richard Weinberger, Vignesh Raghavendra,
	Bodo Stroesser, Martin K . Petersen, David Howells, Marc Dionne,
	Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <72750af6906fd96fb6f18e83ac3e694cf357a2c1.1773695307.git.ljs@kernel.org>

On Mon, 16 Mar 2026 21:12:08 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:

> While the conversion of mmap hooks to mmap_prepare is underway, we wil
> encounter situations where mmap hooks need to invoke nested mmap_prepare
> hooks.
> 
> The nesting of mmap hooks is termed 'stacking'.  In order to flexibly
> facilitate the conversion of custom mmap hooks in drivers which stack, we
> must split up the existing compat_vma_mapped() function into two separate
> functions:
> 
> * compat_set_desc_from_vma() - This allows the setting of a vm_area_desc
>   object's fields to the relevant fields of a VMA.

Hello Lorenzo, I hope you are doing well!

Thank you for this patch. I was developing on top of mm-new today and had
an error that I think was caused by this patch. I want to preface this by
saying that I am not at all familiar with this area of the code, so please
do forgive me if I've misinterpreted the crash and mistakenly pointed
at this commit : -)

Here is the crash:

[    1.083795] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[    1.083883] BUG: unable to handle page fault for address: ffa00000048efbb8
[    1.083957] #PF: supervisor instruction fetch in kernel mode
[    1.084030] #PF: error_code(0x0011) - permissions violation
[    1.084086] PGD 100000067 P4D 10035f067 PUD 100364067 PMD 441ed9067 PTE 80000004466a3163
[    1.084162] Oops: Oops: 0011 [#1] SMP
[    1.084218] CPU: 0 UID: 0 PID: 305 Comm: mkdir Tainted: G        W   E       7.0.0-rc4-virtme-00442-ge53de5a0302f-dirty #85 PREEMPTLAZY

As you can see, it's on a QEMU instance. I don't think this makes a difference
in the crash, though.

[    1.084321] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[    1.084369] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-5.el9 11/05/2023
[    1.084450] RIP: 0010:0xffa00000048efbb8
[    1.084489] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <40> 12 0e 00 01 00 11 ff d0 fa 8e 04 00 00 a0 ff 80 33 51 02 01 00
[    1.084642] RSP: 0018:ffa00000048ef998 EFLAGS: 00010286
[    1.084692] RAX: ffa00000048efbb8 RBX: ff11000102512cc0 RCX: 000000000000000d
[    1.084766] RDX: ffffffffa06247d0 RSI: ffa00000048efa18 RDI: ff11000102512cc0
[    1.084826] RBP: ffa00000048ef9c8 R08: 0000000000000000 R09: 0000000000000007
[    1.084889] R10: ff110001047d1f08 R11: 00007effdc3d0fff R12: ff110001047d3b00
[    1.084954] R13: ff11000446cae600 R14: ff110001024efe00 R15: ff11000102510a80
[    1.085021] FS:  0000000000000000(0000) GS:ff110004aae72000(0000) knlGS:0000000000000000
[    1.085083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.085136] CR2: ffa00000048efbb8 CR3: 0000000102667001 CR4: 0000000000771ef0
[    1.085201] PKRU: 55555554
[    1.085228] Call Trace:
[    1.085248]  <TASK>
[    1.085274]  ? __compat_vma_mmap+0x8e/0x130
[    1.085318]  ? compat_vma_mmap+0x76/0x80
[    1.085354]  ? mas_alloc_nodes+0xb2/0x110
[    1.085390]  ? backing_file_mmap+0xc3/0xf0
[    1.085426]  ? ovl_mmap+0x41/0x50
[    1.085463]  ? ovl_mmap+0x50/0x50
[    1.085499]  ? __mmap_region+0x7e8/0x1100
[    1.085539]  ? do_mmap+0x49f/0x5e0
[    1.085573]  ? vm_mmap_pgoff+0xef/0x1e0
[    1.085609]  ? ksys_mmap_pgoff+0x15c/0x1f0
[    1.085647]  ? do_syscall_64+0xab/0x980
[    1.085684]  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
[    1.085730]  </TASK>
[    1.085770] Modules linked in: virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
[    1.085838] CR2: ffa00000048efbb8
[    1.085874] ---[ end trace 0000000000000000 ]---
[    1.085875] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[    1.085918] RIP: 0010:0xffa00000048efbb8
[    1.085921] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <40> 12 0e 00 01 00 11 ff d0 fa 8e 04 00 00 a0 ff 80 33 51 02 01 00
[    1.085988] BUG: unable to handle page fault for address: ffa00000048f7bb8
[    1.086026] RSP: 0018:ffa00000048ef998 EFLAGS: 00010286
[    1.086166] #PF: supervisor instruction fetch in kernel mode
[    1.086221]
[    1.086267] #PF: error_code(0x0011) - permissions violation
[    1.086321] RAX: ffa00000048efbb8 RBX: ff11000102512cc0 RCX: 000000000000000d
[    1.086348] PGD 100000067
[    1.086394] RDX: ffffffffa06247d0 RSI: ffa00000048efa18 RDI: ff11000102512cc0
[    1.086459] P4D 10035f067
[    1.086486] RBP: ffa00000048ef9c8 R08: 0000000000000000 R09: 0000000000000007
[    1.086550] PUD 100364067
[    1.086577] R10: ff110001047d1f08 R11: 00007effdc3d0fff R12: ff110001047d3b00
[    1.086641] PMD 441ed9067
[    1.086668] R13: ff11000446cae600 R14: ff110001024efe00 R15: ff11000102510a80
[    1.086731] PTE 80000004433d3163
[    1.086764] FS:  0000000000000000(0000) GS:ff110004aae72000(0000) knlGS:0000000000000000
[    1.086829]
[    1.086868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.086931] Oops: Oops: 0011 [#2] SMP
[    1.086958] CR2: ffa00000048efbb8 CR3: 0000000102667001 CR4: 0000000000771ef0
[    1.087015] CPU: 29 UID: 0 PID: 306 Comm: mount Tainted: G      D W   E       7.0.0-rc4-virtme-00442-ge53de5a0302f-dirty #85 PREEMPTLAZY
[    1.087050] PKRU: 55555554
[    1.087115] Tainted: [D]=DIE, [W]=WARN, [E]=UNSIGNED_MODULE
[    1.087207] Kernel panic - not syncing: Fatal exception
[    2.158392] Shutting down cpus with NMI
[    2.158629] Kernel Offset: disabled
[    2.158668] ---[ end Kernel panic - not syncing: Fatal exception ]---

It crashes at compat_vma_mmap, and here is what I think could be the 
potential crash path:

- compat_vma_mmap() creates struct vm_area_desc desc;
  - compat_set_desc_from_vma Doesn't initialize the struct, but instead
    modifies independent fields. I think this is where the behavior
    diverges, since before we would use the C initializer and uninitialized
    variables would be set to 0 (including ommitted ones, like
    action.success_hook or action.error_hook). But action.type = MMAP_NOTHING
  - desc.action.success_hook remains uninitialized in vfs_mmap_prepare
  - mmap_action_complete()
    - Here, We've set action.type to be MMAP_NOTHING, so we have err = 0
    - mmap_action_finish(action, vma, 0)
      - And here, since err == 0, we check action->success_hook (which has
        garbage, therefore it's nonzero) and call action->success_hook(vma)

And I think action->success_hook(vma) where success_hook is uninitialized
stack garbage gets me to where I am.

Again, I'm not too familiar with this area of the kernel, this is just
based on the quick digging that I did. And aplogies again if I'm missing
something ; -) I do think that the uninitialized members could be a problem
though.

Thank you, I hope you have a great day Lorenzo!
Joshua

^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Expose hardware diagnostic info via debugfs
From: Jakub Kicinski @ 2026-03-19  2:36 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	yury.norov, kees, ssengar, gargaditya, shirazsaleem, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260316112339.1208155-1-ernis@linux.microsoft.com>

On Mon, 16 Mar 2026 04:23:27 -0700 Erni Sri Satya Vennela wrote:
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
> 
> The debugfs directory creation and removal for each PCI device is
> integrated into mana_gd_setup() and mana_gd_cleanup_device()
> respectively, so that all callers (probe, remove, suspend, resume,
> shutdown) share a single code path.

Does not apply:

Failed to apply patch:
Applying: net: mana: Expose hardware diagnostic info via debugfs
Using index info to reconstruct a base tree...
M	drivers/net/ethernet/microsoft/mana/gdma_main.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/net/ethernet/microsoft/mana/gdma_main.c
CONFLICT (content): Merge conflict in drivers/net/ethernet/microsoft/mana/gdma_main.c
Recorded preimage for 'drivers/net/ethernet/microsoft/mana/gdma_main.c'
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 net: mana: Expose hardware diagnostic info via debugfs
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v6 0/3] add ethtool COALESCE_RX_CQE_FRAMES/NSECS and use it in MANA driver
From: patchwork-bot+netdevbpf @ 2026-03-19  3:10 UTC (permalink / raw)
  To: Haiyang Zhang; +Cc: linux-hyperv, netdev, haiyangz, paulros
In-Reply-To: <20260317191826.1346111-1-haiyangz@linux.microsoft.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 17 Mar 2026 12:18:04 -0700 you wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Add two parameters for drivers supporting Rx CQE Coalescing.
> 
> ETHTOOL_A_COALESCE_RX_CQE_FRAMES:
> Maximum number of frames that can be coalesced into a CQE or
> writeback.
> 
> [...]

Here is the summary with links:
  - [net-next,v6,1/3] net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS
    https://git.kernel.org/netdev/net-next/c/dc3d720e12f6
  - [net-next,v6,2/3] net: mana: Add support for RX CQE Coalescing
    https://git.kernel.org/netdev/net-next/c/c2fe3ff3d66d
  - [net-next,v6,3/3] net: mana: Add ethtool counters for RX CQEs in coalesced type
    https://git.kernel.org/netdev/net-next/c/d01440e10a82

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* RE: RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Michael Kelley @ 2026-03-19  3:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka
In-Reply-To: <20260318101042.-QHDXjlS@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Wednesday, March 18, 2026 3:11 AM
> 
> On 2026-03-17 17:26:39 [+0000], Michael Kelley wrote:
> > > > Who is other one and does it have its add_interrupt_randomness() there
> > > > already?
> > >
> > > It's the arm64 path of the hv support. Regarding the vmbus IRQ, it seems
> > > to be fully handled here, without an equivalent of
> > > arch/x86/kernel/cpu/mshyperv.c.
> >
> > The arm64 path is the call to request_percpu_irq() in vmbus_bus_init().
> > That call is only made when running on arm64. See the code comment in
> > vmbus_bus_init().
> >
> > The specified interrupt handler is vmbus_percpu_isr(), which again runs
> > only on arm64. It calls vmbus_isr(), which starts the common path for both
> > x86/x64 and arm64.
> >
> > Then the slight weirdness is that the standard Linux IRQ handling for
> > per-CPU IRQs on arm64 with a GICv3 (which is what Hyper-V emulates)
> > does *not* call add_interrupt_randomness().  The function
> > gic_irq_domain_map() sets the IRQ handler for PPI range to
> > handle_percpu_devid_irq(), and that function does not do
> > add_interrupt_randomness().  The other variant, handle_percpu_irq(),
> > calls handle_irq_event_percpu(), which *does* do the
> > add_interrupt_randomness().
> 
> So despite all the generic code on arm64 does not do it? Then don't
> workaround this in your driver. Either talk to the IRQ maintainer and
> suggest adding it there so everyone benefits from or don't because there
> might be a reason to avoid it. Having it in driver code is wrong.

FWIW, I've researched the history of handle_percpu_devid_irq(). It dates
back to 2011, which is probably when the ARM architecture first introduced
per-CPU interrupts. At that time, it did not do add_interrupt_randomness().
An RFC patch set [1] in 2017 proposed adding the interrupt randomness
as a side effect of other changes, but that patch set evidently did not
move forward. There was no mailing list discussion of the interrupt
randomness aspect of the patch set.

I'll raise the topic with ARM maintainers and IRQ subsystem maintainers
to see if there's any reason one way or the other. I would not be surprised
if adding interrupt randomness is intentionally excluded because these
per-CPU interrupts were historically used for IPIs and timers only. What's
changed is that ARM64 is now used significantly in data centers, and
support for VMs is super important. The per-CPU interrupts are now used
for more that IPIs and timers, such as in the Hyper-V case, and
handle_percpu_devid_irq() was never reconsidered in that light. I would
expect a reluctance to burden the IPI and timer interrupt paths with the
overhead of add_interrupt_randomness(). But the Hyper-V VMBus case
still needs it because that's the primary source of interrupt entropy in the
VM. There aren't necessarily other devices generating non-per-CPU interrupts
like in a physical machine. To me, it is perfectly valid for the IPI & timer
interrupt paths to want to skip interrupt randomness, while the
Hyper-V VMBus interrupt path needs it, and we will be back where we
are now.

Related, *not* doing add_interrupt_randomness() on the ARM64 Hyper-V
synthetic timer path is consistent with the ARM64 architectural timer, since
it also uses handle_percpu_devid_irq(). I did the original work to get the
Hyper-V synthetic timers working on ARM64 back in 2019 (?), but I don't
recall if that consistency with the ARM64 architectural timer was
intentional or accidental.

Again, I'll raise this with the appropriate maintainers and see what the
feedback is.

Michael

[1] https://lore.kernel.org/lkml/20170907232542.20589-3-paul.burton@imgtec.com/

> 
> > So at this point, putting the add_interrupt_randomness() in
> > vmbus_isr() is needed to catch both architectures. If the lack of
> > add_interrupt_randomness() in handle_percpu_devid_irq() is a bug,
> > then that would be a cleaner way to handle this. But maybe there's
> > a reason behind the current behavior of handle_percpu_devid_irq()
> > that I'm unaware of.
> >
> > Michael
> 
> Sebastian

^ permalink raw reply

* RE: RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Michael Kelley @ 2026-03-19  3:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <20260318100138.GimjldpV@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Wednesday, March 18, 2026 3:02 AM
> 
> On 2026-03-17 17:25:20 [+0000], Michael Kelley wrote:
> > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 12, 2026 10:07 AM
> > >
> > Let me try to address the range of questions here and in the follow-up
> > discussion. As background, an overview of VMBus interrupt handling is in:
> >
> > Documentation/virt/hyperv/vmbus.rst
> >
> > in the section entitled "Synthetic Interrupt Controller (synic)". The
> > relevant text is:
> >
> >    The SINT is mapped to a single per-CPU architectural interrupt (i.e,
> >    an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
> >    each CPU in the guest has a synic and may receive VMBus interrupts,
> >    they are best modeled in Linux as per-CPU interrupts. This model works
> >    well on arm64 where a single per-CPU Linux IRQ is allocated for
> >    VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
> >    "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
> >    interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
> >    across all CPUs and explicitly coded to call vmbus_isr(). In this case,
> >    there's no Linux IRQ, and the interrupts are visible in aggregate in
> >    /proc/interrupts on the "HYP" line.
> >
> > The use of a statically allocated sysvec pre-dates my involvement in this
> > code starting in 2017, but I believe it was modelled after what Xen does,
> > and for the same reason -- to effectively create a per-CPU interrupt on
> > x86/x64. Acorn is also using HYPERVISOR_CALLBACK_VECTOR, but I
> > don't know if that is also to create a per-CPU interrupt.
> 
> If you create a vector, it becomes per-CPU. There is simply no mapping
> from HYPERVISOR_CALLBACK_VECTOR to request_percpu_irq(). But if we had
> this…

Indeed, yes, that would remove the need for all the per-CPU interrupt hackery
on x86/x64. I don't have any objection to someone pursuing that path, but it's
not something I can do. Full disclosure:  You'll see my name on Hyper-V and
VMBus stuff in the Linux kernel, with Microsoft as my employer. But I retired
from Microsoft 2.5 years ago, and my current involvement in Linux kernel work
is purely as a very part-time volunteer. I also lack access to hardware and the
test machinery needed to make more significant changes, particularly if multiple
versions of Hyper-V must be tested.

> 
> …
> > > What clears this? This is wrongly placed. This should go to
> > > sysvec_hyperv_callback() instead with its matching canceling part. The
> > > add_interrupt_randomness() should also be there and not here.
> > > sysvec_hyperv_stimer0() managed to do so.
> >
> > I don't have any knowledge to bring regarding the use of
> > lockdep_hardirq_threaded().
> 
> It is used in IRQ core to mark the execution of an interrupt handler
> which becomes threaded in a forced-threaded scenario. The goal is to let
> lockdep know that this piece of code on !RT will be threaded on RT and
> therefore there is no need to report a possible locking problem that
> will not exist on RT.
> 
> > > Different question: What guarantees that there won't be another
> > > interrupt before this one is done? The handshake appears to be
> > > deprecated. The interrupt itself returns ACKing (or not) but the actual
> > > handler is delayed to this thread. Depending on the userland it could
> > > take some time and I don't know how impatient the host is.
> >
> > In more recent versions of Hyper-V, what's deprecated is Hyper-V implicitly
> > and automatically doing the EOI. So in sysvec_hyperv_callback(), apic_eoi()
> > is usually explicitly called to ack the interrupt.
> >
> > There's no guarantee, in either the existing case or the new PREEMPT_RT
> > case, that another VMBus interrupt won't come in on the same CPU
> > before the tasklets scheduled by vmbus_message_sched() or
> > vmbus_chan_sched() have run. From a functional standpoint, the Linux
> > code and interaction with Hyper-V handles another interrupt correctly.
> 
> So there is no scenario that the host will trigger interrupts because
> the guest is leaving the ISR without doing anything/ making progress?
> 
> > From a delay standpoint, there's not a problem for the normal (i.e., not
> > PREEMPT_RT) case because the tasklets run as the interrupt exits -- they
> > don't end up in ksoftirqd. For the PREEMPT_RT case, I can see your point
> > about delays since the tasklets are scheduled from the new per-CPU thread.
> > But my understanding is that Jan's motivation for these changes is not to
> > achieve true RT behavior, since Hyper-V doesn't provide that anyway.
> > The goal is simply to make PREEMPT_RT builds functional, though Jan may
> > have further comments on the goal.
> 
> I would be worried if the host would storming interrupts to the guest
> because it makes no progress.

No, that kind of storming won't happen. The Hyper-V host<->guest
interface is based on message queues. The host interrupts the guest
if it puts a message in the queue that transitions the queue from
"empty" to "not empty". Eventually the tasklet enabled in vmbus_isr()
and its subsidiaries gets around to emptying the queue, which effectively
re-arms the interrupt. The host may add more messages to the queue,
but it doesn't interrupt again for that queue until the queue is empty.
If the guest is delayed in doing that emptying, nothing bad happens.

There could be multiple queues that interrupt the same vCPU in the
guest, so there might be another interrupt to the same vCPU due to
a different queue, but that could happen regardless of the latency in
emptying a queue. And the number of queues assigned to a vCPU
is at most a small integer.

> 
> > > > +		__vmbus_isr();
> > > Moving on. This (trying very hard here) even schedules tasklets. Why?
> > > You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> > > You don't want that.
> >
> > Again, Jan can comment on the impact of delays due to ending up
> > in ksoftirqd.
> 
> My point is that having this with threaded interrupt support would
> eliminate the usage of tasklets.

Agreed, probably. For the non-RT case, the latency in getting to the
tasklet code *does* matter. I'm not familiar with how tasklets compare
to threaded interrupts on latency.

> 
> > > Couldn't the whole logic be integrated into the IRQ code? Then we could
> > > have mask/ unmask if supported/ provided and threaded interrupts. Then
> > > sysvec_hyperv_reenlightenment() could use a proper threaded interrupt
> > > instead apic_eoi() + schedule_delayed_work().
> >
> > As I described above, Hyper-V needs a per-CPU interrupt. It's faked up
> > on x86/x64 with the hardcoded HYPERVISOR_CALLBACK_VECTOR sysvec
> > entry, but on arm64 a normal Linux per-CPU IRQ is used. Once the execution
> > path gets to vmbus_isr(), the two architectures share the same code. Same
> > thing is done with the Hyper-V STIMER0 interrupt as a per-CPU interrupt.
> 
> This one has the "random" collecting on the right spot.

Regarding the timer path, see my comment in the other email thread.

> 
> > If there's a better way to fake up a per-CPU interrupt on x86/x64, I'm open
> > to looking at it.
> >
> > As I recently discovered in discussion with Jan, standard Linux IRQ handling
> > will *not* thread per-CPU interrupts. So even on arm64 with a standard
> > Linux per-CPU IRQ is used for VMBus and STIMER0 interrupts, we can't
> > request threading.
> 
> It would require a statement from the x86 & IRQ maintainers if it is
> worth on x86 to make allow pass HYPERVISOR_CALLBACK_VECTOR to
> request_percpu_irq() and have an IRQF_ that this one needs to be forced
> threaded. Otherwise we would need to remain with the workarounds.

Again, you or someone else is welcome to explore this topic.

> 
> If you say that an interrupt storm can not occur, I would prefer
> |static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
> |…
> |	lock_map_acquire_try(&vmbus_map);
> |	__vmbus_isr();
> |	lock_map_release(&vmbus_map);
> 
> while it has mostly the same effect.
> 
> Either way, that add_interrupt_randomness() should be moved to
> sysvec_hyperv_callback() like it has been done for
> sysvec_hyperv_stimer0(). It should be invoked twice now if gets there
> via vmbus_percpu_isr().
> 
> > I need to refresh my memory on sysvec_hyperv_reenlightenment(). If
> > I recall correctly, it's not a per-CPU interrupt, so it probably doesn't
> > need to have a hardcoded vector. Overall, the Hyper-V reenlightenment
> > functionality is a bit of a fossil that isn't needed on modern x86/x64
> > processors that support TSC scaling. And it doesn't exist for arm64.
> > It might be worth seeing if it could be dropped entirely ...

I've refreshed my memory on the reenlightenment functionality, and
I think it has to stay. The functionality is used by KVM when it is running
in an L1 VM on an L0 Hyper-V host, and supporting its own L2 guest VMs.
I will check with Vitaly Kuznetsov, who originally added the reenlightenment
support for KVM, but I suspect it needs to stay for a few more years.

Old Hyper-V version support has been dropped in the past [1], but the
situation with reenlightenment is more that just the Hyper-V version.

Michael

[1] https://lore.kernel.org/all/1651509391-2058-2-git-send-email-mikelley@microsoft.com/

> >
> > Michael
> 
> Sebastian


^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-19  6:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, shradhagupta, dipayanroy,
	yury.norov, kees, ssengar, gargaditya, shirazsaleem, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260318193614.22328bc8@kernel.org>

On Wed, Mar 18, 2026 at 07:36:14PM -0700, Jakub Kicinski wrote:
> On Mon, 16 Mar 2026 04:23:27 -0700 Erni Sri Satya Vennela wrote:
> > Add debugfs entries to expose hardware configuration and diagnostic
> > information that aids in debugging driver initialization and runtime
> > operations without adding noise to dmesg.
> > 
> > The debugfs directory creation and removal for each PCI device is
> > integrated into mana_gd_setup() and mana_gd_cleanup_device()
> > respectively, so that all callers (probe, remove, suspend, resume,
> > shutdown) share a single code path.
> 
> Does not apply:
> 
> Failed to apply patch:
> Applying: net: mana: Expose hardware diagnostic info via debugfs
> Using index info to reconstruct a base tree...
> M	drivers/net/ethernet/microsoft/mana/gdma_main.c
> Falling back to patching base and 3-way merge...
> Auto-merging drivers/net/ethernet/microsoft/mana/gdma_main.c
> CONFLICT (content): Merge conflict in drivers/net/ethernet/microsoft/mana/gdma_main.c
> Recorded preimage for 'drivers/net/ethernet/microsoft/mana/gdma_main.c'
> error: Failed to merge in the changes.
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To restore the original branch and stop patching, run "git am --abort".
> hint: Disable this message with "git config set advice.mergeConflict false"
> Patch failed at 0001 net: mana: Expose hardware diagnostic info via debugfs
> -- 
> pw-bot: cr

I will rebase and send the next version.
Thankyou.

^ permalink raw reply

* [PATCH net-next v4] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-19  7:09 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, dipayanroy, yury.norov, kees, ernis, ssengar,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma

Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.

The debugfs directory creation and removal for each PCI device is
integrated into mana_gd_setup() and mana_gd_cleanup_device()
respectively, so that all callers (probe, remove, suspend, resume,
shutdown) share a single code path.

Device-level entries (under /sys/kernel/debug/mana/<slot>/):
  - num_msix_usable, max_num_queues: Max resources from hardware
  - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
  - num_vports, bm_hostmode: Device configuration

Per-vPort entries (under /sys/kernel/debug/mana/<slot>/vportN/):
  - port_handle: Hardware vPort handle
  - max_sq, max_rq: Max queues from vPort config
  - indir_table_sz: Indirection table size
  - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
    Last applied steering configuration parameters

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 65 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 34 ++++++++++
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 +++
 4 files changed, 78 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 2ba1fa3336f9..5028858febef 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -169,6 +169,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues > gc->num_msix_usable - 1)
 		gc->max_num_queues = gc->num_msix_usable - 1;
 
+	debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+			   &gc->num_msix_usable);
+	debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+			   &gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1239,6 +1244,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 	gc->pf_cap_flags1 = resp.pf_cap_flags1;
+	gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+	debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+			   &gc->gdma_protocol_ver);
+	debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+			   &gc->pf_cap_flags1);
+
 	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
 		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
 		if (err) {
@@ -1918,15 +1930,23 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	int err;
 
+	if (gc->is_pf)
+		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
+	else
+		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
+							  mana_debugfs_root);
+
 	err = mana_gd_init_registers(pdev);
 	if (err)
-		return err;
+		goto remove_debugfs;
 
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
 	gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
-	if (!gc->service_wq)
-		return -ENOMEM;
+	if (!gc->service_wq) {
+		err = -ENOMEM;
+		goto remove_debugfs;
+	}
 
 	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
@@ -1967,11 +1987,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
 free_workqueue:
 	destroy_workqueue(gc->service_wq);
 	gc->service_wq = NULL;
+remove_debugfs:
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
 	dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
 	return err;
 }
 
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
@@ -1983,6 +2006,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
 		destroy_workqueue(gc->service_wq);
 		gc->service_wq = NULL;
 	}
+
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
+
 	dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
 }
 
@@ -2040,12 +2067,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	if (gc->is_pf)
-		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
-	else
-		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
-							  mana_debugfs_root);
-
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2074,16 +2095,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 cleanup_mana:
 	mana_remove(&gc->mana, false);
 cleanup_gd:
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 unmap_bar:
-	/*
-	 * at this point we know that the other debugfs child dir/files
-	 * are either not yet created or are already cleaned up.
-	 * The pci debugfs folder clean-up now, will only be cleaning up
-	 * adapter-MTU file and apc->mana_pci_debugfs folder.
-	 */
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-	gc->mana_pci_debugfs = NULL;
 	xa_destroy(&gc->irq_contexts);
 	pci_iounmap(pdev, bar0_va);
 free_gc:
@@ -2133,11 +2146,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, false);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	xa_destroy(&gc->irq_contexts);
 
@@ -2159,7 +2168,7 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 
 	return 0;
 }
@@ -2198,11 +2207,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..ae80e9fb66b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1263,6 +1263,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
 
+	apc->vport_max_sq = *max_sq;
+	apc->vport_max_rq = *max_rq;
+
 	return 0;
 }
 
@@ -1417,6 +1420,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
+
+	apc->steer_rx = rx;
+	apc->steer_rss = apc->rss_state;
+	apc->steer_update_tab = update_tab;
+	apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
 out:
 	kfree(req);
 	return err;
@@ -3141,6 +3149,24 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+			   &apc->port_handle);
+	debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_sq);
+	debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_rq);
+	debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+			   &apc->indir_table_sz);
+	debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rx);
+	debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rss);
+	debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+			   &apc->steer_update_tab);
+	debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+			   &apc->steer_cqe_coalescing);
+
 	return 0;
 
 reset_apc:
@@ -3630,6 +3656,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	ac->bm_hostmode = bm_hostmode;
 
+	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+			   &ac->num_ports);
+	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+			  &ac->bm_hostmode);
+
 	if (!resuming) {
 		ac->num_ports = num_ports;
 
@@ -3770,6 +3801,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_gd_deregister_device(gd);
 
+	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+
 	if (suspending)
 		return;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 7fe3a1b61b2d..c4e3ce5147f7 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
 	struct gdma_dev		mana_ib;
 
 	u64 pf_cap_flags1;
+	u64 gdma_protocol_ver;
 
 	struct workqueue_struct *service_wq;
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..1d4223a78010 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
 
 	/* Debugfs */
 	struct dentry *mana_port_debugfs;
+
+	/* Cached vport/steering config for debugfs */
+	u32 vport_max_sq;
+	u32 vport_max_rq;
+	u32 steer_rx;
+	u32 steer_rss;
+	u32 steer_update_tab;
+	u32 steer_cqe_coalescing;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
-- 
2.34.1


^ permalink raw reply related

* Re: RE: RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Sebastian Andrzej Siewior @ 2026-03-19  9:57 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka
In-Reply-To: <SN6PR02MB4157392A02838453BAFBA807D44FA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-19 03:39:12 [+0000], Michael Kelley wrote:
> I'll raise the topic with ARM maintainers and IRQ subsystem maintainers
> to see if there's any reason one way or the other. I would not be surprised

Thank you.

> if adding interrupt randomness is intentionally excluded because these
> per-CPU interrupts were historically used for IPIs and timers only. What's
> changed is that ARM64 is now used significantly in data centers, and
> support for VMs is super important. The per-CPU interrupts are now used
> for more that IPIs and timers, such as in the Hyper-V case, and
> handle_percpu_devid_irq() was never reconsidered in that light. I would
> expect a reluctance to burden the IPI and timer interrupt paths with the
> overhead of add_interrupt_randomness(). But the Hyper-V VMBus case
> still needs it because that's the primary source of interrupt entropy in the
> VM. There aren't necessarily other devices generating non-per-CPU interrupts
> like in a physical machine. To me, it is perfectly valid for the IPI & timer
> interrupt paths to want to skip interrupt randomness, while the
> Hyper-V VMBus interrupt path needs it, and we will be back where we
> are now.

But if that is your concern, don't you have or should have something
similar to virtio-rng where you can feed high quality random data to the
guest?

> Related, *not* doing add_interrupt_randomness() on the ARM64 Hyper-V
> synthetic timer path is consistent with the ARM64 architectural timer, since
> it also uses handle_percpu_devid_irq(). I did the original work to get the
> Hyper-V synthetic timers working on ARM64 back in 2019 (?), but I don't
> recall if that consistency with the ARM64 architectural timer was
> intentional or accidental.
> 
> Again, I'll raise this with the appropriate maintainers and see what the
> feedback is.

Again, thank you.

> Michael

Sebastian

^ permalink raw reply

* Re: RE: RE: [PATCH v3] drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT
From: Sebastian Andrzej Siewior @ 2026-03-19 10:14 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86@kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Bezdeka, RT, Mitchell Levy,
	Saurabh Singh Sengar, Naman Jain
In-Reply-To: <SN6PR02MB4157EE20DCB90F955C12D2C0D44FA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-03-19 03:43:12 [+0000], Michael Kelley wrote:
> Indeed, yes, that would remove the need for all the per-CPU interrupt hackery
> on x86/x64. I don't have any objection to someone pursuing that path, but it's
> not something I can do. Full disclosure:  You'll see my name on Hyper-V and
> VMBus stuff in the Linux kernel, with Microsoft as my employer. But I retired
> from Microsoft 2.5 years ago, and my current involvement in Linux kernel work
> is purely as a very part-time volunteer. I also lack access to hardware and the
> test machinery needed to make more significant changes, particularly if multiple
> versions of Hyper-V must be tested.

right. Then I would only ask for better annotation instead this current
thingy.

> > I would be worried if the host would storming interrupts to the guest
> > because it makes no progress.
> 
> No, that kind of storming won't happen. The Hyper-V host<->guest
> interface is based on message queues. The host interrupts the guest
> if it puts a message in the queue that transitions the queue from
> "empty" to "not empty". Eventually the tasklet enabled in vmbus_isr()
> and its subsidiaries gets around to emptying the queue, which effectively
> re-arms the interrupt. The host may add more messages to the queue,
> but it doesn't interrupt again for that queue until the queue is empty.
> If the guest is delayed in doing that emptying, nothing bad happens.

Okay.

> > > > Moving on. This (trying very hard here) even schedules tasklets. Why?
> > > > You need to disable BH before doing so. Otherwise it ends in ksoftirqd.
> > > > You don't want that.
> > >
> > > Again, Jan can comment on the impact of delays due to ending up
> > > in ksoftirqd.
> > 
> > My point is that having this with threaded interrupt support would
> > eliminate the usage of tasklets.
> 
> Agreed, probably. For the non-RT case, the latency in getting to the
> tasklet code *does* matter. I'm not familiar with how tasklets compare
> to threaded interrupts on latency.

There shouldn't be much difference on level where it actually matters.

Sebastian

^ permalink raw reply

* Re: [PATCH v2 12/16] mm: allow handling of stacked mmap_prepare hooks in more drivers
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 12:52 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Clemens Ladisch, Arnd Bergmann, Greg Kroah-Hartman,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Alexander Shishkin, Maxime Coquelin, Alexandre Torgue,
	Miquel Raynal, Richard Weinberger, Vignesh Raghavendra,
	Bodo Stroesser, Martin K . Petersen, David Howells, Marc Dionne,
	Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <20260318210845.2591228-1-joshua.hahnjy@gmail.com>

On Wed, Mar 18, 2026 at 02:08:45PM -0700, Joshua Hahn wrote:
> On Mon, 16 Mar 2026 21:12:08 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
>
> > While the conversion of mmap hooks to mmap_prepare is underway, we wil
> > encounter situations where mmap hooks need to invoke nested mmap_prepare
> > hooks.
> >
> > The nesting of mmap hooks is termed 'stacking'.  In order to flexibly
> > facilitate the conversion of custom mmap hooks in drivers which stack, we
> > must split up the existing compat_vma_mapped() function into two separate
> > functions:
> >
> > * compat_set_desc_from_vma() - This allows the setting of a vm_area_desc
> >   object's fields to the relevant fields of a VMA.
>
> Hello Lorenzo, I hope you are doing well!
>
> Thank you for this patch. I was developing on top of mm-new today and had
> an error that I think was caused by this patch. I want to preface this by
> saying that I am not at all familiar with this area of the code, so please
> do forgive me if I've misinterpreted the crash and mistakenly pointed
> at this commit : -)
>
> Here is the crash:
>
> [    1.083795] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> [    1.083883] BUG: unable to handle page fault for address: ffa00000048efbb8
> [    1.083957] #PF: supervisor instruction fetch in kernel mode
> [    1.084030] #PF: error_code(0x0011) - permissions violation
> [    1.084086] PGD 100000067 P4D 10035f067 PUD 100364067 PMD 441ed9067 PTE 80000004466a3163
> [    1.084162] Oops: Oops: 0011 [#1] SMP
> [    1.084218] CPU: 0 UID: 0 PID: 305 Comm: mkdir Tainted: G        W   E       7.0.0-rc4-virtme-00442-ge53de5a0302f-dirty #85 PREEMPTLAZY
>
> As you can see, it's on a QEMU instance. I don't think this makes a difference
> in the crash, though.
>
> [    1.084321] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
> [    1.084369] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-5.el9 11/05/2023
> [    1.084450] RIP: 0010:0xffa00000048efbb8
> [    1.084489] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <40> 12 0e 00 01 00 11 ff d0 fa 8e 04 00 00 a0 ff 80 33 51 02 01 00
> [    1.084642] RSP: 0018:ffa00000048ef998 EFLAGS: 00010286
> [    1.084692] RAX: ffa00000048efbb8 RBX: ff11000102512cc0 RCX: 000000000000000d
> [    1.084766] RDX: ffffffffa06247d0 RSI: ffa00000048efa18 RDI: ff11000102512cc0
> [    1.084826] RBP: ffa00000048ef9c8 R08: 0000000000000000 R09: 0000000000000007
> [    1.084889] R10: ff110001047d1f08 R11: 00007effdc3d0fff R12: ff110001047d3b00
> [    1.084954] R13: ff11000446cae600 R14: ff110001024efe00 R15: ff11000102510a80
> [    1.085021] FS:  0000000000000000(0000) GS:ff110004aae72000(0000) knlGS:0000000000000000
> [    1.085083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.085136] CR2: ffa00000048efbb8 CR3: 0000000102667001 CR4: 0000000000771ef0
> [    1.085201] PKRU: 55555554
> [    1.085228] Call Trace:
> [    1.085248]  <TASK>
> [    1.085274]  ? __compat_vma_mmap+0x8e/0x130
> [    1.085318]  ? compat_vma_mmap+0x76/0x80
> [    1.085354]  ? mas_alloc_nodes+0xb2/0x110
> [    1.085390]  ? backing_file_mmap+0xc3/0xf0
> [    1.085426]  ? ovl_mmap+0x41/0x50
> [    1.085463]  ? ovl_mmap+0x50/0x50
> [    1.085499]  ? __mmap_region+0x7e8/0x1100
> [    1.085539]  ? do_mmap+0x49f/0x5e0
> [    1.085573]  ? vm_mmap_pgoff+0xef/0x1e0
> [    1.085609]  ? ksys_mmap_pgoff+0x15c/0x1f0
> [    1.085647]  ? do_syscall_64+0xab/0x980
> [    1.085684]  ? entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [    1.085730]  </TASK>
> [    1.085770] Modules linked in: virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
> [    1.085838] CR2: ffa00000048efbb8
> [    1.085874] ---[ end trace 0000000000000000 ]---
> [    1.085875] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> [    1.085918] RIP: 0010:0xffa00000048efbb8
> [    1.085921] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <40> 12 0e 00 01 00 11 ff d0 fa 8e 04 00 00 a0 ff 80 33 51 02 01 00
> [    1.085988] BUG: unable to handle page fault for address: ffa00000048f7bb8
> [    1.086026] RSP: 0018:ffa00000048ef998 EFLAGS: 00010286
> [    1.086166] #PF: supervisor instruction fetch in kernel mode
> [    1.086221]
> [    1.086267] #PF: error_code(0x0011) - permissions violation
> [    1.086321] RAX: ffa00000048efbb8 RBX: ff11000102512cc0 RCX: 000000000000000d
> [    1.086348] PGD 100000067
> [    1.086394] RDX: ffffffffa06247d0 RSI: ffa00000048efa18 RDI: ff11000102512cc0
> [    1.086459] P4D 10035f067
> [    1.086486] RBP: ffa00000048ef9c8 R08: 0000000000000000 R09: 0000000000000007
> [    1.086550] PUD 100364067
> [    1.086577] R10: ff110001047d1f08 R11: 00007effdc3d0fff R12: ff110001047d3b00
> [    1.086641] PMD 441ed9067
> [    1.086668] R13: ff11000446cae600 R14: ff110001024efe00 R15: ff11000102510a80
> [    1.086731] PTE 80000004433d3163
> [    1.086764] FS:  0000000000000000(0000) GS:ff110004aae72000(0000) knlGS:0000000000000000
> [    1.086829]
> [    1.086868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.086931] Oops: Oops: 0011 [#2] SMP
> [    1.086958] CR2: ffa00000048efbb8 CR3: 0000000102667001 CR4: 0000000000771ef0
> [    1.087015] CPU: 29 UID: 0 PID: 306 Comm: mount Tainted: G      D W   E       7.0.0-rc4-virtme-00442-ge53de5a0302f-dirty #85 PREEMPTLAZY
> [    1.087050] PKRU: 55555554
> [    1.087115] Tainted: [D]=DIE, [W]=WARN, [E]=UNSIGNED_MODULE
> [    1.087207] Kernel panic - not syncing: Fatal exception
> [    2.158392] Shutting down cpus with NMI
> [    2.158629] Kernel Offset: disabled
> [    2.158668] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> It crashes at compat_vma_mmap, and here is what I think could be the
> potential crash path:
>
> - compat_vma_mmap() creates struct vm_area_desc desc;
>   - compat_set_desc_from_vma Doesn't initialize the struct, but instead
>     modifies independent fields. I think this is where the behavior
>     diverges, since before we would use the C initializer and uninitialized

Ah yeah you're right I'll fix that up!

>     variables would be set to 0 (including ommitted ones, like
>     action.success_hook or action.error_hook). But action.type = MMAP_NOTHING
>   - desc.action.success_hook remains uninitialized in vfs_mmap_prepare
>   - mmap_action_complete()
>     - Here, We've set action.type to be MMAP_NOTHING, so we have err = 0
>     - mmap_action_finish(action, vma, 0)
>       - And here, since err == 0, we check action->success_hook (which has
>         garbage, therefore it's nonzero) and call action->success_hook(vma)
>
> And I think action->success_hook(vma) where success_hook is uninitialized
> stack garbage gets me to where I am.
>
> Again, I'm not too familiar with this area of the kernel, this is just
> based on the quick digging that I did. And aplogies again if I'm missing
> something ; -) I do think that the uninitialized members could be a problem
> though.
>
> Thank you, I hope you have a great day Lorenzo!
> Joshua

Thanks for the report and analysis, much appreciated, hope you have a great
day too :)

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 12/61] quota: Prefer IS_ERR_OR_NULL over manual NULL check
From: Jan Kara @ 2026-03-19 14:13 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Jan Kara
In-Reply-To: <20260310-b4-is_err_or_null-v1-12-bd63b656022d@avm.de>

On Tue 10-03-26 12:48:38, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> Change generated with coccinelle.
> 
> To: Jan Kara <jack@suse.com>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>

Thanks for the patch but frankly I find the original variant clearer wrt
what is going on. So I prefer to keep the code as is.

								Honza

> ---
>  fs/quota/quota.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 33bacd70758007129e0375bab44d7431195ec441..2e09fc247d0cf45b9e83a4f8a0be7ea694c8c2a1 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -965,7 +965,7 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
>  	else
>  		drop_super_exclusive(sb);
>  out:
> -	if (pathp && !IS_ERR(pathp))
> +	if (!IS_ERR_OR_NULL(pathp))
>  		path_put(pathp);
>  	return ret;
>  }
> 
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* RE: RE: RE: [PATCH] Drivers: hv: vmbus: Move add_interrupt_randomness back to real interrupt
From: Michael Kelley @ 2026-03-19 14:19 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Michael Kelley
  Cc: Jan Kiszka, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, linux-hyperv@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka
In-Reply-To: <20260319095719.7cPrV_6A@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Thursday, March 19, 2026 2:57 AM
> 
> On 2026-03-19 03:39:12 [+0000], Michael Kelley wrote:
> > I'll raise the topic with ARM maintainers and IRQ subsystem maintainers
> > to see if there's any reason one way or the other. I would not be surprised
> 
> Thank you.
> 
> > if adding interrupt randomness is intentionally excluded because these
> > per-CPU interrupts were historically used for IPIs and timers only. What's
> > changed is that ARM64 is now used significantly in data centers, and
> > support for VMs is super important. The per-CPU interrupts are now used
> > for more that IPIs and timers, such as in the Hyper-V case, and
> > handle_percpu_devid_irq() was never reconsidered in that light. I would
> > expect a reluctance to burden the IPI and timer interrupt paths with the
> > overhead of add_interrupt_randomness(). But the Hyper-V VMBus case
> > still needs it because that's the primary source of interrupt entropy in the
> > VM. There aren't necessarily other devices generating non-per-CPU interrupts
> > like in a physical machine. To me, it is perfectly valid for the IPI & timer
> > interrupt paths to want to skip interrupt randomness, while the
> > Hyper-V VMBus interrupt path needs it, and we will be back where we
> > are now.
> 
> But if that is your concern, don't you have or should have something
> similar to virtio-rng where you can feed high quality random data to the
> guest?

Hyper-V provides a modest pool of entropy at VM boot time, in the form
of a vendor-specific ACPI table. It is consumed by the guest in the function
ms_hyperv_late_init() for the purpose of seeding the Linux random
number generator, and works on both x86/x64 and arm64.

But this is a one-shot operation at boot time. Hyper-V does not provide
guests with an ongoing source of entropy like virtio-rng, so the guest
must generate its own. And if the guest does a kexec(), the new kernel
doesn't even get to start with that ACPI table entropy.

Michael

> 
> > Related, *not* doing add_interrupt_randomness() on the ARM64 Hyper-V
> > synthetic timer path is consistent with the ARM64 architectural timer, since
> > it also uses handle_percpu_devid_irq(). I did the original work to get the
> > Hyper-V synthetic timers working on ARM64 back in 2019 (?), but I don't
> > recall if that consistency with the ARM64 architectural timer was
> > intentional or accidental.
> >
> > Again, I'll raise this with the appropriate maintainers and see what the
> > feedback is.
> 
> Again, thank you.
> 
> > Michael
> 
> Sebastian


^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 14:54 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpE5qZmi43EeZiRcy78pD6YvJb5n_xnoUJfwEjomowu0=A@mail.gmail.com>

On Tue, Mar 17, 2026 at 02:32:16PM -0700, Suren Baghdasaryan wrote:
> On Tue, Mar 17, 2026 at 2:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > The f_op->mmap interface is deprecated, so update driver to use its
> > > successor, mmap_prepare.
> > >
> > > The driver previously used vm_iomap_memory(), so this change replaces it
> > > with its mmap_prepare equivalent, mmap_action_simple_ioremap().
> > >
> > > Functions that wrap mmap() are also converted to wrap mmap_prepare()
> > > instead.
> > >
> > > Also update the documentation accordingly.
> > >
> > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > ---
> > >  Documentation/driver-api/vme.rst    |  2 +-
> > >  drivers/staging/vme_user/vme.c      | 20 +++++------
> > >  drivers/staging/vme_user/vme.h      |  2 +-
> > >  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
> > >  4 files changed, 42 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> > > index c0b475369de0..7111999abc14 100644
> > > --- a/Documentation/driver-api/vme.rst
> > > +++ b/Documentation/driver-api/vme.rst
> > > @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
> > >
> > >  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
> > >  do a read-modify-write transaction. Parts of a VME window can also be mapped
> > > -into user space memory using :c:func:`vme_master_mmap`.
> > > +into user space memory using :c:func:`vme_master_mmap_prepare`.
> > >
> > >
> > >  Slave windows
> > > diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> > > index f10a00c05f12..7220aba7b919 100644
> > > --- a/drivers/staging/vme_user/vme.c
> > > +++ b/drivers/staging/vme_user/vme.c
> > > @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
> > >  EXPORT_SYMBOL(vme_master_rmw);
> > >
> > >  /**
> > > - * vme_master_mmap - Mmap region of VME master window.
> > > + * vme_master_mmap_prepare - Mmap region of VME master window.
> > >   * @resource: Pointer to VME master resource.
> > > - * @vma: Pointer to definition of user mapping.
> > > + * @desc: Pointer to descriptor of user mapping.
> > >   *
> > >   * Memory map a region of the VME master window into user space.
> > >   *
> > > @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
> > >   *         resource or -EFAULT if map exceeds window size. Other generic mmap
> > >   *         errors may also be returned.
> > >   */
> > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > > +int vme_master_mmap_prepare(struct vme_resource *resource,
> > > +                           struct vm_area_desc *desc)
> > >  {
> > > +       const unsigned long vma_size = vma_desc_size(desc);
> > >         struct vme_bridge *bridge = find_bridge(resource);
> > >         struct vme_master_resource *image;
> > >         phys_addr_t phys_addr;
> > > -       unsigned long vma_size;
> > >
> > >         if (resource->type != VME_MASTER) {
> > >                 dev_err(bridge->parent, "Not a master resource\n");
> > > @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > >         }
> > >
> > >         image = list_entry(resource->entry, struct vme_master_resource, list);
> > > -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> > > -       vma_size = vma->vm_end - vma->vm_start;
> > > +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
> > >
> > >         if (phys_addr + vma_size > image->bus_resource.end + 1) {
> > >                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
> > >                 return -EFAULT;
> > >         }
> > >
> > > -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > -
> > > -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> > > +       desc->page_prot = pgprot_noncached(desc->page_prot);
> > > +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> > > +       return 0;
> > >  }
> > > -EXPORT_SYMBOL(vme_master_mmap);
> > > +EXPORT_SYMBOL(vme_master_mmap_prepare);
> > >
> > >  /**
> > >   * vme_master_free - Free VME master window
> > > diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> > > index 797e9940fdd1..b6413605ea49 100644
> > > --- a/drivers/staging/vme_user/vme.h
> > > +++ b/drivers/staging/vme_user/vme.h
> > > @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
> > >  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
> > >  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
> > >                             unsigned int swap, loff_t offset);
> > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> > > +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
> > >  void vme_master_free(struct vme_resource *resource);
> > >
> > >  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> > > diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> > > index d95dd7d9190a..11e25c2f6b0a 100644
> > > --- a/drivers/staging/vme_user/vme_user.c
> > > +++ b/drivers/staging/vme_user/vme_user.c
> > > @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
> > >         kfree(vma_priv);
> > >  }
> > >
> > > -static const struct vm_operations_struct vme_user_vm_ops = {
> > > -       .open = vme_user_vm_open,
> > > -       .close = vme_user_vm_close,
> > > -};
> > > -
> > > -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > > +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > +                             const struct file *file, void **vm_private_data)
> > >  {
> > > -       int err;
> > > +       const unsigned int minor = iminor(file_inode(file));
> > >         struct vme_user_vma_priv *vma_priv;
> > >
> > >         mutex_lock(&image[minor].mutex);
> > >
> > > -       err = vme_master_mmap(image[minor].resource, vma);
> > > -       if (err) {
> > > -               mutex_unlock(&image[minor].mutex);
> > > -               return err;
> > > -       }
> > > -
> >
> > Ok, this changes the set of the operations performed under image[minor].mutex.
> > Before we had:
> >
> > mutex_lock(&image[minor].mutex);
> > vme_master_mmap();
> > <some final adjustments>
> > mutex_unlock(&image[minor].mutex);
> >
> > Now we have:
> >
> > mutex_lock(&image[minor].mutex);
> > vme_master_mmap_prepare()
> > mutex_unlock(&image[minor].mutex);
> > vm_iomap_memory();
> > mutex_lock(&image[minor].mutex);
> > vme_user_vm_mapped(); // <some final adjustments>
> > mutex_unlock(&image[minor].mutex);
> >
> > I think as long as image[minor] does not change while we are not
> > holding the mutex we should be safe, and looking at the code it seems
> > to be the case. But I'm not familiar with this driver and might be
> > wrong. Worth double-checking.

The file is pinned for the duration, the mutex is associated with the file,
so there's no sane world in which that could be problematic.

Keeping in mind that we manipulate stuff on vme_user_vm_close() that
directly acceses image[minor] at an arbitary time.

>
> A side note: if we had to hold the mutex across all those operations I
> think we would need to take the mutex in the vm_ops->mmap_prepare and
> add a vm_ops->map_failed hook or something along that line to drop the
> mutex in case mmap_action_complete() fails. Not sure if we will have
> such cases though...

No, I don't want to do this if it can be at all avoided. You should in
nearly any sane circumstance be able to defer things until the mapped hook
anyway.

Also a merge can happen too after an .mmap_prepare, so we'd have to have
some 'success' hook and I'm just not going there it'll end up open to abuse
again.

(We do have success and error filtering hooks right now, sadly, but they're
really for hugetlb and I plan to find a way to get rid of them).

The mmap_prepare is meant to essentially be as stateless as possible.

Anyway I don't think it's relevant here.

>
> >
> > >         vma_priv = kmalloc_obj(*vma_priv);
> > >         if (!vma_priv) {
> > >                 mutex_unlock(&image[minor].mutex);
> > > @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > >
> > >         vma_priv->minor = minor;
> > >         refcount_set(&vma_priv->refcnt, 1);
> > > -       vma->vm_ops = &vme_user_vm_ops;
> > > -       vma->vm_private_data = vma_priv;
> > > -
> > > +       *vm_private_data = vma_priv;
> > >         image[minor].mmap_count++;
> > >
> > >         mutex_unlock(&image[minor].mutex);
> > > -
> > >         return 0;
> > >  }
> > >
> > > -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> > > +static const struct vm_operations_struct vme_user_vm_ops = {
> > > +       .mapped = vme_user_vm_mapped,
> > > +       .open = vme_user_vm_open,
> > > +       .close = vme_user_vm_close,
> > > +};
> > > +
> > > +static int vme_user_master_mmap_prepare(unsigned int minor,
> > > +                                       struct vm_area_desc *desc)
> > > +{
> > > +       int err;
> > > +
> > > +       mutex_lock(&image[minor].mutex);
> > > +
> > > +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> > > +       if (!err)
> > > +               desc->vm_ops = &vme_user_vm_ops;
> > > +
> > > +       mutex_unlock(&image[minor].mutex);
> > > +       return err;
> > > +}
> > > +
> > > +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
> > >  {
> > > -       unsigned int minor = iminor(file_inode(file));
> > > +       const struct file *file = desc->file;
> > > +       const unsigned int minor = iminor(file_inode(file));
> > >
> > >         if (type[minor] == MASTER_MINOR)
> > > -               return vme_user_master_mmap(minor, vma);
> > > +               return vme_user_master_mmap_prepare(minor, desc);
> > >
> > >         return -ENODEV;
> > >  }
> > > @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
> > >         .llseek = vme_user_llseek,
> > >         .unlocked_ioctl = vme_user_unlocked_ioctl,
> > >         .compat_ioctl = compat_ptr_ioctl,
> > > -       .mmap = vme_user_mmap,
> > > +       .mmap_prepare = vme_user_mmap_prepare,
> > >  };
> > >
> > >  static int vme_user_match(struct vme_dev *vdev)
> > > --
> > > 2.53.0
> > >

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 15/16] mm: add mmap_action_map_kernel_pages[_full]()
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:05 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpF6eS18HLgNvQtkLGd=7N0_L1JPmF0GzM-Z0QimRWT7AQ@mail.gmail.com>

On Wed, Mar 18, 2026 at 09:00:13AM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > A user can invoke mmap_action_map_kernel_pages() to specify that the
> > mapping should map kernel pages starting from desc->start of a specified
> > number of pages specified in an array.
> >
> > In order to implement this, adjust mmap_action_prepare() to be able to
> > return an error code, as it makes sense to assert that the specified
> > parameters are valid as quickly as possible as well as updating the VMA
> > flags to include VMA_MIXEDMAP_BIT as necessary.
> >
> > This provides an mmap_prepare equivalent of vm_insert_pages().
> >
> > We additionally update the existing vm_insert_pages() code to use
> > range_in_vma() and add a new range_in_vma_desc() helper function for the
> > mmap_prepare case, sharing the code between the two in range_is_subset().
> >
> > We add both mmap_action_map_kernel_pages() and
> > mmap_action_map_kernel_pages_full() to allow for both partial and full VMA
> > mappings.
> >
> > We also add mmap_action_map_kernel_pages_discontig() to allow for
> > discontiguous mapping of kernel pages should the need arise.
> >
> > We update the documentation to reflect the new features.
> >
> > Finally, we update the VMA tests accordingly to reflect the changes.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>
> With one nit,
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst |  8 ++
> >  include/linux/mm.h                         | 95 +++++++++++++++++++++-
> >  include/linux/mm_types.h                   |  7 ++
> >  mm/memory.c                                | 42 +++++++++-
> >  mm/util.c                                  |  6 ++
> >  tools/testing/vma/include/dup.h            |  7 ++
> >  6 files changed, 159 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > index be76ae475b9c..e810aa4134eb 100644
> > --- a/Documentation/filesystems/mmap_prepare.rst
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -156,5 +156,13 @@ pointer. These are:
> >  * mmap_action_simple_ioremap() - Sets up an I/O remap from a specified
> >    physical address and over a specified length.
> >
> > +* mmap_action_map_kernel_pages() - Maps a specified array of `struct page`
> > +  pointers in the VMA from a specific offset.
> > +
> > +* mmap_action_map_kernel_pages_full() - Maps a specified array of `struct
> > +  page` pointers over the entire VMA. The caller must ensure there are
> > +  sufficient entries in the page array to cover the entire range of the
> > +  described VMA.
> > +
> >  **NOTE:** The ``action`` field should never normally be manipulated directly,
> >  rather you ought to use one of these helpers.
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index df8fa6e6402b..6f0a3edb24e1 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2912,7 +2912,7 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio)
> >   * The caller must add any reference (e.g., from folio_try_get()) it might be
> >   * holding itself to the result.
> >   *
> > - * Returns the expected folio refcount.
> > + * Returns: the expected folio refcount.
>
> nit: I see both "Returns:" and "Return:" being used in the codebase
> but this header file uses "Return:", so for consistency you should
> probably do the same. This also applies to later instances in this
> patch.

Well here I'm just adding the colon, while I'm here (maybe have been an
update in response to feedback actualy).

And this function that's not part of my change already uses 'Returns' and
I'm pretty sure that's the correct form.

So I think not a big deal to keep using that?

>
> >   */
> >  static inline int folio_expected_ref_count(const struct folio *folio)
> >  {
> > @@ -4364,6 +4364,45 @@ static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
> >         action->type = MMAP_SIMPLE_IO_REMAP;
> >  }
> >
> > +/**
> > + * mmap_action_map_kernel_pages - helper for mmap_prepare hook to specify that
> > + * @num kernel pages contained in the @pages array should be mapped to userland
> > + * starting at virtual address @start.
> > + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> > + * @start: The virtual address from which to map them.
> > + * @pages: An array of struct page pointers describing the memory to map.
> > + * @nr_pages: The number of entries in the @pages aray.
> > + */
> > +static inline void mmap_action_map_kernel_pages(struct vm_area_desc *desc,
> > +               unsigned long start, struct page **pages,
> > +               unsigned long nr_pages)
> > +{
> > +       struct mmap_action *action = &desc->action;
> > +
> > +       action->type = MMAP_MAP_KERNEL_PAGES;
> > +       action->map_kernel.start = start;
> > +       action->map_kernel.pages = pages;
> > +       action->map_kernel.nr_pages = nr_pages;
> > +       action->map_kernel.pgoff = desc->pgoff;
> > +}
> > +
> > +/**
> > + * mmap_action_map_kernel_pages_full - helper for mmap_prepare hook to specify that
> > + * kernel pages contained in the @pages array should be mapped to userland
> > + * from @desc->start to @desc->end.
> > + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> > + * @pages: An array of struct page pointers describing the memory to map.
> > + *
> > + * The caller must ensure that @pages contains sufficient entries to cover the
> > + * entire range described by @desc.
> > + */
> > +static inline void mmap_action_map_kernel_pages_full(struct vm_area_desc *desc,
> > +               struct page **pages)
> > +{
> > +       mmap_action_map_kernel_pages(desc, desc->start, pages,
> > +                                    vma_desc_pages(desc));
> > +}
> > +
> >  int mmap_action_prepare(struct vm_area_desc *desc);
> >  int mmap_action_complete(struct vm_area_struct *vma,
> >                          struct mmap_action *action);
> > @@ -4380,10 +4419,59 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> >         return vma;
> >  }
> >
> > +/**
> > + * range_is_subset - Is the specified inner range a subset of the outer range?
> > + * @outer_start: The start of the outer range.
> > + * @outer_end: The exclusive end of the outer range.
> > + * @inner_start: The start of the inner range.
> > + * @inner_end: The exclusive end of the inner range.
> > + *
> > + * Returns: %true if [inner_start, inner_end) is a subset of [outer_start,
> > + * outer_end), otherwise %false.
> > + */
> > +static inline bool range_is_subset(unsigned long outer_start,
> > +                                  unsigned long outer_end,
> > +                                  unsigned long inner_start,
> > +                                  unsigned long inner_end)
> > +{
> > +       return outer_start <= inner_start && inner_end <= outer_end;
> > +}
> > +
> > +/**
> > + * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
> > + * @vma: The VMA against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns: %true if [@start, @end) is a subset of [@vma->vm_start,
> > + * @vma->vm_end), %false otherwise.
> > + */
> >  static inline bool range_in_vma(const struct vm_area_struct *vma,
> >                                 unsigned long start, unsigned long end)
> >  {
> > -       return (vma && vma->vm_start <= start && end <= vma->vm_end);
> > +       if (!vma)
> > +               return false;
> > +
> > +       return range_is_subset(vma->vm_start, vma->vm_end, start, end);
> > +}
> > +
> > +/**
> > + * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
> > + * described by @desc, a VMA descriptor?
> > + * @desc: The VMA descriptor against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns: %true if [@start, @end) is a subset of [@desc->start, @desc->end),
> > + * %false otherwise.
> > + */
> > +static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
> > +                                    unsigned long start, unsigned long end)
> > +{
> > +       if (!desc)
> > +               return false;
> > +
> > +       return range_is_subset(desc->start, desc->end, start, end);
> >  }
> >
> >  #ifdef CONFIG_MMU
> > @@ -4427,6 +4515,9 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> >  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> >                         struct page **pages, unsigned long *num);
> > +int map_kernel_pages_prepare(struct vm_area_desc *desc);
> > +int map_kernel_pages_complete(struct vm_area_struct *vma,
> > +                             struct mmap_action *action);
> >  int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
> >                                 unsigned long num);
> >  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7538d64f8848..c46224020a46 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -815,6 +815,7 @@ enum mmap_action_type {
> >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> >         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> > +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from array. */
> >  };
> >
> >  /*
> > @@ -833,6 +834,12 @@ struct mmap_action {
> >                         phys_addr_t start_phys_addr;
> >                         unsigned long size;
> >                 } simple_ioremap;
> > +               struct {
> > +                       unsigned long start;
> > +                       struct page **pages;
> > +                       unsigned long nr_pages;
> > +                       pgoff_t pgoff;
> > +               } map_kernel;
> >         };
> >         enum mmap_action_type type;
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f3f4046aee97..849d5d9eeb83 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2484,13 +2484,14 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
> >  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> >                         struct page **pages, unsigned long *num)
> >  {
> > -       const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1;
> > +       const unsigned long nr_pages = *num;
> > +       const unsigned long end = addr + PAGE_SIZE * nr_pages;
> >
> > -       if (addr < vma->vm_start || end_addr >= vma->vm_end)
> > +       if (!range_in_vma(vma, addr, end))
> >                 return -EFAULT;
> >         if (!(vma->vm_flags & VM_MIXEDMAP)) {
> > -               BUG_ON(mmap_read_trylock(vma->vm_mm));
> > -               BUG_ON(vma->vm_flags & VM_PFNMAP);
> > +               VM_WARN_ON_ONCE(mmap_read_trylock(vma->vm_mm));
> > +               VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
> >                 vm_flags_set(vma, VM_MIXEDMAP);
> >         }
> >         /* Defer page refcount checking till we're about to map that page. */
> > @@ -2498,6 +2499,39 @@ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  EXPORT_SYMBOL(vm_insert_pages);
> >
> > +int map_kernel_pages_prepare(struct vm_area_desc *desc)
> > +{
> > +       const struct mmap_action *action = &desc->action;
> > +       const unsigned long addr = action->map_kernel.start;
> > +       unsigned long nr_pages, end;
> > +
> > +       if (!vma_desc_test(desc, VMA_MIXEDMAP_BIT)) {
> > +               VM_WARN_ON_ONCE(mmap_read_trylock(desc->mm));
> > +               VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_PFNMAP_BIT));
> > +               vma_desc_set_flags(desc, VMA_MIXEDMAP_BIT);
> > +       }
> > +
> > +       nr_pages = action->map_kernel.nr_pages;
> > +       end = addr + PAGE_SIZE * nr_pages;
> > +       if (!range_in_vma_desc(desc, addr, end))
> > +               return -EFAULT;
> > +
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL(map_kernel_pages_prepare);
> > +
> > +int map_kernel_pages_complete(struct vm_area_struct *vma,
> > +                             struct mmap_action *action)
> > +{
> > +       unsigned long nr_pages;
> > +
> > +       nr_pages = action->map_kernel.nr_pages;
> > +       return insert_pages(vma, action->map_kernel.start,
> > +                           action->map_kernel.pages,
> > +                           &nr_pages, vma->vm_page_prot);
> > +}
> > +EXPORT_SYMBOL(map_kernel_pages_complete);
> > +
> >  /**
> >   * vm_insert_page - insert single page into user vma
> >   * @vma: user vma to map to
> > diff --git a/mm/util.c b/mm/util.c
> > index a166c48fe894..dea590e7a26c 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -1441,6 +1441,8 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> >                 return io_remap_pfn_range_prepare(desc);
> >         case MMAP_SIMPLE_IO_REMAP:
> >                 return simple_ioremap_prepare(desc);
> > +       case MMAP_MAP_KERNEL_PAGES:
> > +               return map_kernel_pages_prepare(desc);
> >         }
> >
> >         WARN_ON_ONCE(1);
> > @@ -1472,6 +1474,9 @@ int mmap_action_complete(struct vm_area_struct *vma,
> >         case MMAP_IO_REMAP_PFN:
> >                 err = io_remap_pfn_range_complete(vma, action);
> >                 break;
> > +       case MMAP_MAP_KERNEL_PAGES:
> > +               err = map_kernel_pages_complete(vma, action);
> > +               break;
> >         case MMAP_SIMPLE_IO_REMAP:
> >                 /*
> >                  * The simple I/O remap should have been delegated to an I/O
> > @@ -1494,6 +1499,7 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> >         case MMAP_REMAP_PFN:
> >         case MMAP_IO_REMAP_PFN:
> >         case MMAP_SIMPLE_IO_REMAP:
> > +       case MMAP_MAP_KERNEL_PAGES:
> >                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
> >                 break;
> >         }
> > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > index 6658df26698a..4407caf207ad 100644
> > --- a/tools/testing/vma/include/dup.h
> > +++ b/tools/testing/vma/include/dup.h
> > @@ -454,6 +454,7 @@ enum mmap_action_type {
> >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> >         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> > +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from an array. */
> >  };
> >
> >  /*
> > @@ -472,6 +473,12 @@ struct mmap_action {
> >                         phys_addr_t start;
> >                         unsigned long len;
> >                 } simple_ioremap;
> > +               struct {
> > +                       unsigned long start;
> > +                       struct page **pages;
> > +                       unsigned long num;
> > +                       pgoff_t pgoff;
> > +               } map_kernel;
> >         };
> >         enum mmap_action_type type;
> >
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH v2 12/16] mm: allow handling of stacked mmap_prepare hooks in more drivers
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpFr8_uU28S=v7y74Opa4L_4s9J70NgUXg1WGmraDhsxRA@mail.gmail.com>

On Wed, Mar 18, 2026 at 08:33:28AM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > While the conversion of mmap hooks to mmap_prepare is underway, we wil
>
> nit: s/wil/will

Thanks, fixed.

>
> > encounter situations where mmap hooks need to invoke nested mmap_prepare
> > hooks.
> >
> > The nesting of mmap hooks is termed 'stacking'.  In order to flexibly
> > facilitate the conversion of custom mmap hooks in drivers which stack, we
> > must split up the existing compat_vma_mapped() function into two separate
> > functions:
> >
> > * compat_set_desc_from_vma() - This allows the setting of a vm_area_desc
> >   object's fields to the relevant fields of a VMA.
> >
> > * __compat_vma_mmap() - Once an mmap_prepare hook has been executed upon a
> >   vm_area_desc object, this function performs any mmap actions specified by
> >   the mmap_prepare hook and then invokes its vm_ops->mapped() hook if any
> >   were specified.
> >
> > In ordinary cases, where a file's f_op->mmap_prepare() hook simply needs to
> > be invoked in a stacked mmap() hook, compat_vma_mmap() can be used.
> >
> > However some drivers define their own nested hooks, which are invoked in
> > turn by another hook.
> >
> > A concrete example is vmbus_channel->mmap_ring_buffer(), which is invoked
> > in turn by bin_attribute->mmap():
> >
> > vmbus_channel->mmap_ring_buffer() has a signature of:
> >
> > int (*mmap_ring_buffer)(struct vmbus_channel *channel,
> >                         struct vm_area_struct *vma);
> >
> > And bin_attribute->mmap() has a signature of:
> >
> >         int (*mmap)(struct file *, struct kobject *,
> >                     const struct bin_attribute *attr,
> >                     struct vm_area_struct *vma);
> >
> > And so compat_vma_mmap() cannot be used here for incremental conversion of
> > hooks from mmap() to mmap_prepare().
> >
> > There are many such instances like this, where conversion to mmap_prepare
> > would otherwise cascade to a huge change set due to nesting of this kind.
> >
> > The changes in this patch mean we could now instead convert
> > vmbus_channel->mmap_ring_buffer() to
> > vmbus_channel->mmap_prepare_ring_buffer(), and implement something like:
> >
> >         struct vm_area_desc desc;
> >         int err;
> >
> >         compat_set_desc_from_vm(&desc, file, vma);
> >         err = channel->mmap_prepare_ring_buffer(channel, &desc);
> >         if (err)
> >                 return err;
> >
> >         return __compat_vma_mmap(&desc, vma);
> >
> > Allowing us to incrementally update this logic, and other logic like it.
>
> The way I understand this and the next 2 patches is that they are
> preperations for later replacement of mmap() with mmap_prepare() but
> they don't yet do that completely. Is that right?
> To clarify what I mean, in [1] for example, you are replacing struct
> uio_info.mmap with uio_info.mmap_prepare but it's still being called
> from uio_mmap(). IOW, you are not replacing uio_mmap with
> uio_mmap_prepare. Is that the next step that's not yet implemented?

Yeah, there were 12 more patches I didn't send :) because I feel they'd
make more sense separate and wanted to test/develop them more.

This is all laying the groundwork for having mmap_prepare DMA cache
mappings ultimately, while expanding functionality as we go.

The intent here though isn't _just_ that, it's more - in general - when we
have an e.g.:

int some_special_mmap(struct some_type *blah, struct file *filp /* sometimes */,
		      struct vm_area_struct *vma)
{
	...
}

Or some say custom ops for something like this, where in the existing code
callers hook .mmap(), grab some specific struct (like a device pointer or a
state pointer) from the file private data and then delegate to another
helper.

In this situation, we are able to use the compatibility layer to change say
the ops to be .mmap_prepare instead while the overarching caller is .mmap.

This allows for iterative conversion to .mmap_prepare without having to
amend 100 files at once in a multi-thousand line patch touching dozens of
drivers or some hellish notion like that.

This is vital to sensibly being able to implement these changes bit-by-bit.

Cheers, Lorenzo

>
> [1] https://lore.kernel.org/all/892a8b32e5ef64c69239ccc2d1bd364716fd7fdf.1773695307.git.ljs@kernel.org/
>
> >
> > Unfortunately, as part of this change, we need to be able to flexibly
> > assign to the VMA descriptor, so have to remove some of the const
> > declarations within the structure.
> >
> > Also update the VMA tests to reflect the changes.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  include/linux/fs.h              |   3 +
> >  include/linux/mm_types.h        |   4 +-
> >  mm/util.c                       | 111 +++++++++++++++++++++++---------
> >  mm/vma.h                        |   2 +-
> >  tools/testing/vma/include/dup.h | 111 ++++++++++++++++++++------------
> >  5 files changed, 157 insertions(+), 74 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index c390f5c667e3..0bdccfa70b44 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2058,6 +2058,9 @@ static inline bool can_mmap_file(struct file *file)
> >         return true;
> >  }
> >
> > +void compat_set_desc_from_vma(struct vm_area_desc *desc, const struct file *file,
> > +                             const struct vm_area_struct *vma);
> > +int __compat_vma_mmap(struct vm_area_desc *desc, struct vm_area_struct *vma);
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> >  int __vma_check_mmap_hook(struct vm_area_struct *vma);
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 50685cf29792..7538d64f8848 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -891,8 +891,8 @@ static __always_inline bool vma_flags_empty(vma_flags_t *flags)
> >   */
> >  struct vm_area_desc {
> >         /* Immutable state. */
> > -       const struct mm_struct *const mm;
> > -       struct file *const file; /* May vary from vm_file in stacked callers. */
> > +       struct mm_struct *mm;
> > +       struct file *file; /* May vary from vm_file in stacked callers. */
> >         unsigned long start;
> >         unsigned long end;
> >
> > diff --git a/mm/util.c b/mm/util.c
> > index aa92e471afe1..a166c48fe894 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -1163,34 +1163,38 @@ void flush_dcache_folio(struct folio *folio)
> >  EXPORT_SYMBOL(flush_dcache_folio);
> >  #endif
> >
> > -static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > +/**
> > + * compat_set_desc_from_vma() - assigns VMA descriptor @desc fields from a VMA.
> > + * @desc: A VMA descriptor whose fields need to be set.
> > + * @file: The file object describing the file being mmap()'d.
> > + * @vma: The VMA whose fields we wish to assign to @desc.
> > + *
> > + * This is a compatibility function to allow an mmap() hook to call
> > + * mmap_prepare() hooks when drivers nest these. This function specifically
> > + * allows the construction of a vm_area_desc value, @desc, from a VMA @vma for
> > + * the purposes of doing this.
> > + *
> > + * Once the conversion of drivers is complete this function will no longer be
> > + * required and will be removed.
> > + */
> > +void compat_set_desc_from_vma(struct vm_area_desc *desc,
> > +                             const struct file *file,
> > +                             const struct vm_area_struct *vma)
> >  {
> > -       struct vm_area_desc desc = {
> > -               .mm = vma->vm_mm,
> > -               .file = file,
> > -               .start = vma->vm_start,
> > -               .end = vma->vm_end,
> > -
> > -               .pgoff = vma->vm_pgoff,
> > -               .vm_file = vma->vm_file,
> > -               .vma_flags = vma->flags,
> > -               .page_prot = vma->vm_page_prot,
> > -
> > -               .action.type = MMAP_NOTHING, /* Default */
> > -       };
> > -       int err;
> > +       desc->mm = vma->vm_mm;
> > +       desc->file = (struct file *)file;
> > +       desc->start = vma->vm_start;
> > +       desc->end = vma->vm_end;
> >
> > -       err = vfs_mmap_prepare(file, &desc);
> > -       if (err)
> > -               return err;
> > +       desc->pgoff = vma->vm_pgoff;
> > +       desc->vm_file = vma->vm_file;
> > +       desc->vma_flags = vma->flags;
> > +       desc->page_prot = vma->vm_page_prot;
> >
> > -       err = mmap_action_prepare(&desc);
> > -       if (err)
> > -               return err;
> > -
> > -       set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(vma, &desc.action);
> > +       /* Default. */
> > +       desc->action.type = MMAP_NOTHING;
> >  }
> > +EXPORT_SYMBOL(compat_set_desc_from_vma);
> >
> >  static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> >  {
> > @@ -1211,6 +1215,49 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> >         return err;
> >  }
> >
> > +/**
> > + * __compat_vma_mmap() - Similar to compat_vma_mmap(), only it allows
> > + * flexibility as to how the mmap_prepare callback is invoked, which is useful
> > + * for drivers which invoke nested mmap_prepare callbacks in an mmap() hook.
> > + * @desc: A VMA descriptor upon which an mmap_prepare() hook has already been
> > + * executed.
> > + * @vma: The VMA to which @desc should be applied.
> > + *
> > + * The function assumes that you have obtained a VMA descriptor @desc from
> > + * compt_set_desc_from_vma(), and already executed the mmap_prepare() hook upon
> > + * it.
> > + *
> > + * It then performs any specified mmap actions, and invokes the vm_ops->mapped()
> > + * hook if one is present.
> > + *
> > + * See the description of compat_vma_mmap() for more details.
> > + *
> > + * Once the conversion of drivers is complete this function will no longer be
> > + * required and will be removed.
> > + *
> > + * Returns: 0 on success or error.
> > + */
> > +int __compat_vma_mmap(struct vm_area_desc *desc,
> > +                     struct vm_area_struct *vma)
> > +{
> > +       int err;
> > +
> > +       /* Perform any preparatory tasks for mmap action. */
> > +       err = mmap_action_prepare(desc);
> > +       if (err)
> > +               return err;
> > +       /* Update the VMA from the descriptor. */
> > +       compat_set_vma_from_desc(vma, desc);
> > +       /* Complete any specified mmap actions. */
> > +       err = mmap_action_complete(vma, &desc->action);
> > +       if (err)
> > +               return err;
> > +
> > +       /* Invoke vm_ops->mapped callback. */
> > +       return __compat_vma_mapped(desc->file, vma);
> > +}
> > +EXPORT_SYMBOL(__compat_vma_mmap);
> > +
> >  /**
> >   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> >   * existing VMA and execute any requested actions.
> > @@ -1218,10 +1265,10 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> >   * @vma: The VMA to apply the .mmap_prepare() hook to.
> >   *
> >   * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
> > - * stacked filesystems invoke a nested mmap hook of an underlying file.
> > + * stacked drivers invoke a nested mmap hook of an underlying file.
> >   *
> > - * Until all filesystems are converted to use .mmap_prepare(), we must be
> > - * conservative and continue to invoke these stacked filesystems using the
> > + * Until all drivers are converted to use .mmap_prepare(), we must be
> > + * conservative and continue to invoke these stacked drivers using the
> >   * deprecated .mmap() hook.
> >   *
> >   * However we have a problem if the underlying file system possesses an
> > @@ -1232,20 +1279,22 @@ static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> >   * establishes a struct vm_area_desc descriptor, passes to the underlying
> >   * .mmap_prepare() hook and applies any changes performed by it.
> >   *
> > - * Once the conversion of filesystems is complete this function will no longer
> > - * be required and will be removed.
> > + * Once the conversion of drivers is complete this function will no longer be
> > + * required and will be removed.
> >   *
> >   * Returns: 0 on success or error.
> >   */
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> >  {
> > +       struct vm_area_desc desc;
> >         int err;
> >
> > -       err = __compat_vma_mmap(file, vma);
> > +       compat_set_desc_from_vma(&desc, file, vma);
> > +       err = vfs_mmap_prepare(file, &desc);
> >         if (err)
> >                 return err;
> >
> > -       return __compat_vma_mapped(file, vma);
> > +       return __compat_vma_mmap(&desc, vma);
> >  }
> >  EXPORT_SYMBOL(compat_vma_mmap);
> >
> > diff --git a/mm/vma.h b/mm/vma.h
> > index adc18f7dd9f1..a76046c39b14 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -300,7 +300,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> >   * f_op->mmap() but which might have an underlying file system which implements
> >   * f_op->mmap_prepare().
> >   */
> > -static inline void set_vma_from_desc(struct vm_area_struct *vma,
> > +static inline void compat_set_vma_from_desc(struct vm_area_struct *vma,
> >                 struct vm_area_desc *desc)
> >  {
> >         /*
> > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > index 114daaef4f73..6658df26698a 100644
> > --- a/tools/testing/vma/include/dup.h
> > +++ b/tools/testing/vma/include/dup.h
> > @@ -519,8 +519,8 @@ enum vma_operation {
> >   */
> >  struct vm_area_desc {
> >         /* Immutable state. */
> > -       const struct mm_struct *const mm;
> > -       struct file *const file; /* May vary from vm_file in stacked callers. */
> > +       struct mm_struct *mm;
> > +       struct file *file; /* May vary from vm_file in stacked callers. */
> >         unsigned long start;
> >         unsigned long end;
> >
> > @@ -1272,43 +1272,92 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
> >  }
> >
> >  /* Declared in vma.h. */
> > -static inline void set_vma_from_desc(struct vm_area_struct *vma,
> > +static inline void compat_set_vma_from_desc(struct vm_area_struct *vma,
> >                 struct vm_area_desc *desc);
> >
> > -static inline int __compat_vma_mmap(const struct file_operations *f_op,
> > -               struct file *file, struct vm_area_struct *vma)
> > +static inline void compat_set_desc_from_vma(struct vm_area_desc *desc,
> > +                             const struct file *file,
> > +                             const struct vm_area_struct *vma)
> >  {
> > -       struct vm_area_desc desc = {
> > -               .mm = vma->vm_mm,
> > -               .file = file,
> > -               .start = vma->vm_start,
> > -               .end = vma->vm_end,
> > +       desc->mm = vma->vm_mm;
> > +       desc->file = (struct file *)file;
> > +       desc->start = vma->vm_start;
> > +       desc->end = vma->vm_end;
> >
> > -               .pgoff = vma->vm_pgoff,
> > -               .vm_file = vma->vm_file,
> > -               .vma_flags = vma->flags,
> > -               .page_prot = vma->vm_page_prot,
> > +       desc->pgoff = vma->vm_pgoff;
> > +       desc->vm_file = vma->vm_file;
> > +       desc->vma_flags = vma->flags;
> > +       desc->page_prot = vma->vm_page_prot;
> >
> > -               .action.type = MMAP_NOTHING, /* Default */
> > -       };
> > +       /* Default. */
> > +       desc->action.type = MMAP_NOTHING;
> > +}
> > +
> > +static inline unsigned long vma_pages(const struct vm_area_struct *vma)
> > +{
> > +       return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > +}
> > +
> > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > +{
> > +       const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > +
> > +       mmap_assert_write_locked(vma->vm_mm);
> > +       do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > +}
> > +
> > +static inline int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       const struct vm_operations_struct *vm_ops = vma->vm_ops;
> >         int err;
> >
> > -       err = f_op->mmap_prepare(&desc);
> > +       if (!vm_ops->mapped)
> > +               return 0;
> > +
> > +       err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
> > +                            &vma->vm_private_data);
> >         if (err)
> > -               return err;
> > +               unmap_vma_locked(vma);
> > +       return err;
> > +}
> >
> > -       err = mmap_action_prepare(&desc);
> > +static inline int __compat_vma_mmap(struct vm_area_desc *desc,
> > +               struct vm_area_struct *vma)
> > +{
> > +       int err;
> > +
> > +       /* Perform any preparatory tasks for mmap action. */
> > +       err = mmap_action_prepare(desc);
> > +       if (err)
> > +               return err;
> > +       /* Update the VMA from the descriptor. */
> > +       compat_set_vma_from_desc(vma, desc);
> > +       /* Complete any specified mmap actions. */
> > +       err = mmap_action_complete(vma, &desc->action);
> >         if (err)
> >                 return err;
> >
> > -       set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(vma, &desc.action);
> > +       /* Invoke vm_ops->mapped callback. */
> > +       return __compat_vma_mapped(desc->file, vma);
> > +}
> > +
> > +static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> > +{
> > +       return file->f_op->mmap_prepare(desc);
> >  }
> >
> >  static inline int compat_vma_mmap(struct file *file,
> >                 struct vm_area_struct *vma)
> >  {
> > -       return __compat_vma_mmap(file->f_op, file, vma);
> > +       struct vm_area_desc desc;
> > +       int err;
> > +
> > +       compat_set_desc_from_vma(&desc, file, vma);
> > +       err = vfs_mmap_prepare(file, &desc);
> > +       if (err)
> > +               return err;
> > +
> > +       return __compat_vma_mmap(&desc, vma);
> >  }
> >
> >
> > @@ -1318,11 +1367,6 @@ static inline void vma_iter_init(struct vma_iterator *vmi,
> >         mas_init(&vmi->mas, &mm->mm_mt, addr);
> >  }
> >
> > -static inline unsigned long vma_pages(struct vm_area_struct *vma)
> > -{
> > -       return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > -}
> > -
> >  static inline void mmap_assert_locked(struct mm_struct *);
> >  static inline struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
> >                                                 unsigned long start_addr,
> > @@ -1492,11 +1536,6 @@ static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> >         return file->f_op->mmap(file, vma);
> >  }
> >
> > -static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> > -{
> > -       return file->f_op->mmap_prepare(desc);
> > -}
> > -
> >  static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
> >  {
> >         /* Changing an anonymous vma with this is illegal */
> > @@ -1521,11 +1560,3 @@ static inline pgprot_t vma_get_page_prot(vma_flags_t vma_flags)
> >
> >         return vm_get_page_prot(vm_flags);
> >  }
> > -
> > -static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > -{
> > -       const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > -
> > -       mmap_assert_write_locked(vma->vm_mm);
> > -       do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > -}
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH v2 15/16] mm: add mmap_action_map_kernel_pages[_full]()
From: Suren Baghdasaryan @ 2026-03-19 15:14 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <d877ee66-1ac9-4b1b-b860-6919dc58edfe@lucifer.local>

On Thu, Mar 19, 2026 at 8:05 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Wed, Mar 18, 2026 at 09:00:13AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > A user can invoke mmap_action_map_kernel_pages() to specify that the
> > > mapping should map kernel pages starting from desc->start of a specified
> > > number of pages specified in an array.
> > >
> > > In order to implement this, adjust mmap_action_prepare() to be able to
> > > return an error code, as it makes sense to assert that the specified
> > > parameters are valid as quickly as possible as well as updating the VMA
> > > flags to include VMA_MIXEDMAP_BIT as necessary.
> > >
> > > This provides an mmap_prepare equivalent of vm_insert_pages().
> > >
> > > We additionally update the existing vm_insert_pages() code to use
> > > range_in_vma() and add a new range_in_vma_desc() helper function for the
> > > mmap_prepare case, sharing the code between the two in range_is_subset().
> > >
> > > We add both mmap_action_map_kernel_pages() and
> > > mmap_action_map_kernel_pages_full() to allow for both partial and full VMA
> > > mappings.
> > >
> > > We also add mmap_action_map_kernel_pages_discontig() to allow for
> > > discontiguous mapping of kernel pages should the need arise.
> > >
> > > We update the documentation to reflect the new features.
> > >
> > > Finally, we update the VMA tests accordingly to reflect the changes.
> > >
> > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> >
> > With one nit,
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks!
>
> >
> > > ---
> > >  Documentation/filesystems/mmap_prepare.rst |  8 ++
> > >  include/linux/mm.h                         | 95 +++++++++++++++++++++-
> > >  include/linux/mm_types.h                   |  7 ++
> > >  mm/memory.c                                | 42 +++++++++-
> > >  mm/util.c                                  |  6 ++
> > >  tools/testing/vma/include/dup.h            |  7 ++
> > >  6 files changed, 159 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > > index be76ae475b9c..e810aa4134eb 100644
> > > --- a/Documentation/filesystems/mmap_prepare.rst
> > > +++ b/Documentation/filesystems/mmap_prepare.rst
> > > @@ -156,5 +156,13 @@ pointer. These are:
> > >  * mmap_action_simple_ioremap() - Sets up an I/O remap from a specified
> > >    physical address and over a specified length.
> > >
> > > +* mmap_action_map_kernel_pages() - Maps a specified array of `struct page`
> > > +  pointers in the VMA from a specific offset.
> > > +
> > > +* mmap_action_map_kernel_pages_full() - Maps a specified array of `struct
> > > +  page` pointers over the entire VMA. The caller must ensure there are
> > > +  sufficient entries in the page array to cover the entire range of the
> > > +  described VMA.
> > > +
> > >  **NOTE:** The ``action`` field should never normally be manipulated directly,
> > >  rather you ought to use one of these helpers.
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index df8fa6e6402b..6f0a3edb24e1 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2912,7 +2912,7 @@ static inline bool folio_maybe_mapped_shared(struct folio *folio)
> > >   * The caller must add any reference (e.g., from folio_try_get()) it might be
> > >   * holding itself to the result.
> > >   *
> > > - * Returns the expected folio refcount.
> > > + * Returns: the expected folio refcount.
> >
> > nit: I see both "Returns:" and "Return:" being used in the codebase
> > but this header file uses "Return:", so for consistency you should
> > probably do the same. This also applies to later instances in this
> > patch.
>
> Well here I'm just adding the colon, while I'm here (maybe have been an
> update in response to feedback actualy).
>
> And this function that's not part of my change already uses 'Returns' and
> I'm pretty sure that's the correct form.
>
> So I think not a big deal to keep using that?

Correct. Anything I mark as "nit:" is not critical and can be ignored.

>
> >
> > >   */
> > >  static inline int folio_expected_ref_count(const struct folio *folio)
> > >  {
> > > @@ -4364,6 +4364,45 @@ static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
> > >         action->type = MMAP_SIMPLE_IO_REMAP;
> > >  }
> > >
> > > +/**
> > > + * mmap_action_map_kernel_pages - helper for mmap_prepare hook to specify that
> > > + * @num kernel pages contained in the @pages array should be mapped to userland
> > > + * starting at virtual address @start.
> > > + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> > > + * @start: The virtual address from which to map them.
> > > + * @pages: An array of struct page pointers describing the memory to map.
> > > + * @nr_pages: The number of entries in the @pages aray.
> > > + */
> > > +static inline void mmap_action_map_kernel_pages(struct vm_area_desc *desc,
> > > +               unsigned long start, struct page **pages,
> > > +               unsigned long nr_pages)
> > > +{
> > > +       struct mmap_action *action = &desc->action;
> > > +
> > > +       action->type = MMAP_MAP_KERNEL_PAGES;
> > > +       action->map_kernel.start = start;
> > > +       action->map_kernel.pages = pages;
> > > +       action->map_kernel.nr_pages = nr_pages;
> > > +       action->map_kernel.pgoff = desc->pgoff;
> > > +}
> > > +
> > > +/**
> > > + * mmap_action_map_kernel_pages_full - helper for mmap_prepare hook to specify that
> > > + * kernel pages contained in the @pages array should be mapped to userland
> > > + * from @desc->start to @desc->end.
> > > + * @desc: The VMA descriptor for the VMA requiring kernel pags to be mapped.
> > > + * @pages: An array of struct page pointers describing the memory to map.
> > > + *
> > > + * The caller must ensure that @pages contains sufficient entries to cover the
> > > + * entire range described by @desc.
> > > + */
> > > +static inline void mmap_action_map_kernel_pages_full(struct vm_area_desc *desc,
> > > +               struct page **pages)
> > > +{
> > > +       mmap_action_map_kernel_pages(desc, desc->start, pages,
> > > +                                    vma_desc_pages(desc));
> > > +}
> > > +
> > >  int mmap_action_prepare(struct vm_area_desc *desc);
> > >  int mmap_action_complete(struct vm_area_struct *vma,
> > >                          struct mmap_action *action);
> > > @@ -4380,10 +4419,59 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> > >         return vma;
> > >  }
> > >
> > > +/**
> > > + * range_is_subset - Is the specified inner range a subset of the outer range?
> > > + * @outer_start: The start of the outer range.
> > > + * @outer_end: The exclusive end of the outer range.
> > > + * @inner_start: The start of the inner range.
> > > + * @inner_end: The exclusive end of the inner range.
> > > + *
> > > + * Returns: %true if [inner_start, inner_end) is a subset of [outer_start,
> > > + * outer_end), otherwise %false.
> > > + */
> > > +static inline bool range_is_subset(unsigned long outer_start,
> > > +                                  unsigned long outer_end,
> > > +                                  unsigned long inner_start,
> > > +                                  unsigned long inner_end)
> > > +{
> > > +       return outer_start <= inner_start && inner_end <= outer_end;
> > > +}
> > > +
> > > +/**
> > > + * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
> > > + * @vma: The VMA against which we want to check [@start, @end).
> > > + * @start: The start of the range we wish to check.
> > > + * @end: The exclusive end of the range we wish to check.
> > > + *
> > > + * Returns: %true if [@start, @end) is a subset of [@vma->vm_start,
> > > + * @vma->vm_end), %false otherwise.
> > > + */
> > >  static inline bool range_in_vma(const struct vm_area_struct *vma,
> > >                                 unsigned long start, unsigned long end)
> > >  {
> > > -       return (vma && vma->vm_start <= start && end <= vma->vm_end);
> > > +       if (!vma)
> > > +               return false;
> > > +
> > > +       return range_is_subset(vma->vm_start, vma->vm_end, start, end);
> > > +}
> > > +
> > > +/**
> > > + * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
> > > + * described by @desc, a VMA descriptor?
> > > + * @desc: The VMA descriptor against which we want to check [@start, @end).
> > > + * @start: The start of the range we wish to check.
> > > + * @end: The exclusive end of the range we wish to check.
> > > + *
> > > + * Returns: %true if [@start, @end) is a subset of [@desc->start, @desc->end),
> > > + * %false otherwise.
> > > + */
> > > +static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
> > > +                                    unsigned long start, unsigned long end)
> > > +{
> > > +       if (!desc)
> > > +               return false;
> > > +
> > > +       return range_is_subset(desc->start, desc->end, start, end);
> > >  }
> > >
> > >  #ifdef CONFIG_MMU
> > > @@ -4427,6 +4515,9 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> > >  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> > >  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> > >                         struct page **pages, unsigned long *num);
> > > +int map_kernel_pages_prepare(struct vm_area_desc *desc);
> > > +int map_kernel_pages_complete(struct vm_area_struct *vma,
> > > +                             struct mmap_action *action);
> > >  int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
> > >                                 unsigned long num);
> > >  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 7538d64f8848..c46224020a46 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -815,6 +815,7 @@ enum mmap_action_type {
> > >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> > >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> > >         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> > > +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from array. */
> > >  };
> > >
> > >  /*
> > > @@ -833,6 +834,12 @@ struct mmap_action {
> > >                         phys_addr_t start_phys_addr;
> > >                         unsigned long size;
> > >                 } simple_ioremap;
> > > +               struct {
> > > +                       unsigned long start;
> > > +                       struct page **pages;
> > > +                       unsigned long nr_pages;
> > > +                       pgoff_t pgoff;
> > > +               } map_kernel;
> > >         };
> > >         enum mmap_action_type type;
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f3f4046aee97..849d5d9eeb83 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2484,13 +2484,14 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
> > >  int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> > >                         struct page **pages, unsigned long *num)
> > >  {
> > > -       const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1;
> > > +       const unsigned long nr_pages = *num;
> > > +       const unsigned long end = addr + PAGE_SIZE * nr_pages;
> > >
> > > -       if (addr < vma->vm_start || end_addr >= vma->vm_end)
> > > +       if (!range_in_vma(vma, addr, end))
> > >                 return -EFAULT;
> > >         if (!(vma->vm_flags & VM_MIXEDMAP)) {
> > > -               BUG_ON(mmap_read_trylock(vma->vm_mm));
> > > -               BUG_ON(vma->vm_flags & VM_PFNMAP);
> > > +               VM_WARN_ON_ONCE(mmap_read_trylock(vma->vm_mm));
> > > +               VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
> > >                 vm_flags_set(vma, VM_MIXEDMAP);
> > >         }
> > >         /* Defer page refcount checking till we're about to map that page. */
> > > @@ -2498,6 +2499,39 @@ int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
> > >  }
> > >  EXPORT_SYMBOL(vm_insert_pages);
> > >
> > > +int map_kernel_pages_prepare(struct vm_area_desc *desc)
> > > +{
> > > +       const struct mmap_action *action = &desc->action;
> > > +       const unsigned long addr = action->map_kernel.start;
> > > +       unsigned long nr_pages, end;
> > > +
> > > +       if (!vma_desc_test(desc, VMA_MIXEDMAP_BIT)) {
> > > +               VM_WARN_ON_ONCE(mmap_read_trylock(desc->mm));
> > > +               VM_WARN_ON_ONCE(vma_desc_test(desc, VMA_PFNMAP_BIT));
> > > +               vma_desc_set_flags(desc, VMA_MIXEDMAP_BIT);
> > > +       }
> > > +
> > > +       nr_pages = action->map_kernel.nr_pages;
> > > +       end = addr + PAGE_SIZE * nr_pages;
> > > +       if (!range_in_vma_desc(desc, addr, end))
> > > +               return -EFAULT;
> > > +
> > > +       return 0;
> > > +}
> > > +EXPORT_SYMBOL(map_kernel_pages_prepare);
> > > +
> > > +int map_kernel_pages_complete(struct vm_area_struct *vma,
> > > +                             struct mmap_action *action)
> > > +{
> > > +       unsigned long nr_pages;
> > > +
> > > +       nr_pages = action->map_kernel.nr_pages;
> > > +       return insert_pages(vma, action->map_kernel.start,
> > > +                           action->map_kernel.pages,
> > > +                           &nr_pages, vma->vm_page_prot);
> > > +}
> > > +EXPORT_SYMBOL(map_kernel_pages_complete);
> > > +
> > >  /**
> > >   * vm_insert_page - insert single page into user vma
> > >   * @vma: user vma to map to
> > > diff --git a/mm/util.c b/mm/util.c
> > > index a166c48fe894..dea590e7a26c 100644
> > > --- a/mm/util.c
> > > +++ b/mm/util.c
> > > @@ -1441,6 +1441,8 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> > >                 return io_remap_pfn_range_prepare(desc);
> > >         case MMAP_SIMPLE_IO_REMAP:
> > >                 return simple_ioremap_prepare(desc);
> > > +       case MMAP_MAP_KERNEL_PAGES:
> > > +               return map_kernel_pages_prepare(desc);
> > >         }
> > >
> > >         WARN_ON_ONCE(1);
> > > @@ -1472,6 +1474,9 @@ int mmap_action_complete(struct vm_area_struct *vma,
> > >         case MMAP_IO_REMAP_PFN:
> > >                 err = io_remap_pfn_range_complete(vma, action);
> > >                 break;
> > > +       case MMAP_MAP_KERNEL_PAGES:
> > > +               err = map_kernel_pages_complete(vma, action);
> > > +               break;
> > >         case MMAP_SIMPLE_IO_REMAP:
> > >                 /*
> > >                  * The simple I/O remap should have been delegated to an I/O
> > > @@ -1494,6 +1499,7 @@ int mmap_action_prepare(struct vm_area_desc *desc)
> > >         case MMAP_REMAP_PFN:
> > >         case MMAP_IO_REMAP_PFN:
> > >         case MMAP_SIMPLE_IO_REMAP:
> > > +       case MMAP_MAP_KERNEL_PAGES:
> > >                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
> > >                 break;
> > >         }
> > > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > > index 6658df26698a..4407caf207ad 100644
> > > --- a/tools/testing/vma/include/dup.h
> > > +++ b/tools/testing/vma/include/dup.h
> > > @@ -454,6 +454,7 @@ enum mmap_action_type {
> > >         MMAP_REMAP_PFN,         /* Remap PFN range. */
> > >         MMAP_IO_REMAP_PFN,      /* I/O remap PFN range. */
> > >         MMAP_SIMPLE_IO_REMAP,   /* I/O remap with guardrails. */
> > > +       MMAP_MAP_KERNEL_PAGES,  /* Map kernel page range from an array. */
> > >  };
> > >
> > >  /*
> > > @@ -472,6 +473,12 @@ struct mmap_action {
> > >                         phys_addr_t start;
> > >                         unsigned long len;
> > >                 } simple_ioremap;
> > > +               struct {
> > > +                       unsigned long start;
> > > +                       struct page **pages;
> > > +                       unsigned long num;
> > > +                       pgoff_t pgoff;
> > > +               } map_kernel;
> > >         };
> > >         enum mmap_action_type type;
> > >
> > > --
> > > 2.53.0
> > >

^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-19 15:19 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <8cdad898-b306-40fe-a367-efe7147f83b9@lucifer.local>

On Thu, Mar 19, 2026 at 7:55 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Tue, Mar 17, 2026 at 02:32:16PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Mar 17, 2026 at 2:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > >
> > > > The f_op->mmap interface is deprecated, so update driver to use its
> > > > successor, mmap_prepare.
> > > >
> > > > The driver previously used vm_iomap_memory(), so this change replaces it
> > > > with its mmap_prepare equivalent, mmap_action_simple_ioremap().
> > > >
> > > > Functions that wrap mmap() are also converted to wrap mmap_prepare()
> > > > instead.
> > > >
> > > > Also update the documentation accordingly.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > ---
> > > >  Documentation/driver-api/vme.rst    |  2 +-
> > > >  drivers/staging/vme_user/vme.c      | 20 +++++------
> > > >  drivers/staging/vme_user/vme.h      |  2 +-
> > > >  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
> > > >  4 files changed, 42 insertions(+), 33 deletions(-)
> > > >
> > > > diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> > > > index c0b475369de0..7111999abc14 100644
> > > > --- a/Documentation/driver-api/vme.rst
> > > > +++ b/Documentation/driver-api/vme.rst
> > > > @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
> > > >
> > > >  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
> > > >  do a read-modify-write transaction. Parts of a VME window can also be mapped
> > > > -into user space memory using :c:func:`vme_master_mmap`.
> > > > +into user space memory using :c:func:`vme_master_mmap_prepare`.
> > > >
> > > >
> > > >  Slave windows
> > > > diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> > > > index f10a00c05f12..7220aba7b919 100644
> > > > --- a/drivers/staging/vme_user/vme.c
> > > > +++ b/drivers/staging/vme_user/vme.c
> > > > @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
> > > >  EXPORT_SYMBOL(vme_master_rmw);
> > > >
> > > >  /**
> > > > - * vme_master_mmap - Mmap region of VME master window.
> > > > + * vme_master_mmap_prepare - Mmap region of VME master window.
> > > >   * @resource: Pointer to VME master resource.
> > > > - * @vma: Pointer to definition of user mapping.
> > > > + * @desc: Pointer to descriptor of user mapping.
> > > >   *
> > > >   * Memory map a region of the VME master window into user space.
> > > >   *
> > > > @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
> > > >   *         resource or -EFAULT if map exceeds window size. Other generic mmap
> > > >   *         errors may also be returned.
> > > >   */
> > > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > > > +int vme_master_mmap_prepare(struct vme_resource *resource,
> > > > +                           struct vm_area_desc *desc)
> > > >  {
> > > > +       const unsigned long vma_size = vma_desc_size(desc);
> > > >         struct vme_bridge *bridge = find_bridge(resource);
> > > >         struct vme_master_resource *image;
> > > >         phys_addr_t phys_addr;
> > > > -       unsigned long vma_size;
> > > >
> > > >         if (resource->type != VME_MASTER) {
> > > >                 dev_err(bridge->parent, "Not a master resource\n");
> > > > @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > > >         }
> > > >
> > > >         image = list_entry(resource->entry, struct vme_master_resource, list);
> > > > -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> > > > -       vma_size = vma->vm_end - vma->vm_start;
> > > > +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
> > > >
> > > >         if (phys_addr + vma_size > image->bus_resource.end + 1) {
> > > >                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
> > > >                 return -EFAULT;
> > > >         }
> > > >
> > > > -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > -
> > > > -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> > > > +       desc->page_prot = pgprot_noncached(desc->page_prot);
> > > > +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> > > > +       return 0;
> > > >  }
> > > > -EXPORT_SYMBOL(vme_master_mmap);
> > > > +EXPORT_SYMBOL(vme_master_mmap_prepare);
> > > >
> > > >  /**
> > > >   * vme_master_free - Free VME master window
> > > > diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> > > > index 797e9940fdd1..b6413605ea49 100644
> > > > --- a/drivers/staging/vme_user/vme.h
> > > > +++ b/drivers/staging/vme_user/vme.h
> > > > @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
> > > >  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
> > > >  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
> > > >                             unsigned int swap, loff_t offset);
> > > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> > > > +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
> > > >  void vme_master_free(struct vme_resource *resource);
> > > >
> > > >  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> > > > diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> > > > index d95dd7d9190a..11e25c2f6b0a 100644
> > > > --- a/drivers/staging/vme_user/vme_user.c
> > > > +++ b/drivers/staging/vme_user/vme_user.c
> > > > @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
> > > >         kfree(vma_priv);
> > > >  }
> > > >
> > > > -static const struct vm_operations_struct vme_user_vm_ops = {
> > > > -       .open = vme_user_vm_open,
> > > > -       .close = vme_user_vm_close,
> > > > -};
> > > > -
> > > > -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > > > +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                             const struct file *file, void **vm_private_data)
> > > >  {
> > > > -       int err;
> > > > +       const unsigned int minor = iminor(file_inode(file));
> > > >         struct vme_user_vma_priv *vma_priv;
> > > >
> > > >         mutex_lock(&image[minor].mutex);
> > > >
> > > > -       err = vme_master_mmap(image[minor].resource, vma);
> > > > -       if (err) {
> > > > -               mutex_unlock(&image[minor].mutex);
> > > > -               return err;
> > > > -       }
> > > > -
> > >
> > > Ok, this changes the set of the operations performed under image[minor].mutex.
> > > Before we had:
> > >
> > > mutex_lock(&image[minor].mutex);
> > > vme_master_mmap();
> > > <some final adjustments>
> > > mutex_unlock(&image[minor].mutex);
> > >
> > > Now we have:
> > >
> > > mutex_lock(&image[minor].mutex);
> > > vme_master_mmap_prepare()
> > > mutex_unlock(&image[minor].mutex);
> > > vm_iomap_memory();
> > > mutex_lock(&image[minor].mutex);
> > > vme_user_vm_mapped(); // <some final adjustments>
> > > mutex_unlock(&image[minor].mutex);
> > >
> > > I think as long as image[minor] does not change while we are not
> > > holding the mutex we should be safe, and looking at the code it seems
> > > to be the case. But I'm not familiar with this driver and might be
> > > wrong. Worth double-checking.
>
> The file is pinned for the duration, the mutex is associated with the file,
> so there's no sane world in which that could be problematic.
>
> Keeping in mind that we manipulate stuff on vme_user_vm_close() that
> directly acceses image[minor] at an arbitary time.

That was my understanding as well. Thanks for confirming.

>
> >
> > A side note: if we had to hold the mutex across all those operations I
> > think we would need to take the mutex in the vm_ops->mmap_prepare and
> > add a vm_ops->map_failed hook or something along that line to drop the
> > mutex in case mmap_action_complete() fails. Not sure if we will have
> > such cases though...
>
> No, I don't want to do this if it can be at all avoided. You should in
> nearly any sane circumstance be able to defer things until the mapped hook
> anyway.
>
> Also a merge can happen too after an .mmap_prepare, so we'd have to have
> some 'success' hook and I'm just not going there it'll end up open to abuse
> again.
>
> (We do have success and error filtering hooks right now, sadly, but they're
> really for hugetlb and I plan to find a way to get rid of them).
>
> The mmap_prepare is meant to essentially be as stateless as possible.

Yes, I also hope we won't encounter cases requiring us to keep any
state information between the mmap_prepare and mapped stages.

>
> Anyway I don't think it's relevant here.
>
> >
> > >
> > > >         vma_priv = kmalloc_obj(*vma_priv);
> > > >         if (!vma_priv) {
> > > >                 mutex_unlock(&image[minor].mutex);
> > > > @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > > >
> > > >         vma_priv->minor = minor;
> > > >         refcount_set(&vma_priv->refcnt, 1);
> > > > -       vma->vm_ops = &vme_user_vm_ops;
> > > > -       vma->vm_private_data = vma_priv;
> > > > -
> > > > +       *vm_private_data = vma_priv;
> > > >         image[minor].mmap_count++;
> > > >
> > > >         mutex_unlock(&image[minor].mutex);
> > > > -
> > > >         return 0;
> > > >  }
> > > >
> > > > -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> > > > +static const struct vm_operations_struct vme_user_vm_ops = {
> > > > +       .mapped = vme_user_vm_mapped,
> > > > +       .open = vme_user_vm_open,
> > > > +       .close = vme_user_vm_close,
> > > > +};
> > > > +
> > > > +static int vme_user_master_mmap_prepare(unsigned int minor,
> > > > +                                       struct vm_area_desc *desc)
> > > > +{
> > > > +       int err;
> > > > +
> > > > +       mutex_lock(&image[minor].mutex);
> > > > +
> > > > +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> > > > +       if (!err)
> > > > +               desc->vm_ops = &vme_user_vm_ops;
> > > > +
> > > > +       mutex_unlock(&image[minor].mutex);
> > > > +       return err;
> > > > +}
> > > > +
> > > > +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
> > > >  {
> > > > -       unsigned int minor = iminor(file_inode(file));
> > > > +       const struct file *file = desc->file;
> > > > +       const unsigned int minor = iminor(file_inode(file));
> > > >
> > > >         if (type[minor] == MASTER_MINOR)
> > > > -               return vme_user_master_mmap(minor, vma);
> > > > +               return vme_user_master_mmap_prepare(minor, desc);
> > > >
> > > >         return -ENODEV;
> > > >  }
> > > > @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
> > > >         .llseek = vme_user_llseek,
> > > >         .unlocked_ioctl = vme_user_unlocked_ioctl,
> > > >         .compat_ioctl = compat_ptr_ioctl,
> > > > -       .mmap = vme_user_mmap,
> > > > +       .mmap_prepare = vme_user_mmap_prepare,
> > > >  };
> > > >
> > > >  static int vme_user_match(struct vme_dev *vdev)
> > > > --
> > > > 2.53.0
> > > >
>
> Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 11/16] staging: vme_user: replace deprecated mmap hook with mmap_prepare
From: Suren Baghdasaryan @ 2026-03-19 15:19 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpHXqtxZr5s84jCcz513a2pgMeDoobsLBJH9pSON49cM+w@mail.gmail.com>

On Thu, Mar 19, 2026 at 8:19 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Mar 19, 2026 at 7:55 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Tue, Mar 17, 2026 at 02:32:16PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Mar 17, 2026 at 2:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Mar 16, 2026 at 2:14 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > > > >
> > > > > The f_op->mmap interface is deprecated, so update driver to use its
> > > > > successor, mmap_prepare.
> > > > >
> > > > > The driver previously used vm_iomap_memory(), so this change replaces it
> > > > > with its mmap_prepare equivalent, mmap_action_simple_ioremap().
> > > > >
> > > > > Functions that wrap mmap() are also converted to wrap mmap_prepare()
> > > > > instead.
> > > > >
> > > > > Also update the documentation accordingly.
> > > > >
> > > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> > > > > ---
> > > > >  Documentation/driver-api/vme.rst    |  2 +-
> > > > >  drivers/staging/vme_user/vme.c      | 20 +++++------
> > > > >  drivers/staging/vme_user/vme.h      |  2 +-
> > > > >  drivers/staging/vme_user/vme_user.c | 51 +++++++++++++++++------------
> > > > >  4 files changed, 42 insertions(+), 33 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst
> > > > > index c0b475369de0..7111999abc14 100644
> > > > > --- a/Documentation/driver-api/vme.rst
> > > > > +++ b/Documentation/driver-api/vme.rst
> > > > > @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and
> > > > >
> > > > >  In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to
> > > > >  do a read-modify-write transaction. Parts of a VME window can also be mapped
> > > > > -into user space memory using :c:func:`vme_master_mmap`.
> > > > > +into user space memory using :c:func:`vme_master_mmap_prepare`.
> > > > >
> > > > >
> > > > >  Slave windows
> > > > > diff --git a/drivers/staging/vme_user/vme.c b/drivers/staging/vme_user/vme.c
> > > > > index f10a00c05f12..7220aba7b919 100644
> > > > > --- a/drivers/staging/vme_user/vme.c
> > > > > +++ b/drivers/staging/vme_user/vme.c
> > > > > @@ -735,9 +735,9 @@ unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask,
> > > > >  EXPORT_SYMBOL(vme_master_rmw);
> > > > >
> > > > >  /**
> > > > > - * vme_master_mmap - Mmap region of VME master window.
> > > > > + * vme_master_mmap_prepare - Mmap region of VME master window.
> > > > >   * @resource: Pointer to VME master resource.
> > > > > - * @vma: Pointer to definition of user mapping.
> > > > > + * @desc: Pointer to descriptor of user mapping.
> > > > >   *
> > > > >   * Memory map a region of the VME master window into user space.
> > > > >   *
> > > > > @@ -745,12 +745,13 @@ EXPORT_SYMBOL(vme_master_rmw);
> > > > >   *         resource or -EFAULT if map exceeds window size. Other generic mmap
> > > > >   *         errors may also be returned.
> > > > >   */
> > > > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > > > > +int vme_master_mmap_prepare(struct vme_resource *resource,
> > > > > +                           struct vm_area_desc *desc)
> > > > >  {
> > > > > +       const unsigned long vma_size = vma_desc_size(desc);
> > > > >         struct vme_bridge *bridge = find_bridge(resource);
> > > > >         struct vme_master_resource *image;
> > > > >         phys_addr_t phys_addr;
> > > > > -       unsigned long vma_size;
> > > > >
> > > > >         if (resource->type != VME_MASTER) {
> > > > >                 dev_err(bridge->parent, "Not a master resource\n");
> > > > > @@ -758,19 +759,18 @@ int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma)
> > > > >         }
> > > > >
> > > > >         image = list_entry(resource->entry, struct vme_master_resource, list);
> > > > > -       phys_addr = image->bus_resource.start + (vma->vm_pgoff << PAGE_SHIFT);
> > > > > -       vma_size = vma->vm_end - vma->vm_start;
> > > > > +       phys_addr = image->bus_resource.start + (desc->pgoff << PAGE_SHIFT);
> > > > >
> > > > >         if (phys_addr + vma_size > image->bus_resource.end + 1) {
> > > > >                 dev_err(bridge->parent, "Map size cannot exceed the window size\n");
> > > > >                 return -EFAULT;
> > > > >         }
> > > > >
> > > > > -       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > > > -
> > > > > -       return vm_iomap_memory(vma, phys_addr, vma->vm_end - vma->vm_start);
> > > > > +       desc->page_prot = pgprot_noncached(desc->page_prot);
> > > > > +       mmap_action_simple_ioremap(desc, phys_addr, vma_size);
> > > > > +       return 0;
> > > > >  }
> > > > > -EXPORT_SYMBOL(vme_master_mmap);
> > > > > +EXPORT_SYMBOL(vme_master_mmap_prepare);
> > > > >
> > > > >  /**
> > > > >   * vme_master_free - Free VME master window
> > > > > diff --git a/drivers/staging/vme_user/vme.h b/drivers/staging/vme_user/vme.h
> > > > > index 797e9940fdd1..b6413605ea49 100644
> > > > > --- a/drivers/staging/vme_user/vme.h
> > > > > +++ b/drivers/staging/vme_user/vme.h
> > > > > @@ -151,7 +151,7 @@ ssize_t vme_master_read(struct vme_resource *resource, void *buf, size_t count,
> > > > >  ssize_t vme_master_write(struct vme_resource *resource, void *buf, size_t count, loff_t offset);
> > > > >  unsigned int vme_master_rmw(struct vme_resource *resource, unsigned int mask, unsigned int compare,
> > > > >                             unsigned int swap, loff_t offset);
> > > > > -int vme_master_mmap(struct vme_resource *resource, struct vm_area_struct *vma);
> > > > > +int vme_master_mmap_prepare(struct vme_resource *resource, struct vm_area_desc *desc);
> > > > >  void vme_master_free(struct vme_resource *resource);
> > > > >
> > > > >  struct vme_resource *vme_dma_request(struct vme_dev *vdev, u32 route);
> > > > > diff --git a/drivers/staging/vme_user/vme_user.c b/drivers/staging/vme_user/vme_user.c
> > > > > index d95dd7d9190a..11e25c2f6b0a 100644
> > > > > --- a/drivers/staging/vme_user/vme_user.c
> > > > > +++ b/drivers/staging/vme_user/vme_user.c
> > > > > @@ -446,24 +446,14 @@ static void vme_user_vm_close(struct vm_area_struct *vma)
> > > > >         kfree(vma_priv);
> > > > >  }
> > > > >
> > > > > -static const struct vm_operations_struct vme_user_vm_ops = {
> > > > > -       .open = vme_user_vm_open,
> > > > > -       .close = vme_user_vm_close,
> > > > > -};
> > > > > -
> > > > > -static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > > > > +static int vme_user_vm_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > > +                             const struct file *file, void **vm_private_data)
> > > > >  {
> > > > > -       int err;
> > > > > +       const unsigned int minor = iminor(file_inode(file));
> > > > >         struct vme_user_vma_priv *vma_priv;
> > > > >
> > > > >         mutex_lock(&image[minor].mutex);
> > > > >
> > > > > -       err = vme_master_mmap(image[minor].resource, vma);
> > > > > -       if (err) {
> > > > > -               mutex_unlock(&image[minor].mutex);
> > > > > -               return err;
> > > > > -       }
> > > > > -
> > > >
> > > > Ok, this changes the set of the operations performed under image[minor].mutex.
> > > > Before we had:
> > > >
> > > > mutex_lock(&image[minor].mutex);
> > > > vme_master_mmap();
> > > > <some final adjustments>
> > > > mutex_unlock(&image[minor].mutex);
> > > >
> > > > Now we have:
> > > >
> > > > mutex_lock(&image[minor].mutex);
> > > > vme_master_mmap_prepare()
> > > > mutex_unlock(&image[minor].mutex);
> > > > vm_iomap_memory();
> > > > mutex_lock(&image[minor].mutex);
> > > > vme_user_vm_mapped(); // <some final adjustments>
> > > > mutex_unlock(&image[minor].mutex);
> > > >
> > > > I think as long as image[minor] does not change while we are not
> > > > holding the mutex we should be safe, and looking at the code it seems
> > > > to be the case. But I'm not familiar with this driver and might be
> > > > wrong. Worth double-checking.
> >
> > The file is pinned for the duration, the mutex is associated with the file,
> > so there's no sane world in which that could be problematic.
> >
> > Keeping in mind that we manipulate stuff on vme_user_vm_close() that
> > directly acceses image[minor] at an arbitary time.
>
> That was my understanding as well. Thanks for confirming.
>
> >
> > >
> > > A side note: if we had to hold the mutex across all those operations I
> > > think we would need to take the mutex in the vm_ops->mmap_prepare and
> > > add a vm_ops->map_failed hook or something along that line to drop the
> > > mutex in case mmap_action_complete() fails. Not sure if we will have
> > > such cases though...
> >
> > No, I don't want to do this if it can be at all avoided. You should in
> > nearly any sane circumstance be able to defer things until the mapped hook
> > anyway.
> >
> > Also a merge can happen too after an .mmap_prepare, so we'd have to have
> > some 'success' hook and I'm just not going there it'll end up open to abuse
> > again.
> >
> > (We do have success and error filtering hooks right now, sadly, but they're
> > really for hugetlb and I plan to find a way to get rid of them).
> >
> > The mmap_prepare is meant to essentially be as stateless as possible.
>
> Yes, I also hope we won't encounter cases requiring us to keep any
> state information between the mmap_prepare and mapped stages.
>
> >
> > Anyway I don't think it's relevant here.
> >
> > >
> > > >
> > > > >         vma_priv = kmalloc_obj(*vma_priv);
> > > > >         if (!vma_priv) {
> > > > >                 mutex_unlock(&image[minor].mutex);
> > > > > @@ -472,22 +462,41 @@ static int vme_user_master_mmap(unsigned int minor, struct vm_area_struct *vma)
> > > > >
> > > > >         vma_priv->minor = minor;
> > > > >         refcount_set(&vma_priv->refcnt, 1);
> > > > > -       vma->vm_ops = &vme_user_vm_ops;
> > > > > -       vma->vm_private_data = vma_priv;
> > > > > -
> > > > > +       *vm_private_data = vma_priv;
> > > > >         image[minor].mmap_count++;
> > > > >
> > > > >         mutex_unlock(&image[minor].mutex);
> > > > > -
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > -static int vme_user_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > +static const struct vm_operations_struct vme_user_vm_ops = {
> > > > > +       .mapped = vme_user_vm_mapped,
> > > > > +       .open = vme_user_vm_open,
> > > > > +       .close = vme_user_vm_close,
> > > > > +};
> > > > > +
> > > > > +static int vme_user_master_mmap_prepare(unsigned int minor,
> > > > > +                                       struct vm_area_desc *desc)
> > > > > +{
> > > > > +       int err;
> > > > > +
> > > > > +       mutex_lock(&image[minor].mutex);
> > > > > +
> > > > > +       err = vme_master_mmap_prepare(image[minor].resource, desc);
> > > > > +       if (!err)
> > > > > +               desc->vm_ops = &vme_user_vm_ops;
> > > > > +
> > > > > +       mutex_unlock(&image[minor].mutex);
> > > > > +       return err;
> > > > > +}
> > > > > +
> > > > > +static int vme_user_mmap_prepare(struct vm_area_desc *desc)
> > > > >  {
> > > > > -       unsigned int minor = iminor(file_inode(file));
> > > > > +       const struct file *file = desc->file;
> > > > > +       const unsigned int minor = iminor(file_inode(file));
> > > > >
> > > > >         if (type[minor] == MASTER_MINOR)
> > > > > -               return vme_user_master_mmap(minor, vma);
> > > > > +               return vme_user_master_mmap_prepare(minor, desc);
> > > > >
> > > > >         return -ENODEV;
> > > > >  }
> > > > > @@ -498,7 +507,7 @@ static const struct file_operations vme_user_fops = {
> > > > >         .llseek = vme_user_llseek,
> > > > >         .unlocked_ioctl = vme_user_unlocked_ioctl,
> > > > >         .compat_ioctl = compat_ptr_ioctl,
> > > > > -       .mmap = vme_user_mmap,
> > > > > +       .mmap_prepare = vme_user_mmap_prepare,
> > > > >  };
> > > > >
> > > > >  static int vme_user_match(struct vme_dev *vdev)
> > > > > --
> > > > > 2.53.0
> > > > >
> >
> > Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 06/16] mm: add mmap_action_simple_ioremap()
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 16:29 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <330f3614-7dc1-4e80-96c4-8472b25108bb@lucifer.local>

On Wed, Mar 18, 2026 at 08:39:25PM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Mon, Mar 16, 2026 at 09:14:28PM -0700, Suren Baghdasaryan wrote:
> > > +int simple_ioremap_prepare(struct vm_area_desc *desc)
> > > +{
> > > +       struct mmap_action *action = &desc->action;
> > > +       const phys_addr_t start = action->simple_ioremap.start_phys_addr;
> > > +       const unsigned long size = action->simple_ioremap.size;
> > > +       unsigned long pfn;
> > > +       int err;
> > > +
> > > +       err = __simple_ioremap_prep(desc->start, desc->end, desc->pgoff,
> > > +                                   start, size, &pfn);
> > > +       if (err)
> > > +               return err;
> > > +
> > > +       /* The I/O remap logic does the heavy lifting. */
> > > +       mmap_action_ioremap(desc, desc->start, pfn, vma_desc_size(desc));
> >
> > nit: Looks like a perfect opportunity to use mmap_action_ioremap_full() here.
>
> Yeah can do!
>
> >
> > > +       return mmap_action_prepare(desc);
> >
> > Ok, so IIUC this uses recursion:
> > mmap_action_prepare(MMAP_SIMPLE_IO_REMAP) -> simple_ioremap_prepare()
> > -> mmap_action_prepare(MMAP_IO_REMAP_PFN).
>
> Yep, it's one level, I think that should be ok? :)

On second thoughts, it's silly not just to call io_remap_pfn_range_prepare()
direct so will change it to do that!

Cheers, Lorenzo

^ permalink raw reply

* RE: [PATCH v12 0/5] shut down devices asynchronously
From: Michael Kelley @ 2026-03-19 17:13 UTC (permalink / raw)
  To: David Jeffery, linux-kernel@vger.kernel.org,
	driver-core@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich
  Cc: Tarun Sahu, Pasha Tatashin, Michał Cłapiński,
	Jordan Richards, Ewan Milne, John Meneghini, Lombardi, Maurizio,
	Stuart Hayes, Laurence Oberman, Bart Van Assche, Bjorn Helgaas,
	linux-hyperv@vger.kernel.org
In-Reply-To: <20260319141142.5781-1-djeffery@redhat.com>

From: David Jeffery <djeffery@redhat.com>
> 
> This patchset allows the kernel to shutdown devices asynchronously and
> unrelated async devices to be shut down in parallel to each other.
> 
> Only devices which explicitly enable it are shut down asynchronously. The
> default is for a device to be shut down from the synchronous shutdown loop.
> 
> This can dramatically reduce system shutdown/reboot time on systems that
> have multiple devices that take many seconds to shut down (like certain
> NVMe drives). On one system tested, the shutdown time went from 11 minutes
> without this patch to 55 seconds with the patch. And on another system from
> 80 seconds to 11.
> 

(Copying the linux-hyperv mailing list as FYI.)

Tested this patch set on two different x86/x64 VMs in the Azure public cloud.
Baseline kernel is linux-next20260312. The VMs are running on Hyper-V with
synthetic SCSI devices, PCI pass-thru NVME disks, and emulated PCI NVMe
disks, depending on the VM configuration.

First VM is an Azure L64s_v3, with 2 synthetic SCSI disks and 8 PCI NVMe
pass-thru disks. Time spent in device_shutdown() was reduced from
683 milliseconds to 150 milliseconds (averaged across 4 runs each).

Second VM is an Azure D32lds_v6, with 2 emulated NVMe disks and
4 NVMe pass-thru disks. Time sent in device_shutdown() was reduced from
1010 milliseconds to 610 milliseconds (averaged across 2 runs each).

In both cases, the results seem reasonable. None of these disks should
be particularly slow in shutting down, so the results are not as dramatic
are reported by David. But there is non-trivial improvement nonetheless.

Tested-by: Michael Kelley <mhklinux@outlook.com>

> 
> Stuart Hayes (2):
>   driver core: separate function to shutdown one device
>   driver core: don't always lock parent in shutdown
> 
> David Jeffery (5):
>   driver core: async device shutdown infrastructure
>   PCI: enable async shutdown support
>   scsi: enable async shutdown support
> 
>  drivers/base/base.h       |   2 +
>  drivers/base/core.c       | 176 +++++++++++++++++++++++++++++++-------
>  drivers/pci/probe.c       |   2 +
>  drivers/scsi/hosts.c      |   3 +
>  drivers/scsi/scsi_scan.c  |   1 +
>  drivers/scsi/scsi_sysfs.c |   4 +
>  include/linux/device.h    |  13 +++
>  7 files changed, 170 insertions(+), 31 deletions(-)
> 
> --
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH] net: mana: fix use-after-free in add_adev() error path
From: Simon Horman @ 2026-03-19 18:18 UTC (permalink / raw)
  To: Guangshuo Li
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Saurabh Sengar, Erni Sri Satya Vennela,
	Shradha Gupta, Dipayaan Roy, Aditya Garg, Shiraz Saleem,
	Leon Romanovsky, linux-hyperv, netdev, linux-kernel, stable
In-Reply-To: <20260318154041.638747-1-lgs201920130244@gmail.com>

On Wed, Mar 18, 2026 at 11:40:41PM +0800, Guangshuo Li wrote:
> If auxiliary_device_add() fails, add_adev() calls
> auxiliary_device_uninit(adev), whose release callback adev_release()
> frees the containing struct mana_adev.
> 
> The current error path then falls through to init_fail and accesses
> adev->id. Since adev is embedded in struct mana_adev, this may lead
> to a use-after-free.

It isn't clear to me how the use-after-free manifests.
Could you elaborate?

> 
> Fix it by storing the allocated auxiliary device id in a local
> variable and using that saved id in the cleanup path after
> auxiliary_device_uninit().
> 
> Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
> Cc: stable@vger.kernel.org
> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>

As a bug fix for code present in the net tree, this patch
should be targeted at that tree like this.

Subject: [PATCH net] ...

And it should apply to that tree.

As it is the CI tries to apply this patch to the default tree, net-next.
Which fails. So there is no further CI performed.

> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 1ad154f9db1a..70d71594c599 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -3362,6 +3362,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
>  {
>  	struct auxiliary_device *adev;
>  	struct mana_adev *madev;
> +	int id;
>  	int ret;

Please preserve reverse xmas tree order for local variables - longest line
to shortest.

>  
>  	madev = kzalloc(sizeof(*madev), GFP_KERNEL);

...

-- 
pw-bot: changes-requested

^ permalink raw reply

* [PATCH v3 00/16] mm: expand mmap_prepare functionality and usage
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 18:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts

This series expands the mmap_prepare functionality, which is intended to
replace the deprecated f_op->mmap hook which has been the source of bugs
and security issues for some time.

This series starts with some cleanup of existing mmap_prepare logic, then
adds documentation for the mmap_prepare call to make it easier for
filesystem and driver writers to understand how it works.

It then importantly adds a vm_ops->mapped hook, a key feature that was
missing from mmap_prepare previously - this is invoked when a driver which
specifies mmap_prepare has successfully been mapped but not merged with
another VMA.

mmap_prepare is invoked prior to a merge being attempted, so you cannot
manipulate state such as reference counts as if it were a new mapping.

The vm_ops->mapped hook allows a driver to perform tasks required at this
stage, and provides symmetry against subsequent vm_ops->open,close calls.

The series uses this to correct the afs implementation which wrongly
manipulated reference count at mmap_prepare time.

It then adds an mmap_prepare equivalent of vm_iomap_memory() -
mmap_action_simple_ioremap(), then uses this to update a number of drivers.

It then splits out the mmap_prepare compatibility layer (which allows for
invocation of mmap_prepare hooks in an mmap() hook) in such a way as to
allow for more incremental implementation of mmap_prepare hooks.

It then uses this to extend mmap_prepare usage in drivers.

Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays
the foundation for future work which will extend mmap_prepare to DMA
coherent mappings.

v3:
* Propagated tags (thanks Suren, Richard!)
* Updated 12/16 to correctly clear the vm_area_desc data structure in
  set_desc_from_vma() as per Joshua Hahn (thanks! :)
* Fixed type in 12/16 as per Suren (cheers!)
* Fixed up 6/16 to use mmap_action_ioremap_full() in simple_ioremap_prepare() as
  suggested by Suren.
* Also fixed up 6/16 to call io_remap_pfn_range_prepare() direct rather than
  mmap_action_prepare() as per Suren.
* Also fixed up 6/16 to pass vm_len rather than vm_[start, end] to
  __simple_ioremap_prep() as per Suren (thanks for all the above! :)
* Fixed issue in rmap lock being held - we were referencing a vma->vm_file after
  the VMA was unmapped, so UAF. Avoid that. Also do_munmap() relies on rmap lock
  NOT being held or may deadlock, so extend functionality to ensure we drop it
  when it is held on error paths.
* Updated 'area' -> 'vma' variable in 3/16 in VMA test dup.h.
* Fixed up reference to __compat_vma_mmap() in 12/16 commit message.
* Updated 1/16 to no longer duplicatively apply io_remap_pfn_range_pfn().
* Updated 1/16 to delegate I/O remap complete to remap complete logic.
* Fixed various typos in 12/16.
* Fixed stale comment typos in 13/16.
* Fixed commit msg and comment typos in 14/16.
* Removed accidental sneak peak to future functionality in 15/16 commit message
  :).
* Fixed up field names to be identical in VMA tests + mm_types.h in 6/16,
  15/16.

v2:
* Rebased on
  https://lore.kernel.org/all/cover.1773665966.git.ljs@kernel.org/ to make
  Andrew's life easier :)
* Folded all interim fixes into series (thanks Randy for many doc fixes!))
* As per Suren, removed a comment about allocations too small to fail.
* As per Randy, fixed up typo in documentation for vm_area_desc.
* Fixed mmap_action_prepare() not returning if invalid action->type
  specified, as updated from Andrew's interim fix (thanks!) and also
  reported by kernel test bot.
* Updated mmap_action_prepare() and specific prepare functions to only
  pass vm_area_desc parameter as per Suren.
* Fixed up whitespace as per Suren.
* Updated vm_op->open comment in vm_operations_struct to reference forking
  as per Suren.
* Added a commit to check that input range is within VMA on remap as per
  Suren (this also covers I/O remap and all other cases already asserted).
* Updated AFS to not incorrectly reference count on mmap prepare as per
  Usama.
* Also updated various static AFS functions to be consistent with each
  other.
* Updated AFS commit message to reflect mmap_prepare being before any VMA
  merging as per Suren.
* Updated __compat_vma_mapped() to check for NULL vm_ops as per Usama.
* Updated __compat_vma_mapped() to not reference an unmapped VMA's fields
  as per Usama.
* Updated __vma_check_mmap_hook() to check for NULL vm_ops as per Usama.
* Dropped comment about preferring mmap_prepare as seems overly confusing,
  as per Suren.
* Updated the mmap lock assert in unmap_vma_locked() to a write lock assert
  as per Suren.
* Copied vm_ops->open comment over to VMA tests in appropriate patch as per
  Suren.
* Updated mmap_prepare documentation to reflect the fact that no resources
  should be allocated upon mmap_prepare.
* Updated mmap_prepare documentation to reference the vm_ops->mapped
  callback.
* Fixed stray markdown '## How to use' in documentation.
* Fixed bug reported by kernel test bot re: overlooked
  vma_desc_test_flags() -> vma_desc_test() in MTD driver for nommu.
https://lore.kernel.org/linux-mm/cover.1773695307.git.ljs@kernel.org/

v1:
https://lore.kernel.org/linux-mm/cover.1773346620.git.ljs@kernel.org/

Lorenzo Stoakes (Oracle) (16):
  mm: various small mmap_prepare cleanups
  mm: add documentation for the mmap_prepare file operation callback
  mm: document vm_operations_struct->open the same as close()
  mm: add vm_ops->mapped hook
  fs: afs: correctly drop reference count on mapping failure
  mm: add mmap_action_simple_ioremap()
  misc: open-dice: replace deprecated mmap hook with mmap_prepare
  hpet: replace deprecated mmap hook with mmap_prepare
  mtdchar: replace deprecated mmap hook with mmap_prepare, clean up
  stm: replace deprecated mmap hook with mmap_prepare
  staging: vme_user: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  mm: add mmap_action_map_kernel_pages[_full]()
  mm: on remap assert that input range within the proposed VMA

 Documentation/driver-api/vme.rst           |   2 +-
 Documentation/filesystems/index.rst        |   1 +
 Documentation/filesystems/mmap_prepare.rst | 168 ++++++++++++++
 drivers/char/hpet.c                        |  12 +-
 drivers/hv/hyperv_vmbus.h                  |   4 +-
 drivers/hv/vmbus_drv.c                     |  31 ++-
 drivers/hwtracing/stm/core.c               |  31 ++-
 drivers/misc/open-dice.c                   |  19 +-
 drivers/mtd/mtdchar.c                      |  21 +-
 drivers/staging/vme_user/vme.c             |  20 +-
 drivers/staging/vme_user/vme.h             |   2 +-
 drivers/staging/vme_user/vme_user.c        |  51 +++--
 drivers/target/target_core_user.c          |  26 ++-
 drivers/uio/uio.c                          |  10 +-
 drivers/uio/uio_hv_generic.c               |  11 +-
 fs/afs/file.c                              |  36 ++-
 include/linux/fs.h                         |  14 +-
 include/linux/hyperv.h                     |   4 +-
 include/linux/mm.h                         | 159 ++++++++++++-
 include/linux/mm_types.h                   |  17 +-
 include/linux/uio_driver.h                 |   4 +-
 mm/internal.h                              |  41 ++--
 mm/memory.c                                | 175 ++++++++++----
 mm/util.c                                  | 251 ++++++++++++++-------
 mm/vma.c                                   |  53 +++--
 mm/vma.h                                   |   2 +-
 tools/testing/vma/include/dup.h            | 152 ++++++++++---
 tools/testing/vma/include/stubs.h          |   9 +-
 28 files changed, 990 insertions(+), 336 deletions(-)
 create mode 100644 Documentation/filesystems/mmap_prepare.rst

--
2.53.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox