+ fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch

All of lore.kernel.org
 help / color / mirror / Atom feed

* + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
@ 2025-05-07 21:55 Andrew Morton
  2025-05-08 14:16 ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2025-05-07 21:55 UTC (permalink / raw)
  To: mm-commits, wade.farnsworth, peterx, jhubbard, jgg, david,
	c.briere, artem.k, p.antoniou, akpm


The patch titled
     Subject: Fix zero copy I/O on __get_user_pages allocated pages
has been added to the -mm mm-hotfixes-unstable branch.  Its filename is
     fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch

This patch will later appear in the mm-hotfixes-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Pantelis Antoniou <p.antoniou@partner.samsung.com>
Subject: Fix zero copy I/O on __get_user_pages allocated pages
Date: Wed, 7 May 2025 10:41:05 -0500

Recent updates to net filesystems enabled zero copy operations, which
require getting a user space page pinned.

This does not work for pages that were allocated via __get_user_pages and
then mapped to user-space via remap_pfn_rage.

remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits. 
VM_PFNMAP in particular mark the pages as not having struct_page
associated with them, which is not the case for __get_user_pages()

This in turn makes any attempt to lock a page fail, and breaking I/O from
that address range.

This patch address it by special casing pages in those VMAs and not
calling vm_normal_page() for them.

Link: https://lkml.kernel.org/r/20250507154105.763088-2-p.antoniou@partner.samsung.com
Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
Cc: Artem Krupotkin <artem.k@samsung.com>
Cc: Charles Briere <c.briere@samsung.com>
Cc: Wade Farnsworth <wade.farnsworth@siemens.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

--- a/mm/gup.c~fix-zero-copy-i-o-on-__get_user_pages-allocated-pages
+++ a/mm/gup.c
@@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(
 	return !userfaultfd_pte_wp(vma, pte);
 }
 
+static struct page *gup_normal_page(struct vm_area_struct *vma,
+		unsigned long address, pte_t pte)
+{
+	unsigned long pfn;
+
+	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
+		pfn = pte_pfn(pte);
+		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn > highest_memmap_pfn)
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+	return vm_normal_page(vma, address, pte);
+}
+
 static struct page *follow_page_pte(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd, unsigned int flags,
 		struct dev_pagemap **pgmap)
@@ -858,7 +872,9 @@ static struct page *follow_page_pte(stru
 	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
 		goto no_page;
 
-	page = vm_normal_page(vma, address, pte);
+	page = gup_normal_page(vma, address, pte);
+	if (page && (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)))
+		(void)follow_pfn_pte(vma, address, ptep, flags);
 
 	/*
 	 * We only care about anon pages in can_follow_write_pte() and don't
@@ -1130,7 +1146,7 @@ static int get_gate_page(struct mm_struc
 	*vma = get_gate_vma(mm);
 	if (!page)
 		goto out;
-	*page = vm_normal_page(*vma, address, entry);
+	*page = gup_normal_page(*vma, address, entry);
 	if (!*page) {
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(entry)))
 			goto unmap;
@@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
 	int foreign = (gup_flags & FOLL_REMOTE);
 	bool vma_anon = vma_is_anonymous(vma);
 
-	if (vm_flags & (VM_IO | VM_PFNMAP))
-		return -EFAULT;
 
 	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;
_

Patches currently in -mm which might be from p.antoniou@partner.samsung.com are

fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-07 21:55 + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch Andrew Morton
@ 2025-05-08 14:16 ` Peter Xu
  2025-05-08 14:36   ` Pantelis Antoniou
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2025-05-08 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mm-commits, wade.farnsworth, jhubbard, jgg, david, c.briere,
	artem.k, p.antoniou, David Howells

Hi, Pantelis,

[Cc David Howells]

On Wed, May 07, 2025 at 02:55:54PM -0700, Andrew Morton wrote:
> 
> The patch titled
>      Subject: Fix zero copy I/O on __get_user_pages allocated pages
> has been added to the -mm mm-hotfixes-unstable branch.  Its filename is
>      fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch
> 
> This patch will shortly appear at
>      https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch
> 
> This patch will later appear in the mm-hotfixes-unstable branch at
>     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
> 
> ------------------------------------------------------
> From: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> Subject: Fix zero copy I/O on __get_user_pages allocated pages
> Date: Wed, 7 May 2025 10:41:05 -0500
> 
> Recent updates to net filesystems enabled zero copy operations, which
> require getting a user space page pinned.
> 
> This does not work for pages that were allocated via __get_user_pages and
> then mapped to user-space via remap_pfn_rage.
> 
> remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits. 
> VM_PFNMAP in particular mark the pages as not having struct_page
> associated with them, which is not the case for __get_user_pages()
> 
> This in turn makes any attempt to lock a page fail, and breaking I/O from
> that address range.
> 
> This patch address it by special casing pages in those VMAs and not
> calling vm_normal_page() for them.
> 
> Link: https://lkml.kernel.org/r/20250507154105.763088-2-p.antoniou@partner.samsung.com
> Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> Cc: Artem Krupotkin <artem.k@samsung.com>
> Cc: Charles Briere <c.briere@samsung.com>
> Cc: Wade Farnsworth <wade.farnsworth@siemens.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/gup.c |   22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> --- a/mm/gup.c~fix-zero-copy-i-o-on-__get_user_pages-allocated-pages
> +++ a/mm/gup.c
> @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(
>  	return !userfaultfd_pte_wp(vma, pte);
>  }
>  
> +static struct page *gup_normal_page(struct vm_area_struct *vma,
> +		unsigned long address, pte_t pte)
> +{
> +	unsigned long pfn;
> +
> +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> +		pfn = pte_pfn(pte);
> +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn > highest_memmap_pfn)
> +			return NULL;
> +		return pfn_to_page(pfn);
> +	}
> +	return vm_normal_page(vma, address, pte);
> +}
> +
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		unsigned long address, pmd_t *pmd, unsigned int flags,
>  		struct dev_pagemap **pgmap)
> @@ -858,7 +872,9 @@ static struct page *follow_page_pte(stru
>  	if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
>  		goto no_page;
>  
> -	page = vm_normal_page(vma, address, pte);
> +	page = gup_normal_page(vma, address, pte);
> +	if (page && (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)))
> +		(void)follow_pfn_pte(vma, address, ptep, flags);
>  
>  	/*
>  	 * We only care about anon pages in can_follow_write_pte() and don't
> @@ -1130,7 +1146,7 @@ static int get_gate_page(struct mm_struc
>  	*vma = get_gate_vma(mm);
>  	if (!page)
>  		goto out;
> -	*page = vm_normal_page(*vma, address, entry);
> +	*page = gup_normal_page(*vma, address, entry);

Is this really needed?  IIUC the iter code would only use in either UBUF or
IOVEC ones.

>  	if (!*page) {
>  		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(entry)))
>  			goto unmap;
> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
>  	int foreign = (gup_flags & FOLL_REMOTE);
>  	bool vma_anon = vma_is_anonymous(vma);
>  
> -	if (vm_flags & (VM_IO | VM_PFNMAP))
> -		return -EFAULT;

Is there's any justification that this won't break some existing GUP users
that may rely on properly failing at pfnmaps?

IIUC netfs isn't the first one that wants to GUP on top of pfnmaps, KVM
does it for years and so far it was processed in a standalone path:

hva_to_pfn:
	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
		r = hva_to_pfn_remapped(vma, kfp, &pfn);

That started with supporting real pfnmaps (with no page struct), but pfnmap
with page structs can also happen afaict, and kvm processes that too by
checking page==NULL ultimately, e.g. in kvm_release_faultin_page().

The other thing is above only processed pte level of pfnmap, and just to
mention pmd/pud may need attention too because we're gradually supporting
huge mappings even for pfns.  I didn't check whether it's possible as of
now, though.  Maybe it's not an immediate concern.

In general, I'm uncertain about whether this is the right way to go so
far. To me it might be less intrusive if we follow what kvm does for now,
or maybe we also at least want to enrich the justification part in the
commit log.

>  
>  	if ((gup_flags & FOLL_ANON) && !vma_anon)
>  		return -EFAULT;
> _
> 
> Patches currently in -mm which might be from p.antoniou@partner.samsung.com are
> 
> fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 14:16 ` Peter Xu
@ 2025-05-08 14:36   ` Pantelis Antoniou
  2025-05-08 15:08     ` Peter Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 14:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, mm-commits, wade.farnsworth, jhubbard, jgg, david,
	c.briere, artem.k, David Howells

On Thu, 8 May 2025 10:16:31 -0400
Peter Xu <peterx@redhat.com> wrote:

Hi Peter,

> Hi, Pantelis, [Cc David Howells] On Wed, May 07, 2025 at 02: 55: 54PM
> -0700, Andrew Morton wrote: > > The patch titled > Subject: Fix zero
> copy I/O on __get_user_pages allocated pages > has been added to the
> -mm mm-hotfixes-unstable 
> Hi, Pantelis,
> 
> [Cc David Howells]
> 
> On Wed, May 07, 2025 at 02:55:54PM -0700, Andrew Morton wrote:
> > 
> > The patch titled
> >      Subject: Fix zero copy I/O on __get_user_pages allocated pages
> > has been added to the -mm mm-hotfixes-unstable branch.  Its
> > filename is
> > fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch
> > 
> > This patch will shortly appear at
> >      https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch__;!!KUh5zVML9r9m!2UOP9aM2VFq6hYqCdCsuJWGKqQ36OHuy8fOXVwFXktF6e9uH-2METAUSLAFHOPpOplI8gbkk7l6UAmauPPQ$
> > 
> > This patch will later appear in the mm-hotfixes-unstable branch at
> >     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > 
> > Before you just go and hit "reply", please:
> >    a) Consider who else should be cc'ed
> >    b) Prefer to cc a suitable mailing list as well
> >    c) Ideally: find the original patch on the mailing list and do a
> >       reply-to-all to that, adding suitable additional cc's
> > 
> > *** Remember to use Documentation/process/submit-checklist.rst when
> > testing your code ***
> > 
> > The -mm tree is included into linux-next via the mm-everything
> > branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > and is updated there every 2-3 working days
> > 
> > ------------------------------------------------------
> > From: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> > Subject: Fix zero copy I/O on __get_user_pages allocated pages
> > Date: Wed, 7 May 2025 10:41:05 -0500
> > 
> > Recent updates to net filesystems enabled zero copy operations,
> > which require getting a user space page pinned.
> > 
> > This does not work for pages that were allocated via
> > __get_user_pages and then mapped to user-space via remap_pfn_rage.
> > 
> > remap_pfn_range_internal() will turn on VM_IO | VM_PFNMAP vma bits. 
> > VM_PFNMAP in particular mark the pages as not having struct_page
> > associated with them, which is not the case for __get_user_pages()
> > 
> > This in turn makes any attempt to lock a page fail, and breaking
> > I/O from that address range.
> > 
> > This patch address it by special casing pages in those VMAs and not
> > calling vm_normal_page() for them.
> > 
> > Link:
> > https://urldefense.com/v3/__https://lkml.kernel.org/r/20250507154105.763088-2-p.antoniou@partner.samsung.com__;!!KUh5zVML9r9m!2UOP9aM2VFq6hYqCdCsuJWGKqQ36OHuy8fOXVwFXktF6e9uH-2METAUSLAFHOPpOplI8gbkk7l6UcsZY8XI$
> > Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
> > Cc: Artem Krupotkin <artem.k@samsung.com> Cc: Charles Briere
> > <c.briere@samsung.com> Cc: Wade Farnsworth
> > <wade.farnsworth@siemens.com> Cc: David Hildenbrand
> > <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > 
> >  mm/gup.c |   22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> > 
> &gt; ---
> a/mm/gup.c~fix-zero-copy-i-o-on-__get_user_pages-allocated-pages
> > +++ a/mm/gup.c
> > @@ -833,6 +833,20 @@ static inline bool can_follow_write_pte(
> >  	return !userfaultfd_pte_wp(vma, pte);
> >  }
> >  
> > +static struct page *gup_normal_page(struct vm_area_struct *vma,
> > +		unsigned long address, pte_t pte)
> > +{
> > +	unsigned long pfn;
> > +
> > +	if (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)) {
> > +		pfn = pte_pfn(pte);
> > +		if (!pfn_valid(pfn) || is_zero_pfn(pfn) || pfn >
> > highest_memmap_pfn)
> > +			return NULL;
> > +		return pfn_to_page(pfn);
> > +	}
> > +	return vm_normal_page(vma, address, pte);
> > +}
> > +
> >  static struct page *follow_page_pte(struct vm_area_struct *vma,
> >  		unsigned long address, pmd_t *pmd, unsigned int
> > flags, struct dev_pagemap **pgmap)
> > @@ -858,7 +872,9 @@ static struct page *follow_page_pte(stru
> >  	if (pte_protnone(pte) && !gup_can_follow_protnone(vma,
> > flags)) goto no_page;
> >  
> > -	page = vm_normal_page(vma, address, pte);
> > +	page = gup_normal_page(vma, address, pte);
> > +	if (page && (vma->vm_flags & (VM_MIXEDMAP | VM_PFNMAP)))
> > +		(void)follow_pfn_pte(vma, address, ptep, flags);
> >  
> >  	/*
> >  	 * We only care about anon pages in can_follow_write_pte()
> > and don't @@ -1130,7 +1146,7 @@ static int get_gate_page(struct
> > mm_struc *vma = get_gate_vma(mm);
> >  	if (!page)
> >  		goto out;
> > -	*page = vm_normal_page(*vma, address, entry);
> > +	*page = gup_normal_page(*vma, address, entry);
> 
> Is this really needed?  IIUC the iter code would only use in either
> UBUF or IOVEC ones.
> 

I think you're right, for our platforms the gate check never passes.

However using the same gup_normal_page() method could be clearer in this
context.

> >  	if (!*page) {
> >  		if ((gup_flags & FOLL_DUMP) ||
> > !is_zero_pfn(pte_pfn(entry))) goto unmap;
> > @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> >  	int foreign = (gup_flags & FOLL_REMOTE);
> >  	bool vma_anon = vma_is_anonymous(vma);
> >  
> > -	if (vm_flags & (VM_IO | VM_PFNMAP))
> > -		return -EFAULT;
> 
> Is there's any justification that this won't break some existing GUP
> users that may rely on properly failing at pfnmaps?
> 
> IIUC netfs isn't the first one that wants to GUP on top of pfnmaps,
> KVM does it for years and so far it was processed in a standalone
> path:
> 
> hva_to_pfn:
> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> 
> That started with supporting real pfnmaps (with no page struct), but
> pfnmap with page structs can also happen afaict, and kvm processes
> that too by checking page==NULL ultimately, e.g. in
> kvm_release_faultin_page().
> 

I see. The problem is that we're not the owners of the code in netfslib,
and it is considerably more intrusive to fix things there.

This is a hotfix for a userspace regression. I sort of agree that having
different handling for these areas in netfslib would be ideal.

Or perhaps changing semantics by having an extra VM_* bit that would
mark that VMA as actually having a backing page struct. Dunno, things
could get considerably complex fast.

> The other thing is above only processed pte level of pfnmap, and just
> to mention pmd/pud may need attention too because we're gradually
> supporting huge mappings even for pfns.  I didn't check whether it's
> possible as of now, though.  Maybe it's not an immediate concern.
> 

You are absolutely right, eventually it will be a concern in the future.

> In general, I'm uncertain about whether this is the right way to go so
> far. To me it might be less intrusive if we follow what kvm does for
> now, or maybe we also at least want to enrich the justification part
> in the commit log.
> 

Again, this as a hotfix. An actual fix might be something that address
both KVM and netfslib concerns, but that would be something much
larger than a 20 line patch.

> >  
> >  	if ((gup_flags & FOLL_ANON) && !vma_anon)
> >  		return -EFAULT;
> > _
> > 
> > Patches currently in -mm which might be from
> > p.antoniou@partner.samsung.com are
> > 
> > fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch
> > 
> 

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 14:36   ` Pantelis Antoniou
@ 2025-05-08 15:08     ` Peter Xu
  2025-05-08 15:10       ` David Hildenbrand
  2025-05-08 15:17       ` Pantelis Antoniou
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Xu @ 2025-05-08 15:08 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Andrew Morton, mm-commits, wade.farnsworth, jhubbard, jgg, david,
	c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 10:16:31 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> Hi Peter,

Hi, Pantelis,

[...]

> > > @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> > >  	int foreign = (gup_flags & FOLL_REMOTE);
> > >  	bool vma_anon = vma_is_anonymous(vma);
> > >  
> > > -	if (vm_flags & (VM_IO | VM_PFNMAP))
> > > -		return -EFAULT;
> > 
> > Is there's any justification that this won't break some existing GUP
> > users that may rely on properly failing at pfnmaps?
> > 
> > IIUC netfs isn't the first one that wants to GUP on top of pfnmaps,
> > KVM does it for years and so far it was processed in a standalone
> > path:
> > 
> > hva_to_pfn:
> > 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> > 
> > That started with supporting real pfnmaps (with no page struct), but
> > pfnmap with page structs can also happen afaict, and kvm processes
> > that too by checking page==NULL ultimately, e.g. in
> > kvm_release_faultin_page().
> > 
> 
> I see. The problem is that we're not the owners of the code in netfslib,
> and it is considerably more intrusive to fix things there.
> 
> This is a hotfix for a userspace regression. I sort of agree that having
> different handling for these areas in netfslib would be ideal.

Do you mean this used to work in older kernels?  Some more info on the
regression would be more than welcomed if so..  If it fixes a kernel
regression, we may want a Fixes for whatever patch at last.

Or do you mean it's a regression caused by userspace change?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:08     ` Peter Xu
@ 2025-05-08 15:10       ` David Hildenbrand
  2025-05-08 15:27         ` Pantelis Antoniou
  2025-05-08 15:17       ` Pantelis Antoniou
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 15:10 UTC (permalink / raw)
  To: Peter Xu, Pantelis Antoniou
  Cc: Andrew Morton, mm-commits, wade.farnsworth, jhubbard, jgg,
	c.briere, artem.k, David Howells

On 08.05.25 17:08, Peter Xu wrote:
> On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
>> On Thu, 8 May 2025 10:16:31 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>>
>> Hi Peter,
> 
> Hi, Pantelis,
> 
> [...]
> 
>>>> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
>>>>   	int foreign = (gup_flags & FOLL_REMOTE);
>>>>   	bool vma_anon = vma_is_anonymous(vma);
>>>>   
>>>> -	if (vm_flags & (VM_IO | VM_PFNMAP))
>>>> -		return -EFAULT;
>>>
>>> Is there's any justification that this won't break some existing GUP
>>> users that may rely on properly failing at pfnmaps?
>>>
>>> IIUC netfs isn't the first one that wants to GUP on top of pfnmaps,
>>> KVM does it for years and so far it was processed in a standalone
>>> path:
>>>
>>> hva_to_pfn:
>>> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
>>> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
>>>
>>> That started with supporting real pfnmaps (with no page struct), but
>>> pfnmap with page structs can also happen afaict, and kvm processes
>>> that too by checking page==NULL ultimately, e.g. in
>>> kvm_release_faultin_page().
>>>
>>
>> I see. The problem is that we're not the owners of the code in netfslib,
>> and it is considerably more intrusive to fix things there.
>>
>> This is a hotfix for a userspace regression. I sort of agree that having
>> different handling for these areas in netfslib would be ideal.
> 
> Do you mean this used to work in older kernels?  Some more info on the
> regression would be more than welcomed if so..  If it fixes a kernel
> regression, we may want a Fixes for whatever patch at last.

To be precise: Whoever decided to use remap_pfn_range() essentially 
decided that GUP cannot possibly work.

So is the regression introduced by a conversion to remap_pfn_range() in 
some code, or because suddenly someone relies on GUP for these things?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:10       ` David Hildenbrand
@ 2025-05-08 15:27         ` Pantelis Antoniou
  2025-05-08 15:40           ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 15:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, mm-commits, wade.farnsworth, jhubbard,
	jgg, c.briere, artem.k, David Howells

On Thu, 8 May 2025 17:10:10 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
> 36: 12PM +0300, Pantelis Antoniou wrote: >> On Thu, 8 May 2025 10:
> 16: 31 -0400 >> Peter Xu <peterx@ redhat. com> wrote: >> >> Hi Peter,

> On 08.05.25 17:08, Peter Xu wrote:
> > On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
> >> On Thu, 8 May 2025 10:16:31 -0400
> >> Peter Xu <peterx@redhat.com> wrote:
> >>
> >> Hi Peter,
> > 
> > Hi, Pantelis,
> > 
> > [...]
> > 
> >>>> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> >>>>   	int foreign = (gup_flags & FOLL_REMOTE);
> >>>>   	bool vma_anon = vma_is_anonymous(vma);
> >>>>   
> >>>> -	if (vm_flags & (VM_IO | VM_PFNMAP))
> >>>> -		return -EFAULT;
> >>>
> >>> Is there's any justification that this won't break some existing
> >>> GUP users that may rely on properly failing at pfnmaps?
> >>>
> >>> IIUC netfs isn't the first one that wants to GUP on top of
> >>> pfnmaps, KVM does it for years and so far it was processed in a
> >>> standalone path:
> >>>
> >>> hva_to_pfn:
> >>> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> >>> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> >>>
> >>> That started with supporting real pfnmaps (with no page struct),
> >>> but pfnmap with page structs can also happen afaict, and kvm
> >>> processes that too by checking page==NULL ultimately, e.g. in
> >>> kvm_release_faultin_page().
> >>>
> >>
> >> I see. The problem is that we're not the owners of the code in
> >> netfslib, and it is considerably more intrusive to fix things
> >> there.
> >>
> >> This is a hotfix for a userspace regression. I sort of agree that
> >> having different handling for these areas in netfslib would be
> >> ideal.
> > 
> > Do you mean this used to work in older kernels?  Some more info on
> > the regression would be more than welcomed if so..  If it fixes a
> > kernel regression, we may want a Fixes for whatever patch at last.
> 
> To be precise: Whoever decided to use remap_pfn_range() essentially 
> decided that GUP cannot possibly work.
> 
> So is the regression introduced by a conversion to remap_pfn_range()
> in some code, or because suddenly someone relies on GUP for these
> things?
> 

I don't think there was a deliberate decision here, but there was no
conversion to remap_pfn_range(), the code (in DRM) was always there.

The regression occurred when netfslib started using GUP for I/O and
when filesystems switched to it we hit this case.

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:27         ` Pantelis Antoniou
@ 2025-05-08 15:40           ` David Hildenbrand
  2025-05-08 15:48             ` Pantelis Antoniou
                               ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 15:40 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Peter Xu, Andrew Morton, mm-commits, wade.farnsworth, jhubbard,
	jgg, c.briere, artem.k, David Howells

On 08.05.25 17:27, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 17:10:10 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
>> 36: 12PM +0300, Pantelis Antoniou wrote: >> On Thu, 8 May 2025 10:
>> 16: 31 -0400 >> Peter Xu <peterx@ redhat. com> wrote: >> >> Hi Peter,
> 
>> On 08.05.25 17:08, Peter Xu wrote:
>>> On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
>>>> On Thu, 8 May 2025 10:16:31 -0400
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> Hi Peter,
>>>
>>> Hi, Pantelis,
>>>
>>> [...]
>>>
>>>>>> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
>>>>>>    	int foreign = (gup_flags & FOLL_REMOTE);
>>>>>>    	bool vma_anon = vma_is_anonymous(vma);
>>>>>>    
>>>>>> -	if (vm_flags & (VM_IO | VM_PFNMAP))
>>>>>> -		return -EFAULT;
>>>>>
>>>>> Is there's any justification that this won't break some existing
>>>>> GUP users that may rely on properly failing at pfnmaps?
>>>>>
>>>>> IIUC netfs isn't the first one that wants to GUP on top of
>>>>> pfnmaps, KVM does it for years and so far it was processed in a
>>>>> standalone path:
>>>>>
>>>>> hva_to_pfn:
>>>>> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
>>>>> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
>>>>>
>>>>> That started with supporting real pfnmaps (with no page struct),
>>>>> but pfnmap with page structs can also happen afaict, and kvm
>>>>> processes that too by checking page==NULL ultimately, e.g. in
>>>>> kvm_release_faultin_page().
>>>>>
>>>>
>>>> I see. The problem is that we're not the owners of the code in
>>>> netfslib, and it is considerably more intrusive to fix things
>>>> there.
>>>>
>>>> This is a hotfix for a userspace regression. I sort of agree that
>>>> having different handling for these areas in netfslib would be
>>>> ideal.
>>>
>>> Do you mean this used to work in older kernels?  Some more info on
>>> the regression would be more than welcomed if so..  If it fixes a
>>> kernel regression, we may want a Fixes for whatever patch at last.
>>
>> To be precise: Whoever decided to use remap_pfn_range() essentially
>> decided that GUP cannot possibly work.
>>
>> So is the regression introduced by a conversion to remap_pfn_range()
>> in some code, or because suddenly someone relies on GUP for these
>> things?
>>
> 
> I don't think there was a deliberate decision here, but there was no
> conversion to remap_pfn_range(), the code (in DRM) was always there.
> 
> The regression occurred when netfslib started using GUP for I/O and
> when filesystems switched to it we hit this case.

Okay, so GUP and DRM always worked that way. They are essentially 
incompatible at this point due to VM_PFNMAP.

So netfslib requesting something that is impossible is the problem .. or 
rather filesystems switching to that and not realizing the problem.

Hmmm

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:40           ` David Hildenbrand
@ 2025-05-08 15:48             ` Pantelis Antoniou
  2025-05-08 16:25             ` Pantelis Antoniou
  2025-05-08 17:35             ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 15:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, mm-commits, wade.farnsworth, jhubbard,
	jgg, c.briere, artem.k, David Howells

On Thu, 8 May 2025 17:40:15 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 08. 05. 25 17: 27, Pantelis Antoniou wrote: > On Thu, 8 May 2025
> 17: 10: 10 +0200 > David Hildenbrand <david@ redhat. com> wrote: > >>
> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
> On 08.05.25 17:27, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 17:10:10 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> >> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
> >> 36: 12PM +0300, Pantelis Antoniou wrote: >> On Thu, 8 May 2025 10:
> >> 16: 31 -0400 >> Peter Xu <peterx@ redhat. com> wrote: >> >> Hi
> >> Peter,
> > 
> >> On 08.05.25 17:08, Peter Xu wrote:
> >>> On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
> >>>> On Thu, 8 May 2025 10:16:31 -0400
> >>>> Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>
> >>> Hi, Pantelis,
> >>>
> >>> [...]
> >>>
> >>>>>> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> >>>>>>    	int foreign = (gup_flags & FOLL_REMOTE);
> >>>>>>    	bool vma_anon = vma_is_anonymous(vma);
> >>>>>>    
> >>>>>> -	if (vm_flags & (VM_IO | VM_PFNMAP))
> >>>>>> -		return -EFAULT;
> >>>>>
> >>>>> Is there's any justification that this won't break some existing
> >>>>> GUP users that may rely on properly failing at pfnmaps?
> >>>>>
> >>>>> IIUC netfs isn't the first one that wants to GUP on top of
> >>>>> pfnmaps, KVM does it for years and so far it was processed in a
> >>>>> standalone path:
> >>>>>
> >>>>> hva_to_pfn:
> >>>>> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> >>>>> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> >>>>>
> >>>>> That started with supporting real pfnmaps (with no page struct),
> >>>>> but pfnmap with page structs can also happen afaict, and kvm
> >>>>> processes that too by checking page==NULL ultimately, e.g. in
> >>>>> kvm_release_faultin_page().
> >>>>>
> >>>>
> >>>> I see. The problem is that we're not the owners of the code in
> >>>> netfslib, and it is considerably more intrusive to fix things
> >>>> there.
> >>>>
> >>>> This is a hotfix for a userspace regression. I sort of agree that
> >>>> having different handling for these areas in netfslib would be
> >>>> ideal.
> >>>
> >>> Do you mean this used to work in older kernels?  Some more info on
> >>> the regression would be more than welcomed if so..  If it fixes a
> >>> kernel regression, we may want a Fixes for whatever patch at last.
> >>
> >> To be precise: Whoever decided to use remap_pfn_range() essentially
> >> decided that GUP cannot possibly work.
> >>
> >> So is the regression introduced by a conversion to
> >> remap_pfn_range() in some code, or because suddenly someone relies
> >> on GUP for these things?
> >>
> > 
> > I don't think there was a deliberate decision here, but there was no
> > conversion to remap_pfn_range(), the code (in DRM) was always there.
> > 
> > The regression occurred when netfslib started using GUP for I/O and
> > when filesystems switched to it we hit this case.
> 
> Okay, so GUP and DRM always worked that way. They are essentially 
> incompatible at this point due to VM_PFNMAP.
> 
> So netfslib requesting something that is impossible is the problem ..
> or rather filesystems switching to that and not realizing the problem.
> 

All of the statements above are true.

> Hmmm
> 

Indeed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:40           ` David Hildenbrand
  2025-05-08 15:48             ` Pantelis Antoniou
@ 2025-05-08 16:25             ` Pantelis Antoniou
  2025-05-08 17:35             ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 16:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, mm-commits, wade.farnsworth, jhubbard,
	jgg, c.briere, artem.k, David Howells

[-- Attachment #1: Type: text/plain, Size: 3641 bytes --]

On Thu, 8 May 2025 17:40:15 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 08. 05. 25 17: 27, Pantelis Antoniou wrote: > On Thu, 8 May 2025
> 17: 10: 10 +0200 > David Hildenbrand <david@ redhat. com> wrote: > >>
> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
> >> 36: 12PM 
> On 08.05.25 17:27, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 17:10:10 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> >> On 08. 05. 25 17: 08, Peter Xu wrote: > On Thu, May 08, 2025 at 05:
> >> 36: 12PM +0300, Pantelis Antoniou wrote: >> On Thu, 8 May 2025 10:
> >> 16: 31 -0400 >> Peter Xu <peterx@ redhat. com> wrote: >> >> Hi
> >> Peter,
> > 
> >> On 08.05.25 17:08, Peter Xu wrote:
> >>> On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
> >>>> On Thu, 8 May 2025 10:16:31 -0400
> >>>> Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>
> >>> Hi, Pantelis,
> >>>
> >>> [...]
> >>>
> >>>>>> @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> >>>>>>    	int foreign = (gup_flags & FOLL_REMOTE);
> >>>>>>    	bool vma_anon = vma_is_anonymous(vma);
> >>>>>>    
> >>>>>> -	if (vm_flags & (VM_IO | VM_PFNMAP))
> >>>>>> -		return -EFAULT;
> >>>>>
> >>>>> Is there's any justification that this won't break some existing
> >>>>> GUP users that may rely on properly failing at pfnmaps?
> >>>>>
> >>>>> IIUC netfs isn't the first one that wants to GUP on top of
> >>>>> pfnmaps, KVM does it for years and so far it was processed in a
> >>>>> standalone path:
> >>>>>
> >>>>> hva_to_pfn:
> >>>>> 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> >>>>> 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> >>>>>
> >>>>> That started with supporting real pfnmaps (with no page struct),
> >>>>> but pfnmap with page structs can also happen afaict, and kvm
> >>>>> processes that too by checking page==NULL ultimately, e.g. in
> >>>>> kvm_release_faultin_page().
> >>>>>
> >>>>
> >>>> I see. The problem is that we're not the owners of the code in
> >>>> netfslib, and it is considerably more intrusive to fix things
> >>>> there.
> >>>>
> >>>> This is a hotfix for a userspace regression. I sort of agree that
> >>>> having different handling for these areas in netfslib would be
> >>>> ideal.
> >>>
> >>> Do you mean this used to work in older kernels?  Some more info on
> >>> the regression would be more than welcomed if so..  If it fixes a
> >>> kernel regression, we may want a Fixes for whatever patch at last.
> >>
> >> To be precise: Whoever decided to use remap_pfn_range() essentially
> >> decided that GUP cannot possibly work.
> >>
> >> So is the regression introduced by a conversion to
> >> remap_pfn_range() in some code, or because suddenly someone relies
> >> on GUP for these things?
> >>
> > 
> > I don't think there was a deliberate decision here, but there was no
> > conversion to remap_pfn_range(), the code (in DRM) was always there.
> > 
> > The regression occurred when netfslib started using GUP for I/O and
> > when filesystems switched to it we hit this case.
> 
> Okay, so GUP and DRM always worked that way. They are essentially 
> incompatible at this point due to VM_PFNMAP.
> 
> So netfslib requesting something that is impossible is the problem ..
> or rather filesystems switching to that and not realizing the problem.
> 
> Hmmm
> 

In the interest of getting everyone on the same page here's a buildroot
patch that reproduces an (simplified) environment for triggering the
bug that this patch fixes.

Regards

-- Pantelis


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Modify-aarch64-virt-x86_64-qemu-targets-to-exhibit-a.patch --]
[-- Type: text/x-patch, Size: 22146 bytes --]

From fa33c0501ec4b5e84e9359d56b0da54bfa5af728 Mon Sep 17 00:00:00 2001
From: Pantelis Antoniou <p.antoniou@partner.samsung.com>
Date: Tue, 1 Apr 2025 17:28:20 +0300
Subject: [PATCH] Modify aarch64-virt/x86_64 qemu targets to exhibit a vmbug

When 9p writes directly from a device mmap vma the write fails.

To use, checkout this and:

$ make qemu_aarch64_virt_defconfig # arm64

or

$ make qemu_x86_64_defconfig # x86_64

$ make
$ ./output/images/start-qemu.sh

< login as root, no password >

qemu# modprobe vmbug-module
qemu# vmbug
<OK>
qemu# vmbug -w /mnt/home/vmbug.bin
<FAILS>
qemu# vmbug -b -w /mnt/home/vmbug.bin
<OK>

Signed-off-by: Pantelis Antoniou <p.antoniou@partner.samsung.com>
---
 board/qemu/aarch64-virt/linux.config |   8 +-
 board/qemu/aarch64-virt/readme.txt   |   2 +-
 board/qemu/post-build.sh             |   6 +
 board/qemu/x86_64/linux.config       |  14 +-
 board/qemu/x86_64/readme.txt         |   2 +-
 configs/qemu_aarch64_virt_defconfig  |  12 +-
 configs/qemu_x86_64_defconfig        |  13 +-
 linux/linux.hash                     |   3 +
 package/Config.in                    |   2 +
 package/vmbug-module/Config.in       |   5 +
 package/vmbug-module/Makefile        |  10 ++
 package/vmbug-module/vmbug-module.c  | 136 ++++++++++++++++++
 package/vmbug-module/vmbug-module.mk |  14 ++
 package/vmbug/Config.in              |   5 +
 package/vmbug/Makefile               |  25 ++++
 package/vmbug/vmbug.c                | 202 +++++++++++++++++++++++++++
 package/vmbug/vmbug.mk               |  21 +++
 17 files changed, 463 insertions(+), 17 deletions(-)
 create mode 100755 board/qemu/post-build.sh
 create mode 100644 package/vmbug-module/Config.in
 create mode 100644 package/vmbug-module/Makefile
 create mode 100644 package/vmbug-module/vmbug-module.c
 create mode 100644 package/vmbug-module/vmbug-module.mk
 create mode 100644 package/vmbug/Config.in
 create mode 100644 package/vmbug/Makefile
 create mode 100644 package/vmbug/vmbug.c
 create mode 100644 package/vmbug/vmbug.mk

diff --git a/board/qemu/aarch64-virt/linux.config b/board/qemu/aarch64-virt/linux.config
index 971b9fcf86..0fcb3fbaea 100644
--- a/board/qemu/aarch64-virt/linux.config
+++ b/board/qemu/aarch64-virt/linux.config
@@ -13,6 +13,7 @@ CONFIG_PROFILING=y
 CONFIG_ARCH_VEXPRESS=y
 CONFIG_COMPAT=y
 CONFIG_ACPI=y
+# CONFIG_GCC_PLUGINS is not set
 CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
 CONFIG_BLK_DEV_BSGLIB=y
@@ -21,14 +22,14 @@ CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_NET=y
 CONFIG_PACKET=y
 CONFIG_PACKET_DIAG=y
-CONFIG_UNIX=y
 CONFIG_NET_KEY=y
-CONFIG_INET=y
 CONFIG_IP_MULTICAST=y
 CONFIG_IP_ADVANCED_ROUTER=y
 CONFIG_BRIDGE=m
 CONFIG_NET_SCHED=y
 CONFIG_VSOCKETS=y
+CONFIG_NET_9P=y
+CONFIG_NET_9P_VIRTIO=y
 CONFIG_PCI=y
 CONFIG_PCI_HOST_GENERIC=y
 CONFIG_DEVTMPFS=y
@@ -74,3 +75,6 @@ CONFIG_VIRTIO_FS=y
 CONFIG_OVERLAY_FS=y
 CONFIG_TMPFS=y
 CONFIG_TMPFS_POSIX_ACL=y
+CONFIG_9P_FS=y
+CONFIG_9P_FS_POSIX_ACL=y
+CONFIG_CRYPTO_CRC32C=y
diff --git a/board/qemu/aarch64-virt/readme.txt b/board/qemu/aarch64-virt/readme.txt
index db35a3a7a8..3fac0a296c 100644
--- a/board/qemu/aarch64-virt/readme.txt
+++ b/board/qemu/aarch64-virt/readme.txt
@@ -1,5 +1,5 @@
 Run the emulation with:
 
-  qemu-system-aarch64 -M virt -cpu cortex-a53 -nographic -smp 1 -kernel output/images/Image -append "rootwait root=/dev/vda console=ttyAMA0" -netdev user,id=eth0 -device virtio-net-device,netdev=eth0 -drive file=output/images/rootfs.ext4,if=none,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 # qemu_aarch64_virt_defconfig
+  qemu-system-aarch64 -M virt -cpu cortex-a53 -nographic -smp 1 -kernel output/images/Image -append "rootwait root=/dev/vda console=ttyAMA0" -netdev user,id=eth0 -device virtio-net-device,netdev=eth0 -drive file=output/images/rootfs.ext4,if=none,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -virtfs local,path="${HOME}",mount_tag=host0,security_model=mapped,id=host0 # qemu_aarch64_virt_defconfig
 
 The login prompt will appear in the terminal that started Qemu.
diff --git a/board/qemu/post-build.sh b/board/qemu/post-build.sh
new file mode 100755
index 0000000000..61648a1daf
--- /dev/null
+++ b/board/qemu/post-build.sh
@@ -0,0 +1,6 @@
+#!/bin/sh
+BOARD_DIR="$(dirname "$0")"
+
+mkdir -p "${TARGET_DIR}/mnt/home"
+
+echo "host0	/mnt/home	9p trans=virtio,version=9p2000.L	0	1" >>"${TARGET_DIR}/etc/fstab"
diff --git a/board/qemu/x86_64/linux.config b/board/qemu/x86_64/linux.config
index e1d2ce01b0..9d4a97a6b5 100644
--- a/board/qemu/x86_64/linux.config
+++ b/board/qemu/x86_64/linux.config
@@ -1,15 +1,16 @@
 CONFIG_SYSVIPC=y
 CONFIG_CGROUPS=y
-CONFIG_MODULES=y
-CONFIG_MODULE_UNLOAD=y
 CONFIG_SMP=y
 CONFIG_HYPERVISOR_GUEST=y
 CONFIG_PARAVIRT=y
+# CONFIG_GCC_PLUGINS is not set
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
-CONFIG_UNIX=y
-CONFIG_INET=y
 # CONFIG_WIRELESS is not set
+CONFIG_NET_9P=y
+CONFIG_NET_9P_VIRTIO=y
 CONFIG_PCI=y
 CONFIG_DEVTMPFS=y
 CONFIG_DEVTMPFS_MOUNT=y
@@ -30,8 +31,8 @@ CONFIG_VIRTIO_CONSOLE=y
 CONFIG_HW_RANDOM_VIRTIO=m
 CONFIG_DRM=y
 CONFIG_DRM_QXL=y
-CONFIG_DRM_BOCHS=y
 CONFIG_DRM_VIRTIO_GPU=y
+CONFIG_DRM_BOCHS=y
 CONFIG_SOUND=y
 CONFIG_SND=y
 CONFIG_SND_HDA_INTEL=y
@@ -47,7 +48,8 @@ CONFIG_VIRTIO_INPUT=y
 CONFIG_VIRTIO_MMIO=y
 CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y
 CONFIG_EXT4_FS=y
-CONFIG_AUTOFS4_FS=y
 CONFIG_TMPFS=y
 CONFIG_TMPFS_POSIX_ACL=y
+CONFIG_9P_FS=y
+CONFIG_9P_FS_POSIX_ACL=y
 CONFIG_UNWINDER_FRAME_POINTER=y
diff --git a/board/qemu/x86_64/readme.txt b/board/qemu/x86_64/readme.txt
index 2b2ae3be20..a0a7fb6ab7 100644
--- a/board/qemu/x86_64/readme.txt
+++ b/board/qemu/x86_64/readme.txt
@@ -1,6 +1,6 @@
 Run the emulation with:
 
-  qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "rootwait root=/dev/vda console=tty1 console=ttyS0" -serial stdio -net nic,model=virtio -net user # qemu_x86_64_defconfig
+  qemu-system-x86_64 -M pc -kernel output/images/bzImage -drive file=output/images/rootfs.ext2,if=virtio,format=raw -append "rootwait root=/dev/vda console=tty1 console=ttyS0" -serial stdio -net nic,model=virtio -net user -virtfs local,path="${HOME}",mount_tag=host0,security_model=mapped,id=host0 # qemu_x86_64_defconfig
 
 Optionally add -smp N to emulate a SMP system with N CPUs.
 
diff --git a/configs/qemu_aarch64_virt_defconfig b/configs/qemu_aarch64_virt_defconfig
index fb9db3f0fc..cb4811eba0 100644
--- a/configs/qemu_aarch64_virt_defconfig
+++ b/configs/qemu_aarch64_virt_defconfig
@@ -1,18 +1,24 @@
 BR2_aarch64=y
-BR2_PACKAGE_HOST_LINUX_HEADERS_CUSTOM_6_12=y
+BR2_PACKAGE_HOST_LINUX_HEADERS_CUSTOM_6_13=y
 BR2_GLOBAL_PATCH_DIR="board/qemu/patches"
 BR2_DOWNLOAD_FORCE_CHECK_HASHES=y
 BR2_SYSTEM_DHCP="eth0"
+BR2_ROOTFS_POST_BUILD_SCRIPT="board/qemu/post-build.sh"
 BR2_ROOTFS_POST_IMAGE_SCRIPT="board/qemu/post-image.sh"
 BR2_ROOTFS_POST_SCRIPT_ARGS="$(BR2_DEFCONFIG)"
 BR2_LINUX_KERNEL=y
-BR2_LINUX_KERNEL_CUSTOM_VERSION=y
-BR2_LINUX_KERNEL_CUSTOM_VERSION_VALUE="6.12.9"
+BR2_LINUX_KERNEL_CUSTOM_GIT=y
+BR2_LINUX_KERNEL_CUSTOM_REPO_URL="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git"
+BR2_LINUX_KERNEL_CUSTOM_REPO_VERSION="e48e99b6edf41c69c5528aa7ffb2daf3c59ee105"
+BR2_LINUX_KERNEL_CUSTOM_REPO_GIT_SUBMODULES=y
 BR2_LINUX_KERNEL_USE_CUSTOM_CONFIG=y
 BR2_LINUX_KERNEL_CUSTOM_CONFIG_FILE="board/qemu/aarch64-virt/linux.config"
 BR2_LINUX_KERNEL_NEEDS_HOST_OPENSSL=y
+BR2_PACKAGE_VMBUG=y
+BR2_PACKAGE_VMBUG_MODULE=y
 BR2_TARGET_ROOTFS_EXT2=y
 BR2_TARGET_ROOTFS_EXT2_4=y
 # BR2_TARGET_ROOTFS_TAR is not set
 BR2_PACKAGE_HOST_QEMU=y
 BR2_PACKAGE_HOST_QEMU_SYSTEM_MODE=y
+BR2_PACKAGE_HOST_QEMU_VIRTFS=y
diff --git a/configs/qemu_x86_64_defconfig b/configs/qemu_x86_64_defconfig
index 7c7fc374d9..3ef05dd2a2 100644
--- a/configs/qemu_x86_64_defconfig
+++ b/configs/qemu_x86_64_defconfig
@@ -1,18 +1,23 @@
 BR2_x86_64=y
-BR2_PACKAGE_HOST_LINUX_HEADERS_CUSTOM_6_12=y
+BR2_PACKAGE_HOST_LINUX_HEADERS_CUSTOM_6_13=y
 BR2_GLOBAL_PATCH_DIR="board/qemu/patches"
 BR2_DOWNLOAD_FORCE_CHECK_HASHES=y
 BR2_SYSTEM_DHCP="eth0"
-BR2_ROOTFS_POST_BUILD_SCRIPT="board/qemu/x86_64/post-build.sh"
+BR2_ROOTFS_POST_BUILD_SCRIPT="board/qemu/x86_64/post-build.sh board/qemu/post-build.sh"
 BR2_ROOTFS_POST_IMAGE_SCRIPT="board/qemu/post-image.sh"
 BR2_ROOTFS_POST_SCRIPT_ARGS="$(BR2_DEFCONFIG)"
 BR2_LINUX_KERNEL=y
-BR2_LINUX_KERNEL_CUSTOM_VERSION=y
-BR2_LINUX_KERNEL_CUSTOM_VERSION_VALUE="6.12.9"
+BR2_LINUX_KERNEL_CUSTOM_GIT=y
+BR2_LINUX_KERNEL_CUSTOM_REPO_URL="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git"
+BR2_LINUX_KERNEL_CUSTOM_REPO_VERSION="e48e99b6edf41c69c5528aa7ffb2daf3c59ee105"
+BR2_LINUX_KERNEL_CUSTOM_REPO_GIT_SUBMODULES=y
 BR2_LINUX_KERNEL_USE_CUSTOM_CONFIG=y
 BR2_LINUX_KERNEL_CUSTOM_CONFIG_FILE="board/qemu/x86_64/linux.config"
 BR2_LINUX_KERNEL_NEEDS_HOST_LIBELF=y
+BR2_PACKAGE_VMBUG=y
+BR2_PACKAGE_VMBUG_MODULE=y
 BR2_TARGET_ROOTFS_EXT2=y
 # BR2_TARGET_ROOTFS_TAR is not set
 BR2_PACKAGE_HOST_QEMU=y
 BR2_PACKAGE_HOST_QEMU_SYSTEM_MODE=y
+BR2_PACKAGE_HOST_QEMU_VIRTFS=y
diff --git a/linux/linux.hash b/linux/linux.hash
index 10aaed1d3f..77e071d12d 100644
--- a/linux/linux.hash
+++ b/linux/linux.hash
@@ -15,3 +15,6 @@ sha256  b5539243f187e3d478d76d44ae13aab83952c94b885ad889df6fa9997e16a441  linux-
 sha256  fb5a425bd3b3cd6071a3a9aff9909a859e7c1158d54d32e07658398cd67eb6a0  COPYING
 sha256  f6b78c087c3ebdf0f3c13415070dd480a3f35d8fc76f3d02180a407c1c812f79  LICENSES/preferred/GPL-2.0
 sha256  8e378ab93586eb55135d3bc119cce787f7324f48394777d00c34fa3d0be3303f  LICENSES/exceptions/Linux-syscall-note
+
+# extra
+sha256	d1847aa07dfcc0674d1a9ff7f40a3f49b023ab6613daf1184b4be6c4c649f4fc  linux-e48e99b6edf41c69c5528aa7ffb2daf3c59ee105-git4.tar.gz
diff --git a/package/Config.in b/package/Config.in
index 4b7e474cac..56114cc259 100644
--- a/package/Config.in
+++ b/package/Config.in
@@ -162,6 +162,8 @@ menu "Debugging, profiling and benchmark"
 	source "package/valgrind/Config.in"
 	source "package/vmtouch/Config.in"
 	source "package/whetstone/Config.in"
+	source "package/vmbug/Config.in"
+	source "package/vmbug-module/Config.in"
 endmenu
 
 menu "Development tools"
diff --git a/package/vmbug-module/Config.in b/package/vmbug-module/Config.in
new file mode 100644
index 0000000000..2b8881e23d
--- /dev/null
+++ b/package/vmbug-module/Config.in
@@ -0,0 +1,5 @@
+config BR2_PACKAGE_VMBUG_MODULE
+	bool "vmbug module"
+	depends on BR2_LINUX_KERNEL
+	help
+	  Linux Kernel Module for VMBUG demo.
diff --git a/package/vmbug-module/Makefile b/package/vmbug-module/Makefile
new file mode 100644
index 0000000000..260b640891
--- /dev/null
+++ b/package/vmbug-module/Makefile
@@ -0,0 +1,10 @@
+obj-m += $(addsuffix .o, vmbug-module)
+ccflags-y := -DDEBUG -g -std=gnu99 -Wno-declaration-after-statement
+
+.PHONY: all clean
+
+all:
+	$(MAKE) -C '$(LINUX_DIR)' M='$(PWD)' modules
+
+clean:
+	$(MAKE) -C '$(LINUX_DIR)' M='$(PWD)' clean
diff --git a/package/vmbug-module/vmbug-module.c b/package/vmbug-module/vmbug-module.c
new file mode 100644
index 0000000000..0409275339
--- /dev/null
+++ b/package/vmbug-module/vmbug-module.c
@@ -0,0 +1,136 @@
+// vmbug kernel module
+#include <linux/io.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+
+#define DRIVER_NAME "vmbug"
+
+struct vmbug_drvdata {
+	struct miscdevice misc;
+	void *page_buffer;
+};
+
+static struct vmbug_drvdata *vmbug_global_data;
+
+static inline struct vmbug_drvdata *to_vmbug_drvdata(struct file *filp)
+{
+	return container_of(filp->private_data, struct vmbug_drvdata, misc);
+}
+
+static ssize_t vmbug_read(struct file *filp, char __user *ptr, size_t len, loff_t *off)
+{
+	struct vmbug_drvdata *vmbug = to_vmbug_drvdata(filp);
+
+	return simple_read_from_buffer(ptr, len, off, vmbug->page_buffer, PAGE_SIZE);
+}
+
+static ssize_t vmbug_write(struct file *filp, const char __user *ptr, size_t len, loff_t *off)
+{
+	struct vmbug_drvdata *vmbug = to_vmbug_drvdata(filp);
+
+	return simple_write_to_buffer(vmbug->page_buffer, PAGE_SIZE, off, ptr, len);
+}
+
+static int vmbug_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct vmbug_drvdata *vmbug = to_vmbug_drvdata(filp);
+	struct device *dev = vmbug->misc.this_device;
+	pgprot_t prot;
+	unsigned long user_addr, pfn;
+	void *kern_addr;
+	phys_addr_t phys_addr;
+	int ret;
+
+	kern_addr = vmbug->page_buffer;
+	user_addr = vma->vm_start;
+	phys_addr = virt_to_phys(kern_addr);
+	pfn = phys_addr >> PAGE_SHIFT;
+	prot = vma->vm_page_prot;
+
+	ret = remap_pfn_range(vma, user_addr, pfn, PAGE_SIZE, prot);
+	if (ret) {
+		dev_err(dev, "remap_pfn_range() failed\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static const struct file_operations vmbug_fops = {
+	.owner = THIS_MODULE,
+	.read = vmbug_read,
+	.write = vmbug_write,
+	.mmap = vmbug_mmap,
+	.llseek = generic_file_llseek,
+};
+
+static int __init vmbug_init(void)
+{
+	struct vmbug_drvdata *vmbug;
+	int ret;
+
+	vmbug = kmalloc(sizeof(*vmbug_global_data), GFP_KERNEL | __GFP_ZERO);
+	if (!vmbug) {
+		printk(KERN_ERR "failed to allocate memory for '%s'\n", DRIVER_NAME);
+		ret = -ENOMEM;
+		goto err_mem;
+	}
+
+	vmbug->page_buffer = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0);
+	if (!vmbug->page_buffer) {
+		printk(KERN_ERR "failed to allocate page for '%s'\n", DRIVER_NAME);
+		ret = -ENOMEM;
+		goto err_page;
+	}
+
+	SetPageReserved(virt_to_page(vmbug->page_buffer));
+
+	vmbug->misc.name = DRIVER_NAME;
+	vmbug->misc.minor = MISC_DYNAMIC_MINOR;
+	vmbug->misc.fops = &vmbug_fops;
+	vmbug->misc.mode = 0600;
+
+	ret = misc_register(&vmbug->misc);
+	if (ret) {
+		printk(KERN_ERR "failed to register misc device '%s': %d\n",
+				DRIVER_NAME, ret);
+		goto err_misc;
+	}
+
+	vmbug_global_data = vmbug;
+
+	printk(KERN_INFO "%s: ready\n", DRIVER_NAME);
+
+	return 0;
+
+err_misc:
+	free_pages((unsigned long)vmbug->page_buffer, 0);
+err_page:
+	kfree(vmbug);
+err_mem:
+	return ret;
+}
+
+static void __exit vmbug_exit(void)
+{
+	struct vmbug_drvdata *vmbug;
+
+	vmbug = vmbug_global_data;
+	if (!vmbug)
+		return;
+	vmbug_global_data = NULL;
+
+	misc_deregister(&vmbug->misc);
+
+	ClearPageReserved(virt_to_page(vmbug->page_buffer));
+
+	free_pages((unsigned long)vmbug->page_buffer, 0);
+	kfree(vmbug);
+}
+
+module_init(vmbug_init)
+module_exit(vmbug_exit)
+
+MODULE_LICENSE("GPL");
diff --git a/package/vmbug-module/vmbug-module.mk b/package/vmbug-module/vmbug-module.mk
new file mode 100644
index 0000000000..22f625f3d7
--- /dev/null
+++ b/package/vmbug-module/vmbug-module.mk
@@ -0,0 +1,14 @@
+################################################################################
+#
+# vmbug-module
+#
+################################################################################
+
+VMBUG_MODULE_LINUX_LICENSE = GPL-2.0+
+
+define VMBUG_MODULE_EXTRACT_CMDS
+	cp $(VMBUG_MODULE_PKGDIR)/vmbug-module.c $(VMBUG_MODULE_PKGDIR)/Makefile $(@D)
+endef
+
+$(eval $(kernel-module))
+$(eval $(generic-package))
diff --git a/package/vmbug/Config.in b/package/vmbug/Config.in
new file mode 100644
index 0000000000..5372f62591
--- /dev/null
+++ b/package/vmbug/Config.in
@@ -0,0 +1,5 @@
+config BR2_PACKAGE_VMBUG
+	bool "VMBUG userspace program"
+	depends on BR2_LINUX_KERNEL
+	help
+	  Linux userspace VMbug program.
diff --git a/package/vmbug/Makefile b/package/vmbug/Makefile
new file mode 100644
index 0000000000..05b5075448
--- /dev/null
+++ b/package/vmbug/Makefile
@@ -0,0 +1,25 @@
+.PHONY: all clean install help mrproper
+
+.SUFFIXES:
+
+CFLAGS=-Wall -O2 -g
+
+all: vmbug
+
+vmbug: vmbug.o
+	$(CC) vmbug.o -o $@ $(LDFLAGS)
+
+%.o: %.c
+	$(CC) $(CFLAGS) -c $< -o $@
+
+clean:
+	-rm -f vmbug *.o *~
+
+install:
+	cp vmbug $(DESTDIR)/bin/vmbug
+
+help:
+	@echo "Available targets : all, install, clean, mrproper"
+
+mrproper: clean
+	rm -rf vmbug
diff --git a/package/vmbug/vmbug.c b/package/vmbug/vmbug.c
new file mode 100644
index 0000000000..70fdb2a8d2
--- /dev/null
+++ b/package/vmbug/vmbug.c
@@ -0,0 +1,202 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <alloca.h>
+#include <sys/mman.h>
+#include <stdbool.h>
+
+static int safe_open(const char *pathname, int flags, mode_t mode)
+{
+	int fd;
+
+	fd = open(pathname, flags, mode);
+	if (fd == -1) {
+		perror("open failed");
+		exit(EXIT_FAILURE);
+	}
+	return fd;
+}
+
+static void safe_close(int fd)
+{
+	int ret;
+
+	ret = close(fd);
+	if (ret == -1) {
+		perror("close failed");
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void safe_write(int fd, const void *ptr, size_t size)
+{
+	ssize_t wrn;
+
+	while (size) {
+		do {
+			wrn = write(fd, ptr, size);
+		} while (wrn == -1 && errno == EAGAIN);
+		if (wrn == -1 || (size_t)wrn > size) {
+			perror("write failed");
+			exit(EXIT_FAILURE);
+		}
+		ptr += wrn;
+		size -= wrn;
+	}
+}
+
+static void safe_read(int fd, void *ptr, size_t size)
+{
+	ssize_t rdn;
+
+	while (size) {
+		do {
+			rdn = read(fd, ptr, size);
+		} while (rdn == -1 && errno == EAGAIN);
+		if (rdn == -1 || (size_t)rdn > size) {
+			perror("read failed");
+			exit(EXIT_FAILURE);
+		}
+		ptr += rdn;
+		size -= rdn;
+	}
+}
+
+static void safe_lseek(int fd, off_t off, int whence)
+{
+	off_t ret;
+
+	ret = lseek(fd, off, whence);
+	if (ret == (off_t)-1) {
+		perror("lseek failed");
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void *safe_mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
+{
+	void *ret;
+
+	ret = mmap(addr, length, prot, flags, fd, offset);
+	if (ret == MAP_FAILED) {
+		perror("mmap failed");
+		exit(EXIT_FAILURE);
+	}
+	return ret;
+}
+
+#define SIZE 4096
+
+int main(int argc, char *argv[])
+{
+	int opt, fd, wfd;
+	char *buf;
+	const char *test_pattern, *devname, *write_file;
+	size_t test_pattern_sz, pagesz;
+	void *addr;
+	FILE *help_fp;
+	bool help_ok, use_buffer;
+
+	/* allocate a single page buffer */
+	pagesz = (size_t)getpagesize();
+	buf = alloca(pagesz);
+
+	/* default test pattern */
+	test_pattern = "testing";
+	test_pattern_sz = strlen(test_pattern);
+	devname = "/dev/vmbug";
+	write_file = "vmbug.bin";
+	use_buffer = false;
+	while ((opt = getopt(argc, argv, "d:t:w:bh?")) != -1) {
+		switch (opt) {
+		case 'd':
+			devname = optarg;
+			break;
+		case 't':
+			test_pattern = optarg;
+			test_pattern_sz = strlen(test_pattern);
+			if (test_pattern_sz > (size_t)pagesz) {
+				fprintf(stderr, "test-pattern too big size %zu (max = %zu)\n", test_pattern_sz, pagesz);
+				exit(EXIT_FAILURE);
+			}
+			break;
+		case 'w':
+			write_file = optarg;
+			break;
+		case 'b':
+			use_buffer = true;
+			break;
+		case '?':
+		case 'h':
+		default: /* '?' */
+			help_ok = opt == '?' || opt == 'h';
+			help_fp = help_ok ? stdout : stderr;
+			fprintf(help_fp, "Usage: %s [-t test-pattern] [-d device] [-w write-file] [-b]\n", argv[0]);
+			fprintf(help_fp, "Trigger user vm iterator bug\n");
+			fprintf(help_fp, "\n");
+			fprintf(help_fp, " -t   Test pattern to use (default \"testing\")\n");
+			fprintf(help_fp, " -d   Device to use (default \"/dev/vmbug\")\n");
+			fprintf(help_fp, " -w   Write to file (default \"vmbug.bin\")\n");
+			fprintf(help_fp, " -b   Use user space buffer instead of direct write\n");
+			exit(help_ok ? EXIT_SUCCESS : EXIT_FAILURE);
+		}
+	}
+
+	printf("opening %s device\n", devname);
+	fd = safe_open(devname, O_RDWR, 0);
+
+	safe_write(fd, test_pattern, test_pattern_sz);
+	safe_lseek(fd, 0, SEEK_SET);
+
+	safe_read(fd, buf, test_pattern_sz);
+
+	printf("readback of write '%.*s' is '%.*s'\n",
+			(int)test_pattern_sz, test_pattern, (int)test_pattern_sz, buf);
+
+	if (memcmp(test_pattern, buf, test_pattern_sz)) {
+		errno = -EINVAL;
+		perror("readback from device using read differs");
+		exit(EXIT_FAILURE);
+	}
+
+	printf("readback for read/write OK\n");
+	memset(buf, 0, pagesz);
+
+	printf("mmaping %s device (size = %zu)\n", devname, pagesz);
+	addr = safe_mmap(NULL, pagesz, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+	printf("readback of mmap '%.*s' is '%.*s'\n",
+			(int)test_pattern_sz, test_pattern, (int)test_pattern_sz, (char *)addr);
+
+	if (memcmp(test_pattern, addr, test_pattern_sz)) {
+		errno = -EINVAL;
+		perror("readback from device using  mmap differs");
+		exit(EXIT_FAILURE);
+	}
+
+	printf("readback for mmap OK\n");
+
+	/* open write file */
+	printf("opening file %s for write\n", write_file);
+	wfd = safe_open(write_file, O_WRONLY | O_CREAT, 0660);
+
+	if (!use_buffer) {
+		printf("direct mmap write to file\n");
+		safe_write(wfd, addr, pagesz);
+	} else {
+		printf("copy to buffer and then write to file\n");
+		memcpy(buf, addr, pagesz);
+		safe_write(wfd, buf, pagesz);
+	}
+	printf("write OK\n");
+
+	printf("closing file %s for write\n", write_file);
+	safe_close(wfd);
+
+	safe_close(fd);
+
+	return 0;
+}
diff --git a/package/vmbug/vmbug.mk b/package/vmbug/vmbug.mk
new file mode 100644
index 0000000000..4906f2dc03
--- /dev/null
+++ b/package/vmbug/vmbug.mk
@@ -0,0 +1,21 @@
+################################################################################
+#
+# vmbug
+#
+################################################################################
+
+VMBUG_LICENSE = GPL-2.0+
+
+define VMBUG_EXTRACT_CMDS
+	cp $(VMBUG_PKGDIR)/vmbug.c $(VMBUG_PKGDIR)/Makefile $(@D)
+endef
+
+define VMBUG_BUILD_CMDS
+	$(MAKE) $(TARGET_CONFIGURE_OPTS) -C $(@D) all
+endef
+
+define VMBUG_INSTALL_TARGET_CMDS
+	$(INSTALL) -D -m 0755 $(@D)/vmbug $(TARGET_DIR)/usr/bin
+endef
+
+$(eval $(generic-package))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:40           ` David Hildenbrand
  2025-05-08 15:48             ` Pantelis Antoniou
  2025-05-08 16:25             ` Pantelis Antoniou
@ 2025-05-08 17:35             ` Jason Gunthorpe
  2025-05-08 17:47               ` Pantelis Antoniou
  2 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 17:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pantelis Antoniou, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 05:40:15PM +0200, David Hildenbrand wrote:
> > I don't think there was a deliberate decision here, but there was no
> > conversion to remap_pfn_range(), the code (in DRM) was always there.
> > 
> > The regression occurred when netfslib started using GUP for I/O and
> > when filesystems switched to it we hit this case.
> 
> Okay, so GUP and DRM always worked that way. They are essentially
> incompatible at this point due to VM_PFNMAP.
> 
> So netfslib requesting something that is impossible is the problem .. or
> rather filesystems switching to that and not realizing the problem.
> 
> Hmmm

This patch definately doesn't look very good as is. 

We *certainly* should not be even trying to touch the struct page of a
VMA_PFNMAP *at all*. By definition that is forbidden.

It looks to me like vm_normal_page() already supports MIXEDMAP, so
probably the better hotfix is to have DRM use MIXEDMAP if it is
installing PFNs that it is willing to be used as struct page.

But who knows if DRM can do that on arches that don't have
PTE_SPECIAL..

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 17:35             ` Jason Gunthorpe
@ 2025-05-08 17:47               ` Pantelis Antoniou
  2025-05-08 18:01                 ` Jason Gunthorpe
  2025-05-08 18:02                 ` David Hildenbrand
  0 siblings, 2 replies; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 17:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, 8 May 2025 14:35:35 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

Hi Jason,

> On Thu, May 08, 2025 at 05: 40: 15PM +0200, David Hildenbrand wrote:
> > > I don't think there was a deliberate decision here, but there was
> > > no > > conversion to remap_pfn_range(), the code (in DRM) was
> > > always there. > > > > 
> On Thu, May 08, 2025 at 05:40:15PM +0200, David Hildenbrand wrote:
> > > I don't think there was a deliberate decision here, but there was
> > > no conversion to remap_pfn_range(), the code (in DRM) was always
> > > there.
> > > 
> > > The regression occurred when netfslib started using GUP for I/O
> > > and when filesystems switched to it we hit this case.
> > 
> > Okay, so GUP and DRM always worked that way. They are essentially
> > incompatible at this point due to VM_PFNMAP.
> > 
> > So netfslib requesting something that is impossible is the problem
> > .. or rather filesystems switching to that and not realizing the
> > problem.
> > 
> > Hmmm
> 
> This patch definately doesn't look very good as is. 
> 

No argument praising its beauty from me.
What is the right solution then?

> We *certainly* should not be even trying to touch the struct page of a
> VMA_PFNMAP *at all*. By definition that is forbidden.
> 
> It looks to me like vm_normal_page() already supports MIXEDMAP, so
> probably the better hotfix is to have DRM use MIXEDMAP if it is
> installing PFNs that it is willing to be used as struct page.
> 
> But who knows if DRM can do that on arches that don't have
> PTE_SPECIAL..
> 

The question from me is why a __get_free_pages() area that is passed
to remap_pfn_range() gets the PFNMAP bit set.

Even if DRM sets the MIXEDMAP bit PFNMAP will still be set.

> Jason
> 

Regards

-- Pantelis


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 17:47               ` Pantelis Antoniou
@ 2025-05-08 18:01                 ` Jason Gunthorpe
  2025-05-08 18:02                 ` David Hildenbrand
  1 sibling, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 18:01 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: David Hildenbrand, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 08:47:11PM +0300, Pantelis Antoniou wrote:
> > But who knows if DRM can do that on arches that don't have
> > PTE_SPECIAL..
> > 
> 
> The question from me is why a __get_free_pages() area that is passed
> to remap_pfn_range() gets the PFNMAP bit set.
> 
> Even if DRM sets the MIXEDMAP bit PFNMAP will still be set.

remap_pfn_range() is the wrong way to install a PFN that has a working
struct page. It sets the special bit and it is unambiguously wrong to
try to convert a special PTE to a struct page. Special bit means
without any doubt the PTE's address must never reach to a struct page
even if it has one. It is very meaning of the special bit.

I see some DRM drivers using MIXEDMAP and remap_pfn_range() together,
that seems to just be creating a confusing mess. Having
VM_PFNMAP|VM_MIXEDMAP set together is a nonsensical
combination. Having struct page backed memory in a VMA marked with
MIXEDMAP and with the PTE set a special is pointless.

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 17:47               ` Pantelis Antoniou
  2025-05-08 18:01                 ` Jason Gunthorpe
@ 2025-05-08 18:02                 ` David Hildenbrand
  2025-05-08 18:11                   ` Pantelis Antoniou
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 18:02 UTC (permalink / raw)
  To: Pantelis Antoniou, Jason Gunthorpe
  Cc: Peter Xu, Andrew Morton, mm-commits, wade.farnsworth, jhubbard,
	c.briere, artem.k, David Howells

On 08.05.25 19:47, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 14:35:35 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> Hi Jason,
> 
>> On Thu, May 08, 2025 at 05: 40: 15PM +0200, David Hildenbrand wrote:
>>>> I don't think there was a deliberate decision here, but there was
>>>> no > > conversion to remap_pfn_range(), the code (in DRM) was
>>>> always there. > > > >
>> On Thu, May 08, 2025 at 05:40:15PM +0200, David Hildenbrand wrote:
>>>> I don't think there was a deliberate decision here, but there was
>>>> no conversion to remap_pfn_range(), the code (in DRM) was always
>>>> there.
>>>>
>>>> The regression occurred when netfslib started using GUP for I/O
>>>> and when filesystems switched to it we hit this case.
>>>
>>> Okay, so GUP and DRM always worked that way. They are essentially
>>> incompatible at this point due to VM_PFNMAP.
>>>
>>> So netfslib requesting something that is impossible is the problem
>>> .. or rather filesystems switching to that and not realizing the
>>> problem.
>>>
>>> Hmmm
>>
>> This patch definately doesn't look very good as is.
>>
> 
> No argument praising its beauty from me.
> What is the right solution then?
> 
>> We *certainly* should not be even trying to touch the struct page of a
>> VMA_PFNMAP *at all*. By definition that is forbidden.
>>
>> It looks to me like vm_normal_page() already supports MIXEDMAP, so
>> probably the better hotfix is to have DRM use MIXEDMAP if it is
>> installing PFNs that it is willing to be used as struct page.
>>
>> But who knows if DRM can do that on arches that don't have
>> PTE_SPECIAL..
>>
> 
> The question from me is why a __get_free_pages() area that is passed
> to remap_pfn_range() gets the PFNMAP bit set.

remap *PFN* range is the wrong interface. You literally tell the system 
"map a PFN range and ignore any struct page" instead of "please map this 
refcounted page".

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 18:02                 ` David Hildenbrand
@ 2025-05-08 18:11                   ` Pantelis Antoniou
  2025-05-08 18:26                     ` David Hildenbrand
                                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 18:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, 8 May 2025 20:02:38 +0200
David Hildenbrand <david@redhat.com> wrote:

Hi David,

> On 08. 05. 25 19: 47, Pantelis Antoniou wrote: > On Thu, 8 May 2025
> 14: 35: 35 -0300 > Jason Gunthorpe <jgg@ ziepe. ca> wrote: > > Hi
> Jason, > >> On Thu, May 08, 2025 at 05: 40: 15PM +0200, David
> Hildenbrand wrote: >>>> 
> On 08.05.25 19:47, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 14:35:35 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > Hi Jason,
> > 
> >> On Thu, May 08, 2025 at 05: 40: 15PM +0200, David Hildenbrand
> >> wrote:
> >>>> I don't think there was a deliberate decision here, but there was
> >>>> no > > conversion to remap_pfn_range(), the code (in DRM) was
> >>>> always there. > > > >
> >> On Thu, May 08, 2025 at 05:40:15PM +0200, David Hildenbrand wrote:
> >>>> I don't think there was a deliberate decision here, but there was
> >>>> no conversion to remap_pfn_range(), the code (in DRM) was always
> >>>> there.
> >>>>
> >>>> The regression occurred when netfslib started using GUP for I/O
> >>>> and when filesystems switched to it we hit this case.
> >>>
> >>> Okay, so GUP and DRM always worked that way. They are essentially
> >>> incompatible at this point due to VM_PFNMAP.
> >>>
> >>> So netfslib requesting something that is impossible is the problem
> >>> .. or rather filesystems switching to that and not realizing the
> >>> problem.
> >>>
> >>> Hmmm
> >>
> >> This patch definately doesn't look very good as is.
> >>
> > 
> > No argument praising its beauty from me.
> > What is the right solution then?
> > 
> >> We *certainly* should not be even trying to touch the struct page
> >> of a VMA_PFNMAP *at all*. By definition that is forbidden.
> >>
> >> It looks to me like vm_normal_page() already supports MIXEDMAP, so
> >> probably the better hotfix is to have DRM use MIXEDMAP if it is
> >> installing PFNs that it is willing to be used as struct page.
> >>
> >> But who knows if DRM can do that on arches that don't have
> >> PTE_SPECIAL..
> >>
> > 
> > The question from me is why a __get_free_pages() area that is passed
> > to remap_pfn_range() gets the PFNMAP bit set.
> 
> remap *PFN* range is the wrong interface. You literally tell the
> system "map a PFN range and ignore any struct page" instead of
> "please map this refcounted page".
> 

I agree, but it's not my code that's doing it.
This has been going on for more than a decade at this point.

Can we get a plan on how to go around fixing these issues correctly?

1. Drivers/subsystems (DRM in this case) are doing remap_pfn_range() to
map system memory with a page attached to user space.
Up until recently this was OK, since no-one tried to pin the pages for
any reason. It doesn't seem like this is the right way to do it.
What is the right way?

2. DRM in particular has no standardized way to handle mapping system
memory Buffer Objects (BOs) to userspace. Each driver is free to do it's
own thing and does so. What is the right way to handle this case.

3. While we go about fixing it, this has caused a pretty significant
userspace regression, where the address space that those BOs reside
cannot be used for I/O when a network filesystem is involved. I think
it's a matter of time when regular filesystem start using the same
method of pining and doing I/O instead of using the filecache on fast
memory mediums.

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 18:11                   ` Pantelis Antoniou
@ 2025-05-08 18:26                     ` David Hildenbrand
  2025-05-08 18:47                     ` Peter Xu
  2025-05-08 19:11                     ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 18:26 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Jason Gunthorpe, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells,
	Lorenzo Stoakes

> I agree, but it's not my code that's doing it.

I know. And it's not GUP that's broken.

Taking a ref on a page and returning a page when explicitly told not
to even lookup a page (including if there is no page, or if the page
contains unrelated garbage -- e.g., from a memory hole during boot etc)
cannot possibly work. And that's what GUP is all about.

> This has been going on for more than a decade at this point.
> 
> Can we get a plan on how to go around fixing these issues correctly?
> 
> 1. Drivers/subsystems (DRM in this case) are doing remap_pfn_range() to
> map system memory with a page attached to user space.
> Up until recently this was OK, since no-one tried to pin the pages for
> any reason. It doesn't seem like this is the right way to do it.
> What is the right way?

I recently had a similar discussion with Lorenzo. (CC)

Worth looking at

commit 8e553520596bbd5ce832e26e9d721e6a0c797b8b
Author: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date:   Mon Mar 31 13:56:08 2025 +0100

     intel_th: avoid using deprecated page->mapping, index fields
     
     The struct page->mapping, index fields are deprecated and soon to be only
     available as part of a folio.
     
     It is likely the intel_th code which sets page->mapping, index is was
     implemented out of concern that some aspect of the page fault logic may
     encounter unexpected problems should they not.
     
     However, the appropriate interface for inserting kernel-allocated memory is
     vm_insert_page() in a VM_MIXEDMAP. By using the helper function
     vmf_insert_mixed() we can do this with minimal churn in the existing fault
     handler.
     
     ...

Take a look at how kernel/trace/ring_buffer.c and io_uring/memmap.c use vm_insert_pages().

It can be done lazily during page faults using vmf_insert_mixed().

> 
> 2. DRM in particular has no standardized way to handle mapping system
> memory Buffer Objects (BOs) to userspace. Each driver is free to do it's
> own thing and does so. What is the right way to handle this case.

Probably this should all be properly refactored to map kernel allocations
as refcounted into user page tables.

> 
> 3. While we go about fixing it, this has caused a pretty significant
> userspace regression, where the address space that those BOs reside
> cannot be used for I/O when a network filesystem is involved. I think
> it's a matter of time when regular filesystem start using the same
> method of pining and doing I/O instead of using the filecache on fast
> memory mediums.

I am surprised that whoever did that change didn't realize that this simply
doesn't work and never did work earlier :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 18:11                   ` Pantelis Antoniou
  2025-05-08 18:26                     ` David Hildenbrand
@ 2025-05-08 18:47                     ` Peter Xu
  2025-05-08 19:04                       ` David Hildenbrand
  2025-05-08 19:11                     ` Jason Gunthorpe
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Xu @ 2025-05-08 18:47 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: David Hildenbrand, Jason Gunthorpe, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 09:11:36PM +0300, Pantelis Antoniou wrote:
> This has been going on for more than a decade at this point.

I wasn't aware how long, but I need to confess I am also aware of such in
virt context where drivers map these pages in PFNMAPs..  So KVM also has
similar cases happening in corner case setups.

> 
> Can we get a plan on how to go around fixing these issues correctly?
> 
> 1. Drivers/subsystems (DRM in this case) are doing remap_pfn_range() to
> map system memory with a page attached to user space.
> Up until recently this was OK, since no-one tried to pin the pages for
> any reason. It doesn't seem like this is the right way to do it.
> What is the right way?
> 
> 2. DRM in particular has no standardized way to handle mapping system
> memory Buffer Objects (BOs) to userspace. Each driver is free to do it's
> own thing and does so. What is the right way to handle this case.
> 
> 3. While we go about fixing it, this has caused a pretty significant
> userspace regression, where the address space that those BOs reside
> cannot be used for I/O when a network filesystem is involved. I think
> it's a matter of time when regular filesystem start using the same
> method of pining and doing I/O instead of using the filecache on fast
> memory mediums.

I mentioned it elsewhere, but _if_ fixing all the drivers isn't possible in
the near future.. we could still have chance to not mess with GUP (in which
case PFNMAP is also working like that for so many years, likely what the
drivers do on abusing pages in PFNMAPs...).  That is supporting such
special pages in iov_iter.

I attached such patch below, not saying that we should merge it, but IMHO
it's much better than fiddling with gup here, and so far that's the best I
can think if (and only if it works for 9pfs's current zerocopy case).. I
only smoked it, but I didn't verify it.

PS: I was also thinking maybe if 9pfs is the only affected so far, I wonder
if we could try to fallback to cached RW when necessary, IIUC that's still
working, right (and is 9pfs a production-level fs)?  But I think that's not
as good as below if supporting that isn't extremely hard.

Thanks,

=======8<======
From cd4aa467e4b653b5bd8496b5123d65d06e2e3263 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 8 May 2025 14:35:24 -0400
Subject: [PATCH] iov_iter: Supports special PFNMAP for user buffer when
 pfn_valid()

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  1 +
 lib/iov_iter.c     | 60 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/gup.c           | 16 +++++++++++++
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38e16c984b9a..f79fd14599fa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1459,6 +1459,7 @@ static inline void put_page(struct page *page)
  */
 #define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+int pin_user_page(struct page *page);
 void unpin_user_page(struct page *page);
 void unpin_folio(struct folio *folio);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d9e19fb2dcf3..2eb24b2793f0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1809,6 +1809,64 @@ static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
 	return size;
 }
 
+/**
+ * iov_iter_pin_user_pages() - pin user pages for iov iter ops
+ *
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
+ *
+ * Almost a wrapper for pin_user_pages_fast(), but also supports PFNMAPs
+ * where in extremely rare cases there's actually struct page available
+ * (e.g. device drivers playing trick with PFNMAP by injecting allocated
+ * RAM pages).
+ */
+static inline int
+iov_iter_pin_user_pages(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	struct follow_pfnmap_args args;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int res, ret;
+
+	res = pin_user_pages_fast(start, nr_pages, gup_flags, pages);
+
+	/* Normally, GUP should really work already.. */
+	if (likely(res > 0))
+		return res;
+
+	/*
+	 * This is to take care of an extremely rare case: retry in case if
+	 * it's a PFNMAP that has struct page backed.
+	 *
+	 * So far it does the minimum we need in the failure path.  It
+	 * assumes the PFNMAP entries must exist in the pgtables already,
+	 * and it resolves one PFN at a time.
+	 */
+	mm = current->mm;
+	mmap_read_lock(mm);
+	vma = vma_lookup(current->mm, start);
+	if (!vma)
+		goto out;
+
+	args.vma = vma;
+	args.address = start;
+
+	ret = follow_pfnmap_start(&args);
+	if (ret)
+		goto out;
+	/* Did we find a special page under VM_PFNMAP? */
+	if (pfn_valid(args.pfn) && pin_user_page(pfn_to_page(args.pfn)))
+		res = 1;
+	follow_pfnmap_end(&args);
+out:
+	mmap_read_unlock(mm);
+	return res;
+}
+
 /*
  * Extract a list of contiguous pages from a user iterator and get a pin on
  * each of them.  This should only be used if the iterator is user-backed
@@ -1846,7 +1904,7 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
 	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
 	if (!maxpages)
 		return -ENOMEM;
-	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	res = iov_iter_pin_user_pages(addr, maxpages, gup_flags, *pages);
 	if (unlikely(res <= 0))
 		return res;
 	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
diff --git a/mm/gup.c b/mm/gup.c
index d3aac58862c0..eede221ac89a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -178,6 +178,22 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
 	return 0;
 }
 
+/**
+ * pin_user_page() - dma-pinned a page
+ * @page:            pointer to page to be pinned
+ *
+ * NOTE!  One should normally use pin_user_pages*() API instead.  This
+ * should be only useful in extremely special cases, like struct page under
+ * VM_PFNMAP.
+ *
+ * Returns: 0 if success, negative if pin failed
+ */
+int pin_user_page(struct page *page)
+{
+	return try_grab_folio(page_folio(page), 1, FOLL_PIN);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
-- 
2.49.0


-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 18:47                     ` Peter Xu
@ 2025-05-08 19:04                       ` David Hildenbrand
  2025-05-08 19:06                         ` Jason Gunthorpe
  2025-05-08 19:08                         ` Peter Xu
  0 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 19:04 UTC (permalink / raw)
  To: Peter Xu, Pantelis Antoniou
  Cc: Jason Gunthorpe, Andrew Morton, mm-commits, wade.farnsworth,
	jhubbard, c.briere, artem.k, David Howells

> +	/* Did we find a special page under VM_PFNMAP? */
> +	if (pfn_valid(args.pfn) && pin_user_page(pfn_to_page(args.pfn)))
> +		res = 1;
>

It's doing the same wrong thing at a different place.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:04                       ` David Hildenbrand
@ 2025-05-08 19:06                         ` Jason Gunthorpe
  2025-05-08 19:08                         ` Peter Xu
  1 sibling, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 19:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Pantelis Antoniou, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
> > +	/* Did we find a special page under VM_PFNMAP? */
> > +	if (pfn_valid(args.pfn) && pin_user_page(pfn_to_page(args.pfn)))
> > +		res = 1;
> > 
> 
> It's doing the same wrong thing at a different place.

+1 this is a DRM problem, it must be fixed in DRM drivers.

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:04                       ` David Hildenbrand
  2025-05-08 19:06                         ` Jason Gunthorpe
@ 2025-05-08 19:08                         ` Peter Xu
  2025-05-08 19:12                           ` Jason Gunthorpe
  2025-05-08 19:14                           ` David Hildenbrand
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Xu @ 2025-05-08 19:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pantelis Antoniou, Jason Gunthorpe, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
> It's doing the same wrong thing at a different place.

As I mentioned, I believe KVM has this wrong thing working so far..  and it
doesn't block us from going right ultimately.  It's a matter of time.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:08                         ` Peter Xu
@ 2025-05-08 19:12                           ` Jason Gunthorpe
  2025-05-08 19:16                             ` David Hildenbrand
  2025-05-08 19:39                             ` Peter Xu
  2025-05-08 19:14                           ` David Hildenbrand
  1 sibling, 2 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 19:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Pantelis Antoniou, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 03:08:19PM -0400, Peter Xu wrote:
> On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
> > It's doing the same wrong thing at a different place.
> 
> As I mentioned, I believe KVM has this wrong thing working so far..  and it
> doesn't block us from going right ultimately.  It's a matter of time.

AFAIK KVM is doing this wonky thing using mmu notifiers, it doesn't
take page references on pte special pages..

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:12                           ` Jason Gunthorpe
@ 2025-05-08 19:16                             ` David Hildenbrand
  2025-05-08 19:39                             ` Peter Xu
  1 sibling, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 19:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Pantelis Antoniou, Andrew Morton, mm-commits, wade.farnsworth,
	jhubbard, c.briere, artem.k, David Howells

On 08.05.25 21:12, Jason Gunthorpe wrote:
> On Thu, May 08, 2025 at 03:08:19PM -0400, Peter Xu wrote:
>> On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
>>> It's doing the same wrong thing at a different place.
>>
>> As I mentioned, I believe KVM has this wrong thing working so far..  and it
>> doesn't block us from going right ultimately.  It's a matter of time.
> 
> AFAIK KVM is doing this wonky thing using mmu notifiers, it doesn't
> take page references on pte special pages..

Ah, and vfio also doesn't grab a ref in that case IIUC.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:12                           ` Jason Gunthorpe
  2025-05-08 19:16                             ` David Hildenbrand
@ 2025-05-08 19:39                             ` Peter Xu
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Xu @ 2025-05-08 19:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Pantelis Antoniou, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 04:12:15PM -0300, Jason Gunthorpe wrote:
> AFAIK KVM is doing this wonky thing using mmu notifiers, it doesn't
> take page references on pte special pages..

I checked the latest, I think you're right at least on the latest master
branch..  IIUC it's behavior on refcounting changed only last year after
Sean's 3dd48ecfac7f ("KVM: Provide refcounted page as output field in
struct kvm_follow_pfn").

To me, taking the refcount has one tiny little "benefit" of avoiding the
UAF you mentioned in the other email.  But I agree the whole thing is still
pretty wonky.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:08                         ` Peter Xu
  2025-05-08 19:12                           ` Jason Gunthorpe
@ 2025-05-08 19:14                           ` David Hildenbrand
  2025-05-08 19:19                             ` Jason Gunthorpe
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 19:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Pantelis Antoniou, Jason Gunthorpe, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On 08.05.25 21:08, Peter Xu wrote:
> On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
>> It's doing the same wrong thing at a different place.
> 
> As I mentioned, I believe KVM has this wrong thing working so far..  and it
> doesn't block us from going right ultimately.  It's a matter of time.

Yes, KVM has it wrong and vfio probably as well. And they are usually 
not dealing with actual kernel allocations, but rather with MMIO ranges.

Sorry, no more hacks.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:14                           ` David Hildenbrand
@ 2025-05-08 19:19                             ` Jason Gunthorpe
  2025-05-08 19:34                               ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 19:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Pantelis Antoniou, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 09:14:38PM +0200, David Hildenbrand wrote:
> On 08.05.25 21:08, Peter Xu wrote:
> > On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
> > > It's doing the same wrong thing at a different place.
> > 
> > As I mentioned, I believe KVM has this wrong thing working so far..  and it
> > doesn't block us from going right ultimately.  It's a matter of time.
> 
> Yes, KVM has it wrong and vfio probably as well. And they are usually not
> dealing with actual kernel allocations, but rather with MMIO ranges.

vfio also doesn't take references on the things it pulls out of the
VMA. The vfio bug is different, it lets you take a pte special
phys_addr_t and reference it through the IOMMU without any
refcounting. So when the VMA is destroyed and the page free'd by the
GPU driver we just UAF it from VFIO through the iommu page
table. Woops.

What we are talking about here is very different from both kvm and
vfio, this is ignoring pte special and accessing the struct page
refcount anyhow. I certainly don't know of anything that is doing
that, though I didn't know about the old netfs stuff :\

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:19                             ` Jason Gunthorpe
@ 2025-05-08 19:34                               ` David Hildenbrand
  2025-05-09 16:30                                 ` Pantelis Antoniou
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2025-05-08 19:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Pantelis Antoniou, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On 08.05.25 21:19, Jason Gunthorpe wrote:
> On Thu, May 08, 2025 at 09:14:38PM +0200, David Hildenbrand wrote:
>> On 08.05.25 21:08, Peter Xu wrote:
>>> On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
>>>> It's doing the same wrong thing at a different place.
>>>
>>> As I mentioned, I believe KVM has this wrong thing working so far..  and it
>>> doesn't block us from going right ultimately.  It's a matter of time.
>>
>> Yes, KVM has it wrong and vfio probably as well. And they are usually not
>> dealing with actual kernel allocations, but rather with MMIO ranges.
> 
> vfio also doesn't take references on the things it pulls out of the
> VMA. The vfio bug is different, it lets you take a pte special
> phys_addr_t and reference it through the IOMMU without any
> refcounting. So when the VMA is destroyed and the page free'd by the
> GPU driver we just UAF it from VFIO through the iommu page
> table. Woops.

Right. What is_invalid_reserved_pfn() does is check that if it has a 
"struct page", that that one must be marked PG_reserved.

That PG_reserved check is a nasty check for MMIO pages or memory holes 
part of a present memory section.

So at least it will reject anything that is just an ordinary kernel 
allocation (!reserved).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 19:34                               ` David Hildenbrand
@ 2025-05-09 16:30                                 ` Pantelis Antoniou
  2025-05-09 17:11                                   ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-09 16:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, 8 May 2025 21:34:28 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 08. 05. 25 21: 19, Jason Gunthorpe wrote: > On Thu, May 08, 2025
> at 09: 14: 38PM +0200, David Hildenbrand wrote: >> On 08. 05. 25 21:
> 08, Peter Xu wrote: >>> On Thu, May 08, 2025 at 09: 04: 11PM +0200,
> On 08.05.25 21:19, Jason Gunthorpe wrote:
> > On Thu, May 08, 2025 at 09:14:38PM +0200, David Hildenbrand wrote:
> >> On 08.05.25 21:08, Peter Xu wrote:
> >>> On Thu, May 08, 2025 at 09:04:11PM +0200, David Hildenbrand wrote:
> >>>> It's doing the same wrong thing at a different place.
> >>>
> >>> As I mentioned, I believe KVM has this wrong thing working so
> >>> far..  and it doesn't block us from going right ultimately.  It's
> >>> a matter of time.
> >>
> >> Yes, KVM has it wrong and vfio probably as well. And they are
> >> usually not dealing with actual kernel allocations, but rather
> >> with MMIO ranges.
> > 
> > vfio also doesn't take references on the things it pulls out of the
> > VMA. The vfio bug is different, it lets you take a pte special
> > phys_addr_t and reference it through the IOMMU without any
> > refcounting. So when the VMA is destroyed and the page free'd by the
> > GPU driver we just UAF it from VFIO through the iommu page
> > table. Woops.
> 
> Right. What is_invalid_reserved_pfn() does is check that if it has a 
> "struct page", that that one must be marked PG_reserved.
> 
> That PG_reserved check is a nasty check for MMIO pages or memory
> holes part of a present memory section.
> 
> So at least it will reject anything that is just an ordinary kernel 
> allocation (!reserved).
> 

So what's the plan now?

DRM seems to be the first place to be fixed, however am I wrong in
thinking that most of the uses of remap_pfn_range() are wrong in the
context of system page backed memory?

Should we start with an implementation of something like remap_range()
which does not set PFNMAP bit. Its use is wrong in that context IMO.

And then move to DRM proper and replace the call to remap_pfn_range()
with it and see how far we go.

What do you think?

Regards

-- Pantelis


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-09 16:30                                 ` Pantelis Antoniou
@ 2025-05-09 17:11                                   ` John Hubbard
  2025-05-09 17:33                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2025-05-09 17:11 UTC (permalink / raw)
  To: Pantelis Antoniou, David Hildenbrand
  Cc: Jason Gunthorpe, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, c.briere, artem.k, David Howells

On 5/9/25 9:30 AM, Pantelis Antoniou wrote:
> On Thu, 8 May 2025 21:34:28 +0200
> David Hildenbrand <david@redhat.com> wrote:
...
> So what's the plan now?
> 
> DRM seems to be the first place to be fixed, however am I wrong in
> thinking that most of the uses of remap_pfn_range() are wrong in the
> context of system page backed memory?
> 
> Should we start with an implementation of something like remap_range()
> which does not set PFNMAP bit. Its use is wrong in that context IMO.
> 
> And then move to DRM proper and replace the call to remap_pfn_range()
> with it and see how far we go.
> 
That sounds like the right approach to me. Because the way we get into
these problems is mostly due to the lack of a clear example, and so
providing something correct to call is the way out.

thanks,
-- 
John Hubbard


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-09 17:11                                   ` John Hubbard
@ 2025-05-09 17:33                                     ` Jason Gunthorpe
  2025-05-09 17:50                                       ` Pantelis Antoniou
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-09 17:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Pantelis Antoniou, David Hildenbrand, Peter Xu, Andrew Morton,
	mm-commits, wade.farnsworth, c.briere, artem.k, David Howells

On Fri, May 09, 2025 at 10:11:01AM -0700, John Hubbard wrote:
> On 5/9/25 9:30 AM, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 21:34:28 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> ...
> > So what's the plan now?
> > 
> > DRM seems to be the first place to be fixed, however am I wrong in
> > thinking that most of the uses of remap_pfn_range() are wrong in the
> > context of system page backed memory?
> > 
> > Should we start with an implementation of something like remap_range()
> > which does not set PFNMAP bit. Its use is wrong in that context IMO.
> > 
> > And then move to DRM proper and replace the call to remap_pfn_range()
> > with it and see how far we go.
> > 
> That sounds like the right approach to me. Because the way we get into
> these problems is mostly due to the lack of a clear example, and so
> providing something correct to call is the way out.

Thing is if you want to just install struct page memory you don't need
MIXEDMAP or a special call, just insert the pages in the normal way.

The issue here seems to be that the DRM caller wants to mix and match
struct page and non-struct page memory, so I think you want an
entirely different API for managing effectively a scatter of two
different memory types and computing what the proper VMA flags should
be based on what was given.

OR DRM is actually using remap_pfn specifical because it does not want
3rd parties taking the page refcount because that destroys its
lifetime model..

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-09 17:33                                     ` Jason Gunthorpe
@ 2025-05-09 17:50                                       ` Pantelis Antoniou
  2025-05-09 18:39                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-09 17:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, David Hildenbrand, Peter Xu, Andrew Morton,
	mm-commits, wade.farnsworth, c.briere, artem.k, David Howells

On Fri, 9 May 2025 14:33:14 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, May 09, 2025 at 10: 11: 01AM -0700, John Hubbard wrote: > On
> 5/9/25 9: 30 AM, Pantelis Antoniou wrote: > > On Thu, 8 May 2025 21:
> 34: 28 +0200 > > David Hildenbrand <david@ redhat. com> wrote: > .. .
> > > So what's 
> On Fri, May 09, 2025 at 10:11:01AM -0700, John Hubbard wrote:
> > On 5/9/25 9:30 AM, Pantelis Antoniou wrote:
> > > On Thu, 8 May 2025 21:34:28 +0200
> > > David Hildenbrand <david@redhat.com> wrote:
> > ...
> > > So what's the plan now?
> > > 
> > > DRM seems to be the first place to be fixed, however am I wrong in
> > > thinking that most of the uses of remap_pfn_range() are wrong in
> > > the context of system page backed memory?
> > > 
> > > Should we start with an implementation of something like
> > > remap_range() which does not set PFNMAP bit. Its use is wrong in
> > > that context IMO.
> > > 
> > > And then move to DRM proper and replace the call to
> > > remap_pfn_range() with it and see how far we go.
> > > 
> > That sounds like the right approach to me. Because the way we get
> > into these problems is mostly due to the lack of a clear example,
> > and so providing something correct to call is the way out.
> 
> Thing is if you want to just install struct page memory you don't need
> MIXEDMAP or a special call, just insert the pages in the normal way.
> 
> The issue here seems to be that the DRM caller wants to mix and match
> struct page and non-struct page memory, so I think you want an
> entirely different API for managing effectively a scatter of two
> different memory types and computing what the proper VMA flags should
> be based on what was given.
> 
> OR DRM is actually using remap_pfn specifical because it does not want
> 3rd parties taking the page refcount because that destroys its
> lifetime model..
> 

To be frank our driver does not explicitly call remap_pfn_range(), I
just saw that there is use of it in other drivers and is found in many
tutorials for writing drivers that share memory with user-space.

However our driver is dependent on the drm_gem_mmap_obj() method where
all paths end up setting the PFNMAP bit, and the remap_pfn_range() method
is the easier way we could reproduce the bug in a simplified test-case.

A quick grep for vm_iomap_memory and remap_pfn_range in drivers/

$ git grep -e 'vm_iomap_memory\|remap_pfn_range' drivers/ | wc -l 
92

I have no idea how many of those are operating on system memory pages.

From my understanding DRM takes full control of the lifecycle of the
buffer objects in question, so I don't think that the page refcount is
applicable here.

Turning that bit off could expose the pages to the swapper which I guess
would be pretty bad (maybe, not my particular area of expertise).

Do we have any DRM maintainers available to chime in about the page
lifecycle?

> Jason
> 

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-09 17:50                                       ` Pantelis Antoniou
@ 2025-05-09 18:39                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-09 18:39 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: John Hubbard, David Hildenbrand, Peter Xu, Andrew Morton,
	mm-commits, wade.farnsworth, c.briere, artem.k, David Howells

On Fri, May 09, 2025 at 08:50:22PM +0300, Pantelis Antoniou wrote:

> However our driver is dependent on the drm_gem_mmap_obj() method where
> all paths end up setting the PFNMAP bit, and the remap_pfn_range() method
> is the easier way we could reproduce the bug in a simplified test-case.

Not all paths, the obj->funcs->mmap() does not.

How do the pfns get installed in the flow you are looking at?

It seems to me that if the driver knows it is using CPU memory it
should follow:

 GEM objects can either provide a fault handler in their vm_op

And then in the fault handler use the normal vmf_insert_.* stuff and
never set any special VMA flags.

Otherwise it sounds like:

 or mmap the buffer memory synchronously after calling drm_gem_mmap_obj.

Means the driver called something like remap_pfn..

> Turning that bit off could expose the pages to the swapper which I guess
> would be pretty bad (maybe, not my particular area of expertise).

I think that's different..

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 18:11                   ` Pantelis Antoniou
  2025-05-08 18:26                     ` David Hildenbrand
  2025-05-08 18:47                     ` Peter Xu
@ 2025-05-08 19:11                     ` Jason Gunthorpe
  2 siblings, 0 replies; 32+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 19:11 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: David Hildenbrand, Peter Xu, Andrew Morton, mm-commits,
	wade.farnsworth, jhubbard, c.briere, artem.k, David Howells

On Thu, May 08, 2025 at 09:11:36PM +0300, Pantelis Antoniou wrote:

> 3. While we go about fixing it, this has caused a pretty significant
> userspace regression, where the address space that those BOs reside
> cannot be used for I/O when a network filesystem is involved. I think
> it's a matter of time when regular filesystem start using the same
> method of pining and doing I/O instead of using the filecache on fast
> memory mediums.

Regular file systems already uses GUP on O_DIRECT paths and already
didn't work basically forever. It seems like a kernel bug in the net
filesystems to have done something different in their O_DIRECT for so
long :\

Anyhow, the fixes must come from DRM using the mm properly, not by
hacking up the mm to ignore the well defined API rules we have.

I don't know enough about DRM to say exactly what that means, but that
is where you should be focusing your attention to fix it.

Somehow I suspect the number of places actually using O_DIRECT from a
network filesystem to a DRM buffer is going to be pretty small since
it never worked on a normal filesystem. Meaning this isn't some
general common code that is feeding generic files into GPUs, but
something custom built to only use a network that happened to stumble
onto this kernel bug and abuse it. Do you know differently?

Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
  2025-05-08 15:08     ` Peter Xu
  2025-05-08 15:10       ` David Hildenbrand
@ 2025-05-08 15:17       ` Pantelis Antoniou
  1 sibling, 0 replies; 32+ messages in thread
From: Pantelis Antoniou @ 2025-05-08 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, mm-commits, wade.farnsworth, jhubbard, jgg, david,
	c.briere, artem.k, David Howells

On Thu, 8 May 2025 11:08:14 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, May 08, 2025 at 05: 36: 12PM +0300, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 10: 16: 31 -0400 > Peter Xu <peterx@ redhat.
> > com> wrote: > > Hi Peter, Hi, Pantelis, [. . . ] > > > @@ -1271,8
> > com> +1287,6 @@ static int ZjQcmQRYFpfptBannerStart
> On Thu, May 08, 2025 at 05:36:12PM +0300, Pantelis Antoniou wrote:
> > On Thu, 8 May 2025 10:16:31 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > Hi Peter,
> 
> Hi, Pantelis,
> 
> [...]
> 
> > > > @@ -1271,8 +1287,6 @@ static int check_vma_flags(struct vm_are
> > > >  	int foreign = (gup_flags & FOLL_REMOTE);
> > > >  	bool vma_anon = vma_is_anonymous(vma);
> > > >  
> > > > -	if (vm_flags & (VM_IO | VM_PFNMAP))
> > > > -		return -EFAULT;
> > > 
> > > Is there's any justification that this won't break some existing
> > > GUP users that may rely on properly failing at pfnmaps?
> > > 
> > > IIUC netfs isn't the first one that wants to GUP on top of
> > > pfnmaps, KVM does it for years and so far it was processed in a
> > > standalone path:
> > > 
> > > hva_to_pfn:
> > > 	else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
> > > 		r = hva_to_pfn_remapped(vma, kfp, &pfn);
> > > 
> > > That started with supporting real pfnmaps (with no page struct),
> > > but pfnmap with page structs can also happen afaict, and kvm
> > > processes that too by checking page==NULL ultimately, e.g. in
> > > kvm_release_faultin_page().
> > > 
> > 
> > I see. The problem is that we're not the owners of the code in
> > netfslib, and it is considerably more intrusive to fix things there.
> > 
> > This is a hotfix for a userspace regression. I sort of agree that
> > having different handling for these areas in netfslib would be
> > ideal.
> 
> Do you mean this used to work in older kernels?  Some more info on the
> regression would be more than welcomed if so..  If it fixes a kernel
> regression, we may want a Fixes for whatever patch at last.
> 

Yes, it used to work in older kernels, before filesystems like 9p
switched to using the accessors. The problem is that there is not
a single patch that I can point as the culprit.

It took a long time to figure out but the timeline was:

1. Before any netfslib and 9pfs changes, I/O from remap_pfn_page ranges
works.
2. netfslib accessors are merged in mainline. Userspace still works.
3. 9pfs picks up netfslib accessors, things break.

I doubt any kernel CI would have a test-case for it, because it is
quite esoteric.

We do have a relatively simple buildroot patch that we can share, that
exhibits the problem, and that contains both a kernel, a kernel module
and a user space program that performs the I/O.

> Or do you mean it's a regression caused by userspace change?
>

No userspace changes.

> Thanks,
> 

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-05-09 18:39 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 21:55 + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch Andrew Morton
2025-05-08 14:16 ` Peter Xu
2025-05-08 14:36   ` Pantelis Antoniou
2025-05-08 15:08     ` Peter Xu
2025-05-08 15:10       ` David Hildenbrand
2025-05-08 15:27         ` Pantelis Antoniou
2025-05-08 15:40           ` David Hildenbrand
2025-05-08 15:48             ` Pantelis Antoniou
2025-05-08 16:25             ` Pantelis Antoniou
2025-05-08 17:35             ` Jason Gunthorpe
2025-05-08 17:47               ` Pantelis Antoniou
2025-05-08 18:01                 ` Jason Gunthorpe
2025-05-08 18:02                 ` David Hildenbrand
2025-05-08 18:11                   ` Pantelis Antoniou
2025-05-08 18:26                     ` David Hildenbrand
2025-05-08 18:47                     ` Peter Xu
2025-05-08 19:04                       ` David Hildenbrand
2025-05-08 19:06                         ` Jason Gunthorpe
2025-05-08 19:08                         ` Peter Xu
2025-05-08 19:12                           ` Jason Gunthorpe
2025-05-08 19:16                             ` David Hildenbrand
2025-05-08 19:39                             ` Peter Xu
2025-05-08 19:14                           ` David Hildenbrand
2025-05-08 19:19                             ` Jason Gunthorpe
2025-05-08 19:34                               ` David Hildenbrand
2025-05-09 16:30                                 ` Pantelis Antoniou
2025-05-09 17:11                                   ` John Hubbard
2025-05-09 17:33                                     ` Jason Gunthorpe
2025-05-09 17:50                                       ` Pantelis Antoniou
2025-05-09 18:39                                         ` Jason Gunthorpe
2025-05-08 19:11                     ` Jason Gunthorpe
2025-05-08 15:17       ` Pantelis Antoniou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.