All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Pantelis Antoniou <p.antoniou@partner.samsung.com>
Cc: David Hildenbrand <david@redhat.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Andrew Morton <akpm@linux-foundation.org>,
	mm-commits@vger.kernel.org, wade.farnsworth@siemens.com,
	jhubbard@nvidia.com, c.briere@samsung.com, artem.k@samsung.com,
	David Howells <dhowells@redhat.com>
Subject: Re: + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch
Date: Thu, 8 May 2025 14:47:30 -0400	[thread overview]
Message-ID: <aBz8QuzT7YkPqrrF@x1.local> (raw)
In-Reply-To: <20250508211136.7a0a3865@sarc.samsung.com>

On Thu, May 08, 2025 at 09:11:36PM +0300, Pantelis Antoniou wrote:
> This has been going on for more than a decade at this point.

I wasn't aware how long, but I need to confess I am also aware of such in
virt context where drivers map these pages in PFNMAPs..  So KVM also has
similar cases happening in corner case setups.

> 
> Can we get a plan on how to go around fixing these issues correctly?
> 
> 1. Drivers/subsystems (DRM in this case) are doing remap_pfn_range() to
> map system memory with a page attached to user space.
> Up until recently this was OK, since no-one tried to pin the pages for
> any reason. It doesn't seem like this is the right way to do it.
> What is the right way?
> 
> 2. DRM in particular has no standardized way to handle mapping system
> memory Buffer Objects (BOs) to userspace. Each driver is free to do it's
> own thing and does so. What is the right way to handle this case.
> 
> 3. While we go about fixing it, this has caused a pretty significant
> userspace regression, where the address space that those BOs reside
> cannot be used for I/O when a network filesystem is involved. I think
> it's a matter of time when regular filesystem start using the same
> method of pining and doing I/O instead of using the filecache on fast
> memory mediums.

I mentioned it elsewhere, but _if_ fixing all the drivers isn't possible in
the near future.. we could still have chance to not mess with GUP (in which
case PFNMAP is also working like that for so many years, likely what the
drivers do on abusing pages in PFNMAPs...).  That is supporting such
special pages in iov_iter.

I attached such patch below, not saying that we should merge it, but IMHO
it's much better than fiddling with gup here, and so far that's the best I
can think if (and only if it works for 9pfs's current zerocopy case).. I
only smoked it, but I didn't verify it.

PS: I was also thinking maybe if 9pfs is the only affected so far, I wonder
if we could try to fallback to cached RW when necessary, IIUC that's still
working, right (and is 9pfs a production-level fs)?  But I think that's not
as good as below if supporting that isn't extremely hard.

Thanks,

=======8<======
From cd4aa467e4b653b5bd8496b5123d65d06e2e3263 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 8 May 2025 14:35:24 -0400
Subject: [PATCH] iov_iter: Supports special PFNMAP for user buffer when
 pfn_valid()

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  1 +
 lib/iov_iter.c     | 60 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/gup.c           | 16 +++++++++++++
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38e16c984b9a..f79fd14599fa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1459,6 +1459,7 @@ static inline void put_page(struct page *page)
  */
 #define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+int pin_user_page(struct page *page);
 void unpin_user_page(struct page *page);
 void unpin_folio(struct folio *folio);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d9e19fb2dcf3..2eb24b2793f0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1809,6 +1809,64 @@ static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
 	return size;
 }
 
+/**
+ * iov_iter_pin_user_pages() - pin user pages for iov iter ops
+ *
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
+ *
+ * Almost a wrapper for pin_user_pages_fast(), but also supports PFNMAPs
+ * where in extremely rare cases there's actually struct page available
+ * (e.g. device drivers playing trick with PFNMAP by injecting allocated
+ * RAM pages).
+ */
+static inline int
+iov_iter_pin_user_pages(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	struct follow_pfnmap_args args;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int res, ret;
+
+	res = pin_user_pages_fast(start, nr_pages, gup_flags, pages);
+
+	/* Normally, GUP should really work already.. */
+	if (likely(res > 0))
+		return res;
+
+	/*
+	 * This is to take care of an extremely rare case: retry in case if
+	 * it's a PFNMAP that has struct page backed.
+	 *
+	 * So far it does the minimum we need in the failure path.  It
+	 * assumes the PFNMAP entries must exist in the pgtables already,
+	 * and it resolves one PFN at a time.
+	 */
+	mm = current->mm;
+	mmap_read_lock(mm);
+	vma = vma_lookup(current->mm, start);
+	if (!vma)
+		goto out;
+
+	args.vma = vma;
+	args.address = start;
+
+	ret = follow_pfnmap_start(&args);
+	if (ret)
+		goto out;
+	/* Did we find a special page under VM_PFNMAP? */
+	if (pfn_valid(args.pfn) && pin_user_page(pfn_to_page(args.pfn)))
+		res = 1;
+	follow_pfnmap_end(&args);
+out:
+	mmap_read_unlock(mm);
+	return res;
+}
+
 /*
  * Extract a list of contiguous pages from a user iterator and get a pin on
  * each of them.  This should only be used if the iterator is user-backed
@@ -1846,7 +1904,7 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
 	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
 	if (!maxpages)
 		return -ENOMEM;
-	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	res = iov_iter_pin_user_pages(addr, maxpages, gup_flags, *pages);
 	if (unlikely(res <= 0))
 		return res;
 	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
diff --git a/mm/gup.c b/mm/gup.c
index d3aac58862c0..eede221ac89a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -178,6 +178,22 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
 	return 0;
 }
 
+/**
+ * pin_user_page() - dma-pinned a page
+ * @page:            pointer to page to be pinned
+ *
+ * NOTE!  One should normally use pin_user_pages*() API instead.  This
+ * should be only useful in extremely special cases, like struct page under
+ * VM_PFNMAP.
+ *
+ * Returns: 0 if success, negative if pin failed
+ */
+int pin_user_page(struct page *page)
+{
+	return try_grab_folio(page_folio(page), 1, FOLL_PIN);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
-- 
2.49.0


-- 
Peter Xu


  parent reply	other threads:[~2025-05-08 18:47 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 21:55 + fix-zero-copy-i-o-on-__get_user_pages-allocated-pages.patch added to mm-hotfixes-unstable branch Andrew Morton
2025-05-08 14:16 ` Peter Xu
2025-05-08 14:36   ` Pantelis Antoniou
2025-05-08 15:08     ` Peter Xu
2025-05-08 15:10       ` David Hildenbrand
2025-05-08 15:27         ` Pantelis Antoniou
2025-05-08 15:40           ` David Hildenbrand
2025-05-08 15:48             ` Pantelis Antoniou
2025-05-08 16:25             ` Pantelis Antoniou
2025-05-08 17:35             ` Jason Gunthorpe
2025-05-08 17:47               ` Pantelis Antoniou
2025-05-08 18:01                 ` Jason Gunthorpe
2025-05-08 18:02                 ` David Hildenbrand
2025-05-08 18:11                   ` Pantelis Antoniou
2025-05-08 18:26                     ` David Hildenbrand
2025-05-08 18:47                     ` Peter Xu [this message]
2025-05-08 19:04                       ` David Hildenbrand
2025-05-08 19:06                         ` Jason Gunthorpe
2025-05-08 19:08                         ` Peter Xu
2025-05-08 19:12                           ` Jason Gunthorpe
2025-05-08 19:16                             ` David Hildenbrand
2025-05-08 19:39                             ` Peter Xu
2025-05-08 19:14                           ` David Hildenbrand
2025-05-08 19:19                             ` Jason Gunthorpe
2025-05-08 19:34                               ` David Hildenbrand
2025-05-09 16:30                                 ` Pantelis Antoniou
2025-05-09 17:11                                   ` John Hubbard
2025-05-09 17:33                                     ` Jason Gunthorpe
2025-05-09 17:50                                       ` Pantelis Antoniou
2025-05-09 18:39                                         ` Jason Gunthorpe
2025-05-08 19:11                     ` Jason Gunthorpe
2025-05-08 15:17       ` Pantelis Antoniou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBz8QuzT7YkPqrrF@x1.local \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=artem.k@samsung.com \
    --cc=c.briere@samsung.com \
    --cc=david@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=mm-commits@vger.kernel.org \
    --cc=p.antoniou@partner.samsung.com \
    --cc=wade.farnsworth@siemens.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.