All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Jason Gunthorpe <jgg@ziepe.ca>, Christoph Hellwig <hch@lst.de>
Cc: "Robin Murphy" <robin.murphy@arm.com>,
	"Marek Szyprowski" <m.szyprowski@samsung.com>,
	"Joerg Roedel" <joro@8bytes.org>, "Will Deacon" <will@kernel.org>,
	"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jens Axboe" <axboe@kernel.dk>, "Keith Busch" <kbusch@kernel.org>,
	"Sagi Grimberg" <sagi@grimberg.me>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Shameer Kolothum" <shameerali.kolothum.thodi@huawei.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	iommu@lists.linux.dev, linux-nvme@lists.infradead.org,
	kvm@vger.kernel.org, linux-mm@kvack.org,
	"Bart Van Assche" <bvanassche@acm.org>,
	"Damien Le Moal" <damien.lemoal@opensource.wdc.com>,
	"Amir Goldstein" <amir73il@gmail.com>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"jack@suse.com" <jack@suse.com>,
	"Zhu Yanjun" <zyjzyj2000@gmail.com>
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps
Date: Wed, 20 Mar 2024 10:55:36 +0200	[thread overview]
Message-ID: <20240320085536.GA14887@unreal> (raw)
In-Reply-To: <20240319153620.GB66976@ziepe.ca>

On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > > The DMA API callers really need to know what is P2P or not for
> > > > various reasons.  And they should generally have that information
> > > > available, either from pin_user_pages that needs to special case
> > > > it or from the in-kernel I/O submitter that build it from P2P and
> > > > normal memory.
> > > 
> > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > > shoves the resulting page list into in a scattertable. It never checks
> > > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > > does all the work.
> > 
> > Right now it does, but that's not really a good interface.  If we have
> > a pin_user_pages variant that only pins until the next relevant P2P
> > boundary and tells you about we can significantly simplify the overall
> > interface.
> 
> Sorry for the delay, I was away..

<...>

> Can we tweak what Leon has done to keep the hmm_range_fault support
> and non-uniformity for RDMA but add a uniformity optimized flow for
> BIO?

Something like this will do the trick.

From 45e739e7073fb04bc168624f77320130bb3f9267 Mon Sep 17 00:00:00 2001
Message-ID: <45e739e7073fb04bc168624f77320130bb3f9267.1710924764.git.leonro@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Date: Mon, 18 Mar 2024 11:16:41 +0200
Subject: [PATCH] mm/gup: add strict interface to pin user pages according to
 FOLL flag

All pin_user_pages*() and get_user_pages*() callbacks allocate user
pages by partially taking into account their p2p vs. non-p2p properties.

In case, user sets FOLL_PCI_P2PDMA flag, the allocated pages will include
both p2p and "regular" pages, while if FOLL_PCI_P2PDMA flag is not provided,
only regular pages are returned.

In order to make sure that with FOLL_PCI_P2PDMA flag, only p2p pages are
returned, let's introduce new internal FOLL_STRICT flag and provide special
pin_user_pages_fast_strict() API call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mm.h |  3 +++
 mm/gup.c           | 36 +++++++++++++++++++++++++++++++++++-
 mm/internal.h      |  4 +++-
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..910b65dde24a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2491,6 +2491,9 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 void folio_add_pin(struct folio *folio);
 
+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+			       unsigned int gup_flags, struct page **pages);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..11b5c626a4ab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -133,6 +133,10 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return NULL;
 
+	if (flags & FOLL_STRICT)
+		if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+			return NULL;
+
 	if (flags & FOLL_GET)
 		return try_get_folio(page, refs);
 
@@ -232,6 +236,10 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return -EREMOTEIO;
 
+	if (flags & FOLL_STRICT)
+		if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+			return -EREMOTEIO;
+
 	if (flags & FOLL_GET)
 		folio_ref_inc(folio);
 	else if (flags & FOLL_PIN) {
@@ -2243,6 +2251,8 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 	 * - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only
 	 * - FOLL_REMOTE is internal only and used on follow_page()
 	 * - FOLL_UNLOCKABLE is internal only and used if locked is !NULL
+	 * - FOLL_STRICT is internal only and used to distinguish between p2p
+	 *   and "regular" pages.
 	 */
 	if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS))
 		return false;
@@ -3187,7 +3197,8 @@ static int internal_get_user_pages_fast(unsigned long start,
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
 				       FOLL_FAST_ONLY | FOLL_NOFAULT |
-				       FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT)))
+				       FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT |
+				       FOLL_STRICT)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
@@ -3322,6 +3333,29 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
+/**
+ * pin_user_pages_fast_strict() - this is pin_user_pages_fast() variant, which
+ * makes sure that only pages with same properties are pinned.
+ *
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
+ *
+ * Nearly the same as pin_user_pages_fastt(), except that FOLL_STRICT is set.
+ *
+ * FOLL_STRICT means that the pages are allocated with specific FOLL_* properties.
+ */
+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+			       unsigned int gup_flags, struct page **pages)
+{
+	if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN | FOLL_STRICT))
+		return -EINVAL;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast_strict);
+
 /**
  * pin_user_pages_remote() - pin pages of a remote process
  *
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..7578837a0444 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1031,10 +1031,12 @@ enum {
 	FOLL_FAST_ONLY = 1 << 20,
 	/* allow unlocking the mmap lock */
 	FOLL_UNLOCKABLE = 1 << 21,
+	/* don't mix pages with different properties, e.g. p2p with "regular" ones */
+	FOLL_STRICT = 1 << 22,
 };
 
 #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \
-			    FOLL_FAST_ONLY | FOLL_UNLOCKABLE)
+			    FOLL_FAST_ONLY | FOLL_UNLOCKABLE | FOLL_STRICT)
 
 /*
  * Indicates for which pages that are write-protected in the page table,
-- 
2.44.0

  reply	other threads:[~2024-03-20  8:55 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-05 11:18 [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 01/16] mm/hmm: let users to tag specific PFNs Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 02/16] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 03/16] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 04/16] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 05/16] iommu/dma: Prepare map/unmap page functions to receive IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 06/16] iommu/dma: Implement link/unlink page callbacks Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 07/16] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 08/16] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 09/16] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 10/16] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 11/16] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 12/16] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 13/16] vfio/mlx5: Explicitly store page list Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 14/16] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 15/16] block: add dma_link_range() based API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL Leon Romanovsky
2024-03-05 15:51   ` Keith Busch
2024-03-05 16:08     ` Jens Axboe
2024-03-05 16:39       ` Chaitanya Kulkarni
2024-03-05 16:46         ` Chaitanya Kulkarni
2024-03-06 14:33     ` Christoph Hellwig
2024-03-06 15:05       ` Jason Gunthorpe
2024-03-06 16:14         ` Christoph Hellwig
2024-05-03 14:41   ` Zhu Yanjun
2024-05-05 13:23     ` Leon Romanovsky
2024-05-06  7:25       ` Zhu Yanjun
2024-03-05 12:05 ` [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Robin Murphy
2024-03-05 12:29   ` Leon Romanovsky
2024-03-06 14:44     ` Christoph Hellwig
2024-03-06 15:43       ` Jason Gunthorpe
2024-03-06 16:20         ` Christoph Hellwig
2024-03-06 17:44           ` Jason Gunthorpe
2024-03-06 22:14             ` Christoph Hellwig
2024-03-07  0:00               ` Jason Gunthorpe
2024-03-07 15:05                 ` Christoph Hellwig
2024-03-07 21:01                   ` Jason Gunthorpe
2024-03-08 16:49                     ` Christoph Hellwig
2024-03-08 20:23                       ` Jason Gunthorpe
2024-03-09 16:14                         ` Christoph Hellwig
2024-03-10  9:35                           ` Leon Romanovsky
2024-03-12 21:28                             ` Christoph Hellwig
2024-03-13  7:46                               ` Leon Romanovsky
2024-03-13 21:44                                 ` Christoph Hellwig
2024-03-19 15:36                           ` Jason Gunthorpe
2024-03-20  8:55                             ` Leon Romanovsky [this message]
2024-03-21 22:40                               ` Christoph Hellwig
2024-03-22 17:46                                 ` Leon Romanovsky
2024-03-24 23:16                                   ` Christoph Hellwig
2024-03-21 22:39                             ` Christoph Hellwig
2024-03-22 18:43                               ` Jason Gunthorpe
2024-03-24 23:22                                 ` Christoph Hellwig
2024-03-27 17:14                                   ` Jason Gunthorpe
2024-03-07  6:01 ` Zhu Yanjun
2024-04-09 20:39   ` Zhu Yanjun
2024-05-02 23:32 ` Zeng, Oak
2024-05-03 11:57   ` Zhu Yanjun
2024-05-03 16:42   ` Jason Gunthorpe
2024-05-03 20:59     ` Zeng, Oak
2024-06-10 15:12     ` Zeng, Oak
2024-06-10 15:19       ` Zhu Yanjun
2024-06-10 16:18       ` Leon Romanovsky
2024-06-10 16:40         ` Zeng, Oak
2024-06-10 17:25           ` Jason Gunthorpe
2024-06-10 21:28             ` Zeng, Oak
2024-06-11  7:49               ` Zhu Yanjun
2024-06-11 15:45               ` Leon Romanovsky
2024-06-11 18:26                 ` Zeng, Oak
2024-06-11 19:11                   ` Leon Romanovsky
2024-06-11 15:39           ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240320085536.GA14887@unreal \
    --to=leon@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chaitanyak@nvidia.com \
    --cc=corbet@lwn.net \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@iogearbox.net \
    --cc=hch@lst.de \
    --cc=iommu@lists.linux.dev \
    --cc=jack@suse.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=joro@8bytes.org \
    --cc=josef@toxicpanda.com \
    --cc=kbusch@kernel.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=martin.petersen@oracle.com \
    --cc=robin.murphy@arm.com \
    --cc=sagi@grimberg.me \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=will@kernel.org \
    --cc=yishaih@nvidia.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.