* [PATCH V1 0/9] iommu_ioas_map_file
@ 2024-09-14 13:05 Steve Sistare
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
` (10 more replies)
0 siblings, 11 replies; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
memory by passing a memfd plus offset and length. Implement it using
the memfd_map_folios KAPI, and the proposed repin_folio_unhugely KAPI.
See the individual patches for details.
Steve Sistare (9):
mm/gup: repin_folio_unhugely
iommufd: remove uptr from iopt_alloc_iova
iommufd: generalize iopt_pages address
iommufd: pfn reader for file mappings
iommufd: IOMMU_IOAS_MAP_FILE interface
iommufd: IOMMU_IOAS_MAP_FILE implementation
iommufd: file mappings for mdev
iommufd: replace upages_len
iommufd: optimize file mapping
drivers/iommu/iommufd/io_pagetable.c | 102 +++++++++++-----
drivers/iommu/iommufd/io_pagetable.h | 20 ++-
drivers/iommu/iommufd/ioas.c | 43 +++++++
drivers/iommu/iommufd/iommufd_private.h | 6 +
drivers/iommu/iommufd/main.c | 2 +
drivers/iommu/iommufd/pages.c | 210 ++++++++++++++++++++++++++------
include/linux/mm.h | 1 +
include/uapi/linux/iommufd.h | 22 ++++
mm/gup.c | 18 +++
9 files changed, 354 insertions(+), 70 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-14 13:19 ` Steven Sistare
2024-09-15 20:37 ` Jason Gunthorpe
2024-09-14 13:05 ` [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova Steve Sistare
` (9 subsequent siblings)
10 siblings, 2 replies; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Export a function that repins a huge-page folio at small-page granularity.
This allows any range of small pages within the folio to be unpinned later.
For example, pages pinned via memfd_pin_folios and modified by
repin_folio_unhugely could be unpinned via unpin_user_page(s).
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/linux/mm.h | 1 +
mm/gup.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1470736..ba8344f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
struct folio **folios, unsigned int max_folios,
pgoff_t *offset);
+void repin_folio_unhugely(struct folio *folio, unsigned long npin);
int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 947881ff..f8f3f2a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
return ret;
}
EXPORT_SYMBOL_GPL(memfd_pin_folios);
+
+/**
+ * repin_folio_unhugely() - repin a folio at small page granularity
+ * @folio: the folio to repin
+ * @npin: the number of pages pinned in the folio
+ *
+ * Given a huge page folio that is already pinned, and the number of small
+ * pages that are pinned in it, adjust the pincount to reflect small-page
+ * granularity. Each small page can later be unpinned individually.
+ */
+void repin_folio_unhugely(struct folio *folio, unsigned long npin)
+{
+ if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
+ return;
+ atomic_add(npin - 1, &folio->_refcount);
+ atomic_add(npin - 1, &folio->_pincount);
+}
+EXPORT_SYMBOL_GPL(repin_folio_unhugely);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-15 20:41 ` Jason Gunthorpe
2024-09-14 13:05 ` [PATCH V1 3/9] iommufd: generalize iopt_pages address Steve Sistare
` (8 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
iopt_alloc_iova takes a uptr argument but only checks for its alignment.
Pass an offset from start instead to eliminate its dependence on a user
address.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 05fd9d3..1c23a59 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -104,12 +104,13 @@ static bool __alloc_iova_check_used(struct interval_tree_span_iter *span,
/*
* Automatically find a block of IOVA that is not being used and not reserved.
+ * @start is the byte offset from the start of the address space.
* Does not return a 0 IOVA even if it is valid.
*/
static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
- unsigned long uptr, unsigned long length)
+ unsigned long start, unsigned long length)
{
- unsigned long page_offset = uptr % PAGE_SIZE;
+ unsigned long page_offset = start % PAGE_SIZE;
struct interval_tree_double_span_iter used_span;
struct interval_tree_span_iter allowed_span;
unsigned long iova_alignment;
@@ -121,15 +122,15 @@ static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
return -EOVERFLOW;
/*
- * Keep alignment present in the uptr when building the IOVA, this
+ * Keep the start alignment when building the IOVA, which
* increases the chance we can map a THP.
*/
- if (!uptr)
+ if (!start)
iova_alignment = roundup_pow_of_two(length);
else
iova_alignment = min_t(unsigned long,
roundup_pow_of_two(length),
- 1UL << __ffs64(uptr));
+ 1UL << __ffs64(start));
if (iova_alignment < iopt->iova_alignment)
return -EINVAL;
@@ -240,7 +241,7 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
int iommu_prot, unsigned int flags)
{
struct iopt_pages_list *elm;
- unsigned long iova;
+ unsigned long iova, start;
int rc = 0;
list_for_each_entry(elm, pages_list, next) {
@@ -259,9 +260,8 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
/* Use the first entry to guess the ideal IOVA alignment */
elm = list_first_entry(pages_list, struct iopt_pages_list,
next);
- rc = iopt_alloc_iova(
- iopt, dst_iova,
- (uintptr_t)elm->pages->uptr + elm->start_byte, length);
+ start = elm->start_byte + (uintptr_t)elm->pages->uptr;
+ rc = iopt_alloc_iova(iopt, dst_iova, start, length);
if (rc)
goto out_unlock;
if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 3/9] iommufd: generalize iopt_pages address
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
2024-09-14 13:05 ` [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-18 14:52 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 4/9] iommufd: pfn reader for file mappings Steve Sistare
` (7 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
The starting address in iopt_pages is currently a __user *uptr. Generalize
to allow other types of addresses. Refactor iopt_alloc_pages and
iopt_map_user_pages into address-type specific and common functions.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.c | 60 +++++++++++++++++++++++-------------
drivers/iommu/iommufd/io_pagetable.h | 13 ++++++--
drivers/iommu/iommufd/pages.c | 30 +++++++++++++-----
3 files changed, 71 insertions(+), 32 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 1c23a59..3f8f0de 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -376,6 +376,38 @@ int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
return rc;
}
+static int iopt_map_common(struct iommufd_ctx *ictx,
+ struct io_pagetable *iopt,
+ struct iopt_pages *pages,
+ unsigned long *iova,
+ unsigned long length,
+ unsigned long start_byte,
+ int iommu_prot,
+ unsigned int flags)
+{
+ struct iopt_pages_list elm = {};
+ LIST_HEAD(pages_list);
+ int rc;
+
+ elm.pages = pages;
+ elm.start_byte = start_byte;
+ if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
+ elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
+ elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
+ elm.length = length;
+ list_add(&elm.next, &pages_list);
+
+ rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
+ if (rc) {
+ if (elm.area)
+ iopt_abort_area(elm.area);
+ if (elm.pages)
+ iopt_put_pages(elm.pages);
+ return rc;
+ }
+ return 0;
+}
+
/**
* iopt_map_user_pages() - Map a user VA to an iova in the io page table
* @ictx: iommufd_ctx the iopt is part of
@@ -400,29 +432,15 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long length, int iommu_prot,
unsigned int flags)
{
- struct iopt_pages_list elm = {};
- LIST_HEAD(pages_list);
- int rc;
+ struct iopt_pages *pages;
- elm.pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
- if (IS_ERR(elm.pages))
- return PTR_ERR(elm.pages);
- if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
- elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
- elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
- elm.start_byte = uptr - elm.pages->uptr;
- elm.length = length;
- list_add(&elm.next, &pages_list);
+ pages = iopt_alloc_user_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
- rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
- if (rc) {
- if (elm.area)
- iopt_abort_area(elm.area);
- if (elm.pages)
- iopt_put_pages(elm.pages);
- return rc;
- }
- return 0;
+ return iopt_map_common(ictx, iopt, pages, iova, length,
+ uptr - pages->uptr,
+ iommu_prot, flags);
}
struct iova_bitmap_fn_arg {
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 0ec3509..7c4a338 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -175,6 +175,10 @@ enum {
IOPT_PAGES_ACCOUNT_MM = 2,
};
+enum iopt_address_type {
+ IOPT_ADDRESS_USER = 0,
+};
+
/*
* This holds a pinned page list for multiple areas of IO address space. The
* pages always originate from a linear chunk of userspace VA. Multiple
@@ -195,7 +199,10 @@ struct iopt_pages {
struct task_struct *source_task;
struct mm_struct *source_mm;
struct user_struct *source_user;
- void __user *uptr;
+ enum iopt_address_type type;
+ union {
+ void __user *uptr; /* IOPT_ADDRESS_USER */
+ };
bool writable:1;
u8 account_mode;
@@ -206,8 +213,8 @@ struct iopt_pages {
struct rb_root_cached domains_itree;
};
-struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
- bool writable);
+struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
+ unsigned long length, bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 117f644..69822d4 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1139,11 +1139,11 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
return 0;
}
-struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
- bool writable)
+static struct iopt_pages *iopt_alloc_pages(unsigned long length,
+ unsigned long start_byte,
+ bool writable)
{
struct iopt_pages *pages;
- unsigned long end;
/*
* The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
@@ -1152,9 +1152,6 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
if (length > SIZE_MAX - PAGE_SIZE || length == 0)
return ERR_PTR(-EINVAL);
- if (check_add_overflow((unsigned long)uptr, length, &end))
- return ERR_PTR(-EOVERFLOW);
-
pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
if (!pages)
return ERR_PTR(-ENOMEM);
@@ -1164,8 +1161,7 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
mutex_init(&pages->mutex);
pages->source_mm = current->mm;
mmgrab(pages->source_mm);
- pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
- pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+ pages->npages = DIV_ROUND_UP(length + start_byte, PAGE_SIZE);
pages->access_itree = RB_ROOT_CACHED;
pages->domains_itree = RB_ROOT_CACHED;
pages->writable = writable;
@@ -1179,6 +1175,24 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
return pages;
}
+struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
+ unsigned long length, bool writable)
+{
+ struct iopt_pages *pages;
+ unsigned long end;
+ void __user *uptr_down = (void __user *) ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+
+ if (check_add_overflow((unsigned long)uptr, length, &end))
+ return ERR_PTR(-EOVERFLOW);
+
+ pages = iopt_alloc_pages(length, uptr - uptr_down, writable);
+ if (IS_ERR(pages))
+ return pages;
+ pages->uptr = uptr_down;
+ pages->type = IOPT_ADDRESS_USER;
+ return pages;
+}
+
void iopt_release_pages(struct kref *kref)
{
struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 4/9] iommufd: pfn reader for file mappings
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (2 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 3/9] iommufd: generalize iopt_pages address Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-15 20:51 ` Jason Gunthorpe
2024-09-14 13:05 ` [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface Steve Sistare
` (6 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Extend pfn_reader_user to pin file mappings, by calling memfd_pin_folios.
Repin at small page granularity and unpack pages into upages[] to mesh
with the existing code paths.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.h | 5 ++++
drivers/iommu/iommufd/pages.c | 58 ++++++++++++++++++++++++++++++++----
2 files changed, 57 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 7c4a338..3a28f46 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -177,6 +177,7 @@ enum {
enum iopt_address_type {
IOPT_ADDRESS_USER = 0,
+ IOPT_ADDRESS_FILE = 1,
};
/*
@@ -202,6 +203,10 @@ struct iopt_pages {
enum iopt_address_type type;
union {
void __user *uptr; /* IOPT_ADDRESS_USER */
+ struct { /* IOPT_ADDRESS_FILE */
+ struct file *file;
+ unsigned long start;
+ };
};
bool writable:1;
u8 account_mode;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 69822d4..065c28f 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -695,6 +695,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
struct pfn_reader_user {
struct page **upages;
size_t upages_len;
+ struct file *file;
unsigned long upages_start;
unsigned long upages_end;
unsigned int gup_flags;
@@ -712,6 +713,7 @@ static void pfn_reader_user_init(struct pfn_reader_user *user,
user->upages_start = 0;
user->upages_end = 0;
user->locked = -1;
+ user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
user->gup_flags = FOLL_LONGTERM;
if (pages->writable)
@@ -733,13 +735,54 @@ static void pfn_reader_user_destroy(struct pfn_reader_user *user,
user->upages = NULL;
}
+static long pin_memfd_pages(struct pfn_reader_user *user,
+ unsigned long start,
+ unsigned long npages)
+{
+ unsigned long end, nr, i, j, k, npin, nfolios, pgoff, max_folios;
+ unsigned long npages_orig = npages;
+ struct folio *folio;
+ size_t size = npages * sizeof(folio);
+ struct folio **folios = temp_kmalloc(&size, NULL, 0);
+
+ if (!folios)
+ return -ENOMEM;
+
+ k = 0;
+ max_folios = size / sizeof(folio);
+ end = start + (npages << PAGE_SHIFT) - 1;
+
+ while (npages > 0) {
+ nfolios = memfd_pin_folios(user->file, start, end,
+ folios, max_folios, &pgoff);
+ if (nfolios <= 0)
+ return nfolios;
+
+ pgoff >>= PAGE_SHIFT;
+ for (i = 0; i < nfolios; i++) {
+ folio = folios[i];
+ nr = folio_nr_pages(folio);
+ npin = min(nr - pgoff, npages);
+ repin_folio_unhugely(folio, npin);
+ for (j = pgoff; j < pgoff + npin; j++)
+ user->upages[k++] = folio_page(folio, j);
+ npages -= npin;
+ start += npin << PAGE_SHIFT;
+ pgoff = 0;
+ }
+ }
+
+ kfree(folios);
+ return npages_orig;
+}
+
static int pfn_reader_user_pin(struct pfn_reader_user *user,
struct iopt_pages *pages,
unsigned long start_index,
unsigned long last_index)
{
bool remote_mm = pages->source_mm != current->mm;
- unsigned long npages;
+ unsigned long npages, start;
uintptr_t uptr;
long rc;
@@ -756,7 +799,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
return -ENOMEM;
}
- if (user->locked == -1) {
+ if (!user->file && user->locked == -1) {
/*
* The majority of usages will run the map task within the mm
* providing the pages, so we can optimize into
@@ -772,15 +815,18 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
npages = min_t(unsigned long, last_index - start_index + 1,
user->upages_len / sizeof(*user->upages));
-
if (iommufd_should_fail())
return -EFAULT;
- uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
- if (!remote_mm)
+ if (user->file) {
+ start = pages->start + (start_index * PAGE_SIZE);
+ rc = pin_memfd_pages(user, start, npages);
+ } else if (!remote_mm) {
+ uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
user->upages);
- else {
+ } else {
+ uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
if (!user->locked) {
mmap_read_lock(pages->source_mm);
user->locked = 1;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (3 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 4/9] iommufd: pfn reader for file mappings Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-15 20:52 ` Jason Gunthorpe
2024-09-14 13:05 ` [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation Steve Sistare
` (5 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
register memory by passing a memfd plus offset and length.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/ioas.c | 5 +++++
drivers/iommu/iommufd/iommufd_private.h | 1 +
drivers/iommu/iommufd/main.c | 2 ++
include/uapi/linux/iommufd.h | 22 ++++++++++++++++++++++
4 files changed, 30 insertions(+)
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 157a89b..270e8ea 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -197,6 +197,11 @@ static int conv_iommu_prot(u32 map_flags)
return iommu_prot;
}
+int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
+{
+ return -ENOTTY;
+}
+
int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
{
struct iommu_ioas_map *cmd = ucmd->cmd;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 92efe30..21ec075 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -275,6 +275,7 @@ static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ctx *ictx,
int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd);
int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 83bbd7c..5f77ce6 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -378,6 +378,8 @@ struct iommufd_ioctl_op {
struct iommu_ioas_iova_ranges, out_iova_alignment),
IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
iova),
+ IOCTL_OP(IOMMU_IOAS_MAP_FILE, iommufd_ioas_map_file,
+ struct iommu_ioas_map_file, iova),
IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
length),
IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4dde745..1e760da 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -51,6 +51,7 @@ enum {
IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP = 0x8c,
IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d,
IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e,
+ IOMMUFD_CMD_IOAS_MAP_FILE = 0x8f,
};
/**
@@ -214,6 +215,27 @@ struct iommu_ioas_map {
#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
/**
+ * struct iommu_ioas_map_file - ioctl(IOMMU_IOAS_MAP_FILE)
+ * @size: sizeof(struct iommu_ioas_map_file)
+ * @fd: the memfd to map
+ * @start: byte offset from start of file to map from
+ * @flags, @ios_id, @length, @iova: same as for iommu_ioas_map
+ *
+ * Set an IOVA mapping from a memfd file. All other arguments and semantics
+ * match those of IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_map_file {
+ __u32 size;
+ __u32 flags;
+ __u32 ioas_id;
+ __s32 fd;
+ __aligned_u64 start;
+ __aligned_u64 length;
+ __aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP_FILE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP_FILE)
+
+/**
* struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
* @size: sizeof(struct iommu_ioas_copy)
* @flags: Combination of enum iommufd_ioas_map_flags
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (4 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-17 1:48 ` kernel test robot
2024-09-14 13:05 ` [PATCH V1 7/9] iommufd: file mappings for mdev Steve Sistare
` (4 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Implement the IOMMU_IOAS_MAP_FILE ioctl using the memfd_pin_folios KAPI.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.c | 26 ++++++++++++++++++++-
drivers/iommu/iommufd/io_pagetable.h | 2 ++
drivers/iommu/iommufd/ioas.c | 40 ++++++++++++++++++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 5 +++++
drivers/iommu/iommufd/pages.c | 16 +++++++++++++
5 files changed, 87 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 3f8f0de..615c1a5 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -260,7 +260,10 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
/* Use the first entry to guess the ideal IOVA alignment */
elm = list_first_entry(pages_list, struct iopt_pages_list,
next);
- start = elm->start_byte + (uintptr_t)elm->pages->uptr;
+ if (elm->pages->type == IOPT_ADDRESS_FILE)
+ start = elm->start_byte + elm->pages->start;
+ else
+ start = elm->start_byte + (uintptr_t)elm->pages->uptr;
rc = iopt_alloc_iova(iopt, dst_iova, start, length);
if (rc)
goto out_unlock;
@@ -443,6 +446,27 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
iommu_prot, flags);
}
+/**
+ * iopt_map_file_pages() - Like iopt_map_user_pages, but map @file starting
+ * at byte offset @start.
+ */
+int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+ unsigned long *iova,
+ struct file *file, unsigned long start,
+ unsigned long length, int iommu_prot,
+ unsigned int flags)
+{
+ struct iopt_pages *pages;
+
+ pages = iopt_alloc_file_pages(file, start, length,
+ iommu_prot & IOMMU_WRITE);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+ return iopt_map_common(ictx, iopt, pages, iova, length,
+ start - pages->start,
+ iommu_prot, flags);
+}
+
struct iova_bitmap_fn_arg {
unsigned long flags;
struct io_pagetable *iopt;
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 3a28f46..909d396 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -220,6 +220,8 @@ struct iopt_pages {
struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
unsigned long length, bool writable);
+struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
+ unsigned long length, bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 270e8ea..2d25ba0 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -2,6 +2,7 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
*/
+#include <linux/file.h>
#include <linux/interval_tree.h>
#include <linux/iommufd.h>
#include <linux/iommu.h>
@@ -199,7 +200,44 @@ static int conv_iommu_prot(u32 map_flags)
int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
{
- return -ENOTTY;
+ struct iommu_ioas_map_file *cmd = ucmd->cmd;
+ unsigned long iova = cmd->iova;
+ struct iommufd_ioas *ioas;
+ unsigned int flags = 0;
+ int rc;
+ struct file *file;
+ unsigned long end;
+
+ if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+ return -EOVERFLOW;
+
+ if (check_add_overflow(cmd->start, cmd->length - 1, &end))
+ return -EOVERFLOW;
+
+ ioas = iommufd_get_ioas(ucmd->ictx, cmd->ioas_id);
+ if (IS_ERR(ioas))
+ return PTR_ERR(ioas);
+
+ if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+ flags = IOPT_ALLOC_IOVA;
+
+ file = fget(cmd->fd);
+ if (!file)
+ return -EBADF;
+
+ rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova,
+ file, cmd->start,
+ cmd->length,
+ conv_iommu_prot(cmd->flags), flags);
+ if (rc)
+ goto out_put;
+
+ cmd->iova = iova;
+ rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+ iommufd_put_object(ucmd->ictx, &ioas->obj);
+ fput(file);
+ return rc;
}
int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 21ec075..0ebb5bd 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -68,6 +68,11 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long *iova, void __user *uptr,
unsigned long length, int iommu_prot,
unsigned int flags);
+int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+ unsigned long *iova,
+ struct file *file, unsigned long start,
+ unsigned long length, int iommu_prot,
+ unsigned int flags);
int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
unsigned long length, unsigned long *dst_iova,
int iommu_prot, unsigned int flags);
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 065c28f..8c8d0ff 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1239,6 +1239,22 @@ struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
return pages;
}
+struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
+ unsigned long length, bool writable)
+
+{
+ struct iopt_pages *pages;
+ unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE);
+
+ pages = iopt_alloc_pages(length, start - start_down, writable);
+ if (IS_ERR(pages))
+ return pages;
+ pages->file = get_file(file);
+ pages->start = start_down;
+ pages->type = IOPT_ADDRESS_FILE;
+ return pages;
+}
+
void iopt_release_pages(struct kref *kref)
{
struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 7/9] iommufd: file mappings for mdev
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (5 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-18 14:52 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 8/9] iommufd: replace upages_len Steve Sistare
` (3 subsequent siblings)
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Support file mappings for mediated devices, aka mdevs. Access is initiated
by the vfio_pin_pages and vfio_dma_rw kernel interfaces.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/pages.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 8c8d0ff..4cb532e 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1706,11 +1706,11 @@ static int iopt_pages_fill_from_domain(struct iopt_pages *pages,
return 0;
}
-static int iopt_pages_fill_from_mm(struct iopt_pages *pages,
- struct pfn_reader_user *user,
- unsigned long start_index,
- unsigned long last_index,
- struct page **out_pages)
+static int iopt_pages_fill(struct iopt_pages *pages,
+ struct pfn_reader_user *user,
+ unsigned long start_index,
+ unsigned long last_index,
+ struct page **out_pages)
{
unsigned long cur_index = start_index;
int rc;
@@ -1784,8 +1784,8 @@ int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
/* hole */
cur_pages = out_pages + (span.start_hole - start_index);
- rc = iopt_pages_fill_from_mm(pages, &user, span.start_hole,
- span.last_hole, cur_pages);
+ rc = iopt_pages_fill(pages, &user, span.start_hole,
+ span.last_hole, cur_pages);
if (rc)
goto out_clean_xa;
rc = pages_to_xarray(&pages->pinned_pfns, span.start_hole,
@@ -1865,6 +1865,9 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
struct page *page = NULL;
int rc;
+ if (WARN_ON(pages->type != IOPT_ADDRESS_USER))
+ return -EINVAL;
+
if (!mmget_not_zero(pages->source_mm))
return iopt_pages_rw_slow(pages, index, index, offset, data,
length, flags);
@@ -1920,6 +1923,11 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
return -EPERM;
+ if (pages->type == IOPT_ADDRESS_FILE)
+ return iopt_pages_rw_slow(pages, start_index, last_index,
+ start_byte % PAGE_SIZE, data, length,
+ flags);
+
if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
if (start_index == last_index)
return iopt_pages_rw_page(pages, start_index,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 8/9] iommufd: replace upages_len
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (6 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 7/9] iommufd: file mappings for mdev Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-14 13:05 ` [PATCH V1 9/9] iommufd: optimize file mapping Steve Sistare
` (2 subsequent siblings)
10 siblings, 0 replies; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Replace the byte length upages_len with the element count upages_num.
This simplifies a subsequent patch that needs element count.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/pages.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 4cb532e..f6c7c90 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -694,7 +694,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
/* pfn_reader_user is just the pin_user_pages() path */
struct pfn_reader_user {
struct page **upages;
- size_t upages_len;
+ unsigned long upages_num;
struct file *file;
unsigned long upages_start;
unsigned long upages_end;
@@ -783,6 +783,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
{
bool remote_mm = pages->source_mm != current->mm;
unsigned long npages, start;
+ size_t len;
uintptr_t uptr;
long rc;
@@ -792,11 +793,11 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
if (!user->upages) {
/* All undone in pfn_reader_destroy() */
- user->upages_len =
- (last_index - start_index + 1) * sizeof(*user->upages);
- user->upages = temp_kmalloc(&user->upages_len, NULL, 0);
+ len = (last_index - start_index + 1) * sizeof(*user->upages);
+ user->upages = temp_kmalloc(&len, NULL, 0);
if (!user->upages)
return -ENOMEM;
+ user->upages_num = len / sizeof(*user->upages);
}
if (!user->file && user->locked == -1) {
@@ -813,7 +814,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
}
npages = min_t(unsigned long, last_index - start_index + 1,
- user->upages_len / sizeof(*user->upages));
+ user->upages_num);
if (iommufd_should_fail())
return -EFAULT;
@@ -1756,7 +1757,7 @@ int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
lockdep_assert_held(&pages->mutex);
pfn_reader_user_init(&user, pages);
- user.upages_len = (last_index - start_index + 1) * sizeof(*out_pages);
+ user.upages_num = last_index - start_index + 1;
interval_tree_for_each_double_span(&span, &pages->access_itree,
&pages->domains_itree, start_index,
last_index) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V1 9/9] iommufd: optimize file mapping
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (7 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 8/9] iommufd: replace upages_len Steve Sistare
@ 2024-09-14 13:05 ` Steve Sistare
2024-09-15 20:59 ` Jason Gunthorpe
2024-09-14 13:21 ` [PATCH V1 0/9] iommu_ioas_map_file Steven Sistare
2024-09-15 20:30 ` Jason Gunthorpe
10 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-09-14 13:05 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Avoid expanding a huge folio into multiple upages[] in pin_memfd_pages by
adding a page count nupages[i] for each upages[i]. Provide the function
batch_from_pages_num to efficiently add counted pages to a pfn_batch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/pages.c | 75 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 63 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index f6c7c90..5cc0d57 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -347,26 +347,34 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
}
/* true if the pfn was added, false otherwise */
-static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+static bool batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
+ unsigned long nr)
{
const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
+ unsigned long max_npfns = MAX_NPFNS - nr;
if (batch->end &&
pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
- batch->npfns[batch->end - 1] != MAX_NPFNS) {
- batch->npfns[batch->end - 1]++;
- batch->total_pfns++;
+ batch->npfns[batch->end - 1] <= max_npfns) {
+ batch->npfns[batch->end - 1] += nr;
+ batch->total_pfns += nr;
return true;
}
if (batch->end == batch->array_size)
return false;
- batch->total_pfns++;
+ batch->total_pfns += nr;
batch->pfns[batch->end] = pfn;
- batch->npfns[batch->end] = 1;
+ batch->npfns[batch->end] = nr;
batch->end++;
return true;
}
+/* true if the pfn was added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+ return batch_add_pfn_num(batch, pfn, 1);
+}
+
/*
* Fill the batch with pfns from the domain. When the batch is full, or it
* reaches last_index, the function will return. The caller should use
@@ -622,6 +630,20 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
break;
}
+static void batch_from_pages_num(struct pfn_batch *batch, struct page **pages,
+ u32 *npages,
+ size_t total_pages)
+{
+ unsigned long num;
+
+ while (total_pages) {
+ num = min_t(unsigned long, *npages++, total_pages);
+ if (!batch_add_pfn_num(batch, page_to_pfn(*pages++), num))
+ break;
+ total_pages -= num;
+ }
+}
+
static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
unsigned int first_page_off, size_t npages)
{
@@ -695,6 +717,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
struct pfn_reader_user {
struct page **upages;
unsigned long upages_num;
+ u32 *nupages;
struct file *file;
unsigned long upages_start;
unsigned long upages_end;
@@ -714,6 +737,7 @@ static void pfn_reader_user_init(struct pfn_reader_user *user,
user->upages_end = 0;
user->locked = -1;
user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
+ user->nupages = NULL;
user->gup_flags = FOLL_LONGTERM;
if (pages->writable)
@@ -733,6 +757,8 @@ static void pfn_reader_user_destroy(struct pfn_reader_user *user,
kfree(user->upages);
user->upages = NULL;
+ kfree(user->nupages);
+ user->nupages = NULL;
}
static long pin_memfd_pages(struct pfn_reader_user *user,
@@ -764,8 +790,13 @@ static long pin_memfd_pages(struct pfn_reader_user *user,
nr = folio_nr_pages(folio);
npin = min(nr - pgoff, npages);
repin_folio_unhugely(folio, npin);
- for (j = pgoff; j < pgoff + npin; j++)
- user->upages[k++] = folio_page(folio, j);
+ if (user->nupages) {
+ user->upages[k] = &folio->page;
+ user->nupages[k++] = npin;
+ } else {
+ for (j = pgoff; j < pgoff + npin; j++, k++)
+ user->upages[k] = folio_page(folio, j);
+ }
npages -= npin;
start += npin << PAGE_SHIFT;
pgoff = 0;
@@ -798,6 +829,17 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
if (!user->upages)
return -ENOMEM;
user->upages_num = len / sizeof(*user->upages);
+
+ if (user->file) {
+ len = user->upages_num * sizeof(*user->nupages);
+ user->nupages = temp_kmalloc(&len, NULL, 0);
+ if (!user->nupages ||
+ len != user->upages_num * sizeof(*user->nupages)) {
+ kfree(user->upages);
+ user->upages = NULL;
+ /* nupages is optional. proceed without it. */
+ }
+ }
}
if (!user->file && user->locked == -1) {
@@ -1025,6 +1067,7 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
{
struct interval_tree_double_span_iter *span = &pfns->span;
unsigned long start_index = pfns->batch_end_index;
+ unsigned long npages;
struct iopt_area *area;
int rc;
@@ -1062,10 +1105,18 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
return rc;
}
- batch_from_pages(&pfns->batch,
- pfns->user.upages +
- (start_index - pfns->user.upages_start),
- pfns->user.upages_end - start_index);
+ npages = pfns->user.upages_end - start_index;
+ start_index -= pfns->user.upages_start;
+
+ if (pfns->user.nupages)
+ batch_from_pages_num(&pfns->batch,
+ pfns->user.upages + start_index,
+ pfns->user.nupages + start_index,
+ npages);
+ else
+ batch_from_pages(&pfns->batch,
+ pfns->user.upages + start_index,
+ npages);
return 0;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
@ 2024-09-14 13:19 ` Steven Sistare
2024-09-17 12:25 ` David Hildenbrand
2024-09-15 20:37 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-09-14 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
cc'ing linux-mm for review of this one patch of the series.
This proposes a new KAPI function repin_folio_unhugely(), for use in this
patch of the iommu_ioas_map_file series:
iommufd: IOMMU_IOAS_MAP_FILE implementation
https://lore.kernel.org/linux-iommu/1726319158-283074-7-git-send-email-steven.sistare@oracle.com
- Steve
On 9/14/2024 9:05 AM, Steve Sistare wrote:
> Export a function that repins a huge-page folio at small-page granularity.
> This allows any range of small pages within the folio to be unpinned later.
> For example, pages pinned via memfd_pin_folios and modified by
> repin_folio_unhugely could be unpinned via unpin_user_page(s).
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 18 ++++++++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1470736..ba8344f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
> struct folio **folios, unsigned int max_folios,
> pgoff_t *offset);
> +void repin_folio_unhugely(struct folio *folio, unsigned long npin);
>
> int get_user_pages_fast(unsigned long start, int nr_pages,
> unsigned int gup_flags, struct page **pages);
> diff --git a/mm/gup.c b/mm/gup.c
> index 947881ff..f8f3f2a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
> return ret;
> }
> EXPORT_SYMBOL_GPL(memfd_pin_folios);
> +
> +/**
> + * repin_folio_unhugely() - repin a folio at small page granularity
> + * @folio: the folio to repin
> + * @npin: the number of pages pinned in the folio
> + *
> + * Given a huge page folio that is already pinned, and the number of small
> + * pages that are pinned in it, adjust the pincount to reflect small-page
> + * granularity. Each small page can later be unpinned individually.
> + */
> +void repin_folio_unhugely(struct folio *folio, unsigned long npin)
> +{
> + if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
> + return;
> + atomic_add(npin - 1, &folio->_refcount);
> + atomic_add(npin - 1, &folio->_pincount);
> +}
> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 0/9] iommu_ioas_map_file
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (8 preceding siblings ...)
2024-09-14 13:05 ` [PATCH V1 9/9] iommufd: optimize file mapping Steve Sistare
@ 2024-09-14 13:21 ` Steven Sistare
2024-09-15 20:30 ` Jason Gunthorpe
10 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-14 13:21 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen
I forgot to give credit, this ioctl was suggested by Jason.
- Steve
On 9/14/2024 9:05 AM, Steve Sistare wrote:
> Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
> memory by passing a memfd plus offset and length. Implement it using
> the memfd_map_folios KAPI, and the proposed repin_folio_unhugely KAPI.
> See the individual patches for details.
>
> Steve Sistare (9):
> mm/gup: repin_folio_unhugely
> iommufd: remove uptr from iopt_alloc_iova
> iommufd: generalize iopt_pages address
> iommufd: pfn reader for file mappings
> iommufd: IOMMU_IOAS_MAP_FILE interface
> iommufd: IOMMU_IOAS_MAP_FILE implementation
> iommufd: file mappings for mdev
> iommufd: replace upages_len
> iommufd: optimize file mapping
>
> drivers/iommu/iommufd/io_pagetable.c | 102 +++++++++++-----
> drivers/iommu/iommufd/io_pagetable.h | 20 ++-
> drivers/iommu/iommufd/ioas.c | 43 +++++++
> drivers/iommu/iommufd/iommufd_private.h | 6 +
> drivers/iommu/iommufd/main.c | 2 +
> drivers/iommu/iommufd/pages.c | 210 ++++++++++++++++++++++++++------
> include/linux/mm.h | 1 +
> include/uapi/linux/iommufd.h | 22 ++++
> mm/gup.c | 18 +++
> 9 files changed, 354 insertions(+), 70 deletions(-)
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 0/9] iommu_ioas_map_file
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
` (9 preceding siblings ...)
2024-09-14 13:21 ` [PATCH V1 0/9] iommu_ioas_map_file Steven Sistare
@ 2024-09-15 20:30 ` Jason Gunthorpe
2024-09-18 14:52 ` Steven Sistare
10 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:30 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:49AM -0700, Steve Sistare wrote:
> Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
> memory by passing a memfd plus offset and length. Implement it using
> the memfd_map_folios KAPI, and the proposed repin_folio_unhugely KAPI.
> See the individual patches for details.
Do you want to do this and the other patches as well?
This seems much easier to understand, and over the long run I expect
to wrap VFIO in a DMABUF FD as well, so it should work well.
Thanks,
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
2024-09-14 13:19 ` Steven Sistare
@ 2024-09-15 20:37 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:37 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:50AM -0700, Steve Sistare wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index 947881ff..f8f3f2a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
> return ret;
> }
> EXPORT_SYMBOL_GPL(memfd_pin_folios);
> +
> +/**
> + * repin_folio_unhugely() - repin a folio at small page granularity
> + * @folio: the folio to repin
> + * @npin: the number of pages pinned in the folio
> + *
> + * Given a huge page folio that is already pinned, and the number of small
> + * pages that are pinned in it, adjust the pincount to reflect small-page
> + * granularity. Each small page can later be unpinned individually.
> + */
I think the language choice here is probably not entirely consistent
with the rest of the code..
Maybe
folio_split_user_page_pin(folio, npages)
@npages: The new number of pages the folio pin reference should hold
Given a high order folio that is already pinned adjust the reference
count to allow unpin_user_page_range() and related to be called on a
the folio. npages is the number of pages that will be passed to a
future unpin_user_page_range().
Which anchors it in the purpose a little more
The implementation looked OK to me
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova
2024-09-14 13:05 ` [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova Steve Sistare
@ 2024-09-15 20:41 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:41 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:51AM -0700, Steve Sistare wrote:
> iopt_alloc_iova takes a uptr argument but only checks for its alignment.
> Pass an offset from start instead to eliminate its dependence on a user
> address.
An offset from start is a meaningfully different thing for this, it
throws away all the information it is trying to extract (is the uptr
huge page aligned)
But that isn't what this does anyhow:
> - rc = iopt_alloc_iova(
> - iopt, dst_iova,
> - (uintptr_t)elm->pages->uptr + elm->start_byte, length);
> + start = elm->start_byte + (uintptr_t)elm->pages->uptr;
> + rc = iopt_alloc_iova(iopt, dst_iova, start, length);
It doesn't change anything but rename uptr?
Thats probably OK, as you will want to pass in an offset from the
start of a FD here eventually.
Just the existing uptr logic should stay the same..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 4/9] iommufd: pfn reader for file mappings
2024-09-14 13:05 ` [PATCH V1 4/9] iommufd: pfn reader for file mappings Steve Sistare
@ 2024-09-15 20:51 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:51 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:53AM -0700, Steve Sistare wrote:
> Extend pfn_reader_user to pin file mappings, by calling memfd_pin_folios.
> Repin at small page granularity and unpack pages into upages[] to mesh
> with the existing code paths.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/iommu/iommufd/io_pagetable.h | 5 ++++
> drivers/iommu/iommufd/pages.c | 58 ++++++++++++++++++++++++++++++++----
> 2 files changed, 57 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> index 7c4a338..3a28f46 100644
> --- a/drivers/iommu/iommufd/io_pagetable.h
> +++ b/drivers/iommu/iommufd/io_pagetable.h
> @@ -177,6 +177,7 @@ enum {
>
> enum iopt_address_type {
> IOPT_ADDRESS_USER = 0,
> + IOPT_ADDRESS_FILE = 1,
> };
>
> /*
> @@ -202,6 +203,10 @@ struct iopt_pages {
> enum iopt_address_type type;
> union {
> void __user *uptr; /* IOPT_ADDRESS_USER */
> + struct { /* IOPT_ADDRESS_FILE */
> + struct file *file;
> + unsigned long start;
> + };
> };
> bool writable:1;
> u8 account_mode;
> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> index 69822d4..065c28f 100644
> --- a/drivers/iommu/iommufd/pages.c
> +++ b/drivers/iommu/iommufd/pages.c
> @@ -695,6 +695,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
> struct pfn_reader_user {
> struct page **upages;
> size_t upages_len;
> + struct file *file;
> unsigned long upages_start;
> unsigned long upages_end;
> unsigned int gup_flags;
> @@ -712,6 +713,7 @@ static void pfn_reader_user_init(struct pfn_reader_user *user,
> user->upages_start = 0;
> user->upages_end = 0;
> user->locked = -1;
> + user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
>
> user->gup_flags = FOLL_LONGTERM;
> if (pages->writable)
> @@ -733,13 +735,54 @@ static void pfn_reader_user_destroy(struct pfn_reader_user *user,
> user->upages = NULL;
> }
>
> +static long pin_memfd_pages(struct pfn_reader_user *user,
> + unsigned long start,
> + unsigned long npages)
> +{
> + unsigned long end, nr, i, j, k, npin, nfolios, pgoff, max_folios;
> + unsigned long npages_orig = npages;
> + struct folio *folio;
> + size_t size = npages * sizeof(folio);
> + struct folio **folios = temp_kmalloc(&size, NULL, 0);
> +
> + if (!folios)
> + return -ENOMEM;
> +
> + k = 0;
> + max_folios = size / sizeof(folio);
> + end = start + (npages << PAGE_SHIFT) - 1;
> +
> + while (npages > 0) {
> + nfolios = memfd_pin_folios(user->file, start, end,
> + folios, max_folios, &pgoff);
> + if (nfolios <= 0)
> + return nfolios;
> +
> + pgoff >>= PAGE_SHIFT;
> + for (i = 0; i < nfolios; i++) {
> + folio = folios[i];
> + nr = folio_nr_pages(folio);
> + npin = min(nr - pgoff, npages);
> + repin_folio_unhugely(folio, npin);
> + for (j = pgoff; j < pgoff + npin; j++)
> + user->upages[k++] = folio_page(folio, j);
> + npages -= npin;
> + start += npin << PAGE_SHIFT;
> + pgoff = 0;
You should try harder to avoid this loop, the batch already can just
swallow a full folio, so this would be much better to copy more of the
code from it's caller and be more fully stand alone. Use only a temp
folio array and stick that directly into the batch full folio at a
time.
This above will be functionally the same, but alot slower..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface
2024-09-14 13:05 ` [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface Steve Sistare
@ 2024-09-15 20:52 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:52 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:54AM -0700, Steve Sistare wrote:
> Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
> register memory by passing a memfd plus offset and length.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/iommu/iommufd/ioas.c | 5 +++++
> drivers/iommu/iommufd/iommufd_private.h | 1 +
> drivers/iommu/iommufd/main.c | 2 ++
> include/uapi/linux/iommufd.h | 22 ++++++++++++++++++++++
> 4 files changed, 30 insertions(+)
This looks OK to me
> /**
> + * struct iommu_ioas_map_file - ioctl(IOMMU_IOAS_MAP_FILE)
> + * @size: sizeof(struct iommu_ioas_map_file)
> + * @fd: the memfd to map
> + * @start: byte offset from start of file to map from
> + * @flags, @ios_id, @length, @iova: same as for iommu_ioas_map
But please check that the kdoc tools accept this syntax?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 9/9] iommufd: optimize file mapping
2024-09-14 13:05 ` [PATCH V1 9/9] iommufd: optimize file mapping Steve Sistare
@ 2024-09-15 20:59 ` Jason Gunthorpe
2024-09-18 14:52 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 20:59 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Sat, Sep 14, 2024 at 06:05:58AM -0700, Steve Sistare wrote:
> Avoid expanding a huge folio into multiple upages[] in pin_memfd_pages by
> adding a page count nupages[i] for each upages[i]. Provide the function
> batch_from_pages_num to efficiently add counted pages to a pfn_batch.
> @@ -695,6 +717,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
> struct pfn_reader_user {
> struct page **upages;
> unsigned long upages_num;
> + u32 *nupages;
I think you'd be better to have
union {
struct page **upages;
struct folio **ufolio;
};
Add some batch_from_folios() to do the logic
The flow would want to work the same as the pages where you fill in
the ufolios array and then trickle it out into the batch
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation
2024-09-14 13:05 ` [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation Steve Sistare
@ 2024-09-17 1:48 ` kernel test robot
0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2024-09-17 1:48 UTC (permalink / raw)
To: Steve Sistare, iommu
Cc: oe-kbuild-all, Jason Gunthorpe, Kevin Tian, Nicolin Chen,
Steve Sistare
Hi Steve,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.11 next-20240916]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steve-Sistare/mm-gup-repin_folio_unhugely/20240914-210746
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/1726319158-283074-7-git-send-email-steven.sistare%40oracle.com
patch subject: [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation
config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20240917/202409170940.No0O836I-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240917/202409170940.No0O836I-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409170940.No0O836I-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'ictx' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'iopt' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'iova' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'file' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'start' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'length' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'iommu_prot' not described in 'iopt_map_file_pages'
>> drivers/iommu/iommufd/io_pagetable.c:458: warning: Function parameter or struct member 'flags' not described in 'iopt_map_file_pages'
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for GET_FREE_REGION
Depends on [n]: SPARSEMEM [=n]
Selected by [m]:
- RESOURCE_KUNIT_TEST [=m] && RUNTIME_TESTING_MENU [=y] && KUNIT [=m]
vim +458 drivers/iommu/iommufd/io_pagetable.c
448
449 /**
450 * iopt_map_file_pages() - Like iopt_map_user_pages, but map @file starting
451 * at byte offset @start.
452 */
453 int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
454 unsigned long *iova,
455 struct file *file, unsigned long start,
456 unsigned long length, int iommu_prot,
457 unsigned int flags)
> 458 {
459 struct iopt_pages *pages;
460
461 pages = iopt_alloc_file_pages(file, start, length,
462 iommu_prot & IOMMU_WRITE);
463 if (IS_ERR(pages))
464 return PTR_ERR(pages);
465 return iopt_map_common(ictx, iopt, pages, iova, length,
466 start - pages->start,
467 iommu_prot, flags);
468 }
469
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-14 13:19 ` Steven Sistare
@ 2024-09-17 12:25 ` David Hildenbrand
2024-09-18 14:51 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2024-09-17 12:25 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
On 14.09.24 15:19, Steven Sistare wrote:
> cc'ing linux-mm for review of this one patch of the series.
>
> This proposes a new KAPI function repin_folio_unhugely(), for use in this
> patch of the iommu_ioas_map_file series:
>
> iommufd: IOMMU_IOAS_MAP_FILE implementation
> https://lore.kernel.org/linux-iommu/1726319158-283074-7-git-send-email-steven.sistare@oracle.com
>
> - Steve
>
> On 9/14/2024 9:05 AM, Steve Sistare wrote:
>> Export a function that repins a huge-page folio at small-page granularity.
>> This allows any range of small pages within the folio to be unpinned later.
>> For example, pages pinned via memfd_pin_folios and modified by
>> repin_folio_unhugely could be unpinned via unpin_user_page(s).
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/linux/mm.h | 1 +
>> mm/gup.c | 18 ++++++++++++++++++
>> 2 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 1470736..ba8344f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>> struct folio **folios, unsigned int max_folios,
>> pgoff_t *offset);
>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin);
>>
>> int get_user_pages_fast(unsigned long start, int nr_pages,
>> unsigned int gup_flags, struct page **pages);
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 947881ff..f8f3f2a 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>> return ret;
>> }
>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>> +
>> +/**
>> + * repin_folio_unhugely() - repin a folio at small page granularity
>> + * @folio: the folio to repin
>> + * @npin: the number of pages pinned in the folio
>> + *
>> + * Given a huge page folio that is already pinned, and the number of small
s/huge page folio/large folio/
>> + * pages that are pinned in it, adjust the pincount to reflect small-page
>> + * granularity. Each small page can later be unpinned individually.
>> + */
>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin)
>> +{
>> + if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
Why not the huge zero folio? That looks very odd here.
>> + return;
>> + atomic_add(npin - 1, &folio->_refcount);
>> + atomic_add(npin - 1, &folio->_pincount);
>> +}
>> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
Can we ... find a better name? For example, it's "large" folio not "huge"...
And repin is really misleading. We are simply adding more pins to an
already pinned one ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-15 20:37 ` Jason Gunthorpe
@ 2024-09-18 14:51 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:51 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: iommu, Kevin Tian, Nicolin Chen, David Hildenbrand, Andrew Morton,
linux-mm
On 9/15/2024 4:37 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:50AM -0700, Steve Sistare wrote:
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 947881ff..f8f3f2a 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>> return ret;
>> }
>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>> +
>> +/**
>> + * repin_folio_unhugely() - repin a folio at small page granularity
>> + * @folio: the folio to repin
>> + * @npin: the number of pages pinned in the folio
>> + *
>> + * Given a huge page folio that is already pinned, and the number of small
>> + * pages that are pinned in it, adjust the pincount to reflect small-page
>> + * granularity. Each small page can later be unpinned individually.
>> + */
>
> I think the language choice here is probably not entirely consistent
> with the rest of the code
Yes, a good name was not obvious to me, to I submitted a slightly flippant
name to generate some discussion :)
>
> Maybe
>
> folio_split_user_page_pin(folio, npages)
> @npages: The new number of pages the folio pin reference should hold
>
> Given a high order folio that is already pinned adjust the reference
> count to allow unpin_user_page_range() and related to be called on a
> the folio. npages is the number of pages that will be passed to a
> future unpin_user_page_range().
>
> Which anchors it in the purpose a little more
Thanks, I will use this if no one objects.
- Steve
> The implementation looked OK to me
>
> Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-17 12:25 ` David Hildenbrand
@ 2024-09-18 14:51 ` Steven Sistare
2024-09-19 8:11 ` David Hildenbrand
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:51 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
On 9/17/2024 8:25 AM, David Hildenbrand wrote:
> On 14.09.24 15:19, Steven Sistare wrote:
>> cc'ing linux-mm for review of this one patch of the series.
>>
>> This proposes a new KAPI function repin_folio_unhugely(), for use in this
>> patch of the iommu_ioas_map_file series:
>>
>> iommufd: IOMMU_IOAS_MAP_FILE implementation
>> https://lore.kernel.org/linux-iommu/1726319158-283074-7-git-send-email-steven.sistare@oracle.com
>>
>> - Steve
>>
>> On 9/14/2024 9:05 AM, Steve Sistare wrote:
>>> Export a function that repins a huge-page folio at small-page granularity.
>>> This allows any range of small pages within the folio to be unpinned later.
>>> For example, pages pinned via memfd_pin_folios and modified by
>>> repin_folio_unhugely could be unpinned via unpin_user_page(s).
>>>
>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/linux/mm.h | 1 +
>>> mm/gup.c | 18 ++++++++++++++++++
>>> 2 files changed, 19 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 1470736..ba8344f 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>>> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>> struct folio **folios, unsigned int max_folios,
>>> pgoff_t *offset);
>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin);
>>> int get_user_pages_fast(unsigned long start, int nr_pages,
>>> unsigned int gup_flags, struct page **pages);
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 947881ff..f8f3f2a 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>> return ret;
>>> }
>>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>>> +
>>> +/**
>>> + * repin_folio_unhugely() - repin a folio at small page granularity
>>> + * @folio: the folio to repin
>>> + * @npin: the number of pages pinned in the folio
>>> + *
>>> + * Given a huge page folio that is already pinned, and the number of small
>
> s/huge page folio/large folio/
>
>>> + * pages that are pinned in it, adjust the pincount to reflect small-page
>>> + * granularity. Each small page can later be unpinned individually.
>>> + */
>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin)
>>> +{
>>> + if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
>
> Why not the huge zero folio? That looks very odd here.
The zero page is treated specially here and elsewhere, it can never be deleted so
reference fiddling is skipped.
>>> + return;
>>> + atomic_add(npin - 1, &folio->_refcount);
>>> + atomic_add(npin - 1, &folio->_pincount);
>>> +}
>>> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
>
> Can we ... find a better name? For example, it's "large" folio not "huge"...
>
> And repin is really misleading. We are simply adding more pins to an already pinned one ...
Jason suggests a better name in the other thread.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova
2024-09-15 20:41 ` Jason Gunthorpe
@ 2024-09-18 14:51 ` Steven Sistare
2024-09-24 19:50 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:51 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 9/15/2024 4:41 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:51AM -0700, Steve Sistare wrote:
>> iopt_alloc_iova takes a uptr argument but only checks for its alignment.
>> Pass an offset from start instead to eliminate its dependence on a user
>> address.
>
> An offset from start is a meaningfully different thing for this, it
> throws away all the information it is trying to extract (is the uptr
> huge page aligned)
Interpret offset as "from the start of the address space, eg 0" and it
works. That was my intent, since I pass offset from the start of the file
address space in a subsequent patch. But instead, I can rename uptr to
to address and describe it this way:
iopt_alloc_iova takes a uptr argument but only checks for its alignment.
Generalize this to an unsigned address, which will be the offset from the
start of a file in a subsequent patch. No functional change.
> But that isn't what this does anyhow:
>
>> - rc = iopt_alloc_iova(
>> - iopt, dst_iova,
>> - (uintptr_t)elm->pages->uptr + elm->start_byte, length);
>> + start = elm->start_byte + (uintptr_t)elm->pages->uptr;
>> + rc = iopt_alloc_iova(iopt, dst_iova, start, length);
>
> It doesn't change anything but rename uptr?
Yes.
- Steve
> Thats probably OK, as you will want to pass in an offset from the
> start of a FD here eventually.
>
> Just the existing uptr logic should stay the same..
>
> Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 4/9] iommufd: pfn reader for file mappings
2024-09-15 20:51 ` Jason Gunthorpe
@ 2024-09-18 14:51 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:51 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 9/15/2024 4:51 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:53AM -0700, Steve Sistare wrote:
>> Extend pfn_reader_user to pin file mappings, by calling memfd_pin_folios.
>> Repin at small page granularity and unpack pages into upages[] to mesh
>> with the existing code paths.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> drivers/iommu/iommufd/io_pagetable.h | 5 ++++
>> drivers/iommu/iommufd/pages.c | 58 ++++++++++++++++++++++++++++++++----
>> 2 files changed, 57 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
>> index 7c4a338..3a28f46 100644
>> --- a/drivers/iommu/iommufd/io_pagetable.h
>> +++ b/drivers/iommu/iommufd/io_pagetable.h
>> @@ -177,6 +177,7 @@ enum {
>>
>> enum iopt_address_type {
>> IOPT_ADDRESS_USER = 0,
>> + IOPT_ADDRESS_FILE = 1,
>> };
>>
>> /*
>> @@ -202,6 +203,10 @@ struct iopt_pages {
>> enum iopt_address_type type;
>> union {
>> void __user *uptr; /* IOPT_ADDRESS_USER */
>> + struct { /* IOPT_ADDRESS_FILE */
>> + struct file *file;
>> + unsigned long start;
>> + };
>> };
>> bool writable:1;
>> u8 account_mode;
>> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
>> index 69822d4..065c28f 100644
>> --- a/drivers/iommu/iommufd/pages.c
>> +++ b/drivers/iommu/iommufd/pages.c
>> @@ -695,6 +695,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
>> struct pfn_reader_user {
>> struct page **upages;
>> size_t upages_len;
>> + struct file *file;
>> unsigned long upages_start;
>> unsigned long upages_end;
>> unsigned int gup_flags;
>> @@ -712,6 +713,7 @@ static void pfn_reader_user_init(struct pfn_reader_user *user,
>> user->upages_start = 0;
>> user->upages_end = 0;
>> user->locked = -1;
>> + user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
>>
>> user->gup_flags = FOLL_LONGTERM;
>> if (pages->writable)
>> @@ -733,13 +735,54 @@ static void pfn_reader_user_destroy(struct pfn_reader_user *user,
>> user->upages = NULL;
>> }
>>
>> +static long pin_memfd_pages(struct pfn_reader_user *user,
>> + unsigned long start,
>> + unsigned long npages)
>> +{
>> + unsigned long end, nr, i, j, k, npin, nfolios, pgoff, max_folios;
>> + unsigned long npages_orig = npages;
>> + struct folio *folio;
>> + size_t size = npages * sizeof(folio);
>> + struct folio **folios = temp_kmalloc(&size, NULL, 0);
>> +
>> + if (!folios)
>> + return -ENOMEM;
>> +
>> + k = 0;
>> + max_folios = size / sizeof(folio);
>> + end = start + (npages << PAGE_SHIFT) - 1;
>> +
>> + while (npages > 0) {
>> + nfolios = memfd_pin_folios(user->file, start, end,
>> + folios, max_folios, &pgoff);
>> + if (nfolios <= 0)
>> + return nfolios;
>> +
>> + pgoff >>= PAGE_SHIFT;
>> + for (i = 0; i < nfolios; i++) {
>> + folio = folios[i];
>> + nr = folio_nr_pages(folio);
>> + npin = min(nr - pgoff, npages);
>> + repin_folio_unhugely(folio, npin);
>> + for (j = pgoff; j < pgoff + npin; j++)
>> + user->upages[k++] = folio_page(folio, j);
>> + npages -= npin;
>> + start += npin << PAGE_SHIFT;
>> + pgoff = 0;
>
> You should try harder to avoid this loop, the batch already can just
> swallow a full folio, so this would be much better to copy more of the
> code from it's caller and be more fully stand alone. Use only a temp
> folio array and stick that directly into the batch full folio at a
> time.
Yes, I optimize in a later patch as you already saw in the other thread.
I submitted this slow version as an intermediate step to show a working
implementation with less code to review. I forgot to mention that in the
commit message.
- Steve
> This above will be functionally the same, but alot slower..
>
> Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface
2024-09-15 20:52 ` Jason Gunthorpe
@ 2024-09-18 14:51 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:51 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 9/15/2024 4:52 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:54AM -0700, Steve Sistare wrote:
>> Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
>> register memory by passing a memfd plus offset and length.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> drivers/iommu/iommufd/ioas.c | 5 +++++
>> drivers/iommu/iommufd/iommufd_private.h | 1 +
>> drivers/iommu/iommufd/main.c | 2 ++
>> include/uapi/linux/iommufd.h | 22 ++++++++++++++++++++++
>> 4 files changed, 30 insertions(+)
>
> This looks OK to me
>
>> /**
>> + * struct iommu_ioas_map_file - ioctl(IOMMU_IOAS_MAP_FILE)
>> + * @size: sizeof(struct iommu_ioas_map_file)
>> + * @fd: the memfd to map
>> + * @start: byte offset from start of file to map from
>> + * @flags, @ios_id, @length, @iova: same as for iommu_ioas_map
>
> But please check that the kdoc tools accept this syntax?
Yeah, the robot already dinged me for it :) Will fix.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 9/9] iommufd: optimize file mapping
2024-09-15 20:59 ` Jason Gunthorpe
@ 2024-09-18 14:52 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:52 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 9/15/2024 4:59 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:58AM -0700, Steve Sistare wrote:
>> Avoid expanding a huge folio into multiple upages[] in pin_memfd_pages by
>> adding a page count nupages[i] for each upages[i]. Provide the function
>> batch_from_pages_num to efficiently add counted pages to a pfn_batch.
>
>
>> @@ -695,6 +717,7 @@ static unsigned long batch_rw(struct pfn_batch *batch, void *data,
>> struct pfn_reader_user {
>> struct page **upages;
>> unsigned long upages_num;
>> + u32 *nupages;
>
> I think you'd be better to have
>
> union {
> struct page **upages;
> struct folio **ufolio;
> };
>
> Add some batch_from_folios() to do the logic
>
> The flow would want to work the same as the pages where you fill in
> the ufolios array and then trickle it out into the batch
Sounds good, will do - steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 3/9] iommufd: generalize iopt_pages address
2024-09-14 13:05 ` [PATCH V1 3/9] iommufd: generalize iopt_pages address Steve Sistare
@ 2024-09-18 14:52 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:52 UTC (permalink / raw)
To: Jason Gunthorpe, Nicolin Chen; +Cc: Kevin Tian, iommu
Hi Jason, Nicolin, any comment or RB on this before I submit V2?
- Steve
On 9/14/2024 9:05 AM, Steve Sistare wrote:
> The starting address in iopt_pages is currently a __user *uptr. Generalize
> to allow other types of addresses. Refactor iopt_alloc_pages and
> iopt_map_user_pages into address-type specific and common functions.
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/iommu/iommufd/io_pagetable.c | 60 +++++++++++++++++++++++-------------
> drivers/iommu/iommufd/io_pagetable.h | 13 ++++++--
> drivers/iommu/iommufd/pages.c | 30 +++++++++++++-----
> 3 files changed, 71 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index 1c23a59..3f8f0de 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -376,6 +376,38 @@ int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
> return rc;
> }
>
> +static int iopt_map_common(struct iommufd_ctx *ictx,
> + struct io_pagetable *iopt,
> + struct iopt_pages *pages,
> + unsigned long *iova,
> + unsigned long length,
> + unsigned long start_byte,
> + int iommu_prot,
> + unsigned int flags)
> +{
> + struct iopt_pages_list elm = {};
> + LIST_HEAD(pages_list);
> + int rc;
> +
> + elm.pages = pages;
> + elm.start_byte = start_byte;
> + if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
> + elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
> + elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
> + elm.length = length;
> + list_add(&elm.next, &pages_list);
> +
> + rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
> + if (rc) {
> + if (elm.area)
> + iopt_abort_area(elm.area);
> + if (elm.pages)
> + iopt_put_pages(elm.pages);
> + return rc;
> + }
> + return 0;
> +}
> +
> /**
> * iopt_map_user_pages() - Map a user VA to an iova in the io page table
> * @ictx: iommufd_ctx the iopt is part of
> @@ -400,29 +432,15 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
> unsigned long length, int iommu_prot,
> unsigned int flags)
> {
> - struct iopt_pages_list elm = {};
> - LIST_HEAD(pages_list);
> - int rc;
> + struct iopt_pages *pages;
>
> - elm.pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> - if (IS_ERR(elm.pages))
> - return PTR_ERR(elm.pages);
> - if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
> - elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
> - elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
> - elm.start_byte = uptr - elm.pages->uptr;
> - elm.length = length;
> - list_add(&elm.next, &pages_list);
> + pages = iopt_alloc_user_pages(uptr, length, iommu_prot & IOMMU_WRITE);
> + if (IS_ERR(pages))
> + return PTR_ERR(pages);
>
> - rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
> - if (rc) {
> - if (elm.area)
> - iopt_abort_area(elm.area);
> - if (elm.pages)
> - iopt_put_pages(elm.pages);
> - return rc;
> - }
> - return 0;
> + return iopt_map_common(ictx, iopt, pages, iova, length,
> + uptr - pages->uptr,
> + iommu_prot, flags);
> }
>
> struct iova_bitmap_fn_arg {
> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> index 0ec3509..7c4a338 100644
> --- a/drivers/iommu/iommufd/io_pagetable.h
> +++ b/drivers/iommu/iommufd/io_pagetable.h
> @@ -175,6 +175,10 @@ enum {
> IOPT_PAGES_ACCOUNT_MM = 2,
> };
>
> +enum iopt_address_type {
> + IOPT_ADDRESS_USER = 0,
> +};
> +
> /*
> * This holds a pinned page list for multiple areas of IO address space. The
> * pages always originate from a linear chunk of userspace VA. Multiple
> @@ -195,7 +199,10 @@ struct iopt_pages {
> struct task_struct *source_task;
> struct mm_struct *source_mm;
> struct user_struct *source_user;
> - void __user *uptr;
> + enum iopt_address_type type;
> + union {
> + void __user *uptr; /* IOPT_ADDRESS_USER */
> + };
> bool writable:1;
> u8 account_mode;
>
> @@ -206,8 +213,8 @@ struct iopt_pages {
> struct rb_root_cached domains_itree;
> };
>
> -struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
> - bool writable);
> +struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
> + unsigned long length, bool writable);
> void iopt_release_pages(struct kref *kref);
> static inline void iopt_put_pages(struct iopt_pages *pages)
> {
> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> index 117f644..69822d4 100644
> --- a/drivers/iommu/iommufd/pages.c
> +++ b/drivers/iommu/iommufd/pages.c
> @@ -1139,11 +1139,11 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
> return 0;
> }
>
> -struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
> - bool writable)
> +static struct iopt_pages *iopt_alloc_pages(unsigned long length,
> + unsigned long start_byte,
> + bool writable)
> {
> struct iopt_pages *pages;
> - unsigned long end;
>
> /*
> * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
> @@ -1152,9 +1152,6 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
> if (length > SIZE_MAX - PAGE_SIZE || length == 0)
> return ERR_PTR(-EINVAL);
>
> - if (check_add_overflow((unsigned long)uptr, length, &end))
> - return ERR_PTR(-EOVERFLOW);
> -
> pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
> if (!pages)
> return ERR_PTR(-ENOMEM);
> @@ -1164,8 +1161,7 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
> mutex_init(&pages->mutex);
> pages->source_mm = current->mm;
> mmgrab(pages->source_mm);
> - pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
> - pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
> + pages->npages = DIV_ROUND_UP(length + start_byte, PAGE_SIZE);
> pages->access_itree = RB_ROOT_CACHED;
> pages->domains_itree = RB_ROOT_CACHED;
> pages->writable = writable;
> @@ -1179,6 +1175,24 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
> return pages;
> }
>
> +struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
> + unsigned long length, bool writable)
> +{
> + struct iopt_pages *pages;
> + unsigned long end;
> + void __user *uptr_down = (void __user *) ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
> +
> + if (check_add_overflow((unsigned long)uptr, length, &end))
> + return ERR_PTR(-EOVERFLOW);
> +
> + pages = iopt_alloc_pages(length, uptr - uptr_down, writable);
> + if (IS_ERR(pages))
> + return pages;
> + pages->uptr = uptr_down;
> + pages->type = IOPT_ADDRESS_USER;
> + return pages;
> +}
> +
> void iopt_release_pages(struct kref *kref)
> {
> struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 7/9] iommufd: file mappings for mdev
2024-09-14 13:05 ` [PATCH V1 7/9] iommufd: file mappings for mdev Steve Sistare
@ 2024-09-18 14:52 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:52 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen
Hi Jason, any comments or RB on this before I submit V2? - steve
On 9/14/2024 9:05 AM, Steve Sistare wrote:
> Support file mappings for mediated devices, aka mdevs. Access is initiated
> by the vfio_pin_pages and vfio_dma_rw kernel interfaces.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/iommu/iommufd/pages.c | 22 +++++++++++++++-------
> 1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
> index 8c8d0ff..4cb532e 100644
> --- a/drivers/iommu/iommufd/pages.c
> +++ b/drivers/iommu/iommufd/pages.c
> @@ -1706,11 +1706,11 @@ static int iopt_pages_fill_from_domain(struct iopt_pages *pages,
> return 0;
> }
>
> -static int iopt_pages_fill_from_mm(struct iopt_pages *pages,
> - struct pfn_reader_user *user,
> - unsigned long start_index,
> - unsigned long last_index,
> - struct page **out_pages)
> +static int iopt_pages_fill(struct iopt_pages *pages,
> + struct pfn_reader_user *user,
> + unsigned long start_index,
> + unsigned long last_index,
> + struct page **out_pages)
> {
> unsigned long cur_index = start_index;
> int rc;
> @@ -1784,8 +1784,8 @@ int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
>
> /* hole */
> cur_pages = out_pages + (span.start_hole - start_index);
> - rc = iopt_pages_fill_from_mm(pages, &user, span.start_hole,
> - span.last_hole, cur_pages);
> + rc = iopt_pages_fill(pages, &user, span.start_hole,
> + span.last_hole, cur_pages);
> if (rc)
> goto out_clean_xa;
> rc = pages_to_xarray(&pages->pinned_pfns, span.start_hole,
> @@ -1865,6 +1865,9 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
> struct page *page = NULL;
> int rc;
>
> + if (WARN_ON(pages->type != IOPT_ADDRESS_USER))
> + return -EINVAL;
> +
> if (!mmget_not_zero(pages->source_mm))
> return iopt_pages_rw_slow(pages, index, index, offset, data,
> length, flags);
> @@ -1920,6 +1923,11 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
> if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
> return -EPERM;
>
> + if (pages->type == IOPT_ADDRESS_FILE)
> + return iopt_pages_rw_slow(pages, start_index, last_index,
> + start_byte % PAGE_SIZE, data, length,
> + flags);
> +
> if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
> if (start_index == last_index)
> return iopt_pages_rw_page(pages, start_index,
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 0/9] iommu_ioas_map_file
2024-09-15 20:30 ` Jason Gunthorpe
@ 2024-09-18 14:52 ` Steven Sistare
2024-09-23 17:33 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-09-18 14:52 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 9/15/2024 4:30 PM, Jason Gunthorpe wrote:
> On Sat, Sep 14, 2024 at 06:05:49AM -0700, Steve Sistare wrote:
>> Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
>> memory by passing a memfd plus offset and length. Implement it using
>> the memfd_map_folios KAPI, and the proposed repin_folio_unhugely KAPI.
>> See the individual patches for details.
>
> Do you want to do this and the other patches as well?
>
> This seems much easier to understand, and over the long run I expect
> to wrap VFIO in a DMABUF FD as well, so it should work well.
By "other patches", do you mean iommu_ioas_change_process and friends,
with VA manipulation removed? I will submit that as a separate series
after I submit iommu_ioas_map_file v2.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-18 14:51 ` Steven Sistare
@ 2024-09-19 8:11 ` David Hildenbrand
2024-09-19 21:06 ` Steven Sistare
2024-09-20 13:28 ` Jason Gunthorpe
0 siblings, 2 replies; 39+ messages in thread
From: David Hildenbrand @ 2024-09-19 8:11 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
On 18.09.24 16:51, Steven Sistare wrote:
> On 9/17/2024 8:25 AM, David Hildenbrand wrote:
>> On 14.09.24 15:19, Steven Sistare wrote:
>>> cc'ing linux-mm for review of this one patch of the series.
>>>
>>> This proposes a new KAPI function repin_folio_unhugely(), for use in this
>>> patch of the iommu_ioas_map_file series:
>>>
>>> iommufd: IOMMU_IOAS_MAP_FILE implementation
>>> https://lore.kernel.org/linux-iommu/1726319158-283074-7-git-send-email-steven.sistare@oracle.com
>>>
>>> - Steve
>>>
>>> On 9/14/2024 9:05 AM, Steve Sistare wrote:
>>>> Export a function that repins a huge-page folio at small-page granularity.
>>>> This allows any range of small pages within the folio to be unpinned later.
>>>> For example, pages pinned via memfd_pin_folios and modified by
>>>> repin_folio_unhugely could be unpinned via unpin_user_page(s).
>>>>
>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> include/linux/mm.h | 1 +
>>>> mm/gup.c | 18 ++++++++++++++++++
>>>> 2 files changed, 19 insertions(+)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 1470736..ba8344f 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>>>> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>>> struct folio **folios, unsigned int max_folios,
>>>> pgoff_t *offset);
>>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin);
>>>> int get_user_pages_fast(unsigned long start, int nr_pages,
>>>> unsigned int gup_flags, struct page **pages);
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 947881ff..f8f3f2a 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>>> return ret;
>>>> }
>>>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>>>> +
>>>> +/**
>>>> + * repin_folio_unhugely() - repin a folio at small page granularity
>>>> + * @folio: the folio to repin
>>>> + * @npin: the number of pages pinned in the folio
>>>> + *
>>>> + * Given a huge page folio that is already pinned, and the number of small
>>
>> s/huge page folio/large folio/
>>
>>>> + * pages that are pinned in it, adjust the pincount to reflect small-page
>>>> + * granularity. Each small page can later be unpinned individually.
>>>> + */
>>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin)
>>>> +{
>>>> + if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
>>
>> Why not the huge zero folio? That looks very odd here.
>
> The zero page is treated specially here and elsewhere, it can never be deleted so
> reference fiddling is skipped.
Please point me in mm/gup.c at that handling.
IIRC is_zero_folio() does *not* include the huge zero page.
Yes, we should likely be special-casing the huge zeropage in mm/gup.c,
but it's not that easy because PINs can outlive MMs ... so *not*
grabbing a reference could currently be harmful.
But that has do be changed consistently, not with doing things here
different compared to other gup.c functions.
>
>>>> + return;
>>>> + atomic_add(npin - 1, &folio->_refcount);
>>>> + atomic_add(npin - 1, &folio->_pincount);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
>>
>> Can we ... find a better name? For example, it's "large" folio not "huge"...
>>
>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>
> Jason suggests a better name in the other thread.
I would prefer something that simply adds more pins to an already pinned
folio. Much easier to get.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-19 8:11 ` David Hildenbrand
@ 2024-09-19 21:06 ` Steven Sistare
2024-09-26 11:38 ` David Hildenbrand
2024-09-20 13:28 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-09-19 21:06 UTC (permalink / raw)
To: David Hildenbrand, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
On 9/19/2024 4:11 AM, David Hildenbrand wrote:
> On 18.09.24 16:51, Steven Sistare wrote:
>> On 9/17/2024 8:25 AM, David Hildenbrand wrote:
>>> On 14.09.24 15:19, Steven Sistare wrote:
>>>> cc'ing linux-mm for review of this one patch of the series.
>>>>
>>>> This proposes a new KAPI function repin_folio_unhugely(), for use in this
>>>> patch of the iommu_ioas_map_file series:
>>>>
>>>> iommufd: IOMMU_IOAS_MAP_FILE implementation
>>>> https://lore.kernel.org/linux-iommu/1726319158-283074-7-git-send-email-steven.sistare@oracle.com
>>>>
>>>> - Steve
>>>>
>>>> On 9/14/2024 9:05 AM, Steve Sistare wrote:
>>>>> Export a function that repins a huge-page folio at small-page granularity.
>>>>> This allows any range of small pages within the folio to be unpinned later.
>>>>> For example, pages pinned via memfd_pin_folios and modified by
>>>>> repin_folio_unhugely could be unpinned via unpin_user_page(s).
>>>>>
>>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>> include/linux/mm.h | 1 +
>>>>> mm/gup.c | 18 ++++++++++++++++++
>>>>> 2 files changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index 1470736..ba8344f 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -2514,6 +2514,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>>>>> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>>>> struct folio **folios, unsigned int max_folios,
>>>>> pgoff_t *offset);
>>>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin);
>>>>> int get_user_pages_fast(unsigned long start, int nr_pages,
>>>>> unsigned int gup_flags, struct page **pages);
>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>> index 947881ff..f8f3f2a 100644
>>>>> --- a/mm/gup.c
>>>>> +++ b/mm/gup.c
>>>>> @@ -3720,3 +3720,21 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>>>>> return ret;
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>>>>> +
>>>>> +/**
>>>>> + * repin_folio_unhugely() - repin a folio at small page granularity
>>>>> + * @folio: the folio to repin
>>>>> + * @npin: the number of pages pinned in the folio
>>>>> + *
>>>>> + * Given a huge page folio that is already pinned, and the number of small
>>>
>>> s/huge page folio/large folio/
>>>
>>>>> + * pages that are pinned in it, adjust the pincount to reflect small-page
>>>>> + * granularity. Each small page can later be unpinned individually.
>>>>> + */
>>>>> +void repin_folio_unhugely(struct folio *folio, unsigned long npin)
>>>>> +{
>>>>> + if (!folio_test_large(folio) || is_huge_zero_folio(folio) || npin == 1)
>>>
>>> Why not the huge zero folio? That looks very odd here.
>>
>> The zero page is treated specially here and elsewhere, it can never be deleted so
>> reference fiddling is skipped.
>
> Please point me in mm/gup.c at that handling.
>
> IIRC is_zero_folio() does *not* include the huge zero page.
>
> Yes, we should likely be special-casing the huge zeropage in mm/gup.c, but it's not that easy because PINs can outlive MMs ... so *not* grabbing a reference could currently be harmful.
>
> But that has do be changed consistently, not with doing things here different compared to other gup.c functions.
folios_put() -> folios_put_refs() -> is_huge_zero_folio()
I will run some tests with huge zero folios to verify the ref and pin
counts behave correctly.
>>>>> + return;
>>>>> + atomic_add(npin - 1, &folio->_refcount);
>>>>> + atomic_add(npin - 1, &folio->_pincount);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
>>>
>>> Can we ... find a better name? For example, it's "large" folio not "huge"...
>>>
>>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>>
>> Jason suggests a better name in the other thread.
>
> I would prefer something that simply adds more pins to an already pinned folio. Much easier to get.
How about folio_add_pins()?
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-19 8:11 ` David Hildenbrand
2024-09-19 21:06 ` Steven Sistare
@ 2024-09-20 13:28 ` Jason Gunthorpe
2024-09-26 11:32 ` David Hildenbrand
1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-20 13:28 UTC (permalink / raw)
To: David Hildenbrand
Cc: Steven Sistare, linux-mm, Kevin Tian, Nicolin Chen, iommu,
Andrew Morton, Matthew Wilcox
On Thu, Sep 19, 2024 at 10:11:38AM +0200, David Hildenbrand wrote:
> > > And repin is really misleading. We are simply adding more pins to an already pinned one ...
> >
> > Jason suggests a better name in the other thread.
>
> I would prefer something that simply adds more pins to an already pinned
> folio. Much easier to get.
Yes, but also nobody should ever want to do that operation, it should
always be part of some kind of "splitting" sort of behavior..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 0/9] iommu_ioas_map_file
2024-09-18 14:52 ` Steven Sistare
@ 2024-09-23 17:33 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-23 17:33 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Wed, Sep 18, 2024 at 10:52:57AM -0400, Steven Sistare wrote:
> On 9/15/2024 4:30 PM, Jason Gunthorpe wrote:
> > On Sat, Sep 14, 2024 at 06:05:49AM -0700, Steve Sistare wrote:
> > > Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
> > > memory by passing a memfd plus offset and length. Implement it using
> > > the memfd_map_folios KAPI, and the proposed repin_folio_unhugely KAPI.
> > > See the individual patches for details.
> >
> > Do you want to do this and the other patches as well?
> >
> > This seems much easier to understand, and over the long run I expect
> > to wrap VFIO in a DMABUF FD as well, so it should work well.
>
> By "other patches", do you mean iommu_ioas_change_process and friends,
> with VA manipulation removed?
I was meaning iommu_ioas_change_process as you already showed it, but
that seems to answer the question that we will rely on this approach
and drop the VA stuff
Thanks,
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova
2024-09-18 14:51 ` Steven Sistare
@ 2024-09-24 19:50 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-24 19:50 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Wed, Sep 18, 2024 at 10:51:32AM -0400, Steven Sistare wrote:
> On 9/15/2024 4:41 PM, Jason Gunthorpe wrote:
> > On Sat, Sep 14, 2024 at 06:05:51AM -0700, Steve Sistare wrote:
> > > iopt_alloc_iova takes a uptr argument but only checks for its alignment.
> > > Pass an offset from start instead to eliminate its dependence on a user
> > > address.
> >
> > An offset from start is a meaningfully different thing for this, it
> > throws away all the information it is trying to extract (is the uptr
> > huge page aligned)
>
> Interpret offset as "from the start of the address space, eg 0" and it
> works. That was my intent, since I pass offset from the start of the file
> address space in a subsequent patch. But instead, I can rename uptr to
> to address and describe it this way:
>
> iopt_alloc_iova takes a uptr argument but only checks for its alignment.
> Generalize this to an unsigned address, which will be the offset from the
> start of a file in a subsequent patch. No functional change.
Yes that makes sense
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-20 13:28 ` Jason Gunthorpe
@ 2024-09-26 11:32 ` David Hildenbrand
2024-09-26 11:40 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2024-09-26 11:32 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Steven Sistare, linux-mm, Kevin Tian, Nicolin Chen, iommu,
Andrew Morton, Matthew Wilcox
On 20.09.24 15:28, Jason Gunthorpe wrote:
> On Thu, Sep 19, 2024 at 10:11:38AM +0200, David Hildenbrand wrote:
>
>>>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>>>
>>> Jason suggests a better name in the other thread.
>>
>> I would prefer something that simply adds more pins to an already pinned
>> folio. Much easier to get.
>
> Yes, but also nobody should ever want to do that operation, it should
> always be part of some kind of "splitting" sort of behavior..
I remember patches from Dave Howells that needed that for O_DIRECT
handling. Never say never ;)
Adding is much more intuitive than splitting ... just like we add
references when splitting a THP, using folio_ref_add().
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-19 21:06 ` Steven Sistare
@ 2024-09-26 11:38 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2024-09-26 11:38 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox
>>> reference fiddling is skipped.
>>
>> Please point me in mm/gup.c at that handling.
>>
>> IIRC is_zero_folio() does *not* include the huge zero page.
>>
>> Yes, we should likely be special-casing the huge zeropage in mm/gup.c, but it's not that easy because PINs can outlive MMs ... so *not* grabbing a reference could currently be harmful.
>>
>> But that has do be changed consistently, not with doing things here different compared to other gup.c functions.
>
Sorry for the late reply, conferences ...
> folios_put() -> folios_put_refs() -> is_huge_zero_folio()
unpin_user_folio() -> gup_put_folio() will do
1) atomic_sub(refs, &folio->_pincount);
2) folio_put_refs(folio, refs);
What you cite above is for releasing huge folios that are mapped into
user space.
>
> I will run some tests with huge zero folios to verify the ref and pin
> counts behave correctly.
Whatever is pinned is supposed to be released via the unpin interface.
We have to be consistent there, so this patch would be wrong (likely you
never get the huge zero folio).
>
>>>>>> + return;
>>>>>> + atomic_add(npin - 1, &folio->_refcount);
>>>>>> + atomic_add(npin - 1, &folio->_pincount);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(repin_folio_unhugely);
>>>>
>>>> Can we ... find a better name? For example, it's "large" folio not "huge"...
>>>>
>>>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>>>
>>> Jason suggests a better name in the other thread.
>>
>> I would prefer something that simply adds more pins to an already pinned folio. Much easier to get.
>
> How about folio_add_pins()?
Something like that, but more importantly, via which interface did we
obtain the pins? That will make a difference and should be documented.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-26 11:32 ` David Hildenbrand
@ 2024-09-26 11:40 ` Jason Gunthorpe
2024-09-26 12:57 ` David Hildenbrand
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-09-26 11:40 UTC (permalink / raw)
To: David Hildenbrand
Cc: Steven Sistare, linux-mm, Kevin Tian, Nicolin Chen, iommu,
Andrew Morton, Matthew Wilcox
On Thu, Sep 26, 2024 at 01:32:09PM +0200, David Hildenbrand wrote:
> On 20.09.24 15:28, Jason Gunthorpe wrote:
> > On Thu, Sep 19, 2024 at 10:11:38AM +0200, David Hildenbrand wrote:
> >
> > > > > And repin is really misleading. We are simply adding more pins to an already pinned one ...
> > > >
> > > > Jason suggests a better name in the other thread.
> > >
> > > I would prefer something that simply adds more pins to an already pinned
> > > folio. Much easier to get.
> >
> > Yes, but also nobody should ever want to do that operation, it should
> > always be part of some kind of "splitting" sort of behavior..
>
> I remember patches from Dave Howells that needed that for O_DIRECT handling.
> Never say never ;)
Wouldn't O_DIRECT be the same splitting thing?
> Adding is much more intuitive than splitting ... just like we add references
> when splitting a THP, using folio_ref_add().
Well, sure, it just seems harder to document so people can use it
properly.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-26 11:40 ` Jason Gunthorpe
@ 2024-09-26 12:57 ` David Hildenbrand
2024-09-26 12:58 ` David Hildenbrand
0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2024-09-26 12:57 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Steven Sistare, linux-mm, Kevin Tian, Nicolin Chen, iommu,
Andrew Morton, Matthew Wilcox
On 26.09.24 13:40, Jason Gunthorpe wrote:
> On Thu, Sep 26, 2024 at 01:32:09PM +0200, David Hildenbrand wrote:
>> On 20.09.24 15:28, Jason Gunthorpe wrote:
>>> On Thu, Sep 19, 2024 at 10:11:38AM +0200, David Hildenbrand wrote:
>>>
>>>>>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>>>>>
>>>>> Jason suggests a better name in the other thread.
>>>>
>>>> I would prefer something that simply adds more pins to an already pinned
>>>> folio. Much easier to get.
>>>
>>> Yes, but also nobody should ever want to do that operation, it should
>>> always be part of some kind of "splitting" sort of behavior..
>>
>> I remember patches from Dave Howells that needed that for O_DIRECT handling.
>> Never say never ;)
>
> Wouldn't O_DIRECT be the same splitting thing?
No, I recall the implementation had to duplicate pins, not split.
>
>> Adding is much more intuitive than splitting ... just like we add references
>> when splitting a THP, using folio_ref_add().
>
> Well, sure, it just seems harder to document so people can use it
> properly.
As soon as we have large GUP and splitting might go to multiple PMDs,
multiple PTEs or a mixture, just being able to add the number of pins
you actually need might be cleaner ...
Let me have a look at the latest reincarnation of this patch.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V1 1/9] mm/gup: repin_folio_unhugely
2024-09-26 12:57 ` David Hildenbrand
@ 2024-09-26 12:58 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2024-09-26 12:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Steven Sistare, linux-mm, Kevin Tian, Nicolin Chen, iommu,
Andrew Morton, Matthew Wilcox
On 26.09.24 14:57, David Hildenbrand wrote:
> On 26.09.24 13:40, Jason Gunthorpe wrote:
>> On Thu, Sep 26, 2024 at 01:32:09PM +0200, David Hildenbrand wrote:
>>> On 20.09.24 15:28, Jason Gunthorpe wrote:
>>>> On Thu, Sep 19, 2024 at 10:11:38AM +0200, David Hildenbrand wrote:
>>>>
>>>>>>> And repin is really misleading. We are simply adding more pins to an already pinned one ...
>>>>>>
>>>>>> Jason suggests a better name in the other thread.
>>>>>
>>>>> I would prefer something that simply adds more pins to an already pinned
>>>>> folio. Much easier to get.
>>>>
>>>> Yes, but also nobody should ever want to do that operation, it should
>>>> always be part of some kind of "splitting" sort of behavior..
>>>
>>> I remember patches from Dave Howells that needed that for O_DIRECT handling.
>>> Never say never ;)
>>
>> Wouldn't O_DIRECT be the same splitting thing?
>
> No, I recall the implementation had to duplicate pins, not split.
>
>>
>>> Adding is much more intuitive than splitting ... just like we add references
>>> when splitting a THP, using folio_ref_add().
>>
>> Well, sure, it just seems harder to document so people can use it
>> properly.
>
> As soon as we have large GUP
s/GUP/PUD/
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2024-09-26 12:58 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-14 13:05 [PATCH V1 0/9] iommu_ioas_map_file Steve Sistare
2024-09-14 13:05 ` [PATCH V1 1/9] mm/gup: repin_folio_unhugely Steve Sistare
2024-09-14 13:19 ` Steven Sistare
2024-09-17 12:25 ` David Hildenbrand
2024-09-18 14:51 ` Steven Sistare
2024-09-19 8:11 ` David Hildenbrand
2024-09-19 21:06 ` Steven Sistare
2024-09-26 11:38 ` David Hildenbrand
2024-09-20 13:28 ` Jason Gunthorpe
2024-09-26 11:32 ` David Hildenbrand
2024-09-26 11:40 ` Jason Gunthorpe
2024-09-26 12:57 ` David Hildenbrand
2024-09-26 12:58 ` David Hildenbrand
2024-09-15 20:37 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 2/9] iommufd: remove uptr from iopt_alloc_iova Steve Sistare
2024-09-15 20:41 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
2024-09-24 19:50 ` Jason Gunthorpe
2024-09-14 13:05 ` [PATCH V1 3/9] iommufd: generalize iopt_pages address Steve Sistare
2024-09-18 14:52 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 4/9] iommufd: pfn reader for file mappings Steve Sistare
2024-09-15 20:51 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 5/9] iommufd: IOMMU_IOAS_MAP_FILE interface Steve Sistare
2024-09-15 20:52 ` Jason Gunthorpe
2024-09-18 14:51 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 6/9] iommufd: IOMMU_IOAS_MAP_FILE implementation Steve Sistare
2024-09-17 1:48 ` kernel test robot
2024-09-14 13:05 ` [PATCH V1 7/9] iommufd: file mappings for mdev Steve Sistare
2024-09-18 14:52 ` Steven Sistare
2024-09-14 13:05 ` [PATCH V1 8/9] iommufd: replace upages_len Steve Sistare
2024-09-14 13:05 ` [PATCH V1 9/9] iommufd: optimize file mapping Steve Sistare
2024-09-15 20:59 ` Jason Gunthorpe
2024-09-18 14:52 ` Steven Sistare
2024-09-14 13:21 ` [PATCH V1 0/9] iommu_ioas_map_file Steven Sistare
2024-09-15 20:30 ` Jason Gunthorpe
2024-09-18 14:52 ` Steven Sistare
2024-09-23 17:33 ` Jason Gunthorpe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.