* [PATCH V3 0/9] iommu_ioas_map_file
@ 2024-10-04 18:48 Steve Sistare
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
` (9 more replies)
0 siblings, 10 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
memory by passing a memfd plus offset and length. Implement it using
the memfd_map_folios KAPI, and the proposed folio_add_pins KAPI.
See the individual patches for details.
Changes in V2:
* changed names and commit message in "rename uptr in iopt_alloc_iova"
* normalized comments describing the iopt_map_user_pages interface
* submitted folio_split_user_page_pin (fka folio_repin_unhugely) separately
* replaced nupages[] optimization with folio-to-batch optimization
* added selftests for map file
Changes in V3:
* fixed bug setting user->locked
* fixed bug in pages->file refs
* replaced lockdep_off with down_write_nest_lock
* added ufolios_next to track folio consumption in reader
* combined IOMMU_IOAS_MAP_FILE interface and implementation
* added patch folio_add_pins (fka folio_split_user_page_pin )
* reformatted patches using clang-format
* misc cosmetic changes in response to review comments
Steve Sistare (9):
mm/gup: folio_add_pins
iommufd: rename uptr in iopt_alloc_iova
iommufd: generalize iopt_pages address
iommufd: pfn reader for file mappings
iommufd: IOMMU_IOAS_MAP_FILE
iommufd: file mappings for mdev
iommufd: pfn reader local variables
iommufd: optimize file mapping
iommufd: map file selftest
drivers/iommu/iommufd/io_pagetable.c | 118 +++++++---
drivers/iommu/iommufd/io_pagetable.h | 20 +-
drivers/iommu/iommufd/ioas.c | 43 ++++
drivers/iommu/iommufd/iommufd_private.h | 5 +
drivers/iommu/iommufd/main.c | 2 +
drivers/iommu/iommufd/pages.c | 320 +++++++++++++++++++++-----
include/linux/mm.h | 1 +
include/uapi/linux/iommufd.h | 25 ++
mm/gup.c | 24 ++
tools/testing/selftests/iommu/iommufd.c | 151 ++++++++++--
tools/testing/selftests/iommu/iommufd_utils.h | 41 ++++
11 files changed, 644 insertions(+), 106 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH V3 1/9] mm/gup: folio_add_pins
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-04 18:52 ` Steven Sistare
2024-10-16 12:17 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 2/9] iommufd: rename uptr in iopt_alloc_iova Steve Sistare
` (8 subsequent siblings)
9 siblings, 2 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Export a function that adds pins to an already-pinned huge-page folio.
This allows any range of small pages within the folio to be unpinned later.
For example, pages pinned via memfd_pin_folios and modified by
folio_add_pins could be unpinned via unpin_user_page(s).
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/linux/mm.h | 1 +
mm/gup.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 13bff7c..70d5293 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2521,6 +2521,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
struct folio **folios, unsigned int max_folios,
pgoff_t *offset);
+int folio_add_pins(struct folio *folio, unsigned int pins);
int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index fcd602b..11c5f27 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3733,3 +3733,27 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
return ret;
}
EXPORT_SYMBOL_GPL(memfd_pin_folios);
+
+/**
+ * folio_add_pins() - add pins to an already-pinned folio
+ * @folio: the folio to add more pins to
+ * @pins: number of pins to add
+ *
+ * Try to add more pins to an already-pinned folio. The semantics
+ * of the pin (e.g., FOLL_WRITE) follow any existing pin and cannot
+ * be changed.
+ *
+ * This function is helpful when having obtained a pin on a large folio
+ * using memfd_pin_folios(), but wanting to logically unpin parts
+ * (e.g., individual pages) of the folio later, for example, using
+ * unpin_user_page_range_dirty_lock().
+ *
+ * This is not the right interface to initially pin a folio.
+ */
+int folio_add_pins(struct folio *folio, unsigned int pins)
+{
+ VM_WARN_ON_ONCE(!folio_maybe_dma_pinned(folio));
+
+ return try_grab_folio(folio, pins, FOLL_PIN);
+}
+EXPORT_SYMBOL_GPL(folio_add_pins);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 2/9] iommufd: rename uptr in iopt_alloc_iova
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-04 18:48 ` [PATCH V3 3/9] iommufd: generalize iopt_pages address Steve Sistare
` (7 subsequent siblings)
9 siblings, 0 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
iopt_alloc_iova takes a uptr argument but only checks for its alignment.
Generalize this to an unsigned address, which can be the offset from the
start of a file in a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/iommu/iommufd/io_pagetable.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 05fd9d3..20e31e1 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -107,9 +107,9 @@ static bool __alloc_iova_check_used(struct interval_tree_span_iter *span,
* Does not return a 0 IOVA even if it is valid.
*/
static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
- unsigned long uptr, unsigned long length)
+ unsigned long addr, unsigned long length)
{
- unsigned long page_offset = uptr % PAGE_SIZE;
+ unsigned long page_offset = addr % PAGE_SIZE;
struct interval_tree_double_span_iter used_span;
struct interval_tree_span_iter allowed_span;
unsigned long iova_alignment;
@@ -121,15 +121,15 @@ static int iopt_alloc_iova(struct io_pagetable *iopt, unsigned long *iova,
return -EOVERFLOW;
/*
- * Keep alignment present in the uptr when building the IOVA, this
+ * Keep alignment present in addr when building the IOVA, which
* increases the chance we can map a THP.
*/
- if (!uptr)
+ if (!addr)
iova_alignment = roundup_pow_of_two(length);
else
iova_alignment = min_t(unsigned long,
roundup_pow_of_two(length),
- 1UL << __ffs64(uptr));
+ 1UL << __ffs64(addr));
if (iova_alignment < iopt->iova_alignment)
return -EINVAL;
@@ -240,7 +240,7 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
int iommu_prot, unsigned int flags)
{
struct iopt_pages_list *elm;
- unsigned long iova;
+ unsigned long iova, start;
int rc = 0;
list_for_each_entry(elm, pages_list, next) {
@@ -259,9 +259,8 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
/* Use the first entry to guess the ideal IOVA alignment */
elm = list_first_entry(pages_list, struct iopt_pages_list,
next);
- rc = iopt_alloc_iova(
- iopt, dst_iova,
- (uintptr_t)elm->pages->uptr + elm->start_byte, length);
+ start = elm->start_byte + (uintptr_t)elm->pages->uptr;
+ rc = iopt_alloc_iova(iopt, dst_iova, start, length);
if (rc)
goto out_unlock;
if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 3/9] iommufd: generalize iopt_pages address
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
2024-10-04 18:48 ` [PATCH V3 2/9] iommufd: rename uptr in iopt_alloc_iova Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-04 18:48 ` [PATCH V3 4/9] iommufd: pfn reader for file mappings Steve Sistare
` (6 subsequent siblings)
9 siblings, 0 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
The starting address in iopt_pages is currently a __user *uptr. Generalize
to allow other types of addresses. Refactor iopt_alloc_pages and
iopt_map_user_pages into address-type specific and common functions.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/iommu/iommufd/io_pagetable.c | 55 ++++++++++++++++++++++--------------
drivers/iommu/iommufd/io_pagetable.h | 13 +++++++--
drivers/iommu/iommufd/pages.c | 31 ++++++++++++++------
3 files changed, 67 insertions(+), 32 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 20e31e1..9dd7c03 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -375,6 +375,34 @@ int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
return rc;
}
+static int iopt_map_common(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+ struct iopt_pages *pages, unsigned long *iova,
+ unsigned long length, unsigned long start_byte,
+ int iommu_prot, unsigned int flags)
+{
+ struct iopt_pages_list elm = {};
+ LIST_HEAD(pages_list);
+ int rc;
+
+ elm.pages = pages;
+ elm.start_byte = start_byte;
+ if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
+ elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
+ elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
+ elm.length = length;
+ list_add(&elm.next, &pages_list);
+
+ rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
+ if (rc) {
+ if (elm.area)
+ iopt_abort_area(elm.area);
+ if (elm.pages)
+ iopt_put_pages(elm.pages);
+ return rc;
+ }
+ return 0;
+}
+
/**
* iopt_map_user_pages() - Map a user VA to an iova in the io page table
* @ictx: iommufd_ctx the iopt is part of
@@ -399,29 +427,14 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long length, int iommu_prot,
unsigned int flags)
{
- struct iopt_pages_list elm = {};
- LIST_HEAD(pages_list);
- int rc;
+ struct iopt_pages *pages;
- elm.pages = iopt_alloc_pages(uptr, length, iommu_prot & IOMMU_WRITE);
- if (IS_ERR(elm.pages))
- return PTR_ERR(elm.pages);
- if (ictx->account_mode == IOPT_PAGES_ACCOUNT_MM &&
- elm.pages->account_mode == IOPT_PAGES_ACCOUNT_USER)
- elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
- elm.start_byte = uptr - elm.pages->uptr;
- elm.length = length;
- list_add(&elm.next, &pages_list);
+ pages = iopt_alloc_user_pages(uptr, length, iommu_prot & IOMMU_WRITE);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
- rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
- if (rc) {
- if (elm.area)
- iopt_abort_area(elm.area);
- if (elm.pages)
- iopt_put_pages(elm.pages);
- return rc;
- }
- return 0;
+ return iopt_map_common(ictx, iopt, pages, iova, length,
+ uptr - pages->uptr, iommu_prot, flags);
}
struct iova_bitmap_fn_arg {
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 0ec3509..7c4a338 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -175,6 +175,10 @@ enum {
IOPT_PAGES_ACCOUNT_MM = 2,
};
+enum iopt_address_type {
+ IOPT_ADDRESS_USER = 0,
+};
+
/*
* This holds a pinned page list for multiple areas of IO address space. The
* pages always originate from a linear chunk of userspace VA. Multiple
@@ -195,7 +199,10 @@ struct iopt_pages {
struct task_struct *source_task;
struct mm_struct *source_mm;
struct user_struct *source_user;
- void __user *uptr;
+ enum iopt_address_type type;
+ union {
+ void __user *uptr; /* IOPT_ADDRESS_USER */
+ };
bool writable:1;
u8 account_mode;
@@ -206,8 +213,8 @@ struct iopt_pages {
struct rb_root_cached domains_itree;
};
-struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
- bool writable);
+struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
+ unsigned long length, bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 117f644..5419795 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1139,11 +1139,11 @@ static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
return 0;
}
-struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
- bool writable)
+static struct iopt_pages *iopt_alloc_pages(unsigned long length,
+ unsigned long start_byte,
+ bool writable)
{
struct iopt_pages *pages;
- unsigned long end;
/*
* The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
@@ -1152,9 +1152,6 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
if (length > SIZE_MAX - PAGE_SIZE || length == 0)
return ERR_PTR(-EINVAL);
- if (check_add_overflow((unsigned long)uptr, length, &end))
- return ERR_PTR(-EOVERFLOW);
-
pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
if (!pages)
return ERR_PTR(-ENOMEM);
@@ -1164,8 +1161,7 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
mutex_init(&pages->mutex);
pages->source_mm = current->mm;
mmgrab(pages->source_mm);
- pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
- pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+ pages->npages = DIV_ROUND_UP(length + start_byte, PAGE_SIZE);
pages->access_itree = RB_ROOT_CACHED;
pages->domains_itree = RB_ROOT_CACHED;
pages->writable = writable;
@@ -1179,6 +1175,25 @@ struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
return pages;
}
+struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
+ unsigned long length, bool writable)
+{
+ struct iopt_pages *pages;
+ unsigned long end;
+ void __user *uptr_down =
+ (void __user *) ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+
+ if (check_add_overflow((unsigned long)uptr, length, &end))
+ return ERR_PTR(-EOVERFLOW);
+
+ pages = iopt_alloc_pages(length, uptr - uptr_down, writable);
+ if (IS_ERR(pages))
+ return pages;
+ pages->uptr = uptr_down;
+ pages->type = IOPT_ADDRESS_USER;
+ return pages;
+}
+
void iopt_release_pages(struct kref *kref)
{
struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 4/9] iommufd: pfn reader for file mappings
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (2 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 3/9] iommufd: generalize iopt_pages address Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-16 12:37 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE Steve Sistare
` (5 subsequent siblings)
9 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Extend pfn_reader_user to pin file mappings, by calling memfd_pin_folios.
Repin at small page granularity and unpack pages into upages[] to mesh
with the existing code paths. This is sub-optimal but simple, and will be
optimized in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.h | 5 ++
drivers/iommu/iommufd/pages.c | 91 +++++++++++++++++++++++++++++++-----
2 files changed, 84 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 7c4a338..3a28f46 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -177,6 +177,7 @@ enum {
enum iopt_address_type {
IOPT_ADDRESS_USER = 0,
+ IOPT_ADDRESS_FILE = 1,
};
/*
@@ -202,6 +203,10 @@ struct iopt_pages {
enum iopt_address_type type;
union {
void __user *uptr; /* IOPT_ADDRESS_USER */
+ struct { /* IOPT_ADDRESS_FILE */
+ struct file *file;
+ unsigned long start;
+ };
};
bool writable:1;
u8 account_mode;
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 5419795..bb60af1 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -703,19 +703,28 @@ struct pfn_reader_user {
* neither
*/
int locked;
+
+ /* The following are only valid if file != NULL. */
+ struct file *file;
+ struct folio **ufolios;
+ unsigned long ufolios_len;
};
static void pfn_reader_user_init(struct pfn_reader_user *user,
struct iopt_pages *pages)
{
user->upages = NULL;
+ user->upages_len = 0;
user->upages_start = 0;
user->upages_end = 0;
user->locked = -1;
-
user->gup_flags = FOLL_LONGTERM;
if (pages->writable)
user->gup_flags |= FOLL_WRITE;
+
+ user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
+ user->ufolios = NULL;
+ user->ufolios_len = 0;
}
static void pfn_reader_user_destroy(struct pfn_reader_user *user,
@@ -731,6 +740,47 @@ static void pfn_reader_user_destroy(struct pfn_reader_user *user,
kfree(user->upages);
user->upages = NULL;
+ kfree(user->ufolios);
+ user->ufolios = NULL;
+}
+
+static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
+ unsigned long npages)
+{
+ unsigned long end, nr, i, j, npin, offset, npages_out;
+ long nfolios;
+ int rc;
+ struct folio *folio;
+ struct page **upages = user->upages;
+
+ nfolios = user->ufolios_len / sizeof(*user->ufolios);
+ end = start + (npages << PAGE_SHIFT) - 1;
+
+ nfolios = memfd_pin_folios(user->file, start, end, user->ufolios,
+ nfolios, &offset);
+ if (nfolios <= 0)
+ return nfolios;
+
+ offset >>= PAGE_SHIFT;
+ npages_out = 0;
+
+ for (i = 0; i < nfolios; i++) {
+ folio = user->ufolios[i];
+ nr = folio_nr_pages(folio);
+ npin = min(nr - offset, npages);
+ if (npin > 1) {
+ rc = folio_add_pins(folio, npin - 1);
+ if (rc)
+ return rc;
+ }
+ for (j = offset; j < offset + npin; j++)
+ *upages++ = folio_page(folio, j);
+ npages -= npin;
+ npages_out += npin;
+ offset = 0;
+ }
+
+ return npages_out;
}
static int pfn_reader_user_pin(struct pfn_reader_user *user,
@@ -739,7 +789,8 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
unsigned long last_index)
{
bool remote_mm = pages->source_mm != current->mm;
- unsigned long npages;
+ unsigned long npages = last_index - start_index + 1;
+ unsigned long start, unum;
uintptr_t uptr;
long rc;
@@ -749,38 +800,53 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
if (!user->upages) {
/* All undone in pfn_reader_destroy() */
- user->upages_len =
- (last_index - start_index + 1) * sizeof(*user->upages);
+ user->upages_len = npages * sizeof(*user->upages);
user->upages = temp_kmalloc(&user->upages_len, NULL, 0);
if (!user->upages)
return -ENOMEM;
}
+ if (user->file && !user->ufolios) {
+ user->ufolios_len = npages * sizeof(*user->ufolios);
+ user->ufolios = temp_kmalloc(&user->ufolios_len, NULL, 0);
+ if (!user->ufolios)
+ return -ENOMEM;
+
+ /* Bail for now. Be more robust when we optimize for folios. */
+ if (user->ufolios_len / sizeof(*user->ufolios) <
+ user->upages_len / sizeof(*user->upages))
+ return -ENOMEM;
+ }
+
if (user->locked == -1) {
/*
* The majority of usages will run the map task within the mm
* providing the pages, so we can optimize into
* get_user_pages_fast()
*/
- if (remote_mm) {
+ if (!user->file && remote_mm) {
if (!mmget_not_zero(pages->source_mm))
return -EFAULT;
}
user->locked = 0;
}
- npages = min_t(unsigned long, last_index - start_index + 1,
- user->upages_len / sizeof(*user->upages));
-
+ unum = user->file ? user->ufolios_len / sizeof(*user->ufolios) :
+ user->upages_len / sizeof(*user->upages);
+ npages = min_t(unsigned long, npages, unum);
if (iommufd_should_fail())
return -EFAULT;
- uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
- if (!remote_mm)
+ if (user->file) {
+ start = pages->start + (start_index * PAGE_SIZE);
+ rc = pin_memfd_pages(user, start, npages);
+ } else if (!remote_mm) {
+ uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
user->upages);
- else {
+ } else {
+ uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
if (!user->locked) {
mmap_read_lock(pages->source_mm);
user->locked = 1;
@@ -838,7 +904,8 @@ static int update_mm_locked_vm(struct iopt_pages *pages, unsigned long npages,
mmap_read_unlock(pages->source_mm);
user->locked = 0;
/* If we had the lock then we also have a get */
- } else if ((!user || !user->upages) &&
+
+ } else if ((!user || (!user->upages && !user->ufolios)) &&
pages->source_mm != current->mm) {
if (!mmget_not_zero(pages->source_mm))
return -EINVAL;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (3 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 4/9] iommufd: pfn reader for file mappings Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-16 12:41 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 6/9] iommufd: file mappings for mdev Steve Sistare
` (4 subsequent siblings)
9 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Define the IOMMU_IOAS_MAP_FILE ioctl interface, which allows a user to
register memory by passing a memfd plus offset and length. Implement it
using the memfd_pin_folios KAPI.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/io_pagetable.c | 36 ++++++++++++++++++++++++++-
drivers/iommu/iommufd/io_pagetable.h | 2 ++
drivers/iommu/iommufd/ioas.c | 43 +++++++++++++++++++++++++++++++++
drivers/iommu/iommufd/iommufd_private.h | 5 ++++
drivers/iommu/iommufd/main.c | 2 ++
drivers/iommu/iommufd/pages.c | 19 +++++++++++++++
include/uapi/linux/iommufd.h | 25 +++++++++++++++++++
7 files changed, 131 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 9dd7c03..f8fe7bf 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -259,7 +259,14 @@ static int iopt_alloc_area_pages(struct io_pagetable *iopt,
/* Use the first entry to guess the ideal IOVA alignment */
elm = list_first_entry(pages_list, struct iopt_pages_list,
next);
- start = elm->start_byte + (uintptr_t)elm->pages->uptr;
+ switch (elm->pages->type) {
+ case IOPT_ADDRESS_USER:
+ start = elm->start_byte + (uintptr_t)elm->pages->uptr;
+ break;
+ case IOPT_ADDRESS_FILE:
+ start = elm->start_byte + elm->pages->start;
+ break;
+ }
rc = iopt_alloc_iova(iopt, dst_iova, start, length);
if (rc)
goto out_unlock;
@@ -437,6 +444,33 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
uptr - pages->uptr, iommu_prot, flags);
}
+/**
+ * iopt_map_file_pages() - Like iopt_map_user_pages, but map a file.
+ * @ictx: iommufd_ctx the iopt is part of
+ * @iopt: io_pagetable to act on
+ * @iova: If IOPT_ALLOC_IOVA is set this is unused on input and contains
+ * the chosen iova on output. Otherwise is the iova to map to on input
+ * @file: file to map
+ * @start: map file starting at this byte offset
+ * @length: Number of bytes to map
+ * @iommu_prot: Combination of IOMMU_READ/WRITE/etc bits for the mapping
+ * @flags: IOPT_ALLOC_IOVA or zero
+ */
+int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+ unsigned long *iova, struct file *file,
+ unsigned long start, unsigned long length,
+ int iommu_prot, unsigned int flags)
+{
+ struct iopt_pages *pages;
+
+ pages = iopt_alloc_file_pages(file, start, length,
+ iommu_prot & IOMMU_WRITE);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+ return iopt_map_common(ictx, iopt, pages, iova, length,
+ start - pages->start, iommu_prot, flags);
+}
+
struct iova_bitmap_fn_arg {
unsigned long flags;
struct io_pagetable *iopt;
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 3a28f46..909d396 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -220,6 +220,8 @@ struct iopt_pages {
struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
unsigned long length, bool writable);
+struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
+ unsigned long length, bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 157a89b..2b2e172 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -2,6 +2,7 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
*/
+#include <linux/file.h>
#include <linux/interval_tree.h>
#include <linux/iommufd.h>
#include <linux/iommu.h>
@@ -197,6 +198,48 @@ static int conv_iommu_prot(u32 map_flags)
return iommu_prot;
}
+int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_ioas_map_file *cmd = ucmd->cmd;
+ unsigned long iova = cmd->iova;
+ struct iommufd_ioas *ioas;
+ unsigned int flags = 0;
+ int rc;
+ struct file *file;
+ unsigned long end;
+
+ if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
+ return -EOVERFLOW;
+
+ if (cmd->length > 0 &&
+ check_add_overflow(cmd->start, cmd->length - 1, &end))
+ return -EOVERFLOW;
+
+ ioas = iommufd_get_ioas(ucmd->ictx, cmd->ioas_id);
+ if (IS_ERR(ioas))
+ return PTR_ERR(ioas);
+
+ if (!(cmd->flags & IOMMU_IOAS_MAP_FIXED_IOVA))
+ flags = IOPT_ALLOC_IOVA;
+
+ file = fget(cmd->fd);
+ if (!file)
+ return -EBADF;
+
+ rc = iopt_map_file_pages(ucmd->ictx, &ioas->iopt, &iova, file,
+ cmd->start, cmd->length,
+ conv_iommu_prot(cmd->flags), flags);
+ if (rc)
+ goto out_put;
+
+ cmd->iova = iova;
+ rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+out_put:
+ iommufd_put_object(ucmd->ictx, &ioas->obj);
+ fput(file);
+ return rc;
+}
+
int iommufd_ioas_map(struct iommufd_ucmd *ucmd)
{
struct iommu_ioas_map *cmd = ucmd->cmd;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 92efe30..34b1f3a 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -68,6 +68,10 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
unsigned long *iova, void __user *uptr,
unsigned long length, int iommu_prot,
unsigned int flags);
+int iopt_map_file_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
+ unsigned long *iova, struct file *file,
+ unsigned long start, unsigned long length,
+ int iommu_prot, unsigned int flags);
int iopt_map_pages(struct io_pagetable *iopt, struct list_head *pages_list,
unsigned long length, unsigned long *dst_iova,
int iommu_prot, unsigned int flags);
@@ -275,6 +279,7 @@ static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ctx *ictx,
int iommufd_ioas_iova_ranges(struct iommufd_ucmd *ucmd);
int iommufd_ioas_allow_iovas(struct iommufd_ucmd *ucmd);
int iommufd_ioas_map(struct iommufd_ucmd *ucmd);
+int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd);
int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 83bbd7c..f57c3318 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -378,6 +378,8 @@ struct iommufd_ioctl_op {
struct iommu_ioas_iova_ranges, out_iova_alignment),
IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
iova),
+ IOCTL_OP(IOMMU_IOAS_MAP_FILE, iommufd_ioas_map_file,
+ struct iommu_ioas_map_file, iova),
IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
length),
IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index bb60af1..c4c83c4 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/iommu.h>
#include <linux/sched/mm.h>
+#include <linux/file.h>
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/iommufd.h>
@@ -1261,6 +1262,22 @@ struct iopt_pages *iopt_alloc_user_pages(void __user *uptr,
return pages;
}
+struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
+ unsigned long length, bool writable)
+
+{
+ struct iopt_pages *pages;
+ unsigned long start_down = ALIGN_DOWN(start, PAGE_SIZE);
+
+ pages = iopt_alloc_pages(length, start - start_down, writable);
+ if (IS_ERR(pages))
+ return pages;
+ pages->file = get_file(file);
+ pages->start = start_down;
+ pages->type = IOPT_ADDRESS_FILE;
+ return pages;
+}
+
void iopt_release_pages(struct kref *kref)
{
struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
@@ -1273,6 +1290,8 @@ void iopt_release_pages(struct kref *kref)
mutex_destroy(&pages->mutex);
put_task_struct(pages->source_task);
free_uid(pages->source_user);
+ if (pages->type == IOPT_ADDRESS_FILE)
+ fput(pages->file);
kfree(pages);
}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4dde745..c484981 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -51,6 +51,7 @@ enum {
IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP = 0x8c,
IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d,
IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e,
+ IOMMUFD_CMD_IOAS_MAP_FILE = 0x8f,
};
/**
@@ -214,6 +215,30 @@ struct iommu_ioas_map {
#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
/**
+ * struct iommu_ioas_map_file - ioctl(IOMMU_IOAS_MAP_FILE)
+ * @size: sizeof(struct iommu_ioas_map_file)
+ * @fd: the memfd to map
+ * @start: byte offset from start of file to map from
+ * @flags: same as for iommu_ioas_map
+ * @ios_id: ditto
+ * @length: ditto
+ * @iova: ditto
+ *
+ * Set an IOVA mapping from a memfd file. All other arguments and semantics
+ * match those of IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_map_file {
+ __u32 size;
+ __u32 flags;
+ __u32 ioas_id;
+ __s32 fd;
+ __aligned_u64 start;
+ __aligned_u64 length;
+ __aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP_FILE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP_FILE)
+
+/**
* struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
* @size: sizeof(struct iommu_ioas_copy)
* @flags: Combination of enum iommufd_ioas_map_flags
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 6/9] iommufd: file mappings for mdev
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (4 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-04 18:48 ` [PATCH V3 7/9] iommufd: pfn reader local variables Steve Sistare
` (3 subsequent siblings)
9 siblings, 0 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Support file mappings for mediated devices, aka mdevs. Access is initiated
by the vfio_pin_pages and vfio_dma_rw kernel interfaces.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/iommu/iommufd/pages.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index c4c83c4..8030226 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1731,11 +1731,11 @@ static int iopt_pages_fill_from_domain(struct iopt_pages *pages,
return 0;
}
-static int iopt_pages_fill_from_mm(struct iopt_pages *pages,
- struct pfn_reader_user *user,
- unsigned long start_index,
- unsigned long last_index,
- struct page **out_pages)
+static int iopt_pages_fill(struct iopt_pages *pages,
+ struct pfn_reader_user *user,
+ unsigned long start_index,
+ unsigned long last_index,
+ struct page **out_pages)
{
unsigned long cur_index = start_index;
int rc;
@@ -1809,8 +1809,8 @@ int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
/* hole */
cur_pages = out_pages + (span.start_hole - start_index);
- rc = iopt_pages_fill_from_mm(pages, &user, span.start_hole,
- span.last_hole, cur_pages);
+ rc = iopt_pages_fill(pages, &user, span.start_hole,
+ span.last_hole, cur_pages);
if (rc)
goto out_clean_xa;
rc = pages_to_xarray(&pages->pinned_pfns, span.start_hole,
@@ -1890,6 +1890,10 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
struct page *page = NULL;
int rc;
+ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+ WARN_ON(pages->type != IOPT_ADDRESS_USER))
+ return -EINVAL;
+
if (!mmget_not_zero(pages->source_mm))
return iopt_pages_rw_slow(pages, index, index, offset, data,
length, flags);
@@ -1945,6 +1949,15 @@ int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
return -EPERM;
+ if (pages->type == IOPT_ADDRESS_FILE)
+ return iopt_pages_rw_slow(pages, start_index, last_index,
+ start_byte % PAGE_SIZE, data, length,
+ flags);
+
+ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+ WARN_ON(pages->type != IOPT_ADDRESS_USER))
+ return -EINVAL;
+
if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
if (start_index == last_index)
return iopt_pages_rw_page(pages, start_index,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 7/9] iommufd: pfn reader local variables
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (5 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 6/9] iommufd: file mappings for mdev Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-16 12:42 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 8/9] iommufd: optimize file mapping Steve Sistare
` (2 subsequent siblings)
9 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Add local variables for common sub-expressions needed by a subsequent
patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/pages.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 8030226..2db2b06 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1046,6 +1046,8 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
{
struct interval_tree_double_span_iter *span = &pfns->span;
unsigned long start_index = pfns->batch_end_index;
+ struct pfn_reader_user *user = &pfns->user;
+ unsigned long npages;
struct iopt_area *area;
int rc;
@@ -1083,10 +1085,9 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
return rc;
}
- batch_from_pages(&pfns->batch,
- pfns->user.upages +
- (start_index - pfns->user.upages_start),
- pfns->user.upages_end - start_index);
+ npages = user->upages_end - start_index;
+ start_index -= user->upages_start;
+ batch_from_pages(&pfns->batch, user->upages + start_index, npages);
return 0;
}
@@ -1160,16 +1161,17 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
static void pfn_reader_release_pins(struct pfn_reader *pfns)
{
struct iopt_pages *pages = pfns->pages;
+ struct pfn_reader_user *user = &pfns->user;
+ unsigned long npages, start_index;
- if (pfns->user.upages_end > pfns->batch_end_index) {
- size_t npages = pfns->user.upages_end - pfns->batch_end_index;
-
+ if (user->upages_end > pfns->batch_end_index) {
/* Any pages not transferred to the batch are just unpinned */
- unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
- pfns->user.upages_start),
- npages);
+
+ npages = user->upages_end - pfns->batch_end_index;
+ start_index = pfns->batch_end_index - user->upages_start;
+ unpin_user_pages(user->upages + start_index, npages);
iopt_pages_sub_npinned(pages, npages);
- pfns->user.upages_end = pfns->batch_end_index;
+ user->upages_end = pfns->batch_end_index;
}
if (pfns->batch_start_index != pfns->batch_end_index) {
pfn_reader_unpin(pfns);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (6 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 7/9] iommufd: pfn reader local variables Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-16 13:00 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 9/9] iommufd: map file selftest Steve Sistare
2024-10-16 12:23 ` [PATCH V3 0/9] iommu_ioas_map_file Jason Gunthorpe
9 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
When filling a batch, avoid the intermediate step of expanding folios
into upages[] in pin_memfd_pages, via the new helper functions
batch_from_folios_huge and batch_add_pfn_num.
However, we must still expand to upages for the iopt_pages_fill path.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
drivers/iommu/iommufd/pages.c | 146 +++++++++++++++++++++++++++++++++++-------
1 file changed, 123 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 2db2b06..b57c2b9 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -347,25 +347,41 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
kfree(batch->pfns);
}
-/* true if the pfn was added, false otherwise */
-static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+/* returns the number of pfn's added */
+static int batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
+ unsigned long nr)
{
- const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
+ const unsigned long MAX_NPFNS = type_max(typeof(*batch->npfns));
+ unsigned long pfn_end, npfn_end;
+ unsigned long n = 0;
- if (batch->end &&
- pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
- batch->npfns[batch->end - 1] != MAX_NPFNS) {
- batch->npfns[batch->end - 1]++;
- batch->total_pfns++;
- return true;
+ if (batch->end) {
+ npfn_end = batch->npfns[batch->end - 1];
+ pfn_end = batch->pfns[batch->end - 1];
+
+ if (pfn == pfn_end + npfn_end && npfn_end < MAX_NPFNS) {
+ n = min_t(unsigned long, MAX_NPFNS - npfn_end, nr);
+ batch->npfns[batch->end - 1] += n;
+ batch->total_pfns += n;
+ if (nr == n)
+ return n;
+ nr -= n;
+ }
}
if (batch->end == batch->array_size)
- return false;
- batch->total_pfns++;
+ return n;
+ nr = min_t(unsigned long, MAX_NPFNS, nr);
+ batch->total_pfns += nr;
batch->pfns[batch->end] = pfn;
- batch->npfns[batch->end] = 1;
+ batch->npfns[batch->end] = nr;
batch->end++;
- return true;
+ return n + nr;
+}
+
+/* true if the pfn was added, false otherwise */
+static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
+{
+ return batch_add_pfn_num(batch, pfn, 1) == 1;
}
/*
@@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
break;
}
+static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
+ size_t npages)
+{
+ struct folio **end = folios + npages;
+
+ for (; folios != end; folios++)
+ if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
+ break;
+}
+
+static void batch_from_folios_huge(struct pfn_batch *batch,
+ struct folio ***folios_p,
+ unsigned long *offset_p,
+ unsigned long npages)
+{
+ unsigned long nr, n, pfn, i = 0;
+ struct folio *folio, **folios = *folios_p;
+ unsigned long offset = *offset_p;
+
+ while (npages) {
+ folio = folios[i++];
+ nr = folio_nr_pages(folio) - offset;
+ nr = min_t(unsigned long, nr, npages);
+ while (nr) {
+ pfn = page_to_pfn(folio_page(folio, offset));
+ n = batch_add_pfn_num(batch, pfn, nr);
+ if (n == 0) {
+ *folios_p = folios + i - 1;
+ *offset_p = offset;
+ return;
+ }
+ npages -= n;
+ nr -= n;
+ offset += n;
+ }
+ offset = 0;
+ }
+}
+
+static void folios_unpin_partial(struct folio **folios, unsigned long offset,
+ unsigned long npages)
+{
+ unsigned long nr, i = 0;
+ struct folio *folio;
+ struct page *page;
+
+ while (npages) {
+ folio = folios[i++];
+ nr = folio_nr_pages(folio);
+ if (offset == 0 && nr < npages) {
+ unpin_folio(folio);
+ } else {
+ nr = min_t(unsigned long, npages, nr - offset);
+ page = folio_page(folio, offset);
+ unpin_user_page_range_dirty_lock(page, nr, false);
+ offset = 0;
+ }
+ npages -= nr;
+ }
+}
+
static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
unsigned int first_page_off, size_t npages)
{
@@ -709,6 +786,9 @@ struct pfn_reader_user {
struct file *file;
struct folio **ufolios;
unsigned long ufolios_len;
+ unsigned long ufolios_offset;
+ struct folio **ufolios_next;
+ bool ufolios_huge;
};
static void pfn_reader_user_init(struct pfn_reader_user *user,
@@ -726,6 +806,9 @@ static void pfn_reader_user_init(struct pfn_reader_user *user,
user->file = (pages->type == IOPT_ADDRESS_FILE) ? pages->file : NULL;
user->ufolios = NULL;
user->ufolios_len = 0;
+ user->ufolios_huge = false;
+ user->ufolios_next = NULL;
+ user->ufolios_offset = 0;
}
static void pfn_reader_user_destroy(struct pfn_reader_user *user,
@@ -763,6 +846,8 @@ static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
return nfolios;
offset >>= PAGE_SHIFT;
+ user->ufolios_next = user->ufolios;
+ user->ufolios_offset = offset;
npages_out = 0;
for (i = 0; i < nfolios; i++) {
@@ -774,8 +859,11 @@ static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
if (rc)
return rc;
}
- for (j = offset; j < offset + npin; j++)
- *upages++ = folio_page(folio, j);
+ if (nr > 1)
+ user->ufolios_huge = true;
+ if (upages)
+ for (j = offset; j < offset + npin; j++)
+ *upages++ = folio_page(folio, j);
npages -= npin;
npages_out += npin;
offset = 0;
@@ -799,7 +887,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
WARN_ON(last_index < start_index))
return -EINVAL;
- if (!user->upages) {
+ if (!user->file && !user->upages) {
/* All undone in pfn_reader_destroy() */
user->upages_len = npages * sizeof(*user->upages);
user->upages = temp_kmalloc(&user->upages_len, NULL, 0);
@@ -812,11 +900,6 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
user->ufolios = temp_kmalloc(&user->ufolios_len, NULL, 0);
if (!user->ufolios)
return -ENOMEM;
-
- /* Bail for now. Be more robust when we optimize for folios. */
- if (user->ufolios_len / sizeof(*user->ufolios) <
- user->upages_len / sizeof(*user->upages))
- return -ENOMEM;
}
if (user->locked == -1) {
@@ -1087,7 +1170,16 @@ static int pfn_reader_fill_span(struct pfn_reader *pfns)
npages = user->upages_end - start_index;
start_index -= user->upages_start;
- batch_from_pages(&pfns->batch, user->upages + start_index, npages);
+
+ if (!user->file)
+ batch_from_pages(&pfns->batch, user->upages + start_index,
+ npages);
+ else if (!user->ufolios_huge)
+ batch_from_folios(&pfns->batch, user->ufolios + start_index,
+ npages);
+ else
+ batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
+ &user->ufolios_offset, npages);
return 0;
}
@@ -1169,7 +1261,15 @@ static void pfn_reader_release_pins(struct pfn_reader *pfns)
npages = user->upages_end - pfns->batch_end_index;
start_index = pfns->batch_end_index - user->upages_start;
- unpin_user_pages(user->upages + start_index, npages);
+
+ if (!user->file)
+ unpin_user_pages(user->upages + start_index, npages);
+ else if (!user->ufolios_huge)
+ unpin_folios(user->ufolios + start_index, npages);
+ else
+ folios_unpin_partial(user->ufolios_next,
+ user->ufolios_offset, npages);
+
iopt_pages_sub_npinned(pages, npages);
user->upages_end = pfns->batch_end_index;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH V3 9/9] iommufd: map file selftest
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (7 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 8/9] iommufd: optimize file mapping Steve Sistare
@ 2024-10-04 18:48 ` Steve Sistare
2024-10-16 12:23 ` [PATCH V3 0/9] iommu_ioas_map_file Jason Gunthorpe
9 siblings, 0 replies; 39+ messages in thread
From: Steve Sistare @ 2024-10-04 18:48 UTC (permalink / raw)
To: iommu; +Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, Steve Sistare
Add test cases to exercise IOMMU_IOAS_MAP_FILE.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
tools/testing/selftests/iommu/iommufd.c | 151 +++++++++++++++++++++++---
tools/testing/selftests/iommu/iommufd_utils.h | 41 +++++++
2 files changed, 177 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 4927b9a..3525670 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -3,6 +3,8 @@
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
+#include <sys/syscall.h>
+#include <asm/unistd.h>
#define __EXPORTED_HEADERS__
#include <linux/vfio.h>
@@ -10,6 +12,8 @@
#include "iommufd_utils.h"
static unsigned long HUGEPAGE_SIZE;
+static void *mfd_buffer;
+static int mfd;
#define MOCK_PAGE_SIZE (PAGE_SIZE / 2)
#define MOCK_HUGE_PAGE_SIZE (512 * MOCK_PAGE_SIZE)
@@ -33,6 +37,19 @@ static unsigned long get_huge_page_size(void)
return strtoul(buf, NULL, 10);
}
+static void *memfd_mmap(size_t length, int prot, int flags, int *mfd_p)
+{
+ int mfd_flags = (flags & MAP_HUGETLB) ? MFD_HUGETLB : 0;
+ int mfd = memfd_create("buffer", mfd_flags);
+
+ if (mfd <= 0)
+ return MAP_FAILED;
+ if (ftruncate(mfd, length))
+ return MAP_FAILED;
+ *mfd_p = mfd;
+ return mmap(0, length, prot, flags, mfd, 0);
+}
+
static __attribute__((constructor)) void setup_sizes(void)
{
void *vrc;
@@ -49,6 +66,9 @@ static __attribute__((constructor)) void setup_sizes(void)
vrc = mmap(buffer, BUFFER_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
assert(vrc == buffer);
+
+ mfd_buffer = memfd_mmap(BUFFER_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
+ &mfd);
}
FIXTURE(iommufd)
@@ -1372,6 +1392,7 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
{
unsigned int mock_domains;
bool hugepages;
+ bool file;
};
FIXTURE_SETUP(iommufd_mock_domain)
@@ -1410,26 +1431,45 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
{
.mock_domains = 1,
.hugepages = false,
+ .file = false,
};
FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains)
{
.mock_domains = 2,
.hugepages = false,
+ .file = false,
};
FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_hugepage)
{
.mock_domains = 1,
.hugepages = true,
+ .file = false,
};
FIXTURE_VARIANT_ADD(iommufd_mock_domain, two_domains_hugepage)
{
.mock_domains = 2,
.hugepages = true,
+ .file = false,
};
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_file)
+{
+ .mock_domains = 1,
+ .hugepages = false,
+ .file = true,
+};
+
+FIXTURE_VARIANT_ADD(iommufd_mock_domain, one_domain_file_hugepage)
+{
+ .mock_domains = 1,
+ .hugepages = true,
+ .file = true,
+};
+
+
/* Have the kernel check that the user pages made it to the iommu_domain */
#define check_mock_iova(_ptr, _iova, _length) \
({ \
@@ -1455,7 +1495,10 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
} \
})
-TEST_F(iommufd_mock_domain, basic)
+static void
+test_basic_mmap(struct __test_metadata *_metadata,
+ struct _test_data_iommufd_mock_domain *self,
+ const struct _fixture_variant_iommufd_mock_domain *variant)
{
size_t buf_size = self->mmap_buf_size;
uint8_t *buf;
@@ -1478,12 +1521,52 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
test_err_ioctl_ioas_map(EFAULT, buf, buf_size, &iova);
}
+static void
+test_basic_file(struct __test_metadata *_metadata,
+ struct _test_data_iommufd_mock_domain *self,
+ const struct _fixture_variant_iommufd_mock_domain *variant)
+{
+ size_t buf_size = self->mmap_buf_size;
+ uint8_t *buf;
+ __u64 iova;
+ int mfd_tmp;
+ int prot = PROT_READ | PROT_WRITE;
+
+ /* Simple one page map */
+ test_ioctl_ioas_map_file(mfd, 0, PAGE_SIZE, &iova);
+ check_mock_iova(mfd_buffer, iova, PAGE_SIZE);
+
+ buf = memfd_mmap(buf_size, prot, MAP_SHARED, &mfd_tmp);
+ ASSERT_NE(MAP_FAILED, buf);
+
+ /* EFAULT half way through mapping */
+ ASSERT_EQ(0, munmap(buf + buf_size / 2, buf_size / 2));
+ test_err_ioctl_ioas_map_file(EFAULT, 0, buf_size, &iova);
+
+ /* EFAULT on first page */
+ ASSERT_EQ(0, munmap(buf, buf_size / 2));
+ test_err_ioctl_ioas_map_file(EFAULT, 0, buf_size, &iova);
+
+ close(mfd_tmp);
+}
+
+TEST_F(iommufd_mock_domain, basic)
+{
+ if (variant->file)
+ test_basic_file(_metadata, self, variant);
+ else
+ test_basic_mmap(_metadata, self, variant);
+}
+
TEST_F(iommufd_mock_domain, ro_unshare)
{
uint8_t *buf;
__u64 iova;
int fd;
+ if (variant->file)
+ SKIP(return, "redundant test");
+
fd = open("/proc/self/exe", O_RDONLY);
ASSERT_NE(-1, fd);
@@ -1513,9 +1596,13 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
unsigned int start;
unsigned int end;
uint8_t *buf;
+ int prot = PROT_READ | PROT_WRITE;
+ int mfd;
- buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
- 0);
+ if (variant->file)
+ buf = memfd_mmap(buf_size, prot, MAP_SHARED, &mfd);
+ else
+ buf = mmap(0, buf_size, prot, self->mmap_flags, -1, 0);
ASSERT_NE(MAP_FAILED, buf);
check_refs(buf, buf_size, 0);
@@ -1532,7 +1619,12 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
size_t length = end - start;
__u64 iova;
- test_ioctl_ioas_map(buf + start, length, &iova);
+ if (variant->file) {
+ test_ioctl_ioas_map_file(mfd, start, length,
+ &iova);
+ } else {
+ test_ioctl_ioas_map(buf + start, length, &iova);
+ }
check_mock_iova(buf + start, iova, length);
check_refs(buf + start / PAGE_SIZE * PAGE_SIZE,
end / PAGE_SIZE * PAGE_SIZE -
@@ -1544,6 +1636,8 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
}
check_refs(buf, buf_size, 0);
ASSERT_EQ(0, munmap(buf, buf_size));
+ if (variant->file)
+ close(mfd);
}
TEST_F(iommufd_mock_domain, all_aligns_copy)
@@ -1554,9 +1648,13 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
unsigned int start;
unsigned int end;
uint8_t *buf;
+ int prot = PROT_READ | PROT_WRITE;
+ int mfd;
- buf = mmap(0, buf_size, PROT_READ | PROT_WRITE, self->mmap_flags, -1,
- 0);
+ if (variant->file)
+ buf = memfd_mmap(buf_size, prot, MAP_SHARED, &mfd);
+ else
+ buf = mmap(0, buf_size, prot, self->mmap_flags, -1, 0);
ASSERT_NE(MAP_FAILED, buf);
check_refs(buf, buf_size, 0);
@@ -1575,7 +1673,12 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
uint32_t mock_stdev_id;
__u64 iova;
- test_ioctl_ioas_map(buf + start, length, &iova);
+ if (variant->file) {
+ test_ioctl_ioas_map_file(mfd, start, length,
+ &iova);
+ } else {
+ test_ioctl_ioas_map(buf + start, length, &iova);
+ }
/* Add and destroy a domain while the area exists */
old_id = self->hwpt_ids[1];
@@ -1596,15 +1699,18 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
}
check_refs(buf, buf_size, 0);
ASSERT_EQ(0, munmap(buf, buf_size));
+ if (variant->file)
+ close(mfd);
}
TEST_F(iommufd_mock_domain, user_copy)
{
+ void *buf = variant->file ? mfd_buffer : buffer;
struct iommu_test_cmd access_cmd = {
.size = sizeof(access_cmd),
.op = IOMMU_TEST_OP_ACCESS_PAGES,
.access_pages = { .length = BUFFER_SIZE,
- .uptr = (uintptr_t)buffer },
+ .uptr = (uintptr_t)buf },
};
struct iommu_ioas_copy copy_cmd = {
.size = sizeof(copy_cmd),
@@ -1623,9 +1729,13 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
/* Pin the pages in an IOAS with no domains then copy to an IOAS with domains */
test_ioctl_ioas_alloc(&ioas_id);
- test_ioctl_ioas_map_id(ioas_id, buffer, BUFFER_SIZE,
- ©_cmd.src_iova);
-
+ if (variant->file) {
+ test_ioctl_ioas_map_id_file(ioas_id, mfd, 0, BUFFER_SIZE,
+ ©_cmd.src_iova);
+ } else {
+ test_ioctl_ioas_map_id(ioas_id, buf, BUFFER_SIZE,
+ ©_cmd.src_iova);
+ }
test_cmd_create_access(ioas_id, &access_cmd.id,
MOCK_FLAGS_ACCESS_CREATE_NEEDS_PIN_PAGES);
@@ -1635,12 +1745,17 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
&access_cmd));
copy_cmd.src_ioas_id = ioas_id;
ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, ©_cmd));
- check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);
+ check_mock_iova(buf, MOCK_APERTURE_START, BUFFER_SIZE);
/* Now replace the ioas with a new one */
test_ioctl_ioas_alloc(&new_ioas_id);
- test_ioctl_ioas_map_id(new_ioas_id, buffer, BUFFER_SIZE,
- ©_cmd.src_iova);
+ if (variant->file) {
+ test_ioctl_ioas_map_id_file(new_ioas_id, mfd, 0, BUFFER_SIZE,
+ ©_cmd.src_iova);
+ } else {
+ test_ioctl_ioas_map_id(new_ioas_id, buf, BUFFER_SIZE,
+ ©_cmd.src_iova);
+ }
test_cmd_access_replace_ioas(access_cmd.id, new_ioas_id);
/* Destroy the old ioas and cleanup copied mapping */
@@ -1654,7 +1769,7 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
&access_cmd));
copy_cmd.src_ioas_id = new_ioas_id;
ASSERT_EQ(0, ioctl(self->fd, IOMMU_IOAS_COPY, ©_cmd));
- check_mock_iova(buffer, MOCK_APERTURE_START, BUFFER_SIZE);
+ check_mock_iova(buf, MOCK_APERTURE_START, BUFFER_SIZE);
test_cmd_destroy_access_pages(
access_cmd.id, access_cmd.access_pages.out_access_pages_id);
@@ -1667,6 +1782,9 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
{
uint32_t ioas_id;
+ if (variant->file)
+ SKIP(return, "redundant test");
+
test_ioctl_ioas_alloc(&ioas_id);
test_cmd_mock_domain_replace(self->stdev_ids[0], ioas_id);
@@ -1697,6 +1815,9 @@ static void check_access_rw(struct __test_metadata *_metadata, int fd,
{
int i;
+ if (variant->file)
+ SKIP(return, "redundant test");
+
for (i = 0; i != variant->mock_domains; i++) {
uint32_t hwpt_id[2];
uint32_t stddev_id;
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 40f6f14..7bc664e 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -589,6 +589,47 @@ static int _test_ioctl_ioas_unmap(int fd, unsigned int ioas_id, uint64_t iova,
EXPECT_ERRNO(_errno, _test_ioctl_ioas_unmap(self->fd, self->ioas_id, \
iova, length, NULL))
+static int _test_ioctl_ioas_map_file(int fd, unsigned int ioas_id, int mfd,
+ size_t start, size_t length, __u64 *iova,
+ unsigned int flags)
+{
+ struct iommu_ioas_map_file cmd = {
+ .size = sizeof(cmd),
+ .flags = flags,
+ .ioas_id = ioas_id,
+ .fd = mfd,
+ .start = start,
+ .length = length,
+ };
+ int ret;
+
+ if (flags & IOMMU_IOAS_MAP_FIXED_IOVA)
+ cmd.iova = *iova;
+
+ ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &cmd);
+ *iova = cmd.iova;
+ return ret;
+}
+
+#define test_ioctl_ioas_map_file(mfd, start, length, iova_p) \
+ ASSERT_EQ(0, _test_ioctl_ioas_map_file(self->fd, self->ioas_id, mfd, \
+ start, length, iova_p, \
+ IOMMU_IOAS_MAP_WRITEABLE | \
+ IOMMU_IOAS_MAP_READABLE))
+
+#define test_err_ioctl_ioas_map_file(_errno, start, length, iova_p) \
+ EXPECT_ERRNO(_errno, \
+ _test_ioctl_ioas_map_file(self->fd, self->ioas_id, mfd, \
+ start, length, iova_p, \
+ IOMMU_IOAS_MAP_WRITEABLE | \
+ IOMMU_IOAS_MAP_READABLE))
+
+#define test_ioctl_ioas_map_id_file(ioas_id, mfd, start, length, iova_p) \
+ ASSERT_EQ(0, _test_ioctl_ioas_map_file(self->fd, ioas_id, mfd, \
+ start, length, iova_p, \
+ IOMMU_IOAS_MAP_WRITEABLE | \
+ IOMMU_IOAS_MAP_READABLE))
+
static int _test_ioctl_set_temp_memory_limit(int fd, unsigned int limit)
{
struct iommu_test_cmd memlimit_cmd = {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH V3 1/9] mm/gup: folio_add_pins
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
@ 2024-10-04 18:52 ` Steven Sistare
2024-10-04 20:20 ` David Hildenbrand
2024-10-16 12:17 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-04 18:52 UTC (permalink / raw)
To: linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
David Hildenbrand, Matthew Wilcox
cc mm folks.
On 10/4/2024 2:48 PM, Steve Sistare wrote:
> Export a function that adds pins to an already-pinned huge-page folio.
> This allows any range of small pages within the folio to be unpinned later.
> For example, pages pinned via memfd_pin_folios and modified by
> folio_add_pins could be unpinned via unpin_user_page(s).
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 13bff7c..70d5293 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2521,6 +2521,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
> struct folio **folios, unsigned int max_folios,
> pgoff_t *offset);
> +int folio_add_pins(struct folio *folio, unsigned int pins);
>
> int get_user_pages_fast(unsigned long start, int nr_pages,
> unsigned int gup_flags, struct page **pages);
> diff --git a/mm/gup.c b/mm/gup.c
> index fcd602b..11c5f27 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3733,3 +3733,27 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
> return ret;
> }
> EXPORT_SYMBOL_GPL(memfd_pin_folios);
> +
> +/**
> + * folio_add_pins() - add pins to an already-pinned folio
> + * @folio: the folio to add more pins to
> + * @pins: number of pins to add
> + *
> + * Try to add more pins to an already-pinned folio. The semantics
> + * of the pin (e.g., FOLL_WRITE) follow any existing pin and cannot
> + * be changed.
> + *
> + * This function is helpful when having obtained a pin on a large folio
> + * using memfd_pin_folios(), but wanting to logically unpin parts
> + * (e.g., individual pages) of the folio later, for example, using
> + * unpin_user_page_range_dirty_lock().
> + *
> + * This is not the right interface to initially pin a folio.
> + */
> +int folio_add_pins(struct folio *folio, unsigned int pins)
> +{
> + VM_WARN_ON_ONCE(!folio_maybe_dma_pinned(folio));
> +
> + return try_grab_folio(folio, pins, FOLL_PIN);
> +}
> +EXPORT_SYMBOL_GPL(folio_add_pins);
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 1/9] mm/gup: folio_add_pins
2024-10-04 18:52 ` Steven Sistare
@ 2024-10-04 20:20 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2024-10-04 20:20 UTC (permalink / raw)
To: Steven Sistare, linux-mm
Cc: Jason Gunthorpe, Kevin Tian, Nicolin Chen, iommu, Andrew Morton,
Matthew Wilcox, Nico Pache
On 04.10.24 20:52, Steven Sistare wrote:
> cc mm folks.
>
> On 10/4/2024 2:48 PM, Steve Sistare wrote:
>> Export a function that adds pins to an already-pinned huge-page folio.
>> This allows any range of small pages within the folio to be unpinned later.
>> For example, pages pinned via memfd_pin_folios and modified by
>> folio_add_pins could be unpinned via unpin_user_page(s).
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/linux/mm.h | 1 +
>> mm/gup.c | 24 ++++++++++++++++++++++++
>> 2 files changed, 25 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 13bff7c..70d5293 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2521,6 +2521,7 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>> long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>> struct folio **folios, unsigned int max_folios,
>> pgoff_t *offset);
>> +int folio_add_pins(struct folio *folio, unsigned int pins);
>>
>> int get_user_pages_fast(unsigned long start, int nr_pages,
>> unsigned int gup_flags, struct page **pages);
>> diff --git a/mm/gup.c b/mm/gup.c
>> index fcd602b..11c5f27 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3733,3 +3733,27 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
>> return ret;
>> }
>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>> +
>> +/**
>> + * folio_add_pins() - add pins to an already-pinned folio
>> + * @folio: the folio to add more pins to
>> + * @pins: number of pins to add
>> + *
>> + * Try to add more pins to an already-pinned folio. The semantics
>> + * of the pin (e.g., FOLL_WRITE) follow any existing pin and cannot
>> + * be changed.
>> + *
>> + * This function is helpful when having obtained a pin on a large folio
>> + * using memfd_pin_folios(), but wanting to logically unpin parts
>> + * (e.g., individual pages) of the folio later, for example, using
>> + * unpin_user_page_range_dirty_lock().
>> + *
>> + * This is not the right interface to initially pin a folio.
>> + */
>> +int folio_add_pins(struct folio *folio, unsigned int pins)
>> +{
>> + VM_WARN_ON_ONCE(!folio_maybe_dma_pinned(folio));
>> +
>> + return try_grab_folio(folio, pins, FOLL_PIN);
>> +}
>> +EXPORT_SYMBOL_GPL(folio_add_pins);
>
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 1/9] mm/gup: folio_add_pins
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
2024-10-04 18:52 ` Steven Sistare
@ 2024-10-16 12:17 ` Jason Gunthorpe
1 sibling, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 12:17 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:12AM -0700, Steve Sistare wrote:
> Export a function that adds pins to an already-pinned huge-page folio.
> This allows any range of small pages within the folio to be unpinned later.
> For example, pages pinned via memfd_pin_folios and modified by
> folio_add_pins could be unpinned via unpin_user_page(s).
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 0/9] iommu_ioas_map_file
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
` (8 preceding siblings ...)
2024-10-04 18:48 ` [PATCH V3 9/9] iommufd: map file selftest Steve Sistare
@ 2024-10-16 12:23 ` Jason Gunthorpe
9 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 12:23 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:11AM -0700, Steve Sistare wrote:
> Provide the IOMMU_IOAS_MAP_FILE ioctl, which allows a user to register
> memory by passing a memfd plus offset and length. Implement it using
> the memfd_map_folios KAPI, and the proposed folio_add_pins KAPI.
> See the individual patches for details.
>
> Changes in V2:
> * changed names and commit message in "rename uptr in iopt_alloc_iova"
> * normalized comments describing the iopt_map_user_pages interface
> * submitted folio_split_user_page_pin (fka folio_repin_unhugely) separately
> * replaced nupages[] optimization with folio-to-batch optimization
> * added selftests for map file
>
> Changes in V3:
> * fixed bug setting user->locked
> * fixed bug in pages->file refs
> * replaced lockdep_off with down_write_nest_lock
> * added ufolios_next to track folio consumption in reader
> * combined IOMMU_IOAS_MAP_FILE interface and implementation
> * added patch folio_add_pins (fka folio_split_user_page_pin )
> * reformatted patches using clang-format
> * misc cosmetic changes in response to review comments
>
> Steve Sistare (9):
> mm/gup: folio_add_pins
> iommufd: rename uptr in iopt_alloc_iova
> iommufd: generalize iopt_pages address
> iommufd: pfn reader for file mappings
> iommufd: IOMMU_IOAS_MAP_FILE
> iommufd: file mappings for mdev
> iommufd: pfn reader local variables
> iommufd: optimize file mapping
> iommufd: map file selftest
This doesn't apply, you need to base it on a clean rc release before
sending it..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 4/9] iommufd: pfn reader for file mappings
2024-10-04 18:48 ` [PATCH V3 4/9] iommufd: pfn reader for file mappings Steve Sistare
@ 2024-10-16 12:37 ` Jason Gunthorpe
2024-10-17 17:01 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 12:37 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:15AM -0700, Steve Sistare wrote:
> +static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
> + unsigned long npages)
> +{
> + unsigned long end, nr, i, j, npin, offset, npages_out;
> + long nfolios;
> + int rc;
> + struct folio *folio;
> + struct page **upages = user->upages;
> +
> + nfolios = user->ufolios_len / sizeof(*user->ufolios);
> + end = start + (npages << PAGE_SHIFT) - 1;
> +
> + nfolios = memfd_pin_folios(user->file, start, end, user->ufolios,
> + nfolios, &offset);
> + if (nfolios <= 0)
> + return nfolios;
> +
> + offset >>= PAGE_SHIFT;
> + npages_out = 0;
> +
> + for (i = 0; i < nfolios; i++) {
> + folio = user->ufolios[i];
> + nr = folio_nr_pages(folio);
> + npin = min(nr - offset, npages);
What happens if all the folios don't fit into the upages array once
expanded? It looks like they are leaked?
I suppose the later patch gets this all sane, but I wonder if we need
this inbetween step?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE
2024-10-04 18:48 ` [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2024-10-16 12:41 ` Jason Gunthorpe
2024-10-17 17:00 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 12:41 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:16AM -0700, Steve Sistare wrote:
>
> +int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
> +{
> + struct iommu_ioas_map_file *cmd = ucmd->cmd;
> + unsigned long iova = cmd->iova;
> + struct iommufd_ioas *ioas;
> + unsigned int flags = 0;
> + int rc;
> + struct file *file;
> + unsigned long end;
> +
> + if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
> + return -EOVERFLOW;
> +
> + if (cmd->length > 0 &&
> + check_add_overflow(cmd->start, cmd->length - 1, &end))
> + return -EOVERFLOW;
Why is this here? iopt_alloc_pages() does this validation and I don't
see math on these values between here and there?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 7/9] iommufd: pfn reader local variables
2024-10-04 18:48 ` [PATCH V3 7/9] iommufd: pfn reader local variables Steve Sistare
@ 2024-10-16 12:42 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 12:42 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:18AM -0700, Steve Sistare wrote:
> Add local variables for common sub-expressions needed by a subsequent
> patch. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> drivers/iommu/iommufd/pages.c | 24 +++++++++++++-----------
> 1 file changed, 13 insertions(+), 11 deletions(-)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-04 18:48 ` [PATCH V3 8/9] iommufd: optimize file mapping Steve Sistare
@ 2024-10-16 13:00 ` Jason Gunthorpe
2024-10-16 13:09 ` Steven Sistare
2024-10-17 17:02 ` Steven Sistare
0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 13:00 UTC (permalink / raw)
To: Steve Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 04, 2024 at 11:48:19AM -0700, Steve Sistare wrote:
> @@ -347,25 +347,41 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
> kfree(batch->pfns);
> }
>
> -/* true if the pfn was added, false otherwise */
> -static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
> +/* returns the number of pfn's added */
> +static int batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
> + unsigned long nr)
This return type is truncating unsigned longs to int..
> @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
> break;
> }
>
> +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
> + size_t npages)
> +{
> + struct folio **end = folios + npages;
> +
> + for (; folios != end; folios++)
> + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
> + break;
This function is really weird, why would we have a function adding
only the first page of each item in a list of folios?
> +static void batch_from_folios_huge(struct pfn_batch *batch,
> + struct folio ***folios_p,
> + unsigned long *offset_p,
> + unsigned long npages)
> +{
> + unsigned long nr, n, pfn, i = 0;
> + struct folio *folio, **folios = *folios_p;
> + unsigned long offset = *offset_p;
Let's make more use of scope local values please (put folio, pfn, etc
in side the narrowest loop scope) and the style in iommufd avoids more
than one variable per declaration
I would like to see this patch when it is applied but I can't apply
it..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-16 13:00 ` Jason Gunthorpe
@ 2024-10-16 13:09 ` Steven Sistare
2024-10-16 13:21 ` Jason Gunthorpe
2024-10-17 17:02 ` Steven Sistare
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-16 13:09 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/16/2024 9:00 AM, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 11:48:19AM -0700, Steve Sistare wrote:
>
>> @@ -347,25 +347,41 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
>> kfree(batch->pfns);
>> }
>>
>> -/* true if the pfn was added, false otherwise */
>> -static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
>> +/* returns the number of pfn's added */
>> +static int batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
>> + unsigned long nr)
>
> This return type is truncating unsigned longs to int..
>
>> @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
>> break;
>> }
>>
>> +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
>> + size_t npages)
>> +{
>> + struct folio **end = folios + npages;
>> +
>> + for (; folios != end; folios++)
>> + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
>> + break;
>
> This function is really weird, why would we have a function adding
> only the first page of each item in a list of folios?
>
>> +static void batch_from_folios_huge(struct pfn_batch *batch,
>> + struct folio ***folios_p,
>> + unsigned long *offset_p,
>> + unsigned long npages)
>> +{
>> + unsigned long nr, n, pfn, i = 0;
>> + struct folio *folio, **folios = *folios_p;
>> + unsigned long offset = *offset_p;
>
> Let's make more use of scope local values please (put folio, pfn, etc
> in side the narrowest loop scope) and the style in iommufd avoids more
> than one variable per declaration
>
> I would like to see this patch when it is applied but I can't apply
> it..
The series is based on
baeb9a7d8b60 ("Merge tag 'sched-rt-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
I will respond to all your comments soon.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-16 13:09 ` Steven Sistare
@ 2024-10-16 13:21 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 13:21 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Wed, Oct 16, 2024 at 09:09:40AM -0400, Steven Sistare wrote:
> > I would like to see this patch when it is applied but I can't apply
> > it..
>
> The series is based on
> baeb9a7d8b60 ("Merge tag 'sched-rt-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> I will respond to all your comments soon.
Oh, don't post patches based on linux-next :\
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE
2024-10-16 12:41 ` Jason Gunthorpe
@ 2024-10-17 17:00 ` Steven Sistare
2024-10-17 19:20 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-17 17:00 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/16/2024 8:41 AM, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 11:48:16AM -0700, Steve Sistare wrote:
>>
>> +int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
>> +{
>> + struct iommu_ioas_map_file *cmd = ucmd->cmd;
>> + unsigned long iova = cmd->iova;
>> + struct iommufd_ioas *ioas;
>> + unsigned int flags = 0;
>> + int rc;
>> + struct file *file;
>> + unsigned long end;
>> +
>> + if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
>> + return -EOVERFLOW;
>> +
>> + if (cmd->length > 0 &&
>> + check_add_overflow(cmd->start, cmd->length - 1, &end))
>> + return -EOVERFLOW;
>
> Why is this here? iopt_alloc_pages() does this validation and I don't
> see math on these values between here and there?
Since the refactoring of iopt_alloc_pages into iopt_alloc_user_pages and
iopt_alloc_file_pages, the former checks for overflow of start + length,
but the latter does not.
To make it symmetric, I could push the check from here (iommufd_ioas_map_file)
down to iopt_alloc_file_pages. Or, hoist the check from iopt_alloc_user_pages to
iommufd_ioas_map. Your call.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 4/9] iommufd: pfn reader for file mappings
2024-10-16 12:37 ` Jason Gunthorpe
@ 2024-10-17 17:01 ` Steven Sistare
2024-10-17 19:20 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-17 17:01 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/16/2024 8:37 AM, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 11:48:15AM -0700, Steve Sistare wrote:
>
>> +static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
>> + unsigned long npages)
>> +{
>> + unsigned long end, nr, i, j, npin, offset, npages_out;
>> + long nfolios;
>> + int rc;
>> + struct folio *folio;
>> + struct page **upages = user->upages;
>> +
>> + nfolios = user->ufolios_len / sizeof(*user->ufolios);
>> + end = start + (npages << PAGE_SHIFT) - 1;
>> +
>> + nfolios = memfd_pin_folios(user->file, start, end, user->ufolios,
>> + nfolios, &offset);
>> + if (nfolios <= 0)
>> + return nfolios;
>> +
>> + offset >>= PAGE_SHIFT;
>> + npages_out = 0;
>> +
>> + for (i = 0; i < nfolios; i++) {
>> + folio = user->ufolios[i];
>> + nr = folio_nr_pages(folio);
>> + npin = min(nr - offset, npages);
>
> What happens if all the folios don't fit into the upages array once
> expanded? It looks like they are leaked?
>
> I suppose the later patch gets this all sane, but I wonder if we need
> this inbetween step?
Yup, I missed the leak if a folio is only partially converted to pages.
The optimization patch does get it right. I'll squash them and reduce its
size by making a separate patch for the new subroutines batch_add_pfn_num etc.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-16 13:00 ` Jason Gunthorpe
2024-10-16 13:09 ` Steven Sistare
@ 2024-10-17 17:02 ` Steven Sistare
2024-10-17 19:24 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-17 17:02 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/16/2024 9:00 AM, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 11:48:19AM -0700, Steve Sistare wrote:
>
>> @@ -347,25 +347,41 @@ static void batch_destroy(struct pfn_batch *batch, void *backup)
>> kfree(batch->pfns);
>> }
>>
>> -/* true if the pfn was added, false otherwise */
>> -static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
>> +/* returns the number of pfn's added */
>> +static int batch_add_pfn_num(struct pfn_batch *batch, unsigned long pfn,
>> + unsigned long nr)
>
> This return type is truncating unsigned longs to int..
Thanks, will fix.
>> @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
>> break;
>> }
>>
>> +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
>> + size_t npages)
>> +{
>> + struct folio **end = folios + npages;
>> +
>> + for (; folios != end; folios++)
>> + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
>> + break;
>
> This function is really weird, why would we have a function adding
> only the first page of each item in a list of folios?
It is only called when we know that the number of pages in all folios is 1
(user->ufolios_huge is false).
>> +static void batch_from_folios_huge(struct pfn_batch *batch,
>> + struct folio ***folios_p,
>> + unsigned long *offset_p,
>> + unsigned long npages)
>> +{
>> + unsigned long nr, n, pfn, i = 0;
>> + struct folio *folio, **folios = *folios_p;
>> + unsigned long offset = *offset_p;
>
> Let's make more use of scope local values please (put folio, pfn, etc
> in side the narrowest loop scope)
OK. Some like this, some do not, now I know your preference.
I personally like to use local scope and initializers, but dislike when
cstyle it forces me to add a blank line after the declarations which disrupts
the flow/use of some quantity, eg:
{
long pfn = foo();
bar(pfn);
}
> and the style in iommufd avoids more
> than one variable per declaration
OK, I had not realized that.
- Steve
>
> I would like to see this patch when it is applied but I can't apply
> it..
I will send you my patched version of pages.c off list.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE
2024-10-17 17:00 ` Steven Sistare
@ 2024-10-17 19:20 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 19:20 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Thu, Oct 17, 2024 at 01:00:51PM -0400, Steven Sistare wrote:
> On 10/16/2024 8:41 AM, Jason Gunthorpe wrote:
> > On Fri, Oct 04, 2024 at 11:48:16AM -0700, Steve Sistare wrote:
> > > +int iommufd_ioas_map_file(struct iommufd_ucmd *ucmd)
> > > +{
> > > + struct iommu_ioas_map_file *cmd = ucmd->cmd;
> > > + unsigned long iova = cmd->iova;
> > > + struct iommufd_ioas *ioas;
> > > + unsigned int flags = 0;
> > > + int rc;
> > > + struct file *file;
> > > + unsigned long end;
> > > +
> > > + if (cmd->iova >= ULONG_MAX || cmd->length >= ULONG_MAX)
> > > + return -EOVERFLOW;
> > > +
> > > + if (cmd->length > 0 &&
> > > + check_add_overflow(cmd->start, cmd->length - 1, &end))
> > > + return -EOVERFLOW;
> >
> > Why is this here? iopt_alloc_pages() does this validation and I don't
> > see math on these values between here and there?
>
> Since the refactoring of iopt_alloc_pages into iopt_alloc_user_pages and
> iopt_alloc_file_pages, the former checks for overflow of start + length,
> but the latter does not.
>
> To make it symmetric, I could push the check from here (iommufd_ioas_map_file)
> down to iopt_alloc_file_pages. Or, hoist the check from iopt_alloc_user_pages to
> iommufd_ioas_map. Your call.
Put the check closer to the actual math it is protecting from overflow
please.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 4/9] iommufd: pfn reader for file mappings
2024-10-17 17:01 ` Steven Sistare
@ 2024-10-17 19:20 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 19:20 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Thu, Oct 17, 2024 at 01:01:02PM -0400, Steven Sistare wrote:
> On 10/16/2024 8:37 AM, Jason Gunthorpe wrote:
> > On Fri, Oct 04, 2024 at 11:48:15AM -0700, Steve Sistare wrote:
> >
> > > +static long pin_memfd_pages(struct pfn_reader_user *user, unsigned long start,
> > > + unsigned long npages)
> > > +{
> > > + unsigned long end, nr, i, j, npin, offset, npages_out;
> > > + long nfolios;
> > > + int rc;
> > > + struct folio *folio;
> > > + struct page **upages = user->upages;
> > > +
> > > + nfolios = user->ufolios_len / sizeof(*user->ufolios);
> > > + end = start + (npages << PAGE_SHIFT) - 1;
> > > +
> > > + nfolios = memfd_pin_folios(user->file, start, end, user->ufolios,
> > > + nfolios, &offset);
> > > + if (nfolios <= 0)
> > > + return nfolios;
> > > +
> > > + offset >>= PAGE_SHIFT;
> > > + npages_out = 0;
> > > +
> > > + for (i = 0; i < nfolios; i++) {
> > > + folio = user->ufolios[i];
> > > + nr = folio_nr_pages(folio);
> > > + npin = min(nr - offset, npages);
> >
> > What happens if all the folios don't fit into the upages array once
> > expanded? It looks like they are leaked?
> >
> > I suppose the later patch gets this all sane, but I wonder if we need
> > this inbetween step?
>
> Yup, I missed the leak if a folio is only partially converted to pages.
> The optimization patch does get it right. I'll squash them and reduce its
> size by making a separate patch for the new subroutines
> batch_add_pfn_num etc.
That sounds good!
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-17 17:02 ` Steven Sistare
@ 2024-10-17 19:24 ` Jason Gunthorpe
2024-10-17 19:37 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 19:24 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Thu, Oct 17, 2024 at 01:02:14PM -0400, Steven Sistare wrote:
> > > @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
> > > break;
> > > }
> > > +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
> > > + size_t npages)
> > > +{
> > > + struct folio **end = folios + npages;
> > > +
> > > + for (; folios != end; folios++)
> > > + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
> > > + break;
> >
> > This function is really weird, why would we have a function adding
> > only the first page of each item in a list of folios?
>
> It is only called when we know that the number of pages in all folios is 1
> (user->ufolios_huge is false).
Oh, lets add some comments about this please, and maybe give the
function a more scary name. batch_from_folios doesn't suggest it does
something so strange. batch_from_single_page_folios ?
But, still, *why*? If the folios in the list all have 1 page size then
why not just add them in the normal way?
I still didn't quite entirely follow what the ufolios_huge was doing,
it is what I wanted to look at in an editor..
> > > +static void batch_from_folios_huge(struct pfn_batch *batch,
> > > + struct folio ***folios_p,
> > > + unsigned long *offset_p,
> > > + unsigned long npages)
> > > +{
> > > + unsigned long nr, n, pfn, i = 0;
> > > + struct folio *folio, **folios = *folios_p;
> > > + unsigned long offset = *offset_p;
> >
> > Let's make more use of scope local values please (put folio, pfn, etc
> > in side the narrowest loop scope)
>
> OK. Some like this, some do not, now I know your preference.
Yeah, I'm in the camp it is easier to follow
> I personally like to use local scope and initializers, but dislike when
> cstyle it forces me to add a blank line after the declarations which disrupts
> the flow/use of some quantity, eg:
> {
> long pfn = foo();
>
> bar(pfn);
> }
Yeah, sometimes, maybe don't initialize in those cases?
> > I would like to see this patch when it is applied but I can't apply
> > it..
>
> I will send you my patched version of pages.c off list.
include a git link in your next cover letter, that makes it easier
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-17 19:24 ` Jason Gunthorpe
@ 2024-10-17 19:37 ` Steven Sistare
2024-10-18 0:10 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-17 19:37 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/17/2024 3:24 PM, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 01:02:14PM -0400, Steven Sistare wrote:
>
>>>> @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
>>>> break;
>>>> }
>>>> +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
>>>> + size_t npages)
>>>> +{
>>>> + struct folio **end = folios + npages;
>>>> +
>>>> + for (; folios != end; folios++)
>>>> + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
>>>> + break;
>>>
>>> This function is really weird, why would we have a function adding
>>> only the first page of each item in a list of folios?
>>
>> It is only called when we know that the number of pages in all folios is 1
>> (user->ufolios_huge is false).
>
> Oh, lets add some comments about this please, and maybe give the
> function a more scary name. batch_from_folios doesn't suggest it does
> something so strange. batch_from_single_page_folios ?
>
> But, still, *why*? If the folios in the list all have 1 page size then
> why not just add them in the normal way?
>
> I still didn't quite entirely follow what the ufolios_huge was doing,
> it is what I wanted to look at in an editor..
pin_memfd_pages()
for (i = 0; i < nfolios; i++) {
nr = folio_nr_pages(folio);
if (nr > 1)
user->ufolios_huge = true;
batch_from_folios is an optimized version of batch_from_folios_huge,
usable when we know that nr==1 for all folios.
But, it is optimized for a sub-optimal case, all small pages.
- Steve
>>>> +static void batch_from_folios_huge(struct pfn_batch *batch,
>>>> + struct folio ***folios_p,
>>>> + unsigned long *offset_p,
>>>> + unsigned long npages)
>>>> +{
>>>> + unsigned long nr, n, pfn, i = 0;
>>>> + struct folio *folio, **folios = *folios_p;
>>>> + unsigned long offset = *offset_p;
>>>
>>> Let's make more use of scope local values please (put folio, pfn, etc
>>> in side the narrowest loop scope)
>>
>> OK. Some like this, some do not, now I know your preference.
>
> Yeah, I'm in the camp it is easier to follow
>
>> I personally like to use local scope and initializers, but dislike when
>> cstyle it forces me to add a blank line after the declarations which disrupts
>> the flow/use of some quantity, eg:
>> {
>> long pfn = foo();
>>
>> bar(pfn);
>> }
>
> Yeah, sometimes, maybe don't initialize in those cases?
>
>>> I would like to see this patch when it is applied but I can't apply
>>> it..
>>
>> I will send you my patched version of pages.c off list.
>
> include a git link in your next cover letter, that makes it easier
>
> Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-17 19:37 ` Steven Sistare
@ 2024-10-18 0:10 ` Jason Gunthorpe
2024-10-18 14:34 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-18 0:10 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Thu, Oct 17, 2024 at 03:37:49PM -0400, Steven Sistare wrote:
> On 10/17/2024 3:24 PM, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 01:02:14PM -0400, Steven Sistare wrote:
> >
> > > > > @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
> > > > > break;
> > > > > }
> > > > > +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
> > > > > + size_t npages)
> > > > > +{
> > > > > + struct folio **end = folios + npages;
> > > > > +
> > > > > + for (; folios != end; folios++)
> > > > > + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
> > > > > + break;
> > > >
> > > > This function is really weird, why would we have a function adding
> > > > only the first page of each item in a list of folios?
> > >
> > > It is only called when we know that the number of pages in all folios is 1
> > > (user->ufolios_huge is false).
> >
> > Oh, lets add some comments about this please, and maybe give the
> > function a more scary name. batch_from_folios doesn't suggest it does
> > something so strange. batch_from_single_page_folios ?
> >
> > But, still, *why*? If the folios in the list all have 1 page size then
> > why not just add them in the normal way?
> >
> > I still didn't quite entirely follow what the ufolios_huge was doing,
> > it is what I wanted to look at in an editor..
>
> pin_memfd_pages()
> for (i = 0; i < nfolios; i++) {
> nr = folio_nr_pages(folio);
> if (nr > 1)
> user->ufolios_huge = true;
>
> batch_from_folios is an optimized version of batch_from_folios_huge,
> usable when we know that nr==1 for all folios.
I see..
This seems to be a side effect of processing the folio list twice,
once in pin_memfd_pages() which figures out how big it is in terms of
pages, then again when filling the batch
If the list was iterated only once during the batch fill then there
wouldn't be any value in having two paths.
This seems to be like this largely because the upages_start/end are
counting in pages instead of ulist items.
Do you think that is necessary, or could things be done so it index
based and there is only one loop?
Ie instead of things like:
batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
&user->ufolios_offset, npages);
Where npages is the total number of pages inside the folio list, it
would be nfolios.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 0:10 ` Jason Gunthorpe
@ 2024-10-18 14:34 ` Steven Sistare
2024-10-18 16:04 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-18 14:34 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/17/2024 8:10 PM, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 03:37:49PM -0400, Steven Sistare wrote:
>> On 10/17/2024 3:24 PM, Jason Gunthorpe wrote:
>>> On Thu, Oct 17, 2024 at 01:02:14PM -0400, Steven Sistare wrote:
>>>
>>>>>> @@ -623,6 +639,67 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
>>>>>> break;
>>>>>> }
>>>>>> +static void batch_from_folios(struct pfn_batch *batch, struct folio **folios,
>>>>>> + size_t npages)
>>>>>> +{
>>>>>> + struct folio **end = folios + npages;
>>>>>> +
>>>>>> + for (; folios != end; folios++)
>>>>>> + if (!batch_add_pfn(batch, page_to_pfn(&folios[0]->page)))
>>>>>> + break;
>>>>>
>>>>> This function is really weird, why would we have a function adding
>>>>> only the first page of each item in a list of folios?
>>>>
>>>> It is only called when we know that the number of pages in all folios is 1
>>>> (user->ufolios_huge is false).
>>>
>>> Oh, lets add some comments about this please, and maybe give the
>>> function a more scary name. batch_from_folios doesn't suggest it does
>>> something so strange. batch_from_single_page_folios ?
>>>
>>> But, still, *why*? If the folios in the list all have 1 page size then
>>> why not just add them in the normal way?
>>>
>>> I still didn't quite entirely follow what the ufolios_huge was doing,
>>> it is what I wanted to look at in an editor..
>>
>> pin_memfd_pages()
>> for (i = 0; i < nfolios; i++) {
>> nr = folio_nr_pages(folio);
>> if (nr > 1)
>> user->ufolios_huge = true;
>>
>> batch_from_folios is an optimized version of batch_from_folios_huge,
>> usable when we know that nr==1 for all folios.
>
> I see..
>
> This seems to be a side effect of processing the folio list twice,
> once in pin_memfd_pages() which figures out how big it is in terms of
> pages, then again when filling the batch
>
> If the list was iterated only once during the batch fill then there
> wouldn't be any value in having two paths.
>
> This seems to be like this largely because the upages_start/end are
> counting in pages instead of ulist items.
>
> Do you think that is necessary, or could things be done so it index
> based and there is only one loop?
>
>
> Ie instead of things like:
>
> batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
> &user->ufolios_offset, npages);
>
> Where npages is the total number of pages inside the folio list, it
> would be nfolios.
The problem is that a batch may reach capacity after accepting only part
of a folio, a condition I forced in testing by reducing array_size and
MAX_NPFNS. We need to track which folio is partial and what is the next index
to use in the folio. That is ufolios_next and ufolios_offset.
I could delete the ufolios_huge optimization entirely, and always call
batch_from_folios_huge, which I would rename to batch_from_folios.
And always call folios_unpin_partial in pfn_reader_release_pins.
It would cost a few more cycles per iteration for nr==1 folios, but the
code would be simpler. Your call.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 14:34 ` Steven Sistare
@ 2024-10-18 16:04 ` Jason Gunthorpe
2024-10-18 17:54 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-18 16:04 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 18, 2024 at 10:34:39AM -0400, Steven Sistare wrote:
> > Ie instead of things like:
> >
> > batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
> > &user->ufolios_offset, npages);
> >
> > Where npages is the total number of pages inside the folio list, it
> > would be nfolios.
>
> The problem is that a batch may reach capacity after accepting only part
> of a folio, a condition I forced in testing by reducing array_size and
> MAX_NPFNS. We need to track which folio is partial and what is the next index
> to use in the folio. That is ufolios_next and ufolios_offset.
But why is that such a problem? The ufolios_offset already handles a
partial folio.
If the batch fills you leave ufolios_next alone and adjust
ufolios_offset. The next iteration does exactly the same as the first
iteration and takes the partial folio.
The refcount has to be handled carefully as the the portion that has
been put in the batch needs to be refcounted at 4k and the remainder
remain a single ref, but that just means adjusting the refcount after
putting it in the batch.
It seems like it just works naturally and avoids the double loop
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 16:04 ` Jason Gunthorpe
@ 2024-10-18 17:54 ` Steven Sistare
2024-10-18 17:59 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-18 17:54 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/18/2024 12:04 PM, Jason Gunthorpe wrote:
> On Fri, Oct 18, 2024 at 10:34:39AM -0400, Steven Sistare wrote:
>>> Ie instead of things like:
>>>
>>> batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
>>> &user->ufolios_offset, npages);
>>>
>>> Where npages is the total number of pages inside the folio list, it
>>> would be nfolios.
>>
>> The problem is that a batch may reach capacity after accepting only part
>> of a folio, a condition I forced in testing by reducing array_size and
>> MAX_NPFNS. We need to track which folio is partial and what is the next index
>> to use in the folio. That is ufolios_next and ufolios_offset.
>
> But why is that such a problem? The ufolios_offset already handles a
> partial folio.
>
> If the batch fills you leave ufolios_next alone and adjust
> ufolios_offset. The next iteration does exactly the same as the first
> iteration and takes the partial folio.
>
> The refcount has to be handled carefully as the the portion that has
> been put in the batch needs to be refcounted at 4k and the remainder
> remain a single ref, but that just means adjusting the refcount after
> putting it in the batch.
>
> It seems like it just works naturally and avoids the double loop
Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
point because I thought you were suggesting something else).
Do you want me to delete ufolios_huge and its special cases?
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 17:54 ` Steven Sistare
@ 2024-10-18 17:59 ` Jason Gunthorpe
2024-10-18 18:10 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-18 17:59 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 18, 2024 at 01:54:12PM -0400, Steven Sistare wrote:
> On 10/18/2024 12:04 PM, Jason Gunthorpe wrote:
> > On Fri, Oct 18, 2024 at 10:34:39AM -0400, Steven Sistare wrote:
> > > > Ie instead of things like:
> > > >
> > > > batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
> > > > &user->ufolios_offset, npages);
> > > >
> > > > Where npages is the total number of pages inside the folio list, it
> > > > would be nfolios.
> > >
> > > The problem is that a batch may reach capacity after accepting only part
> > > of a folio, a condition I forced in testing by reducing array_size and
> > > MAX_NPFNS. We need to track which folio is partial and what is the next index
> > > to use in the folio. That is ufolios_next and ufolios_offset.
> >
> > But why is that such a problem? The ufolios_offset already handles a
> > partial folio.
> >
> > If the batch fills you leave ufolios_next alone and adjust
> > ufolios_offset. The next iteration does exactly the same as the first
> > iteration and takes the partial folio.
> >
> > The refcount has to be handled carefully as the the portion that has
> > been put in the batch needs to be refcounted at 4k and the remainder
> > remain a single ref, but that just means adjusting the refcount after
> > putting it in the batch.
> >
> > It seems like it just works naturally and avoids the double loop
>
> Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
> point because I thought you were suggesting something else).
>
> Do you want me to delete ufolios_huge and its special cases?
I think yes, if you can get to one loop then it would be a degradation
to sweep the folio list an extra time just to figure out ufolios_huge.
Just check nr=1 in the batch_from_folios_huge body if that helps
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 17:59 ` Jason Gunthorpe
@ 2024-10-18 18:10 ` Steven Sistare
2024-10-18 23:10 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-18 18:10 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/18/2024 1:59 PM, Jason Gunthorpe wrote:
> On Fri, Oct 18, 2024 at 01:54:12PM -0400, Steven Sistare wrote:
>> On 10/18/2024 12:04 PM, Jason Gunthorpe wrote:
>>> On Fri, Oct 18, 2024 at 10:34:39AM -0400, Steven Sistare wrote:
>>>>> Ie instead of things like:
>>>>>
>>>>> batch_from_folios_huge(&pfns->batch, &user->ufolios_next,
>>>>> &user->ufolios_offset, npages);
>>>>>
>>>>> Where npages is the total number of pages inside the folio list, it
>>>>> would be nfolios.
>>>>
>>>> The problem is that a batch may reach capacity after accepting only part
>>>> of a folio, a condition I forced in testing by reducing array_size and
>>>> MAX_NPFNS. We need to track which folio is partial and what is the next index
>>>> to use in the folio. That is ufolios_next and ufolios_offset.
>>>
>>> But why is that such a problem? The ufolios_offset already handles a
>>> partial folio.
>>>
>>> If the batch fills you leave ufolios_next alone and adjust
>>> ufolios_offset. The next iteration does exactly the same as the first
>>> iteration and takes the partial folio.
>>>
>>> The refcount has to be handled carefully as the the portion that has
>>> been put in the batch needs to be refcounted at 4k and the remainder
>>> remain a single ref, but that just means adjusting the refcount after
>>> putting it in the batch.
>>>
>>> It seems like it just works naturally and avoids the double loop
>>
>> Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
>> point because I thought you were suggesting something else).
>>
>> Do you want me to delete ufolios_huge and its special cases?
>
> I think yes, if you can get to one loop then it would be a degradation
> to sweep the folio list an extra time just to figure out ufolios_huge.
So now I'm confused again and don't know which "double loops" you want me
to eliminate. There is still a loop in pin_memfd_pages. It calls folio_add_pins,
which I could push to batch_from_folios_huge, but it also counts and returns
pages. It must do the count because the caller calls iopt_pages_add_npinned().
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 18:10 ` Steven Sistare
@ 2024-10-18 23:10 ` Jason Gunthorpe
2024-10-21 14:06 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-18 23:10 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Fri, Oct 18, 2024 at 02:10:26PM -0400, Steven Sistare wrote:
> > > > But why is that such a problem? The ufolios_offset already handles a
> > > > partial folio.
> > > >
> > > > If the batch fills you leave ufolios_next alone and adjust
> > > > ufolios_offset. The next iteration does exactly the same as the first
> > > > iteration and takes the partial folio.
> > > >
> > > > The refcount has to be handled carefully as the the portion that has
> > > > been put in the batch needs to be refcounted at 4k and the remainder
> > > > remain a single ref, but that just means adjusting the refcount after
> > > > putting it in the batch.
> > > >
> > > > It seems like it just works naturally and avoids the double loop
> > >
> > > Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
> > > point because I thought you were suggesting something else).
> > >
> > > Do you want me to delete ufolios_huge and its special cases?
> >
> > I think yes, if you can get to one loop then it would be a degradation
> > to sweep the folio list an extra time just to figure out ufolios_huge.
>
> So now I'm confused again and don't know which "double loops" you want me
> to eliminate. There is still a loop in pin_memfd_pages. It calls folio_add_pins,
> which I could push to batch_from_folios_huge, but it also counts and returns
> pages.
This is the one I was looking at, yes..
> It must do the count because the caller calls
> iopt_pages_add_npinned().
Ah, this is what I was trying to get to, if there was some reason why
this loop and this pages calculation was needed..
So, to fix that up you'd need to adjust the memfd_pin_folios() to also
return back the end va of what it returned - it calls it start_idx
internally. Then that can directly give the npages for add_npinned
with simple math and the loop in pin_memfd_pages can go away.
Then you have a single loop in the batch_from_folios(), which
is what I am thinking about.
What do you think?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-18 23:10 ` Jason Gunthorpe
@ 2024-10-21 14:06 ` Steven Sistare
2024-10-21 14:29 ` Steven Sistare
2024-10-21 16:10 ` Jason Gunthorpe
0 siblings, 2 replies; 39+ messages in thread
From: Steven Sistare @ 2024-10-21 14:06 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/18/2024 7:10 PM, Jason Gunthorpe wrote:
> On Fri, Oct 18, 2024 at 02:10:26PM -0400, Steven Sistare wrote:
>
>>>>> But why is that such a problem? The ufolios_offset already handles a
>>>>> partial folio.
>>>>>
>>>>> If the batch fills you leave ufolios_next alone and adjust
>>>>> ufolios_offset. The next iteration does exactly the same as the first
>>>>> iteration and takes the partial folio.
>>>>>
>>>>> The refcount has to be handled carefully as the the portion that has
>>>>> been put in the batch needs to be refcounted at 4k and the remainder
>>>>> remain a single ref, but that just means adjusting the refcount after
>>>>> putting it in the batch.
>>>>>
>>>>> It seems like it just works naturally and avoids the double loop
>>>>
>>>> Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
>>>> point because I thought you were suggesting something else).
>>>>
>>>> Do you want me to delete ufolios_huge and its special cases?
>>>
>>> I think yes, if you can get to one loop then it would be a degradation
>>> to sweep the folio list an extra time just to figure out ufolios_huge.
>>
>> So now I'm confused again and don't know which "double loops" you want me
>> to eliminate. There is still a loop in pin_memfd_pages. It calls folio_add_pins,
>> which I could push to batch_from_folios_huge, but it also counts and returns
>> pages.
>
> This is the one I was looking at, yes..
>
>> It must do the count because the caller calls
>> iopt_pages_add_npinned().
>
> Ah, this is what I was trying to get to, if there was some reason why
> this loop and this pages calculation was needed..
>
> So, to fix that up you'd need to adjust the memfd_pin_folios() to also
> return back the end va of what it returned - it calls it start_idx
> internally. Then that can directly give the npages for add_npinned
> with simple math and the loop in pin_memfd_pages can go away.
>
> Then you have a single loop in the batch_from_folios(), which
> is what I am thinking about.
>
> What do you think?
Yes, this would eliminate the extra loop in pin_memfd_pages for the !upages
case. However, we still need all of that code when upages must be filled,
so pin_memfd_pages does not get any simpler.
But, we cannot change the signature of the exported function memfd_pin_folios.
It would have been nice if the end parameter was an inout.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-21 14:06 ` Steven Sistare
@ 2024-10-21 14:29 ` Steven Sistare
2024-10-21 16:11 ` Jason Gunthorpe
2024-10-21 16:10 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2024-10-21 14:29 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/21/2024 10:06 AM, Steven Sistare wrote:
> On 10/18/2024 7:10 PM, Jason Gunthorpe wrote:
>> On Fri, Oct 18, 2024 at 02:10:26PM -0400, Steven Sistare wrote:
>>
>>>>>> But why is that such a problem? The ufolios_offset already handles a
>>>>>> partial folio.
>>>>>>
>>>>>> If the batch fills you leave ufolios_next alone and adjust
>>>>>> ufolios_offset. The next iteration does exactly the same as the first
>>>>>> iteration and takes the partial folio.
>>>>>>
>>>>>> The refcount has to be handled carefully as the the portion that has
>>>>>> been put in the batch needs to be refcounted at 4k and the remainder
>>>>>> remain a single ref, but that just means adjusting the refcount after
>>>>>> putting it in the batch.
>>>>>>
>>>>>> It seems like it just works naturally and avoids the double loop
>>>>>
>>>>> Yes, I can squash the nested loop in batch_from_folios_huge. (I was off
>>>>> point because I thought you were suggesting something else).
>>>>>
>>>>> Do you want me to delete ufolios_huge and its special cases?
>>>>
>>>> I think yes, if you can get to one loop then it would be a degradation
>>>> to sweep the folio list an extra time just to figure out ufolios_huge.
>>>
>>> So now I'm confused again and don't know which "double loops" you want me
>>> to eliminate. There is still a loop in pin_memfd_pages. It calls folio_add_pins,
>>> which I could push to batch_from_folios_huge, but it also counts and returns
>>> pages.
>>
>> This is the one I was looking at, yes..
>>
>>> It must do the count because the caller calls
>>> iopt_pages_add_npinned().
>>
>> Ah, this is what I was trying to get to, if there was some reason why
>> this loop and this pages calculation was needed..
>>
>> So, to fix that up you'd need to adjust the memfd_pin_folios() to also
>> return back the end va of what it returned - it calls it start_idx
>> internally. Then that can directly give the npages for add_npinned
>> with simple math and the loop in pin_memfd_pages can go away.
>>
>> Then you have a single loop in the batch_from_folios(), which
>> is what I am thinking about.
>>
>> What do you think?
>
> Yes, this would eliminate the extra loop in pin_memfd_pages for the !upages
> case. However, we still need all of that code when upages must be filled,
> so pin_memfd_pages does not get any simpler.
>
> But, we cannot change the signature of the exported function memfd_pin_folios.
> It would have been nice if the end parameter was an inout.
Despite this I can make one further simplification of the V4 code. In pin_memfd_pages,
if !upages, I can skip folio_add_pins and instead call it from batch_from_folios. This
guarantees that in pfn_reader_release_pins, each folio to be released has only 1 pin,
so I can call the existing unpin_folios and delete folios_unpin_partial.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-21 14:06 ` Steven Sistare
2024-10-21 14:29 ` Steven Sistare
@ 2024-10-21 16:10 ` Jason Gunthorpe
1 sibling, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-21 16:10 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Mon, Oct 21, 2024 at 10:06:44AM -0400, Steven Sistare wrote:
> But, we cannot change the signature of the exported function
> memfd_pin_folios.
Well, we can, that is how the upstream kernel works...
Question is if it is worthwhile
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-21 14:29 ` Steven Sistare
@ 2024-10-21 16:11 ` Jason Gunthorpe
2024-10-21 17:29 ` Steven Sistare
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-10-21 16:11 UTC (permalink / raw)
To: Steven Sistare; +Cc: iommu, Kevin Tian, Nicolin Chen
On Mon, Oct 21, 2024 at 10:29:02AM -0400, Steven Sistare wrote:
> Despite this I can make one further simplification of the V4 code. In pin_memfd_pages,
> if !upages, I can skip folio_add_pins and instead call it from batch_from_folios. This
> guarantees that in pfn_reader_release_pins, each folio to be released has only 1 pin,
> so I can call the existing unpin_folios and delete folios_unpin_partial.
That sounds pretty good, if the loop in the memfd_pin_folios() can
just be computing npages then that is a reasonable place to come later
and optimized by changing the mm code.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH V3 8/9] iommufd: optimize file mapping
2024-10-21 16:11 ` Jason Gunthorpe
@ 2024-10-21 17:29 ` Steven Sistare
0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2024-10-21 17:29 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: iommu, Kevin Tian, Nicolin Chen
On 10/21/2024 12:11 PM, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2024 at 10:29:02AM -0400, Steven Sistare wrote:
>
>> Despite this I can make one further simplification of the V4 code. In pin_memfd_pages,
>> if !upages, I can skip folio_add_pins and instead call it from batch_from_folios. This
>> guarantees that in pfn_reader_release_pins, each folio to be released has only 1 pin,
>> so I can call the existing unpin_folios and delete folios_unpin_partial.
>
> That sounds pretty good, if the loop in the memfd_pin_folios() can
> just be computing npages then that is a reasonable place to come later
> and optimized by changing the mm code.
OK, I'll post a V5 with that change. Don't bother reviewing V4.
- Steve
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2024-10-21 17:29 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-04 18:48 [PATCH V3 0/9] iommu_ioas_map_file Steve Sistare
2024-10-04 18:48 ` [PATCH V3 1/9] mm/gup: folio_add_pins Steve Sistare
2024-10-04 18:52 ` Steven Sistare
2024-10-04 20:20 ` David Hildenbrand
2024-10-16 12:17 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 2/9] iommufd: rename uptr in iopt_alloc_iova Steve Sistare
2024-10-04 18:48 ` [PATCH V3 3/9] iommufd: generalize iopt_pages address Steve Sistare
2024-10-04 18:48 ` [PATCH V3 4/9] iommufd: pfn reader for file mappings Steve Sistare
2024-10-16 12:37 ` Jason Gunthorpe
2024-10-17 17:01 ` Steven Sistare
2024-10-17 19:20 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 5/9] iommufd: IOMMU_IOAS_MAP_FILE Steve Sistare
2024-10-16 12:41 ` Jason Gunthorpe
2024-10-17 17:00 ` Steven Sistare
2024-10-17 19:20 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 6/9] iommufd: file mappings for mdev Steve Sistare
2024-10-04 18:48 ` [PATCH V3 7/9] iommufd: pfn reader local variables Steve Sistare
2024-10-16 12:42 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 8/9] iommufd: optimize file mapping Steve Sistare
2024-10-16 13:00 ` Jason Gunthorpe
2024-10-16 13:09 ` Steven Sistare
2024-10-16 13:21 ` Jason Gunthorpe
2024-10-17 17:02 ` Steven Sistare
2024-10-17 19:24 ` Jason Gunthorpe
2024-10-17 19:37 ` Steven Sistare
2024-10-18 0:10 ` Jason Gunthorpe
2024-10-18 14:34 ` Steven Sistare
2024-10-18 16:04 ` Jason Gunthorpe
2024-10-18 17:54 ` Steven Sistare
2024-10-18 17:59 ` Jason Gunthorpe
2024-10-18 18:10 ` Steven Sistare
2024-10-18 23:10 ` Jason Gunthorpe
2024-10-21 14:06 ` Steven Sistare
2024-10-21 14:29 ` Steven Sistare
2024-10-21 16:11 ` Jason Gunthorpe
2024-10-21 17:29 ` Steven Sistare
2024-10-21 16:10 ` Jason Gunthorpe
2024-10-04 18:48 ` [PATCH V3 9/9] iommufd: map file selftest Steve Sistare
2024-10-16 12:23 ` [PATCH V3 0/9] iommu_ioas_map_file Jason Gunthorpe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.